Run Qwen2: The next gen of LLMs on an NVIDIA A5000 Cloud Server

In this article, I'm excited to show you the easiest way to run Qwen2 using Olama. You'll be thrilled to discover how this model has surpassed Mistral and Llama3 in performance.

5 days ago   •   5 min read

By Shaoni Mukherjee

Sign up FREE

Build & scale AI models on low-cost cloud GPUs.

Get started Talk to an expert
Table of contents

Welcome to our tutorial on running Qwen2:7b with Ollama. In this guide, we'll use one of our favorite GPUs, A5000, offered by Paperspace.

A5000, powered by NVIDIA and built on ampere architecture, is a powerful GPU known to enhance the performance of rendering, graphics, AI, and computing workloads. A5000 offers 8192 CUDA cores and 24 GB of GDDR6 memory, providing exceptional computational power and memory bandwidth.

The A5000 supports advanced features like real-time ray tracing, AI-enhanced workflows, and NVIDIA's CUDA and Tensor cores for accelerated performance. With its robust capabilities, the A5000 is ideal for handling complex simulations, large-scale data analysis, and rendering high-resolution graphics.

Join our Discord Community

Get started Join the community

What is Qwen2-7b?

Qwen2 is the latest series of large language models, offering base models and instruction-tuned versions ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. The best thing about the model is that it is open-sourced on Hugging Face.

Compared to other open-source models like Qwen1.5, Qwen2 generally outperforms them across various benchmarks, including language understanding, generation, multilingual capability, coding, mathematics, and reasoning. The Qwen2 series is based on the Transformer architecture with enhancements like SwiGLU activation, attention QKV bias, group query attention, and an improved tokenizer adaptable to multiple languages and codes.

Further, Qwen2-72B is said to outperform Meta's Llama3-70B in all tested benchmarks by a wide margin.

Qwen Performance Comparison (Source)

A comprehensive evaluation of Qwen2-72B-Instruct across different benchmarks on various domains is shown in the image below. This model achieves a balance between enhanced capabilities and alignment with human values. Also, the model significantly outperforms Qwen1.5-72B-Chat on all benchmarks and demonstrates competitive performance compared to Llama-3-70B-Instruct. Even smaller Qwen2 models surpass state-of-the-art models of similar or larger sizes. Qwen2-7B-Instruct maintains advantages across benchmarks, particularly excelling in coding and Chinese-related metrics.

Qwen Performance Comparison (Source)

Available Model

Qwen2 is trained on a dataset comprising 29 languages, including English and Chinese. It has five parameter sizes: 0.5B, 1.5B, 7B, 57B and 72B. The context length of the 7B and 72B models has been expanded to 128k tokens.

The Qwen2 series comprises base and instruction-tuned models in five different sizes(Source)

Brief Introduction to Ollama

This article will show you the easiest method to run the Qwen2 using Ollama. Ollama is an open-source project offering a user-friendly platform for executing large language models (LLMs) on your personal computer or using a platform like Paperspace.

Ollama provides access to a diverse library of pre-trained models, offers effortless installation and setup across different operating systems, and exposes a local API for seamless integration into applications and workflows. Users can customize and fine-tune LLMs, optimize performance with hardware acceleration, and benefit from interactive user interfaces for intuitive interactions.

Run Qwen2-7b on Paperspace with Ollama

Bring this project to life

Before we start, let us first check the GPU specification.

NVIDIA A5000 Specification
Displaying the NVIDIA A5000 Specification

Next, open up a terminal, and we will start downloading Ollama. To download Ollama, paste the code below into the terminal and press enter.

curl -fsSL | sh

This one line of code will start downloading Ollama.

Once this is done, clear the screen, type the below command, and press enter to run the model.

ollama run qwen2:7b
In case of Error: could not connect to ollama app, is it running? Try running the below code, this will help start ollama service
ollama serve

and open up another terminal and try the command again.

or try enabling the systemctl service manually by running the below command.

sudo systemctl enable ollama
sudo systemctl start ollama

Now, we can run the model for inferencing.

ollama run qwen2:7b

This will download the layers of the model, also please note this is a quantized model. Hence the downloading process will not take much time.

Next, we will start using our model to answer few question and check how the model works.

  • Write a python code for a fibonacci series.
Python code for a fibonacci by Qwen2:7b model
Python code for a fibonacci generated by Qwen2:7b model
  • What is mindful eating?
  • Explain quantum Physics.

Response generated by Qwen2:7b

Please feel free to try out other model versions however 7b is the latest version and is available with Ollama.

The model demonstrates impressive performance in various aspects, matching the overall performance of GPT in comparison to an earlier model.

Test data, sourced from Jailbreak and translated into multiple languages, was used for evaluation. Notably, Llama-3 struggled with multilingual prompts and was thus excluded from comparison. The findings reveals that the Qwen2-72B-Instruct model achieves safety levels comparable to GPT-4 and significantly outperforms the Mistral-8x22B model according to significance testing (P-value).

The occurrence of harmful responses generated by large models across four categories of multilingual unsafe queries: Illegal Activity, Fraud, Pornography, and Privacy Violence (Source)


In conclusion, we can say that the Qwen2-72B-Instruct model, demonstrates its remarkable performance across various benchmarks. Notably, Qwen2-72B-Instruct surpasses previous iterations such as Qwen1.5-72B-Chat and even competes favorably with state-of-the-art models like GPT-4, as evidenced by significance testing results. Moreover, it significantly outperforms models like Mistral-8x22B, underscoring its effectiveness in ensuring safety across multilingual contexts.

Looking ahead, the rapid increase in the use of large language models like Qwen2 hints at a future where AI-driven applications and solutions becoming more and more sophisticated. These models have the potential to revolutionize various domains, including natural language understanding, generation, multilingual communication, coding, mathematics, and reasoning. With continued advancements and refinements in these models, we can anticipate even greater strides in AI technology, leading to the development of more intelligent and human-like systems that better serve society's needs while adhering to ethical and safety standards.

We hope you enjoyed reading this article!

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert


Spread the word

Keep reading