Running Llama 2 13B on an Intel ARC GPU, iGPU and CPU

In the ever-evolving world of artificial intelligence, the recent launch of the Meta Llama 2 large language model has sparked interest among tech enthusiasts. A fascinating demonstration has been conducted, showcasing the running of Llama 2 13B on an Intel ARC GPU, iGPU, and CPU. This demonstration provides a glimpse into the potential of these devices when paired with this advanced language model.

Llama 2 is an open-source language model launched by Meta. It comes in various versions, with some specialized for chat. It’s trained on public data and fine-tuned for specific purposes. Its performance is notable, and it’s designed to be a tool for developers to innovate in AI projects. Meta’s underlying philosophy is to promote open collaboration in AI.

Features of Llama 2

Availability:
- Llama 2 is an open-source language model.
- Anyone, whether they are individuals, creators, researchers, or businesses, can access it for free.
- It’s a part of Meta’s efforts to encourage openness in the field of artificial intelligence.
- The goal is to allow as many people as possible to test, innovate, and make improvements on the model.
Versions and Models:
- Llama 2 isn’t just one model; it’s a collection of models.
- These models vary in size, with the smallest having 7 billion parameters and the largest having 70 billion parameters.
- One specific version, Llama-2-chat, is designed especially for conversations.
Training and Fine-Tuning:
- The initial training of Llama 2 used data from the public domain.
- For the chat-optimized version (Llama-2-chat), additional training was done. This training is known as supervised fine-tuning.
- The fine-tuning process also involved techniques from Reinforcement Learning, using methods like rejection sampling and proximal policy optimization (PPO).
Performance:
- Compared to other open-source chat models, Llama 2 and its variants are superior in most benchmark tests.
- It can potentially replace some proprietary models.
- The model has been assessed to ensure it provides useful and safe responses.
Use Cases:
- Meta’s goal with releasing Llama 2 is to give developers a powerful AI tool.
- This tool can help in various AI-related projects, enabling them to experiment, innovate, and scale their ideas responsibly.
Philosophy:
- Releasing Llama 2 is in line with Meta’s vision of having a more open AI ecosystem.
- Meta believes in collaboration and wants a wide community of developers and researchers to work together in the AI field.

You can apply to download the latest Llama 2 LLM over on the official Meta AI website.

Running Llama 2 on Intel ARC GPU, iGPU and CPU

The demonstration below involves running the Llama 2 model, with its staggering 13 billion and 7 billion parameters, on the Intel Arc GPU. This was achieved using the Lava CPP library and the CL Blast Library, both of which are instrumental in accelerating matrix multiplications and other mathematical operations. However, it’s important to note that the method is not fully optimized for Intel Arc devices, meaning it doesn’t fully exploit their capabilities. Despite this, the speed and usability of the process are commendable.

Watch this video on YouTube.

Other articles you may find of interest on the subject of Llama 2:

The process of compiling the library from source is admittedly long and tedious, but the results are evident on the Intel Arc devices, Intel integrated GPU, and the CPU. The cmake tool, a crucial component for this process, needs to be installed on the system. Additionally, the CL Blast Library, which provides functions for accelerated mathematics, needs to be installed and built. This library requires OpenCL, which must be provided during its construction.

Once the CL Blast Library is built, the CL blast.dll and GL blast.lib files are generated. These files are then used to compile the Llama.cpp library. The final executables generated are used to run the Llama models on Arc GPUs and on integrated GPUs. It’s important to provide the platform path to the environment variable so the program can locate it.

The Llama 13 billion model, which is 8-bit quantized, can run on the GPU and provides fast predictions. The Llama 7 billion model can also run on the GPU and offers even faster results. The model can also run on the integrated GPU, and while the speed is slower, it remains usable. Running the model on the CPU results in high CPU activity, but it’s still a viable option.

This exploration into running Llama 2 13B on an Intel ARC GPU, iGPU, and CPU is a testament to the exciting advancements in the field of artificial intelligence and the potential of these devices and the ability to run these powerful large language models locally on an affordable computer.

Llama 2 comparisons and articles on coding and setting up the large language model locally :

Filed Under: Guides, Top News

Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

Running Llama 2 13B on an Intel ARC GPU, iGPU and CPU

Features of Llama 2

Running Llama 2 on Intel ARC GPU, iGPU and CPU

About Us

Further Reading

Features of Llama 2

Running Llama 2 on Intel ARC GPU, iGPU and CPU

Footer

About Us

Further Reading