---
title: Launching an OpenAI-Compatible Server Using Llama 3 Model with llama-cpp-python and Access It from Spring AI
tags: ["Python", "llama.cpp", "OpenAI", "Machine Learning", "MPS", "Llama 3", "Spring AI"]
categories: ["AI", "LLM", "llama.cpp"]
date: 2024-05-10T08:22:47Z
updated: 2024-05-10T09:18:33Z
---

> ⚠️ This article was automatically translated by OpenAI (gpt-4o).
> It may be edited eventually, but please be aware that it may contain incorrect information at this time.

"[Launching an OpenAI-Compatible Server Using Gemma Model with llama-cpp-python and Access It from Spring AI](/entries/778)" in the same way, we will try Meta's [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B).

**Table of Contents**
<!-- toc -->

### Installing llama-cpp-python

First, create a virtual environment.

```
mkdir -p $HOME/work/llm
cd $HOME/work/llm
python3 -m venv .venv
source .venv/bin/activate
```

Install llama-cpp-python along with the server.

```
CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
```

> ℹ️ If you encounter errors on an Apple Silicon Mac, try the setup described at https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md.

`chat_format="llama-3"` was supported in the following commit, so use v0.2.64 or higher.

https://github.com/abetlen/llama-cpp-python/commit/8559e8ce88b7c7343004eeccb7333b806034b01c

### Downloading Llama 3

```
sudo mkdir -p /opt/models
sudo chown -R $USER /opt/models
```

The model used with llama-cpp-python needs to be in GGUF format, so we will use the following model converted to GGUF format.

https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

Download `Meta-Llama-3-8B-Instruct.Q4_K_M.gguf` to `/opt/models/`.

### Starting the OpenAI-Compatible Server

Start the server with the following command. You need to specify `--chat_format=llama-3`.

```
python3 -m llama_cpp.server --chat_format=llama-3 --model /opt/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --n_gpu_layers 1
```

You can check the API documentation from the following.

http://localhost:8000/docs

> The OpenAI ["Create chat completion" API](https://platform.openai.com/docs/api-reference/chat/create) requires the `model` parameter, but it seems that the `model` parameter is not necessary for llama-cpp-python.

#### Accessing with curl

```
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "Give me a joke."}
   ]
 }' | jq .
```

A joke was returned.

```json
{
  "id": "chatcmpl-755f1cc0-39e1-408c-b119-11034492b500",
  "object": "chat.completion",
  "created": 1715328366,
  "model": "/opt/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(Wait for it...)\n\nBecause it was two-tired!\n\nHope that made you laugh!",
        "role": "assistant"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 32,
    "total_tokens": 49
  }
}
```

#### Accessing with Spring AI

Let's access it from an app using [Spring AI](https://docs.spring.io/spring-ai/reference/index.html).
Since it is OpenAI-compatible, you can use the [OpenAI Chat Client](https://docs.spring.io/spring-ai/reference/api/clients/openai-chat.html) of Spring AI.

Here is a sample app.<br>
https://github.com/making/hello-spring-ai

```
git clone https://github.com/making/hello-spring-ai
cd hello-spring-ai
./mvnw clean package -DskipTests=true
java -jar target/hello-spring-ai-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy
```

```
$ curl localhost:8080
Here's one:

Why couldn't the bicycle stand up by itself?

(Wait for it...)

Because it was two-tired!

Hope that made you smile! Do you want to hear another one?
```

This app itself is for OpenAI, but the advantage of using llama-cpp-python is that you can use Llama-3 just by changing the properties.

If you don't care about compatibility with the OpenAI API and just want to use Llama 3, you can also use [Ollama](https://ollama.com/) via [spring-ai-ollama](https://docs.spring.io/spring-ai/reference/api/clients/ollama-chat.html).
