Running DeepSeek-R1-Distill-Qwen-32B-Japanese with llama.cpp and Accessing it via Spring AI

Warning

This article was automatically translated by OpenAI (gpt-4o-mini).It may be edited eventually, but please be aware that it may contain incorrect information at this time.

Table of Contents

Installing llama-cpp
Starting the OpenAI API Server

Accessing with Spring AI

We will run the model DeepSeek-R1-Distill-Qwen-32B-Japanese (a distilled model of DeepSeek-R1 for Japanese) released by CyberAgent using llama-cpp and access it via the OpenAI API.

Here is the execution environment:

Installing llama-cpp

brew install llama.cpp

Starting the OpenAI API Server

In practice, we will use the GGUF format converted version of mmnga/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf. The model will be downloaded on the first run.

llama-server -hf mmnga/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf --port 8000

build: 4589 (eb7cf15a) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
system info: n_threads = 12, n_threads_batch = 12, total_threads = 16

system_info: n_threads = 12 (n_threads_batch = 12) / 16 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8000, http threads: 15
main: loading model
srv    load_model: loading model '/Users/toshiaki/Library/Caches/llama.cpp/mmnga_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-Q4_K_M.gguf'
common_download_file: previous metadata file found /Users/toshiaki/Library/Caches/llama.cpp/mmnga_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-Q4_K_M.gguf.json: {"etag":"\"a3e66d7c746c3f4bf60dd74c668d5a38-1000\"","lastModified":"Mon, 27 Jan 2025 11:33:22 GMT","url":"https://huggingface.co/mmnga/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf/resolve/main/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-Q4_K_M.gguf"}
curl_perform_with_retry: Trying to download from https://huggingface.co/mmnga/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf/resolve/main/cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-Q4_K_M.gguf (attempt 1 of 3)...
llama_model_load_from_file_impl: using device Metal (Apple M4 Max) - 98303 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /Users/toshiaki/Library/Caches/llama.cpp/mmnga_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-gguf_cyberagent-DeepSeek-R1-Distill-Qwen-32B-Japanese-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B Japanese
llama_model_loader: - kv   3:                           general.finetune str              = Japanese
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["japanese", "qwen2", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["ja"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = DeepSeek R1 Distill Qwen 32B Japanese
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 151643 '<｜end▁of▁sentence｜>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256

Access the simple UI.

Paste the following prompt into the UI and submit.

Among A to D, three are honest and one is a liar. Who is the liar?

A: D is the liar.
B: I am not lying.
C: A is not lying.
D: B is the liar.

The logical thought process flows through, and eventually, the answer is returned. The correct answer is D.

We will also access the content tested in the UI via the OpenAI API using curl.

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Among A to D, three are honest and one is a liar. Who is the liar?\n\nA: D is the liar.\nB: I am not lying.\nC: A is not lying.\nD: B is the liar.\n"}
    ]
  }' | jq .

The following JSON is returned. The `` part is the thought process.

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\n**Answer: D**\n\n**Explanation:**\n1. **Organizing each person's statements**  \n   - A: D is the liar  \n   - B: I am honest  \n   - C: A is honest  \n   - D: B is the liar  \n\n2. **Searching for a contradiction-free assumption**  \n   - **If A is the liar**  \n     → D is honest (A's statement is false)  \n     → C's statement \"A is honest\" is false → C is also a liar → Contradiction (two liars)  \n\n   - **If B is the liar**  \n     → B's statement is false → B is a liar  \n     → D's statement \"B is the liar\" is true → D is honest  \n     → A's statement \"D is the liar\" is false → A is also a liar → Contradiction (two liars)  \n\n   - **If C is the liar**  \n     → C's statement \"A is honest\" is false → A is a liar  \n     → A's statement \"D is the liar\" is false → D is honest  \n     → D's statement \"B is the liar\" is true → B is a liar → Contradiction (two liars)  \n\n   - **If D is the liar**  \n     → D's statement \"B is the liar\" is false → B is honest  \n     → If B is honest, then \"I am not lying\" is true → B is honest  \n     → A's statement \"D is the liar\" is true → A is honest  \n     → C's statement \"A is honest\" is true → C is honest  \n     → **The only liar is D** (satisfies the conditions)  \n\n**Conclusion: D is the liar**",
        "role": "assistant"
      }
    }
  ],
  "created": 1738243717,
  "model": "g3.5-turbo",
  "system_fingerprint": "b4589-eb7cf15a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 1207,
    "prompt_tokens": 62,
    "total_tokens": 1269
  },
  "id": "chatcmpl-GyOOzEH08sntwpx8tio4HFDrqpFC7hTS",
  "timings": {
    "prompt_n": 62,
    "prompt_ms": 401.295,
    "prompt_per_token_ms": 6.4725,
    "prompt_per_second": 154.4998068752414,
    "predicted_n": 1207,
    "predicted_ms": 70933.127,
    "predicted_per_token_ms": 58.76812510356255,
    "predicted_per_second": 17.01602694041671
  }
}

Accessing with Spring AI

Let's try accessing it from an application using Spring AI.
Since it is OpenAI compatible, we can use the OpenAI Chat Client for Spring AI.

Here is the sample application.
https://github.com/making/hello-spring-ai

git clone https://github.com/making/hello-spring-ai
cd hello-spring-ai
./mvnw clean package -DskipTests=true
java -jar target/hello-spring-ai-0.0.1-SNAPSHOT.jar --spring.ai.openai.base-url=http://localhost:8000 --spring.ai.openai.api-key=dummy

We will send the same prompt.

$ curl http://localhost:8080 -H Content-Type: text/plain -d "Among A to D, three are honest and one is a liar. Who is the liar?\n\nA: D is the liar.\nB: I am not lying.\nC: A is not lying.\nD: B is the liar.\n"

data:
data:
data:**Answer: D is the liar.**

data:**
data:
data:**Reasoning:**  
data:We verify each person's statements and derive a situation without contradictions.

data:  
data:- **If D is the liar:**  
data:  - D's statement "B is the liar" is false → B is honest.

data:  
data:  - If B is honest, then "I am not lying" is true.

data:  
data:  - C's statement "A is not lying" is true → A is also honest.

data:  
data:  - If A is honest, then "D is the liar" is true.

data:  
data:  - This means A, B, and C are honest, and D is the only liar, with no contradictions.

data:  
data:
data:Other cases (A, B, C being liars) result in **more than two liars**, making them invalid.

data:  
data:Thus, **D being the liar** is the only solution.

IK.AM