Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3.2-vision doesn't utilize my GPU. #8310

Open
blueApple12 opened this issue Jan 5, 2025 · 23 comments
Open

llama3.2-vision doesn't utilize my GPU. #8310

blueApple12 opened this issue Jan 5, 2025 · 23 comments
Labels
bug Something isn't working

Comments

@blueApple12
Copy link

What is the issue?

I bought a new pc with 4070 Super to do some AI tasks using Ollama, but when I tried to run llama3.2-vision it just didn't utilize my GPU and only utilize my CPU, llama3.2 does utilize my GPU, so why is that? thank you.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.5.4

@blueApple12 blueApple12 added the bug Something isn't working label Jan 5, 2025
@rick-github
Copy link
Collaborator

Maybe not enough free VRAM on your system, depending on what else you are running. The output of nvidia-smi and server logs will aid in identifying the cause.

@blueApple12
Copy link
Author

this is my smi:
Sun Jan 5 18:15:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 On | N/A |
| 0% 42C P0 33W / 220W | 1350MiB / 12282MiB | 14% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

@rick-github
Copy link
Collaborator

server logs will aid in identifying the cause.

@blueApple12
Copy link
Author

this is my log:
2025/01/05 16:43:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\avish\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-05T16:43:45.920+02:00 level=INFO source=images.go:757 msg="total blobs: 12"
time=2025-01-05T16:43:45.926+02:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-05T16:43:45.929+02:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-05T16:43:45.930+02:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu]"
time=2025-01-05T16:43:45.931+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_windows.go:50 msg="no compatible amdgpu devices detected"
time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/01/05 - 16:43:59 | 200 | 500µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:43:59 | 200 | 2.5007ms | 127.0.0.1 | GET "/api/tags"
time=2025-01-05T16:44:53.262+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:44:53.331+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.4 GiB" free_swap="10.1 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:44:53.346+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59288"
time=2025-01-05T16:44:53.352+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:44:53.387+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T16:44:53.404+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:44:53.406+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59288"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T16:44:53.607+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:45:03.649+02:00 level=INFO source=server.go:594 msg="llama runner started in 10.30 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:45:09 | 200 | 16.5585361s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:09.804+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:19 | 200 | 9.9060809s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:36.380+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:41 | 200 | 5.0508025s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:41.512+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:45 | 500 | 4.0971502s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:36.667+02:00 level=INFO source=runner.go:662 msg="aborting completion request due to client closing the connection"
time=2025-01-05T16:47:38.948+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:43 | 200 | 4.9453625s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:43.887+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:52 | 200 | 8.9424866s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:08.430+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:14 | 200 | 5.8496372s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:14.287+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:27 | 200 | 13.5327677s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:45.398+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:51 | 200 | 5.8480718s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:51.241+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:08 | 200 | 17.0670151s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:41.721+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:47 | 200 | 5.3733708s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:47.151+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:14 | 500 | 27.8632648s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:50:22 | 200 | 997.4µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:50:22 | 200 | 26.5029ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:50:22.893+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:22 | 200 | 24.5118ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:50:24.739+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:40 | 200 | 15.6020101s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:51:35 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:51:35 | 200 | 62.5791ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:51:35.660+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="10.5 GiB"
time=2025-01-05T16:51:35.661+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11251539968 required="3.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:51:35.688+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59523"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:51:36.419+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:51:36.460+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:51:36.463+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59523"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
time=2025-01-05T16:51:36.702+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:51:38.959+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 16:51:38 | 200 | 3.364194s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 16:51:51 | 200 | 676.757ms | 127.0.0.1 | POST "/api/chat"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:51:57 | 200 | 1.1356145s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:11 | 200 | 5.1867467s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:52:20 | 200 | 15.4987ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:52:20.927+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:52:25.947+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0186925 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.009+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.1 GiB"
time=2025-01-05T16:52:26.196+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2687383 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.356+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:52:26.363+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 33 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59564"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:52:26.470+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:52:26.509+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:52:26.510+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59564"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:52:26.620+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1306.52 MiB
llm_load_tensors: CUDA0 model buffer size = 4090.98 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 48.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 558.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 71 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds"
[GIN] 2025/01/05 - 16:52:35 | 200 | 14.9908729s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:52:44.743+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:52:54 | 200 | 10.2073533s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:53:23.080+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:53:35 | 200 | 12.800509s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:55:02.546+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="644.1 MiB"
time=2025-01-05T16:55:02.868+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11424337920 required="3.7 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.2 GiB" free_swap="19.5 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:55:02.896+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59620"
time=2025-01-05T16:55:02.900+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:55:02.900+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:55:02.901+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:55:02.994+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:55:03.031+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:55:03.032+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59620"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:55:03.152+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:55:03.904+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.00 seconds"
[GIN] 2025/01/05 - 16:55:07 | 200 | 4.7508347s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:55:38 | 200 | 15.5506907s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:56:03 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:03 | 404 | 497.8µs | 127.0.0.1 | POST "/api/show"
[GIN] 2025/01/05 - 16:56:04 | 200 | 1.0677276s | 127.0.0.1 | POST "/api/pull"
[GIN] 2025/01/05 - 16:56:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:12 | 200 | 16.0004ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:56:12.092+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:56:12.138+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.2 GiB"
time=2025-01-05T16:56:12.485+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:56:12.492+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 34 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59649"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:56:12.583+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:56:12.618+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:56:12.619+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59649"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T16:56:12.748+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 34 repeating layers to GPU
llm_load_tensors: offloaded 34/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1189.49 MiB
llm_load_tensors: CUDA0 model buffer size = 4208.01 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 40.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 566.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 60 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:56:15.507+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.01 seconds"
[GIN] 2025/01/05 - 16:56:15 | 200 | 3.4331242s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:56:16.504+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:56:17 | 200 | 1.1631321s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:56:25.494+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:56:32 | 200 | 7.4890539s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:17:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:12 | 200 | 16.5005ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:17:12.973+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:17:13.029+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.5 GiB" free_swap="19.4 GiB"
time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:17:13.038+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59968"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:17:13.146+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:17:13.182+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:17:13.183+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59968"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T17:17:13.295+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1439.02 MiB
llm_load_tensors: CUDA0 model buffer size = 3958.48 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 56.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 550.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 82 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 17:17:16 | 200 | 3.3481249s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:17:18.345+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:17:19 | 200 | 690.9205ms | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:17:31.745+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:17:37 | 200 | 5.7424355s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
time=2025-01-05T17:32:26.860+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=1 available=3969449984 required="2.9 GiB"
time=2025-01-05T17:32:26.881+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:32:26.887+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 1 --port 60186"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:27.004+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:32:27.040+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:27.041+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60186"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:32:27.144+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 224.00 MiB
llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 256.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:32:28.148+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.26 seconds"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:28 | 200 | 1.9139322s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:32:28.741+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:32:28.779+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="1.1 GiB"
time=2025-01-05T17:32:29.126+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:32:29.131+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 60190"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:29.154+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T17:32:29.171+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:29.172+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60190"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T17:32:29.389+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:32:38.426+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.29 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:50 | 200 | 22.0606032s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:44:58 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:58 | 200 | 16.9993ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:44:58.564+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11475161088 required="3.7 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.3 GiB" free_swap="17.9 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:44:58.590+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 63944"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:44:58.726+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:44:58.763+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:44:58.764+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63944"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:44:58.850+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:45:00.104+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.51 seconds"
[GIN] 2025/01/05 - 17:45:00 | 200 | 1.5937332s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 17:45:04 | 200 | 648.2546ms | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:20 | 200 | 16.4985ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:20.728+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:45:20.764+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.3 GiB"
time=2025-01-05T17:45:21.110+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="18.0 GiB"
time=2025-01-05T17:45:21.112+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.6 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:45:21.118+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 35 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 63956"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:45:21.205+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:45:21.240+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:45:21.241+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63956"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T17:45:21.375+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1072.46 MiB
llm_load_tensors: CUDA0 model buffer size = 4325.04 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 32.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 574.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 49 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:45:24.635+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.51 seconds"
[GIN] 2025/01/05 - 17:45:24 | 200 | 3.922538s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:27.666+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:28 | 200 | 1.3313811s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:43 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:43 | 200 | 20.9989ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:43.977+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:43 | 200 | 16.001ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:56.150+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:46:33 | 200 | 37.3366449s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:49:35 | 200 | 500.9µs | 127.0.0.1 | GET "/api/version"
[GIN] 2025/01/06 - 17:54:18 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 17:54:18 | 200 | 16.4978ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T17:54:18.334+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T17:54:18.418+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.8 GiB" free_swap="18.0 GiB"
time=2025-01-06T17:54:18.421+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T17:54:18.426+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 36 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 57521"
time=2025-01-06T17:54:18.436+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T17:54:18.436+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T17:54:18.437+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T17:54:18.577+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-06T17:54:18.622+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-06T17:54:18.623+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57521"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-06T17:54:18.688+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloaded 36/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 939.96 MiB
llm_load_tensors: CUDA0 model buffer size = 4457.54 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 24.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 582.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 38 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T17:54:22.199+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.76 seconds"
[GIN] 2025/01/06 - 17:54:22 | 200 | 3.8828603s | 127.0.0.1 | POST "/api/generate"
time=2025-01-06T17:54:26.611+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/06 - 17:54:27 | 200 | 1.042562s | 127.0.0.1 | POST "/api/chat"
time=2025-01-06T17:54:36.996+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/06 - 17:55:04 | 200 | 27.6056929s | 127.0.0.1 | POST "/api/chat"

@ChandlerHooley
Copy link

Having this same issue as well. Latest version of Ollama and NVIDIA GTX 1650 SUPER graphics card. (Yes, I know it isn't powerful, this is just for a POC). Here are my logs when I run the "ollama serve" and then in another window the "ollama run llama3.2-vision" command. If I can provide any other information that would help, please let me know.

2025/01/06 22:07:11 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\chand\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-06T22:07:11.022-06:00 level=INFO source=images.go:757 msg="total blobs: 11"
time=2025-01-06T22:07:11.023-06:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB"
[GIN] 2025/01/06 - 22:07:22 | 200 | 544.5µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 22:07:22 | 200 | 54.7107ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T22:07:22.747-06:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T22:07:22.793-06:00 level=INFO source=server.go:104 msg="system memory" total="63.7 GiB" free="42.7 GiB" free_swap="45.0 GiB"
time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T22:07:22.802-06:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\chand\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\chand\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\chand\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 58622"
time=2025-01-06T22:07:22.960-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T22:07:22.960-06:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T22:07:22.962-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T22:07:22.967-06:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-06T22:07:22.969-06:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2025-01-06T22:07:22.970-06:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:58622"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\chand.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-06T22:07:23.213-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T22:07:37.235-06:00 level=INFO source=server.go:594 msg="llama runner started in 14.27 seconds"
[GIN] 2025/01/06 - 22:07:37 | 200 | 14.5041464s | 127.0.0.1 | POST "/api/generate"

@rick-github
Copy link
Collaborator

time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

When ollama started, there was 10.8G free VRAM. When it came time to load a model, something else was running and only 3.5G was free. The llama3.2-vision model won't fit, so it loads it into RAM.

time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

The model is unloaded after 5 minutes and then a bit later another request comes in for the model. This time there is 10.5G available and ollama does a partial load (33 of 41 layers) into the GPU.

NAME                      ID              SIZE     PROCESSOR    UNTIL   
llama3.2-vision:latest    085a1fdae525    12 GB    100% GPU     Forever    

Your GPU is too small to host the entire model, and other GPU users are occasionally taking VRAM to the point where ollama can't even do a partial load.

@blueApple12
Copy link
Author

blueApple12 commented Jan 7, 2025

So I don't have enough vram?

@rick-github
Copy link
Collaborator

Correct.

@blueApple12
Copy link
Author

Is there a way to get around this?

@rick-github
Copy link
Collaborator

@blueApple12
Copy link
Author

Is there any other way to use less vram like low vram mode?

@rick-github
Copy link
Collaborator

There are two components that take up VRAM - context and weights. The usual ways of reducing context size (num_ctx, OLLAMA_NUM_PARALLEL, OLLAMA_FLASH_ATTENTION) won't help because you are already using the minimum context. Other models (eg llama3.2:3b) come in a variety of quantizations which can be used to reduce the size of the weights. The default quant for llama3.2:3b is q4-K_M which is 2G, but the size can be as low as 1.4G with the q2_K quant. Unfortunately llama3.2-vision doesn't offer anything smaller than q4_K_M at 7.9G. I haven't tried this, but in theory could take the base model and quantize it yourself to something smaller. However, I don't think the tool that I usually use for quantizing models (llama.cpp) supports the llama3.2-vision architecture (mllama), so you'd need to find suitable tools.

One last alternative would be to force llama.cpp to load all layers into VRAM and then have the GPU overflow to RAM, rather than having ollama decide on the RAM allocation. This will maximize VRAM usage at the cost of a performance penalty for the layers residing in RAM. However, because you can almost fit the model in VRAM, only a few layers will spill into RAM, and the penalty might not be noticeable. You can force this by setting num_gpu to the number of layers (or really any number greater than or equal to the layer count). See here for ways to adjust num_gpu.

@rick-github
Copy link
Collaborator

@ChandlerHooley

time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB"
time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

Your GPU has 3.2G free. Just the projector (2.8G) and context space (656M) add up to more than this, so there is no way to run llama3.2-vison on your GPU, even with the num_gpu hack from above.

@blueApple12
Copy link
Author

Why is my gpu so full? I just built this pc a week ago. Will the full log of nvidia smi help identify what takes all of the vram?

@rick-github
Copy link
Collaborator

I'm not a Windows user so fine details of process usage escape me. Try this for help: https://saturncloud.io/blog/how-to-find-and-limit-gpu-usage-by-process-in-windows/#finding-gpu-usage-by-process

@blueApple12
Copy link
Author

i really didnt understand, this page, if someone can understand this and help me it would be great.

Tue Jan 7 16:34:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 On | N/A |
| 30% 35C P5 15W / 220W | 895MiB / 12282MiB | 28% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1244 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 2748 C+G ...air\Corsair iCUE5 Software\iCUE.exe N/A |
| 0 N/A N/A 2836 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A |
| 0 N/A N/A 5704 C+G ...al\Discord\app-1.0.9175\Discord.exe N/A |
| 0 N/A N/A 6336 C+G ...\Cef\CefSharp.BrowserSubprocess.exe N/A |
| 0 N/A N/A 11832 C+G ....0_x64__8wekyb3d8bbwe\XboxPcApp.exe N/A |
| 0 N/A N/A 12004 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 14244 C+G ...6.0_x64__cv1g1gvanyjgm\WhatsApp.exe N/A |
| 0 N/A N/A 15152 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 15476 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A |
| 0 N/A N/A 21852 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 23444 C+G ...ces\Razer Central\Razer Central.exe N/A |
| 0 N/A N/A 23724 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A |
| 0 N/A N/A 23812 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 24456 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 24876 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A |
| 0 N/A N/A 27828 C+G ...x64__97hta09mmv6hy\Build\Lively.exe N/A |
| 0 N/A N/A 30528 C+G ... Synapse 3 Host\Razer Synapse 3.exe N/A |
| 0 N/A N/A 32416 C+G ...nr4m\radeonsoftware\AMDRSSrcExt.exe N/A |
| 0 N/A N/A 34800 C+G ...m\radeonsoftware\RadeonSoftware.exe N/A |
| 0 N/A N/A 35960 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 36928 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 38648 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 41612 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 41720 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 42312 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 44232 C+G C:\Windows\System32\ShellHost.exe N/A |
| 0 N/A N/A 45884 C+G ...s\System32\ApplicationFrameHost.exe N/A |
| 0 N/A N/A 49344 C+G ...Programs\Microsoft VS Code\Code.exe N/A |
| 0 N/A N/A 51312 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A |
+-----------------------------------------------------------------------------------------+

@rick-github
Copy link
Collaborator

Unfortunately this is not a really useful output, as it doesn't contain the VRAM usage and the process names are incomplete, so it's not possible to identify the large users of VRAM. But there may be low hanging fruit. Does your machine have an integrated graphics processor? If so, it may be possible to set that as the default GPU for the system in the BIOS, so that when Windows starts it doesn't allocate VRAM from the 4070. The alternative is to set the preferred GPU on a program by program basis as discussed here.

@blueApple12
Copy link
Author

I completely disabled the integrated graphics, may it cause it? Because I thought it might use my integrated graphics instead of my gpu.

@rick-github
Copy link
Collaborator

ollama will not use integrated graphics, there is very little support for those types of GPUs. Enable it, make it the default.

@kreier
Copy link

kreier commented Jan 9, 2025

Your RAM should be sufficient. This is really strange. I found conflicting statements about your available VRAM in your logfile #8310 (comment) just a second apart:

time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"

I observed a similar behavior to your 4060 with two GTX 1060 6GB. Starting llama3.2-vision runs 100% on the CPU (check with ollama ps after leaving with \bye). Then I started a similar sized model phi4 and it went 100% to the GPU, split between both graphics cards and used 11GB. I tried a few others, and the vision model was the outlier.

Can you try other models (like phi4), that should fit into your VRAM to narrow down this behavior? You have more than 10 GB free, and llama3.2-vision needs usually a little more than 9 GB, even though the files are only 7.9 GB large. And even when it can't fit completely into the VRAM, it should split some layers of and process them in regular RAM. With my 8 GB card under Windows I get the following result after running ollama, exiting it and calling ollama ps:

mk@i3:~$ ollama ps
NAME                      ID              SIZE     PROCESSOR          UNTIL
llama3.2-vision:latest    38107a0cd119    12 GB    43%/57% CPU/GPU    3 minutes from now

I checked my logfile, and got a statement layers.offload=7 where you got a zero. Don't know the reason yet:

time=2025-01-10T01:13:34.533+07:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-ef7243b2-74b9-5dfb-ed58-a2f776e2ae78 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3070 Ti" total="8.0 GiB" available="6.9 GiB"
time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memo>

@kreier
Copy link

kreier commented Jan 9, 2025

A similar behavior was noticed with 6GB VRAM graphics cards in November 2024: #7509 It works with my 8GB card and the problem described here is for more than 10 GB available VRAM.

@rick-github
Copy link
Collaborator

Your RAM should be sufficient.

It's sufficient if there are no other processes using the GPU. Switching to integrated graphics will help.

I found conflicting statements about your available VRAM in your logfile #8310 (comment) just a second apart:

These are 67 seconds apart.

Starting llama3.2-vision runs 100% on the CPU (check with ollama ps after leaving with \bye). Then I started a similar sized model phi4 and it went 100% to the GPU,

Vision models have extra requirements that make it harder to fit them in limited VRAM as discussed in #7509 (comment).

@kreier
Copy link

kreier commented Jan 11, 2025

Thanks @rick-github for the feedback and double-checking my comment. Sorry for the mistake, I should learn how to read the time!

I tested this scenario again, and I'm not sure if llama3.2-vision will fit entirely into 12GB of VRAM. The use of the integrated graphics might be the only way. As pointed out by others above.

First I tried to run llama3.2-vision just on the CPU. To do this I set the parameter /set parameter num_gpu 0 after starting ollama, and then gave it a prompt to process. I checked the RAM usage afterwards with ollama ps and got a result of 11 GB. Which is less than 12GB, so a 3060 with 12GB might work. Surprisingly when using the 8 GB GPU partially, the speed went down from 5.4 token/s to 4.5 token/s. The stated utilization from ollama ps was 43%/57% CPU/GPU but I think this only relates to RAM, not token generation speed. The GPU seems to be used only for the projector (see below) and the token generation is done entirely by the CPU.

On another system with a 8GB card and a 6GB card I got llama3.2-vision almost entirely into the VRAM, just 4% were still processed by the CPU. It resulted in 15 token/s. Following the advice given in this thread I switched to my iGPU of my processor and gained a few Megabyte on the larger card - and finally got 100% GPU utilization. The responsiveness increased by 55% to 23.3 token/s! That's the reward to have all layers in the fast GPU memory!

Here I checked the combined unitization of the GPUs with nvtop. The larger used 6.937 GiB and the smaller 4.583 GiB. Combined this equals 11.52 GiB. There is not much space left if this should fit into a 12 GB card. ollama ps even reported 13 GB RAM used. The distribution of procession power was heavily skewed, the big 8GB card used only 12% of the GPU power, while the smaller 6GB card got up to 84%.

One thing I still don't unterstand is how the memory requirements for the projector combined equal to something very close to 8GB, so any system with graphics cards smaller than 8GB might not even split the models to use the combined VRAM. It was already stated that the vision model is unique in this regard, and need one continous chunk of RAM to opereate. The logfile states:

projector.weights="1.8 GiB"
projector.graph="2.8 GiB"
layers.offload=41 layers.split=13,28
memory.available="[7.6 GiB 5.9 GiB]"
memory.required.full="10.6 GiB"
memory.required.kv="656.2 MiB"
memory.required.allocations="[7.6 GiB 5.4 GiB]"

I can't see how 1.8 + 2.8 results in something 6.837 GiB, even if I add the 656 MiB for kv. Can someone explain the math to me? When using the system with only one 8 GB card the logfile (see above) states that only 7 of the 41 layers were offloaded to the GPU:

Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.132+07:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=7 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.0 GiB" memory.required.partial="6.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
Jan 10 01:19:07 i3 ollama[230]: time=2025-01-10T01:19:07.133+07:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 7 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 4 --parallel 1 --port 35025"
...
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloading 7 repeating layers to GPU
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors: offloaded 7/41 layers to GPU
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors:   CPU_Mapped model buffer size =  5679.33 MiB
Jan 10 01:19:17 i3 ollama[230]: llm_load_tensors:        CUDA0 model buffer size =   911.00 MiB

This seems to be the "minimum pieces of the model that have to be loaded in VRAM in their entirety for anything to run on the GPU" that @jessegross mentioned in the issue 7509 in November 6, 2024. #7509 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants