Large Language and Vision Assistent

LLaVA is an open-source chatbot/assistent trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. So it is able to take in an image and a prompt and it will answer them (this is the instruction part).

An interesting part of the learning process is that each image needs some text to go with it, but how do we get that text?
What was done in the training of LLaVA was that the text was generated by a GPT (ChatGPT?). So this is a text only GPT remember so that it cannot handle the image itself.

There are annotated data sets like Coco where humans have provided the annotations for the image. So the image is described in text format, which consists of bounding boxes (like where things are in the image with coordinates) and such to describe what is in the image. So we have an image, and the caption, and the description of the image with the bounding boxes and with this the LLM can understand the image.

The most recent version of LLaVA is LLaVA v1.6/LLaVA-Next. And the paper for 1.5 can be found here.

Llava has a LLM, a Visual Transformer (CLIP with ViT-L/14 (ViT Large and using 14x14 patches), and adds a trainable projection layer (Multi-layer projector in LLaVA-1.5). The projection layer is used to project/transform the patch embeddings into the token embedding space.

So a ViT will produce a patch embedding for the image, and the LLM will produce token embeddings:

   +-------------+      +----------------+
   | Text input  |----->| Vicuna         |---> token embeddings
   +-------------+      +----------------+

   +-------------+      +----------------+
   | Image input |----->| ViT-L/14       |--->  patch embeddings
   +-------------+      +----------------+

But these are embeddings from different embedding spaces, so we need to project them into the same space. This is done by a trainable projection layer (W):

   patch embeddings [0                 511]
       (Z)

Zᵥ = the patch embeddings

    Hᵥ = W * Zᵥ

Hᵥ = patch embeddings but now in the same space as the token embeddings
W  = trainable projection layer

Now, during training the input data set consists of pairs of images and text inputs. At inference time there might not be any text, we could just pass an image and have the model generate a response for it, perhaps describing what it sees in the image. But we could also pass a prompt along with the image and ask a specific question about the image.

For example, we can pass the following image to llava-cli which is part of llama.cpp and it will describe it:

$ ~/work/ai/llama.cpp/llava-cli --no-display-prompt --log-disable --n-gpu-layers 25 -m ~/work/ai/llava-v1.5-7b/ggml-model-f16.gguf --mmproj ~/work/ai/llava-v1.5-7b/mmproj-model-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg 

The image features a man wearing a white space suit, standing on the moon and posing in front of an American flag. He is carrying several items in his suit, including a backpack and two handheld objects. There are also a couple of additional smaller figures visible in the distance, likely also part of the same scene. The overall atmosphere suggests that this picture was taken during a historical moon landing event or space exploration mission, capturing the moment as mankind achieved another milestone in its cosmic journey.

And we can also pass in a prompt along with the image and ask a specific question:

$ ~/work/ai/llama.cpp/llava-cli --no-display-prompt --log-disable --n-gpu-layers 25 -m ~/work/ai/llava-v1.5-7b/ggml-model-f16.gguf --mmproj ~/work/ai/llava-v1.5-7b/mmproj-model-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg -p "Is there a banan in the image?"

No, there is no banana in the image.

During training the first stage trains the projection layer W and the ViT and LLM are frozen (not updated). So this is mostly about enabling the patch embeddings to be projected into the same space as the token embeddings. In the second stage the projector is trained as well as the language model which is about the instruction tuning learning.

llama.cpp example

First we clone https://huggingface.co/liuhaotian/llava-v1.5-7b which is the instruction tuned model, which recall was trained on prompts which contained a textual description of an image and then user/assistent interactions, and also with the images. This is what produces the projector.

I did the following in the directory above my checked out llama.cpp directory. We need to checkout the LLaVA model:

$ git clone https://huggingface.co/liuhaotian/llava-v1.5-7b

And we need the Vision Transformer (ViT) model:

$ git clone https://huggingface.co/openai/clip-vit-large-patch14-336

Create a Python virtual environment and install the required packages:

$ python3.11 -m venv llava-venv
$ source llava-venv/bin/activate
(llava-venv) $ pip install torch numpy gguf transformers pillow sentencepiece

Then we can run the script llava-surgery.py script:

(llava-venv) $ python examples/llava/llava-surgery.py -m ../llava-v1.5-7b/
Done!
Now you can convert ../llava-v1.5-7b/ to a a regular LLaMA GGUF file.
Also, use ../llava-v1.5-7b//llava.projector to prepare a llava-encoder.gguf file.

What this script does is that it looks up the pretrained PyTorch weight files (the last one) which in my case is pytorch_model-00002-of-00002.bin. Hmm, could this not be looked up instead using the pytorch_model.bin.index.json?
The following tensors are retrieved and stored:

"model.mm_projector.0.bias": "pytorch_model-00002-of-00002.bin",               
"model.mm_projector.0.weight": "pytorch_model-00002-of-00002.bin",             
"model.mm_projector.2.bias": "pytorch_model-00002-of-00002.bin",               
"model.mm_projector.2.weight": "pytorch_model-00002-of-00002.bin",

Then torch.save (which would be in pickle format) is used which will save the projector weights in a files called llava.projector:

torch.save(projector, f"{args.model}/llava.projector")

$ cd fundamentals/python && source fund/bin/activate
(fund) $ python src/list-pytorch-model.py 
model.mm_projector.0.weight: torch.Size([4096, 1024])
model.mm_projector.0.bias: torch.Size([4096])
model.mm_projector.2.weight: torch.Size([4096, 4096])
model.mm_projector.2.bias: torch.Size([4096])

The surgery script also removes these weights from the model file and saves it, so we need to revert the changes in ../llava-v1.5-7b/ before we can run the script again. So at this point pytorch_model-00002-of-00002.bin does not contain the tensor weights related to the projector.

The llava.projector are the tensors that projector which we will use with the image encoder so that it can convert/transform the image embeddings into the same space as the text embeddings.

And the type of model has to be GGUF so this will also convert it to that format:

(llava-venv) $ python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
gguf: This GGUF file is for Little Endian only
Projector tensors added
...
Done. Output file: ../llava-v1.5-7b/mmproj-model-f16.gguf

So that is the ViT with the addition of the projector tensors, converted to GGUF format which is now in mmproj-model-f16.gguf

Then we need to convert the llava part of llava to GGUF format, which we removed the projector tensors from:

(llava-venv) $ python ./convert.py ../llava-v1.5-7b
Loading model file ../llava-v1.5-7b/pytorch_model-00001-of-00002.bin
Loading model file ../llava-v1.5-7b/pytorch_model-00001-of-00002.bin
Loading model file ../llava-v1.5-7b/pytorch_model-00002-of-00002.bin
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('../llava-v1.5-7b'))
Found vocab files: {'tokenizer.model': PosixPath('../llava-v1.5-7b/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': None}
Loading vocab file '../llava-v1.5-7b/tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>
...
Writing ../llava-v1.5-7b/ggml-model-f16.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
...
Wrote ../llava-v1.5-7b/ggml-model-f16.gguf

So the removal of the projector tensors from the model confused me somewhat, I understand that they need to be added to the ViT model, but I don't understand why they need to be removed from the LLaMA model. Without removing them we would run into and error when converting the model:

(llava-venv) $ python ./convert.py ../llava-v1.5-7b
...
model.layers.31.post_attention_layernorm.weight  -> blk.31.ffn_norm.weight                   | F16    | [4096]
model.norm.weight                                -> output_norm.weight                       | F16    | [4096]
Traceback (most recent call last):
  File "/home/danielbevenius/work/ai/llama.cpp/./convert.py", line 1483, in <module>
    main()
  File "/home/danielbevenius/work/ai/llama.cpp/./convert.py", line 1469, in main
    model   = convert_model_names(model, params, args.skip_unknown)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danielbevenius/work/ai/llama.cpp/./convert.py", line 1206, in convert_model_names
    raise Exception(f"Unexpected tensor name: {name}. Use --skip-unknown to ignore it (e.g. LLaVA)")
Exception: Unexpected tensor name: model.mm_projector.0.weight. Use --skip-unknown to ignore it (e.g. LLaVA)

This is because these tensors are not defined in gguf-py/gguf/tensor_mapping.py so they will not be recognized, hence the error. I think it would be alright to remove the removal of the projector tensors from the LLaMA model and then added the --skip-unknown flag to the convert.py script.

Using --skip-unknown as suggested seems to work:

$ python ./convert.py ../llava-v1.5-7b --skip-unknown

Perhaps this could be changed as it would be nice to be able to not have to update the original model file.

Now, we can pass the following image to llava-cli and it will describe it:

(llava-venv) $ ./llava-cli --no-display-prompt --log-disable --n-gpu-layers 25 -m ../llava-v1.5-7b/ggml-model-f16.gguf --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf --image apollo11.jpg
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ../llava-v1.5-7b/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.53 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.53 MB (377 tensors)
clip_model_load: compute allocated memory: 36.18 MB
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../llava-v1.5-7b/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 12.55 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/33 layers to GPU
llm_load_tensors:        CPU buffer size = 12853.02 MiB
llm_load_tensors:      CUDA0 buffer size =  9650.78 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   800.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   171.60 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   167.20 MiB
llama_new_context_with_model: graph splits (measure): 5

encode_image_with_clip: image encoded in    84.79 ms by CLIP (    0.15 ms per image patch)

 In the image, a person wearing a white and red space suit stands on top of a moon surface. They have their back turned to the American flag that is waving nearby. The person appears to be an astronaut or possibly an alien character in an iconic scene from science fiction movies. They are holding a suitcase, which could potentially be carrying supplies for their journey or a prop from the movie set.

llama_print_timings:        load time =    6637.59 ms
llama_print_timings:      sample time =      56.21 ms /    87 runs   (    0.65 ms per token,  1547.82 tokens per second)
llama_print_timings: prompt eval time =    5576.01 ms /   616 tokens (    9.05 ms per token,   110.47 tokens per second)
llama_print_timings:        eval time =   16764.83 ms /    87 runs   (  192.70 ms per token,     5.19 tokens per second)
llama_print_timings:       total time =   27646.54 ms /   703 tokens

Using LLaVA 1.6

First clone the LLaVA 1.6 model:

$ git clone -v --progress --depth 1 --single-branch --branch main https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b

One thing to note about this model is that it includes the vision part in the
model files, in contrast to llava-1.5 where the vision part is in a separate
model that we checked out. So the surgery script for this version will extract them.

Then we run the llava-surgery-v2.py script:

$ python examples/llava/llava-surgery-v2.py -C -m ../llava-v1.6-vicuna-7b/

Then copy the following files to a new directory:

$ mkdir vit
$ cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
$ cp ../llava-v1.6-vicuna-7b/llava.projector vit/
$ curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json

$ python ./examples/llava/convert-image-encoder-to-gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision

Then we can convert the model to gguf format:

$ python ./convert.py ../llava-v1.6-vicuna-7b/

And finally we can run the llava-cli using the 1.6 model version:

~/work/ai/llama.cpp/llava-cli --no-display-prompt --log-disable --n-gpu-layers 25 -m ~/work/ai/llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj ~/work/ai/llama.cpp/vit/mmproj-model-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg
...

The image shows an astronaut standing on the surface of the moon, looking towards the camera. He is wearing a white space suit with the American flag patch visible on his chest, and he has a backpack strapped to his shoulders. In front of him stands a small wooden pole with an American flag attached to it. This scene depicts a historical moment from the Apollo missions when astronauts planted flags on the moon as part of their mission objectives. The environment around them is barren and rocky, characteristic of the moon's surface.

BakLLaVA-1

This is a Mistral 7B base model agumented with LLaVA-1.5 architecture compared to the version above which was based on Vicuna (which recall is a fine-tuned LLaMA base model for chat (conversations were collected from ShareGPT)).

First clone BakLLaVA-1:

$ git clone https://huggingface.co/SkunkworksAI/BakLLaVA-1

Then we run the llava-surgery.py script:

(llava-venv) $ python examples/llava/llava-surgery.py -m ../BakLLaVA-1/
Done!
Now you can convert ../BakLLaVA-1/ to a regular LLaMA GGUF file.
Also, use ../BakLLaVA-1//llava.projector to prepare a llava-encoder.gguf file.

Then we convert the vision model to gguf format:

(llava-venv) $ python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../BakLLaVA-1/llava.projector --output-dir ../BakLLaVA-1
...
Done. Output file: ../BakLLaVA-1/mmproj-model-f16.gguf

Then we convert the BakLLaVA-1 to gguf format:

$ python ./convert.py ../BakLLaVA-1
...
Wrote ../BakLLaVA-1/ggml-model-f16.gguf

And then we can run the llava-cli using the BakLLaVA-1 model:

(llava-venv) $ ~/work/ai/llama.cpp/llava-cli --no-display-prompt --log-disable --n-gpu-layers 25 -m ~/work/ai/BakLLaVA-1/ggml-model-f16.gguf --mmproj ~/work/ai/BakLLaVA-1/mmproj-model-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg
...

The image is a photograph of an astronaut standing on the surface of the Moon during the Apollo 11 mission.

Lightweight Downsampling Projector (LDP)

This is an option that is available in the script.

(llava-venv) $ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] -m MODEL_DIR [--use-f32] [--text-only] [--vision-only]
                                        [--clip_model_is_vision] [--llava-projector LLAVA_PROJECTOR]
                                        [--projector-type {mlp,ldp}]
                                        [--image-mean IMAGE_MEAN IMAGE_MEAN IMAGE_MEAN]
                                        [--image-std IMAGE_STD IMAGE_STD IMAGE_STD] [-o OUTPUT_DIR]
                                        [--image_mean IMAGE_MEAN [IMAGE_MEAN ...]]
                                        [--image_std IMAGE_STD [IMAGE_STD ...]]

options:
  -h, --help            show this help message and exit
  -m MODEL_DIR, --model-dir MODEL_DIR
                        Path to model directory cloned from HF Hub
  --llava-projector LLAVA_PROJECTOR
                        Path to llava.projector file. If specified, save an image encoder for LLaVA models.
  --projector-type {mlp,ldp}
                        Type of projector. Possible values: mlp, ldp
  ...

https://arxiv.org/pdf/2402.03766.pdf

llama.cpp llava notes

This section will step through the llava-cli example to understand how it work.

First we will build the llama.cpp project with debugging symbols enabled and with CUDA support:

$ cmake -S . -B build -DGGML_CUDA=On -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build

Lets try running the example first to see that it works as expected:

$ ./build/bin/llama-llava-cli -m models/vicuna-7b-q5_k.gguf --mmproj models/mmproj-vicuna7b-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg -c 4096 -ngl 15

This will output the following (cleaned up from debugging output):

The image you've shared depicts an astronaut on the surface of the moon,
standing next to a flag of the United States. This photograph captures a
significant moment in history, as the Apollo missions were the first time that
humans walked on the moon and planted the flag of the United States. The image
is often remembered as a symbol of national pride and the triumph of human
ingenuity. It was taken on July 20, 1969, during the Apollo 11 mission.

The actual output was this which I'm saving here so that I can take a look at where these are coming from:

 The image you've shared depicts an astronaut on the surface of the moon, standing next to a flagggml_gallocr_needs_realloc: src 1 (KQ_mask) of node kq_soft_max_ext-0 is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
 of the United States. This photograph captures a significant moment in history, as the Apollo missions were the first time that humans walked on the moon andggml_gallocr_needs_realloc: node kq-0 is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
 planted the flag of the United States. The image is often remembered as a symbol of national pride and the triumph of human ingenuity. It was takenggml_gallocr_needs_realloc: node kq-0 is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
 on July 20, 1969, during the Apollo 11 mission.
llama_perf_context_print:        load time =   36031.05 ms
llama_perf_context_print: prompt eval time =   30555.63 ms /  2920 tokens (   10.46 ms per token,    95.56 tokens per second)
llama_perf_context_print:        eval time =   54394.93 ms /   109 runs   (  499.04 ms per token,     2.00 tokens per second)
llama_perf_context_print:       total time =   90945.80 ms /  3029 tokens

Great, so lets start stepping through the code:

$ gdb --args ./build/bin/llama-llava-cli -m models/vicuna-7b-q5_k.gguf --mmproj models/mmproj-vicuna7b-f16.gguf --image ~/work/ai/learning-ai/notes/apollo11.jpg -c 4096 -ngl 15
(gdb) br llava-cli.cpp:273
Breakpoint 1 at 0x5e566: file /home/danbev/work/ai/llama.cpp/examples/llava/llava-cli.cpp, line 273.

Skipping the parsing of command line arguments and some checks we get to:

    auto * model = llava_init(&params);

This will load the language model which in our case is the Vicuna 7B model. This uses common.cpp to load and is something we have gone through before.

static struct llama_model * llava_init(gpt_params * params) {
    llama_backend_init();
    llama_numa_init(params->numa);

    llama_model_params model_params = llama_model_params_from_gpt_params(*params);

    llama_model * model = llama_load_model_from_file(params->model.c_str(), model_params);
    if (model == NULL) {
        LOG_ERR("%s: unable to load model\n" , __func__);
        return NULL;
    }
    return model;
}

After that we have the following check which is checking to see if the prompt contains an embedded base64 encoded image (I think, I need to look into this more and try it out) which is not the case for this session, so else block will be executed:

    if (prompt_contains_image(params.prompt)) {
        auto * ctx_llava = llava_init_context(&params, model);

        auto * image_embed = load_image(ctx_llava, &params, "");

        // process the prompt
        process_prompt(ctx_llava, image_embed, &params, params.prompt);

        llama_perf_context_print(ctx_llava->ctx_llama);
        llava_image_embed_free(image_embed);
        ctx_llava->model = NULL;
        llava_free(ctx_llava);
    } else {
        for (auto & image : params.image) {
            auto * ctx_llava = llava_init_context(&params, model);

            auto * image_embed = load_image(ctx_llava, &params, image);
            if (!image_embed) {
                LOG_ERR("%s: failed to load image %s. Terminating\n\n", __func__, image.c_str());
                return 1;
            }

            // process the prompt
            process_prompt(ctx_llava, image_embed, &params, params.prompt);

            llama_perf_context_print(ctx_llava->ctx_llama);
            llava_image_embed_free(image_embed);
            ctx_llava->model = NULL;
            llava_free(ctx_llava);
        }
    }

So we can see that params.image is in fact a vector so we can supply multiple images on the command line:

(gdb) ptype params.image
type = std::vector<std::string>

Next a llava_context will be initialized.

static struct llava_context * llava_init_context(gpt_params * params, llama_model * model) {
    const char * clip_path = params->mmproj.c_str();

    auto prompt = params->prompt;
    if (prompt.empty()) {
        prompt = "describe the image in detail.";
    }

    auto ctx_clip = clip_model_load(clip_path, /*verbosity=*/ 1);


    llama_context_params ctx_params = llama_context_params_from_gpt_params(*params);
    ctx_params.n_ctx           = params->n_ctx < 2048 ? 2048 : params->n_ctx; // we need a longer context size to process image embeddings

    llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);

    if (ctx_llama == NULL) {
        LOG_ERR("%s: failed to create the llama_context\n" , __func__);
        return NULL;
    }

    auto * ctx_llava = (struct llava_context *)malloc(sizeof(llava_context));

    ctx_llava->ctx_llama = ctx_llama;
    ctx_llava->ctx_clip = ctx_clip;
    ctx_llava->model = model;
    return ctx_llava;
}

Notice that we did not specify a prompt on the command line so the default prompt will be used which is "describe the image in detail.". Next clip_model_load will be called which is defined in clip.cpp:

struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
    struct ggml_context * meta = NULL;

    struct gguf_init_params params = {
        /*.no_alloc = */ true,
        /*.ctx      = */ &meta,
    };

So first a ggml_context pointer is created named meta and used to initialize gguf_init_params which is then passed to gguf_init_from_file:

    struct gguf_context * ctx = gguf_init_from_file(fname, params);

(gdb) p *ctx
$22 = {header = {magic = "GGUF", version = 2, n_tensors = 378, n_kv = 25},
kv = 0x555556b33520, infos = 0x555555b602b0, alignment = 32, offset = 22112,
size = 624429056, data = 0x0}

This image encoder (image encoder for LLaVA) has 378 tensors which will be iterated over:

        std::map<enum ggml_type, uint32_t> n_type;

        for (int i = 0; i < n_tensors; i++) {
            enum ggml_type type = gguf_get_tensor_type(ctx, i);

            n_type[type]++;
        }

enum ggml_type gguf_get_tensor_type(const struct gguf_context * ctx, int i) {
    return ctx->infos[i].type;
}

And infos is of type struct gguf_tensor_info:

(gdb) ptype *ctx->infos
type = struct gguf_tensor_info {
    gguf_str name;
    uint32_t n_dims;
    uint64_t ne[4];
    ggml_type type;
    uint64_t offset;
    const void *data;
    size_t size;
}

(gdb) p ctx->infos[0]
$28 = {name = {n = 9, data = 0x555555b4afe0 "mm.0.bias"}, n_dims = 1,
ne = {4096, 1, 1, 1}, type = GGML_TYPE_F32, offset = 0, data = 0x0, size = 0}

The above for loop is only used to count the number of tensors of each type.

(gdb) until 1080

Actually this whole block is just for logging so lets skip it but I'll show the output here:

(gdb) until 1109
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vit-large336-custom
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:           clip.vision.image_grid_pinpoints arr[i32,10]      = [336, 672, 672, 336, 672, 672, 1008, ...
clip_model_load: - kv  17:          clip.vision.image_crop_resolution u32              = 224
clip_model_load: - kv  18:             clip.vision.image_aspect_ratio str              = anyres
clip_model_load: - kv  19:         clip.vision.image_split_resolution u32              = 224
clip_model_load: - kv  20:            clip.vision.mm_patch_merge_type str              = spatial_unpad
clip_model_load: - kv  21:              clip.vision.mm_projector_type str              = mlp2x_gelu
clip_model_load: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  24:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load (fname=0x555555b548e0 "models/mmproj-vicuna7b-f16.gguf", verbosity=1)
    at /home/danbev/work/ai/llama.cpp/examples/llava/clip.cpp:1109

    clip_ctx * new_clip = new clip_ctx{};

The clip_ctx struct is defined in clip.cpp:

struct clip_ctx {
    bool has_text_encoder    = false;
    bool has_vision_encoder  = false;
    bool has_llava_projector = false;
    bool has_minicpmv_projector = false;
    int minicpmv_version = 2;

    struct clip_vision_model vision_model;
    projector_type proj_type = PROJECTOR_TYPE_MLP;

    float image_mean[3];
    float image_std[3];
    bool use_gelu = false;
    int32_t ftype = 1;

    bool has_class_embedding = true;
    bool has_pre_norm = true;
    bool has_post_norm = false;
    bool has_patch_bias = false;

    struct gguf_context * ctx_gguf;
    struct ggml_context * ctx_data;

    std::vector<uint8_t> buf_compute_meta;

    // memory buffers to evaluate the model
    ggml_backend_buffer_t params_buffer  = NULL;

    ggml_backend_t backend       = NULL;
    ggml_gallocr_t compute_alloc = NULL;

    struct clip_image_size * load_image_size;
};

Next the type of the projector will will be read from the model inforation

            const std::string proj_type = gguf_get_val_str(ctx, idx);
            new_clip->proj_type = clip_projector_type_from_string(proj_type);

(gdb) p proj_type
$2 = "mlp

The following projector types are currently defined:

static std::map<projector_type, std::string> PROJECTOR_TYPE_NAMES = {
    { PROJECTOR_TYPE_MLP, "mlp" },
    { PROJECTOR_TYPE_LDP, "ldp" },
    { PROJECTOR_TYPE_LDPV2, "ldpv2"},
    { PROJECTOR_TYPE_RESAMPLER, "resampler"},
};

So these are different types of approches to mapping the image embeddings into a space that can be processed along side text token embeddings.

MPL is a multi-layer perceptron which is a feedforward neural network which does the transformation of the image embeddings from CLIP's image encoder into a format compatible with the language model.
LDP TODO: what is this? There is a section above but it needs more info.
LDPV2 TODO: what is this?
RESAMPLER TODO: what is this?

So the above will set the projector type on the clip context. Specific to the MLP type projector is the following:

        if (new_clip->proj_type == PROJECTOR_TYPE_MLP) {
            if (gguf_find_tensor(ctx, format(TN_LLAVA_PROJ, 3, "weight").c_str()) != -1) {
                new_clip->proj_type = PROJECTOR_TYPE_MLP_NORM;
            }
        }

In our case this will not be true:

(gdb) p gguf_find_tensor(ctx, format("mm.%d.%s", 3, "weight"))
$3 = -1

Following that we have the initalization of the CUDA backend:

#ifdef GGML_USE_CUDA
    new_clip->backend = ggml_backend_cuda_init(0);
    LOG_INF("%s: CLIP using CUDA backend\n", __func__);
#endif

After that we have a block which populates the clip context:

        int idx = get_key_idx(ctx, KEY_HAS_TEXT_ENC);
        new_clip->has_text_encoder = gguf_get_val_bool(ctx, idx);
        ...

This block is setting some of the model values that we saw in the output above.

clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true

After that we have the loading of tensors:

    // load tensors
    {
        std::vector<uint8_t> read_buf;
        struct ggml_init_params params = {
            /*.mem_size =*/ (n_tensors + 1) * ggml_tensor_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc =*/ true,
        };
        new_clip->ctx_data = ggml_init(params);

Notice where that no_alloc is set to true which means that data for the tensor will not be allocated. See ggml.md for more details on this.

Then we open an input file stream to models/mmproj-vicuna7b-f16.gguf:

        auto fin = std::ifstream(fname, std::ios::binary);

And then we will iterate over all the tensors, keep in mind that ctx in this case is a gguf_context and not a ggml_context, and meta is a ggml_context:

        // add tensors to context
        for (int i = 0; i < n_tensors; ++i) {
            const char * name = gguf_get_tensor_name(ctx, i);
            struct ggml_tensor * t = ggml_get_tensor(meta, name);
            struct ggml_tensor * cur = ggml_dup_tensor(new_clip->ctx_data, t);
            ggml_set_name(cur, name);
        }

So we have 378 tensors for this model. Lets go through one of them:

(gdb) p name
$11 = 0x555555b4afe0 "mm.0.bias"
(gdb) p ctx->infos[i]
$12 = {name = {n = 9, data = 0x555555b4afe0 "mm.0.bias"}, n_dims = 1, ne = {4096, 1, 1, 1}, type = GGML_TYPE_F32, offset = 0,
  data = 0x0, size = 0}
(gdb) p ctx->infos[i].name
$13 = {n = 9, data = 0x555555b4afe0 "mm.0.bias"}
(gdb) p ctx->infos[i].name.data
$14 = 0x555555b4afe0 "mm.0.bias"

Then the tensor will be looked up in the ggml_context meta:

gdb) p *t
$15 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 1, 1, 1}, nb = {4, 16384, 16384, 16384}, 
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x0, name = "mm.0.bias", '\000' <repeats 54 times>, extra = 0x0}

And this will be duplicated into a different ggml_context which is the field of the clip_ctx struct. The name of a tensor is not duplicated so we need to set the the name of the new tensor.

After all the tensors have been copied into the new_clip ggml_context (ctx_data) these tensors will be allocated in the backend which is:

(gdb) p new_clip->backend.iface.get_name(new_clip->backend)
$21 = 0x555556b328a8 "CUDA0"

And the allocation is done like this:

        // alloc memory and offload data
        new_clip->params_buffer = ggml_backend_alloc_ctx_tensors(new_clip->ctx_data, new_clip->backend);

ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors(struct ggml_context * ctx, ggml_backend_t backend) {
    return ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_get_default_buffer_type(backend));
}

Note that this function suffix is buft which stands for buffer type. This will end up in ggml-alloc.c:

ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx,
    ggml_backend_buffer_type_t buft) {
    GGML_ASSERT(ggml_get_no_alloc(ctx) == true);

    size_t alignment = ggml_backend_buft_get_alignment(buft);
    size_t max_size = ggml_backend_buft_get_max_size(buft);
    GGML_ASSERT(ggml_get_no_alloc(ctx) == true);

And notice here that the assert is checking that no_alloc is set to true which makes sense as the data for the tensors are to be allocated in the backend.

This will get the first tensor from the new_clip->ctx_data ggml context:

    struct ggml_tensor * first = ggml_get_first_tensor(ctx);

And then start from that tensor and iterate over all tensors to calculate the size of the buffer required to store all the tensors:

    for (struct ggml_tensor * t = first; t != NULL; t = ggml_get_next_tensor(ctx, t)) {
        size_t this_size = 0;
        if (t->data == NULL && t->view_src == NULL) {
            this_size = GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
        }

        if ((cur_buf_size + this_size) > max_size) {
            // allocate tensors in the current buffer
            if (!alloc_tensor_range(ctx, first, t, buft, cur_buf_size, &buffers, &n_buffers)) {
                return NULL;
            }
            first = t;
            cur_buf_size = this_size;
        } else {
            cur_buf_size += this_size;
        }
    }

Notice that this will iterate over the tensors and calculate the size of the buffer required to store all the tensors. If the size of the buffer is greater than the maximum size then the tensors up to the current tensor will be allocated on the CUDA device by calling alloc_tensor_range.

    // allocate remaining tensors
    if (cur_buf_size > 0) {
        if (!alloc_tensor_range(ctx, first, NULL, buft, cur_buf_size, &buffers, &n_buffers)) {
            return NULL;
        }
    }

The first thing that happens here is that a buffer will be allocated using the backend type:

static bool alloc_tensor_range(struct ggml_context * ctx,
        struct ggml_tensor * first, struct ggml_tensor * last,
        ggml_backend_buffer_type_t buft, size_t size,
        ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
    ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);
    if (buffer == NULL) {
        ...
        free(*buffers);
        return false;
    }

    struct ggml_tallocr tallocr = ggml_tallocr_new(buffer);

    for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
        if (t->data == NULL) {
            if (t->view_src == NULL) {
                ggml_tallocr_alloc(&tallocr, t);
            } else if (t->buffer == NULL) {
                ggml_backend_view_init(t);
            }
        } else {
            if (t->view_src != NULL && t->buffer == NULL) {
                // view of a pre-allocated tensor
                ggml_backend_view_init(t);
            }
        }
    }

    *buffers = realloc(*buffers, sizeof(ggml_backend_buffer_t) * (*n_buffers + 1));
    (*buffers)[(*n_buffers)++] = buffer;

    return true;
}

So the ggml_backend_buft_alloc_buffer will end up in ggml-cuda.cu where we first set the device to be used:

GGML_CALL static ggml_backend_buffer_t ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
    ggml_backend_cuda_buffer_type_context * buft_ctx = (ggml_backend_cuda_buffer_type_context *)buft->context;

    ggml_cuda_set_device(buft_ctx->device);

    size = std::max(size, (size_t)1); // cudaMalloc returns null for size 0

    void * dev_ptr;
    cudaError_t err = ggml_cuda_device_malloc(&dev_ptr, size, buft_ctx->device);
    if (err != cudaSuccess) {
        ...
        return nullptr;
    }

    ggml_backend_cuda_buffer_context * ctx = new ggml_backend_cuda_buffer_context(buft_ctx->device, dev_ptr);

    return ggml_backend_buffer_init(buft, ggml_backend_cuda_buffer_interface, ctx, size);
}

And we can see that a new ggml_backend_cuda_buffer_context is created:

(gdb) p *buft_ctx
$35 = {device = 0, dev_ptr = 0x7ffdc6000000, name = "CUDA0"}

And notice that the dev_ptr is a pointer to the device memory, so this is memory on the CUDA device.

And finally ggml_backend_buffer_init is called to initialize the buffer:

GGML_CALL ggml_backend_buffer_t ggml_backend_buffer_init(
               ggml_backend_buffer_type_t      buft,
        struct ggml_backend_buffer_i           iface,
               ggml_backend_buffer_context_t   context,
               size_t                          size) {
    ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));

    (*buffer) = (struct ggml_backend_buffer) {
        /* .interface = */ iface,
        /* .buft      = */ buft,
        /* .context   = */ context,
        /* .size      = */ size,
        /* .usage     = */ GGML_BACKEND_BUFFER_USAGE_ANY
    };

    return buffer;
}

Now, the ggml_backend_buffer is a allocated on the host, but the context has its dev_ptr pointing to memory on the CUDA device.

This will be returned to alloc_tensor_range where we will call ggml_tallocr_new where talloc stands for tensor allocator:

    struct ggml_tallocr tallocr = ggml_tallocr_new(buffer);

// Tensor allocator
struct ggml_tallocr {
    ggml_backend_buffer_t buffer;
    void * base;
    size_t alignment;
    size_t offset;
};

After that all the tensors will be iterated over. Recall that we set no_alloc to true to the tensors in this case will not have an data pointer and their view_src will also be NULL:

    for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
        if (t->data == NULL) {
            if (t->view_src == NULL) {
                ggml_tallocr_alloc(&tallocr, t);
            } else if (t->buffer == NULL) {
                ggml_backend_view_init(t);
            }
        } else {
            if (t->view_src != NULL && t->buffer == NULL) {
                // view of a pre-allocated tensor
                ggml_backend_view_init(t);
            }
        }

So ggml_tallocr_alloc will be called for each tensor. Lets take a look at the first tensor before this call and then we can inspect it again afterwards:

(gdb) p *tensor
$45 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {4096, 1, 1, 1}, nb = {4, 16384, 16384, 16384}, op = GGML_OP_NONE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0}, view_src = 0x0, view_offs = 0, data = 0x0, name = "mm.0.bias", '\000' <repeats 54 times>, extra = 0x0}

void ggml_tallocr_alloc(struct ggml_tallocr * talloc, struct ggml_tensor * tensor) {
    size_t size = ggml_backend_buffer_get_alloc_size(talloc->buffer, tensor);
    size = GGML_PAD(size, talloc->alignment);

    if (talloc->offset + size > ggml_backend_buffer_get_size(talloc->buffer)) {
        fprintf(stderr, "%s: not enough space in the buffer to allocate %s (needed %zu, available %zu)\n",
                __func__, tensor->name, size, ggml_backend_buffer_get_size(talloc->buffer) - talloc->offset);
        GGML_ABORT("not enough space in the buffer");
    }

    void * addr = (char *)ggml_backend_buffer_get_base(talloc->buffer) + talloc->offset;
    talloc->offset += size;

    assert(((uintptr_t)addr % talloc->alignment) == 0);

    ggml_backend_tensor_alloc(talloc->buffer, tensor, addr);
}

In our case size will be 16384 bytes.

void ggml_backend_tensor_alloc(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, void * addr) {
    // removed asserts for readability
    ...
    tensor->buffer = buffer;
    tensor->data = addr;
    ggml_backend_buffer_init_tensor(buffer, tensor);
}

Notice that this is updating the buffer field and the data field is set to the memory address on the CUDA device.

And ggml_backend_buffer_init_tensor

GGML_CALL void ggml_backend_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
    // init_tensor is optional
    if (buffer->iface.init_tensor) {
        buffer->iface.init_tensor(buffer, tensor);
    }
}

This will land in ggml-cuda.cu:

GGML_CALL static void ggml_backend_cuda_buffer_init_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor) {
    ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
    ...

    if (ggml_is_quantized(tensor->type) && tensor->view_src == nullptr &&
        ggml_backend_buffer_get_usage(buffer) != GGML_BACKEND_BUFFER_USAGE_COMPUTE) {
        // initialize padding to 0 to avoid possible NaN values
        size_t original_size = ggml_nbytes(tensor);
        size_t padded_size = ggml_backend_buft_get_alloc_size(buffer->buft, tensor);

        if (padded_size > original_size) {
            ggml_cuda_set_device(ctx->device);
            CUDA_CHECK(cudaMemset((char *)tensor->data + original_size, 0, padded_size - original_size));
        }
    }
}

(gdb) p *ctx
$5 = {device = 0, dev_ptr = 0x7ffdc6000000, name = "CUDA0"}

In our case the above will do nothing as the tensor is not quantized. And all of the tensors will go through the same process where they will have their buffer and data fields updated. So after this the tensors data pointer will point to the memory of the CUDA device.

So to recap, we have been looking into alloc_tensor_range which looks like this:

static bool alloc_tensor_range(struct ggml_context * ctx,
        struct ggml_tensor * first, struct ggml_tensor * last,
        ggml_backend_buffer_type_t buft, size_t size,
        ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
    ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);
    if (buffer == NULL) {
        for (size_t i = 0; i < *n_buffers; i++) {
            ggml_backend_buffer_free((*buffers)[i]);
        }
        free(*buffers);
        return false;
    }

    struct ggml_tallocr tallocr = ggml_tallocr_new(buffer);

    for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
        if (t->data == NULL) {
            if (t->view_src == NULL) {
                ggml_tallocr_alloc(&tallocr, t);
            } else if (t->buffer == NULL) {
                ggml_backend_view_init(t);
            }
        } else {
            if (t->view_src != NULL && t->buffer == NULL) {
                // view of a pre-allocated tensor
                ggml_backend_view_init(t);
            }
        }
    }

    *buffers = realloc(*buffers, sizeof(ggml_backend_buffer_t) * (*n_buffers + 1));
    (*buffers)[(*n_buffers)++] = buffer;

    return true;
}

And we have seen most of this apart from the last three lines. Recall that this function was called when max_size was reached, and the realloc is adding a the new buffer we created at the start of this function. Realloc will behave like malloc if *buffers is null, otherwise if there is not enough space it will copy the old buffer to a new location and free the old buffer.

(gdb) p *buffers
$55 = (ggml_backend_buffer_t *) 0x0

(gdb) p *buffers
$60 = (ggml_backend_buffer_t *) 0x555556b2fda0

So now there will be a space for the new buffer in the array since it was expanded:

(gdb) p (*buffers)[0]
$68 = (ggml_backend_buffer_t) 0x555003e796ef

So we can dereference the pointer to the buffer, and index the array element using the value that n_buffers currently points to (notice the parentheses which makes this using the value as an index), and set that to the the new buffer. The value n_buffers will then be incremented by one.

(gdb) p *n_buffers
$72 = 0

    (*buffers)[(*n_buffers)++] = buffer;

(gdb) p *n_buffers
$73 = 1
(gdb) p *(*buffers)[0]
$75 = {iface = {get_name = 0x7fffe1f741c8 <ggml_backend_cuda_buffer_get_name(ggml_backend_buffer_t)>,
    free_buffer = 0x7fffe1f74218 <ggml_backend_cuda_buffer_free_buffer(ggml_backend_buffer_t)>,
    get_base = 0x7fffe1f74255 <ggml_backend_cuda_buffer_get_base(ggml_backend_buffer_t)>,
    init_tensor = 0x7fffe1f74277 <ggml_backend_cuda_buffer_init_tensor(ggml_backend_buffer_t, ggml_tensor*)>,
    memset_tensor = 0x7fffe1f743e6 <ggml_backend_cuda_buffer_memset_tensor(ggml_backend_buffer_t, ggml_tensor*, uint8_t, size_t, size_t)>,
    set_tensor = 0x7fffe1f744d0 <ggml_backend_cuda_buffer_set_tensor(ggml_backend_buffer_t, ggml_tensor*, void const*, size_t, size_t)>,
    get_tensor = 0x7fffe1f745c0 <ggml_backend_cuda_buffer_get_tensor(ggml_backend_buffer_t, ggml_tensor const*, void*, size_t, size_t)>, cpy_tensor = 0x7fffe1f746b0 <ggml_backend_cuda_buffer_cpy_tensor(ggml_backend_buffer_t, ggml_tensor const*, ggml_tensor*)>,
    clear = 0x7fffe1f74860 <ggml_backend_cuda_buffer_clear(ggml_backend_buffer_t, uint8_t)>, reset = 0x0},
  buft = 0x7ffff781a7e0 <ggml_backend_cuda_buffer_type::ggml_backend_cuda_buffer_types>, context = 0x555555b499e0, size = 624429056,
  usage = GGML_BACKEND_BUFFER_USAGE_ANY}

That will return us back in ggml_backend_alloc_ctx_tensors_from_buft:

    ggml_backend_buffer_t buffer;
    if (n_buffers == 1) {
        buffer = buffers[0];
    } else {
        buffer = ggml_backend_multi_buffer_alloc_buffer(buffers, n_buffers);
    }
    free(buffers);
    return buffer;

And that will return us back to clip_model_load. Now the data for the tensors have been alloced on the backend but they don't contain any data yet.

        // alloc memory and offload data
        new_clip->params_buffer = ggml_backend_alloc_ctx_tensors(new_clip->ctx_data, new_clip->backend);
        for (int i = 0; i < n_tensors; ++i) {
            const char * name = gguf_get_tensor_name(ctx, i);
            struct ggml_tensor * cur = ggml_get_tensor(new_clip->ctx_data, name);
            const size_t offset = gguf_get_data_offset(ctx) + gguf_get_tensor_offset(ctx, i);
            fin.seekg(offset, std::ios::beg);
            ...

            int num_bytes = ggml_nbytes(cur);
            if (ggml_backend_buffer_is_host(new_clip->params_buffer)) {
                // for the CPU and Metal backend, we can read directly into the tensor
                fin.read(reinterpret_cast<char *>(cur->data), num_bytes);
            } else {
                // read into a temporary buffer first, then copy to device memory
                read_buf.resize(num_bytes);
                fin.read(reinterpret_cast<char *>(read_buf.data()), num_bytes);
                ggml_backend_tensor_set(cur, read_buf.data(), 0, num_bytes);
            }
        }

In our case the else block will be executed as the backend is CUDA. Lets take a closer look at ggml_backend_tensor_set:

GGML_CALL void ggml_backend_tensor_set(struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
    ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
    ...
    buf->iface.set_tensor(buf, tensor, data, offset, size);
}

GGML_CALL static void ggml_backend_cuda_buffer_set_tensor(
    ggml_backend_buffer_t buffer,
    ggml_tensor * tensor,
    const void * data,
    size_t offset,
    size_t size) {
    ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;

    ggml_cuda_set_device(ctx->device);
    CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, cudaStreamPerThread));
    CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
}

This is copying data from the host, which is the tensor data that was read from the the file into a buffer and then passed to this function. Notice that the destionation is the tensor data pointer which is on the CUDA device. And this is done for all the tensors.

Next we have the vision model loading (in clip.cpp):

    // vision model
    if (new_clip->has_vision_encoder) {
        // load vision model
        auto & vision_model = new_clip->vision_model;
        auto & hparams = vision_model.hparams;
        hparams.hidden_size    = get_u32(ctx, format(KEY_N_EMBD, "vision"));
        hparams.n_head         = get_u32(ctx, format(KEY_N_HEAD, "vision"));
        hparams.n_intermediate = get_u32(ctx, format(KEY_N_FF, "vision"));
        hparams.n_layer        = get_u32(ctx, format(KEY_N_BLOCK, "vision"));
        hparams.image_size     = get_u32(ctx, KEY_IMAGE_SIZE);
        hparams.patch_size     = get_u32(ctx, KEY_PATCH_SIZE);
        hparams.projection_dim = get_u32(ctx, format(KEY_PROJ_DIM, "vision"));
        hparams.eps            = get_f32(ctx, format(KEY_LAYER_NORM_EPS, "vision"));

The above is getting values for the keys in the model:

$ ./inspect-model.sh models/mmproj-vicuna7b-f16.gguf
INFO:gguf-dump:* Loading: models/mmproj-vicuna7b-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 28 key/value pair(s)
     ...
     13: UINT32     |        1 | clip.vision.patch_size = 14
     14: UINT32     |        1 | clip.vision.embedding_length = 1024
     15: UINT32     |        1 | clip.vision.feed_forward_length = 4096
     16: UINT32     |        1 | clip.vision.projection_dim = 768
     17: UINT32     |        1 | clip.vision.attention.head_count = 16
     18: FLOAT32    |        1 | clip.vision.attention.layer_norm_epsilon = 9.999999747378752e-06

Following that we have:

        try {
            int idx = get_key_idx(ctx, KEY_IMAGE_GRID_PINPOINTS);
            int n = gguf_get_arr_n(ctx, idx);
            const int32_t * pinpoints = (const int32_t *)gguf_get_arr_data(ctx, idx);
            for (int i = 0; i < 32 && i < n && pinpoints[i] != 0; ++i) {
                hparams.image_grid_pinpoints[i] = pinpoints[i];
            }
            if (n < 32)
                hparams.image_grid_pinpoints[n] = 0;
        } catch (std::runtime_error & /*e*/) {
            hparams.image_grid_pinpoints[0]=0;
        }

Now, we we inspect the output of the model we can see the following:

     20: [INT32]    |       10 | clip.vision.image_grid_pinpoints

This is an array of 10 integers representing something called pin points.

pinpoints[0] = 336
pinpoints[1] = 672
pinpoints[2] = 672
pinpoints[3] = 672
pinpoints[4] = 672
pinpoints[5] = 1008
pinpoints[6] = 336
pinpoints[7] = 336
pinpoints[8] = 336
pinpoints[9] = 1008

clip.vision.image_size = 336
clip.vision.patch_size = 14
clip.vision.image_grid_pinpoints = [336, 672, 672, 336, 672, 672, 1008, 336, 336, 1008]

So we have an image size of 336x336 pixels, and a patch size of 14x14 pixels. So if we divide 336 by 14 we get 24 which means that the image will be divided into a grid of 24x24 patches.

  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |20 |21 |22 |23 |
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
1 |24 |25 |26 |...                                                                         ...|47 |
  +---+---+---+-------------------------------------------------------------------------------+---+
  |...|                                                                                       |...|
  +---+---------------------------------------------------------------------------------------+---+
23|552|553|554|...                                                                         ...|575|
  +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Each box is 14x14 pixels
Each row is 24 boxes
And we have 24x24=576 boxes in total
576 * 14 * 14 = 
576 * 196     = 112896
336 * 336     = 112896

The values in the pinpoints array are possible resolutions to use for processing high resolution images.

(336, 672)
(672, 672)
(672, 1008)
(336, 336)
(336, 1008)

Next we have the merge type:

        try {
            int idx = get_key_idx(ctx, KEY_MM_PATCH_MERGE_TYPE);
            strcpy(hparams.mm_patch_merge_type, gguf_get_val_str(ctx, idx));
        } catch (std::runtime_error & /*e*/) {
            strcpy(hparams.mm_patch_merge_type, "flat");
        }

We can see that the default type is flat which I believe means that that a grid of 3x3 patches will be flattened into a single vector:

Sequence = [Patch at (336, 672), Patch at (672, 336), Patch at (672, 672), Patch at (1008, 336), Patch at (336, 1008)]

The type that used in this model is:

(gdb) p gguf_get_val_str(ctx, idx)
$119 = 0x555555b4b240 "spatial_unpad"

So what would it look like:

Input Sequence (flattened grid with empty slots):
[0, 0, 0, 0, 0, P(672,336), P(1008,336), 0, P(336,672), P(672,672), 0, 0, P(336,1008), 0, 0, 0]

Possible mask:
[0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0]

Notice here that there is still an order to the patches. My understanding of spatial_unpad is that it will remove the empty slots after the model has processed the image. TODO: see how this works where hparams.mm_patch_merge_type is used.

After that we have:

        try {
            hparams.image_crop_resolution = get_u32(ctx, KEY_IMAGE_CROP_RESOLUTION); // llava-1.6
        } catch(const std::exception& /*e*/) {
            hparams.image_crop_resolution = hparams.image_size;
        }

This is about the resolution in pixels to which images are to be cropped during pre-processing before being fed into the model. So if we input an image it will be cropped. Notice that this is first copping to 224 and then resizing to 336 which sounded a little strange to me. But reading some more about this it seems like this may be because of the way the model was trained. If the model was trained on 224x224 images (like ResNet) then having the same value here might ensure compability and optimal performance.

Load image using its original dimensions.
Crop to 224x224 (center crop?)
Resize to 336x336
Normalize (mean and std)

Next we have the mean and std used for the normalization:

        int idx_mean = get_key_idx(ctx, KEY_IMAGE_MEAN);
        int idx_std  = get_key_idx(ctx, KEY_IMAGE_STD);

These are also arrays:

     26: [FLOAT32]  |        3 | clip.vision.image_mean
     27: [FLOAT32]  |        3 | clip.vision.image_std

        for (int i = 0; i < 3; ++i) {
            new_clip->image_mean[i] = mean_data[i];
            new_clip->image_std[i]  = std_data[i];
        }

(gdb) p new_clip->image_mean[0]
$1 = 0.48145467
(gdb) p new_clip->image_mean[1]
$2 = 0.457827508
(gdb) p new_clip->image_mean[2]
$3 = 0.408210725
(gdb) p new_clip->image_std[0]
$4 = 0.268629551
(gdb) p new_clip->image_std[1]
$5 = 0.26130259
(gdb) p new_clip->image_std[2]
$6 = 0.275777102

I think the intuition here is that since we have rescaled the image we want to adjust the pixel values to match what the model was trained on.

Next, we have a number of tensor that will be set on the vision model

        try {
            vision_model.class_embedding  = get_tensor(new_clip->ctx_data, TN_CLASS_EMBD);
            new_clip->has_class_embedding = true;
        } catch (const std::exception& /*e*/) {
            new_clip->has_class_embedding = false;
        }

This is the class token which aggregates information from all patches and is used for classification.

(gdb) p *vision_model.class_embedding
$8 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {1024, 1, 1, 1}, nb = {4, 4096, 4096,
    4096}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc880c000, name = "v.class_embd", '\000' <repeats 51 times>,
  extra = 0x0}

After that we have the weights and bias for the pre-layer normalization (this is applied before the transformer blocks):

        try {
            vision_model.pre_ln_w  = get_tensor(new_clip->ctx_data, format(TN_LN_PRE, "v", "weight"));
            vision_model.pre_ln_b  = get_tensor(new_clip->ctx_data, format(TN_LN_PRE, "v", "bias"));
            new_clip->has_pre_norm = true;
        } catch (std::exception & /*e*/) {
            new_clip->has_pre_norm = false;
        }

(gdb) p *vision_model.pre_ln_w
$10 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {1024, 1, 1, 1}, nb = {4, 4096, 4096,
    4096}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc8a53800, name = "v.pre_ln.weight", '\000' <repeats 48 times>,
  extra = 0x0}

(gdb) p *vision_model.pre_ln_b
$11 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {1024, 1, 1, 1}, nb = {4, 4096, 4096,
    4096}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc8a54800, name = "v.pre_ln.bias", '\000' <repeats 50 times>,
  extra = 0x0}

Next we have the weights and bias for the post-layer normalization (this is applied after the transformer blocks):

        try {
            vision_model.post_ln_w  = get_tensor(new_clip->ctx_data, format(TN_LN_POST, "v", "weight"));
            vision_model.post_ln_b  = get_tensor(new_clip->ctx_data, format(TN_LN_POST, "v", "bias"));
            new_clip->has_post_norm = true;
        } catch (std::exception & /*e*/) {
            new_clip->has_post_norm = false;
        }

I'm skipping a head as some of the tensor do not exist in this model. One thing to note when debugging is that if you are stepping then get_tensor may throw and exception and this will cause the debugger to continue executing. Just setting a breakpoint outside of the block or in the catch block will allow you to continue stepping.

Next, we have the patch embeddings which are used to project the flattened image patches into the models embedding space, and the positional embedding which encode the position of each patch in the image grid:

        try {
            vision_model.patch_embeddings    = get_tensor(new_clip->ctx_data, TN_PATCH_EMBD);
            vision_model.position_embeddings = get_tensor(new_clip->ctx_data, format(TN_POS_EMBD, "v"));
        } catch(const std::exception& /*e*/) {
            LOG_ERR("%s: failed to load vision model tensors\n", __func__);
        }

(gdb) p *vision_model.patch_embeddings
$13 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {14, 14, 3, 1024}, nb = {2, 28, 392,
    1176}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc880d000, name = "v.patch_embd.weight", '\000' <repeats 44 times>,
  extra = 0x0}

(gdb) p *vision_model.position_embeddings
$14 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {1024, 577, 1, 1}, nb = {2, 2048,
    1181696, 1181696}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc8933000,
  name = "v.position_embd.weight", '\000' <repeats 41 times>, extra = 0x0}

Following that we have:

        // LLaVA projection
        if (new_clip->proj_type == PROJECTOR_TYPE_MLP || new_clip->proj_type == PROJECTOR_TYPE_MLP_NORM) {
            vision_model.mm_0_w              = get_tensor(new_clip->ctx_data, format(TN_LLAVA_PROJ, 0, "weight"));
            vision_model.mm_0_b              = get_tensor(new_clip->ctx_data, format(TN_LLAVA_PROJ, 0, "bias"));

So these are the weights and bias for the first projection linear layer

(gdb) p *vision_model.mm_0_w
$17 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0, ne = {1024, 4096, 1, 1}, nb = {2, 2048,
    8388608, 8388608}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc6004000, name = "mm.0.weight", '\000' <repeats 52 times>,
  extra = 0x0}

Following that there are number of try/catch blocks for loading tensors for different models type(?) like Yi.

Next the vision models layer is resized to the number of layers in the model which is 23 in this case:

        vision_model.layers.resize(hparams.n_layer);
        for (int il = 0; il < hparams.n_layer; ++il) {
            auto & layer = vision_model.layers[il];

(gdb) ptype vision_model.layers
type = std::vector<clip_layer>

So what does a clip_layer look like

(gdb) ptype clip_layer
type = struct clip_layer {
    ggml_tensor *k_w;
    ggml_tensor *k_b;
    ggml_tensor *q_w;
    ggml_tensor *q_b;
    ggml_tensor *v_w;
    ggml_tensor *v_b;
    ggml_tensor *o_w;
    ggml_tensor *o_b;
    ggml_tensor *ln_1_w;
    ggml_tensor *ln_1_b;
    ggml_tensor *ff_i_w;
    ggml_tensor *ff_i_b;
    ggml_tensor *ff_o_w;
    ggml_tensor *ff_o_b;
    ggml_tensor *ln_2_w;
    ggml_tensor *ln_2_b;
}

These layers will get populated by the clip context.

After that the current gguf context is set on the clip context:

    new_clip->ctx_gguf = ctx;

And the last things that happen before returning the clip context is:

    // measure mem requirement and allocate
    {
        new_clip->buf_compute_meta.resize(GGML_DEFAULT_GRAPH_SIZE * ggml_tensor_overhead() + ggml_graph_overhead());
        new_clip->compute_alloc = ggml_gallocr_new(ggml_backend_get_default_buffer_type(new_clip->backend));
        clip_image_f32_batch batch;
        batch.size = 1;
        ggml_cgraph * gf = clip_image_build_graph(new_clip, &batch, nullptr, false);
        ggml_gallocr_reserve(new_clip->compute_alloc, gf);
        size_t compute_memory_buffer_size = ggml_gallocr_get_buffer_size(new_clip->compute_alloc, 0);
        LOG_INF("%s: compute allocated memory: %.2f MB\n", __func__, compute_memory_buffer_size /1024.0/1024.0);
    }

After the computation graph allocator has been created we have the following:

        clip_image_f32_batch batch;
        batch.size = 1;
        ggml_cgraph * gf = clip_image_build_graph(new_clip, &batch, nullptr, false);

struct clip_image_f32_batch {
    struct clip_image_f32 * data;
    size_t size;
};

// RGB float32 image (NHWC)
// Memory layout: RGBRGBRGB...
struct clip_image_f32 {
    int nx;
    int ny;

    std::vector<float> buf;
};

So clip_image_f32 is storing an image in a float32 vector. The width of the image is nx and the height is ny. The NHWC is a common format in ML for representing images where N is the number of images, H is the height, W is the width and C is the number of channels. For example if nx=4 and ny=3 we would have:

  buf.size = 4 * 3 * 3 = 36
  [R1, G1, B1, R2,  G2,  B2,  R3,  G3,  B3,  R4,  G4,  B4]
  [R5, G5, B5, R6,  G6,  B6,  R7,  G7,  B7,  R8,  G8,  B8]
  [R9, G9, B9, R10, G10, B10, R11, G11, B11, R12, G12, B12]

So we are passing in a batch of size 1 to clip_image_build_graph:

static ggml_cgraph * clip_image_build_graph(clip_ctx * ctx,
    const clip_image_f32_batch * imgs,
    struct clip_image_size * load_image_size,
    bool is_inf = false) {
    ...

    struct ggml_init_params params = {
        /*.mem_size   =*/ ctx->buf_compute_meta.size(),
        /*.mem_buffer =*/ ctx->buf_compute_meta.data(),
        /*.no_alloc   =*/ true,
    };

    struct ggml_context * ctx0 = ggml_init(params);
    struct ggml_cgraph * gf = ggml_new_graph(ctx0);

This is familiar and we are creating a new computation graph.

(gdb) p ggml_graph_print(gf)
=== GRAPH ===
n_nodes = 0
n_leafs = 0
========================================

And now the computation graph will be built up.

    struct ggml_tensor * inp_raw = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, image_size_width, image_size_height, 3, batch_size);
    ggml_set_name(inp_raw, "inp_raw");
    ggml_set_input(inp_raw);

We start with the input layer which is the raw image data. Notice that the dimensions are 336x366x3x1 where 3 is the number of channels (RGB) and 1 is the batch size. We are then giving this tensor a name and setting it as as input so that it will be take part in auto differentiation:

(gdb) p *inp_raw
$1 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {336, 336, 3, 1}, nb = {4, 1344, 451584, 1354752},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0}, view_src = 0x0, view_offs = 0, data = 0x0, name = "inp_raw", '\000' <repeats 56 times>, extra = 0x0}

Next we have a convolution operation. I was not familiar with the ggml_conv_2d function and I had to take a breif detour to understand what it does and the notes and examples can be found in ggml.md.

    struct ggml_tensor * inp = ggml_conv_2d(ctx0, model.patch_embeddings, inp_raw, patch_size, patch_size, 0, 0, 1, 1);

Notice that the inputs are the model.patch_embedings which will be the kernel:

$4 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555b49a20,
ne = {14, 14, 3, 1024}, nb = {2, 28, 392, 1176}, op = GGML_OP_NONE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffe6c20d020,
name = "v.patch_embd.weight", '\000' <repeats 44 times>, extra = 0x0}

So we can see that we have a 14x14 kernel with 3 channels and 1024 embedding dimensions. inp_raw is the input data. The patch_size is the stride for x and for y, which is 14 in this case. So this means that the kernel will move 14 pixels in the x and y direction. The result will be that there are no overlaps between the kernels. And we have 0 padding for x and y. The last two arguments are the dilation of x and y which is 1 which means that the kernel is not dilated (does not have any gaps).

So, what we are doing here is that we are dividing the input image into patches and then embedding each patch into a high-dimensional space, which is 1024 in this case. We can think of each patch as a token in an NLP model, where each token needs to get a token embedding. The embedding process for these patches, using convolution in this case, is analogous to word embeddings in NLP. It transforms the raw pixel data into a format that the transformer can process, just like word embeddings transform words into vector representations.

After the convolution we have:

(gdb) p *inp
$6 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {24, 24, 1024, 1}, nb = {4, 96, 2304, 2359296}, op = GGML_OP_CONT,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x555555bbd050, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x0,
name = " (reshaped) (permuted) (cont)", '\000' <repeats 34 times>, extra = 0x0}

Lets try to visualize this:

               336
  +---------------------------------------+
  |                                       |           Image 
  |                                       |
  |                                       |
  |                                       |
  |                                       |  336
  |                                       |
  |                                       |
  |                                       |
  |                                       |
  |                                       |
  +---------------------------------------+

        14                                             Patch
  +-------------+
  |             |
  |             |
  |             | 14
  |             |
  |             |
  +-------------+

x = 336 / 14 = 24
y = 336 / 14 = 24
z = 1024 (embedding size)

This input image is divided into 336/14=24 patches:

                                     x
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
0   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 | 14|15 |16 |17 |18 |19 |20 |21 |22 |23 |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 | 14|15 |16 |17 |18 |19 |20 |21 |22 |23 |   y
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    |                             ...                                                               |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
23  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 | 14|15 |16 |17 |18 |19 |20 |21 |22 |23 |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Each cell in the above grid can be identified using an x and y coordinate. This represents a patch in the input and we can think of this in a similar way as tokens embeddings in NPL. In NPL I'm used to thinking about tokens embeddings as rows in a matrix, where one row would represent a token, and the columns would be the actual embeddings:

token 0 : [0       ...           1023]
token 1 : [0       ...           1023]
token 2 : [0       ...           1023]
...

But in the case of images we will keep the x and y dimensions of the patch, and the embeddings for each patch will be in the z dimension:

x_0, y_0   -> [0, ...                          1023]
x_0, y_1   -> [0, ...                          1023]
 ...
x_0, y_23  -> [0, ...                          1023]
x_1, y_0   -> [0, ...                          1023]
 ...
x_23, y_23 -> [0, ...                          1023]

By doing this the spatial information is preserved and the model can learn about the spatial relationships between the patches. This is important as the model needs to understand the spatial relationships between the patches in order to understand the image. So in effect the convolution operation above has created the embeddings for the image.

Next the embeddings are reshaped and permuted:

    inp = ggml_reshape_3d(ctx0, inp, num_patches, hidden_size, batch_size);
    inp = ggml_cont(ctx0, ggml_permute(ctx0, inp, 1, 0, 2, 3));

(gdb) p inp->ne
$10 = {24, 24, 1024, 1}

(gdb) p num_patches
$7 = 576
(gdb) p hidden_size
$8 = 1024
(gdb) p batch_size
$9 = 1

So that will reshape the tensor to 576x1024x1:

(gdb) p inp->ne
$11 = {576, 1024, 1, 1}

And then this tensor is permuted where the order of the first two dimensions are swapped:

(gdb) p ggml_permute(ctx0, inp, 1, 0, 2, 3)->ne
$14 = {1024, 576, 1, 1}

And then the permuted tensor is made contiguous (a new tensor created with the data now guaranteed to be contiguous in memory). So we went from 24x24x1024x1 -> 1024x576x1x1 which is more like the what we have in an NLP where we would have a row for each token, in this case we have a row for each patch:

NLP:
token 0   : [0       ...           1023]
...
token 576 : [0       ...           1023]

Vision:
patch 0   : [0       ...           1023]

patch 576 : [0 ... 1023]

Next, if the clip context has a patch bias it will be applied:

    if (ctx->has_patch_bias) {
        // inp = ggml_add(ctx0, inp, ggml_repeat(ctx0, model.patch_bias, inp));
        inp = ggml_add(ctx0, inp, model.patch_bias);
    }

This is not the case for this model so this will be skipped.

Following that we have:

    struct ggml_tensor * embeddings = inp;
    struct ggml_tensor * pos_embed = nullptr;

    if (ctx->has_llava_projector) {
        // concat class_embeddings and patch_embeddings
        if (ctx->has_class_embedding) {
            embeddings = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, hidden_size, num_positions, batch_size);
            ggml_set_name(embeddings, "embeddings");
            ggml_set_input(embeddings);

            embeddings = ggml_acc(ctx0, embeddings, model.class_embedding,
                    embeddings->nb[1], embeddings->nb[2], embeddings->nb[3], 0);

            embeddings = ggml_acc(ctx0, embeddings, inp,
                    embeddings->nb[1], embeddings->nb[2], embeddings->nb[3], model.class_embedding->nb[1]);
        }
    }

Notice that the above code will be executed in our case and notice that the embeddings tensor will be reassigned to a new 3d tensor with the shape 1024x576x1. Then the name and input flag are set on this new tensor.

The first ggml_acc operation will use the embeddings tensor as the destination, and the values in the model.class_embedding tensor will be added to destination and the result will be a new tensor. So at this point embeddings looks like this:

(gdb) p embeddings->ne
$32 = {1024, 577, 1, 1}
(gdb) p embeddings->nb
$37 = {4, 4096, 2363392, 2363392}

So the following is using the same strides as embeddings currently has and an offset of 0:

            embeddings = ggml_acc(ctx0, embeddings, model.class_embedding,
                    embeddings->nb[1], embeddings->nb[2], embeddings->nb[3], 0);

So the model.class_embeddings values will added to the embeddings tensor starting at the first element. Then we have another ggml_acc using the same destination as before but the source is now inp but notice that the offset is 4096 (the first dimension as 1024 element and is of type float32 which are 4 bytes and 1024x4=4096). So this will be added with the elements of the embeddings tensor starting at the 4096th element.

            embeddings = ggml_acc(ctx0, embeddings, inp,
                    embeddings->nb[1], embeddings->nb[2], embeddings->nb[3], model.class_embedding->nb[1]);

In summary this will create a new tensor and assigning it to embeddings and it will have the class embeddings first and then the input embeddings.

[CLASS_TOKEN] [PATCH_1] [PATCH_2] ... [PATCH_N]

Following that a tensor for positions is created where the number of positions for this model is 577 (576 + 1 for the class token):

    struct ggml_tensor * positions = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, num_positions);
    ggml_set_name(positions, "positions");
    ggml_set_input(positions);

    embeddings = ggml_add(ctx0, embeddings, ggml_get_rows(ctx0, model.position_embeddings, positions));

Notice that the ggml_get_rows function is used to get the rows from model.position_embeddings tensor and the indices are given by the positions tensor. These are then added to the embeddings tensor. So this is using absolute positional encodings which is different from LLMs which nowadays mostly use RoPE or extensions of that (relative positional encodings and not absolute at least). These are learned and as we can see above are part of the model. In a vision transformer each patch is like a token in NLP, and the patches have an extra dimension (x, y like we discussed above). These will be flattend but their 2d relationships must be preserved. Vision models deal with fixed-size inputs so absolute positional encodings are suitable compared to LLMs which have to deal with variable length sequences and where relative positional encoding is a better option. So each element in the embedding tensor, where each one represents a patch from the input image (apart from the class token), will have a positional encoding added to it.

Reflection:
How can I think about the absolute position values, like how does adding this value to each embedding element (patch or class token) enable positions to be encoded? Is it that during training this has been learned and a patch will be "moved in the embedding dimension space" approprieately for it to understand that this is the top left, top right etc? I think that this might be a way of imaging what is happending. So each patch/token starts with its own embedding which is a unique vector in the embedding space. Then the positional encoding is added to this vector and this will move the vector in the embedding space. During training the model learns to assign unique vectors to each position (the x y grid or patches). It can also capture relationships between positions as well. So adjacent patches might be more similar to each other, located closer, than other patches further away.

Following that we have the following check:

    if (ctx->has_minicpmv_projector) {
        ...
    }

TODO: revisit when trying out MiniCPMV model.

Next we have a layer normalization, and then a multiplication with the model.pre_ln_w weights and addition of the model.pre_ln_b bias:

    // pre-layernorm
    if (ctx->has_pre_norm) {
        embeddings = ggml_norm(ctx0, embeddings, eps);
        ggml_set_name(embeddings, "pre_ln");

        embeddings = ggml_add(ctx0, ggml_mul(ctx0, embeddings, model.pre_ln_w), model.pre_ln_b);
    }

This is the pre-layer normalization which is applied before the transformer. This is a linear transformation and is used to project the embeddings into a different space. This is a common operation in neural networks and is used to learn a mapping from one space to another.

Following that there is a loop over all the layers in the model, which there are 23 of in this model. At this point the embeddings have the shape:

(gdb) p embeddings->ne
$3 = {1024, 577, 1, 1}

So these is one row (token) for each patch plus one row (patch/token for the class token).

    for (int il = 0; il < n_layer - 1; il++) {
        struct ggml_tensor * cur = embeddings; // embeddings = residual, cur = hidden_states
        ...
    }

First we have a layer normalization:

        // layernorm1
        {
            cur = ggml_norm(ctx0, cur, eps);

            cur = ggml_add(ctx0, ggml_mul(ctx0, cur, model.layers[il].ln_1_w),
                           model.layers[il].ln_1_b);
        }

To recap the layer normalization formula is:

y = γ * ((x - μ) / σ) + β

Where:
- x is the input tensor
- μ is the mean of the input tensor               (ggml_norm)
- σ is the standard deviation of the input tensor (ggml_norm)
- γ is the scale parameter (model.layers[i].ln_1_w)
- β is the shift parameter (model.layers[i].ln_1_b)

This becomes something like this in the code above:
cur = model.layers[i].ln_1_w * cur + model.layers[i].ln_1_b

Notice that the ln_1_w and ln_1_b are specific for each layer, so each layer can scale and shift the normalized tensor values differently.

(gdb) p cur->ne
$6 = {1024, 577, 1, 1}

(gdb) p model.layers[il].ln_1_w->ne
$5 = {1024, 1, 1, 1}

This could look something like this:

              cur                         model.layers[il].ln_1_w

0    [0      ...              1023]  * [0     ...              1023] = [0]
1    [0      ...              1023]                                    [1]
...                                                                    ...
576  [0      ...              1023]                                    [576]

But notice that if we just performed this type of multiplication we would get a tensor with the shape 1x577. What happens in ggml_mul though is that model.layers[il].ln_1_w is broadcasted to the shape of cur:

              cur                         model.layers[il].ln_1_w

0    [0      ...              1023]  * [0     ...              1023] = [0 ...  1023]
1    [0      ...              1023]    [0     ...              1023]   [0 ...  1023]
...                                    ...                               ...
576  [0      ...              1023]    [0     ...              1023]   [0 ...  1023]

After the layer normalization we have the attention layer, and going into this layer the shape of the current tensor is:

(gdb) p cur->ne
$8 = {1024, 577, 1, 1}

        // self-attention
        {

            struct ggml_tensor * Q =
                ggml_add(ctx0, ggml_mul_mat(ctx0, model.layers[il].q_w, cur), model.layers[il].q_b);

Now, this time we are doing a matrix multiplication using the model.layers.[il].q_w and the result of the normalization above.

(gdb) p model.layers[il].q_w->ne
$11 = {1024, 1024, 1, 1}
(gdb) p model.hparams.hidden_size
$13 = 1024

0      [0    ...   1023]  x [0    ...   1023]
       [0    ...   1023]    ...
       [0    ...   1023]    [0    ...   1023] 576
       ...                  
       ...                  
1023   [0    ...   1023]

Now, notice that the dimensions of cur don't add up to the dimensions of the weight matrix model.layers[il].q_w. I usually think of the weight matrix as functions on each row and they take a number of parameters, and when we perform matrix multiplication we take a column from the other matrix and pass it down the rows of the weight matrix. But just like in programming we need to pass the correct number of parameters to the function. So the above will not work so what is going on. Well, it turns out that the ggml_mul_mat function will actually transpose the cur tensor before performing the multiplication. We can find this information in the header documentation:

    // A: k columns, n rows => [ne03, ne02, n, k]
    // B: k columns, m rows  (i.e. we transpose it internally) => [ne03 * x, ne02 * y, m, k]
    // result is n columns, m rows => [ne03 * x, ne02 * y, m, n]
    GGML_API struct ggml_tensor * ggml_mul_mat(
            struct ggml_context * ctx,
            struct ggml_tensor  * a,
            struct ggml_tensor  * b);

So the operation will actually look something like this instead internally in that function:

0      [0    ...   1023]  x [0...576]
       [0    ...   1023]    ...
       ...                  ...
       ...                  ... 
       ...                  ... 
1023   [0    ...   1023]    [0...576]

The output of this multiplication will be:

(gdb) p ggml_mul_mat(ctx0, model.layers[il].q_w, cur)->ne
$19 = {1024, 577, 1, 1}

A bias for this layer is also added which then results in the Q (Query) tensor.

Recall that it is called scaled dot-product attention and the formula for the attention is:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

And the Q tensor will now be used in the following operations:

            Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
            Q = ggml_reshape_4d(ctx0, Q, d_head, n_head, num_positions, batch_size);
            Q = ggml_cont(ctx0, ggml_permute(ctx0, Q, 0, 2, 1, 3));
            Q = ggml_reshape_3d(ctx0, Q, d_head, num_positions, n_head * batch_size);

The ggml_scale_inplace operation is performing the scaling part of the above formula. This is equivalent to performing the multiplication of K^T with Q and then dividing by the square root of the dimension of the key vectors. This could have scaled K just as well as long as one of the is scaled. K^T will then automicatically be scaled as well when it is multiplied by Q. After the scaling the Q tensor is reshaped to a 4d tensor.

(gdb) p Q->ne
$20 = {1024, 577, 1, 1}
(gdb) p d_head
$21 = 64
(gdb) p n_head
$22 = 16
(gdb) p num_positions
$23 = 577
(gdb) p batch_size
$24 = 1
(gdb) p Q->ne
$25 = {64, 16, 577, 1}

So this has split the 1024 featues into 16 heads with 64 features each.

z
0          x
  0  [0   ...   63]
  ...                 y
  15 [0   ...   63]

...

z
576        x
  0  [0   ...   63]
  ...                 y
  15 [0   ...   63]

This will then be permuted where, the second and third dimensions are swapped:

(gdb) p ggml_permute(ctx0, Q, 0, 2, 1, 3)->ne
$26 = {64, 577, 16, 1}

z
0          x
  0   [0   ...   63]
  ...                 
  ...                 y
  ...                 
  576 [0   ...   63]

...

z
16          x
  0   [0   ...   63]
  ...                 
  ...                 y
  ...                 
  576 [0   ...   63]

By moving the positions (577) to the second dimension, we ensure that when we perform operations along the head dimension (64) and across positions, we're accessing contiguous memory. Most optimized linear algebra libraries (like BLAS) are designed to work most efficiently when operating on contiguous memory as well.

Now, this is nice because we can think of having 16 (64x577) matrices which can be computed in parallel. We will see this later when we look at the key matrix K but one way to think of this might be:

Head1:
      Q (64x577)  K (64x577) = Attention (577x577)
...
Head15:
      Q (64x577)  K (64x577) = Attention (577x577)

And since we have permuted the tensor we want to make it contiguous in memory for efficiency (something that we have discussed before and is something that it done often after a reshape/permute). This is then again reshaped to a 3d tensor:

(gdb) p Q->ne
$28 = {64, 577, 16, 1}

Next we have the Key tensor:

            struct ggml_tensor * K =
                ggml_add(ctx0, ggml_mul_mat(ctx0, model.layers[il].k_w, cur), model.layers[il].k_b);

            K = ggml_reshape_4d(ctx0, K, d_head, n_head, num_positions, batch_size);
            K = ggml_cont(ctx0, ggml_permute(ctx0, K, 0, 2, 1, 3));
            K = ggml_reshape_3d(ctx0, K, d_head, num_positions, n_head * batch_size);

And this is very similar to the Q tensor above so I won't repeat this. And we create the same operations for the Value tensor:

            struct ggml_tensor * V =
                ggml_add(ctx0, ggml_mul_mat(ctx0, model.layers[il].v_w, cur), model.layers[il].v_b);

            V = ggml_reshape_4d(ctx0, V, d_head, n_head, num_positions, batch_size);
            V = ggml_cont(ctx0, ggml_permute(ctx0, V, 1, 2, 0, 3));
            V = ggml_reshape_3d(ctx0, V, num_positions, d_head, n_head * batch_size);

Notice that V is permuted differently from Q and K. The result of Q * K will be a (577x577) matrix for each head (16) each with 64 features. So the Value will look like this after the permute operation:

(gdb) p V->ne
$31 = {577, 64, 16, 1}

After that we have the multiplication operation for Q x K, and recall that Q was already scaled so this will scale K as well:

            struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);

The shape of QK will be:

(gdb) p KQ->ne
$35 = {577, 577, 16, 1}

After that a softmax operation is created:

            KQ = ggml_soft_max_inplace(ctx0, KQ);

And then we have the multiplication operation for QK x V:

            struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ);
            KQV = ggml_reshape_4d(ctx0, KQV, d_head, num_positions, n_head, batch_size);
            KQV = ggml_permute(ctx0, KQV, 0, 2, 1, 3);

This will result in a tensor with the shape:

(gdb) p KQV->ne
$36 = {64, 577, 16, 1}

After the permute operation:
(gdb) p KQV->ne
$39 = {64, 16, 577, 1}

And then KQV is made contiguous in memory:

            cur = ggml_cont_3d(ctx0, KQV, hidden_size, num_positions, batch_size);

And that is the last of the self-attention block.

The ouptut of the self-attention operation is then multiplied with the model layer weights and added with the bias:

        // attention output
        cur = ggml_add(ctx0, ggml_mul_mat(ctx0, model.layers[il].o_w, cur), model.layers[il].o_b);

The shape of this tensor will be:

(gdb) p cur->ne
$40 = {1024, 577, 1, 1}

And then the original embeddings (the one before this layer) are added to the result of the above operation (this is the residual connection):

        cur = ggml_add(ctx0, cur, embeddings);

And then the current tensor is assigned to the embeddings tensor for use in the addition of a residual connection around of the layer2norm: next iteration:

        embeddings = cur; // embeddings = residual, cur = hidden_states

Following that we have the layernorm2:

        // layernorm2
        {
            cur = ggml_norm(ctx0, cur, eps);

            cur = ggml_add(ctx0, ggml_mul(ctx0, cur, model.layers[il].ln_2_w), model.layers[il].ln_2_b);
        }

Which creates the same type of operation as we saw before with the layernorm1. Then we have a Feed-Forward layer:

        cur = ggml_mul_mat(ctx0, model.layers[il].ff_i_w, cur);
        cur = ggml_add(ctx0, cur, model.layers[il].ff_i_b);

(gdb) p cur->ne
$41 = {4096, 577, 1, 1}

So we can see that the dimensions are expanded to 4096 which is part of the feed-forward process, it expands to a higher dimension, performs some operation on that higher dimension and then compresses it back to the original dimension. The operation will be one of the following:

        if (ctx->use_gelu) {
            cur = ggml_gelu_inplace(ctx0, cur);
        } else {
            cur = ggml_gelu_quick_inplace(ctx0, cur);
        }

In this case it will be the else branch which is the quick version of the GELU. And then we have the last operation for the feed-forward layer:

        cur = ggml_mul_mat(ctx0, model.layers[il].ff_o_w, cur);
        cur = ggml_add(ctx0, cur, model.layers[il].ff_o_b);

(gdb) p cur->ne
$43 = {1024, 577, 1, 1}

And then we have the second residual connection which is using the embeddings that where set before the layer2norm:

        // residual 2
        cur = ggml_add(ctx0, embeddings, cur);

And finally embeddings is set to the current tensor for the next iteration:

        embeddings = cur;

And that is a complete layer!

So the input was the image divided into patches and then embedded into a high-dimensional space. The embeddings were then reshaped and permuted and class embeddings and positional embeddings were added. The embeddings were then passed through a series of transformer layers where each layer consists of a self-attention block and a feed-forward block. The output of the model is the embeddings after the last layer.

After that we have:

    // llava projector
    if (ctx->has_llava_projector) {
        embeddings = ggml_reshape_2d(ctx0, embeddings, embeddings->ne[0], embeddings->ne[1]);

So this is the projector for the LLAVA model which is above taking the patch embeddings and transforming them into the embedding space of the LLM to be used. So both text token embeddings and image patch embeddings are transformed into the same space for processing by a transformer model.

(gdb) p embeddings->ne
$45 = {1024, 577, 1, 1}

First a tensor is created for the patches (576):

        struct ggml_tensor * patches = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, num_patches);
        ggml_set_name(patches, "patches");
        ggml_set_input(patches);

Next, an operation is created to extract/get rows from the embeddings tensor for the indices given by the patches tensor:

        embeddings = ggml_get_rows(ctx0, embeddings, patches);

Following that we have:

        // llava projector
        if (ctx->proj_type == PROJECTOR_TYPE_MLP) {
            embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
            embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);

This is the code path taken in our case and this is the MLP (Multi-Layer Perceptron) projector. The embeddings are first multiplied with the model.mm_0_w weights:

(gdb) p *model.mm_0_w
$53 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555cc12e0,
ne = {1024, 4096, 1, 1}, nb = {2, 2048, 8388608, 8388608}, op = GGML_OP_NONE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffdc6004000,
name = "mm.0.weight", '\000' <repeats 52 times>, extra = 0x0}

Before this operation the embeddings looks like this:

(gdb) p embeddings->ne
$54 = {1024, 576, 1, 1}

And after the multiplication:

(gdb) p embeddings->ne
$55 = {4096, 576, 1, 1}

So we can see that this has expanded the dimensions from 1024->4096. And we can see above that a bias is also added. Then we have a GELU activation function, a matrix multiplication and an addition:

            embeddings = ggml_gelu(ctx0, embeddings);
            embeddings = ggml_mul_mat(ctx0, model.mm_2_w, embeddings);
            embeddings = ggml_add(ctx0, embeddings, model.mm_2_b);
    }

After these operations the embeddings tensor will have the shape:

(gdb) p embeddings->ne
$58 = {4096, 576, 1, 1}

So my understanding of this is that it is taking the embeddings from the output of the self-attention block and projecting them into the embedding space of the LLM model. And for this model that is the last thing done for building the graph.

    // build the graph
    ggml_build_forward_expand(gf, embeddings);

    ggml_free(ctx0);

    return gf;
}

TODO: Take a closer look at the other types of projector like PROJECTOR_TYPE_MLP_NORM, PROJECTOR_TYPE_LDP, PROJECTOR_TYPE_LDPV2, and PROJECTOR_TYPE_RESAMPLER. These does not seem to be a if clause for this last one at the momenet.

So that will returns us back in clip_model_load.

        ggml_cgraph * gf = clip_image_build_graph(new_clip, &batch, nullptr, false);

        ggml_gallocr_reserve(new_clip->compute_alloc, gf);
        size_t compute_memory_buffer_size = ggml_gallocr_get_buffer_size(new_clip->compute_alloc, 0);
        LOG_INF("%s: compute allocated memory: %.2f MB\n", __func__, compute_memory_buffer_size /1024.0/1024.0);
    }

    return new_clip;

The graph allocator is something that I've written about, actually I stopped here to write about it and the notes can be found in ggml.md. And after that the clip_ctx is returned from clip_model_load.

As a reminder, a llava_context is defined as:

struct llava_context {
    struct clip_ctx * ctx_clip = NULL;
    struct llama_context * ctx_llama = NULL;
    struct llama_model * model = NULL;
};

The clip context is the what we have seen be populated this far in this section and represents the image encoder part in CLIP. The llama context is the for the Vicuna model which is the LLM part of the model.

So we are still in the function llava_init_context and we have the following:

    llama_context_params ctx_params = llama_context_params_from_gpt_params(*params);
    ctx_params.n_ctx = params->n_ctx < 2048 ? 2048 : params->n_ctx;

    llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);

This is the "normal" llama context params we have seen in other examples ad we are creating a llama_context using the model (vicuna-7b-q5_k.gguf)

The llava context will be malloced and populated with the llama context, the clip context and the model:

    auto * ctx_llava = (struct llava_context *)malloc(sizeof(llava_context));

    ctx_llava->ctx_llama = ctx_llama;
    ctx_llava->ctx_clip = ctx_clip;
    ctx_llava->model = model;
    return ctx_llava;
}

This will return us to main:

        for (auto & image : params.image) {
            auto * ctx_llava = llava_init_context(&params, model);

---->       auto * image_embed = load_image(ctx_llava, &params, image);
            if (!image_embed) {
                LOG_ERR("%s: failed to load image %s. Terminating\n\n", __func__, image.c_str());
                return 1;
            }

This is loading the embeddings for the image and notice tat the llava context is passed in here to it has access to the image encoder.

static struct llava_image_embed * load_image(llava_context * ctx_llava, gpt_params * params, const std::string & fname) {

    // load and preprocess the image
    llava_image_embed * embed = NULL;
    auto prompt = params->prompt;
    if (prompt_contains_image(prompt)) {
        if (!params->image.empty()) {
            LOG_INF("using base64 encoded image instead of command line image path\n");
        }
        embed = llava_image_embed_make_with_prompt_base64(ctx_llava->ctx_clip, params->cpuparams.n_threads, prompt);
        if (!embed) {
            LOG_ERR("%s: can't load image from prompt\n", __func__);
            return NULL;
        }
        params->prompt = remove_image_from_prompt(prompt);
    } else {
--->    embed = llava_image_embed_make_with_filename(ctx_llava->ctx_clip, params->cpuparams.n_threads, fname.c_str());
        if (!embed) {
            fprintf(stderr, "%s: is %s really an image file?\n", __func__, fname.c_str());
            return NULL;
        }
    }

    return embed;
}

And in our case this will invoke llava_image_embed_make_with_filename and notice that the clip context is passed in here.

struct llava_image_embed * llava_image_embed_make_with_filename(struct clip_ctx * ctx_clip, int n_threads, const char * image_path) {
    unsigned char* image_bytes;
    long image_bytes_length;
    auto loaded = load_file_to_bytes(image_path, &image_bytes, &image_bytes_length);
    if (!loaded) {
        LOG_ERR("%s: failed to load %s\n", __func__, image_path);
        return NULL;
    }

    llava_image_embed *embed = llava_image_embed_make_with_bytes(ctx_clip, n_threads, image_bytes, image_bytes_length);
    free(image_bytes);

    return embed;
}

load_file_to_bytes is a function that reads a file into a buffer and returns the buffer and the length of the buffer (inout parameters). So lets focus on the function llava_image_embed_make_with_bytes:

struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip,
int n_threads, const unsigned char * image_bytes, int image_bytes_length) {
    clip_image_u8 * img = clip_image_u8_init();
    if (!clip_image_load_from_bytes(image_bytes, image_bytes_length, img)) {
        clip_image_u8_free(img);
        LOG_ERR("%s: can't load image from bytes, is it a valid image?", __func__);
        return NULL;
    }

    float* image_embed = NULL;
    int n_image_pos = 0;
    bool image_embed_result = llava_image_embed_make_with_clip_img(ctx_clip, n_threads, img, &image_embed, &n_image_pos);
    if (!image_embed_result) {
        clip_image_u8_free(img);
        LOG_ERR("%s: coulnd't embed the image\n", __func__);
        return NULL;
    }

    clip_image_u8_free(img);
    auto result = (llava_image_embed*)malloc(sizeof(llava_image_embed));
    result->embed = image_embed;
    result->n_image_pos = n_image_pos;
    return result;
}

A clip_image_u8 is created to hold the loaded image in memory:

struct clip_image_u8 * clip_image_u8_init() {
    return new clip_image_u8();
}

// RGB uint8 image
struct clip_image_u8 {
    int nx;
    int ny;

    std::vector<uint8_t> buf;
};

And this will be passed to clip_image_load_from_bytes:

bool clip_image_load_from_bytes(const unsigned char * bytes, size_t bytes_length, struct clip_image_u8 * img) {
    int nx, ny, nc;
    auto * data = stbi_load_from_memory(bytes, bytes_length, &nx, &ny, &nc, 3);
    if (!data) {
        LOG_ERR("%s: failed to decode image bytes\n", __func__);
        return false;
    }

    build_clip_img_from_data(data, nx, ny, img);
    stbi_image_free(data);
    return true;
}

This is using the stb_image library to load the image from the bytes into the clip_image_u8 struct. The image is then embedded using the

static void build_clip_img_from_data(const stbi_uc * data, int nx, int ny, clip_image_u8 * img) {
    img->nx = nx;
    img->ny = ny;
    img->buf.resize(3 * nx * ny);
    memcpy(img->buf.data(), data, img->buf.size());
}

(gdb) p *img
$2 = {nx = 800, ny = 663, buf = std::vector of length 1591200, capacity 1591200 = {0 '\000'
...

struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip,
int n_threads, const unsigned char * image_bytes, int image_bytes_length) {
    ...

    float* image_embed = NULL;
    int n_image_pos = 0;
    bool image_embed_result = llava_image_embed_make_with_clip_img(ctx_clip, n_threads, img, &image_embed, &n_image_pos);
    if (!image_embed_result) {
        clip_image_u8_free(img);
        LOG_ERR("%s: coulnd't embed the image\n", __func__);
        return NULL;
    }

    clip_image_u8_free(img);
    auto result = (llava_image_embed*)malloc(sizeof(llava_image_embed));
    result->embed = image_embed;
    result->n_image_pos = n_image_pos;
    return result;
}

bool llava_image_embed_make_with_clip_img(clip_ctx * ctx_clip, int n_threads,
    const clip_image_u8 * img, float ** image_embd_out, int * n_img_pos_out) {

    int num_max_patches = 6;
    if (clip_is_minicpmv(ctx_clip)) {
        num_max_patches = 10;
    }
    float * image_embd = (float *)malloc(clip_embd_nbytes(ctx_clip)*num_max_patches); // TODO: base on gridsize/llava model
    if (!image_embd) {
        LOG_ERR("Unable to allocate memory for image embeddings\n");
        return false;
    }

    int n_img_pos;
    if (!encode_image_with_clip(ctx_clip, n_threads, img, image_embd, &n_img_pos)) {
        LOG_ERR("%s: cannot encode image, aborting\n", __func__);
        free(image_embd);
        return false;
    }
    *image_embd_out = image_embd;
    *n_img_pos_out = n_img_pos;

    return true;
}

size_t clip_embd_nbytes(const struct clip_ctx * ctx) {
    return clip_n_patches(ctx) * clip_n_mmproj_embd(ctx) * sizeof(float);
}

int clip_n_patches(const struct clip_ctx * ctx) {
    const auto & params = ctx->vision_model.hparams;

    int n_patches = (params.image_size / params.patch_size) * (params.image_size / params.patch_size);

    if (ctx->proj_type == PROJECTOR_TYPE_LDP || ctx->proj_type == PROJECTOR_TYPE_LDPV2) {
        n_patches /= 4;
    } else if (ctx->proj_type == PROJECTOR_TYPE_RESAMPLER) {
        if (ctx->minicpmv_version == 2) {
            n_patches = 96;
        }
        else if (ctx->minicpmv_version == 3) {
            n_patches = 64;
        }
    }

    return n_patches;
}

(gdb) p params.image_size
$8 = 336
(gdb) p params.patch_size
$9 = 14
(gdb) p params.image_size / params.patch_size
$10 = 24
(gdb) p n_patches
$12 = 576

So 576 will be returned from clip_n_patches.

int clip_n_mmproj_embd(const struct clip_ctx * ctx) {
    if (ctx->proj_type == PROJECTOR_TYPE_LDP) {
        return ctx->vision_model.mm_model_block_1_block_2_1_b->ne[0];
    }
    if (ctx->proj_type == PROJECTOR_TYPE_LDPV2) {
        return ctx->vision_model.mm_model_peg_0_b->ne[0];
    }
    if (ctx->proj_type == PROJECTOR_TYPE_MLP) {
        return ctx->vision_model.mm_2_b->ne[0];
    }
    if (ctx->proj_type == PROJECTOR_TYPE_MLP_NORM) {
        return ctx->vision_model.mm_3_b->ne[0];
    }
    if (ctx->proj_type == PROJECTOR_TYPE_RESAMPLER) {
        if (ctx->minicpmv_version == 2) {
            return 4096;
        }
        else if (ctx->minicpmv_version == 3) {
            return 3584;
        }
    }

    std::string proj_type = PROJECTOR_TYPE_NAMES[ctx->proj_type];
    throw std::runtime_error(format("%s: don't support projector with: %s currently\n", __func__, proj_type.c_str()));
}

(gdb) p ctx->vision_model.mm_2_b->ne[0]
$13 = 4096

So clip_embd_nbytes will return 576 * 4096 = 2359296 which will be used with malloc as the size to allocate.

Next we have:

    int n_img_pos;
    if (!encode_image_with_clip(ctx_clip, n_threads, img, image_embd, &n_img_pos)) {
        LOG_ERR("%s: cannot encode image, aborting\n", __func__);
        free(image_embd);
        return false;
    }
    *image_embd_out = image_embd;
    *n_img_pos_out = n_img_pos;

    return true;
}

static bool encode_image_with_clip(clip_ctx * ctx_clip, int n_threads,
    const clip_image_u8 * img, float * image_embd, int * n_img_pos) {

    clip_image_f32_batch img_res_v;
    img_res_v.size = 0;
    img_res_v.data = nullptr;
    if (!clip_image_preprocess(ctx_clip, img, &img_res_v)) {
        LOG_ERR("%s: unable to preprocess image\n", __func__);
        delete[] img_res_v.data;
        return false;
    }

This will call clip_image_preprocess which is a pretty big/long function but I'll try to go over the parts that are relevant to this session and also connect things back to the earlier parts of the code:

bool clip_image_preprocess(struct clip_ctx * ctx, const clip_image_u8 * img, clip_image_f32_batch * res_imgs) {
    ...

    bool pad_to_square = true;
    ...
    if (strcmp(params.mm_patch_merge_type, "spatial_unpad") == 0) {
        pad_to_square = false;
    }

    clip_image_u8 * temp = clip_image_u8_init(); // we will keep the input image data here temporarily
    ...

    } else {
        if (params.image_grid_pinpoints[0] != 0) {

            // "spatial_unpad" with "anyres" processing for llava-1.6
            std::vector<std::pair<int, int>> possible_resolutions;
            for (int i = 0; i < 32 && params.image_grid_pinpoints[i] != 0; i+=2) {
                possible_resolutions.push_back({params.image_grid_pinpoints[i], params.image_grid_pinpoints[i+1]});
            }
            std::pair<int, int> best_resolution = select_best_resolution({img->nx, img->ny}, possible_resolutions);

Here we can see how the pinpoints are used which I was not 100% sure about above. First the possible resolutions are extracted from the pinpoints.

(gdb) p params.image_grid_pinpoints
$17 = {336, 672, 672, 336, 672, 672, 1008, 336, 336, 1008, 0 <repeats 22 times>}

(gdb) p possible_resolutions
$19 = std::vector of length 5, capacity 8 = {
{first = 336, second = 672},
{first = 672, second = 336},
{first = 672, second = 672},
{first = 1008, second = 336},
{first = 336, second = 1008}}

(gdb) p img->nx
$20 = 800
(gdb) p img->ny
$21 = 663

These will be passed to select_best_resolution which will try to figure out how the input image can be resized to fit a image resolution that this clip model can support.

static std::pair<int, int> select_best_resolution(const std::pair<int, int> & original_size,
    const std::vector<std::pair<int, int>> & possible_resolutions) {

    int original_width = original_size.first;
    int original_height = original_size.second;
    std::pair<int, int> best_fit;
    int max_effective_resolution = 0;
    int min_wasted_resolution = std::numeric_limits<int>::max();

    for (const auto& resolution : possible_resolutions) {
        int width = resolution.first;
        int height = resolution.second;
        float scale = std::min(static_cast<float>(width) / original_width, static_cast<float>(height) / original_height);
        int downscaled_width = static_cast<int>(original_width * scale);
        int downscaled_height = static_cast<int>(original_height * scale);
        int effective_resolution = std::min(downscaled_width * downscaled_height, original_width * original_height);
        int wasted_resolution = (width * height) - effective_resolution;
        if (effective_resolution > max_effective_resolution || (effective_resolution == max_effective_resolution && wasted_resolution < min_wasted_resolution)) {
            max_effective_resolution = effective_resolution;
            min_wasted_resolution = wasted_resolution;
            best_fit = resolution;
        }
    }
    return best_fit;
}

So in our case the above will calculate how we can resize the input image to a resolution that is supported by the clip model. The best resolution is the one that will result in the least wasted resolution. This is what we want to do:

   original image                         rescaled image suggestion for 
                                          first possible resolution

             800                               336
  +-------------------------+            +--------------+
  |                         |            |              |
  |                         |  663       |              |  672
  |                         |      =>    |              |
  |                         |            |              |
  |                         |            |              |
  |                         |            |              |
  +-------------------------+            |              |
                                         +--------------+

original_width  = 800  (img->nx)
original_height = 663  (img->ny)

width x height
 336  x 672: 
  scale = min(336 / 800, 672 / 663) = min(0.42, 1.01) = 0.42 
  downscaled_width  = 800 * 0.42 = 336
  downscaled_height = 663 * 0.42 = 279

So this is scaling the image to fit the resolution 336x672 which will result in a 336x279 image if the first resolution is used.

  effective_resolution_new = 336 * 279 = 93624
  effective_resolution_org = 800 * 663 = 530400
  effective_resolution = min(93624, 530400) = 93624

  int wasted_resolution = (width * height) - effective_resolution;
  wasted_resolution = (336 * 672) - 93624 = 225168

Wasted resolution in this case would look something like this:

Possible resolution (336x672)
   +-------------------+
   |                   |
   |  +-------------+  |
   |  |    336x279  |  |
   |  | (Effective  |  |
   |  | Resolution) |  |
   |  |             |  |
   |  +-------------+  |
   |                   |
   |  Wasted Resolution|
   |                   |
   +-------------------+

So the effective resolution is the area used up by the rescaled image. The wasted resolution is the space outside which is just unused pixels. So we want a large effective resolution and a small wasted resolution.

So the above functions job is to find the resolution that will result in the least wasted resolution.

if (effective_resolution > max_effective_resolution ||
    (effective_resolution == max_effective_resolution && wasted_resolution < min_wasted_resolution)) {

I can understand this part of the first check if the effective resolution is larger than the previous max effective resolution then we have a new best fit. But I'm not sure I understand how the effective resolution can be the same as the max effective resolution and the wasted resolution less than the min wasted resolution?
This can happen, for example:

Scale factor = min(400/800, 400/600) = min(0.5, 0.6667) = 0.5

Downscaled width  = 800 * 0.5 = 400
Downscaled height = 600 * 0.5 = 300

Effective resolution = 400 * 300 = 120,000 pixels

Scale factor = min(400/800, 300/600) = min(0.5, 0.5) = 0.5

Downscaled width  = 800 * 0.5 = 400
Downscaled height = 600 * 0.5 = 300

Effective resolution = 400 * 300 = 120,000 pixels

Original Image (800x600)
+------------------------+
|                        |
|                        |
|                        |
|    800 x 600           |
|                        |
|                        |
+------------------------+

Resolution A (400x400)        Resolution B (400x300)
+------------------+          +------------------+
|  +------------+  |          |                  |
|  |            |  |          |     400x300      |
|  | (Effective)|  |          |    (Effective)   |
|  |            |  |          |                  |
|  |            |  |          |                  |
|  +------------+  |          +------------------+
|                  |
|  Wasted Space    |
|                  |
+------------------+

Effective Resolution (both): 400 * 300 = 120,000 pixels
Wasted Resolution A: (400 * 400) - 120,000 = 40,000 pixels
Wasted Resolution B: (400 * 300) - 120,000 = 0 pixels

In this case the best fit will be:

(gdb) p best_resolution 
$31 = {first = 672, second = 672}

Next, well use the best resolution to actually resize the input image:

            resize_and_pad_image(*img, *temp, best_resolution);  // we do not pad with mean-bg color anymore in llava-1.6

This will produce a resized image, something like this:

Original Image         Resized Image       Padded Output
+-------------+        +----------+        +-------------+
|             |        |          |        |    black    |
|             |        |          |        | +---------+ |
|             |   =>   |          |   =>   | | resized | |
|             |        |          |        | |  image  | |
|             |        |          |        | +---------+ |
+-------------+        +----------+        |    black    |
                                           +-------------+

This approach is common in image processing for machine learning models, as it ensures all images are the same size without distorting their content.

TODO: Get familiar with the resize_and_pad_image function and how this actually works.

Next we the image is going to be split into patches:

    // prepare spatial sorted main patches of image_size each (336 in llava-1.6)
    std::vector<clip_image_u8 *> patches = divide_to_patches_u8(*temp, params.image_size);

static std::vector<clip_image_u8*> divide_to_patches_u8(const clip_image_u8 & image, int patch_size) {
    std::vector<clip_image_u8*> patches;
    int width = image.nx;
    int height = image.ny;
    for (int i = 0; i < height; i += patch_size) {
        for (int j = 0; j < width; j += patch_size) {
            clip_image_u8 *patch = clip_image_u8_init();
            patch->nx = std::min(patch_size, width - j);
            patch->ny = std::min(patch_size, height - i);
            patch->buf.resize(3 * patch->nx * patch->ny);
            for (int y = 0; y < patch->ny; ++y) {
                for (int x = 0; x < patch->nx; ++x) {
                    for (int c = 0; c < 3; ++c) {
                        patch->buf[3 * (y * patch->nx + x) + c] = image.buf[3 * ((i + y) * width + (j + x)) + c];
                    }
                }
            }
            patches.push_back(patch);
        }
    }
    return patches;
}

So this will split the image which is 672x672 into 336x336 patches, which results in 4 patches.

Now, next we will resize the original image into a 336x336 image, and then insert this as an additional patch to the beginning of the patches.

            clip_image_u8 *image_original_resize = clip_image_u8_init();
            bicubic_resize(*img, *image_original_resize, params.image_size, params.image_size);
            patches.insert(patches.begin(), image_original_resize);

So we will now have 5 patches, the first being the resized original image and the rest being the patches of the rescaled image using the best resolution.

Next the res_imgs which is of type clip_image_f32_batch has its fields set:

            res_imgs->size = patches.size();
            res_imgs->data = new clip_image_f32[res_imgs->size];
            int num=0;
            for (auto& patch : patches) {
                normalize_image_u8_to_f32(patch, &res_imgs->data[num], ctx->image_mean, ctx->image_std);
                num++;
            }

The function normalize_image_u8_to_f32 will convert the image from uint8 to float32 and normalize it using the mean and standard deviation of the image data. Notice that this function used the image_mean and image_std which we saw previously, and now we can see how they are used. And recall that these are arrays of size 3: For for each patch the following will be performed:

static void normalize_image_u8_to_f32(const clip_image_u8* src, clip_image_f32* dst, const float mean[3], const float std[3]) {
    dst->nx = src->nx;
    dst->ny = src->ny;
    dst->buf.resize(src->buf.size());

    for (size_t i = 0; i < src->buf.size(); ++i) {
        int c = i % 3; // rgb
        dst->buf[i] = (static_cast<float>(src->buf[i]) / 255.0f - mean[c]) / std[c];
    }
}
// normalized_value = (pixel_value / 255.0 - mean) / std

So prior to the normalization a single entry in the u8 image buffer represents one pixel, and they are groupds in three for the RGB channels. We are taking each pixel value which is in u8, so a value between 0-255, and dividing it by a float 255.0 to get a value between 0-1. This is then normalized by subtracting the mean and dividing by the standard deviation for the specific channel that is being processed. This will prepare the data to be able to be processed by a neural network. Even though we have normalized the data, the pixel values relationships are preserved.

After this the patches are freed and the function returns true. The normalized patches are now in the in-out parameter res_imgs.

So this will bring us back into encode_image_with_clip in llava.cpp:

    const int64_t t_img_enc_start_us = ggml_time_us();

    const char * mm_patch_merge_type = clip_patch_merge_type(ctx_clip);
    if (clip_is_minicpmv(ctx_clip)) {
        ...
    }
    ...
    else {
        // spatial_unpad llava-1.6 type embedding
        std::vector<float *> image_embd_v;
        image_embd_v.resize(img_res_v.size);

        for (size_t i = 0; i < img_res_v.size; i++) {
            image_embd_v[i] = (float *) malloc(clip_embd_nbytes(ctx_clip)); // 576 patches * 4096 embeddings * 4 bytes = 9437184
            const bool encoded = clip_image_encode(ctx_clip, n_threads, &img_res_v.data[i], image_embd_v[i]);
            if (!encoded) {
                LOG_ERR("Unable to encode image - spatial_unpad - subimage %d of %d\n", (int) i+1, (int) img_res_v.size);
                return false;
            }
        }

So the above will create a vector of float pointers which will hold 5 elements. Then it will loop over all the normalized patches and encode them using the clip_image_encode function.

bool clip_image_encode(struct clip_ctx * ctx, const int n_threads, clip_image_f32 * img, float * vec) {
    clip_image_f32_batch imgs{};
    imgs.size = 1;
    imgs.data = img;
    return clip_image_batch_encode(ctx, n_threads, &imgs, vec);
}

bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs, float * vec) {
    if (!ctx->has_vision_encoder) {
        LOG_ERR("This gguf file seems to have no vision encoder\n");
        return false;
    }
    int batch_size = imgs->size;

    ggml_cgraph * gf = clip_image_build_graph(ctx, imgs, ctx->load_image_size, true);
    ggml_gallocr_alloc_graph(ctx->compute_alloc, gf);

Now, we have gone through clip_image_build_graph before, but this time we are passing in the normalized patches. And recall that this is the computation graph for images.

Next the inputs will be set:

    const auto & model = ctx->vision_model;
    const auto & hparams = model.hparams;

    const int image_size = hparams.image_size;
    int image_size_width  = image_size;
    int image_size_height = image_size;

Lets inspect some of these values so we know what they are later if needed:

(gdb) p image_size
$79 = 336
(gdb) p image_size_width 
$80 = 336
(gdb) p image_size_height 
$81 = 336
(gdb) p patch_size 
$82 = 14
(gdb) p num_patches 
$83 = 576
(gdb) p num_positions 
$84 = 577

Next we have:

    if(ctx->load_image_size==nullptr){
        ctx->load_image_size= clip_image_size_init();
    }
    const int pos_w = ctx->load_image_size->width/patch_size;
    const int pos_h = ctx->load_image_size->height/patch_size;

This statement is true and will create a new clip_image_size struct:

(gdb) p *ctx->load_image_size
$87 = {width = 448, height = 448}

This load image size, what is this about? We already loaded the image but perhaps this is something else and lets see how it is used later.

Next we will get the tensor inp_raw from the computation graph:

        struct ggml_tensor * inp_raw = ggml_graph_get_tensor(gf, "inp_raw");
        float * data = (float *)malloc(ggml_nbytes(inp_raw));

And we are now going to populate this tensors data with the normalized patches. Lets just first take a look at imgs:

(gdb) p imgs->size
$98 = 1
(gdb) p imgs->data[0]
$99 = {nx = 336, ny = 336, buf = std::vector of length 338688, capacity 338688 = {-1.79226255, -1.73708928, -1.39489937, 
    -1.79226255, -1.73708928, -1.39489937, -1.79226255, -1.73708928, -1.39489937, -1.79226255, -1.73708928, -1.39489937, 
    ...

(gdb) p batch_size
$102 = 1

So the following will loop over the single normalized patch:

        for (size_t i = 0; i < imgs->size; i++) { // could be batch_size for clarity perhaps?
            const int nx = imgs->data[i].nx;
            const int ny = imgs->data[i].ny;

            const int n = nx * ny;

            for (int b = 0; b < batch_size; b++) {
                for (int k = 0; k < 3; k++) {
                    for (int y = 0; y < ny; y++) {
                        for (int x = 0; x < nx; x++) {
                            data[(b * 3 * n) + k * n + y * nx + x] = imgs->data[b].buf[3 * (y * nx + x) + k];
                        }
                    }
                }
            }
        }
        ggml_backend_tensor_set(inp_raw, data, 0, ggml_nbytes(inp_raw));
        free(data);

So the for loop (the first inner one) will iterate over the batches which is just one at the momemt. The second for loop (k) will iterator over the RGB channels. The for loop with y is the height of the image and the for loop with x is the width of the image. The actual setting is done by this line:

data[(b * 3 * n) + k * n + y * nx + x] = imgs->data[b].buf[3 * (y * nx + x) + k];

(b * 3 * n) moves to the start of the current image in the batch. Will always be 0 in this case.
k * n       selects the current color channel (k), and n is size of one channel (nx * ny).
y * nx      selects the current row, and nx is the width of the image (336 in this case).
+ x         selects the current column within the row.

Lets take a look at a simpler example to understand this better:

3x3 image with 3 channels (RGB):

R G B | R G B | R G B
R G B | R G B | R G B
R G B | R G B | R G B

Same with numbers so we can reference them below:
R1 G1 B1 | R2 G2 B2 | R3 G3 B3
R4 G4 B4 | R5 G5 B5 | R6 G6 B6
R7 G7 B7 | R8 G8 B8 | R9 G9 B9

Idx = 3 * (y * nx + x) + k
3 = channels (Red Green Blue)
y  = 0-2 (row)
nx = 3 (width)
k  = 0-2 (channel, where 0=R, 1=G, 2=B)
         y  nx   x    k   idx
R1: 3 * (0 * 3 + 0) + 0 = 0
G1: 3 * (0 * 3 + 0) + 1 = 1
B1: 3 * (0 * 3 + 0) + 2 = 2

R2: 3 * (0 * 3 + 1) + 0 = 3
G2: 3 * (0 * 3 + 1) + 1 = 4
B2: 3 * (0 * 3 + 1) + 2 = 5

R3: 3 * (0 * 3 + 2) + 0 = 6
G3: 3 * (0 * 3 + 2) + 1 = 7
B3: 3 * (0 * 3 + 2) + 2 = 8

R4: 3 * (1 * 3 + 0) + 0 = 9
G4: 3 * (1 * 3 + 0) + 1 = 10
B4: 3 * (1 * 3 + 0) + 2 = 11

R5: 3 * (1 * 3 + 1) + 0 = 12
G5: 3 * (1 * 3 + 1) + 1 = 13
B5: 3 * (1 * 3 + 1) + 2 = 14

R6: 3 * (1 * 3 + 2) + 0 = 15
G6: 3 * (1 * 3 + 2) + 1 = 16
B6: 3 * (1 * 3 + 2) + 2 = 17

R7: 3 * (2 * 3 + 0) + 0 = 18
G7: 3 * (2 * 3 + 0) + 1 = 19
B7: 3 * (2 * 3 + 0) + 2 = 20

R8: 3 * (2 * 3 + 1) + 0 = 21
G8: 3 * (2 * 3 + 1) + 1 = 22
B8: 3 * (2 * 3 + 1) + 2 = 23

R9: 3 * (2 * 3 + 2) + 0 = 24
G9: 3 * (2 * 3 + 2) + 1 = 25
B9: 3 * (2 * 3 + 2) + 2 = 26


Reorganized 3x3 image with 3 channels (RGB):

R R R | R R R | R R R
G G G | G G G | G G G
B B B | B B B | B B B

With numbers:
R1 R2 R3 | R4 R5 R6 | R7 R8 R9
G1 G2 G3 | G4 G5 G6 | G7 G8 G9
B1 B2 B3 | B4 B5 B6 | B7 B8 B9

Index = (0 * 3 * 9) + k * 9 + y * 3 + x

R1: (0 * 3 * 9) + 0 * 9 + 0 * 3 + 0 = 0
R2: (0 * 3 * 9) + 0 * 9 + 0 * 3 + 1 = 1
R3: (0 * 3 * 9) + 0 * 9 + 0 * 3 + 2 = 2
R4: (0 * 3 * 9) + 0 * 9 + 1 * 3 + 0 = 3
R5: (0 * 3 * 9) + 0 * 9 + 1 * 3 + 1 = 4
R6: (0 * 3 * 9) + 0 * 9 + 1 * 3 + 2 = 5
R7: (0 * 3 * 9) + 0 * 9 + 2 * 3 + 0 = 6
R8: (0 * 3 * 9) + 0 * 9 + 2 * 3 + 1 = 7
R9: (0 * 3 * 9) + 0 * 9 + 2 * 3 + 2 = 8

G1: (0 * 3 * 9) + 1 * 9 + 0 * 3 + 0 = 9
G2: (0 * 3 * 9) + 1 * 9 + 0 * 3 + 1 = 10
G3: (0 * 3 * 9) + 1 * 9 + 0 * 3 + 2 = 11
G4: (0 * 3 * 9) + 1 * 9 + 1 * 3 + 0 = 12
G5: (0 * 3 * 9) + 1 * 9 + 1 * 3 + 1 = 13
G6: (0 * 3 * 9) + 1 * 9 + 1 * 3 + 2 = 14
G7: (0 * 3 * 9) + 1 * 9 + 2 * 3 + 0 = 15
G8: (0 * 3 * 9) + 1 * 9 + 2 * 3 + 1 = 16
G9: (0 * 3 * 9) + 1 * 9 + 2 * 3 + 2 = 17

B1: (0 * 3 * 9) + 2 * 9 + 0 * 3 + 0 = 18
B2: (0 * 3 * 9) + 2 * 9 + 0 * 3 + 1 = 19
B3: (0 * 3 * 9) + 2 * 9 + 0 * 3 + 2 = 20
B4: (0 * 3 * 9) + 2 * 9 + 1 * 3 + 0 = 21
B5: (0 * 3 * 9) + 2 * 9 + 1 * 3 + 1 = 22
B6: (0 * 3 * 9) + 2 * 9 + 1 * 3 + 2 = 23
B7: (0 * 3 * 9) + 2 * 9 + 2 * 3 + 0 = 24
B8: (0 * 3 * 9) + 2 * 9 + 2 * 3 + 1 = 25
B9: (0 * 3 * 9) + 2 * 9 + 2 * 3 + 2 = 26

Notice that the for loops is doing the following in the inner most loop:

    for (int x = 0; x < nx; x++) {
        data[(b * 3 * n) + k * n + y * nx + x] = imgs->data[b].buf[3 * (y * nx + x) + k];
    }

k = 1
  data[(0 * 3 * n) + 1 * n + y * nx + x] = imgs->data[b].buf[3 * (y * nx + x) + 1];
  data[(0 * 3 * n) + 1 * n + y * nx + x] = imgs->data[b].buf[3 * (0 * 3 + 0) + 1];
  data[(0 * 3 * 9) + 1 * 9 + 0 * 3 + 0] = imgs->data[b].buf[1];
  data[9] = imgs->data[b].buf[1];

Org G1  : 3 * (0 * 3 + 0) + 1 = 1
Reorg G1: (0 * 3 * 9) + 1 * 9 + 0 * 3 + 0 = 9

This data is then set on the backend tensor (device memory):

        ggml_backend_tensor_set(inp_raw, data, 0, ggml_nbytes(inp_raw));
        free(data);

Next we have the class embedding which recall is the classification token and this is initalized to zero which makes sense as it will be populated during inference:

        {
            if (ctx->has_class_embedding) {
                struct ggml_tensor * embeddings = ggml_graph_get_tensor(gf, "embeddings");

                void* zero_mem = malloc(ggml_nbytes(embeddings));
                memset(zero_mem, 0, ggml_nbytes(embeddings));
                ggml_backend_tensor_set(embeddings, zero_mem, 0, ggml_nbytes(embeddings));
                free(zero_mem);
            }
        }

Next we have the positions tensor which will get populated by the positions from 0 to num_positions (577):

        {
            struct ggml_tensor * positions = ggml_graph_get_tensor(gf, "positions");

            int* positions_data = (int*)malloc(ggml_nbytes(positions));
            for (int i = 0; i < num_positions; i++) {
                positions_data[i] = i;
            }
            ggml_backend_tensor_set(positions, positions_data, 0, ggml_nbytes(positions));
            free(positions_data);
        }

Next we have patches which also initialized in a similar way to the positions but where the first value is skipped as it is the class token:

        {
            struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
            int* patches_data = (int*)malloc(ggml_nbytes(patches));
            for (int i = 0; i < num_patches; i++) {
                patches_data[i] = i + 1;
            }
            ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
            free(patches_data);
        }

If I recall correctly I think patches is used with a get_rows operation.

Next for CPU backends we set the threads to be used:

    if (ggml_backend_is_cpu(ctx->backend)) {
        ggml_backend_cpu_set_n_threads(ctx->backend, n_threads);
    }

Then we compute the graph:

    ggml_backend_graph_compute(ctx->backend, gf);

This will perform the forward pass of the graph that was built previously.

static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
    ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;

    ggml_cuda_set_device(cuda_ctx->device);

Lets start by inspecting the backend cuda context:

(gdb) p *cuda_ctx
$2 = {
device = 0, name = "CUDA0", copy_event = 0x0,
streams = {{0x5555566d1690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} <repeats 15 times>},
cublas_handles = {0x555557534530, 0x0 <repeats 15 times>},
cuda_graph = std::unique_ptr<ggml_cuda_graph> = {
    get() = 0x555556a4d560},
    pools = {std::unique_ptr<ggml_cuda_pool> = {
        get() = 0x5555573e8620},
        std::unique_ptr<ggml_cuda_pool> = {
            get() = 0x0
        } <repeats 15 times>
    }
}

(gdb) ptype ggml_cuda_graph
type = struct ggml_cuda_graph {
    cudaGraph_t graph;
    cudaGraphExec_t instance;
    size_t num_nodes;
    std::vector<CUgraphNode_st*> nodes;
    std::vector<cudaKernelNodeParams> params;
    bool disable_due_to_gpu_arch;
    bool disable_due_to_too_many_updates;
    bool disable_due_to_failed_graph_capture;
    int number_consecutive_updates;
    std::vector<ggml_graph_node_properties> ggml_graph_properties;
    std::vector<char**> updated_kernel_arg;

    ~ggml_cuda_graph(void);
}

(gdb) ptype ggml_cuda_pool
type = struct ggml_cuda_pool {

    ~ggml_cuda_pool(void);
    virtual void * alloc(size_t, size_t *);
    virtual void free(void *, size_t);
}

These structs are defined in ggml/src/ggml-cuda/common.cuh. So above we can see that the cuda device is set. This is done incase a different device it to be used and a check if the device passed in is the same as the current and if that is the case then nothing is done.

#ifdef USE_CUDA_GRAPH
    static const bool disable_cuda_graphs_due_to_env = (getenv("GGML_CUDA_DISABLE_GRAPHS") != nullptr);

    // Objects required for CUDA Graph
    if (cuda_ctx->cuda_graph == nullptr) {
        cuda_ctx->cuda_graph.reset(new ggml_cuda_graph());
    }

    bool use_cuda_graph = true;
    bool cuda_graph_update_required = false;
    // vector of pointers to CUDA cpy kernels, which are required to identify
    // kernel parameters which need updated in the graph for each token
    std::vector<void *> ggml_cuda_cpy_fn_ptrs;

In our case USE_CUDA_GRAPHS is defined so we will use CUDA graphs. Next we have:

    if (cuda_ctx->cuda_graph->graph == nullptr) {
        if (ggml_cuda_info().devices[cuda_ctx->device].cc < CC_AMPERE) {
            cuda_ctx->cuda_graph->disable_due_to_gpu_arch = true;
#ifndef NDEBUG
            GGML_LOG_WARN("%s: disabling CUDA graphs due to GPU architecture\n", __func__);
#endif
        }
    }

The above is using ggml_cuda_info() which returns the following:

(gdb) p ggml_cuda_info()
$7 = (const ggml_cuda_device_info &) @0x7ffff78ec120: {
device_count = 1,
devices = {
  {cc = 890, nsm = 46, smpb = 49152, smpbo = 101376, vmm = true, vmm_granularity = 2097152, total_vram = 0},
  {cc = 0, nsm = 0, smpb = 0, smpbo = 0, vmm = false, vmm_granularity = 0, total_vram = 0} <repeats 15 times>},
  default_tensor_split = {_M_elems = {0 <repeats 16 times>}}}

cc is the compute capabily of the GPU. 8.9 in this case matches the NVIDIA GeForce RTX 4070 that I have.
nsm is the number of streaming multiprocessors.
smpbo shared memory per block.
vvm is the virtual memory management.
vmm_granularity is the granularity of the virtual memory management.
total_vram is the total video memory. Strange that this is zero? I added the the total memory of my GPU to minimal.cu and it prints the following information:

 ./minimal
CUDA Runtime version: 12.6
CUDA Driver version: 12.6
CUDA device count: 1
Device 0 - Total VRAM: 11.62 GB

Device 0:
  Name: NVIDIA GeForce RTX 4070
  Compute Capability: 8.9
  Multiprocessors: 46
  Clock Rate: 2505 MHz
  Total Global Memory: 11.62 GB
  L2 Cache Size: 36.00 MB

So the above check for .cc is getting the compute capability of the GPU and checking if it is less than CC_AMPERE in which case graphs are not supported so they are disabled.

#define CC_PASCAL     600
#define MIN_CC_DP4A   610 // minimum compute capability for __dp4a, an intrinsic for byte-wise dot products
#define CC_VOLTA      700
#define CC_TURING     750
#define CC_AMPERE     800

    if (disable_cuda_graphs_due_to_env
        || cuda_ctx->cuda_graph->disable_due_to_gpu_arch
        || cuda_ctx->cuda_graph->disable_due_to_too_many_updates
        || cuda_ctx->cuda_graph->disable_due_to_failed_graph_capture) {
        use_cuda_graph = false;
    }

The following will iterate over all the nodes in the compute graph propertis. This is not something I've seen before so lets take a look at the ggml_graph_node_properties struct:

struct ggml_graph_node_properties {
    void * node_address;
    ggml_op node_op;
    int64_t ne[GGML_MAX_DIMS];
    size_t nb[GGML_MAX_DIMS];
    void * src_address[GGML_MAX_SRC];
    int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
};

So the following will use the tensor information to set these properties:

        for (int i = 0; i < cgraph->n_nodes; i++) {
            bool has_matching_properties = true;
            if (!cuda_graph_update_required) {
                has_matching_properties = ggml_graph_node_has_matching_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
            }
            if (!has_matching_properties) {
                cuda_graph_update_required = true;
            }
            set_ggml_graph_node_properties(cgraph->nodes[i], &cuda_ctx->cuda_graph->ggml_graph_properties[i]);
        }

#ifdef USE_CUDA_GRAPH
static void set_ggml_graph_node_properties(ggml_tensor * node, ggml_graph_node_properties * graph_node_properties) {
    graph_node_properties->node_address = node->data;
    graph_node_properties->node_op = node->op;
    for (int i = 0; i < GGML_MAX_DIMS; i++) {
        graph_node_properties->ne[i] = node->ne[i];
        graph_node_properties->nb[i] = node->nb[i];
    }
    for (int i = 0; i < GGML_MAX_SRC; i++) {
        graph_node_properties->src_address[i] = node->src[i] ? node->src[i]->data : nullptr;
    }
    memcpy(graph_node_properties->op_params, node->op_params, GGML_MAX_OP_PARAMS);
}

Notice that the above is using node->data which is the data pointer of the tensor and setting that as the node address. I'm not sure how this information is used yet but hopefully this will become clear later.

Next there is another loop over all the nodes in the compute graph:

        // Loop over nodes in GGML graph to obtain info needed for CUDA graph
        cuda_ctx->cuda_graph->updated_kernel_arg.clear();
        for (int i = 0; i < cgraph->n_nodes; i++) {
            ggml_tensor * node = cgraph->nodes[i];

            if (ggml_is_empty(node) || node->op == GGML_OP_RESHAPE ||
                node->op == GGML_OP_TRANSPOSE || node->op == GGML_OP_VIEW ||
                node->op == GGML_OP_PERMUTE || node->op == GGML_OP_NONE) {
                continue;
            }

            if (node->src[0] && node->src[0]->buffer && ggml_backend_buffer_is_cuda_split(node->src[0]->buffer)) {
                use_cuda_graph = false; // Split buffers are not supported by CUDA graph capture
#ifndef NDEBUG
                GGML_LOG_WARN("%s: disabling CUDA graphs due to split buffer\n", __func__);
#endif
            }

            if (node->op == GGML_OP_MUL_MAT_ID) {
                use_cuda_graph = false; // This node type is not supported by CUDA graph capture
                GGML_LOG_WARN("%s: disabling CUDA graphs due to mul_mat_id\n", __func__);
            }

updated_kernel_arg is a vector of char pointers:

struct ggml_cuda_graph {
#ifdef USE_CUDA_GRAPH
    ~ggml_cuda_graph() {
        if (instance != nullptr) {
            CUDA_CHECK(cudaGraphExecDestroy(instance));
        }
        if (graph != nullptr) {
            CUDA_CHECK(cudaGraphDestroy(graph));
        }
    }
    cudaGraph_t graph = nullptr;
    cudaGraphExec_t instance = nullptr;
    size_t num_nodes = 0;
    std::vector<cudaGraphNode_t> nodes;
    std::vector<cudaKernelNodeParams> params;
    bool disable_due_to_gpu_arch = false;
    bool disable_due_to_too_many_updates = false;
    bool disable_due_to_failed_graph_capture = false;
    int number_consecutive_updates = 0;
    std::vector<ggml_graph_node_properties> ggml_graph_properties;
    std::vector<char **> updated_kernel_arg;
#endif
};

Skipping ahead...

                bool ok = ggml_cuda_compute_forward(*cuda_ctx, node);

(gdb) p node->src[0]->name
$8 = "embeddings", '\000' <repeats 53 times>
(gdb) p node->src[1]->name
$9 = "v.class_embd", '\000' <repeats 51 times>
(gdb) p node->op
$10 = GGML_OP_ACC

static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct ggml_tensor * dst) {
    // why is this here instead of mul_mat?
    if (dst->src[0] != nullptr && ggml_backend_buffer_is_cuda_split(dst->src[0]->buffer)) {
        ggml_cuda_set_peer_access(dst->src[1]->ne[1], ctx.device);
    }

    switch (dst->op) {
        ...
        case GGML_OP_ACC:
            ggml_cuda_op_acc(ctx, dst);
            break;
        ...
    }

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        GGML_LOG_ERROR("%s: %s failed\n", __func__, ggml_op_desc(dst));
        CUDA_CHECK(err);
    }

    return true;

This will land in ggml/src/ggml-cuda/acc.cu:

void ggml_cuda_op_acc(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
    const ggml_tensor * src0 = dst->src[0];
    const ggml_tensor * src1 = dst->src[1];
    const float * src0_d = (const float *)src0->data;
    const float * src1_d = (const float *)src1->data;
    float * dst_d = (float *)dst->data;
    cudaStream_t stream = ctx.stream();

    int nb1 = dst->op_params[0] / 4; // 4 bytes of float32
    int nb2 = dst->op_params[1] / 4; // 4 bytes of float32
    int offset = dst->op_params[3] / 4; // offset in bytes

    acc_f32_cuda(src0_d, src1_d, dst_d, ggml_nelements(dst), src1->ne[0], src1->ne[1], src1->ne[2], nb1, nb2, offset, stream);
}

static void acc_f32_cuda(const float * x, const float * y, float * dst, const int n_elements,
    const int ne10, const int ne11, const int ne12,
    const int nb1, const int nb2, const int offset, cudaStream_t stream) {
    int num_blocks = (n_elements + CUDA_ACC_BLOCK_SIZE - 1) / CUDA_ACC_BLOCK_SIZE;
    acc_f32<<<num_blocks, CUDA_ACC_BLOCK_SIZE, 0, stream>>>(x, y, dst, n_elements, ne10, ne11, ne12, nb1, nb2, offset);
}

Notice that this is launching a CUDA kernel acc_f32 and that a CUDA stream is being specified and this is an async call and this will place this launch operation into the stream (command/operation queue thing).

After all the computation has been performed will will be back in clip_image_batch_encode:

    ggml_backend_graph_compute(ctx->backend, gf);

    // the last node is the embedding tensor
--> struct ggml_tensor * embeddings = ggml_graph_node(gf, -1);

    // copy the embeddings to the location passed by the user
    ggml_backend_tensor_get(embeddings, vec, 0, ggml_nbytes(embeddings));

    return true;
}

The above will get the last node in the graph which is the embeddings tensor, so this is the embeddings for the image:

(gdb) p embeddings->ne
$7 = {4096, 576, 1, 1}

And we copying the data from this tensor which is on the backend device into the passed in vec pointer. So after this the embeddings for the image will be in the vec pointer. This will the return true to:

bool clip_image_encode(struct clip_ctx * ctx, const int n_threads, clip_image_f32 * img, float * vec) {
    ```

    return clip_image_batch_encode(ctx, n_threads, &imgs, vec);
}

Which in turn will return to llava.cpp. And recalls that we have 5 patches which we now have iterated over the first one, and will continue with the rest which will all populate image_embd_v:

(gdb) p image_embd_v
$10 = std::vector of length 5, capacity 5 = {0x7fffb4200010, 0x0, 0x0, 0x0, 0x0}
(gdb) p image_embd_v[0]
$11 = (float *) 0x7fffb4200010
(gdb) p image_embd_v[1]
$12 = (float *) 0x0

        for (size_t i = 0; i < img_res_v.size; i++) {
            image_embd_v[i] = (float *)malloc(clip_embd_nbytes(ctx_clip));
            const bool encoded = clip_image_encode(ctx_clip, n_threads, &img_res_v.data[i], image_embd_v[i]);
            if (!encoded) {
                LOG_ERR("Unable to encode image - spatial_unpad - subimage %d of %d\n", (int) i+1, (int) img_res_v.size);
                return false;
            }
        }

Lets skip over the rest and continue the exploration:

(gdb) until 326

encode_image_with_clip: 5 segments encoded in 1284690.74 ms

Thread 1 "llama-llava-cli" hit Breakpoint 7, encode_image_with_clip (ctx_clip=0x555555bf1380, n_threads=4, img=0x5555567dee50,
    image_embd=0x7ffd1a800010, n_img_pos=0x7fffffffc5a8) at /home/danbev/work/ai/llama.cpp/examples/llava/llava.cpp:329
329	        const int32_t * image_grid = clip_image_grid(ctx_clip)

This getting the pinpoints from the model and storing a pointer to them in image_grid. These will then be stored as pairs in a vector:

        std::vector<std::pair<int, int>> grid_pinpoints;
        for (int i = 0; i < 32 && image_grid[i] != 0; i += 2) {
            grid_pinpoints.push_back({image_grid[i], image_grid[i+1]});
        }

(gdb) p grid_pinpoints
$16 = std::vector of length 5, capacity 8 = {{first = 336, second = 672}, {first = 672, second = 336}, {first = 672, second = 672}, {
    first = 1008, second = 336}, {first = 336, second = 1008}}

I'm was a little confused that the pinpoints are used again as I thought that perhaps they would only be used in the pre-processing stage of the image processing. Perhaps this will make more sense when we see how this is used in clip_image_grid_shape:

        const int32_t image_size = clip_image_size(ctx_clip);

        struct clip_image_grid_shape grid_shape = get_anyres_image_grid_shape({img->nx,img->ny}, grid_pinpoints, image_size);

        int n_img_pos_out;
        clip_llava_handle_patches(ctx_clip, image_embd_v, grid_shape, image_embd, &n_img_pos_out);
        *n_img_pos = n_img_pos_out;

(gdb) p grid_shape 
$24 = {first = 2, second = 2}

static bool clip_llava_handle_patches(clip_ctx * ctx_clip,
    std::vector<float *> & image_embd_v,
    struct clip_image_grid_shape grid_shape,
    float * image_embd_out,
    int * n_img_pos_out) {
    struct {
        struct ggml_context * ctx;
    } model;

    const int32_t image_size = clip_image_size(ctx_clip);
    const int32_t patch_size = clip_patch_size(ctx_clip);
    int32_t num_patches_per_side = image_size / patch_size;

So we have the following patch embeddings:

(gdb) p image_embd_v
$45 = std::vector of length 5, capacity 5 = {0x7fffb4200010, 0x7fff9ac00010, 0x7fff9a200010, 0x7fff97600010, 0x7fff96c00010}

The first one is the embedding for the class "token" and the rest are the embeddings for the image patches.

So the patch embeddings look something like this in the image_embd_v vector:

   image_embd_v[0] =  0    [0 ... 4095]          (class "token")
                      ...
                      575  [0 ... 4095]

   image_embd_v[1] =  0    [0 ... 4095]          (patch embedding 1)
                      ...
                      575  [0 ... 4095]

   image_embd_v[2] =  0    [0 ... 4095]          (patch embedding 2)
                      ...
                      575  [0 ... 4095]

   image_embd_v[3] =  0    [0 ... 4095]          (patch embedding 3)
                      ...
                      575  [0 ... 4095]

   image_embd_v[4] =  0    [0 ... 4095]          (patch embedding 4)
                      ...
                      575  [0 ... 4095]

(gdb) p image_size
$17 = 336
(gdb) p patch_size 
$22 = 14
(gdb) p num_patches_per_side
$23 = 24

Next we have:

    int num_patches_width  = grid_shape.first;  // grid 1-4
    int num_patches_height = grid_shape.second; // grid 1-4

    const size_t num_images = num_patches_width * num_patches_height + 1;

(gdb) p num_patches_width
$25 = 2
(gdb) p num_patches_height
$26 = 2
(gdb) p num_images
$27 = 5

    size_t ctx_size = 0;

    {
        ctx_size += clip_embd_nbytes(ctx_clip) * num_images * 8; // image_features
        ctx_size += 1024*1024 * ggml_type_size(GGML_TYPE_F32);
    }

    struct ggml_init_params params {
        /*.mem_size   =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc   =*/ false, // NOTE: this should be false when using the legacy API
    };

    model.ctx = ggml_init(params);
    struct ggml_tensor * image_features = ggml_new_tensor_3d(model.ctx, GGML_TYPE_F32,
        clip_n_mmproj_embd(ctx_clip), clip_n_patches(ctx_clip), num_images - 1);

(gdb) p clip_n_mmproj_embd(ctx_clip)
$30 = 4096
(gdb) p clip_n_patches(ctx_clip)
$31 = 576

(gdb) p image_features->ne
$33 = {4096, 576, 4, 1}

Next we will iterate over the image patch embeddings (4 in this case and we skip the first class patch embedding which is not part of the image) and setting the data of the above created tensor:

    for (size_t i = 1; i < num_images; i++) {
        size_t offset = (i-1) * clip_embd_nbytes(ctx_clip);
        memcpy((uint8_t *)(image_features->data) + offset, image_embd_v[i], clip_embd_nbytes(ctx_clip));
    }

So we have something like this:

z0
  0    [0 ...    4095]
  ...
  575  [0 ...    4095]

z1
  0    [0 ...    4095]
  ...
  575  [0 ...    4095]

z2
  0    [0 ...    4095]
  ...
  575  [0 ...    4095]

z3
  0    [0 ...    4095]
  ...
  575  [0 ...    4095]

So we have 4 patches (z), and each patch is 576 embeddings (y) (the image size is 336x336 and each patch is 14x14, and 336/14=24, and 24x24=576) and each embedding has 4096 features/dimensions (x).

Then we can see that a computation graph is created.

    struct ggml_cgraph  * gf = ggml_new_graph(model.ctx);
    size_t size_ele = ggml_type_size(GGML_TYPE_F32);

    struct ggml_tensor *image_features_patchview = ggml_view_4d(model.ctx, image_features,
                                                                num_patches_per_side * clip_n_mmproj_embd(ctx_clip),
                                                                num_patches_per_side,
                                                                num_patches_width,
                                                                num_patches_height,
                                                                size_ele * num_patches_per_side * clip_n_mmproj_embd(ctx_clip),
                                                                size_ele * num_patches_per_side * clip_n_mmproj_embd(ctx_clip) * num_patches_per_side,
                                                                size_ele * num_patches_per_side * clip_n_mmproj_embd(ctx_clip) * num_patches_per_side * num_patches_width, 0);

(gdb) p num_patches_per_side
$38 = 24
(gdb) p clip_n_mmproj_embd(ctx_clip)
$40 = 4096

(gdb) p image_features_patchview->ne
$41 = {98304, 24, 2, 2}

Now, we have a spatial grid of 2x2 patches which we can visualize like this:

q0
  z0 
    0  [0 ... 98303]
    ...
    23 [0 ... 98303]

  z1 
    0  [0 ... 98303]
    ...
    23 [0 ... 98303]

q1
  z0 
    0  [0 ... 98303]
    ...
    23 [0 ... 98303]
  z1 
    0  [0 ... 98303]
    ...
    23 [0 ... 98303]

   +-------+--------+
   |q0 z0  | q0 z1  |
   +----------------+
   |q1 z0  | q1 z1  |
   +-------+--------+

q0 z0 = represents a patch embedding 1: 24x24 patches each with 4096 features.
q0 z1 = represents a patch embedding 2: 24x24 patches each with 4096 features.
q1 z0 = represents a patch embedding 3: 24x24 patches each with 4096 features.
q1 z1 = represents a patch embedding 4: 24x24 patches each with 4096 features.

So we are preserving the spatial arragement of the patches in the image which is something that we need to do for the LLM to understand the visual context. Buy preserving the 2x2 grid of patch embeddings this spatial arrangement allows the LLM to understand not just each image individually, but also how they relate to each other in the grid. This can be crucial for tasks that require understanding the context across multiple patch embeddings.

Next the image_features_patchview tensor is permuted and made contiguous:

    struct ggml_tensor *permuted_cont = ggml_cont(model.ctx, ggml_permute(model.ctx, image_features_patchview, 0, 2, 1, 3));

Notice that the permuation is swapping the second and third dimensions.

So we go from a grid something like this:

 [patch1  row1] [patch2 row  1]
 [patch1  row2] [patch2 row  2]
 ...
 [patch1 row24] [patch2 row 24]

 [patch3  row1] [patch4 row  1]
 [patch3  row2] [patch4 row  2]
 ...
 [patch3 row24] [patch4 row 24]

To something like this:

 [patch1  row1] [patch2 row  1]
 [patch3  row1] [patch4 row  1]

 [patch1  row2] [patch2 row  2]
 [patch3  row2] [patch4 row  2]

 [patch1  row3] [patch2 row  3]
 [patch3  row2] [patch4 row  3]
 ...

 [patch1  row24] [patch2 row  24]
 [patch3  row24] [patch4 row  24]

Notice that this rearrangement still preserves the spatial arrangement, for example patch1 row1 and patch2 row1 are still next to each other, and likewise for patch3 row1 and patch4 row1.

The permuted shape will be:

(gdb) p permuted_cont->ne
$43 = {98304, 2, 24, 2}
q0
  y0
    0  [0 ... 98303]
    1  [0 ... 98303]
    ...
  y23
    0  [0 ... 98303]
    1  [0 ... 98303]

q1
  y0
    0  [0 ... 98303]
    1  [0 ... 98303]
    ...
  y23
    0  [0 ... 98303]
    1  [0 ... 98303]

This will then be flattened into a 2d tensor:

    struct ggml_tensor *flatten = ggml_view_2d(model.ctx, permuted_cont, clip_n_mmproj_embd(ctx_clip), num_patches_height * num_patches_width * num_patches_per_side * num_patches_per_side,  size_ele * clip_n_mmproj_embd(ctx_clip), 0);

The flattened shape will be:

(gdb) p flatten->ne
$44 = {4096, 2304, 1, 1}

   0  [0 ... 4095]
   ...
   ...
2303  [0 ... 4095]

There are 24x24 patches per patch embedding, and we have 4 patch embeddings so 24x24x4=2304. Each patch embedding has 4096 features. And recall that the previous permutation looked something like this:

 [patch1  row1] [patch2 row1]
 [patch3  row1] [patch4 row1]

 [patch1  row2] [patch2 row2]
 [patch3  row2] [patch4 row2]

 [patch1  row3] [patch2 row3]
 [patch3  row2] [patch4 row3]
 ...

 [patch1  row24] [patch2 row24]
 [patch3  row24] [patch4 row24]

And with this flatting it becomes something like this:

    0         4095
0    [patch1  row1]
    [patch2  row1]
    [patch3  row1]
    [patch4  row1]
    [patch1  row2]
    [patch2  row2]
    [patch3  row2]
    [patch4  row2]
    ...
    [patch1  row24]
    [patch2  row24]
    [patch3  row24]
2303 [patch4  row24]

Notice how this resembles the shape of inputs to a LLM model where each row would represent a token embedding in a sequence. Each row in the final flattened representation contains information from the same row across all four patch embeddings.

So next we forward expand the nodes in the graph and then execute the graph.

    ggml_build_forward_expand(gf, flatten);
    ggml_graph_compute_with_ctx(model.ctx, gf, 1);
    struct ggml_tensor* result = ggml_graph_node(gf, -1);

And we can inspect the shape of the result tensor using:

(gdb) p result->ne
$64 = {4096, 2304, 1, 1}
(gdb) p gf->nodes[gf->n_nodes-1]->ne
$69 = {4096, 2304, 1, 1}

(gdb) p gf->nodes[gf->n_nodes-1]
$70 = (ggml_tensor *) 0x7ffd08210880
(gdb) p result
$71 = (ggml_tensor *) 0x7ffd08210880

Following that we have:

    memcpy(image_embd_out, image_embd_v[0], clip_embd_nbytes(ctx_clip)); // main image as global context

Recall that the original image (scaled down to 224x224) was encoded and stored in the first element of the image_embd_v vector (TODO: double check this) and the above is copying this into image_embed_out.

Nex we copy data from the result tensor:

    memcpy(image_embd_out + clip_n_patches(ctx_clip) * clip_n_mmproj_embd(ctx_clip),
        (float*)result->data,
        clip_embd_nbytes(ctx_clip) * (num_images-1));
    *n_img_pos_out = static_cast<int>(result->ne[1]+clip_n_patches(ctx_clip));

    ggml_free(model.ctx);
    return true;
}

(gdb) p clip_n_patches(ctx_clip)
$73 = 576
(gdb) p clip_n_mmproj_embd(ctx_clip)
$74 = 4096

The above return will return us to encode_image_with_clip:

        clip_llava_handle_patches(ctx_clip, image_embd_v, grid_shape, image_embd, &n_img_pos_out);
--->    *n_img_pos = n_img_pos_out;

        for (size_t i = 0; i < image_embd_v.size(); i++) {
            free(image_embd_v[i]);
        }
        image_embd_v.clear();
    }

    LOG_INF("%s: image embedding created: %d tokens\n", __func__, *n_img_pos);
    return true;
}

encode_image_with_clip: image embedding created: 2880 tokens

And the return us to llava_image_embed_make_with_clip_img:

    if (!encode_image_with_clip(ctx_clip, n_threads, img, image_embd, &n_img_pos)) {
        LOG_ERR("%s: cannot encode image, aborting\n", __func__);
        free(image_embd);
        return false;
    }
--> *image_embd_out = image_embd;
    *n_img_pos_out = n_img_pos;

    return true;
}

And this will return us to llava_image_embed_make_with_bytes:

struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length) {
    ...

    bool image_embed_result = llava_image_embed_make_with_clip_img(ctx_clip, n_threads, img, &image_embed, &n_image_pos);
--> if (!image_embed_result) {
        clip_image_u8_free(img);
        LOG_ERR("%s: coulnd't embed the image\n", __func__);
        return NULL;
    }

    clip_image_u8_free(img);
    auto result = (llava_image_embed*)malloc(sizeof(llava_image_embed));
    result->embed = image_embed;
    result->n_image_pos = n_image_pos;
    return result;
}

And that will return us to llava_image_embed_make_with_filename:

    llava_image_embed *embed = llava_image_embed_make_with_bytes(ctx_clip, n_threads, image_bytes, image_bytes_length);
--> free(image_bytes);

    return embed;
}

And this will return to load_image in llava-cli.cpp:

        if (!embed) {
            fprintf(stderr, "%s: is %s really an image file?\n", __func__, fname.c_str());
            return NULL;
        }
    }

    return embed;
}

And that will return us to the main function in llava-cli.cpp:

            auto * image_embed = load_image(ctx_llava, &params, image);
            if (!image_embed) {
                LOG_ERR("%s: failed to load image %s. Terminating\n\n", __func__, image.c_str());
                return 1;
            }

            // process the prompt
            process_prompt(ctx_llava, image_embed, &params, params.prompt);

So we can now see that all that we have gone through is to load the image and generate the patch embeddings for the image and the class token embedding. We will now pass the patch embeddings to process_prompt.

static void process_prompt(struct llava_context * ctx_llava,
    struct llava_image_embed * image_embed,
    common_params * params,
    const std::string & prompt) {
    int n_past = 0;

    const int max_tgt_len = params->n_predict < 0 ? 256 : params->n_predict;

The above is setting the max target (?) length to 256 in this case.

    std::string system_prompt, user_prompt;
    size_t image_pos = prompt.find("<image>");

    if (image_pos != std::string::npos) {
        ...
    } else {
        // llava-1.5 native mode
        system_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:";
        user_prompt = prompt + "\nASSISTANT:";
        if (params->verbose_prompt) {
            auto tmp = common_tokenize(ctx_llava->ctx_llama, user_prompt, true, true);
            for (int i = 0; i < (int) tmp.size(); i++) {
                LOG_INF("%6d -> '%s'\n", tmp[i], common_token_to_piece(ctx_llava->ctx_llama, tmp[i]).c_str());
            }
        }
    }

We can see that the system prompt and user prompt are set. Following that the system prompt will be passed to eval_string:

    eval_string(ctx_llava->ctx_llama, system_prompt.c_str(), params->n_batch, &n_past, true);

static bool eval_string(struct llama_context * ctx_llama, const char* str, int n_batch, int * n_past, bool add_bos){
    std::string              str2     = str;
    std::vector<llama_token> embd_inp = common_tokenize(ctx_llama, str2, add_bos, true);
    eval_tokens(ctx_llama, embd_inp, n_batch, n_past);
    return true;
}

So this will first tokenize the string which will create the token for the system prompt:

(gdb) p embd_inp
$17 = std::vector of length 34, capacity 164 = {1, 319, 13563, 1546, 263, 12758, 5199, 322, 385, 23116, 21082, 20255, 29889, 450,
  20255, 4076, 8444, 29892, 13173, 29892, 322, 1248, 568, 6089, 304, 278, 5199, 29915, 29879, 5155, 29889, 13, 11889, 29901}

(gdb) p ctx_llama->model->vocab->id_to_token[319]
$18 = {text = "▁A", score = -60, attr = LLAMA_TOKEN_ATTR_NORMAL}

(gdb) p ctx_llama->model->vocab->id_to_token[13563]
$19 = {text = "▁chat", score = -13304, attr = LLAMA_TOKEN_ATTR_NORMAL}

And this will be passed to eval_tokens:

static bool eval_tokens(struct llama_context * ctx_llama, std::vector<llama_token> tokens, int n_batch, int * n_past) {
    int N = (int) tokens.size();
    for (int i = 0; i < N; i += n_batch) {
        int n_eval = (int) tokens.size() - i;
        if (n_eval > n_batch) {
            n_eval = n_batch;
        }
        if (llama_decode(ctx_llama, llama_batch_get_one(&tokens[i], n_eval, *n_past, 0))) {
            LOG_ERR("%s : failed to eval. token %d/%d (batch size %d, n_past %d)\n", __func__, i, N, n_batch, *n_past);
            return false;
        }
        *n_past += n_eval;
    }
    return true;
}

So the above will iterate over all the tokens which is 34 in this case, but notice that the increments are by n_batch which is 2048 in this case. And n_past is 0 to start with. Stepping into llama_decode I realized that I've never really looked at (or it has changed since I last looked at it) the u_batch handling and

wip

(gdb) ptype new_clip->vision_model.hparams
type = struct clip_hparams {
    int32_t image_size;
    int32_t patch_size;
    int32_t hidden_size;
    int32_t n_intermediate;
    int32_t projection_dim;
    int32_t n_head;
    int32_t n_layer;
    float eps;
    char mm_patch_merge_type[32];
    int32_t image_grid_pinpoints[32];
    int32_t image_crop_resolution;
}

`std_image`

The clip implementation in the example (clip.cpp) used std_image.h which is a from the single-file public domain library which was created by Sean T. Barrett, hence std. The std_image.h header contains code related to loading and decoding images from files and memory in various formats like PNG, JPEG, BMP, PSD, GIF, HDR, PIC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava.md

llava.md

Large Language and Vision Assistent

llama.cpp example

Using LLaVA 1.6

BakLLaVA-1

Lightweight Downsampling Projector (LDP)

llama.cpp llava notes

`std_image`

Files

llava.md

Latest commit

History

llava.md

File metadata and controls

Large Language and Vision Assistent

llama.cpp example

Using LLaVA 1.6

BakLLaVA-1

Lightweight Downsampling Projector (LDP)

llama.cpp llava notes

std_image

`std_image`