Skip to content

Latest commit

 

History

History
2922 lines (2665 loc) · 101 KB

llama-kv-cache.md

File metadata and controls

2922 lines (2665 loc) · 101 KB

Overview

I've gone through the theory of Key-Value caching in the transformer architecture in llama.md. This document is a more detailed look at the implementation of the key-value cache in the llama.cpp codebase.

Table of Contents

Inference with KV-Cache

Lets set a break point before llama_decode and see how this interacts with the kv-cache.

$ cd fundamentals/llama.cpp
$ make simple_prompt
$ gdb --args simple_prompt
(gdb) br llama_decode_internal

In llama_decode_internal we find the following:

        // non-causal masks do not use the KV cache
        if (hparams.causal_attn) {
            llama_kv_cache_update(&lctx);

            // if we have enough unused cells before the current head ->
            //   better to start searching from the beginning of the cache, hoping to fill it
            if (kv_self.head > kv_self.used + 2*n_tokens) {
                kv_self.head = 0;
            }

            if (!llama_kv_cache_find_slot(kv_self, u_batch)) {
                return 1;
            }

            if (!kv_self.recurrent) {
                // a heuristic, to avoid attending the full cache if it is not yet utilized
                // after enough generations, the benefit from this heuristic disappears
                // if we start defragmenting the cache, the benefit from this will be more important
                const uint32_t pad = llama_kv_cache_get_padding(cparams);
                kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));
                //kv_self.n = llama_kv_cache_cell_max(kv_self);
            }
        }

The first call is to llama_kv_cache_update which actually does not do anything is our case. But this checks the has_shift of the cache and would perform apply a shift of the keys if that had been set, which is done by the add/div functions. TODO: Incorporate notes from self-extend which I think went through this process in some detail.

At this stage the cache is empty and head is zero so lets look at find slot: (in this case n_tokens is 6 and cache size is 1024)

static bool llama_kv_cache_find_slot(
           struct llama_kv_cache & cache,
        const struct llama_batch & batch) { <--- I thought I'd renamed these to ubatch?
    const uint32_t n_tokens = batch.n_tokens;

    if (cache.recurrent) {
        ...
    }

    if (n_tokens > cache.size) {
        LLAMA_LOG_ERROR("%s: n_tokens=%d > cache.size=%d\n", __func__, n_tokens, cache.size);
        return false;
    }

    uint32_t n_tested = 0;
    while (true) {
        if (cache.head + n_tokens > cache.size) {
            n_tested += cache.size - cache.head;
            cache.head = 0;
            continue;
        }

        bool found = true;
        for (uint32_t i = 0; i < n_tokens; i++) {
            if (cache.cells[cache.head + i].pos >= 0) {
                found = false;
                cache.head += i + 1;
                n_tested   += i + 1;
                break;
            }
        }

        if (found) {
            break;
        }

        if (n_tested >= cache.size) {
            //LLAMA_LOG_ERROR("%s: failed to find a slot for %d tokens\n", __func__, n_tokens);
            return false;
        }
    }

Lets look what is happening in the for loop, we know that n_tokens is 6 so we will iteratate 0-5 times. There is nothing in the cache yet so all cells will have positions that are -1 (unused).

(gdb) p cache.cells[cache.head + i]
$85 = {pos = -1, delta = 0, src = 0, seq_id = std::set with 0 elements}

So our cache head is 0 and we are checking the first 6 cells in the cache to see if their position is greater than 0 (indicating that they in use). found will therefor still be true and we will exit the loop.

Next we iterate over all the sequences.

    for (uint32_t s = 0; s < n_seqs; s++) {
        for (uint32_t i = 0; i < n_seq_tokens; ++i) {
            uint32_t k = s*n_seq_tokens + i;
            cache.cells[cache.head + k].pos = batch.pos[k];

            for (int32_t j = 0; j < batch.n_seq_id[s]; j++) {
                cache.cells[cache.head + k].seq_id.insert(batch.seq_id[s][j]);
            }
        }
    }

    cache.used += n_tokens;

    return true;

We will again iterate over our 6 tokens this time using the ubatch n_seqs. So this will set the position of the cell to the position of the position of the ubatch, so this cell will the be in use as it will have a pos value that is not less than 0. And each cell also has a set of sequence ids which identify the sequences that the token is part of. This will be done for all the 6 tokens in the ubatch.

Finally we will update the cache.used to 6: and then return. This will return ut to llama_decode_internal and we will continue from there.

            if (!kv_self.recurrent) {
                // a heuristic, to avoid attending the full cache if it is not yet utilized
                // after enough generations, the benefit from this heuristic disappears
                // if we start defragmenting the cache, the benefit from this will be more important
                const uint32_t pad = llama_kv_cache_get_padding(cparams);
                kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));
                //kv_self.n = llama_kv_cache_cell_max(kv_self);
            }

Now, the padding is different depending on the type of attention in use. If flash attenting (FA) is used the padding wil be 256 and otherwise 32.

(gdb) p llama_kv_cache_get_padding(cparams)
$15 = 32
(gdb) p llama_kv_cache_cell_max(kv_self)
$14 = 6

(gdb) p kv_self.n
$17 = 32

So this is where kv_self.n is set which something that I've been wondering about.

The next interesting thing that happens with regards to the kv-cache is that the graph is build:

        ggml_cgraph * gf = llama_build_graph(lctx, ubatch, false);

I'm just noting that in the callback we have the following with is related to the kv cache but I'm not sure what it is about yet though.

    llm_build_cb cb = [&](struct ggml_tensor * cur, const char * name, int il) {
        ...

        if (!lctx.cparams.offload_kqv) {
            if (strcmp(name, "kqv_merged_cont") == 0) {
                // all nodes between the KV store and the attention output are run on the CPU
                ggml_backend_sched_set_tensor_backend(lctx.sched, cur, lctx.backend_cpu);
            }
        }

The model I'm using for this example is a llama model so llm.build_llama will be called.

    struct ggml_cgraph * build_llama() {
        ...

        const int64_t n_embd_head = hparams.n_embd_head_v;
(gdb) p n_embd_head
$18 = 128

Now, build_llama is a method/member of the struct llm_build_context which has a field named kv_head:

(gdb) p this.kv_head
(gdb) 0

This is very important to and for the first prompt it will be zero and was something that I overlooked the first time I stepped through the code and it caused me some confusion. For the next token processed this value will be the number of that token in the sequence. So we if had 6 tokens in the initital prompt this would be 6 for the next token to be docoded.

First we have the input layer which will be build using either the tokens in the ubatch or the embeddings in the ubatch:

        struct ggml_tensor * cur;
        struct ggml_tensor * inpL;

        inpL = llm_build_inp_embd(ctx0, lctx, hparams, ubatch, model.tok_embd, cb);
        // inp_pos - contains the positions
        struct ggml_tensor * inp_pos = build_inp_pos();
        // KQ_mask (mask for 1 head, it will be broadcasted to all heads)
        struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
    struct ggml_tensor * build_inp_KQ_mask(bool causal = true) {
        lctx.inp_KQ_mask = causal
            ? ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_kv,     GGML_PAD(n_tokens, GGML_KQ_MASK_PAD))
            : ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_tokens, GGML_PAD(n_tokens, GGML_KQ_MASK_PAD));
        cb(lctx.inp_KQ_mask, "KQ_mask", -1);
        ggml_set_input(lctx.inp_KQ_mask);

        return flash_attn ? ggml_cast(ctx0, lctx.inp_KQ_mask, GGML_TYPE_F16) : lctx.inp_KQ_mask;
    }

In our case this will create a 2d tensor with a dimension of 32 (n_kv)

(gdb) p lctx.inp_KQ_mask->ne
$22 = {32, 32, 1, 1}

This mask will be used to mask out earlier tokens in the sequence. And notice that the comment says that it will be broadcasted to all the heads which is the reason why it may seem small. Interesting to see ggml_cast which I have not used, and this is making sure that if flash attention is used the tensor will be cast to f16.

Next all layer operations will be built:

        const float kq_scale = hparams.f_attention_scale == 0.0f ? 1.0f/sqrtf(float(n_embd_head)) : hparams.f_attention_scale;
        for (int il = 0; il < n_layer; ++il) {
            struct ggml_tensor * inpSA = inpL;
            ...

            // self-attention
            {
                ...
                struct ggml_tensor * Vcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wv, cur);
                cb(Vcur, "Vcur", il);
                if (model.layers[il].bv) {
                    Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv);
                    cb(Vcur, "Vcur", il);
                }

                Qcur = ggml_rope_ext(
                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, rope_factors,
                    n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                    ext_factor, attn_factor, beta_fast, beta_slow
                );
                cb(Qcur, "Qcur", il);

                Kcur = ggml_rope_ext(
                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, rope_factors,
                    n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                    ext_factor, attn_factor, beta_fast, beta_slow
                );
                cb(Kcur, "Kcur", il);

                cur = llm_build_kv(ctx0, lctx, kv_self, gf,
                        model.layers[il].wo, model.layers[il].bo,
                        Kcur, Vcur, Qcur, KQ_mask, n_tokens, kv_head, n_kv, kq_scale, cb, il);

Notice here that the roped Key and Query operations are created and then being passed into llm_build_kv. And notice that kv_head is passed in which like we mentioned above will be important for the next tokens.

There are a lot of parameters passed to llm_build_kv but if we look through them we have seen most of them before.

static struct ggml_tensor * llm_build_kv(
        struct ggml_context * ctx,
       struct llama_context & lctx,
       const llama_kv_cache & kv,
         struct ggml_cgraph * graph,
         struct ggml_tensor * wo,          // model.layers[il].wo
         struct ggml_tensor * wo_b,        // model.layers[il].bo
         struct ggml_tensor * k_cur,
         struct ggml_tensor * v_cur,
         struct ggml_tensor * q_cur,
         struct ggml_tensor * kq_mask,
                    int32_t   n_tokens,
                    int32_t   kv_head,
                    int32_t   n_kv,
                    float     kq_scale,
         const llm_build_cb & cb,
                    int       il) {
    const llama_hparams & hparams = lctx.model.hparams;
    const llama_cparams & cparams = lctx.cparams;

    // these nodes are added to the graph together so that they are not reordered
    // by doing so, the number of splits in the graph is reduced
    ggml_build_forward_expand(graph, q_cur);
    ggml_build_forward_expand(graph, k_cur);
    ggml_build_forward_expand(graph, v_cur);

    llm_build_kv_store(ctx, hparams, cparams, kv, graph, k_cur, v_cur, n_tokens, kv_head, cb, il);

    struct ggml_tensor * cur;

    cur  = llm_build_kqv(ctx, lctx, kv, graph, wo, wo_b, q_cur, kq_mask, n_tokens, n_kv, kq_scale, cb, il);
    cb(cur, "kqv_out", il);

    return cur;

I'm showing the complete function as want to point out that first the llm_build_kv_store and then llm_build_kqv which is the QK^T operation.

What is happening, which I'll go through below, is that the llm_build_kv_store function will copy the current roped key value into the cache (doing this one head at a time). And then later in llm_build_kqv the k tensor that will be used in the attention matrix multiplication will use a view into the layers cache:

    struct ggml_tensor * k =
        ggml_view_3d(ctx, kv.k_l[il],
                n_embd_head_k, n_kv, n_head_kv,
                ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
                ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
                0);
    cb(k, "k", il);
    ...

        struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);

And this will operate on all the values in the cache up to this point.

Lets take a look at a few of the parameters passed to llm_build_kv_store:

(gdb) p model.layers[il].wo->ne
$26 = {4096, 4096, 1, 1}
(gdb) p model.layers[il].bo
$27 = (ggml_tensor *) 0x0
(gdb) p Kcur->ne
$28 = {128, 32, 6, 1}
(gdb) p Qcur->ne
$29 = {128, 32, 6, 1}
(gdb) p KQ_mask->ne
$30 = {32, 32, 1, 1}
(gdb) p kv_head
$31 = 0
(gdb) p kq_scale
$32 = 0.0883883461

So this is what these matrices looks like for a single layer:

           Kcur                     Qcur                 KQ_mask
0   [0     ...     127]      0  [0     ...  127]      0  [0...31]
    .                     .                   .
    .                     .                   .
    .                     .                   .
31  [0     ...     127]      31 [0     ...  127]      31 [0...31]

So we have an embedding size of 4096 and we divide this into 32 heads which means that each head will have an embeddings size of 128.

    llm_build_kv_store(ctx, hparams, cparams, kv, graph, k_cur, v_cur, n_tokens, kv_head, cb, il);

llm_build_kv_store also has quite a few parameters but again we have seen most of them before:

static void llm_build_kv_store(
        struct ggml_context * ctx,
        const llama_hparams & hparams,
        const llama_cparams & cparams,
       const llama_kv_cache & kv,
         struct ggml_cgraph * graph,
         struct ggml_tensor * k_cur,
         struct ggml_tensor * v_cur,
                    int32_t   n_tokens,
                    int32_t   kv_head,
         const llm_build_cb & cb,
                    int64_t   il) {
    const int64_t n_ctx = cparams.n_ctx;

    const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa(il);
    const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa(il);
(lldb) p n_ctx
(const int64_t) 1024
(lldb) p n_embd_k_gqa
(const int64_t) 4096
(lldb) p n_embd_v_gqa
(const int64_t) 4096

So we have a context size of 1024 and an embedding dimension size of 4096. Next we will create a view of the kv cache k_l tensor for this layer, and the number of elements will be n_tokens * n_embed_k_gqa, and the offset is the last argument. The cache is empty in this case. One note here is that if kv_head is a larger value, like 512 and not the size of the tokens representing the prompt, then this is probably because there is call to this function as part of llama_new_context_with_model and this can can happen if you set a break point on a line somewhere in this function. Just continue in gdb or lldb to get past this and break in the function for building the operations for the prompt instead..

Next a view of the k matrix is created:

    struct ggml_tensor * k_cache_view = ggml_view_1d(ctx,
        kv.k_l[il],
        n_tokens*n_embd_k_gqa,
        ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)*kv_head);

We can see here that we are creating a view of the tensor k_l for the current layer and notice that the offset used is taking into account the kv_head value. So this is creating a new which will be of size (elements) the number of tokens being processed (6 in this case) times the embeddings size. And the offset will be 0 in this case because this is the first time we are processing tokens:

(lldb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)
(size_t) 8192
(lldb) p kv_head
(int32_t) 0
(gdb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)*kv_head
$69 = 0
(lldb) p k_cache_view->ne
(int64_t[4])  ([0] = 24576, [1] = 1, [2] = 1, [3] = 1)

So in this case the will produce a view of the tensor and the view will span the first 24576 (n_tokens * n_embd_k_gqa which is 6 * 4096 in this case) elements.

Now, just to make this clear I'll also show what this would look like when decoding the next token.

(gdb) p kv_head
$75 = 6
(gdb) p n_tokens
$76 = 1
(gdb) p n_embd_k_qga
No symbol "n_embd_k_qga" in current context.
(gdb) p n_embd_k_gqa
$77 = 4096
(gdb) p n_embd_k_gqa * 1
$78 = 4096

(gdb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)*kv_head
$79 = 49152

Notice that the offset is now 49152 which is the size of the previous view and we can visualize this as follows:

                                                   offset
     0   [0   ...        4095]  (prompt token 1)   0
     1   [0   ...        4095]  (prompt token 2)   8192
     2   [0   ...        4095]  (prompt token 3)   16384
     3   [0   ...        4095]  (prompt token 4)   24576
     4   [0   ...        4095]  (prompt token 5)   32768 
     5   [0   ...        4095]  (prompt token 6)   40960 
     6   [0   ...        4095]  (next token 1)     49152
     ...
     ...
     ...
  1023   [0   ...        4095]

At this point we have created an operation to create a view with the correct offset.

Next we create a tensor operation that will copy the current tokens roped key value into this slot of the cache: Then a copy operation will be created for copying k_cur to the k_cache_view:

    ggml_build_forward_expand(graph, ggml_cpy(ctx, k_cur, k_cache_view));

So this is how the new tokens roped key value is added to the cache. We have a cache which can store 1024 tokens, each having an embedding dimension of 4096, and kv_head is used to create the offset into the k_l tensor for each layer.

These tensors have different dimensions but the same number of elelements:

(gdb) p ggml_n_dims(k_cur)
$16 = 3
(gdb) p ggml_n_dims(k_cache_view)
$17 = 1

(gdb) p k_cache_view->ne
$13 = {24576, 1, 1, 1}
(gdb) p k_cur->ne
$14 = {128, 32, 6, 1}
(gdb) p 128 * 32 * 6
$15 = 24576

So this is creating a copy operation to copy a head of the key tensor into the k cache view. And we have 6 tokens in this batch so there will be 6 of these which is indicated by the z axis below:

z_0

  0   [0     ...     127]
  .
  .
  .
  31  [0     ...     127]

.
.
.

z_5

  0   [0     ...     127]
  .
  .
  .
  31  [0     ...     127]

Now, k_cur was passed in and is the tensor representing the roped key value for this token. So my understanding is that the cache is empty in this case and this it taking the roped key value and copying it into the cache.

This is only processing the current batch which is the prompt in our case so there is nothing in the cache at this point. But if we had already processed some tokens this would just be adding to the currently processed token to the key cache. So we are adding a column to the key matrix for each token we process.

Then we also create a view for the value matrix:

    struct ggml_tensor * v_cache_view = nullptr;
    if (cparams.flash_attn) {
        v_cache_view = ggml_view_1d(ctx, kv.v_l[il],
            n_tokens*n_embd_v_gqa, ggml_row_size(kv.v_l[il]->type, n_embd_v_gqa) * kv_head);
    } else {
        // note: the V cache is transposed when not using flash attention
        v_cache_view = ggml_view_2d(ctx, kv.v_l[il], n_tokens, n_embd_v_gqa,
                (  n_ctx)*ggml_element_size(kv.v_l[il]),
                (kv_head)*ggml_element_size(kv.v_l[il]));

        v_cur = ggml_transpose(ctx, v_cur);
    }
    ggml_build_forward_expand(graph, ggml_cpy(ctx, v_cur, v_cache_view));
(gdb) p v_cache_view->ne
$31 = {6, 4096, 1, 1}

And this is then transposed and also copied into the value cache view. So the above will have populated the cache with the key and value for one single layer. The next time we process a token which belongs to the same sequence this will add a column to the key cache matrix and a row to the value cache matrix.

This is the last thing to happen in llm_build_kv_store and we will be back in llm_build_kv:

    llm_build_kv_store(ctx, hparams, cparams, kv, graph, k_cur, v_cur, n_tokens, kv_head, cb, il);
static struct ggml_tensor * llm_build_kqv(
        struct ggml_context * ctx,
       struct llama_context & lctx,
       const llama_kv_cache & kv,
         struct ggml_cgraph * graph,
         struct ggml_tensor * wo,
         struct ggml_tensor * wo_b,
         struct ggml_tensor * q_cur,
         struct ggml_tensor * kq_mask,
                    int32_t   n_tokens,
                    int32_t   n_kv,
                    float     kq_scale,
         const llm_build_cb & cb,
                    int       il) {

    const llama_model   & model   = lctx.model;
    const llama_hparams & hparams = lctx.model.hparams;
    const llama_cparams & cparams = lctx.cparams;

    const int64_t n_ctx         = cparams.n_ctx;
    const int64_t n_head        = hparams.n_head(il);
    const int64_t n_head_kv     = hparams.n_head_kv(il);
    const int64_t n_embd_head_k = hparams.n_embd_head_k;
    const int64_t n_embd_k_gqa  = hparams.n_embd_k_gqa(il);
    const int64_t n_embd_head_v = hparams.n_embd_head_v;
    const int64_t n_embd_v_gqa  = hparams.n_embd_v_gqa(il);

    struct ggml_tensor * q = ggml_permute(ctx, q_cur, 0, 2, 1, 3);

So this will swap the second and third dimensions of q_cur:

(lldb) p q_cur->ne
(int64_t[4])  ([0] = 128, [1] = 32, [2] = 6, [3] = 1)

(lldb) p q->ne
(int64_t[4])  ([0] = 128, [1] = 6, [2] = 32, [3] = 1)

So this will be restructured to something like this:

q matrix:

z0
          x-axis ->
  0   [0  ...   127]          y-axis
      .                         ↓
      .
  5   [0  ...   127]

.
.
.

z31
  0   [0  ...   127]
      .
      .
  5   [0  ...   127]

So we have 32 heads, and each one matrix with 6 rows each with 128 dimensions is what I'm trying to convey here.

Next something similar is done for the Key matrix:

    struct ggml_tensor * k =
        ggml_view_3d(ctx, kv.k_l[il],
                n_embd_head_k, n_kv, n_head_kv,
                ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
                ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
                0);
    cb(k, "k", il);

Notice that this is using kv.k_l[il] which is the the tensor of the cache for this layer. So when the k q multiplication is done below it will be using this view of the cache.

But notice that the dimensions are different:

z0
  0   [0  ...   127]
      .
      .
      .
      .
  31  [0  ...   127]
.  
.
.

z31
  0   [0  ...   127]
      .
      .
      .
      .
  31  [0  ...   127]

(lldb) p k->ne
(int64_t[4])  ([0] = 128, [1] = 32, [2] = 32, [3] = 1)

Next there is a block for flash attention: TODO: try out flash attention.

    if (cparams.flash_attn) {

    } else {
        struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
        cb(kq, "kq", il);

Next is the actual QK^T operation is created:

struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);

(lldb) p k->ne
(int64_t[4])  ([0] = 128, [1] = 32, [2] = 32, [3] = 1)
(lldb) p q->ne
(int64_t[4])  ([0] = 128, [1] = 6, [2] = 32, [3] = 1)

We can visualize this as we are performing 32 separate 2d multiplications:

      k matrix
z0
  0   [0  ...   127]
      .
      .
      .
      .
  31  [0  ...   127]

       q matrix
z0
          x-axis ->
  0   [0  ...   127]          y-axis
      .                         ↓
      .                       
  5   [0  ...   127]          

We need to keep in mind that ggml will transpose the second tensor so this becomes:

      k matrix                   q matrix

  0   [0  ...   127]        0    [0  ...  5]     0  [0 ... 31]
      .                          .                  .
      .                x         .            =     .
      .                          .               5  [0 ... 31]
  31  [0  ...   127]             .

                           127   [0  ...  5]

And this will enable the multiplication to work.

Next we have:

        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 ||
            model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 ||
            model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
            // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
            // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
            ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
        }

This is not the case for this session but something that might be good to be aware of.

Next there is a special case for GROK:

        if (model.arch == LLM_ARCH_GROK) {
            // need to do the following:
            // multiply by attn_output_multiplyer of 0.08838834764831845
            // and then :
            // kq = 30 * tanh(kq / 30)
            // before the softmax below

            //try from phi2
            //ggml_mul_mat_set_prec(kq, GGML_PREC_F32);

            kq = ggml_tanh(ctx, ggml_scale(ctx, kq, 0.08838834764831845f/30.0f));
            kq = ggml_scale(ctx, kq, 30);
        }

That is pretty much it for llama_build_graph related to the kv cache.

Now, lets take a look at llama_set_inputs to see if the kv cache is used there.

    if (lctx.inp_KQ_mask || lctx.inp_KQ_mask_swa) {
        // NOTE: hparams.causal_attn indicates the model is capable of generation and uses the kv cache.
        if (cparams.causal_attn && !lctx.is_encoding) {
            const int64_t n_kv         = kv_self.n;
            const int64_t n_tokens     = ubatch.n_tokens;
            const int64_t n_seq_tokens = ubatch.n_seq_tokens;
            const int64_t n_seqs       = ubatch.n_seqs;


            float * data     = nullptr;
            float * data_swa = nullptr;

            if (lctx.inp_KQ_mask) {
                GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_KQ_mask->buffer));
                data = (float *) lctx.inp_KQ_mask->data;
            }

Next we have the following for loop which is setting the mask (lctx.inp_KQ_mask) for a single head (h):

            for (int h = 0; h < 1; ++h) {
                for (int s = 0; s < n_seqs; ++s) {
                    const llama_seq_id seq_id = ubatch.seq_id[s][0];

                    for (int j = 0; j < n_seq_tokens; ++j) {
                        const llama_pos pos = ubatch.pos[s*n_seq_tokens + j];

                        for (int i = 0; i < n_kv; ++i) {
                            float f;
                            if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
                                f = -INFINITY;
                            } else {
                                if (hparams.use_alibi) {
                                    f = -std::abs(kv_self.cells[i].pos - pos);
                                } else {
                                    f = 0.0f;
                                }
                            }

                            if (data) {
                                data[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }

                            // may need to cut off old tokens for sliding window
                            if (data_swa) {
                                if (pos - kv_self.cells[i].pos >= (int32_t)hparams.n_swa) {
                                    f = -INFINITY;
                                }
                                data_swa[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }
                        }
                    }
                }

                if (data) {
                    for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) {
                        for (int j = 0; j < n_kv; ++j) {
                            data[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY;
                        }
                    }
                }

                if (data_swa) {
                    for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) {
                        for (int j = 0; j < n_kv; ++j) {
                            data_swa[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY;
                        }
                    }
                }
            }

Since h will always be zero we can simplify this a little for readability:

                for (int s = 0; s < n_seqs; ++s) {
                    const llama_seq_id seq_id = ubatch.seq_id[s][0];

                    for (int j = 0; j < n_seq_tokens; ++j) {
                        const llama_pos pos = ubatch.pos[s*n_seq_tokens + j];

                        for (int i = 0; i < n_kv; ++i) {
                            float f;
                            if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
                                f = -INFINITY;
                            } else {
                                if (hparams.use_alibi) {
                                    f = -std::abs(kv_self.cells[i].pos - pos);
                                } else {
                                    f = 0.0f;
                                }
                            }

                            if (data) {
                                data[s * (n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }

                            // may need to cut off old tokens for sliding window
                            if (data_swa) {
                                if (pos - kv_self.cells[i].pos >= (int32_t)hparams.n_swa) {
                                    f = -INFINITY;
                                }
                                data_swa[s * (n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }
                        }
                    }
                }

The mask (for a single head) is a square matrix:

(gdb) p lctx.inp_KQ_mask->ne
$23 = {32, 32, 1, 1}

The above will iterate over all the tokens in the ubatch (n_seq above which is 6 in this case). And each token can potentially be part of multiple sequences which is that n_seq_tokens is specifying so we iterate over all of them. Then we iterator over all the entries in the kv cache (32 here even though we only have 6 tokens due to the padding we saw earlier). For each entry in the cache the code will check if the current cache cell does not have the current tokens sequence id, or if the current cells position is greater than the current tokens position.

This is what the cache cells looks like at this point:

cell_0    {pos = 0, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_1    {pos = 1, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_2    {pos = 2, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_3    {pos = 3, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_4    {pos = 4, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_5    {pos = 5, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}}
cell_6    {pos = -1, delta = 0, src = -1, tail = -1, seq_id = std::set with 0 elements}
.
.
.
cell_1023 {pos = -1, delta = 0, src = -1, tail = -1, seq_id = std::set with 0 elements}

So for the first entry the else clause will be executed and the value of f will be 0.0f. And recall that data is a pointer to the tensor lctx.inp_KQ_mask's data.

    data[s * (n_kv * n_seq_tokens) + j * n_kv + i] = f;

This first iteration s is 0, j is 0, and i is 0:

    data[0] = 0.0f;

For the next iteration (of 32) the value of f will be -INFINITY:

    data[1] = -INFINITY;

And this will continue until all 32 entries have been processed for the first token in the sequence.


      0    1    2    3    4    5    6   ...        31
    +----+----+----+----+----+----+----+----+----+----+
    |0.0f|-inf|-inf|-inf|-inf|-inf|-inf|  ...    |    |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+

Notice that the values in this matrix are initially zero.

Then we do the same for the next token in the sequence (s=1).

    data[1 * (n_kv * n_seq_tokens) + j * n_kv + i] = f;
    data[1 * (n_kv * n_seq_tokens)] = f;
    data[1 * (32 * 1)] = f;
    data[1 * (32)] = f;
    data[(32)] = f;

      0    1    2    3    4    5    6   ...        31
    +----+----+----+----+----+----+----+----+---------+
    |0.0f|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+---------+
    |0.0f|0.0f|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+

Then we do the same for the next token in the sequence (s=2).

    data[2 * (n_kv * n_seq_tokens) + j * n_kv + i] = f;
    data[2 * (n_kv * n_seq_tokens)] = f;
    data[2 * ( * 1)] = f;
    data[2 * (32)] = f;
    data[64] = f;

      0    1    2    3    4    5    6   ...        31
    +----+----+----+----+----+----+----+----+---------+
    |0.0f|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+---------+
    |0.0f|0.0f|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    |0.0f|0.0f|0.0f|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+

And this continues until all the ubatch.n_seq tokens have been processed which is 6 in this case:

      0    1    2    3    4    5    6   ...        31
    +----+----+----+----+----+----+----+----+---------+
0   |0.0f|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+---------+
1   |0.0f|0.0f|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
2   |0.0f|0.0f|0.0f|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
3   |0.0f|0.0f|0.0f|0.0f|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
4   |0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
5   |0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
6   | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
7   | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    .
    .
    .
    +----+----+----+----+----+----+----+----+----+----+
31  | 0  | 0  | 0  | 0  | 0  |  0 |  0 | 0  |   ...   |
    +----+----+----+----+----+----+----+----+----+----+

Then we have the following loop:

                if (data) {
                    for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) {
                        for (int j = 0; j < n_kv; ++j) {
                            data[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY;
                        }
                    }
                }

The GGML_PAD macro is defined as:

#define GGML_KQ_MASK_PAD 32
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))

GGML_PAD(n_tokens, GGML_KQ_MASK_PAD);
GGML_PAD(6, 32);
GGML_PAD(6, 32) (((6) + (32) - 1) & ~((32) - 1))
= 32

This will round up to the nearest multiple of 32 which is 32 in this case. So in our case this will iterate of the last 26 entries in the mask and set them to -INFINITY. Recall that each token has a mask and this will fill each of these padding token masks with -INFINITY. So there will be 26 tokens with a mask that look something like this:

      0    1    2    3    4    5    6   ...        31
    +----+----+----+----+----+----+----+----+---------+
0   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+---------+
1   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
2   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
3   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
4   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
5   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
6   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
7   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+
    .
    .
    .
    +----+----+----+----+----+----+----+----+----+----+
31  |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|   ...   |
    +----+----+----+----+----+----+----+----+----+----+

After this we will continue in llama_set_inputs but there is nothing more releated to the kv cache in this function.

So to recap this after the QK^T operation is performed the result will be copied into the layers kv.k_l tensor. And similar for the value cache but that the operation is:

        struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
        cb(kqv, "kqv", il);

The next thing that happens is the graph is computed, which will perform the operations that have been built up in the graphs. Following that the cache head is updated in `

        // update the kv ring buffer
        {
            kv_self.head += n_tokens;

            // Ensure kv cache head points to a valid index.
            if (kv_self.head >= kv_self.size) {
                kv_self.head = 0;
            }
        }

So for the next decode kv_self.head will be 6 and the offset we say above for the key and values cache will use 6 when calculating the offset to store the roped k and value cache entried for the next token.

llama_kv_cache

A llama_context contains a member named kv_self (self as in self attention) which is of type llama_kv_cache. This struct is defined in llama.cpp:

struct llama_context {
    ...
    // key + value cache for the self attention
    struct llama_kv_cache kv_self;

So every llama_context will have a key-value cache for self attention.

And the llama_kv_cache struct is defined as follows:

struct llama_kv_cache {
    bool has_shift = false;
    bool do_defrag = false;
    bool do_copy   = false;
    bool recurrent = false; // with recurrent state models, a cell can hold the state for more than one past token
    bool v_trans   = true;  // the value tensor is transposed
    uint32_t head = 0;
    uint32_t size = 0;
    uint32_t used = 0; // used cells (i.e. at least one seq_id)

    // computed before each graph build
    uint32_t n = 0;

    ggml_type type_k = GGML_TYPE_F16;
    ggml_type type_v = GGML_TYPE_F16;

    std::vector<llama_kv_cell> cells;

    std::vector<struct ggml_tensor *> k_l; // per layer
    std::vector<struct ggml_tensor *> v_l;

    std::vector<struct ggml_context *> ctxs;
    std::vector<ggml_backend_buffer_t> bufs;

Recall that there is a KV-Cache per layer in the transformer architecture. And notice that there is a vector of ggml_tensor pointer for key and one for the value per layers. So for each layer there is a tensor which we will see later is a 1d tensor, or just a list of values. And each layer has a ggml_context and also a ggml_backend_buffer_t.

So when a llama_context is created the kv_self will also be created using default initialization (so just default values will be assigned to it).

struct llama_context {
    llama_context(const llama_model & model) : model(model), t_start_us(model.t_start_us), t_load_us(model.t_load_us) {}
    ...
}
$ gdb --args simple_prompt
(gdb) br llama_new_context_with_model
(gdb) r
(gdb) 
16368	    llama_context * ctx = new llama_context(*model);
(gdb) n
(gdb) p ctx.kv_self 
$1 = {has_shift = false, do_defrag = false, do_copy = false, recurrent = false,
v_trans = true, head = 0, size = 0, used = 0, n = 0,
type_k = GGML_TYPE_F16,
type_v = GGML_TYPE_F16,
cells = std::vector of length 0, capacity 0, 
k_l = std::vector of length 0, capacity 0,
v_l = std::vector of length 0, capacity 0,
ctxs = std::vector of length 0, capacity 0, 
bufs = std::vector of length 0, capacity 0}

So after the construction of a llama_context the kv_self member is initialized to default values, there has not been any explicit assignements to any of the members of kv_self.

Further down in llama_new_context_with_model kv_self is initialized with:

        if (!llama_kv_cache_init(ctx->kv_self, ctx, type_k, type_v, kv_size, cparams.offload_kqv)) {
            LLAMA_LOG_ERROR("%s: llama_kv_cache_init() failed for self-attention cache\n", __func__);
            llama_free(ctx);
            return nullptr;
        }

So we are passing in ctx->kv_self which will have a local name of cache below:

static bool llama_kv_cache_init(
             struct llama_kv_cache & cache,
               const llama_context * ctx,
                         ggml_type   type_k,
                         ggml_type   type_v,
                          uint32_t   kv_size,
                              bool   offload) {

    const int64_t  n_layer = hparams.n_layer;

    cache.has_shift = false;

    cache.recurrent = llama_model_is_recurrent(&model);
    cache.v_trans   = !cache.recurrent && !cparams.flash_attn;

    cache.head = 0;
    cache.size = kv_size;
    cache.used = 0;

    cache.type_k = type_k;
    cache.type_v = type_v;

    cache.cells.clear();
    cache.cells.resize(kv_size);

The kv_size is the passed in and will be the size of the computation param n_ctx unless the model supports Mamba.

(gdb) p n_layer
$3 = 32

(gdb) p kv_size
$12 = 1024

(gdb) p cache.v_trans
$4 = true

(gdb) p cache.size
$5 = 1024

(gdb) p cache.type_v
$9 = GGML_TYPE_F16

(gdb) p cache.cells.size()
$10 = 1024

So we can see that we have 1024 cells in this cache.

Next, a map of ggml_backend_buffer_type_t and a count of the different types of backend buffers for the kv cache. In our case offload is true (comes from cparams.offload_kqv) so that is the path that will be taken. But also notice that if this was not the case then the default buffer type would be llama_default_buffer_type_cpu(true) would be set as the key and the found 32.

But for the offload case we iterate over the number of layers and count the number of different buffer types used:

    // count used buffer types
    std::map<ggml_backend_buffer_type_t, int> buft_layer_count;
    if (offload) {
        for (int64_t i = 0; i < n_layer; ++i) {
            buft_layer_count[model.buft_layer[i].buft]++;
        }
    } else {
        buft_layer_count[llama_default_buffer_type_cpu(true)] = n_layer;
    }

So the model struct has a field buft_layer which is a vector of llama_buft:

(gdb) ptype model.buft_layer
type = std::vector<llama_model::layer_buft>

(gdb) p model.buft_layer.size()
$14 = 32

This vector is populated by llama_load_tensors.

After this the map looks like this:

(gdb) p buft_layer_count
$25 = std::map with 1 element = {[0x555555978400 <ggml_backend_cpu_buffer_type>] = 32}

Next for each of the entries in the buft_layer_count map we create a ggml context for each buffer type, and in this case there is only one element, which has a count of 32:

    // create a context for each buffer type
    std::map<ggml_backend_buffer_type_t, ggml_context *> ctx_map;
    for (auto & it : buft_layer_count) {
        int n_layers = it.second;
        struct ggml_init_params params = {
            /*.mem_size   =*/ 2u*n_layers*ggml_tensor_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc   =*/ true,
        };
        ggml_context * ctx = ggml_init(params);
        if (!ctx) {
            LLAMA_LOG_ERROR("%s: failed to allocate context for kv cache\n", __func__);
            return false;
        }
        ctx_map[it.first] = ctx;
        cache.ctxs.push_back(ctx);
    }

So after this there will be one entry in cache.ctxs

Next the vectors of key and value vectors will be reserved for the capacity of the number of layers in the model, the following will happen in llama_kv_cache_init:

    cache.k_l.reserve(n_layer);
    cache.v_l.reserve(n_layer);

And these vectors store elements of type ggml_tensor:

(lldb) p n_layer
(const int64_t) 32

(gdb) ptype cache.k_l
type = std::vector<ggml_tensor*>

So each of these vectors will be able to hold 32 ggml_tensor pointers, one for each layer.

Next, these vectors are populated with the following:

    for (int i = 0; i < (int) n_layer; i++) {
        const uint32_t n_embd_k_gqa = hparams.n_embd_k_gqa(i) + hparams.n_embd_k_s();
        const uint32_t n_embd_v_gqa = hparams.n_embd_v_gqa(i) + hparams.n_embd_v_s();

        struct ggml_context * ctx = offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
        ggml_tensor * k = ggml_new_tensor_1d(ctx, type_k, n_embd_k_gqa*kv_size);
        ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, n_embd_v_gqa*kv_size);
        ggml_format_name(k, "cache_k_l%d", i);
        ggml_format_name(v, "cache_v_l%d", i);
        cache.k_l.push_back(k);
        cache.v_l.push_back(v);
    }

So we have this function n_embd_k_gqa which returnes the number of embedding dimensions for the Key matrix for grouped query attention (qga). Notice that we are passing in the layer which sounds like there can be different embedding sizes for different layers.

    uint32_t n_embd_k_gqa(uint32_t il = 0) const { // dimension of key embeddings across all k-v heads
        const uint32_t n_head_kv = this->n_head_kv(il);

        return n_embd_head_k * n_head_kv;
    }

And n_embd_head_k is the number of embeddings in the key matrix for each head:

    uint32_t n_head_kv(uint32_t il = 0) const {
        if (il < n_layer) {
            return n_head_kv_arr[il];
        }

        GGML_ABORT("fatal error");
    }

n_head_kv_arr is a fixed size array, the size is of LLAMA_MAX_LAYERS which is currently 512:

#define LLAMA_MAX_LAYERS  512

    std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_arr;
    std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
    std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;

Notice that there other per-layer parameters, n_head_arr is a value per layer for the number of heads, and then we also have n_ff_arr which is the number of feed-forward/MLP layers too. The values are loaded from the model in llm_load_hparams:

    // zero-out the per-layer hparams
    std::fill(hparams.n_head_arr.begin(),    hparams.n_head_arr.end(),    0);
    std::fill(hparams.n_head_kv_arr.begin(), hparams.n_head_kv_arr.end(), 0);
    std::fill(hparams.n_ff_arr.begin(),      hparams.n_ff_arr.end(),      0);

    ml.get_key_or_arr(LLM_KV_FEED_FORWARD_LENGTH,  hparams.n_ff_arr,   hparams.n_layer);
    ml.get_key_or_arr(LLM_KV_ATTENTION_HEAD_COUNT, hparams.n_head_arr, hparams.n_layer);

    // n_head_kv is optional, default to n_head
    hparams.n_head_kv_arr = hparams.n_head_arr;

    ml.get_key_or_arr(LLM_KV_ATTENTION_HEAD_COUNT_KV, hparams.n_head_kv_arr, hparams.n_layer, false);

So the model can indeed have a different number of heads for each layer, but in our case we have the same number of heads for each layer.

(lldb) p this
(const llama_hparams *) 0x0000000132812228

(gdb) p n_head_kv_arr.size()
$23 = 512
(gdb) p n_head_kv_arr[il]
$25 = 32

So the dimensions for these tensors will be:

(gdb) p n_embd_k_gqa
$35 = 4096
(gdb) p n_embd_v_gqa
$36 = 4096

And the size of the cache in this case is kv_size, so the above will create a 1d tensor of size n_embd_k_gqa*kv_size (4096*1024) for the key and the value tensors. And hparams_n_embd_k_s is zero in this case as this is only used for recursive models I think which is not the case here. So the embedding dimension for each layer is 4096.

Just to recap where we are:

    for (int i = 0; i < (int) n_layer; i++) {
        const uint32_t n_embd_k_gqa = hparams.n_embd_k_gqa(i) + hparams.n_embd_k_s();
        const uint32_t n_embd_v_gqa = hparams.n_embd_v_gqa(i) + hparams.n_embd_v_s();

        struct ggml_context * ctx = offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
        ggml_tensor * k = ggml_new_tensor_1d(ctx, type_k, n_embd_k_gqa*kv_size);
        ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, n_embd_v_gqa*kv_size);
        ggml_format_name(k, "cache_k_l%d", i);
        ggml_format_name(v, "cache_v_l%d", i);
        cache.k_l.push_back(k);
        cache.v_l.push_back(v);
    }

So each of the k tensors will be of size of the embeddings dimensions plus the number of items that can be stored in the cache.

(gdb) p n_embd_k_gqa 
$39 = 4096
(gdb) p kv_size
$40 = 1024

And we can take a look at the shape of k:

(gdb) p *k
$3 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {4194304, 1, 1, 1}, nb = {2, 8388608, 8388608, 8388608}, 
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>},
flags = 0,
grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0,
view_offs = 0,
data = 0x0,
name = '\000' <repeats 63 times>, 
extra = 0x0, padding = "\000\000\000\000\000\000\000"}

So this is a 1d tensor:

   [0                                                        4194304]

And these tensors will then be added to the cache.k_l and cache.v_l vectors:

        cache.k_l.push_back(k);
        cache.v_l.push_back(v);

So each layer will have a key tensor which has 4194304 elements which we can think of as being 1024 rows (one entry for each item in the cache, each with 4096 dimensions in each. One row for each entry in the cache. 4096 is the embedding dimenesion size.

     0   [0   ...        4095]
     ...
     ...
     ...
  1023   [0   ...        4095]

And recall that this will be done for each n_layers in the model.

Again, recall that the ctx_map only contains one entry.

    // allocate tensors and initialize the buffers to avoid NaNs in the padding
    for (auto it : ctx_map) {
        ggml_backend_buffer_type_t buft = it.first;
        ggml_context * ctx = it.second;
        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
        if (!buf) {
            LLAMA_LOG_ERROR("%s: failed to allocate buffer for kv cache\n", __func__);
            return false;
        }
        ggml_backend_buffer_clear(buf, 0);
        cache.bufs.push_back(buf);
    }

Now, buft describes the buffer type and we can inspect it like this:

(gdb) ptype buft
type = struct ggml_backend_buffer_type {
    ggml_backend_buffer_type_i iface;
    ggml_backend_buffer_type_context_t context;
} *

(gdb) ptype buft.iface
type = struct ggml_backend_buffer_type_i {
    const char *(*get_name)(ggml_backend_buffer_type_t);
    ggml_backend_buffer_t (*alloc_buffer)(ggml_backend_buffer_type_t, size_t);
    size_t (*get_alignment)(ggml_backend_buffer_type_t);
    size_t (*get_max_size)(ggml_backend_buffer_type_t);
    size_t (*get_alloc_size)(ggml_backend_buffer_type_t, const ggml_tensor *);
    _Bool (*is_host)(ggml_backend_buffer_type_t);
}

(gdb) p buft.iface.get_name(buft)
$67 = 0x555555882d22 "CPU"

(gdb) p buft.iface.is_host(buft)
$70 = true

Notice that buf is of type ggml_backend_buffer_t which is different from buft which is of type ggml_backend_buffer_type_t, at least I have some trouble keeping these apart as the names are somewhat similar. I try to think that one it is a description of a backend buffer type and the other is an actual buffer. So in the following we are passing in the buffer type to allocate a new buffer of that type:

        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
    GGML_ASSERT(ggml_get_no_alloc(ctx) == true);

    size_t alignment = ggml_backend_buft_get_alignment(buft);
    size_t max_size = ggml_backend_buft_get_max_size(buft);

    ggml_backend_buffer_t * buffers = NULL;
    size_t n_buffers = 0;
gdb) p alignment 
$71 = 32
(gdb) p max_size
$72 = 18446744073709551615
    ggml_backend_buffer_t * buffers = NULL;
    size_t n_buffers = 0;

    size_t cur_buf_size = 0;
    struct ggml_tensor * first = ggml_get_first_tensor(ctx);

first will be cache_k_l0:

(gdb) p *first
$74 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4194304, 1, 1, 1}, nb = {2, 8388608, 8388608, 
    8388608}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, 
  name = "cache_k_l0", '\000' <repeats 53 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}

Then the following loop will iterate over the tensors in the passed in context and calculate the size for the buffer:

    for (struct ggml_tensor * t = first; t != NULL; t = ggml_get_next_tensor(ctx, t)) {
        size_t this_size = 0;
        if (t->data == NULL && t->view_src == NULL) {
            this_size = GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
        }

        if (this_size > max_size) {
            ...
        }

        if ((cur_buf_size + this_size) > max_size) {
            // allocate tensors in the current buffer
            if (!alloc_tensor_range(ctx, first, t, buft, cur_buf_size, &buffers, &n_buffers)) {
                return NULL;
            }
            first = t;
            cur_buf_size = this_size;
        } else {
            cur_buf_size += this_size;
        }
    }

The second time throught the loop t will be:

(gdb) p *t
$76 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4194304, 1, 1, 1}, nb = {2, 8388608, 8388608, 
    8388608}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, 
  name = "cache_v_l0", '\000' <repeats 53 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}

When the loop has completed the following wil be run:

    if (cur_buf_size > 0) {
        if (!alloc_tensor_range(ctx, first, NULL, buft, cur_buf_size, &buffers, &n_buffers)) {
            return NULL;
        }
    }

We can find alloc_tensor_range in the ggml-alloc.c::

static bool alloc_tensor_range(struct ggml_context * ctx,
        struct ggml_tensor * first, struct ggml_tensor * last,
        ggml_backend_buffer_type_t buft, size_t size,
        ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
    ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);

And in ggml-backend.c we have:

GGML_CALL ggml_backend_buffer_t ggml_backend_buft_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
    return buft->iface.alloc_buffer(buft, size);
}

Which in our case will call the following function:

GGML_CALL static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
    size += TENSOR_ALIGNMENT;   // malloc may return an address that is not aligned
    void * data = malloc(size); // TODO: use GGML_ALIGNED_MALLOC (move to ggml-impl.h)
    if (data == NULL) {
        fprintf(stderr, "%s: failed to allocate buffer of size %zu\n", __func__, size);
        return NULL;
    }

    return ggml_backend_buffer_init(buft, cpu_backend_buffer_i, data, size);
}

Now, malloc will take the size (in bytes), so that will be:

(gdb) p size / 1024.0/1024.0
$82 = 512.00003051757812

So 512MB will be allocated for the buffer. If we were using a different backend for example CUDA this would instead be the function ggml_backend_cuda_buffer_type_alloc_buffer in llama.cpp/ggml-backend.cu which does ggml_cuda_device_malloc which allocates memory on the device instead.

The last call in the function is:

GGML_CALL ggml_backend_buffer_t ggml_backend_buffer_init(
               ggml_backend_buffer_type_t      buft,
        struct ggml_backend_buffer_i           iface,
               ggml_backend_buffer_context_t   context,
               size_t                          size) {
    ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));

    (*buffer) = (struct ggml_backend_buffer) {
        /* .interface = */ iface,
        /* .buft      = */ buft,
        /* .context   = */ context,
        /* .size      = */ size,
        /* .usage     = */ GGML_BACKEND_BUFFER_USAGE_ANY
    };

    return buffer;
}

Following this code there is some logging of the memory usage:

        {
            size_t memory_size_k = 0;
            size_t memory_size_v = 0;

            for (auto & k : ctx->kv_self.k_l) {
                memory_size_k += ggml_nbytes(k);
            }

            for (auto & v : ctx->kv_self.v_l) {
                memory_size_v += ggml_nbytes(v);
            }

            LLAMA_LOG_INFO("%s: KV self size  = %7.2f MiB, K (%s): %7.2f MiB, V (%s): %7.2f MiB\n", __func__,
                (float)(memory_size_k + memory_size_v) / (1024.0f * 1024.0f),
                ggml_type_name(type_k), (float)memory_size_k / (1024.0f * 1024.0f),
                ggml_type_name(type_v), (float)memory_size_v / (1024.0f * 1024.0f));
        }

Next we have llama_output_reserve which is is called in a number of places, so what does it actually do?

            // resized during inference when a batch uses more outputs
            if (llama_output_reserve(*ctx, params.n_seq_max) < params.n_seq_max) {
                LLAMA_LOG_ERROR("%s: failed to reserve initial output buffer\n", __func__);
                llama_free(ctx);
                return nullptr;
            }

In this case the n_seq_max is 1:

(gdb) p params.n_seq_max
$14 = 1
            ctx->buf_compute_meta.resize(
                ggml_tensor_overhead()*LLAMA_MAX_NODES + 
                ggml_graph_overhead_custom(LLAMA_MAX_NODES, false));
            ctx->sched = ggml_backend_sched_new(ctx->backends.data(),
                backend_buft.data(),
                ctx->backends.size(),
                LLAMA_MAX_NODES,
                pipeline_parallel);

What is a GGML Backend Schuduler? TODO: Look into this.

Multiple sequences

This section will take a closer look at how multiple sequences behave with regards to the kv-cache.

The example simple-prompt-multi sets up an initial batch with two sequences and the performs decoding using one of the two sequences (this is done by setting the sequence id to 0 for the first sequence and 1 for the second sequence).

What I'm curious about is how the handling of the k_l and v_l tensor and their offset is handled.

For example, after the initial decoding of the first batch which contains two prompts the kv-cache will look like this:

(gdb) p ctx.kv_self.head
$3 = 13

(gdb) p ctx.kv_self.cells
$4 = std::vector of length 1024, capacity 1024 = {
{pos = 0,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 1,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 2,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 3,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 4,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 5,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 6,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 7,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 8,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 9,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 10, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 11, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 12, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = -1, delta = 0, src = -1, tail = -1, seq_id = std::set with 0 elements},

Inspecting the view created in llm_build_kv_store:

    struct ggml_tensor * k_cache_view = ggml_view_1d(ctx, kv.k_l[il], n_tokens*n_embd_k_gqa, ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)*kv_head);

We can see that the offset will be:

(gdb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)*kv_head
$11 = 106496
(gdb) p kv_head
$12 = 13
(gdb) p n_embd_k_gqa
$13 = 4096

Recall that this is just setting up the an operation to create a view so that the roped key value can be copied into this location.

Now, it we turn our attention to llama_set_inputs we can take a look at the mask that is created. I went through this above but only for the initial batch/prompt and I did not go through this case where we are continuing a sequence and have multiple sequences in the cache already.

static void llama_set_inputs(llama_context & lctx, const llama_ubatch & ubatch) {
    const auto & hparams = lctx.model.hparams;
    const auto & cparams = lctx.cparams;
    const auto & kv_self = lctx.kv_self;
    ...

            for (int h = 0; h < 1; ++h) {
                for (int s = 0; s < n_seqs; ++s) {
                    const llama_seq_id seq_id = ubatch.seq_id[s][0];

                    for (int j = 0; j < n_seq_tokens; ++j) {
                        const llama_pos pos = ubatch.pos[s*n_seq_tokens + j];

                        for (int i = 0; i < n_kv; ++i) {
                            float f;
                            if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
                                f = -INFINITY;
                            } else {
                                if (hparams.use_alibi) {
                                    f = -std::abs(kv_self.cells[i].pos - pos);
                                } else {
                                    f = 0.0f;
                                }
                            }

                            if (data) {
                                data[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }

                            // may need to cut off old tokens for sliding window
                            if (data_swa) {
                                if (pos - kv_self.cells[i].pos >= (int32_t)hparams.n_swa) {
                                    f = -INFINITY;
                                }
                                data_swa[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }
                        }
                    }
                }

So like before this will iterate over all the tokens (n_seq) in the batch which is just 1 now, and the number of sequences that this tokens has is also just one. The it will iterate over all the entries in the cache (the cells) which that are to be considered (n_kv) (the cache in this case can hold 1024 entries like we discussed above). n_kv is 32 (with is the current token padded to 32).

(gdb) p n_seqs
$18 = 1
(gdb) p n_seq_tokens
$19 = 1
(gdb) p n_kv
$20 = 32
(gdb) p seq_id
$22 = 1
(gdb) p kv_self.cells
{
{pos = 0,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 1,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 2,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 3,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 4,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 5,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 6,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 7,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 8,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 9,  delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 10, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 11, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 12, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}},
{pos = 13, delta = 0, src = -1, tail = -1, seq_id = std::set with 1 element = {[0] = 1}}

We can see that the first 6 entries belong to sequence 0 so the following will be true for them:

if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
    f = -INFINITY;
}...
if (data) {
    data[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
}
      0    1    2    3    4    5    6    7    8    9   10    11   12   13   14   15           31
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+----+
    |-inf|-inf|-inf|-inf|-inf|-inf|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+
      0    1    2    3    4    5    6    7    8    9   10    11   12   13   14   15           31
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+----+
    |-inf|-inf|-inf|-inf|-inf|-inf|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+

This is setting the tensor data for lctx.inp_KQ_mask. This is a tensor that was created in build_llama and then passed in to llm_build_kv for each layer:

        // KQ_mask (mask for 1 head, it will be broadcasted to all heads)
        struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
        ...
        for (int il = 0; il < n_layer; ++il) {
                ...
                cur = llm_build_kv(ctx0, lctx, kv_self, gf,
                        model.layers[il].wo, model.layers[il].bo,
                        Kcur, Vcur, Qcur, KQ_mask, n_tokens, kv_head, n_kv, kq_scale, cb, il);

This then passed through to llm_build_kv:

    cur  = llm_build_kqv(ctx, lctx, kv, graph, wo, wo_b, q_cur, kq_mask, n_tokens, n_kv, kq_scale, cb, il);

This tensor is passed to the softmax tensor operation:

        kq = ggml_soft_max_ext(ctx, kq, kq_mask, kq_scale, hparams.f_max_alibi_bias);
        cb(kq, "kq_soft_max_ext", il);
(gdb) p k->ne
$62 = {128, 32, 32, 1}

(gdb) p q->ne
$63 = {128, 1, 32, 1}

(gdb) p kq->ne
$59 = {32, 1, 32, 1}

(gdb) p kq_mask->ne
$61 = {32, 32, 1, 1}
          k
z_0
    0  [0  ... 127]
       ...
       ...
       ...
   31  [0  ... 127]

...

z_31
    0  [0  ... 127]
       ...
       ...
       ...
   31  [0  ... 127]

TODO: Add this information to the above section that walks through this part of the code (but it does so for the initial prompt): For a single token decode the matrix multiplication between Q an K looks like this:

struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);

Now, if we inspect the shape of the K and Q tensors:

(gdb) p q->ne
$2 = {128,  1, 32, 1}

(gdb) p k->ne
$1 = {128, 32, 32, 1}

(gdb) p kq_mask->ne
$6 = {32, 32, 1, 1}

Recall that the embedding dimension is 4096 and the number of heads is 32. So each head will have 128 dimensions/features.

We can visualize this something like this:

         Query (q)                    Key(k)

z0                              z_0
   0   [0      ...    127]           0  [0...31]
                                        ...
                                        ...
                                        ...
                                   127  [0...31]
...                                ...  
...                                ...
...                                ...

z_31                            z_31
   0   [0      ...    127]          0  [0...31]
                                       ...
                                       ...
                                       ...
                                   127  [0...31]

And notice that the Key matrix has 32 rows (tokens) which in this case would contain previous roped k value tokens. Now, my understanding it that the values in this matrix might not all belong to the same sequence as the current token belongs to. But this is where the mask comes in and will mask out those values when the softmax is calculated. This works as when the matrix multiplication is done the results of the dot products of against the roped k-values that are not part of the current sequenece are independent. What I mean is if we have this following simplified example:

  q = [1, 2, 3]

  k = [1, 2, 3] (seq0)   [4  1  2]
      [4, 5, 6] (seq1)   [4  5  6] x [1  2  3] = [12  32  50]
      [7, 8, 9] (seq1)   [7  8  9]

  mask = [0   1    1]
         [12  32  50] = [32 50]

The computation for seq0 is still being performed but it will not be part of the softmax calculation as the mask will be applied which removes the values that belong to seq0.

My initial reaction to this was that it seemed wasteful to compute the dot product for the cached key values that don't belong to the current token's sequence. But I think that modern CPU and GPUs are optimized for these types of dense matrix operations. Filtering or breaking up the computation would require irregular memory access patterns which would be less efficient. Is sounds like this is a common pattern in transformer implementations, that is compute everything + mask. This part if of the code is just creating the operation and we don't have any actual values yet. This is done in the llama_set_inputs function which we we discussed above before I derailed the discussion to this part. This will happen from time to time to connect the tensors with the actual data.

static void llama_set_inputs(llama_context & lctx, const llama_ubatch & ubatch) {
    ...
    if (lctx.inp_KQ_mask || lctx.inp_KQ_mask_swa) {
        if (cparams.causal_attn && !lctx.is_encoding) {
            const int64_t n_kv         = kv_self.n;
            const int64_t n_tokens     = ubatch.n_tokens;
            const int64_t n_seq_tokens = ubatch.n_seq_tokens;
            const int64_t n_seqs       = ubatch.n_seqs;


            float * data     = nullptr;

            if (lctx.inp_KQ_mask) {
                GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_KQ_mask->buffer));
                data = (float *) lctx.inp_KQ_mask->data;
            }
(gdb) p lctx->inp_KQ_mask-ne
No symbol "ne" in current context.
(gdb) p lctx->inp_KQ_mask->ne
$8 = {32, 32, 1, 1}
(gdb) p seq_id
$10 = 1
(gdb) p n_seq_tokens
$11 = 1
(gdb) p ubatch.pos[0]
$16 = 13
(gdb) p n_kv
$17 = 32
            for (int h = 0; h < 1; ++h) {
                for (int s = 0; s < n_seqs; ++s) {
                    const llama_seq_id seq_id = ubatch.seq_id[s][0];

                    for (int j = 0; j < n_seq_tokens; ++j) {
                        const llama_pos pos = ubatch.pos[s*n_seq_tokens + j];

        (1) ---->       for (int i = 0; i < n_kv; ++i) {
                            float f;
                            if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
                                f = -INFINITY;
                            } else {
                                if (hparams.use_alibi) {
                                    f = -std::abs(kv_self.cells[i].pos - pos);
                                } else {
                                    f = 0.0f;
                                }
                            }

                            if (data) {
                                data[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }

                            // may need to cut off old tokens for sliding window
                            if (data_swa) {
                                if (pos - kv_self.cells[i].pos >= (int32_t)hparams.n_swa) {
                                    f = -INFINITY;
                                }
                                data_swa[h*(n_kv*n_tokens) + s*(n_kv*n_seq_tokens) + j*n_kv + i] = f;
                            }
                        }
                    }
                }

    (2)  -----> if (data) {
                    for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) {
                        for (int j = 0; j < n_kv; ++j) {
                            data[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY;
                        }
                    }
                }

So iterate over all the 32 tokens in the cache and mask out the ones that don't belong to the current tokens sequence:

      0    1    2    3    4    5    6    7    8    9   10    11   12   13   14   15           31
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+----+
    |-inf|-inf|-inf|-inf|-inf|-inf|-inf|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+

Notice how this inner loop (1) is just iterating over a single token in this case. The second loop (2) will then start from from the first token and continue up to the the padded size of the tokens (n_kv which is 32 in this case):

      0    1    2    3    4    5    6    7    8    9   10    11   12   13   14   15           31
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+----+
0   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+
    |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+

And this will continue for all the 32 tokens in the cache.

      0    1    2    3    4    5    6    7    8    9   10    11   12   13   14   15           31
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-------+----+
0   |-inf|-inf|-inf|-inf|-inf|-inf|-inf|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|0.0f|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+
    |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+
    ...
31  |-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf|-inf| ...   |-inf|
    +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+------------+

So now we know that we know what the mask looks like, we can revisit the usage of it and for that we have to turn back to llm_build_kqv:

        kq = ggml_soft_max_ext(ctx, kq, kq_mask, kq_scale, hparams.f_max_alibi_bias);
(gdb) p q->ne
$2 = {128,  1, 32, 1}

(gdb) p k->ne
$1 = {128, 32, 32, 1}

(gdb) p kq_mask->ne
$6 = {32, 32, 1, 1}

(lldb) p kq->ne
(int64_t[4])  ([0] = 32, [1] = 1, [2] = 32, [3] = 1)

So as we can expect and have seen before the result of the Q and K matrix is a square matrix, and recall that this is par layer we are seeing. So what this is doing is it is caclulating the softmax of the logits in kq which like we said contains the dot product of the current token with all the cached Key values. In this case the first 6 tokens in the key cache belong to sequence 0, and the ones from 7-13 are the ones for sequence 1 which the current token belongs to:

      kq                                 kq_mask
z0                0     1    2    3    4    5   6     7    8    9    10  11    12   13   14  ... 31]
   [0 ... 31]   [-inf -inf -inf -inf -inf -inf -inf  0.0  0.0  0.0  0.0  0.0  0.0  0.0  -inf    -inf]
                [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf  -inf ...-inf]
                ...
             31 [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf  -inf ...-inf]

...

z31
   [0 ... 31] 0 [-inf -inf -inf -inf -inf -inf -inf  0.0  0.0  0.0  0.0  0.0  0.0  0.0  -inf    -inf]
                [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf  -inf ...-inf]
                ...
             31 [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf  -inf ...-inf]

So this will include only the logits that belong to the current token's sequence. To get a feel for how this works there is a standalone example in llama-att-softmax.c

has_shift

This is a boolean field in the llama_kv_cache struct. This is closely related to the self-extend attention mechanism and details can be found in Self-Extend Attention.

do_defrag

This is a boolean field in the llama_kv_cache struct. There is a function exposed via llama.h which can trigger a defrag operation:

void llama_kv_cache_defrag(struct llama_context * ctx) {
    llama_kv_cache_defrag(ctx->kv_self);
}

static void llama_kv_cache_defrag(struct llama_kv_cache & cache) {
    if (!cache.recurrent) {
        cache.do_defrag = true;
    }
}

So this will set the do_defrag field to true and upon the next decode operation (in llama_decode_internal) this field will be checked by the llama_kv_cache_update function:

        // non-causal masks do not use the KV cache
        if (hparams.causal_attn) {
            llama_kv_cache_update(&lctx);
static void llama_kv_cache_update_internal(struct llama_context & lctx) {
    bool need_reserve = false;

    ...

    // defragment the KV cache if needed
    if (lctx.kv_self.do_defrag) {
        llama_kv_cache_defrag_internal(lctx);

        need_reserve = true;

        lctx.kv_self.do_defrag = false;
    }
}

Recurrent models and the cache

Recurrent models like Mamba or RWKV don't have a kv-cache in the same way that non-recurrent models do. In a non-recurrent model the kv-cache is used to store information about a single token but for a recurrent model the kv-cache will store information for a sequence. We have seen an example of this earlier where we had a non-recurrent model which used two sequences. This would store an entry for each token in each sequence as then are decoded. For a recurrent model only two entries would be stored in the cache, one for each sequence.

For this section I'm going to use a batch which two sequences and the reason for this is that I initially started with a single sequence but this made it somewhat more difficult to understand some part of the code. So I'll be using the simple-prompt-multi here.

First in llama_new_context_with_model we have the following which is setting the size of the kv-cache to the max number of sequences:

    // Mamba only needs a constant number of KV cache cells per sequence
    if (llama_model_is_recurrent(model)) {
        // Mamba needs at least as many KV cells as there are sequences kept at any time
        kv_size = std::max((uint32_t) 1, params.n_seq_max);
        // it's probably best to keep as much precision as possible for the states
        type_k = GGML_TYPE_F32; // required by ggml_ssm_conv for Mamba's conv_states
        type_v = GGML_TYPE_F32; // required by ggml_ssm_scan for Mamba's ssm_states
    }

So kv_size will be 1 on this case as we only have one sequence:

(gdb) p params.n_seq_max
$21 = 2
(gdb) p kv_size
$22 = 2

The llama_batch that is passed to llama_decode looks like this:

$5 = (const llama_batch &) @0x7fffffffd538: {n_tokens = 9, token = 0x555555a48f10, embd = 0x0, pos = 0x555555a8a9b0, 
  n_seq_id = 0x555555a32c40, seq_id = 0x555555a33450, logits = 0x555555b071a0 ""}

Notice that we have 9 tokens in this batch and 2 sequences.

Let now look at llama_kv_cache_init which is pretty much the same as we discussed before but this time recurrent will be true: skipped earlier.

(gdb) p cache.recurrent
$11 = true

Next lets turn our attention to llama_decode_internal:

static int llama_decode_internal(
         llama_context & lctx,
           llama_batch   inp_batch) {
    ...
    // temporary allocate memory for the input batch if needed
    llama_batch_allocr batch_allocr(lctx, inp_batch);
    const llama_batch & batch = batch_allocr.batch;
    const uint32_t n_tokens_all = batch.n_tokens;
(lldb) expr n_tokens_all
(const uint32_t) $0 = 9

There will be a check to make sure that all the tokens in the batch are in the models vocabulary.

Then we will set/update/initialize the sbatch member of llama_context, and notice that kv_self_recurrent is used to set the simple_split parameter:

    lctx.sbatch.from_batch(batch, n_embd,
        /* simple_split */ !kv_self.recurrent,
        /* logits_all   */ n_outputs == n_tokens_all);

So in this case it is not a simple_spit which has been the case before when we went through this. I actually think I've gone through this as well so I'll link to this later. But lets take a look at the sbatch to get an overview:

(lldb) p lctx.sbatch
(llama_sbatch) {
  n_tokens = 9
  n_embd = 2048
  logits_all = false
  ids = size=9 {
    [0] = 0
    [1] = 1
    [2] = 2
    [3] = 3
    [4] = 4
    [5] = 5
    [6] = 6
    [7] = 7
    [8] = 8
  }
  out_ids = size=0 {}
  seq = size=2 {
    [0] = {
      n_seq_id = 1
      seq_id = 0x00006000015c5170
      offset = 0
      length = 5
    }
    [1] = {
      n_seq_id = 1
      seq_id = 0x00006000015c65f0
      offset = 5
      length = 4
    }
  }
  batch = 0x000000016fdfe630
  ubatch_token = size=0 {}
  ubatch_embd = size=0 {}
  ubatch_pos = size=0 {}
  ubatch_n_seq_id = size=0 {}
  ubatch_seq_id = size=0 {}
  ubatch_output = size=0 {}
}

And then there will be a while loop over the n_tokens in the sbatch:

    while (lctx.sbatch.n_tokens > 0) {
        llama_ubatch ubatch;
        if (kv_self.recurrent) {
            if (embd_pooled) {
                // Pooled embeddings cannot be split across ubatches (yet)
                ubatch = lctx.sbatch.split_seq(n_ubatch);
            } else {
                // recurrent model architectures are easier to implement
                // with equal-length sequences
                ubatch = lctx.sbatch.split_equal(n_ubatch);
            }
        } else {
            ubatch = lctx.sbatch.split_simple(n_ubatch);
        }
        const uint32_t n_tokens = ubatch.n_tokens;

In our case this will call lctx.sbatch.split_equal. Now it is good to keep in mind that recurrent models differ from causul models in that they process tokens in sequence where as a causal model can process all tokens in one go (using a causal mask):

  SSM (Mamba)                  Transformer
 +--+    +--+    +--+         +--------------------+
 |t1| -> |t2| -> |t3| ...     | t1   t2  t3 ...    |
 +--+    +--+    +--+         +--------------------+  

After this the returned ubatch will look like this:

(llama_ubatch) {
  equal_seqs = true
  n_tokens = 8
  n_seq_tokens = 4
  n_seqs = 2
  token = 0x00006000019c3180
  embd = 0x0000000000000000
  pos = 0x00006000019c3150
  n_seq_id = 0x00006000019c31e0
  seq_id = 0x00006000034dbf20
  output = 0x00006000015d0000
}

So this has split the batch into two equals sized ubatches (not that we have 9 tokens in total so there is one left over which I'll return to later). We can see that this ubatch has two sequences, and that each sequence is of equal size and the size of each is 4 tokens.

(lldb) p ubatch.token[0]
(llama_token) 15961
(lldb) p ubatch.token[1]
(llama_token) 14528
(lldb) p ubatch.token[2]
(llama_token) 6749
(lldb) p ubatch.token[3]
(llama_token) 10777
(lldb) p ubatch.token[4]
(llama_token) 1276
(lldb) p ubatch.token[5]
(llama_token) 310
(lldb) p ubatch.token[6]
(llama_token) 9497
(lldb) p ubatch.token[7]
(llama_token) 5214

(lldb) expr lctx.model.vocab.id_to_token[15961].text
(const llama_vocab::token) $2 = "Dan"
(lldb) expr lctx.model.vocab.id_to_token[14528].text
(const llama_vocab::token) $3 = "Ġloves"
(lldb) expr lctx.model.vocab.id_to_token[6749].text
(const llama_vocab::token) $4 = "Ġice"
(lldb) expr lctx.model.vocab.id_to_token[10777].text
(const llama_vocab::token) $5 = "Ġcream"

So now lets look at llama_kv_cache_find_slot and the recurrent block that we skipped earlier:

static bool llama_kv_cache_find_slot(
           struct llama_kv_cache & cache,
       const struct llama_ubatch & batch) {
    const uint32_t n_tokens = batch.n_tokens;         // 8
    const uint32_t n_seqs   = batch.n_seqs;           // 2
    const uint32_t n_seq_tokens = batch.n_seq_tokens; // 4
    ...
    if (cache.recurrent) {
        // can only process batches with an equal number of new tokens in each sequence
        GGML_ASSERT(batch.equal_seqs);

        int32_t min = cache.size - 1;   // cache.size = 2, so min = -1
        int32_t max = 0;

        // everything should fit if all seq_ids are smaller than the max
        for (uint32_t s = 0; s < n_seqs; ++s) {
            const uint32_t n_seq_id = batch.n_seq_id[s];
            for (uint32_t j = 0; j < n_seq_id; ++j) {
                const llama_seq_id seq_id = batch.seq_id[s][j];

                if (seq_id < 0 || (uint32_t) seq_id >= cache.size) {
                    // too big seq_id
                    // TODO: would it be possible to resize the cache instead?
                    LLAMA_LOG_ERROR("%s: seq_id=%d >= n_seq_max=%d Try using a bigger --parallel value\n", __func__, seq_id, cache.size);
                    return false;
                }
                if (j > 0) {
                    llama_kv_cell & seq = cache.cells[seq_id];
                    if (seq.tail >= 0) {
                        llama_kv_cell & cell = cache.cells[seq.tail];
                        // clear cells from seq_ids that become shared
                        // (should not normally happen, but let's handle it anyway)
                        cell.seq_id.erase(seq_id);
                        seq.tail = -1;
                        if (cell.seq_id.empty()) {
                            cell.pos = -1;
                            cell.src = -1;
                            cache.used -= 1;
                        }
                    }
                }
            }
        }
    }

This is what the cache cells look like before any updated are made:

(lldb) p cache.cells
(std::vector<llama_kv_cell>) size=2 {
  [0] = {
    pos = -1
    delta = 0
    src = -1
    tail = -1
    seq_id = size=0 {}
  }
  [1] = {
    pos = -1
    delta = 0
    src = -1
    tail = -1
    seq_id = size=0 {}
  }
}

And this is what it looks like afterwards:

(lldb) p cache.cells
(std::vector<llama_kv_cell>) size=2 {
  [0] = {
    pos = 8
    delta = 0
    src = -1
    tail = 1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [1] = {
    pos = 3
    delta = 0
    src = -1
    tail = 0
    seq_id = size=1 {
      [0] = 0
    }
  }
}

So for the cell 0, the position 8 is the last token for that sequence that will be processed, we can verify this by taking the sequence id that this cells has, 1 and look it up in the batch:

(lldb) expr lctx.sbatch.batch->seq_id[8][0]
(llama_seq_id) $18 = 1

For the second entry notice that pos is 3 as this is the last token that will be processed for that sequence and that sequence still has one more token to be processed.

Now, tail for cell 0 set set to 1 and for cell 1 it is set to 0. So this is a circular kind of reference here.

I still don't understand what tail is so lets try to add a call to llama_kv_cache_seq_rm to perhaps help undertand this:

    llama_kv_cache_seq_rm(ctx, 0, 0, 2);
static bool llama_kv_cache_seq_rm(
        struct llama_kv_cache & cache,
                 llama_seq_id   seq_id,
                    llama_pos   p0,
                    llama_pos   p1) {
    ...
    if (cache.recurrent) {
        ...
            int32_t & tail_id = cache.cells[seq_id].tail;
            if (tail_id >= 0) {
                const llama_kv_cell & cell = cache.cells[tail_id];

So the seq_id that we passed in, which is 0, is looked up in the cache cells, and that cells tails will be set to tail_id:

(lldb) p cache.cells[0].tail
(int32_t) 1

And this is creater than 0 so this will then look up the cell that tail points to:

(lldb) p cell
(const llama_kv_cell &) 0x000060000198d2e8: {
  pos = 4
  delta = 0
  src = 1
  tail = 0
  seq_id = size=1 {
    [0] = 0
  }
}

Next we check the range:

                if ((0 < p0 && p0 <= cell.pos) || (0 < p1 && p1 <= cell.pos)) {
                    return false;
                }
                // invalidate tails which will be cleared
                if (p0 <= cell.pos && cell.pos < p1) {
                    tail_id = -1;
                }

So this is checking that p0 is positive and less than or equal to the position of the cell, and similarly for p1.

(lldb) p batch.seq_id[0][0]
(llama_seq_id) 0
(lldb) p batch.seq_id[1][0]
(llama_seq_id) 0
(lldb) p batch.seq_id[2][0]
(llama_seq_id) 0
(lldb) p batch.seq_id[3][0]
(llama_seq_id) 0
(lldb) p batch.seq_id[4][0]
(llama_seq_id) 0
(lldb) p batch.seq_id[5][0]
(llama_seq_id) 1
(lldb) p batch.seq_id[6][0]
(llama_seq_id) 1
(lldb) p batch.seq_id[7][0]
(llama_seq_id) 1
(lldb) p batch.seq_id[8][0]
(llama_seq_id) 1
(lldb) p batch.seq_id[9][0]
(llama_seq_id) 0

So we can specify that we would like to remove the positions 0-5 for sequence 0 using:

    llama_kv_cache_seq_rm(ctx, 0, 0, 5);
(lldb) p cache.cells
(std::vector<llama_kv_cell>) size=2 {
  [0] = {
    pos = 8
    delta = 0
    src = 0
    tail = 1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [1] = {
    pos = 4
    delta = 0
    src = 1
    tail = 0
    seq_id = size=1 {
      [0] = 0
    }
  }
}

Now we can see that the first entry in the cache is not the first sequence in our batch but that does not matter as the sequence id we pass in will look up the cell with index of the sequence id but use it's tail.

            const int32_t tail_id = cache.cells[seq_id].tail;

And this will be the index of the cell that contains the sequence:

(const llama_kv_cell &) 0x000060000214dce8: {
  pos = 4
  delta = 0
  src = 1
  tail = 0
  seq_id = size=1 {
    [0] = 0
  }
}

And what will happen is that the tail_id, which is a reference to the tail of the cell that we looked up and this will set it to -1.

If we look again as llama_kv_cache_find_slot we have the following code:

        for (uint32_t s = 0; s < n_seqs; ++s) {
            const llama_seq_id seq_id = batch.seq_id[s][0];
            llama_kv_cell & seq_meta = cache.cells[seq_id];

This will iterate over the number of sequences in the ubatch which is 2 in this case. So the sequence id for 0 will be looked up in the ubatch and this will be 1 as that is how the ubatch was created. With that we lookup the cell for that sequence.

I think that using the tail is a way to enable things like speculative decoding and allows for not having to copy the state. For example, lets say we have decoded a sequence and our cache looks like it this:

(lldb) p kv_self.cells
(std::vector<llama_kv_cell>) size=2 {
  [0] = {
    pos = 8
    delta = 0
    src = -1
    tail = 1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [1] = {
    pos = 3
    delta = 0
    src = -1
    tail = 0
    seq_id = size=1 {
      [0] = 0
    }
  }
}

If we wanted to speculatively decode the next token for sequence 0 we could use the tail

  [2] = {
    pos = 8
    delta = 0
    src = -1
    tail = 1
    seq_id = size=1 {
      [0] = 2
    }
  }

I'm still now 100% sure about this and I need to revist later.

wip

        // find usable cell range
        for (uint32_t s = 0; s < n_seqs; ++s) {
            const llama_seq_id seq_id = batch.seq_id[s][0];
            llama_kv_cell & seq_meta = cache.cells[seq_id];
            bool has_cell = false;
            if (seq_meta.tail >= 0) {
                llama_kv_cell & cell = cache.cells[seq_meta.tail];
                GGML_ASSERT(cell.has_seq_id(seq_id));
                // does this seq_id "own" the cell?
                if (cell.seq_id.size() == 1) { has_cell = true; }
            }
            if (!has_cell) {
                llama_kv_cell & empty_cell = cache.cells[next_empty_cell];
                GGML_ASSERT(empty_cell.is_empty());
                // copy old tail into the empty cell
                if (seq_meta.tail >= 0) {
                    llama_kv_cell & orig_cell = cache.cells[seq_meta.tail];
                    empty_cell.pos = orig_cell.pos;
                    empty_cell.src = orig_cell.src;
                    orig_cell.seq_id.erase(seq_id);
                    empty_cell.seq_id.insert(seq_id); // will be overwritten
                }
(1)  ----->     seq_meta.tail = next_empty_cell;
                // find next empty cell
                if (s + 1 < n_seqs) {
                    next_empty_cell += 1;
                    for (uint32_t i = 0; i < cache.size; ++i) {
                        if (next_empty_cell >= cache.size) { next_empty_cell -= cache.size; }
                        llama_kv_cell & cell = cache.cells[next_empty_cell];
                        if (cell.is_empty()) { break; }
                        next_empty_cell += 1;
                    }
                }
            }
            if (min > seq_meta.tail) { min = seq_meta.tail; }
            if (max < seq_meta.tail) { max = seq_meta.tail; }
        }

At (1) this is what seq_meta looks like:

(gdb) p seq_meta
{pos = -1, delta = 0, src = -1, tail = -1, seq_id = std::set with 0 elements}
(gdb) p next_empty_cell
$40 = 0

So this will set its tail to 0.

Now, I've found myself thinking that this should be adding 5 entries to the cache becuause that is what would happen in the non-recurrent case. But I need to forget that and for recurrent models there will only be one entry per sequence. So the cache will look like this initally:

(gdb) p cache.cells
$71 = std::vector of length 1, capacity 1 = {{pos = -1, delta = 0, src = -1, tail = -1, seq_id = std::set with 0 elements}}

And our batch looks like this:

(gdb) p ubatch
$69 = {equal_seqs = true, n_tokens = 5, n_seq_tokens = 5, n_seqs = 1, token = 0x555555b0a1e0, embd = 0x0,
  pos = 0x555555b0a200, n_seq_id = 0x555555b0a220, seq_id = 0x555555a15780, output = 0x555555b0a240 ""}

And the cache will end up looking like this:

(gdb) p cache.cells
$72 = std::vector of length 1, capacity 1 = {{pos = 4, delta = 0, src = -1, tail = 0, seq_id = std::set with 1 element = {
      [0] = 0}}}

I still don't understand what tail is for?

Connection/link between cells and cache tensors

This might seem obvious but this was not clear to me initially which is why I'm adding this section. The cells member of llama_kv_cache is a vector of llama_kv_cell:

struct llama_kv_cache {
    ...
    std::vector<llama_kv_cell> cells;
    ...
}

struct llama_kv_cell {
    llama_pos pos   = -1;
    llama_pos delta = 0;
    int32_t   src   = -1; // used by recurrent state models to copy states
    int32_t   tail  = -1;

    std::set<llama_seq_id> seq_id;
    ...
};

As tokens are decoded they are added one by one to the cells vector. In the previous example when we had two prompts with different sequences 0, and 1, the would populate the first 12 positions in the cells vector. As we saw in the previous section llama_set_inputs iterated through the cells and checks if the current tokens sequence id is in that cell.

If we inspect the cache we can see can see that the first 6 cells belong to sequence 0 and the next 7 belong to sequence 1:

(lldb) p ctx->kv_self->cells
(std::vector<llama_kv_cell>) size=1024 {
  [0] = {
    pos = 0
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [1] = {
    pos = 1
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [2] = {
    pos = 2
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [3] = {
    pos = 3
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [4] = {
    pos = 4
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [5] = {
    pos = 5
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 0
    }
  }
  [6] = {
    pos = 6
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [7] = {
    pos = 7
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [8] = {
    pos = 8
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [9] = {
    pos = 9
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [10] = {
    pos = 10
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [11] = {
    pos = 11
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [12] = {
    pos = 12
    delta = 0
    src = -1
    tail = -1
    seq_id = size=1 {
      [0] = 1
    }
  }
  [13] = {
    pos = -1
    delta = 0
    src = -1
    tail = -1
    seq_id = size=0 {}
  }

Size of the cache

We can calculate the size of the cache using the following forumula:

So we have two caches one for the keys and one for the values:

K cache size = ctx.cparams.n_ctx *
               ctx.model.hparams.n_layer *
               ctx.model.hparams.n_head_kv(0) *
               ctx.model.hparams.n_embd_head_k *
               ctx.kv_self.type_k

V cache size = ctx.cparams.n_ctx *
               ctx.model.hparams.n_layer *
               ctx.model.hparams.n_head_kv(0) *
               ctx.model.hparams.n_embd_head_v *
               ctx.kv_self.type_v

Lets take a concrete example:

(gdb) p ctx.cparams.n_ctx
$11 = 512
(gdb) p ctx.model.hparams.n_layer
$12 = 32
(gdb) p ctx.model.hparams.n_head_kv(0)
$13 = 32
(gdb) p ctx.model.hparams.n_embd_head_k
$14 = 128
(gdb) p ctx.model.hparams.n_embd_head_v
$17 = 128
(gdb) p ctx.kv_self.type_k
$15 = GGML_TYPE_F16
(gdb) p ctx.kv_self.type_v
$16 = GGML_TYPE_F16

This can also be written as just one value and then doubling that if the values are the same for both the keys and values:

kv-cache size = 2 *                     // both keys and values
                ctx.cparams.n_ctx *
                ctx.model.hparams.n_layer *
                ctx.model.hparams.n_head_kv(0) *
                ctx.model.hparams.n_embd_head_k *
                ctx.kv_self.type_k

k_cache_size = 512 * 32 * 32 * 128 * 2 = 536870912 = 512MB

And for other models:

kv-cache size = 2 *                     // both keys and values
                ctx.cparams.n_ctx *
                ctx.model.hparams.n_layer *
                ctx.model.hparams.n_head_kv(0) *
                ctx.model.hparams.n_embd_head_k *
                ctx.kv_self.type_k

kv-cache size = 2 * 30016 * 32 * 8 * 128 * 2 bytes
        = 2 * 30016 * 32 * 8 * 128 * 2
        = 3934257152
        = 3934257152 / (1024*1024)
        = 3,750 MB

wip