Skip to content
brian khuu edited this page Jun 13, 2024 · 5 revisions

Dev Notes

These are general free form note with pointers to good jumping to point to under stand the llama.cpp codebase.

(@<symbol> is a vscode jump to symbol code for your convenience. Also making a feature request to vscode to be able to jump to file and symbol via <file>:@<symbol>)

Where are the definitions for GGUF in llama.cpp?

GGUF file structure spec (WARN: As of 2024-06-11 the llama.cpp implementation is the canonical source for now)

All of the gguf structure can be found in gguf.c unless stated otherwise

GGUF Structure Of Interestgguf.c referencevscode search line
Overall File Structurestruct gguf_context@gguf_context
File Header Structurestruct gguf_header@gguf_header
Key Value Structurestruct gguf_kv@gguf_kv
Tensor Info Structurestruct gguf_tensor_info@gguf_tensor_info

Element of Interest (Think of this as an index lookup reference)

Please use this as an index not as canonical reference. The purpose of this table is to allow you to quickly locate major elements of the gguf file standard.

Header NameGGUF Elements Of Interestc namec typevscode search line
GGUF HeaderMagicmagicuint8_t[4]gguf.c:@gguf_header
GGUF HeaderVersionversionuint32_tgguf.c:@gguf_header
GGUF HeaderTensor Countn_tensorsuint64_tgguf.c:@gguf_header
GGUF HeaderKey Value Countn_kvuint64_tgguf.c:@gguf_header
GGUF ContextKey Value Linked Listkvgguf_kv *gguf.c:@gguf_context
GGUF ContextTensor Info Linked Listinfosgguf_tensor_info *gguf.c:@gguf_context
Key Value EntryKeygguf_kv.keygguf_strgguf.c:@gguf_kv
Key Value EntryTypegguf_kv.typegguf_typegguf.c:@gguf_kv
Key Value EntryTypegguf_kv.valuegguf_valuegguf.c:@gguf_kv
Tensor Info EntryNamegguf_tensor_info.namegguf_strgguf.c:@gguf_tensor_info
Tensor Info EntryTensor shape dimension countgguf_tensor_info.n_dimuint32_tgguf.c:@gguf_tensor_info
Tensor Info EntryTensor shape sizing arraygguf_tensor_info.neuint64_t[GGML_MAX_DIMS]gguf.c:@gguf_tensor_info
Tensor Info EntryTensor Encoding Scheme / Strategygguf_tensor_info.typeggml_typegguf.c:@gguf_tensor_info
Tensor Info EntryOffset from start of 'data'gguf_tensor_info.offsetuint64_tgguf.c:@gguf_tensor_info

Also just note that these values are not actually part of gguf but is there for internal usage and is calculated during model loading. Aka it's for the writing/reading api.

Header NameGGML Elements Of Interestc namec typevscode search line
GGUF ContextAlignmentalignmentsize_tgguf.c:@gguf_context
GGUF ContextOffset Of 'Data' From Beginning Of Fileoffsetsize_tgguf.c:@gguf_context
GGUF ContextSize Of 'Data' In Bytessizesize_tgguf.c:@gguf_context
Tensor Info EntryTensor memory mapped pointer location in computerdatavoid *gguf.c:@gguf_tensor_info
Tensor Info EntryTensor memory mapped size of layer data in computersizesize_tgguf.c:@gguf_tensor_info

Is there a simple example of gguf being used?

There is this cpp example program that will write a test gguf write/read

If we don't store the size tensor array elements etc in gguf where do we store these?

In ggml.c refer to static const ggml_type_traits_t type_traits[GGML_TYPE_COUNT] which is a lookup table containing enough information to deduce the size of a tensor layer in bytes if given an offset and element dimension count.

One good example is shown below (but annotated for clarity):

staticconstggml_type_traits_ttype_traits[GGML_TYPE_COUNT] ={... [GGML_TYPE_F16] ={// General Specs About This Tensor Encoding Scheme .type_name="f16", .blck_size=1, .type_size=sizeof(ggml_fp16_t), .is_quantized= false, // C function methods for interpreting the blocks  .to_float= (ggml_to_float_t) ggml_fp16_to_fp32_row, .from_float= (ggml_from_float_t) ggml_fp32_to_fp16_row, .from_float_reference= (ggml_from_float_t) ggml_fp32_to_fp16_row, // C functions methods plus extra specs required for dot product handling .vec_dot= (ggml_vec_dot_t) ggml_vec_dot_f16, .vec_dot_type=GGML_TYPE_F16, .nrows=1, }, ... }

So basically these are used in various places to help allow the developers to get a sense of the tensor encoding spec and sizing as you can see with the getter methods below (Note didn't trace fully the other functions directly using the values within ggml.c, the few in this graph is just for illustrative purpose):

graph LR; type_traits{"type_traits[]\n Lookup Table"} type_traits-->type_name type_traits-->blck_size type_traits-->type_size type_traits-->is_quantized %%type_traits-->to_float %%type_traits-->from_float %%type_traits-->from_float_reference %%type_traits-->vec_dot %%type_traits-->vec_dot_type %%type_traits-->nrows subgraph getter functions / methods ggml_type_name(["ggml_type_name()"]) ggml_blck_size(["ggml_blck_size()"]) ggml_type_size(["ggml_type_size()"]) ggml_is_quantized(["ggml_is_quantized()"]) end type_name --> ggml_type_name(["ggml_type_name()"]) blck_size --> ggml_blck_size(["ggml_blck_size()"]) type_size --> ggml_type_size(["ggml_type_size()"]) is_quantized --> ggml_is_quantized(["ggml_is_quantized()"]) blck_size --> ggml_type_sizef(["ggml_type_sizef()"]) blck_size --> ggml_quantize_chunk(["ggml_quantize_chunk()"]) 
Loading

This is how the LUT is used to convert a tensor data area to/from float for processing (However these methods is not used in the GPU if i understand as these data area is processed directly using GPU specific instruction code. This is also why the tensors elements has to be packed in a certain way.)

The below analysis is only for connections within ggml.c

graph LR; type_traits{"type_traits[]\n Lookup Table"} %%type_traits-->type_name %%type_traits-->blck_size %%type_traits-->type_size %%type_traits-->is_quantized type_traits-->to_float type_traits-->from_float type_traits-->from_float_reference %%type_traits-->vec_dot %%type_traits-->vec_dot_type %%type_traits-->nrows ggml_compute_forward_add_q_f32(["ggml_compute_forward_add_q_f32()"]) to_float --> ggml_compute_forward_add_q_f32 ggml_compute_forward_out_prod_q_f32(["ggml_compute_forward_out_prod_q_f32()"]) to_float --> ggml_compute_forward_out_prod_q_f32 ggml_compute_forward_get_rows_q(["ggml_compute_forward_get_rows_q()"]) to_float --> ggml_compute_forward_get_rows_q ggml_compute_forward_flash_attn_ext_f16(["ggml_compute_forward_flash_attn_ext_f16()"]) to_float --> ggml_compute_forward_flash_attn_ext_f16 ggml_compute_forward_dup_f16(["ggml_compute_forward_dup_f16()"]) from_float --> ggml_compute_forward_dup_f16 ggml_compute_forward_dup_bf16(["ggml_compute_forward_dup_bf16()"]) from_float --> ggml_compute_forward_dup_bf16 ggml_compute_forward_dup_f32(["ggml_compute_forward_dup_f32()"]) from_float --> ggml_compute_forward_dup_f32 ggml_compute_forward_add_q_f32(["ggml_compute_forward_add_q_f32()"]) from_float --> ggml_compute_forward_add_q_f32 ggml_compute_forward_mul_mat(["ggml_compute_forward_mul_mat()"]) from_float --> ggml_compute_forward_mul_mat ggml_compute_forward_mul_mat_id(["ggml_compute_forward_mul_mat_id()"]) from_float --> ggml_compute_forward_mul_mat_id ggml_compute_forward_flash_attn_ext_f16(["ggml_compute_forward_flash_attn_ext_f16()"]) from_float --> ggml_compute_forward_flash_attn_ext_f16 
Loading

Users Guide

Useful information for users that doesn't fit into Readme.

Technical Details

These are information useful for Maintainers and Developers which does not fit into code comments

Github Actions Main Branch Status

Click on a badge to jump to workflow. This is here as a useful general view of all the actions so that we may notice quicker if main branch automation is broken and where.

  • bench action status
  • build action status
  • close-issue action status
  • code-coverage action status
  • docker action status
  • editorconfig action status
  • gguf-publish action status
  • labeler action status
  • nix-ci-aarch64 action status
  • nix-ci action status
  • nix-flake-update action status
  • nix-publish-flake action status
  • python-check-requirements action status
  • python-lint action status
  • server action status

Clone this wiki locally