Skip to content

Conversation

@DisOOM
Copy link

@DisOOMDisOOM commented Apr 3, 2024

Statement: This has nothing to do with the fine-grained MoE architecture in Qwen/Qwen1.5-MoE-A2.7B. It is more akin to a traditional MoE, except that its experts are derived from the qwen2 (qwen1.5) model.

I was previously using mergekit-moe to merge the qwen1.5 model into an MoE, but the resulting models were corrupted after being converted into the gguf format.
Subsequently, I discovered this custom mergekit script that successfully merges into qwen2MoE: https://github.com/Aratako/mergekit-qwen2. Following the example of #4912, I made some modifications to llama.cpp, enabling it to correctly convert, quantize, and run MoEs merged using this custom script.
It performs well on older versions, but I encountered errors with the latest version. It can correctly convert and quantize but fails to run. I believe the issue lies in incompatibility with the changes made to llamacpp in #6122, but I am unsure how to resolve this problem.

I am a newbie to coding and this is my first PR, please be lenient.

I encountered no issues when converting with convert-hf-to-gguf.py and quantizing with quantize.exe, but I encountered the following issues when I ran main.exe.

PS D:\llama.cpp\llama.cpp> ./build/bin/Release/main.exe -m D:/model/ggml-model-f16.gguf -n 128 Log start main: build = 2585 (f87f7b89) main: built with MSVC 19.39.33523.0 for x64 main: seed = 1712122664 llama_model_loader: loaded meta data with 21 key-value pairs and 643 tensors from D:/model-merge/Merged/ggml-model-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Merged llama_model_loader: - kv 2: qwen2.block_count u32 = 40 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: qwen2.expert_count u32 = 2 llama_model_loader: - kv 11: qwen2.expert_used_count u32 = 2 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151645 llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 20: tokenizer.chat_template str ={% for message in messages %}{% if lo... llama_model_loader: - type f32: 201 tensors llama_model_loader: - type f16: 442 tensors llm_load_vocab: special tokens definition check successful ( 421/152064 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 22.58 B llm_load_print_meta: model size = 42.07 GiB (16.00 BPW) llm_load_print_meta: general.name = Merged llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151645 '<|im_end|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 43074.71 MiB ................................................................................................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 400.00 MiB llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 353.76 MiB llama_new_context_with_model: graph nodes = 2164 llama_new_context_with_model: graph splits = 1 GGML_ASSERT: D:\llama.cpp\llama.cpp:9701: lctx.inp_out_ids && "every model that can must skip unused outputs" 

@DisOOMDisOOM changed the title Adding Support for Custom Qwen2moe Architectures Using mergekit-qwen2Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2Apr 3, 2024
@DisOOMDisOOM changed the title Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2Apr 3, 2024
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 503 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9295.2ms p(90)=26525.97ms fails=0, finish reason: stop=503 truncated=0
  • Prompt processing (pp): avg=241.95tk/s p(90)=732.6tk/s total=200.08tk/s
  • Token generation (tg): avg=98.97tk/s p(90)=277.24tk/s total=130.21tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=115f49a08a1c9fd59c60ed1425827d9ae2614565
Time series

prompt_tokens_seconds

More
--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:prompt_tokens_seconds" x-axis "llamacpp:prompt_tokens_seconds" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 274.95, 274.95, 274.95, 274.95, 274.95, 623.49, 623.49, 623.49, 623.49, 623.49, 646.27, 646.27, 646.27, 646.27, 646.27, 672.33, 672.33, 672.33, 672.33, 672.33, 702.26, 702.26, 702.26, 702.26, 702.26, 711.49, 711.49, 711.49, 711.49, 711.49, 710.87, 710.87, 710.87, 710.87, 710.87, 691.5, 691.5, 691.5, 691.5, 691.5, 691.23, 691.23, 691.23, 691.23, 691.23, 691.08, 691.08, 691.08, 691.08, 691.08, 697.17, 697.17, 697.17, 697.17, 697.17, 714.08, 714.08, 714.08, 714.08, 714.08, 735.9, 735.9, 735.9, 735.9, 735.9, 744.08, 744.08, 744.08, 744.08, 744.08, 733.46, 733.46, 733.46, 733.46, 733.46, 695.97, 695.97, 695.97, 695.97, 695.97, 699.29, 699.29, 699.29, 699.29, 699.29, 699.47, 699.47, 699.47, 699.47, 699.47, 704.05, 704.05, 704.05, 704.05, 704.05, 711.21, 711.21, 711.21, 711.21, 711.21, 708.94, 708.94, 708.94, 708.94, 708.94, 706.46, 706.46, 706.46, 706.46, 706.46, 708.91, 708.91, 708.91, 708.91, 708.91, 708.27, 708.27, 708.27, 708.27, 708.27, 709.56, 709.56, 709.56, 709.56, 709.56, 724.76, 724.76, 724.76, 724.76, 724.76, 726.05, 726.05, 726.05, 726.05, 726.05, 727.15, 727.15, 727.15, 727.15, 727.15, 733.43, 733.43, 733.43, 733.43, 733.43, 730.58, 730.58, 730.58, 730.58, 730.58, 727.73, 727.73, 727.73, 727.73, 727.73, 727.66, 727.66, 727.66, 727.66, 727.66, 724.75, 724.75, 724.75, 724.75, 724.75, 724.06, 724.06, 724.06, 724.06, 724.06, 726.06, 726.06, 726.06, 726.06, 726.06, 733.35, 733.35, 733.35, 733.35, 733.35, 735.28, 735.28, 735.28, 735.28, 735.28, 737.03, 737.03, 737.03, 737.03, 737.03, 741.18, 741.18, 741.18, 741.18, 741.18, 738.92, 738.92, 738.92, 738.92, 738.92, 737.42, 737.42, 737.42, 737.42, 737.42, 737.96, 737.96, 737.96, 737.96, 737.96, 737.32, 737.32, 737.32, 737.32, 737.32, 746.58, 746.58, 746.58, 746.58, 746.58, 745.49, 745.49, 745.49, 745.49, 745.49, 740.22, 740.22, 740.22, 740.22, 740.22, 739.68, 739.68, 739.68, 739.68, 739.68, 736.68, 736.68, 736.68, 736.68, 736.68, 734.52, 734.52, 734.52, 734.52, 734.52, 736.54, 736.54, 736.54, 736.54, 736.54, 738.49, 738.49, 738.49, 738.49, 738.49, 737.84, 737.84, 737.84, 737.84, 737.84, 738.39, 738.39, 738.39, 738.39, 738.39, 739.33, 739.33, 739.33, 739.33, 739.33, 741.19, 741.19, 741.19, 741.19, 741.19, 741.92, 741.92, 741.92, 741.92, 741.92, 741.13, 741.13, 741.13, 741.13, 741.13, 740.7, 740.7, 740.7, 740.7, 740.7, 742.67, 742.67, 742.67, 742.67, 742.67, 742.16, 742.16, 742.16, 742.16, 742.16, 742.95, 742.95, 742.95, 742.95] 
Loading
predicted_tokens_seconds
More
--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:predicted_tokens_seconds" x-axis "llamacpp:predicted_tokens_seconds" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.69, 29.69, 29.69, 29.69, 29.69, 17.1, 17.1, 17.1, 17.1, 17.1, 18.55, 18.55, 18.55, 18.55, 18.55, 18.45, 18.45, 18.45, 18.45, 18.45, 19.19, 19.19, 19.19, 19.19, 19.19, 19.9, 19.9, 19.9, 19.9, 19.9, 20.56, 20.56, 20.56, 20.56, 20.56, 20.63, 20.63, 20.63, 20.63, 20.63, 20.66, 20.66, 20.66, 20.66, 20.66, 20.47, 20.47, 20.47, 20.47, 20.47, 20.49, 20.49, 20.49, 20.49, 20.49, 20.4, 20.4, 20.4, 20.4, 20.4, 20.12, 20.12, 20.12, 20.12, 20.12, 19.98, 19.98, 19.98, 19.98, 19.98, 19.3, 19.3, 19.3, 19.3, 19.3, 19.34, 19.34, 19.34, 19.34, 19.34, 19.11, 19.11, 19.11, 19.11, 19.11, 19.23, 19.23, 19.23, 19.23, 19.23, 19.16, 19.16, 19.16, 19.16, 19.16, 18.91, 18.91, 18.91, 18.91, 18.91, 18.8, 18.8, 18.8, 18.8, 18.8, 18.67, 18.67, 18.67, 18.67, 18.67, 18.57, 18.57, 18.57, 18.57, 18.57, 18.58, 18.58, 18.58, 18.58, 18.58, 18.52, 18.52, 18.52, 18.52, 18.52, 18.61, 18.61, 18.61, 18.61, 18.61, 18.7, 18.7, 18.7, 18.7, 18.7, 18.69, 18.69, 18.69, 18.69, 18.69, 18.61, 18.61, 18.61, 18.61, 18.61, 18.49, 18.49, 18.49, 18.49, 18.49, 18.44, 18.44, 18.44, 18.44, 18.44, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.66, 18.66, 18.66, 18.66, 18.66, 18.7, 18.7, 18.7, 18.7, 18.7, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.59, 18.59, 18.59, 18.59, 18.59, 18.51, 18.51, 18.51, 18.51, 18.51, 18.48, 18.48, 18.48, 18.48, 18.48, 18.47, 18.47, 18.47, 18.47, 18.47, 18.5, 18.5, 18.5, 18.5, 18.5, 18.49, 18.49, 18.49, 18.49, 18.49, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.26, 18.26, 18.26, 18.26, 18.26, 18.2, 18.2, 18.2, 18.2, 18.2, 17.77, 17.77, 17.77, 17.77, 17.77, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.49, 17.49, 17.49, 17.49, 17.49, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.62, 17.62, 17.62, 17.62, 17.62, 17.65, 17.65, 17.65, 17.65, 17.65, 17.64, 17.64, 17.64, 17.64, 17.64, 17.63, 17.63, 17.63, 17.63, 17.63, 17.58, 17.58, 17.58, 17.58, 17.58, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58] 
Loading

Details

kv_cache_usage_ratio

More
--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:kv_cache_usage_ratio" x-axis "llamacpp:kv_cache_usage_ratio" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06, 0.06, 0.06, 0.06, 0.06, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.45, 0.45, 0.45, 0.45, 0.45, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.33, 0.33, 0.33, 0.33, 0.33, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12] 
Loading
requests_processing
More
--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:requests_processing" x-axis "llamacpp:requests_processing" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0] 
Loading

@maziyarpanahi
Copy link

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.

@ggerganov
Copy link
Member

Qwen MoE models should be able to work after merging #6387 and then #6074

DBRX models likely also depend on #6387 + we need conversion scripts and compute graph implementation

@DisOOM
Copy link
Author

DisOOM commented Apr 3, 2024

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.感谢根据我们讨论的内容创建这个基于 Qwen 的 MoE 模型无法正常工作的 PR。

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387我会标记并提及在你提到的 PR 中参与的人。不过,你是否尝试过使用这个 PR 来查看基于 Qwen 架构的 MoE 模型是否正常工作?#6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.我正在为 DBRX 进行 #6387 的测试,但如果它是为了解决 MoE 的问题(不确定 Mergekit MoE 和其他模型如 Qwen、Mixtral、DBRX 之间是否有区别)。我个人会尝试一下,看看我的量化 Qwen MoE 模型是否能正常工作。

I haven't tried this PR yet. I will give it a try later.

@maziyarpanahi
Copy link

I have pulled and used the latest changes from the master branch. I have successfully converted this model into fp16 GGUF: https://huggingface.co/MaziyarPanahi/Qwen1.5-8x7b-v0.1

It works very fine and has a coherent output. However, any quantized model from this pf16 results in the following error:

.................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 343.26 MiB llama_new_context_with_model: graph nodes = 1638 llama_new_context_with_model: graph splits = 1 GGML_ASSERT: ggml.c:11015: wdata == wdata_src1_end Aborted (core dumped) 

@ggerganov I am not sure what causes this error. This is a MoE made by MergeKit based on Qwen models. (one of those situation where the fp16 GGUF model works fine, but the quantized just either crashes or outputs nonsense)

@mofosynemofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs model Model specific labels May 10, 2024
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

modelModel specificReview Complexity : HighGenerally require indepth knowledge of LLMs or GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

@DisOOM@maziyarpanahi@ggerganov@mofosyne