Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2#6453

DisOOM · 2024-04-03T07:25:44Z

Statement: This has nothing to do with the fine-grained MoE architecture in Qwen/Qwen1.5-MoE-A2.7B. It is more akin to a traditional MoE, except that its experts are derived from the qwen2 (qwen1.5) model.

I was previously using mergekit-moe to merge the qwen1.5 model into an MoE, but the resulting models were corrupted after being converted into the gguf format.
Subsequently, I discovered this custom mergekit script that successfully merges into qwen2MoE: https://github.com/Aratako/mergekit-qwen2. Following the example of #4912, I made some modifications to llama.cpp, enabling it to correctly convert, quantize, and run MoEs merged using this custom script.
It performs well on older versions, but I encountered errors with the latest version. It can correctly convert and quantize but fails to run. I believe the issue lies in incompatibility with the changes made to llamacpp in #6122, but I am unsure how to resolve this problem.

I am a newbie to coding and this is my first PR, please be lenient.

I encountered no issues when converting with convert-hf-to-gguf.py and quantizing with quantize.exe, but I encountered the following issues when I ran main.exe.

PS D:\llama.cpp\llama.cpp> ./build/bin/Release/main.exe -m D:/model/ggml-model-f16.gguf -n 128 Log start main: build = 2585 (f87f7b89) main: built with MSVC 19.39.33523.0 for x64 main: seed = 1712122664 llama_model_loader: loaded meta data with 21 key-value pairs and 643 tensors from D:/model-merge/Merged/ggml-model-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Merged llama_model_loader: - kv 2: qwen2.block_count u32 = 40 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: qwen2.expert_count u32 = 2 llama_model_loader: - kv 11: qwen2.expert_used_count u32 = 2 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151645 llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 20: tokenizer.chat_template str ={% for message in messages %}{% if lo... llama_model_loader: - type f32: 201 tensors llama_model_loader: - type f16: 442 tensors llm_load_vocab: special tokens definition check successful ( 421/152064 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 22.58 B llm_load_print_meta: model size = 42.07 GiB (16.00 BPW) llm_load_print_meta: general.name = Merged llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151645 '<|im_end|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_tensors: ggml ctx size = 0.25 MiB llm_load_tensors: CPU buffer size = 43074.71 MiB ................................................................................................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 400.00 MiB llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 353.76 MiB llama_new_context_with_model: graph nodes = 2164 llama_new_context_with_model: graph splits = 1 GGML_ASSERT: D:\llama.cpp\llama.cpp:9701: lctx.inp_out_ids && "every model that can must skip unused outputs"

github-actions · 2024-04-03T07:38:43Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 503 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=9295.2ms p(90)=26525.97ms fails=0, finish reason: stop=503 truncated=0
Prompt processing (pp): avg=241.95tk/s p(90)=732.6tk/s total=200.08tk/s
Token generation (tg): avg=98.97tk/s p(90)=277.24tk/s total=130.21tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=115f49a08a1c9fd59c60ed1425827d9ae2614565

Time series

More

--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:prompt_tokens_seconds" x-axis "llamacpp:prompt_tokens_seconds" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 274.95, 274.95, 274.95, 274.95, 274.95, 623.49, 623.49, 623.49, 623.49, 623.49, 646.27, 646.27, 646.27, 646.27, 646.27, 672.33, 672.33, 672.33, 672.33, 672.33, 702.26, 702.26, 702.26, 702.26, 702.26, 711.49, 711.49, 711.49, 711.49, 711.49, 710.87, 710.87, 710.87, 710.87, 710.87, 691.5, 691.5, 691.5, 691.5, 691.5, 691.23, 691.23, 691.23, 691.23, 691.23, 691.08, 691.08, 691.08, 691.08, 691.08, 697.17, 697.17, 697.17, 697.17, 697.17, 714.08, 714.08, 714.08, 714.08, 714.08, 735.9, 735.9, 735.9, 735.9, 735.9, 744.08, 744.08, 744.08, 744.08, 744.08, 733.46, 733.46, 733.46, 733.46, 733.46, 695.97, 695.97, 695.97, 695.97, 695.97, 699.29, 699.29, 699.29, 699.29, 699.29, 699.47, 699.47, 699.47, 699.47, 699.47, 704.05, 704.05, 704.05, 704.05, 704.05, 711.21, 711.21, 711.21, 711.21, 711.21, 708.94, 708.94, 708.94, 708.94, 708.94, 706.46, 706.46, 706.46, 706.46, 706.46, 708.91, 708.91, 708.91, 708.91, 708.91, 708.27, 708.27, 708.27, 708.27, 708.27, 709.56, 709.56, 709.56, 709.56, 709.56, 724.76, 724.76, 724.76, 724.76, 724.76, 726.05, 726.05, 726.05, 726.05, 726.05, 727.15, 727.15, 727.15, 727.15, 727.15, 733.43, 733.43, 733.43, 733.43, 733.43, 730.58, 730.58, 730.58, 730.58, 730.58, 727.73, 727.73, 727.73, 727.73, 727.73, 727.66, 727.66, 727.66, 727.66, 727.66, 724.75, 724.75, 724.75, 724.75, 724.75, 724.06, 724.06, 724.06, 724.06, 724.06, 726.06, 726.06, 726.06, 726.06, 726.06, 733.35, 733.35, 733.35, 733.35, 733.35, 735.28, 735.28, 735.28, 735.28, 735.28, 737.03, 737.03, 737.03, 737.03, 737.03, 741.18, 741.18, 741.18, 741.18, 741.18, 738.92, 738.92, 738.92, 738.92, 738.92, 737.42, 737.42, 737.42, 737.42, 737.42, 737.96, 737.96, 737.96, 737.96, 737.96, 737.32, 737.32, 737.32, 737.32, 737.32, 746.58, 746.58, 746.58, 746.58, 746.58, 745.49, 745.49, 745.49, 745.49, 745.49, 740.22, 740.22, 740.22, 740.22, 740.22, 739.68, 739.68, 739.68, 739.68, 739.68, 736.68, 736.68, 736.68, 736.68, 736.68, 734.52, 734.52, 734.52, 734.52, 734.52, 736.54, 736.54, 736.54, 736.54, 736.54, 738.49, 738.49, 738.49, 738.49, 738.49, 737.84, 737.84, 737.84, 737.84, 737.84, 738.39, 738.39, 738.39, 738.39, 738.39, 739.33, 739.33, 739.33, 739.33, 739.33, 741.19, 741.19, 741.19, 741.19, 741.19, 741.92, 741.92, 741.92, 741.92, 741.92, 741.13, 741.13, 741.13, 741.13, 741.13, 740.7, 740.7, 740.7, 740.7, 740.7, 742.67, 742.67, 742.67, 742.67, 742.67, 742.16, 742.16, 742.16, 742.16, 742.16, 742.95, 742.95, 742.95, 742.95]

More

--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:predicted_tokens_seconds" x-axis "llamacpp:predicted_tokens_seconds" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.69, 29.69, 29.69, 29.69, 29.69, 17.1, 17.1, 17.1, 17.1, 17.1, 18.55, 18.55, 18.55, 18.55, 18.55, 18.45, 18.45, 18.45, 18.45, 18.45, 19.19, 19.19, 19.19, 19.19, 19.19, 19.9, 19.9, 19.9, 19.9, 19.9, 20.56, 20.56, 20.56, 20.56, 20.56, 20.63, 20.63, 20.63, 20.63, 20.63, 20.66, 20.66, 20.66, 20.66, 20.66, 20.47, 20.47, 20.47, 20.47, 20.47, 20.49, 20.49, 20.49, 20.49, 20.49, 20.4, 20.4, 20.4, 20.4, 20.4, 20.12, 20.12, 20.12, 20.12, 20.12, 19.98, 19.98, 19.98, 19.98, 19.98, 19.3, 19.3, 19.3, 19.3, 19.3, 19.34, 19.34, 19.34, 19.34, 19.34, 19.11, 19.11, 19.11, 19.11, 19.11, 19.23, 19.23, 19.23, 19.23, 19.23, 19.16, 19.16, 19.16, 19.16, 19.16, 18.91, 18.91, 18.91, 18.91, 18.91, 18.8, 18.8, 18.8, 18.8, 18.8, 18.67, 18.67, 18.67, 18.67, 18.67, 18.57, 18.57, 18.57, 18.57, 18.57, 18.58, 18.58, 18.58, 18.58, 18.58, 18.52, 18.52, 18.52, 18.52, 18.52, 18.61, 18.61, 18.61, 18.61, 18.61, 18.7, 18.7, 18.7, 18.7, 18.7, 18.69, 18.69, 18.69, 18.69, 18.69, 18.61, 18.61, 18.61, 18.61, 18.61, 18.49, 18.49, 18.49, 18.49, 18.49, 18.44, 18.44, 18.44, 18.44, 18.44, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.66, 18.66, 18.66, 18.66, 18.66, 18.7, 18.7, 18.7, 18.7, 18.7, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.59, 18.59, 18.59, 18.59, 18.59, 18.51, 18.51, 18.51, 18.51, 18.51, 18.48, 18.48, 18.48, 18.48, 18.48, 18.47, 18.47, 18.47, 18.47, 18.47, 18.5, 18.5, 18.5, 18.5, 18.5, 18.49, 18.49, 18.49, 18.49, 18.49, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.26, 18.26, 18.26, 18.26, 18.26, 18.2, 18.2, 18.2, 18.2, 18.2, 17.77, 17.77, 17.77, 17.77, 17.77, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.49, 17.49, 17.49, 17.49, 17.49, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.62, 17.62, 17.62, 17.62, 17.62, 17.65, 17.65, 17.65, 17.65, 17.65, 17.64, 17.64, 17.64, 17.64, 17.64, 17.63, 17.63, 17.63, 17.63, 17.63, 17.58, 17.58, 17.58, 17.58, 17.58, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58]

Details

More

--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:kv_cache_usage_ratio" x-axis "llamacpp:kv_cache_usage_ratio" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06, 0.06, 0.06, 0.06, 0.06, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.45, 0.45, 0.45, 0.45, 0.45, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.33, 0.33, 0.33, 0.33, 0.33, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12]

More

--- config: xyChart: titleFontSize: 12 width: 900 height: 600 themeVariables: xyChart: titleColor: "#000000" --- xychart-beta title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3 duration=10m 503 iterations" y-axis "llamacpp:requests_processing" x-axis "llamacpp:requests_processing" 1712129291 --> 1712129917 line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]

maziyarpanahi · 2024-04-03T08:06:39Z

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.

ggerganov · 2024-04-03T08:28:14Z

Qwen MoE models should be able to work after merging #6387 and then #6074

DBRX models likely also depend on #6387 + we need conversion scripts and compute graph implementation

DisOOM · 2024-04-03T08:34:05Z

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.感谢根据我们讨论的内容创建这个基于 Qwen 的 MoE 模型无法正常工作的 PR。
I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387我会标记并提及在你提到的 PR 中参与的人。不过，你是否尝试过使用这个 PR 来查看基于 Qwen 架构的 MoE 模型是否正常工作？#6387
I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.我正在为 DBRX 进行 #6387 的测试，但如果它是为了解决 MoE 的问题（不确定 Mergekit MoE 和其他模型如 Qwen、Mixtral、DBRX 之间是否有区别）。我个人会尝试一下，看看我的量化 Qwen MoE 模型是否能正常工作。

I haven't tried this PR yet. I will give it a try later.

maziyarpanahi · 2024-04-04T13:05:04Z

I have pulled and used the latest changes from the master branch. I have successfully converted this model into fp16 GGUF: https://huggingface.co/MaziyarPanahi/Qwen1.5-8x7b-v0.1

It works very fine and has a coherent output. However, any quantized model from this pf16 results in the following error:

.................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 343.26 MiB llama_new_context_with_model: graph nodes = 1638 llama_new_context_with_model: graph splits = 1 GGML_ASSERT: ggml.c:11015: wdata == wdata_src1_end Aborted (core dumped)

@ggerganov I am not sure what causes this error. This is a MoE made by MergeKit based on Qwen models. (one of those situation where the fp16 GGUF model works fine, but the quantized just either crashes or outputs nonsense)

DisOOM added 3 commits April 3, 2024 14:44

Update llama.cpp
b3cf383

Update tensor_mapping.py
3b22eb7

Update constants.py
115f49a

DisOOM changed the title ~~Adding Support for Custom Qwen2moe Architectures Using mergekit-qwen2~~Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2Apr 3, 2024

DisOOM changed the title ~~Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2~~Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2Apr 3, 2024

This was referenced Apr 3, 2024
Are Qwen1.5 MOE models supported? #6415
Closed
Add support for Qwen MoE (Qwen2MoeForCausalLM) #6380
Closed

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs model Model specific labels May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2#6453

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

DisOOM commented Apr 3, 2024•
edited
Loading

Uh oh!

github-actionsbot commented Apr 3, 2024

Uh oh!

maziyarpanahi commented Apr 3, 2024

Uh oh!

ggerganov commented Apr 3, 2024

Uh oh!

DisOOM commented Apr 3, 2024•
edited
Loading

Uh oh!

maziyarpanahi commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2#6453

Are you sure you want to change the base?

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Conversation

DisOOM commented Apr 3, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actionsbot commented Apr 3, 2024

Uh oh!

maziyarpanahi commented Apr 3, 2024

Uh oh!

ggerganov commented Apr 3, 2024

Uh oh!

DisOOM commented Apr 3, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maziyarpanahi commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DisOOM commented Apr 3, 2024•
edited
Loading

DisOOM commented Apr 3, 2024•
edited
Loading