Skip to content

Conversation

@YaelGitAccount
Copy link
Contributor

This PR adds initial support for the Eagle2-VL multimodal models (1B / 2B) in the MTMD pipeline.

The update introduces a dedicated converter path and runtime builder for the Eagle2-VL vision tower and its 2-layer projector.
All changes are fully self-contained and do not affect any existing model architectures.

Converter (convert_hf_to_gguf.py)

  • Registers a new model handler Eagle2VLVisionModel.
  • Writes VisionProjectorType=EAGLE2VL into GGUF metadata.
  • Extracts Eagle2-VL vision metadata (image/patch size, mean/std, block count, RMSNorm eps).
  • Supports metadata-driven spatial merge (spatial_merge_size, default: 2×2).
  • Canonicalizes projector weights (mm.0, mm.2) to [n_in, n_out]; supports optional biases.
  • Converts Conv3D patch-embed kernels into two Conv2D kernels when present.
  • Normalizes HF checkpoint prefixes to align with MTMD conventions.

GGUF (gguf-py/gguf/constants.py)

  • Adds the new projector type EAGLE2VL.

Runtime (tools/mtmd/clip.cpp)

  • Adds a dedicated build_eagle2vl() vision path:
    • ViT with learned absolute position embeddings (including dynamic-resize support).
    • Metadata-driven spatial merge prior to the projector.
    • 2-layer MLP projector (mm.0 → GELU → mm.2) using canonical [n_in, n_out] weights.
  • Updates dispatcher to route PROJECTOR_TYPE_EAGLE2VL to the new builder.
  • Final embedding dimension derived from mm_2_w->ne[1].

Integration & Compatibility

  • Loader extended to read Eagle2-VL projector tensors.
  • No CLI changes.
  • No impact on other projector types or existing model architectures.

Validation

Tested locally on Eagle2-VL 1B and 2B checkpoints:

  • GGUF conversion produces expected metadata.
  • Vision tower + spatial merge + projector run end-to-end.
  • All matmuls operate on canonical weights (no runtime transposes).
  • Inference completes successfully.

Scope

This PR focuses on Eagle2-VL (1B / 2B).
Support for additional Eagle2 variants (e.g., 9B) will be handled in a follow-up.

Closes #16704

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explicitly confirm if part of the PR is generated by AI? I feel very suspicious about some redundant code

While you said in the PR description that you tested it, you haven't even mentioned the link to the model, as well as how you tested it.

learned_pos_embd,
nullptr);

// keep runtime quiet in normal runs; shapes are correct by construction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some indentations seem off here

Comment on lines 1128 to 1142
if (model.mm_0_b) {
embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
}

embeddings = ggml_gelu(ctx0, embeddings);

GGML_ASSERT(model.mm_2_w != nullptr);
// keep [n_in, n_tokens] layout for the second matmul as well
embeddings = ggml_reshape_2d(ctx0, embeddings, embeddings->ne[0], embeddings->ne[1]);
embeddings = ggml_cont_2d(ctx0, embeddings, embeddings->ne[0], embeddings->ne[1]);
// Weights are canonicalized at conversion time to [n_in, n_out]; multiply directly.
embeddings = ggml_mul_mat(ctx0, model.mm_2_w, embeddings);
if (model.mm_2_b) {
embeddings = ggml_add(ctx0, embeddings, model.mm_2_b);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better replacing this whole block with build_ffn

Comment on lines 3916 to 3925
mlp_pos = name.find("mlp1.")
if mlp_pos != -1:
mlp_suffix = name[mlp_pos + len("mlp1."):]
# Skip LayerNorm (mlp1.0.*)
if mlp_suffix.startswith("0."):
return []
# Map first Linear (mlp1.1.*) -> mm.0.*
if mlp_suffix.startswith("1."):
new_name = "mm.0." + mlp_suffix[2:]
if new_name.endswith(".weight"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these code are redundant. This model: https://huggingface.co/nvidia/Eagle2-1B has simple .mlp.fc1 and .mlp.fc2 MLP, there is no nesting mlp1.1.* as you described

]

# 5) Conv3D patch embed -> two Conv2D kernels
if name.endswith("patch_embed.proj.weight") and data_torch.ndim == 5:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure about this? seems like bad copy-paste code from QwenVL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add support for Eagle2_VL (Eagle2_5_VLForConditionalGeneration) multimodal models

2 participants