-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #15641: Modern Bert Support #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #15641: Modern Bert Support #362
Conversation
…orted yet but working on getting conversion to work for encoder only
…ated gate split with views, GEGLU is now used which does exactly this
…when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
…ecking out the rest
… per previous attempt, added local sliding window attention that alternates every third layer
…onstruction in graph build
…rope_freq_base_train_swa were the same and i set them to correct values
ca4155f to
b86b588
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #362 - Modern BERT SupportOverviewThis PR introduces Modern BERT architecture support across 15 files with 308 additions. The changes add a new model type without modifying existing inference paths. Performance analysis shows the primary binary Key FindingsPerformance-Critical Functions ImpactThe analysis identified performance variations in several functions, but these are not caused by this PR. The changes stem from build configuration or compiler differences affecting STL implementations across all models: Most Impacted Functions (Absolute Changes):
These functions are not in the inference hot path. The PR adds new code paths for Modern BERT without touching existing tokenization or inference functions. Tokens Per Second ImpactNo impact on tokens per second. The core inference functions remain unchanged:
The PR adds a new graph builder ( Power Consumption AnalysisBinary-Level Impact:
The power reduction in Code Changes AssessmentThe PR implements:
All changes are isolated to new code paths. The architecture follows established patterns from existing BERT variants (NOMIC_BERT, NEO_BERT), ensuring consistency with the codebase design. |
84f6117 to
91eb894
Compare
Mirrored from ggml-org/llama.cpp#15641
adding support to run granite embedding small, and it primarily pulls the modern bert architecture - https://huggingface.co/ibm-granite/granite-embedding-small-english-r2, currently working on it still, havent figured out the pre-tokenizer type or if I need to impliment it, also for the ubatch size the assert fails in llama-graph.cpp, hacked it to accept ubatch size of 1 for testing, but it seems to keep failing there and not sure why,
if I comment out of the line in llama-graph.cpp
then it works