v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine
New models
ModernBERT
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
- Rotary Positional Embeddings to support sequences of up to 8192 tokens.
- Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- Flash Attention to speed up processing.
- A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)
- Add ModernBERT to Transformers by @warner-benjamin in #35158
Aria
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
- Add Aria by @aymeric-roucher in #34157
TimmWrapper
We add a TimmWrapper
set of classes such that timm models can be loaded in as transformer models into the library.
Here's a general usage example:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor
checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)
with torch.no_grad():
logits = model(**inputs).logits
top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)
Thanks to this, timm models now have access to pipelines, as well as Trainer
, accelerate device maps, quantization, etc:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import pipeline
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))
- Add TimmWrapper by @qubvel and @amyeroberts in #34564
Pixtral-Large
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
- Update Pixtral conversion script to support large format! by @ArthurZucker in #34801
ColPali
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
- Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736
Falcon3
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
- Add Falcon3 documentation by @mokeddembillel in #35307
Bamba
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
- Add the Bamba Model by @fabianlim in #34982
VitPose
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.
- Add VitPose by @SangbumChoi and @NielsRogge in #30530
DINOv2 with registers
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
- no artifacts
- interpretable attention maps
- and improved performances.
- Add DINOv2 with registers by @NielsRogge in #35348
Emu3
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.
Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..
- Add Emu3 by @zucchini-nlp in #33770
Cohere2
A new Cohere update was added through a new "Cohere2" set of classes.
- Add Cohere2 model by @alexrs-cohere in #35224
TextNet
TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.
- Add TextNet by @jadechoghari in #34979
DiffLlama
Differential Transformer combines the Llama architecture with Differential Transformer's Attention.
- Add DiffLllama by @weak-kajuma in #34083
PixtralLarge
The conversion script needed a few update, while the modeling code was barely changed!
- [PixtralLarge] Update Pixtral conversion script to support large format! (#34801)
Moonshine
Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands
.
Quantization methods
VPTQ Quantization
From the VPTQ contributors:
VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq
HIGGS Quantization
From the contributors:
HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.
Runtime support for HIGGS is implemented through FLUTE, and its library.
This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.
- HIGGS Quantization Support by @BlackSamorez in #34997
Cleanup
We merged a cleanup for vision language models, to make sure it all models are standardized.
- VLMs: major clean up 🧼 (#34502)
Breaking changes
Conversion scripts
Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py
. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin
weights or pickle
files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.
In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.
However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main
branch.
- 🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296
Backtracking in Nougat
A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.
Whisper decoding
This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:
➡️ Previously:
• Short-form: Returned a ModelOutput
or torch.LongTensor
, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict
or torch.LongTensor
, excluding decoder input IDs and the EOS token ID.
➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True
and (return_timestamps=False
or force_unique_generate_call=True
).
In this case, the output will be a ModelOutput
, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False
ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.
Attention refactor
In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.
- 🚨All attention refactor🚨 by @ArthurZucker in #35235
Bugfixes and improvements
- [tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer (#35593)
- Setup loss_type in config at model init time (#34616)
- [docs] Update Python version in translations by @jla524 in #35096
- [docs] top_p, top_k, temperature docstrings by @stevhliu in #35065
- Fix private forked repo. CI by @ydshieh in #35114
- Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
- Update I-JEPA checkpoints path by @qubvel in #35120
- Fix GA loss bugs and add unit test by @techkang in #35121
- [I-JEPA] Update docs by @NielsRogge in #35148
- Corrected typo in agent system prompts by @Uvi-12 in #35143
- Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
- Fix typo in EETQ Tests by @MekkCyber in #35160
- Cleanup: continue the init refactor by @LysandreJik in #35167
- Super tiny fix logging message by @fzyzcjy in #35132
- Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
- [CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
- Fix
num_items_in_batch
not being an integer by @xspirus in #35115 - Assisted decoding multi-gpu by @zucchini-nlp in #35116
- Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
- Support BatchNorm in Hubert pos_conv_emb as in fairseq by @gallilmaimon in #34389
- Remove unnecessary masked_fill in deberta models by @xadupre in #35182
- Fix DBRX LayerNorm init method by @hgt312 in #35177
- Fixing GGUF support for StableLm by @MekkCyber in #35060
- [i18n-ar] Translated file :
docs/source/ar/community.md
into Arabic by @AhmedAlmaghz in #33027 - Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
- Only import torch.distributed if it is available by @GaetanLepage in #35133
- [i18n-] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137
- [docs] Fix FlashAttention link by @stevhliu in #35171
- Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
- [i18n-] Translating agents.md to Chinese by @HMJ0628 in #35139
- BLIP: enable device map by @zucchini-nlp in #34850
- 🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
- [PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
- Cleanup: continue the init refactor by @LysandreJik in #35170
- Fix CI by @Cyrilvallez in #35208
- Fix seamless TTS generate by @ylacombe in #34968
- docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
- Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
- Fix type hints for apply_chat_template by @Rocketknight1 in #35216
- Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
- Refactoring
AssistedCandidateGenerator
for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009 - Change back to
Thread
for SF conversion by @ydshieh in #35236 - [Init refactor] Modular changes by @LysandreJik in #35240
- Fix typo in chat template example by @EricWinsorDSIT in #35250
- Run model as compressed/uncompressed mode by @horheynm in #34719
- skip Fuyu from test_generate by @nhamanasu in #35246
- [tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
- Use
rsfE
withpytest
by @ydshieh in #35119 - Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
- Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
- Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
- Fix FSDP no longer working by @muellerzr in #35212
- don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
- [i18n-Chinese] Translating perf_train_cpu.md to Chinese by @asdkfjsd in #35242
- Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
- Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
- Blip: fix offloading and MP tests by @zucchini-nlp in #35239
- Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
- Temporarily disable amd push ci by @ivarflakstad in #35293
- Delete redundancy for loop checks. by @zhanluxianshen in #35288
- [Whisper] patch float type on mps by @eustlb in #35295
- Fix typos in Translated Audio Classification Docs by @jla524 in #35287
- Translating "translate perf_infer_gpu_multi.md" to Chinese by @HMJ0628 in #35271
- Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
- Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
- fix modular order by @ArthurZucker in #35297
- Add sdpa for Beit by @OmarManzoor in #34941
- Support for SDPA for SAM models by @MagnusS0 in #34110
- remove
benchmark
job inpush-important-models.yml
by @ydshieh in #35292 - Fix typos in translated quicktour docs by @jla524 in #35302
- Fix image preview in multi-GPU inference docs by @jla524 in #35303
- Fix remove unused parameter in docs by @zzzzzsa in #35306
- Add Cohere2 docs details by @alexrs-cohere in #35294
- Fixed typo in audio_classification.md by @Uvi-12 in #35305
- [docs] Improve register_pipeline by @stevhliu in #35300
- Fix loading with only state dict and low_cpu_mem_usage = True by @SunMarc in #35217
- [tests] make cuda-only tests device-agnostic by @faaany in #35222
- Trigger GitHub CI with a comment on PR by @ydshieh in #35211
- change bnb tests by @jiqing-feng in #34713
- [Whisper] fix docstrings typo by @eustlb in #35319
- feat: add
benchmarks_entrypoint.py
by @McPatate in #34495 - Fix documentation for ColPali by @tonywu71 in #35321
- Update comment CI bot by @ydshieh in #35323
- PaliGemma: Make sure to add to suffix if is present in
text
by @probicheaux in #35201 - Fix some fa2 tests by @ArthurZucker in #35340
- Modernbert Release Fixes by @warner-benjamin in #35344
- [
docs
] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347 - fix onnx export of speech foundation models by @nikosanto13 in #34224
- [
Mamba2
] Fix caching, slow path, and multi-gpu by @vasqu in #35154 - Reduce CircleCI usage by @ydshieh in #35355
- Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
- Cleaner attention interfaces by @Cyrilvallez in #35342
- Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
- fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
- Aurevoir PyTorch 1 by @ydshieh in #35358
- bugfix: torch.export failure caused by
_make_causal_mask
by @jiwoong-choi in #35291 - update codecarbon by @nhamanasu in #35243
- Update test fetcher when we want to test all by @ArthurZucker in #35364
- Use
weights_only=True
withtorch.load
fortransfo_xl
by @ydshieh in #35241 - Make
test_generate_with_static_cache
even less flaky by @ydshieh in #34995 - Improve modular transformers documentation by @joelpaulkoch in #35322
- Improved Documentation Of Audio Classification by @Uvi-12 in #35368
- [docs] Follow up register_pipeline by @stevhliu in #35310
- owlvit/2 dynamic input resolution by @bastrob in #34764
- Fix new FA2 if
is_causal
is passed explicitly by @Cyrilvallez in #35390 - bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
- make LlamaModel._update_causal_mask torch compilable by @winglian in #35187
- Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
- uniformize kwargs for SAM by @tibor-reiss in #34578
- Deprecate _is_quantized_training_enabled by @MekkCyber in #34991
- Scale loss before backward by @qgallouedec in #35207
- Fix typing in docstring for
PaliGemmaProcessor
by @alvarobartt in #35278 - Fix : VPTQ test by @MekkCyber in #35394
- add bnb support for Ascend NPU by @statelesshz in #31512
- bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
- Adding logger.info about update_torch_dtype in some quantizers by @MekkCyber in #35046
- Add compile test for fast image processor by @yonigozlan in #35184
- Disable
.github/workflows/self-comment-ci.yml
for now by @ydshieh in #35366 - enable non-cuda awq model support without modify version by @jiqing-feng in #35334
- [
GPTQ
,CompressedTensors
] Fix unsafe imports and metada check by @vasqu in #34815 - Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
- Fix: Rename keyword argument in_channels to num_channels by @ningyuv in #35289
- CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
- Fix f-string to show
ACCELERATE_MIN_VERSION
on error by @KSafran in #35189 - Fix
model_accepts_loss_kwargs
for timm model by @qubvel in #35257 - Update perf_infer_gpu_one.md: fix a typo by @martin0258 in #35441
- Add compute_loss_func to Seq2SeqTrainer by @d223302 in #35136
- Update docs for
sdpa_kernel
by @jla524 in #35410 - [i18n-ar] Translated file:
docs/source/ar/tasks/question_answering.md
into Arabic by @AhmedAlmaghz in #35196 - [i18n-ar] Translated file:
docs/source/ar/tasks/summarization.md
into Arabic by @AhmedAlmaghz in #35195 - Update translated docs for
sdpa_kernel
by @jla524 in #35461 - Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
- Fix new BNB test failures by @matthewdouglas in #35345
- Fix docs typos. by @zhanluxianshen in #35465
- Fix paligemma warning message by @hiyouga in #35486
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ydshieh
- Fix private forked repo. CI (#35114)
- Change back to
Thread
for SF conversion (#35236) - Use
rsfE
withpytest
(#35119) - Aggeregate test summary files in CircleCI workflow runs (#34989)
- remove
benchmark
job inpush-important-models.yml
(#35292) - Trigger GitHub CI with a comment on PR (#35211)
- Update comment CI bot (#35323)
- Reduce CircleCI usage (#35355)
- Aurevoir PyTorch 1 (#35358)
- Use
weights_only=True
withtorch.load
fortransfo_xl
(#35241) - Make
test_generate_with_static_cache
even less flaky (#34995) - Disable
.github/workflows/self-comment-ci.yml
for now (#35366)
- @aymeric-roucher
- Add Aria (#34157)
- @NielsRogge
- @HMJ0628
- @alexrs-cohere
- @ArthurZucker
- @tonywu71
- @OmarManzoor
- Add sdpa for Beit (#34941)
- @fabianlim
- Add the Bamba Model (#34982)
- @warner-benjamin
- @wejoncy
- FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
- @bastrob
- owlvit/2 dynamic input resolution (#34764)
- @BlackSamorez
- HIGGS Quantization Support (#34997)