NeoBERT: A New Frontier for Open-Source Encoder Language Models

A digital picture of Bert from Sesame street, wering black trench coat and sunglasses

Large language models are costly to train and need a lot of computing power to run. This is why we developed NeoBERT, a compact, efficient, and open-source state-of-the-art language model that will provide researchers and organizations with a strong foundation for training their own models.

BERT, released in 2019, was a significant advancement in pre-training language models. Decoder models, like GPT, focus on generating new text by predicting the next word in a sequence. Encoder models such as BERT, however, are best at understanding the context of existing text and providing powerful representations for tasks like retrieval, classification, or clustering.

Decoder models have become very popular with the rise of generative AI, but little work has been done to improve encoder models. Many researchers are still using the original versions of BERT or RoBERTa, although their performance is not up to modern standards, and their knowledge is now outdated: If you asked BERT who is currently head of the United Kingdom, it would probably answer “Queen Elizabeth II”!

Small model, big results

Encoder models are still widely used by both academic groups and members of the industry as strong alternatives to decoders for certain tasks like text representation. They are far more efficient than decoders and even demonstrate impressive in-context learning abilities.

This is why we brought all the research that had been done for decoders into encoders to bring these models up to speed and open-source the results to benefit everyone. NeoBERT is not alone in that quest: ModernBERT was announced around the same time, but under identical fine-tuning strategies, our model performs better on real-world, large-scale benchmarks. 

Instead of training a larger model, we decided to have a small, compact 250M parameters model but train it for a while longer, as we know that these models keep on learning.

For users, this means that it is less costly but doesn’t sacrifice accuracy or performance because we've put the effort into training rather than solely model scale. In other words, we put our resources where they mattered the most and optimized the model for the end users.

Increased accuracy 

With RefinedWeb, we used an enormous, diverse, and high-quality open-source dataset to improve the model's robustness.

Improved architecture

Building on the latest literature, we've optimized our model's shape and incorporated state-of-the-art architecture improvements for peak performance.

Trained for efficiency

We used modern techniques like flash attention, deepspeed, and unpadding to make training and fine-tuning fast and efficient. Our model is up-to-date, more efficient, better in benchmarks, and can handle sequences up to 4096 tokens.

The best thing about NeoBERT? It’s free, open-source, and plug-and-play: since NeoBERT has the same hidden size as BERT base, developers can swap it into their projects without making any changes to their architecture.

The code, training pipeline, data pipeline, models, and checkpoints will all be open-sourced so that the whole scientific community can benefit from it, build on top of it, and come up with even better models in the future.

You can find NeoBERT here. The paper can be found here.