Enhancing Sentiment Analysis with ModernBERT
Since its introduction in 2018, BERT has transformed Natural Language Processing. It performs well in tasks like sentiment analysis, question answering, and language inference. Using bidirectional training and transformer-based self-attention, BERT introduced a new way to understand relationships between words in text. However, despite its success, BERT has limitations. It struggles with computational efficiency, handling longer texts, and providing interpretability. This led to the development of ModernBERT, a model designed to address these challenges. ModernBERT improves processing speed, handles longer texts better, and offers more transparency for developers. In this article, we’ll explore how to use ModernBERT for sentiment analysis, highlighting its features and improvements over BERT.
Learning Objective
- Brief introduction to BERT and why ModernBERT came into existence
- Understand the features of ModernBERT
- How to practically implement ModernBERT via Sentiment Analysis example
- Limitations of ModernBERT
This article was published as a part of the Data Science Blogathon.
Table of contents
- What is BERT?
- What is ModernBERT?
- BERT vs ModernBERT
- Understanding the Features of ModernBERT
- Sentiment Analysis Using ModernBERT
- Limitations of ModernBERT
- Conclusion
- Frequently Asked Questions
What is BERT?
BERT, which stands for Bidirectional Encoder Representations from Transformers, has been a game-changer since its introduction by Google in 2018. BERT introduced the concept of bidirectional training that allowed the model to understand the context by looking at surrounding words in all directions. This led to significantly better performance of models for a number of NLP tasks, including question answering, sentiment analysis, and language inference. BERT’s architecture is based on encoder-only transformers, which use self-attention mechanisms to weigh the influence of different words in a sentence and have only encoders. This means that they only understand and encode input, and do not reconstruct or generate output. Thus BERT is excellent at capturing contextual relationships in text, making it one of the most powerful and widely adopted NLP models in recent years.
What is ModernBERT?
Despite the groundbreaking success of BERT, it has certain limitations. Some of them are:
- Computational Resources: BERTis a computationally expensive, memory-intensive model, whichisconstrainingfor real-time applications orfor setups which don’t have an accessible,powerful computing infrastructure.
- Context Length:BERT has a fixed-length context window which becomes a limitation in handling long range inputs like lengthy documents.
- Interpretability: the model’s complexity makes it less interpretable than simpler models, leading to challenges in debugging and performing modifications to the model.
- Common Sense Reasoning: BERT lacks common sense reasoning and struggling to understand context, nuance, and logical reasoning beyond the given information.
BERT vs ModernBERT
BERT | ModernBERT |
It has fixed positional embeddings | It uses Rotary Positional Embeddings (RoPE) |
Standard self-attention | Flash Attention for improved efficiency |
It has fixed-length context windows | It can support longer contexts with Local-Global Alternating Attention |
Complex and less interpretable | Improved interpretability |
Primarily trained on English text | Primarily trained on English and code data |
ModernBERT addresses these limitations by incorporating more efficient algorithms such as Flash Attention and Local-Global Alternating Attention, which optimize memory usage and improve processing speed. Additionally, ModernBERT introduces enhancements to handle longer context lengths more effectively by integrating techniques like Rotary Positional Embeddings (RoPE)to support longer context lengths.
It enhances interpretability by aiming to be more transparent and user-friendly, making it easier for developers to debug and adapt the model to specific tasks. Furthermore, ModernBERT incorporates advancements in common sense reasoning, allowing it to better understand context, nuance, and logical relationships beyond the explicit information provided. It is suitable for common GPUs like NVIDIA T4, A100, and RTX 4090.
ModernBERT is trained on data from a various English sources, including web documents, code, and scientific articles. It is trained on2 trillion unique tokens, unlike the standard 20-40 repetitions popular in previous encoders.
It is released in the following sizes:
- ModernBERT-base which has 22 layers and 149 million parameters
- ModernBERT-large which has 28 layers and 395 million parameters
Understanding the Features of ModernBERT
Some of the unique features of ModernBERT are:
Flash Attention
This is a new algorithm developed to speed up the attention mechanism of transformer models in terms of time and memory usage. The computation of attention can be sped up by rearranging the operations and using tiling and recomputation. Tiling helps to break down large data into manageable chunks, and recomputation reduces memory usage by recalculating intermediate results as needed. This cuts down the quadratic memory usage down to linear, making it much more efficient for long sequences. The computational overhead reduces. It is 2-4x faster than traditional attention mechanisms. Flash Attention is used for speeding up training and inference of transformer models.
Local-Global Alternating Attention
One of the most novel features of ModernBERT is Alternating Attention, rather than full global attention.
- The full input is attended only after every third layer. This is global attention.
- Meanwhile, all other layers have a sliding window. In this sliding window, every token attends only to it’s nearest 128 tokens. This is local attention.
Rotary Positional Embeddings (RoPE)
Rotary Positional Embeddings (RoPE) is a transformer model technique that encodes the position of tokens in a sequence using rotation matrices. It incorporates absolute and relative positional information, adjusting the attention mechanism to understand the order and distance between tokens. RoPE encodes the absolute position of tokens using a rotation matrix and also makes note of the relative positional information or the order and distance between the tokens.
Unpadding and Sequencing
Unpadding and sequence packing are techniques designed to optimize memory and computational efficiency.
- Usually padding is used to find the longest token, add meaningless padding tokens to fill up the rest of shorter sequences to equal their lengths. This increases computation on meaningless tokens. Unpadding removes unnecessary padding tokens from sequences, reducing wasted computation.
- Sequence Packing reorganizes batches of text into compact forms, grouping shorter sequences together to maximize hardware utilization.
Sentiment Analysis Using ModernBERT
Let’s implement Sentiment Analysis Using ModernBERT practically. We are going to perform sentiment analysis task using ModernBERT. Sentiment analysis is a specific type of text classification task which aims to classify text (ex. reviews) into positive or negative.
The dataset we are using is IMDb movie reviews dataset to classify reviews into either positive or negative sentiments.
Note:
- I have used A100 GPU for faster processing on Google Colab. For more information refer to:answerdotai/ModernBERT-base.
- Training process will need wandb api key. You can create one via: Weight and Biases.
Step 1: Install Necessary Libraries
Install the libraries needed to work with Hugging Face Transformers.
#install libraries !pip install git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq !pip install -U transformers>=4.48.0 import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig from datasets import load_dataset
Step 2: Load the IMDb Dataset Using load_dataset Function
The command imdb[“test”][0] will print the first sample in the test split of the IMDb movie review dataset i.e the first test review along with its associated label.
#Load the dataset from datasets import load_dataset imdb = load_dataset("imdb") #print the first test sample imdb["test"][0]
Step 3: Tokenization
okenize the dataset using pre-trained ModernBERT-base tokenizer. This process converts text into numerical inputs suitable for the model.The command “tokenized_test_dataset[0]” will print the first sample of the tokenized test dataset including tokenized inputs such as input IDs and labels.
#initialize the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") #define the tokenizer function def tokenizer_function(example): return tokenizer( example["text"], padding="max_length", truncation=True, max_length=512, ## max length can be modified return_tensors="pt" ) #tokenize training and testing data set based on above defined tokenizer function tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True) tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True) #print the tokenized output of first test sample print(tokenized_test_dataset[0])
Step 4: Initialize the ModernBERT-base Model for Sentiment Classification
#install libraries !pip install git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq !pip install -U transformers>=4.48.0 import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig from datasets import load_dataset
Step 5: Prepare the Datasets
Prepare the datasets by renaming the sentiment labels column (label) to ‘Labels’ and removing unnecessary columns.
#Load the dataset from datasets import load_dataset imdb = load_dataset("imdb") #print the first test sample imdb["test"][0]
Step 6: Define Compute Metrics
Let’s use f1_score as a metric to evaluate our model. We will define a function to process the evaluation predictions, and calculate their F1 score. This let’s us compare the model’s predictions versus the true labels.
#initialize the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") #define the tokenizer function def tokenizer_function(example): return tokenizer( example["text"], padding="max_length", truncation=True, max_length=512, ## max length can be modified return_tensors="pt" ) #tokenize training and testing data set based on above defined tokenizer function tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True) tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True) #print the tokenized output of first test sample print(tokenized_test_dataset[0])
Step 7: Set the Training Arguments
Define the hyperparameters and other configurations for fine-tuning the model using Hugging Face’s TrainingArguments. Let us understand some arguments:
- train_bsz, val_bsz: Indicates batch size for training and validation. Batch size determines the number of samples processed before the model’s internal parameters are updated.
- lr: Learning rate controls the adjustment of the model’s weights with respect to the loss gradient.
- betas: These are the beta parameters for the Adam optimizer.
- n_epochs: Number of epochs, indicating a complete pass through the entire training dataset.
- eps: A small constant added to the denominator to improve numerical stability in the Adam optimizer.
- wd: Stands for weight decay, a regularization technique to prevent overfitting by penalizing large weights.
#initialize the model config = AutoConfig.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForSequenceClassification.from_config(config)
Step 8: Model Training
Use the Trainer class to perform the model training and evaluation process.
#data preparation step - train_dataset = tokenized_train_dataset.remove_columns(['text']).rename_column('label', 'labels') test_dataset = tokenized_test_dataset.remove_columns(['text']).rename_column('label', 'labels')
Step 9: Evaluation
Evaluate the trained model on testing dataset.
import numpy as np from sklearn.metrics import f1_score # Metric helper method def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = np.argmax(predictions, axis=1) score = f1_score( labels, predictions, labels=labels, pos_label=1, average="weighted" ) return {"f1": float(score) if score == 1 else score}
Step 10: Save the Fine-tuned Model
Save the fine-tuned model and tokenizer for further re-use.
#define training arguments train_bsz, val_bsz = 32, 32 lr = 8e-5 betas = (0.9, 0.98) n_epochs = 2 eps = 1e-6 wd = 8e-6 training_args = TrainingArguments( output_dir=f"fine_tuned_modern_bert", learning_rate=lr, per_device_train_batch_size=train_bsz, per_device_eval_batch_size=val_bsz, num_train_epochs=n_epochs, lr_scheduler_type="linear", optim="adamw_torch", adam_beta1=betas[0], adam_beta2=betas[1], adam_epsilon=eps, logging_strategy="epoch", eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, bf16=True, bf16_full_eval=True, push_to_hub=False, )
Step 11: Predict the Sentiment of the Review
Here: 0 indicates negative review and 1 indicates positive review. For my new example, the output should be [0,1] because boring indicates negative review (0) and spectacular indicates positive opinion thus 1 will be given as output.
#Create a Trainer instance trainer = Trainer( model=model, # The pre-trained model args=training_args, # Training arguments train_dataset=train_dataset, # Tokenized training dataset eval_dataset=test_dataset, # Tokenized test dataset compute_metrics=compute_metrics, # Personally, I missed this step, my output won't show F1 score )
Limitations of ModernBERT
While ModernBERT brings several improvements over traditional BERT, it still has some limitations:
- Training Data Bias: it istrained on English and code data, thus it cannot perform as effeciently on other languages or non-code text.
- Complexity: The architectural enhancements and new techniques like Flash Attention and Rotary Positional Embeddings add complexity to the model, which can make it harder to implement and fine-tune for specific tasks.
- Inference Speed: While Flash Attention improves inference speed, using the full 8,192 token window may still be slower.
Conclusion
ModernBERT takes BERT’s foundation and improves it with faster processing, better handling of long texts, and enhanced interpretability. While it still faces challenges like training data bias and complexity, it represents a significant leap in NLP. ModernBERT opens new possibilities for tasks like sentiment analysis and text classification, making advanced language understanding more efficient and accessible.
Key Takeaways
- ModernBERT improves on BERT by fixing issues like inefficiency and limited context handling.
- It uses Flash Attention and Rotary Positional Embeddings for faster processing and longer text support.
- ModernBERT is great for tasks like sentiment analysis and text classification.
- It still has some limitations, like bias toward English and code data.
- Tools like Hugging Face and wandb make it easy to implement and use.
References:
- ModernBERT Blog
- ModerBERT Documentation
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Frequently Asked Questions
Q1. What are encoder-only architectures?Ans. Ans. Encoder-only architectures process input sequences without generating output sequences, focusing on understanding and encoding the input.
Q2. What are limitations of BERT?Ans. Some limitations of BERT include high computational resources, fixed context length, inefficiency, complexity, and lack of common sense reasoning.
Q3. What is attention mechanism?Ans. An attention mechanism is a technique that allows the model to focuses on specific parts of the input to determine which parts are more or less important.
Q4. What is alternating attention?Ans. This mechanism alternates between focusing on local and global contexts within text sequences. Local attention highlights adjacent words or phrases, collecting fine-grained information, whereas global attention recognises overall patterns and relationships across the text.
Q5. What are Rotary Potential Embeddings? How are they different from Fixed Positional embeddings?Ans. In contrast to fixed positional embeddings, which only capture absolute positions, rotary positional embeddings (RoPE) use rotation matrices to encode both absolute and relative positions. RoPE performs better with extended sequences.
Q6. What are the potential applications of ModernBERT?Ans. Some applications of ModernBERT can be in areas of text classification, sentiment analysis, question answering, named-entity recognition, legal text analysis, code understanding etc.
Q7. What and why is wandb api needed?Ans. Weights & Biases (W&B) is a platform for tracking, visualizing, and sharing ML experiments. It helps in tracking model metrics, visualize experiment data, share results and more. It helps monitor metrics like accuracy, visualize progress, tune hyperparameters, keep track of versions of model etc.
The above is the detailed content of Enhancing Sentiment Analysis with ModernBERT. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Vibe coding is reshaping the world of software development by letting us create applications using natural language instead of endless lines of code. Inspired by visionaries like Andrej Karpathy, this innovative approach lets dev

February 2025 has been yet another game-changing month for generative AI, bringing us some of the most anticipated model upgrades and groundbreaking new features. From xAI’s Grok 3 and Anthropic’s Claude 3.7 Sonnet, to OpenAI’s G

YOLO (You Only Look Once) has been a leading real-time object detection framework, with each iteration improving upon the previous versions. The latest version YOLO v12 introduces advancements that significantly enhance accuracy

ChatGPT 4 is currently available and widely used, demonstrating significant improvements in understanding context and generating coherent responses compared to its predecessors like ChatGPT 3.5. Future developments may include more personalized interactions and real-time data processing capabilities, further enhancing its potential for various applications.

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

OpenAI's o1: A 12-Day Gift Spree Begins with Their Most Powerful Model Yet December's arrival brings a global slowdown, snowflakes in some parts of the world, but OpenAI is just getting started. Sam Altman and his team are launching a 12-day gift ex

Google DeepMind's GenCast: A Revolutionary AI for Weather Forecasting Weather forecasting has undergone a dramatic transformation, moving from rudimentary observations to sophisticated AI-powered predictions. Google DeepMind's GenCast, a groundbreak

The article discusses AI models surpassing ChatGPT, like LaMDA, LLaMA, and Grok, highlighting their advantages in accuracy, understanding, and industry impact.(159 characters)
