Enhancing Sentiment Analysis with ModernBERT-AI-php.cn

Since its introduction in 2018, BERT has transformed Natural Language Processing. It performs well in tasks like sentiment analysis, question answering, and language inference. Using bidirectional training and transformer-based self-attention, BERT introduced a new way to understand relationships between words in text. However, despite its success, BERT has limitations. It struggles with computational efficiency, handling longer texts, and providing interpretability. This led to the development of ModernBERT, a model designed to address these challenges. ModernBERT improves processing speed, handles longer texts better, and offers more transparency for developers. In this article, we’ll explore how to use ModernBERT for sentiment analysis, highlighting its features and improvements over BERT.

Learning Objective

Brief introduction to BERT and why ModernBERT came into existence
Understand the features of ModernBERT
How to practically implement ModernBERT via Sentiment Analysis example
Limitations of ModernBERT

This article was published as a part of the Data Science Blogathon.

What is BERT?
What is ModernBERT?
BERT vs ModernBERT
Understanding the Features of ModernBERT
Sentiment Analysis Using ModernBERT
Limitations of ModernBERT
Conclusion
Frequently Asked Questions

What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers, has been a game-changer since its introduction by Google in 2018. BERT introduced the concept of bidirectional training that allowed the model to understand the context by looking at surrounding words in all directions. This led to significantly better performance of models for a number of NLP tasks, including question answering, sentiment analysis, and language inference. BERT’s architecture is based on encoder-only transformers, which use self-attention mechanisms to weigh the influence of different words in a sentence and have only encoders. This means that they only understand and encode input, and do not reconstruct or generate output. Thus BERT is excellent at capturing contextual relationships in text, making it one of the most powerful and widely adopted NLP models in recent years.

What is ModernBERT?

Despite the groundbreaking success of BERT, it has certain limitations. Some of them are:

Computational Resources: BERTis a computationally expensive, memory-intensive model, whichisconstrainingfor real-time applications orfor setups which don’t have an accessible,powerful computing infrastructure.
Context Length:BERT has a fixed-length context window which becomes a limitation in handling long range inputs like lengthy documents.
Interpretability: the model’s complexity makes it less interpretable than simpler models, leading to challenges in debugging and performing modifications to the model.
Common Sense Reasoning: BERT lacks common sense reasoning and struggling to understand context, nuance, and logical reasoning beyond the given information.

BERT vs ModernBERT

BERT	ModernBERT
It has fixed positional embeddings	It uses Rotary Positional Embeddings (RoPE)
Standard self-attention	Flash Attention for improved efficiency
It has fixed-length context windows	It can support longer contexts with Local-Global Alternating Attention
Complex and less interpretable	Improved interpretability
Primarily trained on English text	Primarily trained on English and code data

ModernBERT addresses these limitations by incorporating more efficient algorithms such as Flash Attention and Local-Global Alternating Attention, which optimize memory usage and improve processing speed. Additionally, ModernBERT introduces enhancements to handle longer context lengths more effectively by integrating techniques like Rotary Positional Embeddings (RoPE)to support longer context lengths.

It enhances interpretability by aiming to be more transparent and user-friendly, making it easier for developers to debug and adapt the model to specific tasks. Furthermore, ModernBERT incorporates advancements in common sense reasoning, allowing it to better understand context, nuance, and logical relationships beyond the explicit information provided. It is suitable for common GPUs like NVIDIA T4, A100, and RTX 4090.

ModernBERT is trained on data from a various English sources, including web documents, code, and scientific articles. It is trained on2 trillion unique tokens, unlike the standard 20-40 repetitions popular in previous encoders.

It is released in the following sizes:

ModernBERT-base which has 22 layers and 149 million parameters
ModernBERT-large which has 28 layers and 395 million parameters

Understanding the Features of ModernBERT

Some of the unique features of ModernBERT are:

Flash Attention

This is a new algorithm developed to speed up the attention mechanism of transformer models in terms of time and memory usage. The computation of attention can be sped up by rearranging the operations and using tiling and recomputation. Tiling helps to break down large data into manageable chunks, and recomputation reduces memory usage by recalculating intermediate results as needed. This cuts down the quadratic memory usage down to linear, making it much more efficient for long sequences. The computational overhead reduces. It is 2-4x faster than traditional attention mechanisms. Flash Attention is used for speeding up training and inference of transformer models.

Local-Global Alternating Attention

One of the most novel features of ModernBERT is Alternating Attention, rather than full global attention.

The full input is attended only after every third layer. This is global attention.
Meanwhile, all other layers have a sliding window. In this sliding window, every token attends only to it’s nearest 128 tokens. This is local attention.

Enhancing Sentiment Analysis with ModernBERT

Rotary Positional Embeddings (RoPE)

Rotary Positional Embeddings (RoPE) is a transformer model technique that encodes the position of tokens in a sequence using rotation matrices. It incorporates absolute and relative positional information, adjusting the attention mechanism to understand the order and distance between tokens. RoPE encodes the absolute position of tokens using a rotation matrix and also makes note of the relative positional information or the order and distance between the tokens.

Unpadding and Sequencing

Unpadding and sequence packing are techniques designed to optimize memory and computational efficiency.

Usually padding is used to find the longest token, add meaningless padding tokens to fill up the rest of shorter sequences to equal their lengths. This increases computation on meaningless tokens. Unpadding removes unnecessary padding tokens from sequences, reducing wasted computation.
Sequence Packing reorganizes batches of text into compact forms, grouping shorter sequences together to maximize hardware utilization.

Sentiment Analysis Using ModernBERT

Let’s implement Sentiment Analysis Using ModernBERT practically. We are going to perform sentiment analysis task using ModernBERT. Sentiment analysis is a specific type of text classification task which aims to classify text (ex. reviews) into positive or negative.

The dataset we are using is IMDb movie reviews dataset to classify reviews into either positive or negative sentiments.

Note:

I have used A100 GPU for faster processing on Google Colab. For more information refer to:answerdotai/ModernBERT-base.
Training process will need wandb api key. You can create one via: Weight and Biases.

Step 1: Install Necessary Libraries

Install the libraries needed to work with Hugging Face Transformers.

#install libraries
!pip install  git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq
!pip install -U transformers>=4.48.0

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig
from datasets import load_dataset

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Step 2: Load the IMDb Dataset Using load_dataset Function

The command imdb[“test”][0] will print the first sample in the test split of the IMDb movie review dataset i.e the first test review along with its associated label.

#Load the dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
#print the first test sample
imdb["test"][0]

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Step 3: Tokenization

okenize the dataset using pre-trained ModernBERT-base tokenizer. This process converts text into numerical inputs suitable for the model.The command “tokenized_test_dataset[0]” will print the first sample of the tokenized test dataset including tokenized inputs such as input IDs and labels.

#initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

#define the tokenizer function
def tokenizer_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",  
        truncation=True,       
        max_length=512,      ## max length can be modified
        return_tensors="pt"
    )

#tokenize training and testing data set based on above defined tokenizer function
tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True)
tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True)

#print the tokenized output of first test sample
print(tokenized_test_dataset[0])

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Step 4: Initialize the ModernBERT-base Model for Sentiment Classification

#install libraries
!pip install  git+https://github.com/huggingface/transformers.git datasets accelerate scikit-learn -Uqq
!pip install -U transformers>=4.48.0

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer,AutoModelForMaskedLM,AutoConfig
from datasets import load_dataset

Copy after login

Step 5: Prepare the Datasets

Prepare the datasets by renaming the sentiment labels column (label) to ‘Labels’ and removing unnecessary columns.

#Load the dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
#print the first test sample
imdb["test"][0]

Copy after login

Step 6: Define Compute Metrics

Let’s use f1_score as a metric to evaluate our model. We will define a function to process the evaluation predictions, and calculate their F1 score. This let’s us compare the model’s predictions versus the true labels.

#initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")

#define the tokenizer function
def tokenizer_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",  
        truncation=True,       
        max_length=512,      ## max length can be modified
        return_tensors="pt"
    )

#tokenize training and testing data set based on above defined tokenizer function
tokenized_train_dataset = imdb["train"].map(tokenizer_function, batched=True)
tokenized_test_dataset = imdb["test"].map(tokenizer_function, batched=True)

#print the tokenized output of first test sample
print(tokenized_test_dataset[0])

Copy after login

Step 7: Set the Training Arguments

Define the hyperparameters and other configurations for fine-tuning the model using Hugging Face’s TrainingArguments. Let us understand some arguments:

train_bsz, val_bsz: Indicates batch size for training and validation. Batch size determines the number of samples processed before the model’s internal parameters are updated.
lr: Learning rate controls the adjustment of the model’s weights with respect to the loss gradient.
betas: These are the beta parameters for the Adam optimizer.
n_epochs: Number of epochs, indicating a complete pass through the entire training dataset.
eps: A small constant added to the denominator to improve numerical stability in the Adam optimizer.
wd: Stands for weight decay, a regularization technique to prevent overfitting by penalizing large weights.

#initialize the model
config = AutoConfig.from_pretrained("answerdotai/ModernBERT-base")

model = AutoModelForSequenceClassification.from_config(config)

Copy after login

Step 8: Model Training

Use the Trainer class to perform the model training and evaluation process.

#data preparation step - 
train_dataset = tokenized_train_dataset.remove_columns(['text']).rename_column('label', 'labels')
test_dataset = tokenized_test_dataset.remove_columns(['text']).rename_column('label', 'labels')

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Step 9: Evaluation

Evaluate the trained model on testing dataset.

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Step 10: Save the Fine-tuned Model

Save the fine-tuned model and tokenizer for further re-use.

#define training arguments 
train_bsz, val_bsz = 32, 32 
lr = 8e-5
betas = (0.9, 0.98)
n_epochs = 2
eps = 1e-6
wd = 8e-6

training_args = TrainingArguments(
    output_dir=f"fine_tuned_modern_bert",
    learning_rate=lr,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=val_bsz,
    num_train_epochs=n_epochs,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    adam_beta1=betas[0],
    adam_beta2=betas[1],
    adam_epsilon=eps,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    bf16=True,
    bf16_full_eval=True,
    push_to_hub=False,
)

Copy after login

Step 11: Predict the Sentiment of the Review

Here: 0 indicates negative review and 1 indicates positive review. For my new example, the output should be [0,1] because boring indicates negative review (0) and spectacular indicates positive opinion thus 1 will be given as output.

#Create a Trainer instance
trainer = Trainer(
    model=model,                         # The pre-trained model
    args=training_args,                  # Training arguments
    train_dataset=train_dataset,         # Tokenized training dataset
    eval_dataset=test_dataset,           # Tokenized test dataset
    compute_metrics=compute_metrics,     # Personally, I missed this step, my output won't show F1 score  
)

Copy after login

Enhancing Sentiment Analysis with ModernBERT

Limitations of ModernBERT

While ModernBERT brings several improvements over traditional BERT, it still has some limitations:

Training Data Bias: it istrained on English and code data, thus it cannot perform as effeciently on other languages or non-code text.
Complexity: The architectural enhancements and new techniques like Flash Attention and Rotary Positional Embeddings add complexity to the model, which can make it harder to implement and fine-tune for specific tasks.
Inference Speed: While Flash Attention improves inference speed, using the full 8,192 token window may still be slower.

Conclusion

ModernBERT takes BERT’s foundation and improves it with faster processing, better handling of long texts, and enhanced interpretability. While it still faces challenges like training data bias and complexity, it represents a significant leap in NLP. ModernBERT opens new possibilities for tasks like sentiment analysis and text classification, making advanced language understanding more efficient and accessible.

Key Takeaways

ModernBERT improves on BERT by fixing issues like inefficiency and limited context handling.
It uses Flash Attention and Rotary Positional Embeddings for faster processing and longer text support.
ModernBERT is great for tasks like sentiment analysis and text classification.
It still has some limitations, like bias toward English and code data.
Tools like Hugging Face and wandb make it easy to implement and use.

References:

ModernBERT Blog
ModerBERT Documentation

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What are encoder-only architectures?

Ans. Ans. Encoder-only architectures process input sequences without generating output sequences, focusing on understanding and encoding the input.

Q2. What are limitations of BERT?

Ans. Some limitations of BERT include high computational resources, fixed context length, inefficiency, complexity, and lack of common sense reasoning.

Q3. What is attention mechanism?

Ans. An attention mechanism is a technique that allows the model to focuses on specific parts of the input to determine which parts are more or less important.

Q4. What is alternating attention?

Ans. This mechanism alternates between focusing on local and global contexts within text sequences. Local attention highlights adjacent words or phrases, collecting fine-grained information, whereas global attention recognises overall patterns and relationships across the text.

Q5. What are Rotary Potential Embeddings? How are they different from Fixed Positional embeddings?

Ans. In contrast to fixed positional embeddings, which only capture absolute positions, rotary positional embeddings (RoPE) use rotation matrices to encode both absolute and relative positions. RoPE performs better with extended sequences.

Q6. What are the potential applications of ModernBERT?

Ans. Some applications of ModernBERT can be in areas of text classification, sentiment analysis, question answering, named-entity recognition, legal text analysis, code understanding etc.

Q7. What and why is wandb api needed?

Ans. Weights & Biases (W&B) is a platform for tracking, visualizing, and sharing ML experiments. It helps in tracking model metrics, visualize experiment data, share results and more. It helps monitor metrics like accuracy, visualize progress, tune hyperparameters, keep track of versions of model etc.

The above is the detailed content of Enhancing Sentiment Analysis with ModernBERT. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7504

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

I Tried Vibe Coding with Cursor AI and It's Amazing! Mar 20, 2025 pm 03:34 PM

Vibe coding is reshaping the world of software development by letting us create applications using natural language instead of endless lines of code. Inspired by visionaries like Andrej Karpathy, this innovative approach lets dev

Top 5 GenAI Launches of February 2025: GPT-4.5, Grok-3 & More! Mar 22, 2025 am 10:58 AM

February 2025 has been yet another game-changing month for generative AI, bringing us some of the most anticipated model upgrades and groundbreaking new features. From xAI’s Grok 3 and Anthropic’s Claude 3.7 Sonnet, to OpenAI’s G

How to Use YOLO v12 for Object Detection? Mar 22, 2025 am 11:07 AM

YOLO (You Only Look Once) has been a leading real-time object detection framework, with each iteration improving upon the previous versions. The latest version YOLO v12 introduces advancements that significantly enhance accuracy

Is ChatGPT 4 O available? Mar 28, 2025 pm 05:29 PM

ChatGPT 4 is currently available and widely used, demonstrating significant improvements in understanding context and generating coherent responses compared to its predecessors like ChatGPT 3.5. Future developments may include more personalized interactions and real-time data processing capabilities, further enhancing its potential for various applications.

Best AI Art Generators (Free & Paid) for Creative Projects Apr 02, 2025 pm 06:10 PM

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

o1 vs GPT-4o: Is OpenAI's New Model Better Than GPT-4o? Mar 16, 2025 am 11:47 AM

OpenAI's o1: A 12-Day Gift Spree Begins with Their Most Powerful Model Yet December's arrival brings a global slowdown, snowflakes in some parts of the world, but OpenAI is just getting started. Sam Altman and his team are launching a 12-day gift ex

Google's GenCast: Weather Forecasting With GenCast Mini Demo Mar 16, 2025 pm 01:46 PM

Google DeepMind's GenCast: A Revolutionary AI for Weather Forecasting Weather forecasting has undergone a dramatic transformation, moving from rudimentary observations to sophisticated AI-powered predictions. Google DeepMind's GenCast, a groundbreak

Which AI is better than ChatGPT? Mar 18, 2025 pm 06:05 PM

The article discusses AI models surpassing ChatGPT, like LaMDA, LLaMA, and Grok, highlighting their advantages in accuracy, understanding, and industry impact.(159 characters)

See all articles