Home > Technology peripherals > AI > Comparison of Gemini Embedding with Multilingual-e5-large & Jina

Comparison of Gemini Embedding with Multilingual-e5-large & Jina

Christopher Nolan
Release: 2025-03-20 15:02:13
Original
777 people have browsed it

Gemini Embedding: Multilingual text embedding model under Google Gemini AI framework

Word embedding is crucial for natural language processing (NLP) tasks in Hindi, such as machine translation, question and answer, and information retrieval. These embeddings capture the semantic properties of words, enabling more accurate and context-oriented NLP applications. Given the large number of Hindi speakers and the growing number content of Hindi language, high-quality embedding is critical to improving NLP performance in these languages. Customized embedding can specifically solve the unique language characteristics and resource limitations of the Indian language family. The newly released Gemini Embedding model represents a significant advancement in multilingual text embedding, leveraging Google's powerful Gemini AI framework to achieve state-of-the-art performance in over 100 languages.

The Gemini Embedding model is good at tasks such as classification, retrieval and semantic search, providing greater efficiency and accuracy. By supporting larger input scales and higher dimensional outputs, Gemini Embedding provides richer text representations, enabling it to be widely used in a variety of applications.

Learning Objectives

  • Learn about Gemini embedding and its integration with Gemini LLM.
  • Practical tutorials for retrieving Hindi documents using Gemini embed.
  • Comparative analysis with Jina AI embedding and Multilingual-e5-large.
  • Insights on multilingual text retrieval capabilities and applications.

*This article is published as part of the *** Data Science Blog Marathon . ***

Table of contents

  • What is Gemini embedding?
  • Key Features of Gemini Embedding
  • Gemini embedded model architecture
  • Comparison with other multilingual embedding models
  • Retrieval using Gemini embedding and compared with Jina AI embedding and Multilingual-e5-large
    • Step 1. Install the necessary libraries
    • Step 2. Load the data
    • Step 3. Block the data
    • Step 4. Store the data in the vector database
    • Step 5. Query the database
    • Step 6. Compare with Jina AI Embedding
  • Comparison of embed search output
    • explain
  • in conclusion
  • Frequently Asked Questions

What is Gemini embedding?

In March 2025, Google released a new experimental Gemini Embedding text model (gemini-embedding-exp-03-07) that can be used in the Gemini API.

The advanced embedding model originated from the Gemini model, which is said to inherit Gemini's profound understanding of nuances of language and subtle contexts, enabling it to be widely used in a variety of applications. It ranks first in the MTEB multilingual rankings.

Comparison of Gemini Embedding with Multilingual-e5-large & Jina

Gemini Embedding represents text as dense vectors where text inputs with similar semantics are mapped to vectors in vector space that are close to each other. Currently, it supports over 100 languages, and its embedding can be used for a variety of tasks such as retrieval and classification.

Key Features of Gemini Embedding

  • Strong multilingual capabilities : This model demonstrates outstanding performance in over 100 languages, not only in high-resource languages ​​such as English, but also in low-resource languages ​​such as Assamese and Macedonian.
  • Processing up to 8000 input tags : This powerful capability enables models to seamlessly handle lengthy documents or complex queries without truncation, thus maintaining context and meaning in a way that goes beyond many existing embedded models.
  • Output dimensions for 3K dimensions : This model generates an embed dimension up to 3072 and supports sub-dimensionality such as 768 and 1536 for task-specific optimization.
  • Impressive performance : Gemini Embedding ranked first in the massive text embedding benchmark (MTEB), with an average task score of 68.32, significantly surpassing its closest competitor.

Gemini embedded model architecture

Comparison of Gemini Embedding with Multilingual-e5-large & Jina

The core of Gemini Embedding is based on the Transformer architecture and initialized from Gemini LLM. This basis provides a deep understanding of language structure and semantics for the model. The model uses a bidirectional attention mechanism to process input sequences so that it can take into account the full context of a word or phrase when generating an embedding.

  1. The input sequence T is processed by M (a Transformer with bidirectional attention, initialized from Gemini), resulting in a marker embedding sequence.
  2. To generate a single embedding representing all the information in the input, a pooling function is applied.
  3. Finally, linear projection is applied to scale the embedding to the target dimension, resulting in the final output embedding.

Loss function : The Gemini Embedding model is trained using noise comparison estimation (NCE) losses with in-batch negative examples. The exact loss will vary slightly depending on the training phase. Generally speaking, a training example includes a query, a positive target, and (optional) a difficult target.

Training strategies

  1. Pre-fine-tuning : At this stage, the model is trained on a large diversified dataset containing query-target pairs. This exposure adjusts the parameters of large language models for coding tasks, laying the foundation for their adaptability.
  2. Fine-tuning : In the second phase, the model uses a task-specific dataset containing a triple of query-positive-difficult-negative examples. This process uses smaller batch sizes and well-curated datasets to improve the performance of target tasks.

Read Also: Gemini Embedding: Universal Embedding from Gemini

Comparison with other multilingual embedding models

We compare the search for Hindi documents with the latest newly released Gemini embeddings and then compare them with Jina AI embeddings and Multilingual-e5-large embeddings. As shown in the following table, Gemini embedding and Jina AI embedding are high in terms of maximum number of tags, allowing the model to handle long documents or complex queries. Furthermore, as shown in the following table, Gemini embeddings have a higher embedding dimension that captures more detailed and nuanced semantic relationships between words, allowing models to represent nuanced differences in complex language patterns and meanings.

Number of parameters Embed dimensions Maximum mark Number of languages Doll embedding
gemini-embedding-exp-03-07 unknown 3072 8192 100 Supports truncation of embeddings to various sizes, such as 2048, 1024, 512, 256, and 128 dimensions,
jinaai/jina-embeddings-v3 572 million 1024 8194 100 Supports flexible embed sizes (32, 64, 128, 256, 512, 768, 1024), allowing truncated embeds to fit your application
multilingual-e5-large-instruct 560 million 1024 514 94 NA

Retrieval using Gemini embedding and compared with Jina AI embedding and Multilingual-e5-large

In the following practical tutorial, we compare the search for Hindi documents with the latest newly released Gemini embeddings and then compare it with Jina AI embeddings and Multilingual-e5-large embeddings.

Step 1. Install the necessary libraries

 <code>!pip install langchain-community !pip install chromadb</code>
Copy after login

Step 2. Load the data

We used Hindi data from the website to evaluate the performance of Gemini embedding in Hindi language retrieval.

 <code>from langchain_community.document_loaders import WebBaseLoader loader = WebBaseLoader("https://ckbirlahospitals.com/rbh/blog/pregnancy-early-symptoms-in-hindi") data = loader.load()</code>
Copy after login

Step 3. Block the data

The following code uses RecursiveCharacterTextSplitter to split a large text document into 500-character chunks without overlap. It then applies this split to the datavariable and stores the result in all_splits. Due to the rate limits of the Gemini Embedding API, we only use 10 splits.

 <code>from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0) all_splits = text_splitter.split_documents(data) all_splits = all_splits[:10]</code>
Copy after login

Step 4. Store the data in the vector database

We first create a class called "GeminiEmbeddingFunction" which helps query the Gemini Embedding API and return the embedded value of the input query. We then create a function called "create_chroma_db" to create a collection in ChromaDB that will store data as well as embed.

 <code>import chromadb from chromadb import Documents, EmbeddingFunction, Embeddings class GeminiEmbeddingFunction(EmbeddingFunction): def __call__(self, input: Documents) -> Embeddings: title = "Custom query" return client.models.embed_content( model="gemini-embedding-exp-03-07", contents=input).embeddings[0].values def create_chroma_db(documents, name): chroma_client = chromadb.Client() db = chroma_client.create_collection(name=name, embedding_function=GeminiEmbeddingFunction()) for i, d in enumerate(documents): db.add( documents=d.page_content, ids=str(i) ) return db db = create_chroma_db(all_splits, "datab")</code>
Copy after login

Step 5. Query the database

 <code>def get_relevant_passage(query, db): passage = db.query(query_texts=[query], n_results=1)['documents'][0][0] return passage passage = get_relevant_passage("आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए?", db) print(passage)</code>
Copy after login
Copy after login

Step 6. Compare with Jina AI Embedding

The following code uses the Hugging Face transformer model to define a custom embedding function, as well as a way to process text input to generate embeddings.

  1. AutoTokenizer and AutoModel from transformers are used to load pretrained models (jinaai/jina-embeddings-v3) and import EmbeddingFunction from chromadb for creating custom embeddings.
  2. average_pool function: This function aggregates the hidden states of the model by performing pooling operations on the model, averages the sequence length while taking the attention mask (ignoring the fill mark).
  3. CustomHuggingFace class: It tokenizes text, feeds it into the model, and calculates the embedding using the average_pool function. The result is returned as an embedded list.
 <code>from transformers import AutoTokenizer, AutoModel from chromadb import EmbeddingFunction tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3') model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3') # the model returns many hidden states per document so we must aggregate them def average_pool(last_hidden_states, attention_mask): last_hidden = last_hidden_states.masked_fill(~attention_mask[...,None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[...,None] class CustomHuggingFace(EmbeddingFunction): def __call__(self, texts): queries = [f'query: {text}' for text in texts] batch_dict = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) return embeddings.tolist()</code>
Copy after login

Query

 <code>def get_relevant_passage(query, db): passage = db.query(query_texts=[query], n_results=1)['documents'][0][0] return passage passage = get_relevant_passage("आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए?", db) print(passage)</code>
Copy after login
Copy after login

For choosing Multilingual-e5-large embed , we simply replace the tokenizer and model with "intfloat/multilingual-e5-large-instruct".

Comparison of embed search output

Question number Query Gemini Embed jinaai/jina-embeddings-v3 intfloat/multilingual-e5-large-instruct
1 आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए? If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake
2 Pregnancy के kuch symbols क्या होते हैं? What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct Signs of pregnancy: Complete information on early symptoms! Home Quick Consultation Patient Login Contact Us: 08062136530 Emergency Phone: 07340054470 Open the main menu to serve patients and visitors International Patients About Us Make an appointment to call back WhatsApp to learn about the early symptoms of pregnancy. Obstetrics and Gynecology | Author: Dr. CP Dadhich | Release Date: February 6, 2025 Contents When should you have a pregnancy test? What are the early symptoms of pregnancy? Early symptoms of pregnancy Pregnancy – Error What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct
3 गर्भावस्था के दौरान एंटीबायोटिक दवा लेने से कब बचा हिए? During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct What every woman should know. For any pregnancy-related questions, we recommend that you contact our gynecologist to eliminate all complications. -mistake
4 कब गर्भावस्था में एंटीबायोटिक दवा लेने से बचाया जाए? During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct What every woman should know. For any pregnancy-related questions, we recommend that you contact our gynecologist to eliminate all complications. -mistake
5 गर्भधारण का सबसे पहला सामान्य लक्षण क्या है? Delayed menstruation: This is the earliest and most common symptom of pregnancy. Confirmation of pregnancy based solely on this symptom is not entirely correct. However, if menstruation is delayed for one week or more, pregnancy tests are recommended. Breast changes: During pregnancy, the breasts will swell, become tender or change in color. It mainly changes in the size and color of the nipple (areola). -correct With this in mind, how to confirm pregnancy? How to take care of the first month of pregnancy? How to do pregnancy checkups? How should I sit during pregnancy? Should sex occur during pregnancy? What fruits should you eat during pregnancy? How much water should you drink during pregnancy? The joy of becoming a mother is the greatest happiness in the world. During pregnancy, there are many changes in women's physical and psychological changes. You call these changes early symptoms of pregnancy – Error What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct
6 गर्भधारण के पहले संकेत क्या होते हैं? Signs of pregnancy: Complete information on early symptoms! Home Quick Consultation Patient Login Contact Us: 08062136530 Emergency Phone: 07340054470 Open the main menu to serve patients and visitors International Patients About Us Make an appointment to call back WhatsApp to learn about the early symptoms of pregnancy. Obstetrics and Gynecology | Author: Dr. CP Dadhich | Release Date: February 6, 2025 Contents When should you have a pregnancy test? What are the early symptoms of pregnancy? Early symptoms of pregnancy Pregnancy – Error With this in mind, how to confirm pregnancy? How to take care of the first month of pregnancy? How to do pregnancy checkups? How should I sit during pregnancy? Should sex occur during pregnancy? What fruits should you eat during pregnancy? How much water should you drink during pregnancy? The joy of becoming a mother is the greatest happiness in the world. During pregnancy, there are many changes in women's physical and psychological changes. You call these changes early symptoms of pregnancy – Error What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct
7 गर्भावस्था की पुष्टि के लिए कौन से हार्मोन का पता लगाना होता है? The best time to have a pregnancy test is after menstruation is delayed by at least 7 days. You can use the home pregnancy testing tool to detect hCG levels at home. During pregnancy, the levels of this hormone will increase significantly. One thing you need to note is that premature testing can also lead to wrong results, so if your period is delayed and the test is negative, it is recommended that you wait at least 3 more days before you test again. -correct There is also a correct way to do this, which you can also see on the test tool manual. To get accurate results, you should use the first urine in the morning, as the correct level of hCG hormone can be measured. Also, if you experience early symptoms of pregnancy and the test results are negative, see your doctor for a blood test immediately. In any case, you must consult a doctor if you have any questions. -correct What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -mistake

explain

As can be seen from the above Hindi output, using Gemini embedding, we get 5 correct outputs from 7 queries, while using Jina AI embedding and Multilingual-e5-large, we get only 3 correct responses.

This shows that, as reflected in the MTEB benchmark, Gemini embeddings perform well and handle multilinguals such as Hindi better than other embedding models.

in conclusion

In short, Gemini embedding represents a significant advancement in multilingual NLP, especially for Hindi languages ​​such as Hindi. With its strong multilingual capabilities, support for large input sizes, and superior performance in benchmarks such as MTEB, Gemini excels in tasks such as retrieval, classification, and semantic search. Through practical comparisons, Gemini's performance is better than other models, providing higher accuracy and efficiency, making it a valuable tool for promoting multilingual NLP.

Main gains

  • Importance of Hindi Language Word Embedding : High-quality embedding enhances NLP tasks such as translation, question-and-answer, and retrieval, solving language challenges and resource gap problems.
  • Gemini Embedding Model : Google's Gemini Embedding utilizes its AI framework for multilingual text processing, covering more than 100 languages, including low-resource languages.
  • Key Features : Supports 8000 markers and 3072-dimensional embeddings, enabling efficient processing of long documents and complex queries.
  • Impressive Performance : Ranked No. 1 in the MTEB Multilingual Rankings with an average task score of 68.32, demonstrating its power in multilingual NLP.

The media shown in this article are not owned by Analytics Vidhya and can be used at the discretion of the author.

Frequently Asked Questions

Q1. What is the Gemini Embedding model? A: The Gemini Embedding model is based on Google's Gemini AI and provides top-notch multilingual text embeddings for more than 100 languages ​​including Hindi.

Q2. What is unique about Gemini Embedding compared to other models? A: Gemini Embedding excels in multilingual support, can process 8000 markers and output 3072 dimensions, ensuring efficiency in classification, retrieval and semantic search.

Q3. How does Gemini Embedding perform in multilingual tasks? Answer: Gemini Embedding performs well in high-resource languages ​​such as English and low-resource languages ​​such as Assamese and Macedonian. It ranks number one on the MTEB multilingual rankings, demonstrating its powerful multilingual capabilities.

Q4. What is the architecture of the Gemini Embedding model? A: The model is initialized from Gemini LLM and uses a Transformer architecture with bidirectional attention to generate high-quality text embeddings that capture context and meaning.

Q5. How is the Gemini Embedding model trained? A: Gemini Embedding uses noise comparison estimation (NCE) loss with in-batch negative examples for training. It goes through two training phases: pre-fine-tuning on a large dataset and task-specific datasets to improve NLP performance.

The above is the detailed content of Comparison of Gemini Embedding with Multilingual-e5-large & Jina. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template