DeepSeek Janus Pro 1B, launched on January 27, 2025, is an advanced multimodal AI model built to process and generate images from textual prompts. With its ability to comprehend and create images based on text, this 1 billion parameter version (1B) delivers efficient performance for a wide range of applications, including text-to-image generation and image understanding. Additionally, it excels at producing detailed captions from photos, making it a versatile tool for both creative and analytical tasks.
This article was published as a part of theData Science Blogathon.
DeepSeek Janus Pro is a multimodal AI model that integrates text and image processing, capable of understanding and generating images from text prompts. The 1 billion parameter version (1B) is designed for efficient performance across applications like text-to-image generation and image understanding tasks.
Under DeepSeek’s Janus Pro series, the primary models available are“Janus Pro 1B” and “Janus Pro 7B”, which differ mainly in their parameter size, with the 7B model being significantly larger and offering improved performance in text-to-image generation tasks;both are considered multimodal models capable of handling both visual understanding and text generation based on visual context.
Also read: How to Access DeepSeek Janus Pro 7B?
Janus-Pro diverges from previous multimodal models by employing separate, specialized pathways for visual encoding, rather than relying on a single visual encoder for both image understanding and generation.
This decoupled architecture facilitates task-specific optimizations, mitigating conflicts between interpretation and creative synthesis. The independent encoders interpret input features which are then processed by a unified autoregressive transformer. This allows both multimodal understanding and generation components to independently select their most suitable encoding methods.
Also read: How DeepSeek’s Janus Pro Stacks Up Against DALL-E 3?
A shared transformer backbone is usedfortext and image feature fusion. The independent encoding methods to convert the raw inputs into features are processed by a unified autoregressive transformer.
In Previous Janus training, there was a three-stage training process for the model. The first stage focused on training the adaptors and the image head. The second stage handled unified pretraining, during which all components except the understanding encoder and the generation encoder have their parameters updated. Stage III covered supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training.
This was improved in Janus Pro:
Now, lets build Multimodal RAG with Deepseek Janus Pro:
In the following steps, we will build a multimodal RAG system to query on images based on the Deepseek Janus Pro 1B model.
!pip install byaldi ollama pdf2image !sudo apt-get install -y poppler-utils !git clone https://github.com/deepseek-ai/Janus.git !pip install -e ./Janus
import os from pathlib import Path from byaldi import RAGMultiModalModel import ollama # Initialize RAGMultiModalModel model1 = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")
Byaldi gives an easy-to-use framework for setting up multimodal RAG systems. As seen from the above code, we load Colqwen2, which is a model designed for efficient document indexing using visual features.
# Use ColQwen2 to index and store the presentation index_name = "image_index" model1.index(input_path=Path("/content/PublicWaterMassMailing.pdf"), index_name=index_name, store_collection_with_index=True, # Stores base64 images along with the vectors overwrite=True )
We use this PDF to query and build an RAG system on in the next steps. In the above code, we store the image PDF along with the vectors.
query = "How many clients drive more than 50% revenue?" returned_page = model1.search(query, k=1)[0] import base64 # Example Base64 string (truncated for brevity) base64_string = returned_page['base64'] # Decode the Base64 string image_data = base64.b64decode(base64_string) with open('output_image.png', 'wb') as image_file: image_file.write(image_data)
The relevant page from the pages of the PDF is retrieved and saved as output_image.png based on the query.
!pip install byaldi ollama pdf2image !sudo apt-get install -y poppler-utils !git clone https://github.com/deepseek-ai/Janus.git !pip install -e ./Janus
import os from pathlib import Path from byaldi import RAGMultiModalModel import ollama # Initialize RAGMultiModalModel model1 = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")
The code generates a response from the DeepSeek Janus Pro 1B model using the prepared input embeddings (text and image). It uses several configuration settings like padding, start/end tokens, max token length, and whether to use caching and sampling. After the response is generated, it decodes the token IDs back into human-readable text using the tokenizer. The decoded output is stored in the answer variable.
The whole code is present in this colab notebook.
“What has been the revenue in France?”
The above response is not accurate even though the relevant page was retrieved by thecolqwen2 retriever, the DeepSeek Janus Pro 1B model could not generate the accurate answer from the page. The exact answer should be $2B.
“”What has been the number of promotions since beginning of FY20?”
The above response is correct as it matches with the text mentioned in the PDF.
In conclusion, the DeepSeek Janus Pro 1B model represents a significant advancement in multimodal AI, with its decoupled architecture that optimizes both image understanding and generation tasks. By utilizing separate visual encoders for these tasks and refining its training strategy, Janus Pro offers enhanced performance in text-to-image generation and image analysis. This innovative approach (Multimodal RAG with Deepseek Janus Pro), combined with its open-source accessibility, makes it a powerful tool for various applications in AI-driven visual comprehension and creation.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. DeepSeek Janus Pro 1B is a multimodal AI model designed to integrate both text and image processing, capable of understanding and generating images from text descriptions. It features 1 billion parameters for efficient performance in tasks like text-to-image generation and image understanding.
Q2. How does the architecture of Janus Pro 1B work?Ans. Janus Pro uses a unified transformer architecture with decoupled visual encoding. This means it employs separate pathways for image understanding and generation, allowing task-specific optimization for each task.
Q3. How does the training process of Janus Pro differ from previous versions?Ans. Janus Pro improves on previous training strategies by increasing training steps, dropping the ImageNet dataset in favor of specialized text-to-image data, and focusing on better fine-tuning for enhanced efficiency and performance.
Q4. What kind of applications can benefit from using Janus Pro 1B?Ans. Janus Pro 1B is particularly useful for tasks involving text-to-image generation, image understanding, and multimodal AI applications that require both image and text processing capabilities
Q5. How does Janus-Pro compare to other models like DALL-E 3?Ans. Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench, according to DeepSeek. Janus-Pro separates understanding/generation, scales data/models for stable image generation, and maintains a unified, flexible, and cost-efficient structure. While both models perform text-to-image generation, Janus-Pro also offers image captioning, which DALL-E 3 does not.
The above is the detailed content of Enhancing Multimodal RAG with Deepseek Janus Pro. For more information, please follow other related articles on the PHP Chinese website!