This article showcases SmolVLM-500M-Instruct, a cutting-edge, compact vision-to-text model. Despite its relatively small size (500 million parameters), it demonstrates impressive capabilities.
Here's the Python code:
<code class="language-python">import torch from transformers import AutoProcessor, AutoModelForVision2Seq from PIL import Image import warnings warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused") def describe_image(image_path): processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct") model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct") image = Image.open(image_path) prompt = "Describe the image content in detail. Provide a concise textual response." inputs = processor(text=[prompt], images=[image], return_tensors="pt") with torch.no_grad(): outputs = model.generate( pixel_values=inputs["pixel_values"], input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=150, do_sample=True, temperature=0.7 ) description = processor.batch_decode(outputs, skip_special_tokens=True)[0] return description.strip() if __name__ == "__main__": image_path = "images/bender.jpg" try: description = describe_image(image_path) print("Image Description:", description) except Exception as e: print(f"Error: {e}")</code>
This script leverages the Hugging Face Transformers library to generate a textual description from an image. It loads a pre-trained model and processor, processes the image, and outputs a descriptive text. Error handling is included.
The code is available here: https://www.php.cn/link/042886829869470b75f63dddfd7e9d9d
Using the following non-stock image (placed in the project's image directory):
The model generates a description (the prompt and parameters can be adjusted for finer control): A robot, seated on a couch, is engrossed in reading a book. Bookshelves and a door are visible in the background. A white chair with a cushion is also in the scene.
The model's speed and efficiency are noteworthy compared to larger language models.
The above is the detailed content of Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-M Model. For more information, please follow other related articles on the PHP Chinese website!