Musk’s first-generation multi-modal model Grok-1.5V is finally here!
Grok 1.5V claims to be able to connect the digital world and the physical world.
In addition to text functions, it can also handle various visual information such as documents, icons, screenshots and photos.
Grok 1.5V will be available to all early testers and all Grok users soon.
Jim Fan, a senior scientist at NVIDIA, discovered the “biggest highlight” of Grok-1.5V: In his opinion, Grok-1.5V is the most What’s interesting is the potential to solve the edge cases of autonomous driving.
By using "thinking chains", the car can be helped to solve complex scenarios, reason with rules and counterfactuals, and explain its decisions, Grok-1.5 V can enhance perception to the pixel level, thereby promoting operation mapping to pixel->action.
From this, Tesla FSD V13 is likely to understand language Token!
Tesla’s advantage is that it has a highly mature data management line, so it is not difficult to label a large number of edge cases with high-quality human interpretation traces. In multi-modal FSD reasoning, fine-tuning Grok's performance is likely to be far better than GPT-4V and Gemini.
You might say, didn’t Wayve’s LINGO-1 have similar ideas before? But don’t forget, what Tesla is spinning is an unparalleled data flywheel, far larger than before.
Musk also appeared in the message area at the right time and further explained——
Yes Two data sources are infinitely scalable, synthetic data and real-world video. There is a question of authenticity in the former, but not in the latter.
Some netizens said, good guys, this sounds like the most feasible path to AGI! Multimodal models with explicit linguistic reasoning to consistently infer world models.
On the official website of xAI, Grok-1.5V is introduced like this .
It can compete with the top multi-modal models such as GPT-4V, Claude 3 Sonnet, Claude 3 Opus, etc.
Whether it is multi-disciplinary reasoning, or document understanding, scientific diagrams, tables, screenshots and photos, Grok-1.5V can compete with them.
What is particularly surprising is its outstanding ability to understand the physical world.
In the RealWorldQA benchmark test, without thinking chain prompts and zero-sample settings, Grok is evaluated with the following data set. It can be seen that Grok performs better than similar models. .
The following 7 representative examples illustrate the good performance of Grok 1.5V in different scenarios.
The following figure shows the flow chart of a simple guessing game. Grok can make specific decisions based on the flow chart. Content, use python code to implement the process.
user:
Can you translate it into Python code?
Grok:
Of course! This flowchart depicts a simple guessing game in which the computer generates a random number and the user needs to guess what it is. Here is the Python code that represents the logic in the flowchart:
import randomdef guess_number(): # Generate a random number between 1 and 10target = random.randint(1, 10)while True:guess = int(input("Guess a number between 1 and 10: "))if guess == target:print("You won!")breakelse:print("Wrong guess, try again!")if __name__ == "__main__":guess_number()
This code will run the guessing game described in the flowchart. The user is prompted to enter the guessed number until the user guesses the correct number.
According to the nutritional ingredient list of the food, Grok can know that the serving size of the food is 3 pieces, Containing 60 calories, it is calculated that the total calories of the food given by the user for 5 pieces is 100.
User:
How many calories are in 5 tablets?
Grok:
#If a serving contains 3 slices and the total calories are 60 calories, then the calories in 5 slices That's about 100 calories.
First, determine the calories per slice: 60 calories ÷ 3 slices = 20 calories/slice.
Next, multiply the calories per slice by the number of slices: 20 calories x 5 slices = 100 calories.
Grok can tell a short bedtime story based on the content of the painting.
For example, the following graffiti:
## User:
This is what my son drew. Can you tell a bedtime story based on his drawing?
Grok:
A long time ago, in a land full of adventure and wonder, there lived a brave man little boy named Timmy. He lives in a warm cabin with his family and loves nothing more than exploring the world around him.
On a sunny day, Timmy decided to explore the nearby forest. He packs his favorite snacks, a compass, and his trusty notebook to record all the amazing things he's about to see.
After going deep into the forest, Timmy discovered a magical river that sparkled in the sun. The charming scenery of the river made him decide to make a small paper boat and place it on the water as a special greeting to the forest spirits.
Timmy carefully folded the paper boat and drew a smiley face on it, expressing his love and blessing for the forest elves. Then he gently placed the paper boat in the river and watched it drift away with the water.
As the sun set, Timmy knew it was time to go home. He thought back to his day in the forest, knowing that this wonderful memory would stay with him forever.
Back home, Timmy told his family about his adventures. The family members were very happy and believed that he would experience more exciting adventures in the future.
So, with a heart full of love and dreams, Timmy slowly fell into sleep, looking forward to exploring new adventures in his imaginary world.
In the meme below, the left side is a start-up company and the right side is a large company. What does it mean?
The editor didn’t understand it at first glance. After reading Grok's explanation, it suddenly dawned on me.
User:
I don’t quite understand this picture, please explain it.
Grok:
这幅图幽默地展示了初创企业和大公司在工作方式上的差异。
在左侧的「初创企业」里,一群人齐心协力挖掘一个深洞,每个人都在积极地参与。
而在右侧的「大公司」里,只有一人在挖洞,其他人只是在旁边看着。
这幅图的幽默之处在于夸大了两种组织形态的不同。
初创企业通常会让每个人都直接参与到核心任务中,而大公司则可能因为其庞大的结构和官僚体制,使得团队成员无法接触到项目的每一个细节。
Grok可以将表格内容转换成CSV格式。
用户:
请将这个表转换为CSV格式。
Grok:
图像中的表格可以转换成CSV格式,如下所示:
medal,name,sport,event,date21 gold,El Amin Chentouf,Athletics,Men's marathon - T13,11 September31 gold,Mohamed Amguoun,Athletics,Men's 400m - T13,15 September41 gold,Azeddine Nouiri,Athletics,Men's shot put - T34,15 September52 silver,Mahdi Afri,Athletics,Men's 400m - T12,9 September62 silver,El Amin Chentouf,Athletics,Men's 5000m - T13,15 September73 bronze,Mahdi Afri,Athletics,Men's 200m - T12,17 September83 bronze,Mohamed Lahna,Paratriathlon,Men's PT2,10 September
请注意,CSV格式不支持样式或格式化,因此信息以纯文本格式显示。
地板上的木材腐烂了怎么办?
照片拍给Grok,它可以帮我们判断腐烂情况,还能给出合理的处理建议。
用户:
我地板上的螺丝周围出现了这些洞。是腐烂了吗?我应该更换木板吗?
Grok:
这些木板似乎已经开始腐烂。从螺丝周围出现的孔洞可以看出,这是木材腐烂的典型迹象。
为了确保地板的安全并延长使用的寿命,建议你更换那些已经受损的木板。这样做可以避免地板进一步的损害,并保持其结构的稳固。
如果你不知道如何更换木板,可以寻求专业地板建造商或承包商的帮助。他们会对损害的程度进行评估,并为你提供关于地板维护的专业建议。
编码问题,Grok-1.5V也是不在话下。
用户:
你能写出解决这个问题的Python代码吗?
Grok:
def find_triplets(nums):result = []nums.sort()for i in range(len(nums) - 2):if i > 0 and nums[i] == nums[i - 1]:continuetarget = -nums[i]left = i + 1right = len(nums) - 1while left
Developing useful real-world AI assistants and improving the model’s understanding of the physical world is crucial.
To achieve this goal, the research team also launched a new benchmark-RealWorldQA.
This benchmark can be used to evaluate the ability of multi-modal models to understand basic real-world spaces.
While many examples in current benchmarks are relatively easy for humans to understand, they are still challenging for cutting-edge models.
The initial release of RealWorldQA contains over 700 images, each with a question and easily verifiable answer.
For example, if we ask it: Which object is bigger, the pizza cutter or the scissors in the picture?
The answer given by Grok is: they are about the same size.
Where can I go from the current lane? A. Turn left B. Go straight C. Turn left and go straight D. Turn right
Grok will choose A.
Judging from the camera in front of the car, is there enough space to get around the gray car in front?
The answer given by Grok is: yes.
From this picture, which direction is the dinosaur facing?
Grok answered: East.
This dataset includes anonymized images taken from vehicles, in addition to other real-world images.
The research team has made the data set open source and hopes to continue to expand the data set as the multi-modal model is improved.
Finally, the xAI official website concluded that by improving multi-modal understanding and generation capabilities, they will gradually approach the construction of beneficial AGI that can understand the universe.
In the next few months, it is estimated that we will also see major improvements in these two capabilities in image, audio, video and other multi-modal modes.
The above is the detailed content of Musk's new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds. For more information, please follow other related articles on the PHP Chinese website!