Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios-AI-php.cn

Home

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

王林

Jul 16, 2024 am 03:51 AM

getting Started

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

This article was completed by HMI Lab. HMI LabRelying on the two major platforms of Peking University’s National Engineering Research Center for Video and Visual Technology and the National Key Laboratory of Multimedia Information Processing, it has long been engaged in research in the direction of machine learning, multi-modal learning and embodied intelligence. The first author of this work is Dr. Liu Jiaming, whose research direction is multi-modal embodied large models and continuous learning technology for the open world. The second author of this work is Liu Mengzhen, whose research direction is vision basic model and robot manipulation. The instructor is Chen Shanghang, a researcher at the School of Computer Science at Peking University, a doctoral supervisor, and a young liberal scholar. Engaged in research on multi-modal large models and embodied intelligence, he has achieved a series of important research results. He has published more than 80 papers in top artificial intelligence journals and conferences, and has been cited by Google more than 9,700 times. Won the Best Paper Award from AAAI, the world's top artificial intelligence conference, and ranked first in Trending Research, the world's largest academic source code repository.

In order to give the robot end-to-end reasoning and manipulation capabilities, this article innovatively integrates the visual encoder with an efficient state space language model to build a new RoboMamba multi-modal large model, making it capable of visual common sense tasks and robots reasoning capabilities on related tasks, and have achieved advanced performance. At the same time, this article found that when RoboMamba has strong reasoning capabilities, we can enable RoboMamba to master multiple manipulation posture prediction capabilities through extremely low training costs.

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

Paper: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Paper link: https://arxiv.org/abs/2406.04339
Project homepage: https:// sites.google.com/view/robomamba-web
Github: https://github.com/lmzpai/roboMamba

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

^{Figure 1. Robot-related capabilities of RoboMamba, including tasks Planning, prompt mission planning, long-range mission planning, maneuverability judgment, maneuverability generation, future and past prediction, end effector pose prediction, etc.}

Abstract

A basic goal of robot manipulation is to enable the model to understand the visual scene and perform actions. Although existing multimodal large models of robots (MLLM) can handle a series of basic tasks, they still face challenges in two aspects: 1) Insufficient reasoning ability to handle complex tasks; 2) The computational cost of MLLM fine-tuning and inference is relatively high high. A recently proposed state space model (SSM), namely Mamba, possesses linear inference complexity while demonstrating promising capabilities in sequence modeling. Inspired by this, we launched an end-to-end robot MLLM—RoboMamba, which uses the Mamba model to provide robot reasoning and action capabilities while maintaining efficient fine-tuning and reasoning capabilities.

Specifically, we first integrate the visual encoder with Mamba to align visual data with language embeddings through joint training, giving our model visual common sense and robot-related reasoning capabilities. To further enhance RoboMamba's manipulation pose prediction capabilities, we explore an efficient fine-tuning strategy using only a simple Policy Head. We found that once RoboMamba has sufficient reasoning capabilities, it can master multiple operational skills with very few fine-tuning parameters (0.1% of the model) and fine-tuning time (20 minutes). In experiments, RoboMamba demonstrated excellent reasoning capabilities on general and robotic evaluation benchmarks, as shown in Figure 2. At the same time, our model demonstrates impressive manipulation pose prediction capabilities in simulations and real-world experiments, with inference speeds up to 7 times faster than existing robotic MLLMs.

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

^{Figure 2. Overview: Robomamba is an efficient multi-modal large model of robots with powerful reasoning and operation capabilities. RoboMamba-2.8B achieves competitive inference performance with other 7B MLLMs on general-purpose MLLM benchmarks while demonstrating long-range inference capabilities in robotic tasks. Subsequently, we introduced an extremely efficient fine-tuning strategy to give RoboMamba the ability to predict manipulation poses, and it only takes 20 minutes to fine-tune a simple strategy head.}

The main contributions of this article are summarized as follows:

We innovatively integrate the visual encoder with the efficient Mamba language model to build a new end-to-end multi-modal large model of robot, RoboMamba , which has visual common sense and comprehensive reasoning capabilities related to robots.
In order to equip RoboMamba with end-effector manipulation pose prediction capabilities, we explored an efficient fine-tuning strategy using a simple Policy Head. We found that once RoboMamba reaches sufficient reasoning capabilities, it can master manipulation pose prediction skills at very low cost.
In our extensive experiments, RoboMamba performs well on general and robotic inference evaluation benchmarks, and demonstrates impressive pose prediction results in simulators and real-world experiments.

Research background

Data scaling up has significantly promoted the development of large language models (LLMs) research, demonstrating significant reasoning and generalization capabilities in natural language processing (NLP) progress. In order to understand multimodal information, multimodal large language models (MLLMs) emerged, giving LLMs the ability to follow visual instructions and understand scenes. Inspired by the powerful capabilities of MLLMs in general-purpose environments, recent research aims to apply MLLMs to the field of robot operation. Some research efforts enable robots to understand natural language and visual scenes and automatically generate mission plans. Other research works exploit the inherent capabilities of MLLMs to enable them to predict operating poses.

Robot operation involves interacting with objects in a dynamic environment, requiring human-like reasoning capabilities to understand the semantic information of the scene, as well as powerful manipulation pose prediction capabilities. Although existing robot-based MLLMs can handle a range of basic tasks, they still face challenges in two aspects.

1) First, the reasoning ability of pre-trained MLLMs in robotics scenarios was found to be insufficient. As shown in Figure 2

, this shortcoming creates challenges when fine-tuned robotic MLLMs encounter complex reasoning tasks.

2) Secondly, due to the high computational complexity of existing MLLM attention mechanisms, fine-tuning MLLMs and using them to generate robot operating actions will incur higher computational costs.

In order to balance reasoning ability and efficiency, several studies have emerged in the field of NLP. In particular, Mamba introduces the innovative Selective State Space Model (SSM), which facilitates context-aware reasoning while maintaining linear complexity.

Inspired by this, we asked a question: "Can we develop an efficient robotic MLLM that not only has strong reasoning capabilities, but also acquires robotic operation skills in a very economical way?"

RoboMamba method

1. Background knowledge

Problem statement

For robot visual reasoning, our Robo Mamba generates language based on images Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

and language questions Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

The answer is Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

, expressed as Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

.Reasoning answers often contain separate subtasks Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

for a question Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

. For example, when faced with a planning problem such as "How to clear the table?" responses typically include steps such as "Step 1: Pick up the object" and "Step 2: Put the object into the box." For action prediction, we utilize an efficient and simple policy head π to predict actions Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

. Following previous work, we use 6-DoF to express the end-effector pose of the Franka Emika Panda robotic arm. The 6 degrees of freedom include the end effector position Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

representing the three-dimensional coordinates and the direction Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

representing the rotation matrix. If training on a grasping task, we add the gripper state to the pose prediction, enabling 7-DoF control.

State Space Model (SSM)

This article chooses Mamba as the large language model. Mamba is composed of many Mamba blocks, the most critical component is SSM. SSM is designed based on a continuous system that projects a 1D input sequence Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

to a 1D output sequence Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

through hidden states Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

. SSM consists of three key parameters: state matrix Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

, input matrix Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

, and output matrix Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

. The SSM can be expressed as:

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

Recent SSMs (e.g., Mamba) are constructed as discrete continuous systems using the time scale parameter Δ. This parameter converts the continuous parameters A and B into discrete parameters Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

and

. The discretization adopts the zero-order preserving method, which is defined as follows:

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

Mamba introduces a selective scanning mechanism (S6) to form its SSM operation in each Mamba block. SSM parameters updated to Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

for better content-aware inference. The details of the Mamba block are shown in Figure 3 below.

2. RoboMamba model structure

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

^{Figure 3. Robomamba overall framework. RoboMamba projects images into Mamba’s language embedding space through visual encoders and projection layers, which are then concatenated with text tokens and fed into the Mamba model. To predict the position and orientation of the end effector, we introduce a simple MLP policy head and use a pooling operation to generate global tokens from language output tokens as input. RoboMamba’s training strategy. For model training, we divide the training process into two stages. In Stage 1, we introduce aligned pre-training (Stage 1.1) and instruction co-training (Stage 1.2) to equip RoboMamba with common sense reasoning and robot-related reasoning capabilities. In Stage 2, we propose robot operation fine-tuning to efficiently empower RoboMamba with Low-Level operation skills.}

To equip RoboMamba with visual reasoning and operation capabilities, we built an efficient MLLM architecture starting from pre-trained large language models (LLMs) and vision models. As shown in Figure 3 above, we use the CLIP visual encoder to extract visual features Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

from the input image I, where B and N represent the batch size and the number of tokens respectively. Unlike recent MLLMs, we do not employ visual encoder ensemble techniques, which use multiple backbone networks (i.e., DINOv2, CLIP-ConvNeXt, CLIP-ViT) for image feature extraction. Integration introduces additional computational costs, severely affecting the practicality of robotic MLLM in the real world. Therefore, we demonstrate that simple and straightforward model design can also achieve powerful inference capabilities when high-quality data and appropriate training strategies are combined. To make the LLM understand visual features, we use a multilayer perceptron (MLP) to connect the visual encoder to the LLM. With this simple cross-modal connector, RoboMamba can transform visual information into a language embedding space Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

Please note that model efficiency is crucial in the field of robotics, as robots need to respond quickly to human instructions. Therefore, we choose Mamba as our large language model due to its context-aware reasoning capabilities and linear computational complexity. Textual prompts are encoded into an embedding space Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

using a pretrained tokenizer, then concatenated (cat) with visual tokens and fed into Mamba. We leverage Mamba's powerful sequence modeling to understand multimodal information and use effective training strategies to develop visual reasoning capabilities (as described in the next section). The output token ( Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

) is then decoded (det) to generate a natural language response Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

. The forward process of the model can be expressed as follows:

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

3.RoboMamba general vision and robot reasoning ability training

After building the RoboMamba architecture, the next goal is to train our model to learn general visual reasoning and robot-related reasoning abilities. As shown in Figure 3, we divide the training of Stage 1 into two sub-steps: alignment pre-training (Stage 1.1) and instruction co-training (Stage 1.2). Specifically, unlike previous MLLM training methods, we aim to enable RoboMamba to understand general vision and robotics scenarios. Given that the field of robotics involves many complex and novel tasks, RoboMamba requires stronger generalization capabilities. Therefore, we adopted a co-training strategy in Stage 1.2 to combine high-level robot data (e.g., mission planning) with general instruction data. We find that co-training not only results in more generalizable robot policies, but also results in enhanced general-scenario reasoning capabilities due to complex reasoning tasks in robot data. The training details are as follows:

Stage 1.1: Alignment pre-training.

We adopt LLaVA filtered 558k image-text paired dataset for cross-modal alignment. As shown in Figure 3, we freeze the parameters of the CLIP encoder and Mamba language model and only update the projection layer. In this way, we can align image features with pre-trained Mamba word embeddings.

Stage 1.2: Command to train together.

In this stage, we first follow previous MLLM work for general visual instruction data collection. We employ the 655K LLaVA Hybrid Instruction Dataset and the 400K LRV-Instruct Dataset for learning visual instruction following and mitigating hallucinations, respectively. It is important to note that mitigating hallucinations plays an important role in robotic scenarios because robotic MLLM needs to generate mission plans based on real scenarios rather than imagined ones. For example, existing MLLMs may formulaically answer "Open the microwave" by saying "Step 1: Find the handle," but many microwave ovens do not have handles. Next, we combine the 800K RoboVQA dataset to learn high-level robotic skills such as long-range mission planning, maneuverability judgment, maneuverability generation, future and past prediction, etc. During co-training, as shown in Figure 3, we freeze the parameters of the CLIP encoder and fine-tune the projection layer and Mamba on the 1.8m merged dataset. All outputs from the Mamba language model are supervised using a cross-entropy loss.

4. RoboMamba Manipulation Ability Fine-tuning Training

Based on RoboMamba’s powerful reasoning capabilities, we introduce our robot operation fine-tuning strategy in this section, which is called training Stage 2 in Figure 3 . Existing MLLM-based robot operation methods require updating the projection layer and the entire LLM during the operation fine-tuning stage. Although this paradigm can give the model action pose prediction capabilities, it also destroys the inherent capabilities of MLLM and requires a large amount of training resources. To address these challenges, we propose an efficient fine-tuning strategy, as shown in Figure 3. We freeze all parameters of RoboMamba and introduce a simple Policy head to model Mamba’s output token. The policy head contains two MLPs that learn the end effector position and direction respectively, occupying a total of 0.1% of the entire model parameters. According to the previous work where2act, the loss formula of position and direction is as follows: Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

where, N represents the number of training samples, and Tr (A) represents the trace of matrix A. RoboMamba only predicts the 2D position (x, y) of the contact pixel in the image and then uses depth information to convert it into 3D space. To evaluate this fine-tuning strategy, we generated a dataset of 10,000 end-effector pose predictions using SAPIEN simulations.

After operational fine-tuning, we found that once RoboMamba has sufficient reasoning capabilities, it can acquire pose prediction skills through extremely efficient fine-tuning. Due to the minimal number of fine-tuning parameters (7MB) and efficient model design, we can achieve new operational skill learning in just 20 minutes. This finding highlights the importance of reasoning ability in learning operational skills and proposes a new perspective: we can efficiently empower MLLMs with operational capabilities without affecting their inherent reasoning capabilities. Finally, RoboMamba can use language responses for common sense and robot-related reasoning, and a policy head for action pose prediction.

Quantitative Experiment

1. General Reasoning Ability Assessment (MLLM Benchmarks)

To evaluate reasoning ability, we used several popular benchmarks, including VQAv2, OKVQA, GQA, OCRVQA, VizWiz, POPE, MME, MMBench and MM-Vet.In addition, we also directly evaluated RoboMamba's robot-related reasoning capabilities on RoboVQA's 18k verification data set, covering robot tasks such as task planning, prompted task planning, long-range task planning, maneuverability judgment, and maneuverability. Sexual generation, past description and future prediction, etc. Om Table 1. Comparison of Robomamba and the existing MLLMS on multiple benchmarks. Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

As shown in Table 1, we compare RoboMamba with previous state-of-the-art (SOTA) MLLM on common VQA and recent MLLM benchmarks. First, we find that RoboMamba achieves satisfactory results on all VQA benchmarks using only 2.7B language models. The results show that the simple structural design is effective. Aligned pre-training and instruction co-training significantly improve the inference capabilities of MLLM. For example, RoboMamba's spatial recognition performance on the GQA benchmark is improved due to the introduction of large amounts of robot data in the collaborative training phase. Meanwhile, we also tested our RoboMamba on the recently proposed MLLM benchmark. Compared with previous MLLMs, we observe that our model achieves competitive results on all benchmarks. Although some performance of RoboMamba is still lower than the state-of-the-art 7B MLLM (e.g., LLaVA1.5 and SPHINX), we prioritize the smaller and faster Mamba-2.7B to balance the efficiency of the robot model. In the future, we plan to develop RoboMamba-7B for resource-unconstrained scenarios.

2. Robot reasoning ability evaluation (RoboVQA Benchmark)

In addition, in order to comprehensively compare RoboMamba’s robot-related reasoning capabilities, we benchmarked it with LLaMA-AdapterV2 on the RoboVQA validation set. We choose LLaMA-AdapterV2 as the baseline because it is the base model for current SOTA robotic MLLM (ManipLLM). For a fair comparison, we loaded the LLaMA-AdapterV2 pre-trained parameters and fine-tuned them on the RoboVQA training set for two epochs using its official instruction fine-tuning method. As shown in Figure 4 a), RoboMamba achieves superior performance between BLEU-1 to BLEU-4. The results demonstrate that our model has advanced robot-related reasoning capabilities and confirm the effectiveness of our training strategy. In addition to higher accuracy, our model achieves inference speeds up to 7 times faster than LLaMA-AdapterV2 and ManipLLM, which can be attributed to the content-aware inference capabilities and efficiency of the Mamba language model. Figure 4. Comparison of robot-related reasoning on RoboVQA.

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios 3. Robot Manipulation Ability Evaluation (SAPIEN)

To evaluate the manipulation ability of RoboMamba, we compared our model with four baselines: UMPNet, Flowbot3D, RoboFlamingo and ManipLLM . Before comparison, we reproduce all baselines and train them on the dataset we collected. For UMPNet, we perform operations on predicted contact points, oriented perpendicular to the object surface. Flowbot3D predicts the direction of motion on the point cloud, selects the largest flow as the interaction point, and uses the flow direction to represent the direction of the end effector. RoboFlamingo and ManipLLM load OpenFlamingo and LLaMA-AdapterV2 pre-training parameters respectively, and follow their respective fine-tuning and model update strategies. As shown in Table 2, compared to the previous SOTA ManipLLM, our RoboMamba achieves 7.0% improvement on the visible category and 2.0% improvement on the invisible category. In terms of efficiency, RoboFlamingo updates 35.5% (1.8B) of model parameters, ManipLLM updates adapters in LLM (41.3M) containing 0.5% of model parameters, while our fine-tuned Policy head (3.7M) only accounts for model parameters 0.1%. RoboMamba updates 10x fewer parameters than previous MLLM-based methods while inferring 7x faster. The results show that our RoboMamba not only has strong reasoning capabilities, but also can obtain manipulation capabilities in a low-cost way.

^{Table 2. Comparison of success rates between Robomamba and other baselines}

Qualitative results

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

As shown in Figure 4, we visualize the inference results of RoboMamba in various robotic downstream tasks. In terms of task planning, compared to LLaMA-AdapterV2, RoboMamba has demonstrated more accurate and longer-term planning capabilities due to its powerful reasoning capabilities. For a fair comparison, we also fine-tune the baseline LLaMA-AdapterV2 on the RoboVQA dataset. For manipulation pose prediction, we used a Franka Emika robotic arm to interact with various household objects. We project the 3D pose predicted by RoboMamba onto a 2D image, using red points to represent contact points and end effectors to represent directions, as shown in the lower right corner of the figure.

Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios

The above is the detailed content of Peking University launches new multi-modal robot model! Efficient reasoning and operations for general and robotic scenarios. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1317

PHP Tutorial

1268

C# Tutorial

1242

Related knowledge

A Diffusion Model Tutorial Worth Your Time, from Purdue University Apr 07, 2024 am 09:01 AM

Diffusion can not only imitate better, but also "create". The diffusion model (DiffusionModel) is an image generation model. Compared with the well-known algorithms such as GAN and VAE in the field of AI, the diffusion model takes a different approach. Its main idea is a process of first adding noise to the image and then gradually denoising it. How to denoise and restore the original image is the core part of the algorithm. The final algorithm is able to generate an image from a random noisy image. In recent years, the phenomenal growth of generative AI has enabled many exciting applications in text-to-image generation, video generation, and more. The basic principle behind these generative tools is the concept of diffusion, a special sampling mechanism that overcomes the limitations of previous methods.

Generate PPT with one click! Kimi: Let the 'PPT migrant workers' become popular first Aug 01, 2024 pm 03:28 PM

Kimi: In just one sentence, in just ten seconds, a PPT will be ready. PPT is so annoying! To hold a meeting, you need to have a PPT; to write a weekly report, you need to have a PPT; to make an investment, you need to show a PPT; even when you accuse someone of cheating, you have to send a PPT. College is more like studying a PPT major. You watch PPT in class and do PPT after class. Perhaps, when Dennis Austin invented PPT 37 years ago, he did not expect that one day PPT would become so widespread. Talking about our hard experience of making PPT brings tears to our eyes. "It took three months to make a PPT of more than 20 pages, and I revised it dozens of times. I felt like vomiting when I saw the PPT." "At my peak, I did five PPTs a day, and even my breathing was PPT." If you have an impromptu meeting, you should do it

All CVPR 2024 awards announced! Nearly 10,000 people attended the conference offline, and a Chinese researcher from Google won the best paper award Jun 20, 2024 pm 05:43 PM

In the early morning of June 20th, Beijing time, CVPR2024, the top international computer vision conference held in Seattle, officially announced the best paper and other awards. This year, a total of 10 papers won awards, including 2 best papers and 2 best student papers. In addition, there were 2 best paper nominations and 4 best student paper nominations. The top conference in the field of computer vision (CV) is CVPR, which attracts a large number of research institutions and universities every year. According to statistics, a total of 11,532 papers were submitted this year, and 2,719 were accepted, with an acceptance rate of 23.6%. According to Georgia Institute of Technology’s statistical analysis of CVPR2024 data, from the perspective of research topics, the largest number of papers is image and video synthesis and generation (Imageandvideosyn

From bare metal to a large model with 70 billion parameters, here is a tutorial and ready-to-use scripts Jul 24, 2024 pm 08:13 PM

We know that LLM is trained on large-scale computer clusters using massive data. This site has introduced many methods and technologies used to assist and improve the LLM training process. Today, what we want to share is an article that goes deep into the underlying technology and introduces how to turn a bunch of "bare metals" without even an operating system into a computer cluster for training LLM. This article comes from Imbue, an AI startup that strives to achieve general intelligence by understanding how machines think. Of course, turning a bunch of "bare metal" without an operating system into a computer cluster for training LLM is not an easy process, full of exploration and trial and error, but Imbue finally successfully trained an LLM with 70 billion parameters. and in the process accumulate

PyCharm Community Edition Installation Guide: Quickly master all the steps Jan 27, 2024 am 09:10 AM

Quick Start with PyCharm Community Edition: Detailed Installation Tutorial Full Analysis Introduction: PyCharm is a powerful Python integrated development environment (IDE) that provides a comprehensive set of tools to help developers write Python code more efficiently. This article will introduce in detail how to install PyCharm Community Edition and provide specific code examples to help beginners get started quickly. Step 1: Download and install PyCharm Community Edition To use PyCharm, you first need to download it from its official website

AI in use | AI created a life vlog of a girl living alone, which received tens of thousands of likes in 3 days Aug 07, 2024 pm 10:53 PM

Editor of the Machine Power Report: Yang Wen The wave of artificial intelligence represented by large models and AIGC has been quietly changing the way we live and work, but most people still don’t know how to use it. Therefore, we have launched the "AI in Use" column to introduce in detail how to use AI through intuitive, interesting and concise artificial intelligence use cases and stimulate everyone's thinking. We also welcome readers to submit innovative, hands-on use cases. Video link: https://mp.weixin.qq.com/s/2hX_i7li3RqdE4u016yGhQ Recently, the life vlog of a girl living alone became popular on Xiaohongshu. An illustration-style animation, coupled with a few healing words, can be easily picked up in just a few days.

A must-read for technical beginners: Analysis of the difficulty levels of C language and Python Mar 22, 2024 am 10:21 AM

Title: A must-read for technical beginners: Difficulty analysis of C language and Python, requiring specific code examples In today's digital age, programming technology has become an increasingly important ability. Whether you want to work in fields such as software development, data analysis, artificial intelligence, or just learn programming out of interest, choosing a suitable programming language is the first step. Among many programming languages, C language and Python are two widely used programming languages, each with its own characteristics. This article will analyze the difficulty levels of C language and Python

Five programming software for getting started with learning C language Feb 19, 2024 pm 04:51 PM

As a widely used programming language, C language is one of the basic languages that must be learned for those who want to engage in computer programming. However, for beginners, learning a new programming language can be difficult, especially due to the lack of relevant learning tools and teaching materials. In this article, I will introduce five programming software to help beginners get started with C language and help you get started quickly. The first programming software was Code::Blocks. Code::Blocks is a free, open source integrated development environment (IDE) for

See all articles