


Single GPU realizes 20Hz online decision-making, interpretation of the latest efficient trajectory planning method based on sequence generation model
Previously we introduced the application of sequence modeling methods based on Transformer and Diffusion Model in reinforcement learning, especially in the field of offline continuous control. Among them, Trajectory Transformer (TT) and Diffusser are model-based planning algorithms. They show very high-precision trajectory prediction and good flexibility, but the decision-making delay is relatively high. In particular, TT discretizes each dimension independently as a symbol in the sequence, which makes the entire sequence very long, and the time consumption of sequence generation will increase rapidly as the dimensions of states and actions increase.
In order to enable the trajectory generation model to achieve practical level decision-making speed, we started the project of efficient trajectory generation and decision-making in parallel with Diffusser (overlapping but should be later) . Our first thought was to fit the entire trajectory distribution with a Transformer Mixture of Gaussian in continuous space rather than a discrete distribution. Although implementation problems are not ruled out, we have not been able to obtain a relatively stable generation model under this approach. Then we tried Variational Autoencoder (VAE) and made some breakthroughs. However, the reconstruction accuracy of VAE is not particularly ideal, making the downstream control performance quite different from that of TT. After several rounds of iterations, we finally selected VQ-VAE as the basic model for trajectory generation, and finally obtained a new algorithm that can efficiently sample and plan, and performs far better than other model-based methods on high-dimensional control tasks. We Called Trajectory Autoencoding Planner (TAP).
- ## Project homepage: https://sites.google .com/view/latentplan
- Paper homepage: https://arxiv.org/abs/2208.10291
Under a single GPU, TAP can easily make online decisions with a decision-making efficiency of 20Hz. In the low-dimensional D4RL task, the decision-making delay is only TT Around 1%. What's more important is that as the task state and action dimension D increase, the theoretical decision delay of TT will grow with the cubic power, and the Diffusser will theoretically grow linearly
, while TAP’s decision-making speed is not affected by dimensions
. In terms of the agent's decision-making performance, as the action dimension increases, TAP's performance improves compared to other methods, and the improvement compared to model-based methods (such as TT) is particularly obvious.
The importance of decision delay to decision-making and control tasks is very obvious. Although algorithms like MuZero perform well in simulation environments, they require real-time and rapid response in the real world. task, excessive decision-making delay will become a major difficulty in its deployment. In addition, under the premise of having a simulation environment, the slow decision-making speed will also lead to high testing costs for similar algorithms, and the cost of being used in online reinforcement learning will also be relatively high.
In addition, we believe that allowing the sequence generation modeling method to be smoothly extended to tasks with higher dimensions is also a very important contribution of TAP. In the real world, most of the problems we hope reinforcement learning can ultimately solve actually have higher state and action dimensions. For example, for autonomous driving, the inputs from various sensors are unlikely to be less than 100 even after preprocessing at various perceptual levels. Complex robot control often also has a high action space. The degrees of freedom of all human joints are about 240, which corresponds to an action space of at least 240 dimensions. A robot as flexible as a human also requires an equally high-dimensional action space.
Four sets of tasks with gradually increasing dimensions
Changes in decision latency and relative model performance as task dimensions grow
Method Overview
First, train the autoencoder (autoencoders) part of VQ-VAE. There are two differences from the original VQ-VAE. The first difference is that both the encoder and decoder are based on Causal Transformer instead of CNN. The second difference is that we learn a conditional probability distribution, and the possible trajectories being modeled must all start from the current state . The autoencoder learns a bidirectional mapping between trajectories starting from the current state
and latent codes. These latent codes are arranged in chronological order like the original trajectories, and each latent code will be mapped to the actual
step trajectory. Because we use Causal Transformer, latent codes with lower temporal ranking (such as
) will not transmit information to sequences with higher temporal ranking (such as
), which allows TAP to decode a trajectory of length NL through the first N hidden coding parts, which is very useful when using it for subsequent planning.
We will then use another GPT-2-style Transformer to model the conditional probability distribution of these latent codes :
When making decisions, we can find the best future trajectory by optimizing within the latent variable space, rather than in the original action space optimize. A very simple but effective method is to sample directly from the latent coding distribution, and then select the best-performing trajectory, as shown below:
The objective score referenced when selecting the optimal trajectory will consider both the expected return of the trajectory (reward plus the valuation of the last step) and the feasibility or probability of the trajectory itself. Such as the following formula, where is a number much larger than the highest return. When the probability of the trajectory is higher than a threshold
, the standard for judging this trajectory will be Its expected return (highlighted in red), otherwise the probability of this trajectory itself would be the dominant component (highlighted in blue). In other words, TAP will select the one with the highest expected return among the trajectories greater than the threshold.
Although the number of samples is large enough, direct sampling can also be very effective when the prediction sequence is short. Under the premise of limiting the number of samples and the total time required for planning, it is better to use a better optimizer. will lead to better performance. The following two animations show the difference between trajectories generated by direct sampling and beam search when predicting 144 steps into the future. These trajectories are sorted by the final target score. The trajectories at the top of the top layer have higher scores, and the trajectories stacked behind them have lower scores. In addition, trajectories with low scores will also have lower transparency.
In the picture we can see that many of the dynamics of the trajectories generated by direct sampling are unstable and do not conform to physical laws. In particular, the lighter trajectories in the background are almost all floating. Go. These are all trajectories with relatively low probability and will be eliminated when the final plan is selected. The trajectory in the front row looks more dynamic, but the corresponding performance is relatively poor, and it seems like it is about to fall. In contrast, beam search will dynamically consider the probability of the trajectory when expanding the next hidden variable, so that branches with very low probability will be terminated early, so that the candidate trajectories generated will focus on those with better performance and possibility. The larger tracks are around.
Direct sampling
##Beam search
Experimental resultsIn the absence of more advanced valuation and strategy improvement, relying solely on the advantage of prediction accuracy, in On low-dimensional tasks, TAP has achieved comparable performance to other offline reinforcement learning:
gym locomotion control
On high-dimensional tasks, TAP has achieved far better performance than other model-based methods, and also outperformed common model-free methods. There are actually two open questions that have not yet been answered. The first is why previous model-based methods performed poorly in these high-dimensional offline reinforcement learning tasks, and the second is why TAP can outperform many model-free methods on these tasks. One of our assumptions is that it is very difficult to optimize a policy on a high-dimensional problem while also taking into account preventing the policy from deviating too much from the behavioral policy. When a model is learned, errors in the model itself may amplify this difficulty. TAP moves the optimization space to a small discrete hidden variable space, which makes the entire optimization process more robust.
##adroit robotic hand controlSome slice studies
For many designs in TAP, we have also done a series of slice studies on the task of gym locomotion control. The first is the number of steps of the trajectory that each latent code actually corresponds to (yellow histogram). Facts have proved that allowing a latent variable to correspond to multi-step state transitions not only has computational advantages, but also has good performance in the final model. There has also been an improvement. By adjusting the threshold that triggers low-probability trajectory penalties in the search objective function Finally we tried the performance of TAP under direct sampling (green histogram). Note that the number of samples sampled here is 2048, while the number in the above animation is only 256, and the above animation generates a plan for the next 144 steps, but in fact our basic model directs planning for 15 steps. The conclusion is that direct sampling can achieve similar performance to beam search if the number of samples is sufficient and the planning path is not long. But this is a case of sampling from the learned conditional distribution of latent variables. If we directly sample with equal probability from the latent coding, it will still be much worse than the complete TAP model in the end. Results of Slice Study (red histogram), we also confirmed that both parts of the objective function are indeed helpful to the final performance of the model. Another point is that the number of steps planned into the future (planning horizon, blue histogram) has little impact on model performance. In the post-deployment search, even if only one latent variable is expanded, the final agent's performance will only be reduced by 10%. about.
The above is the detailed content of Single GPU realizes 20Hz online decision-making, interpretation of the latest efficient trajectory planning method based on sequence generation model. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative
