Home Technology peripherals AI Wayformer: A simple and effective attention network for motion prediction

Wayformer: A simple and effective attention network for motion prediction

Apr 09, 2023 pm 09:31 PM
Architecture Autopilot

The arXiv paper "Wayformer: Motion Forecasting via Simple & Efficient Attention Networks", uploaded in July 2022, is the work of Google Waymo.

Wayformer: A simple and effective attention network for motion prediction

Motion prediction for autonomous driving is a challenging task because complex driving scenarios result in various mixed forms of static and dynamic inputs. How best to represent and fuse historical information about road geometry, lane connectivity, time-varying traffic light states, and dynamic sets of agents and their interactions into efficient encodings is an unsolved problem . To model this diverse set of input features, there are many approaches to designing equally complex systems with different sets of modality-specific modules. This results in systems that are difficult to scale, scale, or trade off quality and efficiency in a rigorous way.

The Wayformer in this article is a series of simple and similar attention-based motion prediction architectures. Wayformer provides a compact model description consisting of attention-based scene encoders and decoders. In the scene encoder, the selection of pre-fusion, post-fusion and hierarchical fusion of input modes is studied. For each fusion type, explore strategies that trade off efficiency and quality through decomposition attention or latent query attention. The pre-fusion structure is simple and not only mode-agnostic, but also achieves state-of-the-art results on both the Waymo Open Movement Dataset (WOMD) and the Argoverse leaderboard.

Driving Scenario consists of multi-modal data, such as road information, traffic light status, agent history and interaction. For modality, there is a Context 4th dimension, which represents "a set of contextual goals" for each modeled agent (i.e. a representation of other road users).

INTELLIGENCE HISTORYContains a series of past intelligence states as well as the current state. For each time step, consider the features that define the agent's state, such as x, y, velocity, acceleration, bounding box, etc., as well as a context dimension.

Interaction tensor represents the relationship between agents. For each modeled agent, a fixed number of nearest neighbor contexts surrounding the modeled agent are considered. These contextual agents represent agents that influence the behavior of the modeled agent.

Road MapContains road features around the agent. Road map segments are represented as polylines, a collection of segments specified by their endpoints and annotated with type information that approximate the shape of the road. Use the road map segment closest to the modeling agent. Please note that road features do not have a time dimension, and the time dimension 1 can be added.

For each agent, Traffic light information contains the traffic signal status closest to the agent. Each traffic signal point has features describing the signal location and confidence level.

Wayformer model series consists of two main components: scene encoder and decoder. The scene encoder mainly consists of one or more attention encoders, which are used to summarize the driving scene. The decoder is one or more standard transformer cross-attention modules, which input the learned initial query and then generate trajectories with scene encoding cross-attention.

As shown in the figure, the Wayformer model processes multi-modal input to produce scene encoding: This scene encoding is used as the context of the decoder, generating k possible trajectories covering multi-modality in the output space.

Wayformer: A simple and effective attention network for motion prediction

#The diversity of inputs to scene encoders makes this integration a non-trivial task. Modalities may not be represented at the same abstraction level or scale: {pixels vs target objects}. Therefore, some modalities may require more computation than others. The computational decomposition between modes is application dependent and is very important for engineers. Three fusion levels are proposed here to simplify this process: {post, pre, hierarchical}, as shown in the figure:

Wayformer: A simple and effective attention network for motion prediction

##Post-fusion is motion prediction The most common approach to models where each modality has its own dedicated encoder. Setting the width of these encoders to be equal avoids introducing extra projection layers in the output. Furthermore, by sharing the same depth across all encoders, the exploration space is reduced to a manageable size. Information is only allowed to be transferred across modalities in the cross-attention layer of the trajectory decoder.

Pre-fusionInstead of dedicating the self-attention encoder to each modality, we reduce modality-specific parameters to the projection layer. The scene encoder in the figure consists of a single self-attention encoder (the "cross-modal encoder"), allowing the network to have maximum flexibility in assigning importance across modalities while having minimum inductive bias.

Hierarchical Fusion As a compromise between the first two extremes, the volume is decomposed in a hierarchical manner between modality-specific self-attention encoders and cross-modal encoders. As done in post-fusion, width and depth are shared in the attentional encoder and the cross-modal encoder. This effectively splits the depth of the scene encoder between modality-specific encoders and cross-modal encoders.

The Transformer network does not scale well to large multi-dimensional sequences due to the following two factors:

  • (a) Self-attention is quadratic to the length of the input sequence.
  • (b) Positional feedforward networks are expensive subnetworks.

The acceleration method is discussed below, (S is the spatial dimension, T is the time domain dimension), and its framework is as shown in the figure:

Wayformer: A simple and effective attention network for motion prediction

Multi-Axis Attention: This refers to the default transformer setting, which applies self-attention in both spatial and temporal dimensions, and is expected to be the most computationally expensive. The computational complexity of anterior, posterior and hierarchical fusion with multi-axis attention is O(Sm2×T2).

Factorized attention: The computational complexity of self-attention is the second power of the length of the input sequence. This becomes even more apparent in multidimensional sequences, as each additional dimension increases the size of the input by a multiplicative factor. For example, some input modalities have time and space dimensions, so the computational cost scales O(Sm2×T2). To alleviate this situation, consider decomposing attention along two dimensions. This method exploits the multi-dimensional structure of the input sequence and reduces the cost of the self-attention subnetwork from O(S2×T2) to O(S2) O(T2) by applying self-attention in each dimension individually.

While decomposed attention has the potential to reduce computational effort compared to multi-axis attention, complexity is introduced when applying self-attention to the order of each dimension. Here are two decomposed attention paradigms compared:

  • Sequential attention (sequential attention) : An N-layer encoder consists of N/2 temporal encoder blocks and another N/2 It consists of spatial encoder blocks.
  • Interleaved attention: The N-layer encoder consists of temporal and spatial encoder blocks alternating N/2 times.

Latent query attention : Another way to solve the computational cost of large input sequences is to use a latent query in the first encoder block, where the input mapped to latent space. These latent variables are further processed by a series of encoder blocks that receive and return the latent space. This allows complete freedom in setting the latent space resolution, reducing the computational cost of self-attention components and positional feedforward networks in each block. Set the reduction amount (R=Lout/Lin) as a percentage of the input sequence length. In post-fusion and hierarchical fusion, the reduction factor R remains unchanged for all attention encoders.

The Wayformer predictor outputs a Gaussian mixture, representing the trajectory that the agent may take. To generate predictions, a Transformer decoder is used, which inputs a set of k learned initial queries (Si) and performs cross-attention with the scene embeddings of the encoder to generate embeddings for each component of the Gaussian mixture.

Given the embedding of a specific component in a mixture, a linear projection layer produces a non-canonical log-likelihood of that component, estimating the entire mixture likelihood. To generate trajectories, another linear layer projection is used, outputting 4 time series corresponding to the mean and logarithmic standard deviation of the predicted Gaussian at each time step.

During training, the loss is decomposed into respective classification and regression losses. Assuming k predicted Gaussians, the mixture likelihood is trained to maximize the log probability of the true trajectory.

If the predictor outputs a mixture of Gaussians with multiple modes, it is difficult to make inferences, and benchmark measures often limit the number of trajectories considered. Therefore, during the evaluation process, trajectory aggregation is applied, reducing the number of modes considered while still maintaining the diversity of the original output mixture.

The experimental results are as follows:

Wayformer: A simple and effective attention network for motion prediction

Decomposition Note

Wayformer: A simple and effective attention network for motion prediction

Latent Query

Wayformer: A simple and effective attention network for motion prediction

Wayformer: A simple and effective attention network for motion prediction

Wayformer: A simple and effective attention network for motion prediction


The above is the detailed content of Wayformer: A simple and effective attention network for motion prediction. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Jan 17, 2024 pm 02:57 PM

Written above & the author’s personal understanding Three-dimensional Gaussiansplatting (3DGS) is a transformative technology that has emerged in the fields of explicit radiation fields and computer graphics in recent years. This innovative method is characterized by the use of millions of 3D Gaussians, which is very different from the neural radiation field (NeRF) method, which mainly uses an implicit coordinate-based model to map spatial coordinates to pixel values. With its explicit scene representation and differentiable rendering algorithms, 3DGS not only guarantees real-time rendering capabilities, but also introduces an unprecedented level of control and scene editing. This positions 3DGS as a potential game-changer for next-generation 3D reconstruction and representation. To this end, we provide a systematic overview of the latest developments and concerns in the field of 3DGS for the first time.

How to solve the long tail problem in autonomous driving scenarios? How to solve the long tail problem in autonomous driving scenarios? Jun 02, 2024 pm 02:44 PM

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

Choose camera or lidar? A recent review on achieving robust 3D object detection Choose camera or lidar? A recent review on achieving robust 3D object detection Jan 26, 2024 am 11:18 AM

0.Written in front&& Personal understanding that autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various sensors (such as cameras, lidar, radar, etc.) to perceive the surrounding environment, and using algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires that the three-dimensional object detection algorithm in the autonomous driving system can accurately perceive and describe objects in the surrounding environment, including their locations,

This article is enough for you to read about autonomous driving and trajectory prediction! This article is enough for you to read about autonomous driving and trajectory prediction! Feb 28, 2024 pm 07:20 PM

Trajectory prediction plays an important role in autonomous driving. Autonomous driving trajectory prediction refers to predicting the future driving trajectory of the vehicle by analyzing various data during the vehicle's driving process. As the core module of autonomous driving, the quality of trajectory prediction is crucial to downstream planning control. The trajectory prediction task has a rich technology stack and requires familiarity with autonomous driving dynamic/static perception, high-precision maps, lane lines, neural network architecture (CNN&GNN&Transformer) skills, etc. It is very difficult to get started! Many fans hope to get started with trajectory prediction as soon as possible and avoid pitfalls. Today I will take stock of some common problems and introductory learning methods for trajectory prediction! Introductory related knowledge 1. Are the preview papers in order? A: Look at the survey first, p

SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving SIMPL: A simple and efficient multi-agent motion prediction benchmark for autonomous driving Feb 20, 2024 am 11:48 AM

Original title: SIMPL: ASimpleandEfficientMulti-agentMotionPredictionBaselineforAutonomousDriving Paper link: https://arxiv.org/pdf/2402.02519.pdf Code link: https://github.com/HKUST-Aerial-Robotics/SIMPL Author unit: Hong Kong University of Science and Technology DJI Paper idea: This paper proposes a simple and efficient motion prediction baseline (SIMPL) for autonomous vehicles. Compared with traditional agent-cent

nuScenes' latest SOTA | SparseAD: Sparse query helps efficient end-to-end autonomous driving! nuScenes' latest SOTA | SparseAD: Sparse query helps efficient end-to-end autonomous driving! Apr 17, 2024 pm 06:22 PM

Written in front & starting point The end-to-end paradigm uses a unified framework to achieve multi-tasking in autonomous driving systems. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. In addition, heavy

Let's talk about end-to-end and next-generation autonomous driving systems, as well as some misunderstandings about end-to-end autonomous driving? Let's talk about end-to-end and next-generation autonomous driving systems, as well as some misunderstandings about end-to-end autonomous driving? Apr 15, 2024 pm 04:13 PM

In the past month, due to some well-known reasons, I have had very intensive exchanges with various teachers and classmates in the industry. An inevitable topic in the exchange is naturally end-to-end and the popular Tesla FSDV12. I would like to take this opportunity to sort out some of my thoughts and opinions at this moment for your reference and discussion. How to define an end-to-end autonomous driving system, and what problems should be expected to be solved end-to-end? According to the most traditional definition, an end-to-end system refers to a system that inputs raw information from sensors and directly outputs variables of concern to the task. For example, in image recognition, CNN can be called end-to-end compared to the traditional feature extractor + classifier method. In autonomous driving tasks, input data from various sensors (camera/LiDAR

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles