The arXiv paper "Wayformer: Motion Forecasting via Simple & Efficient Attention Networks", uploaded in July 2022, is the work of Google Waymo.
Motion prediction for autonomous driving is a challenging task because complex driving scenarios result in various mixed forms of static and dynamic inputs. How best to represent and fuse historical information about road geometry, lane connectivity, time-varying traffic light states, and dynamic sets of agents and their interactions into efficient encodings is an unsolved problem . To model this diverse set of input features, there are many approaches to designing equally complex systems with different sets of modality-specific modules. This results in systems that are difficult to scale, scale, or trade off quality and efficiency in a rigorous way.
The Wayformer in this article is a series of simple and similar attention-based motion prediction architectures. Wayformer provides a compact model description consisting of attention-based scene encoders and decoders. In the scene encoder, the selection of pre-fusion, post-fusion and hierarchical fusion of input modes is studied. For each fusion type, explore strategies that trade off efficiency and quality through decomposition attention or latent query attention. The pre-fusion structure is simple and not only mode-agnostic, but also achieves state-of-the-art results on both the Waymo Open Movement Dataset (WOMD) and the Argoverse leaderboard.
Driving Scenario consists of multi-modal data, such as road information, traffic light status, agent history and interaction. For modality, there is a Context 4th dimension, which represents "a set of contextual goals" for each modeled agent (i.e. a representation of other road users).
INTELLIGENCE HISTORYContains a series of past intelligence states as well as the current state. For each time step, consider the features that define the agent's state, such as x, y, velocity, acceleration, bounding box, etc., as well as a context dimension.
Interaction tensor represents the relationship between agents. For each modeled agent, a fixed number of nearest neighbor contexts surrounding the modeled agent are considered. These contextual agents represent agents that influence the behavior of the modeled agent.
Road MapContains road features around the agent. Road map segments are represented as polylines, a collection of segments specified by their endpoints and annotated with type information that approximate the shape of the road. Use the road map segment closest to the modeling agent. Please note that road features do not have a time dimension, and the time dimension 1 can be added.
For each agent, Traffic light information contains the traffic signal status closest to the agent. Each traffic signal point has features describing the signal location and confidence level.
Wayformer model series consists of two main components: scene encoder and decoder. The scene encoder mainly consists of one or more attention encoders, which are used to summarize the driving scene. The decoder is one or more standard transformer cross-attention modules, which input the learned initial query and then generate trajectories with scene encoding cross-attention.
As shown in the figure, the Wayformer model processes multi-modal input to produce scene encoding: This scene encoding is used as the context of the decoder, generating k possible trajectories covering multi-modality in the output space.
#The diversity of inputs to scene encoders makes this integration a non-trivial task. Modalities may not be represented at the same abstraction level or scale: {pixels vs target objects}. Therefore, some modalities may require more computation than others. The computational decomposition between modes is application dependent and is very important for engineers. Three fusion levels are proposed here to simplify this process: {post, pre, hierarchical}, as shown in the figure:
##Post-fusion is motion prediction The most common approach to models where each modality has its own dedicated encoder. Setting the width of these encoders to be equal avoids introducing extra projection layers in the output. Furthermore, by sharing the same depth across all encoders, the exploration space is reduced to a manageable size. Information is only allowed to be transferred across modalities in the cross-attention layer of the trajectory decoder.
Pre-fusionInstead of dedicating the self-attention encoder to each modality, we reduce modality-specific parameters to the projection layer. The scene encoder in the figure consists of a single self-attention encoder (the "cross-modal encoder"), allowing the network to have maximum flexibility in assigning importance across modalities while having minimum inductive bias.
Hierarchical Fusion As a compromise between the first two extremes, the volume is decomposed in a hierarchical manner between modality-specific self-attention encoders and cross-modal encoders. As done in post-fusion, width and depth are shared in the attentional encoder and the cross-modal encoder. This effectively splits the depth of the scene encoder between modality-specific encoders and cross-modal encoders.
The Transformer network does not scale well to large multi-dimensional sequences due to the following two factors:
The acceleration method is discussed below, (S is the spatial dimension, T is the time domain dimension), and its framework is as shown in the figure:
Multi-Axis Attention: This refers to the default transformer setting, which applies self-attention in both spatial and temporal dimensions, and is expected to be the most computationally expensive. The computational complexity of anterior, posterior and hierarchical fusion with multi-axis attention is O(Sm2×T2).
Factorized attention: The computational complexity of self-attention is the second power of the length of the input sequence. This becomes even more apparent in multidimensional sequences, as each additional dimension increases the size of the input by a multiplicative factor. For example, some input modalities have time and space dimensions, so the computational cost scales O(Sm2×T2). To alleviate this situation, consider decomposing attention along two dimensions. This method exploits the multi-dimensional structure of the input sequence and reduces the cost of the self-attention subnetwork from O(S2×T2) to O(S2) O(T2) by applying self-attention in each dimension individually.
While decomposed attention has the potential to reduce computational effort compared to multi-axis attention, complexity is introduced when applying self-attention to the order of each dimension. Here are two decomposed attention paradigms compared:
Latent query attention : Another way to solve the computational cost of large input sequences is to use a latent query in the first encoder block, where the input mapped to latent space. These latent variables are further processed by a series of encoder blocks that receive and return the latent space. This allows complete freedom in setting the latent space resolution, reducing the computational cost of self-attention components and positional feedforward networks in each block. Set the reduction amount (R=Lout/Lin) as a percentage of the input sequence length. In post-fusion and hierarchical fusion, the reduction factor R remains unchanged for all attention encoders.
The Wayformer predictor outputs a Gaussian mixture, representing the trajectory that the agent may take. To generate predictions, a Transformer decoder is used, which inputs a set of k learned initial queries (Si) and performs cross-attention with the scene embeddings of the encoder to generate embeddings for each component of the Gaussian mixture.
Given the embedding of a specific component in a mixture, a linear projection layer produces a non-canonical log-likelihood of that component, estimating the entire mixture likelihood. To generate trajectories, another linear layer projection is used, outputting 4 time series corresponding to the mean and logarithmic standard deviation of the predicted Gaussian at each time step.
During training, the loss is decomposed into respective classification and regression losses. Assuming k predicted Gaussians, the mixture likelihood is trained to maximize the log probability of the true trajectory.
If the predictor outputs a mixture of Gaussians with multiple modes, it is difficult to make inferences, and benchmark measures often limit the number of trajectories considered. Therefore, during the evaluation process, trajectory aggregation is applied, reducing the number of modes considered while still maintaining the diversity of the original output mixture.
The experimental results are as follows:
Decomposition Note
Latent Query
The above is the detailed content of Wayformer: A simple and effective attention network for motion prediction. For more information, please follow other related articles on the PHP Chinese website!