With the development of large language models (LLM), the integration between them and 3D spatial data (3D LLM) Rapid progress has been made, providing unprecedented capabilities for understanding and interacting with physical spaces. This article provides a comprehensive overview of LLM's approach to processing, understanding and generating 3D data. We highlight the unique advantages of LLMs, such as contextual learning, stepwise reasoning, open vocabulary capabilities, and broad world knowledge, and highlight their potential to advance spatial understanding and interaction with embedded artificial intelligence (AI) systems. Our research covers various 3D data representations from point clouds to Neural Rendering Fields (NeRF). We analyze their integration with LLM for tasks such as 3D scene understanding, subtitles, question answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also briefly reviews other related combined 3D and language approaches, further revealing the significant progress but emphasizing the need to exploit the full potential of 3D LLMs. Therefore, through this discussion paper, we aim to chart a path for future research to explore and extend the capabilities of 3D LLM in understanding and interacting with complex 3D worlds.
Open source link: https://github.com/ActiveVisionLab/Awesome-LLM-3D
Point cloud: Use a set of data points in space to represent a three-dimensional shape, and store the position of each point in a three-dimensional Cartesian coordinate system. In addition to storing the location, other information about each point can be stored (e.g. color, normal). Point cloud-based methods are known for their low storage footprint but lack surface topology information. Typical sources for obtaining point clouds include lidar sensors, structured light scanners, time-of-flight cameras, stereo views, photogrammetry, etc.
Voxel Grid: It is composed of unit cubes in three-dimensional space, similar to the pixel representation in two-dimensional space. Each voxel minimally encodes occupancy information (binary or probabilistically), but can additionally encode the distance to the surface, as in a signed distance function (SDF) or a truncated signed distance function (TSDF). However, when high-resolution detail is required, the memory footprint can become excessive.
Polygon mesh: Representation consists of vertices and surfaces, which can compactly describe complex three-dimensional shapes. However, their unstructured and non-differentiable nature poses challenges in integrating them with neural networks to achieve end-to-end differentiable pipelines. Some solutions to this problem, such as methods based on gradient approximation, can only use handcrafted gradient calculations. Other solutions, such as differentiable rasterizers, may lead to inaccurate rendering results such as blurred content.
In recent years, there has been increasing interest in neural scene 3D research communities, which differ from traditional representations that rely on geometric elements. Neural scenes are mappings from spatial coordinates to scene properties (such as occupancy, color, intensity, etc.), but unlike material grids, in neural scenes the mapping is a learned function, typically a multi-layer perceptron. In this way, Neural Scenes implicitly learns the ability to learn geometric, continuous and differentiable 3D shape and scene representations. A set of neural networks focus on implicit surface representation. Occupancy networks encode shape in a continuous 3D occupancy function represented by a neural network, using 3D point locations and features from point clouds, low-resolution voxels, or images to estimate occupancy probabilities. Meanwhile, the deep SDF network uses a neural network to estimate the SDF from 3D coordinates and gradients. Recent methods, such as NeuS and NeuS2, have been shown to improve surface reconstruction fidelity and efficiency for both static and dynamic targets. Another group of methods called Neural Radiation Fields (NeRF) has shown powerful photorealistic rendering capabilities for 3D worlds. These methods use position encoding techniques to encode scene details and leverage MLP to predict the radiance values (color and opacity) of camera rays. However, the necessity of MLP to infer the color and occupancy details of every sample point in space (including sample points in empty space) requires significant computational resources. Therefore, there is a strong incentive to reduce the computational overhead of NeRF for real-time applications. Hybrid representation attempts to combine NeRF technology with traditional volume-based methods to promote high-quality real-time rendering. For example, combining voxel grids or multi-resolution hash grids with neural networks significantly reduces NeRF training and inference times.3D Gaussian scattering is a variation of point clouds in which each point contains additional information representing the radiation emitted in the region of space surrounding the point as anisotropic 3D Gaussian "spots". These 3D Gaussians are typically initialized from SfM point clouds and optimized using differentiable rendering. 3D Gaussian Scattering enables state-of-the-art new view synthesis at a fraction of NeRF computation by leveraging efficient rasterization instead of ray tracing.
Traditional natural language processing (NLP) encompasses a wide range of tasks designed to enable systems to understand, generate and manipulate text. Early approaches to NLP relied on techniques such as rule-based systems, statistical models, and early neural architectures such as recurrent neural networks. The recently introduced large language model (LLM) adopts a transformer architecture and is trained on a large text corpus, achieving unprecedented performance and triggering a new craze in the field. Since the focus of this article is three-dimensional LLM, we provide relevant background knowledge of LLM here. To explore LLM in depth, we refer to recent surveys in the region.
In the context of LLM, "encoder-decoder" and "decoder-only" architectures are mainly used for NLP tasks.
One major difference between LLM and traditional non-LLM methods is that they are available in large models but not in small models emergent ability. The term “emergency capabilities” refers to new complex capabilities that arise as LLMs expand in size and complexity. These capabilities enable people to deeply understand and generate natural language, solve problems in various fields without specific training, and adapt to new tasks through contextual learning. In the following, we will introduce several common emergent capabilities within the scope of LLM.
Contextual Learning refers to the ability of LLM to understand and respond to new tasks or queries based on the context provided in the prompts, without the need for explicit retraining or fine-tuning. The landmark papers (GPT-2/GPT-3) demonstrate contextual learning in a multi-shot approach, where the model is given several task examples in a prompt and then asked to process different examples without prior explicit training. State-of-the-art LLMs, such as GPT-4, exhibit extraordinary contextual learning capabilities, understanding complex instructions and performing a wide range of tasks from simple translation to code generation and creative writing, all based on the context provided in the prompts.
Reasoning in LLM, often referred to as the "thinking chain" prompt, involves models that generate intermediate steps or reasoning paths when dealing with complex problems or problems. This approach allows LLM to break down tasks into smaller, manageable parts, thereby promoting a more structured and understandable solution process. To achieve this, training involves datasets that include various problem-solving tasks, logic puzzles, and datasets designed to simulate reasoning under uncertainty. Current state-of-the-art LLMs typically exhibit advanced inference capabilities when model sizes are larger than 60B to 100B parameters.
Instruction compliance refers to the model's ability to understand and execute commands, or the ability to execute instructions specified by the user. This includes parsing the instruction, understanding its intent, and generating an appropriate response or action. Methods used to adapt this ability to new tasks may require instruction adaptation from a data set containing a variety of instructions paired with the correct response or action. Techniques such as supervised learning, reinforcement learning from human feedback, and interactive learning can further improve performance.
In the context of 3D LLM, LLM is either used directly in its pre-trained state or fine-tuned to suit new multi-modal tasks . However, fine-tuning the entire parameters of LLM poses significant computational and memory challenges due to the large number of parameters involved. Therefore, parameter effective fine-tuning (PEFT) has become increasingly popular in adapting LLMs to specific tasks by updating only a relatively small subset of model parameters rather than retraining the entire model. The following section lists four common PEFT methods used in LLM.
Low-Rank Adaptation (LoRA) and variants update parameters via a low-rank matrix. Mathematically, the forward pass of LoRA during fine-tuning can be expressed as h=W0x+BAx. W0 is the frozen weight of LLM, while BA is a low-rank matrix parameterized by the newly introduced matrices a and B updated in the fine-tuning stage. This approach has several clear benefits. During the fine-tuning process, only B and A are optimized, significantly reducing the computational overhead associated with gradient calculations and parameter updates. Once fine-tuning is complete and the weights are merged, there is no additional inference cost compared to the original model, as shown in the equation: h = (W0 + BA) x. Furthermore, there is no need to save multiple copies of LLM for different tasks since multiple LoRA instances can be saved, thus reducing the storage footprint.
Layer Freeze: Freeze selected layers of the pre-trained model while updating other layers during training. This typically applies to layers closer to the model input or output, depending on the nature of the task and the model architecture. For example, in the 3D-LLM method, all layers except input and output embeddings can be frozen to mitigate the risk of overfitting on task-specific datasets, retain pre-trained general knowledge and reduce the parameters that need to be optimized.
Prompt Tuning Guides LLM to perform specific tasks by setting the LLM's framework in prompts, adjusting model inputs compared to traditional fine-tuning of adjusting model parameters. Manual cue engineering is the most intuitive method, but it can be difficult for experienced cue tuning engineers to find the best cues. Another set of approaches is automated tip generation and optimization. A popular method is to search for the exact best input prompt text, called a hard prompt, for example. Alternatively, optimization methods can be used to optimize the embedding of hints (soft hints).
Adaptive fine-tuning Customize the model architecture for specific tasks by adding or removing layers or modules. This can include integrating new data modalities such as visual information and textual data. The core idea of adaptive fine-tuning is to utilize small neural network modules inserted between the layers of a pre-trained model. During adaptive fine-tuning, only the parameters of these adapter modules are updated, while the original model weights remain unchanged.
Visual language models are a family of models designed to capture and exploit the relationship between text and images/videos and be able to perform both Interaction tasks between modes. Most VLMs have Transformer-based architecture. By leveraging the attention module, visual and textual content condition each other to achieve mutual interaction. In the following paragraphs, we will briefly introduce the application of VLM in discriminative and generative tasks.
Discrimination taskInvolves predicting a certain feature of the data. VLMs, such as CLIP and ALIGN, have shown extraordinary performance in terms of zero-shot transferability to unseen data in image classification. Both models include two modules: visual encoder and text encoder. Given an image and its category, CLIP and ALIGN are trained by maximizing the similarity between the image embedding and the text embedding of the sentence “photo of {image category}”. Zero-shot transferability is achieved by replacing "{image category}" with possible candidates during inference and searching for sentences that best match the image. These two works inspired numerous subsequent works, further improving the accuracy of image classification. These models can also extract learned knowledge for use in other tasks, including object detection, image segmentation, document understanding, and video recognition.
Generation TaskUse VLM to generate text or images from input data. By leveraging large-scale training data, a single VLM can often perform multiple image-to-text generation tasks, such as image captioning and visual question answering (VQA). Notable examples include SimVLM, BLIP, and OFA, among others. More powerful VLMs, such as BLIP-2, Flamingo, and LLaVA, are capable of handling multi-turn dialogue and reasoning based on input images. With the introduction of diffusion models, text-to-image generation has also become the focus of the research community. By training on a large number of image-text pairs, diffusion models can generate high-quality images based on text input. This functionality also extends to generating videos, 3D scenes and dynamic 3D objects. In addition to generating tasks, existing images can also be edited via text prompts.
The Vision Foundation Model (VFM) is a large neural network designed to extract image representations that are sufficiently diverse and expressive to be directly deployed on various In this downstream task, it reflects the role of pre-trained LLM in downstream NLP tasks. One notable example is DINO, which uses a self-supervised teacher-student training model. The learned representations achieve good results in both image classification and semantic image matching. Attention weights in DINO can also be used as segmentation masks for the semantic components of the observed scene. Subsequent works such as iBOT and DINOv2 further improved the representation by introducing a masked image modeling (MIM) loss. SAM is a transformer-based image segmentation model trained on a dataset consisting of 1.1 billion images with semantic masks and exhibits strong zero-shot transfer capabilities. DINO (Zhang et al.)—not to be confused with DINO (Caron et al.)—adopts a DETR-like architecture and hybrid query selection for object detection. The follow-up work Grounding DINO introduces text supervision to improve accuracy. Stable Diffusion is a text-to-image generator that is also used as a feature extractor for "real" images by running a single diffusion step on a clean or artificially noisy image and extracting intermediate features or attention masks. These features have recently been exploited for segmentation and image matching tasks due to the size and diversity of the training sets used for diffusion models, and due to the observed emergent properties of diffusion features, such as zero-shot correspondence between images.
As mentioned earlier, considering the diversity of 3D representations, there are multiple ways to obtain 3D features. As shown in the “3D Geometry” column in Table 1, point clouds are most common due to their simplicity and compatibility with various pre-trained 3D encoders, making them a popular choice for multi-task and multi-modal learning methods . Multi-view images are also often used because research on 2D feature extraction has matured, meaning that 3D feature extraction only requires additional 2D to 3D lifting schemes. RGB-D data easily obtained using depth cameras is often used in 3D embedded agent systems to extract viewpoint-related information for navigation and understanding. A 3D scene graph is a more abstract 3D representation that is good at modeling the existence of objects and their relationships and capturing high-level information of the scene. They are frequently used for 3D scene classification and planning tasks. NeRF is currently less used in 3D-LLM methods. We believe this is due to their implicit nature, which makes them harder to tokenize and integrate with feedforward neural networks.
LLMs trained on large amounts of data have been shown to obtain commonsense knowledge about the world. The potential of LLM's world knowledge and reasoning capabilities has been explored to enhance the understanding of 3D scenes and reformulate the pipeline for several 3D tasks. In this section, we focus on methods that aim to use LLM to improve the performance of existing methods in 3D visual language tasks. When applying LLM to 3D tasks, we can divide its use into two groups: knowledge augmentation and inference augmentation methods. Knowledge augmentation methods exploit the vast world knowledge embedded in LLM to improve 3D task performance. This can provide contextual insights, fill knowledge gaps, or enhance semantic understanding of the 3D environment. Alternatively, methods to enhance inference do not rely on their world knowledge, but leverage the ability of LLM to perform inference step by step, thus providing better generalization capabilities to more complex 3D challenges. The following two sections describe each of these methods.
Many works focus on using LLM’s instruction following and contextual learning capabilities to unify multiple 3D tasks into one in language space. By using different text prompts to represent different tasks, these studies aim to make LLM a unified conversational interface. Implementing multi-task learning using LLM usually involves several key steps, starting with constructing 3D text data pairs. These pairings require crafting task instructions in text form and defining the output for each different task. Next, the 3D data (usually in the form of point clouds) is fed to a 3D encoder to extract 3D features. The alignment module is then used to (i) align 3D features with text embeddings from LLM at multiple levels (object level, relationship level and scene level) and (ii) translate 3D features into LLM interpretable tokens. Finally, an appropriate training strategy needs to be selected, such as single-stage or multi-stage 3D language alignment training and multi-task instruction fine-tuning.
In the remainder of this section, we will explore these aspects in detail. We also summarize the scope and capabilities of each method reviewed in this section in Table 2.
In addition to exploring 3D multi-task learners, some recent studies have also combined information from different modalities to Further improve model capabilities and enable new interactions. In addition to text and 3D scenes, multimodal 3D LLM can also include 2D images, audio, or touch information in the scene as input.
Most works aim to build a common representation space across different modalities. Since some existing works already provide pretrained encoders that map text, images, or audio to a common space, some works choose to learn 3D encodings that align the 3D embeddings with the embedding spaces of pretrained encoders for other modalities. device. JM3D-LLM learns a 3D point cloud encoder that aligns the embedding space of point clouds with the embedding space of text images of SLIP. It renders image sequences of point clouds and builds hierarchical text trees during training to achieve detailed alignment. Point Bind also learns a similar 3D encoder and aligns it with ImageBind to unify the embedding space for images, text, audio, and point clouds. This enables the use of different task heads to handle different tasks such as retrieval, classification and generation between various modes. However, a notable limitation is that this approach is only suitable for small-scale object-level scenes, as it is computationally expensive for 3D encoders to process large scenes with millions of points. Furthermore, most pre-trained multi-modal encoders like CLIP are designed for single-object scenes and are not suitable for large-scale scenes with multiple objects and local details.
In contrast, large scenes require more detailed design to incorporate multiple modes. ConceptFusion builds an enhanced feature map that fuses global information and local details of each component image of a large scene. This is achieved by using pre-trained feature extractors that are already aligned with different modalities including text and audio. It then uses traditional SLAM methods to map the feature map to the scene’s point cloud. MultiPLY uses a representation similar to ConceptGraph. It identifies all salient objects in the scene, obtains the global embedding of each object, and finally builds the scene graph. The resulting representation is a scene embedding aligned with Llama’s embedding space. Embeddings of other modalities including audio, temperature and haptics can also be mapped into the same space using linear projections. All embeds are tokenized and sent to LLM immediately. Compared to methods that can handle large-scale scenes, methods that can handle large-scale scenes reduce costs by relying on pre-trained encoders to bridge modal gaps instead of learning new encoders from scratch.
You can use LLM’s planning, tool usage, and decision-making capabilities to create 3D concrete agents. These capabilities enable LLM to generate intelligent decisions, including navigating in 3D environments, interacting with objects, and selecting appropriate tools to perform specific tasks. This section describes how 3D concrete agents perform planning, navigation, and manipulation tasks.
Traditionally, 3D modeling is a complex, time-intensive process with a high barrier to entry, requiring knowledge of geometry, textures Detailed attention to lighting and lighting is required to achieve realistic results. In this section, we take a closer look at the integration of LLM with 3D generation technologies, showing how the language provides a way to generate contextualized objects in a scene and provide innovative solutions for 3D content creation and manipulation.
Open-Vocabulary 3D Scene Understanding is designed to Identify and describe scene elements using natural language descriptions instead of predefined category labels. OpenScene adopts a zero-shot approach to predict dense features of 3D scene points co-embedded in a shared feature space with CLIP's text and image pixel embeddings, enabling task recognition training and open vocabulary querying to identify objects, materials, affordances, activities, and rooms type. CLIP-FO3D follows a similar approach, modifying CLIP to extract dense pixel features from 3D scenes projected into point clouds, and then training the 3D model via distillation to transfer the knowledge of CLIP. Semantic abstraction extracts association graphs from CLIP as abstract target representations to generalize to new semantics, vocabulary, and domains. Open Fusion combines the SEEM visual language model with TSDF 3D mapping, leveraging region-based embeddings and confidence maps for real-time open vocabulary scene creation and querying.
Here we investigate text-to-3D generation methods utilizing 2D VLM and guidance using a differentiable rendering text-to-image diffusion model. Early works such as DreamFields, CLIP-Mesh, CLIP-Forge, and Text2Mesh explored CLIP-guided zero-shot 3D generation.
DreamFusion introduces Score Distriction Sampling (SDS), in which the parameters of a 3D representation are optimized by making renderings from any angle look highly realistic, as evaluated by a pre-trained 2D diffusion model. It uses a text-to-image Imagen model to optimize NeRF representation via SDS. Magic3D proposes a two-stage framework: generating a coarse model with a low-resolution diffusion prior and a sparse 3D hash mesh, and then optimizing the textured 3D mesh model using an efficient differentiable renderer and a high-resolution latent diffusion model. Fantasia3D uses a hybrid DMET representation and spatially varying BRDF to unravel geometry and appearance. ProlificDreamer introduces variational fractional distillation (VSD), a particle-based framework that treats 3D parameters as random variables to increase fidelity and diversity. Dream3D leverages explicit 3D shape priors and text-to-image diffusion models to enhance text-guided 3D synthesis. MVDream adopts a multi-view consistent diffusion model that can be trained on a small amount of shot data for personalized generation. Text2NeRF combines NeRF representations with pre-trained text-to-image diffusion models to generate different indoor/outdoor 3D scenes based on language. In addition to generating geometry and appearance simultaneously, some research has also explored the possibility of synthesizing textures based only on given geometry.
A Transformer model pre-trained on a large 3D text dataset learns a powerful joint representation that combines vision and Linguistic modalities are connected. 3D VisTA is a Transformer model that uses self-attention to jointly model 3D visual and text data to achieve effective pre-training for goals such as masked language/target modeling and scene text matching. UniT3D uses a unified Transformer method, combined with PointGroup 3D detection backbone, BERT text encoder and multi-modal fusion module, to jointly pre-train the synthesized 3D language data. SpatialVLM adopts a different strategy to jointly train VLM on a large synthetic 3D spatial reasoning data set, improving the performance of 3D spatial visual question answering tasks and supporting applications such as robot thought chain reasoning. Multi CLIP pre-trains a 3D scene encoder to align scene features with CLIP’s text and image embeddings, aiming to transfer CLIP’s knowledge to improve 3D understanding for tasks such as visual question answering.
Improved benchmarks are critical to fully evaluate and improve the capabilities of multi-modal LLM in 3D tasks. The limited scope of current benchmarks, especially in three-dimensional reasoning, hinders the assessment of spatial reasoning skills and the development of three-dimensional decision-making/interaction systems. Furthermore, currently used metrics do not fully capture the functionality of LLM in 3D environments. It is crucial to develop task-specific metrics to more accurately measure the performance of different 3D tasks. Finally, the granularity of current scene understanding benchmarks is too simple, limiting in-depth understanding of complex 3D environments. A more diverse set of tasks is required.
Safety and ethical implications must be considered when using LLM for 3D understanding. LLM can hallucinate and output inaccurate, unsafe information, leading to incorrect decisions in critical 3D applications. Furthermore, LLMs often fail in unpredictable and difficult-to-explain ways. They may also inherit social biases present in the training data, penalizing certain groups when making predictions in real-world 3D scenes. It is crucial that LLM is used prudently in 3D environments, employing strategies to create more inclusive datasets, robust evaluation frameworks for bias detection and correction, and mechanisms to minimize hallucinations to ensure accountability and fairness the result of.
This article conducts an in-depth exploration of the integration of LLM and 3D data. This survey systematically reviews the methods, applications and emergent capabilities of LLM in processing, understanding and generating 3D data, highlighting the transformative potential of LLM across a range of 3D tasks. From enhancing spatial understanding and interaction in three-dimensional environments to advancing the capabilities of embedded artificial intelligence systems, LLM plays a key role in advancing the field.
Key findings include identifying LLM’s unique strengths, such as zero-shot learning, advanced reasoning, and broad world knowledge, that help bridge the gap between textual information and spatial interpretation. The paper demonstrates LLM integration with 3D data for a wide range of tasks. Exploring other 3D visual language methods with LLM reveals rich research prospects aimed at deepening our understanding of the 3D world.
Additionally, the survey highlights significant challenges such as data representation, model scalability, and computational efficiency, demonstrating that overcoming these obstacles is critical to fully realizing the potential of LLM in 3D applications. In conclusion, this survey not only provides a comprehensive overview of the current state of 3D tasks using LLM, but also lays the foundation for future research directions. It calls for collaboration to explore and expand LLM's capabilities in understanding and interacting with complex 3D worlds, paving the way for further advances in the field of spatial intelligence.
The above is the detailed content of The latest from Oxford University | Nearly 400 summaries! Talk about the latest review of large language models and the three-dimensional world. For more information, please follow other related articles on the PHP Chinese website!