


CVPR 2024 full score paper: Zhejiang University proposes a new method of high-quality monocular dynamic reconstruction based on deformable three-dimensional Gaussian
Monocular Dynamic Scene refers to a dynamic environment observed and analyzed using a monocular camera, in which objects can move freely in the scene. Monocular dynamic scene reconstruction is of critical significance in tasks such as understanding dynamic changes in the environment, predicting object motion trajectories, and generating dynamic digital assets. Using monocular vision technology, three-dimensional reconstruction and model estimation of dynamic scenes can be achieved, helping us better understand and deal with various situations in dynamic environments. This technology can not only be applied in the field of computer vision, but also play an important role in fields such as autonomous driving, augmented reality, and virtual reality. Through monocular dynamic scene reconstruction, we can more accurately capture the motion of objects in the environment
With the rise of neural rendering represented by Neural Radiance Field (Neural Radiance Field, NeRF), more and more Work began on using implicit representation for 3D reconstruction of dynamic scenes. Although some representative works based on NeRF, such as D-NeRF, Nerfies, K-planes, etc., have achieved satisfactory rendering quality, they are still far away from true photo-realistic rendering.
The research team from Zhejiang University and ByteDance pointed out that the core of the above problem is that the NeRF pipeline based on ray casting maps the observation space to the observation space through backward-flow. Accuracy and clarity challenges arise when canonical space is used. Inverse mapping is not ideal for the convergence of the learned structure, resulting in the current method only achieving a PSNR rendering index of 30 levels on the D-NeRF dataset.
To solve this challenge, the research team proposed a monocular dynamic scene modeling process based on rasterization. They combined deformation fields with 3D Gaussians for the first time, creating a new method that enables high-quality reconstruction and new perspective rendering. This research paper "Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction" has been accepted by CVPR 2024, the top international academic conference in the field of computer vision. What is unique in this work is that it is the first study to apply deformation fields to 3D Gaussians to extend to monocular dynamic scenes.
Project homepage: https://ingra14m.github.io/Deformable-Gaussians/
Paper link: https://arxiv.org/abs/2309.13101
Code: https://github.com/ingra14m/Deformable-3D-Gaussians
Experimental results show that the deformation field can effectively map the 3D Gaussian forward mapping in the canonical space to the observation space accurately. On the D-NeRF data set, a PSNR improvement of more than 10% was achieved. In addition, in real scenes, even if the camera pose is not accurate enough, the rendering details can be increased.
# 图 1 The experimental results of the real scene of hypernerf.
Related work
Dynamic scene reconstruction has always been a hot issue in three-dimensional reconstruction. As neural rendering represented by NeRF achieves high-quality rendering, a series of work based on implicit representation has emerged in the field of dynamic reconstruction. D-NeRF and Nerfies introduce deformation fields based on the NeRF raycasting pipeline to achieve robust dynamic scene reconstruction. TiNeuVox, K-Planes and Hexplanes introduce a grid structure on this basis, which greatly speeds up the model training process and improves the rendering speed. However, these methods are all based on inverse mapping and cannot truly achieve high-quality decoupling of gauge space and deformation fields. 3D Gaussian Splash is a point cloud rendering pipeline based on rasterization. Its CUDA-customized differentiable Gaussian rasterization pipeline and innovative densification enable 3D Gaussian to not only achieve SOTA rendering quality, but also achieve real-time rendering. Dynamic 3D Gaussian first extends the static 3D Gaussian to the dynamic field. However, its ability to only handle multi-view scenes severely limits its application in more general situations, such as single-view scenes such as mobile phone shooting.Research Thought
The core of Deformable-GS is to extend the static 3D Gaussian to monocular dynamic scenes. Each 3D Gaussian carries position, rotation, scale, opacity and SH coefficients for image-level rendering. According to the formula of the 3D Gaussian alpha-blend, it is not difficult to find that the position over time, as well as the rotation and scaling that controls the Gaussian shape, are the decisive parameters that determine the dynamic 3D Gaussian. However, unlike traditional point cloud-based rendering methods, after 3D Gaussian is initialized, parameters such as position and transparency will be continuously updated with optimization. This adds difficulty to the learning of dynamic Gaussians. ###This research innovatively proposes a dynamic scene rendering framework that is jointly optimized with deformation fields and 3D Gaussians. Specifically, this study treats 3D Gaussians initialized by COLMAP or random point clouds as a canonical space, and then uses the deformation field to use the coordinate information of the 3D Gaussians in the canonical space as input to predict the position and shape of each 3D Gaussian over time. parameter. Using deformation fields, this study can transform a 3D Gaussian from canonical space to observation space for rasterized rendering. This strategy does not affect the differentiable rasterization pipeline of 3D Gaussians, and the gradients calculated by it can be used to update the parameters of the canonical space 3D Gaussians.
In addition, the introduction of the deformation field is beneficial to the Gaussian densification of parts with larger motion ranges. This is because the gradient of the deformation field in areas with larger movement amplitudes will be relatively higher, thus guiding the corresponding areas to be more finely regulated during the densification process. Even though the number and position parameters of the canonical space 3D Gaussians are constantly updated in the early stage, the experimental results show that this joint optimization strategy can eventually achieve robust convergence results. After approximately 20,000 iterations, the positional parameters of the 3D Gaussian in the canonical space hardly change anymore.
The research team found that camera poses in real scenes are often not accurate enough, and dynamic scenes exacerbate this problem. This will not have a big impact on the structure based on the neural radiation field, because the neural radiation field is based on the multilayer perceptron (MLP) and is a very smooth structure. However, 3D Gaussian is based on the explicit structure of point clouds, and slightly inaccurate camera poses are difficult to robustly correct through Gaussian splashing.
In order to alleviate this problem, this study innovatively introduced Annealing Smooth Training (AST). This training mechanism is designed to smooth the learning of 3D Gaussians in the early stage and increase the details of rendering in the later stage. The introduction of this mechanism not only improves the quality of rendering, but also greatly improves the stability and smoothness of temporal interpolation tasks.
Figure 2 shows the pipeline of this research. For details, please see the original text of the paper.
Result Display
This study first conducted experiments on synthetic data sets on the D-NeRF data set, which is widely used in the field of dynamic reconstruction. . It is not difficult to see from the visualization results in Figure 3 that Deformable-GS has a huge improvement in rendering quality compared to the previous method.
# Figure 3 Qualitative experimental comparison results of this study on the D-NeRF data set.
The method proposed in this study not only achieves substantial improvements in visual effects, but also has corresponding improvements in quantitative indicators of rendering. It is worth noting that the research team found errors in the Lego scenes of the D-NeRF data set, that is, there are slight differences between the scenes in the training set and the test set. This manifests itself in the inconsistent flip angle of the Lego model shovel. This is also the fundamental reason why the indicators of the previous method cannot be improved in the Lego scene. To enable meaningful comparisons, the study used Lego's validation set as a baseline for metric measurements.Figure 4 Quantitative comparison on synthetic datasets.
As shown in Figure 4, this study compared SOTA methods at full resolution (800x800), including D-NeRF of CVPR 2020, TiNeuVox of Sig Asia 2022 and CVPR2023 Tensor4D, K-planes. The method proposed in this study has achieved substantial improvements in various rendering indicators (PSNR, SSIM, LPIPS) and in various scenarios. The method proposed in this study is not only applicable to synthetic scenes, but also achieves SOTA results in real scenes where the camera pose is not accurate enough. As shown in Figure 5, this study compares with the SOTA method on the NeRF-DS dataset. Experimental results show that even without special treatment of highly reflective surfaces, the method proposed in this study can still surpass NeRF-DS, which is specially designed for highly reflective scenes, and achieve the best rendering effect.# 图 Figure 5 Real scene method comparison.
In addition, this research also applies a differentiable Gaussian rasterization pipeline with forward and backward depth propagation for the first time. As shown in Figure 6, this depth also proves that Deformable-GS can also obtain robust geometric representations. Deep backpropagation can promote many tasks that require deep supervision in the future, such as inverse rendering (Inverse Rendering), SLAM and autonomous driving.
# Figure 6 Depth visualization.
##About the author
The corresponding author of the paper is Professor Jin Xiaogang from the School of Computer Science and Technology, Zhejiang University.
Email: jin@cad.zju.edu.cn
- ##Personal homepage: http://www.cad.zju.edu. cn/home/jin/
The above is the detailed content of CVPR 2024 full score paper: Zhejiang University proposes a new method of high-quality monocular dynamic reconstruction based on deformable three-dimensional Gaussian. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Currently, autoregressive large-scale language models using the next token prediction paradigm have become popular all over the world. At the same time, a large number of synthetic images and videos on the Internet have already shown us the power of diffusion models. Recently, a research team at MITCSAIL (one of whom is Chen Boyuan, a PhD student at MIT) successfully integrated the powerful capabilities of the full sequence diffusion model and the next token model, and proposed a training and sampling paradigm: Diffusion Forcing (DF). Paper title: DiffusionForcing:Next-tokenPredictionMeetsFull-SequenceDiffusion Paper address: https:/
