The AIxiv column is a column where academic and technical content is published on this site. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
Lei Jiahui, PhD student in the Department of Computer Science, University of Pennsylvania (2020 - present), his supervisor is Professor Kostas Daniilidis, his current main research direction is four-dimensional dynamic scenes Geometric modeling representations and algorithms with applications. He has published 7 articles as the first or co-author in top computer vision and machine learning conferences (CVPR, NeurIPS, ICML, ECCV). His previous undergraduate degree (2016-2020) graduated from the Control Department of Zhejiang University and the mixed class of Zhu Kezhen College with the first place in his major.
Reconstructing renderable dynamic scenes from arbitrary monocular video is a holy grail in computer vision research. In this paper, a team of researchers from the University of Pennsylvania and Stanford University attempts to take a small step toward this goal.
There are massive monocular videos on the Internet, which contain a large amount of information about the physical world. However, 3D vision still lacks effective means to extract 3D dynamic information from these videos to support future 3D large model modeling and Understanding the dynamic physical world. Although important, this inverse problem is extremely challenging.
First, real-shot 2D videos often lack multi-view information, so multi-view geometry cannot be used for 3D reconstruction. In many cases, it is even impossible to solve the camera pose and internal parameters through existing software (such as COLMAP).
Second, the degree of freedom of dynamic scenes is extremely high, and the four-dimensional representation of its deformation and long-term information fusion is still immature, making this difficult inverse problem more complicated.
This article proposes a novel neural information processing system - MoSca, which only needs to provide a series of video frame pictures without any additional information, and can generate videos, movie and TV series clips from SORA , reconstruct renderable dynamic scenes from monocular in-the-wild videos from , Internet videos and public datasets.
Method Overview
In order to overcome the above difficulties, Mosca first utilized the strong prior knowledge stored in computer vision foundation models to reduce the problem solution space.
Specifically, Mosca uses the monocular metric-depth estimation (mono metric-depth) model UniDepth, video any point long-term tracking (track any point) model CoTracker, and optical flow estimation (optical flow) ) The epipolar geometric error (epipolar error) calculated by the model RAFT, and the semantic features provided by the pre-trained semantic model DINO-v2. See Chapter 3.1 of the paper for details. We observe that most real-world dynamic deformations are compact and sparse in nature, and their complexity is often much lower than that of real geometric structures. For example, the motion of a hard object can be represented by rotation and translation, and the motion of a person can be roughly approximated by the rotation and translation of multiple joints.
Based on this observation, this article proposes a
novel compact dynamic scene representation - 4D Motion Scaffold, which upgrades the above cornerstone model output from two dimensions to four dimensions and fuses it, while also integrating physics Inspired deformation regularization (ARAP). The four-dimensional motion scaffold is a graph. Each node of the graph is a string of rigid body motion (SE (3)) trajectories. The topology of the graph is the nearest neighbor edge constructed by considering the distance between the rigid body motion trajectory curves globally. Deformations at any point in space can be represented by smoothing the rigid body trajectories of nodes on the interpolated graph in space-time using dual-quaternions. This representation greatly simplifies the motion parameters that need to be solved. (See Chapter 3.2 of the paper for details). Another huge advantage of the four-dimensional motion scaffold is that it can be directly initialized by monocular depth and video two-dimensional point tracking, and then the unknown occlusion point position and local coordinate system direction can be solved through efficient physical regular term optimization. Please refer to Chapter 3.3 of the paper for details. With the four-dimensional motion scaffolding, any point at any time can be deformed to any target time, which makes it possible to globally fuse observation information. Specifically, each frame of the video can be back-projected into three-dimensional space using the estimated depth map and initialized with a three-dimensional Gaussian (3DGS). These Gaussians are "bound" to the four-dimensional motion scaffolding and can travel freely at any time. If you want to render the scene at a certain moment, you only need to transfer the Gaussians of all other global moments to the current moment through the four-dimensional scaffolding for fusion. This dynamic scene representation based on four-dimensional motion scaffolding and Gaussian can be efficiently optimized by the Gaussian renderer (see Chapter 3.4 of the paper for details). Finally, it is worth mentioning that Mosca is a system that does not require internal and external parameters of the camera. By using the epipolar geometry error output by the above-mentioned cornerstone model to determine the static background mask, and using the depth and point tracking output by the cornerstone model, Mosca can efficiently optimize the reprojection error and solve the global bundle adjustment to directly output Camera internal parameters and pose, and continue to optimize the camera through subsequent rendering (see Chapter 3.5 of the paper for details). Experimental results Mosca can reconstruct dynamic scenes in DAVIS dataset videos. It is worth noting that Mosca flexibly supports multiple Gaussian-based renderers. In addition to the native 3DGS renderer, this article also tested the recent Gaussian surface reconstruction renderer GOF (Gaussian Opacity Field). As shown in the rightmost train in the picture, GOF can render higher quality normal and depth. Moska achieves significant improvements on the challenging IPhone DyCheck dataset, while also comparing other methods on the widely comparable Nvidia dataset.
The above is the detailed content of Whether it's real or AI video, 'Mosca” can reconstruct and restore 4D dynamic renderable scenes.. For more information, please follow other related articles on the PHP Chinese website!