


A GPU can produce 3D models in seconds! OpenAI's new work: Point-E can generate 3D point cloud models using text
Where is the next breakthrough that will take the AI world by storm?
Many people predict that it is a 3D model generator.
After the DALL-E 2 launched at the beginning of the year surprised everyone with its genius brush, OpenAI released its latest image generation model "POINT-E" on Tuesday, which can generate 3D directly from text Model.
##Paper link: https://arxiv.org/pdf/2212.08751.pdf
Compared with competitors (such as Google's DreamFusion) which require several GPUs to work for several hours, POINT-E can generate 3D images in minutes with only a single GPU.
After practical testing by the editor, POINT-E can basically output 3D images in seconds after prompt input. In addition, the output image also supports custom editing, saving and other functions.
Address: https://huggingface.co/spaces/openai/point-e
Netizens also began to try different prompt inputs.
But the output results are not always satisfactory.
Some netizens said that POINT-E might be able to realize Meta’s metaverse vision?
It should be noted that POINT-E generates 3D images through point cloud, which is a data set of points in space. .
Simply put, it is to collect data through a three-dimensional model to obtain point cloud data representing a 3D shape in space.
From a computational perspective, point clouds are easier to synthesize, but they cannot capture the delicate shape or texture of objects, which is currently the case with Point- A shortcoming of E.
To address this limitation, the Point-E team trained an additional artificial intelligence system to convert Point-E’s point clouds into meshes.
Convert Point-E point cloud to mesh
In In addition to the independent mesh generation model, Point-E consists of two models:
A text-to-image model (text-to-image model) and an image conversion 3D model (image -to-3D model).
The text-to-image conversion model is similar to OpenAI’s DALL-E 2 and Stable Diffusion, trained on labeled images to understand the association between words and visual concepts.
Then, a set of paired images with 3D objects is fed into the 3D transformation model so that the model learns to efficiently transform between the two.
When a prompt is input, the text-to-image conversion model generates a synthetic render object, which is fed to the image-to-image conversion 3D model, which then generates a point cloud.
OpenAI researchers say Point-E was trained on a dataset of millions of 3D objects and associated metadata.
But it's not perfect, Point-E's image-to-3D model sometimes fails to understand images in the text-to-image model, resulting in shapes that don't match the text hint. Still, it's orders of magnitude faster than previous state-of-the-art technologies.
They wrote in the paper:
While our method performs worse than the state-of-the-art in evaluation, it only Samples are generated in a fraction of the time. This can make it more practical for certain applications and discover higher quality 3D objects.
Point-E architecture and operating mechanism
The Point-E model first uses a text-to-image diffusion model to generate a single synthetic view, and then uses a second diffusion The model generates a 3D point cloud conditioned on the generated image.
While this method is still not state-of-the-art in terms of sampling quality, it is one to two orders of magnitude faster, providing a practical trade-off for some use cases.
The following picture is a high-level pipeline diagram of the model:
We are not training A single generative model directly generates point clouds conditioned on text, but instead divides the generation process into three steps.
First, generate a comprehensive view conditional on the text title.
Next, generate a rough point cloud (1,024 points) based on the synthetic view.
Finally, a fine point cloud (4,096 points) conditioned on the low-resolution point cloud and the synthetic view was generated.
After training the model on millions of 3D models, we found that the data format and quality of the datasets varied greatly, which led us to develop various post-processing steps to ensure Higher data quality.
In order to convert all the data into a common format, we used Blender to render each 3D model into an RGBAD image from 20 random camera angles (Blender supports multiple 3D formats , with an optimized rendering engine).
For each model, a Blender script normalizes the model into a bounding cube, configures standard lighting settings, and finally exports an RGBAD image using Blender's built-in real-time rendering engine.
Rendering is then used to convert each object into a colored point cloud. First, a dense point cloud is constructed for each object by counting points for each pixel in each RGBAD image. These point clouds typically contain hundreds of thousands of unevenly distributed points, so we also use furthest point sampling to create a uniform 4K point cloud.
By building point clouds directly from renderings, we are able to avoid various problems that can arise from sampling directly from a 3D mesh, sampling points contained in the model, or processing them in a way that does not A common file format for storing 3D models.
Finally, we employ various heuristics to reduce the frequency of low-quality models in our dataset.
First, we eliminate planar objects by calculating the SVD of each point cloud, retaining only those objects whose minimum singular value is higher than a certain threshold.
Next, we cluster the dataset by CLIP features (for each object, we average the features across all renders).
We found that some clusters contained many low-quality model categories, while other clusters appeared more diverse or interpretable.
We split these clusters into several buckets of different qualities and use a weighted mixture of the resulting buckets as our final dataset.
Application prospects
OpenAI researchers pointed out that Point-E’s point cloud can also be used to create real-world objects, such as 3D printing.
With additional mesh transformation models, the system can also enter game and animation development workflows.
While all eyes are currently on 2D art generators, model synthesis artificial intelligence could be the next big industry disruptor.
3D models are widely used in film and television, interior design, architecture and various scientific fields.
Currently, the manufacturing of 3D models usually takes several hours, and the emergence of Point-E just makes up for this shortcoming.
Researchers say that Point-E still has many flaws at this stage, such as biases inherited from training data and a lack of protection measures for models that may be used to create dangerous objects.
Point-E is just a starting point, and they hope it will inspire "further work" in the field of text-to-3D synthesis.
The above is the detailed content of A GPU can produce 3D models in seconds! OpenAI's new work: Point-E can generate 3D point cloud models using text. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative
