Where is the next breakthrough that will take the AI world by storm?
Many people predict that it is a 3D model generator.
After the DALL-E 2 launched at the beginning of the year surprised everyone with its genius brush, OpenAI released its latest image generation model "POINT-E" on Tuesday, which can generate 3D directly from text Model.
##Paper link: https://arxiv.org/pdf/2212.08751.pdf
Compared with competitors (such as Google's DreamFusion) which require several GPUs to work for several hours, POINT-E can generate 3D images in minutes with only a single GPU.
After practical testing by the editor, POINT-E can basically output 3D images in seconds after prompt input. In addition, the output image also supports custom editing, saving and other functions.
Address: https://huggingface.co/spaces/openai/point-e
Netizens also began to try different prompt inputs.
But the output results are not always satisfactory.
Some netizens said that POINT-E might be able to realize Meta’s metaverse vision?
It should be noted that POINT-E generates 3D images through point cloud, which is a data set of points in space. .
Simply put, it is to collect data through a three-dimensional model to obtain point cloud data representing a 3D shape in space.
From a computational perspective, point clouds are easier to synthesize, but they cannot capture the delicate shape or texture of objects, which is currently the case with Point- A shortcoming of E.
To address this limitation, the Point-E team trained an additional artificial intelligence system to convert Point-E’s point clouds into meshes.
Convert Point-E point cloud to mesh
In In addition to the independent mesh generation model, Point-E consists of two models:
A text-to-image model (text-to-image model) and an image conversion 3D model (image -to-3D model).
The text-to-image conversion model is similar to OpenAI’s DALL-E 2 and Stable Diffusion, trained on labeled images to understand the association between words and visual concepts.
Then, a set of paired images with 3D objects is fed into the 3D transformation model so that the model learns to efficiently transform between the two.
When a prompt is input, the text-to-image conversion model generates a synthetic render object, which is fed to the image-to-image conversion 3D model, which then generates a point cloud.
OpenAI researchers say Point-E was trained on a dataset of millions of 3D objects and associated metadata.
But it's not perfect, Point-E's image-to-3D model sometimes fails to understand images in the text-to-image model, resulting in shapes that don't match the text hint. Still, it's orders of magnitude faster than previous state-of-the-art technologies.
They wrote in the paper:
While our method performs worse than the state-of-the-art in evaluation, it only Samples are generated in a fraction of the time. This can make it more practical for certain applications and discover higher quality 3D objects.
Point-E architecture and operating mechanism
The Point-E model first uses a text-to-image diffusion model to generate a single synthetic view, and then uses a second diffusion The model generates a 3D point cloud conditioned on the generated image.
While this method is still not state-of-the-art in terms of sampling quality, it is one to two orders of magnitude faster, providing a practical trade-off for some use cases.
The following picture is a high-level pipeline diagram of the model:
We are not training A single generative model directly generates point clouds conditioned on text, but instead divides the generation process into three steps.
First, generate a comprehensive view conditional on the text title.
Next, generate a rough point cloud (1,024 points) based on the synthetic view.
Finally, a fine point cloud (4,096 points) conditioned on the low-resolution point cloud and the synthetic view was generated.
After training the model on millions of 3D models, we found that the data format and quality of the datasets varied greatly, which led us to develop various post-processing steps to ensure Higher data quality.
In order to convert all the data into a common format, we used Blender to render each 3D model into an RGBAD image from 20 random camera angles (Blender supports multiple 3D formats , with an optimized rendering engine).
For each model, a Blender script normalizes the model into a bounding cube, configures standard lighting settings, and finally exports an RGBAD image using Blender's built-in real-time rendering engine.
Rendering is then used to convert each object into a colored point cloud. First, a dense point cloud is constructed for each object by counting points for each pixel in each RGBAD image. These point clouds typically contain hundreds of thousands of unevenly distributed points, so we also use furthest point sampling to create a uniform 4K point cloud.
By building point clouds directly from renderings, we are able to avoid various problems that can arise from sampling directly from a 3D mesh, sampling points contained in the model, or processing them in a way that does not A common file format for storing 3D models.
Finally, we employ various heuristics to reduce the frequency of low-quality models in our dataset.
First, we eliminate planar objects by calculating the SVD of each point cloud, retaining only those objects whose minimum singular value is higher than a certain threshold.
Next, we cluster the dataset by CLIP features (for each object, we average the features across all renders).
We found that some clusters contained many low-quality model categories, while other clusters appeared more diverse or interpretable.
We split these clusters into several buckets of different qualities and use a weighted mixture of the resulting buckets as our final dataset.
OpenAI researchers pointed out that Point-E’s point cloud can also be used to create real-world objects, such as 3D printing.
With additional mesh transformation models, the system can also enter game and animation development workflows.
While all eyes are currently on 2D art generators, model synthesis artificial intelligence could be the next big industry disruptor.
3D models are widely used in film and television, interior design, architecture and various scientific fields.
Currently, the manufacturing of 3D models usually takes several hours, and the emergence of Point-E just makes up for this shortcoming.
Researchers say that Point-E still has many flaws at this stage, such as biases inherited from training data and a lack of protection measures for models that may be used to create dangerous objects.
Point-E is just a starting point, and they hope it will inspire "further work" in the field of text-to-3D synthesis.
The above is the detailed content of A GPU can produce 3D models in seconds! OpenAI's new work: Point-E can generate 3D point cloud models using text. For more information, please follow other related articles on the PHP Chinese website!