While the world is still recovering, research has not slowed down its frenetic pace, especially in the field of artificial intelligence.
Additionally, this year has seen a new emphasis on AI ethics, bias, governance, and transparency.
Artificial intelligence and our understanding of the human brain and its connection to artificial intelligence are constantly evolving, and in the near future, these applications that improve the quality of our lives will shine.
Well-known blogger Louis Bouchard also counted 32 (!) AI technology breakthroughs in 2022 in his blog.
Let’s take a look at what these amazing studies are!
## Article address: https://www.louisbouchard.ai/2022-ai-recap/
LaMA: Resolution Robust Large Mask Inpainting Based on Fourier ConvolutionYou must have experienced this situation: you and your friends took a great photo Photo. As a result, you find that someone is behind you, destroying the photo you want to send to Moments or Xiaohongshu. But now, this is no longer a problem.
The resolution-robust large mask inpainting method based on Fourier convolution allows users to easily remove unwanted content from images. Both people and trash cans can disappear easily.
It is like a professional PS designer in your pocket, which can be easily cleared with just one click.
Although seemingly simple, image restoration is a problem that many AI researchers have long needed to solve.
##Paper link: https://arxiv.org/abs/2109.07161
Project address: https://github.com/saic-mdal/lama
##Colab Demo: https://colab.research .google.com/github/saic-mdal/lama/blob/master/colab/LaMa_inpainting.ipynbVideo explanation: https://youtu.be /Ia79AvGzveQ
Short analysis: https://www.louisbouchard.ai/lama/
STIT: Based on GAN's real video face editing
Will Smith in "Gemini Man"
Before , which requires professionals to spend hundreds or even thousands of hours of work manually editing the scenes in which these actors appear. But with AI, you can do it in minutes.
In fact, there are many technologies that allow you to increase your smile and make you look younger or older, all done automatically using artificial intelligence-based algorithms. It is called AI-based face manipulations in the video and represents the state of the art in 2022.
##Paper link: https://arxiv.org/abs/2201.08361
Project address: https://github.com/rotemtzaban/STIT
Video explanation: https://youtu.be/mqItu9XoUgk Short analysis : https://www.louisbouchard.ai/stitch-it-in-time/ Neural Rendering Realistic 3D models can be generated in space through pictures of objects, people or scenes. With this technology, you only need a few pictures of an object, and you can ask the machine to understand the object in these pictures and simulate what it looks like in space. Understanding the physical shape of objects through images is easy for humans because we understand the real world. But for a machine that can only see pixels, it's a completely different challenge. How can the generated model be integrated into new scenarios? What if the lighting conditions and angles of the photo are different and the resulting model will change accordingly? These are questions that Snapchat and the University of Southern California needed to address in this new study. Paper link: https://arxiv.org/abs/2201.02533 Project address: https://github.com/snap-research/NeROIC ##Video explanation: https://youtu.be /88Pl9zD1Z78 Short analysis: https://www.louisbouchard.ai/neroic/ For images, machine learning-based repair technology can not only remove the content, but also fill in the missing parts of the image based on background information. For video restoration, the challenge is to not only maintain consistency from frame to frame, but also to avoid generating erroneous artifacts. At the same time, when you successfully "kick" a person out of the video, you need to delete his/her voice as well. To this end, Google researchers have proposed a new voice repair method that can correct grammar, pronunciation, and even remove background noise in videos. Paper link: https://arxiv.org/abs/2202.07273 Video explanation: https://youtu.be/zIIc4bRf5Hg Short analysis: https://www.louisbouchard.ai/speech-inpainting-with-ai/ Do you have some old photos that you have collected because of their age? Blurry picture quality? Don’t worry, with Blind Face Restoration, your memories will last forever. This new and free AI model can repair most of your old photos in a flash. It works very well even if the pre-restoration photo is of very low quality. This used to be quite a challenge before. What’s even cooler is that you can try it any way you like. They have open sourced the code and created a demo and online application for everyone to try. I believe this technology will surprise you! Paper link: https://arxiv.org/abs/2101.04061 Project address: https://github.com/TencentARC/GFPGAN Colab Demo: https://colab.research.google.com/drive/1sVsoBd9AjckIXThgtZhGrHRfFI6UUYOo Online application: https: //huggingface.co/spaces/akhaliq/GFPGAN Video explanation: https://youtu.be/nLDVtzcSeqM Short analysis: https://www.louisbouchard.ai/gfp-gan/ 4D-Net: Learning of multi-modal alignment You may have heard about LiDAR sensors or other weird cameras that car companies are using. But how do they work, how do they see the world, and what exactly do they see differently compared to us? ##Paper link: https://arxiv.org/abs/2109.01066 Unlike Tesla, which only uses cameras to understand the world, most self-driving car manufacturers, such as Waymo, use ordinary cameras and 3D LiDAR sensors. They do not generate images like ordinary cameras, but instead generate 3D point clouds, using RGB sensing information, measuring the distance between objects, and calculating the pulsed laser light they project onto the object. propagation time. Despite this, how do we effectively combine this information and make the vehicle understand it? What will the vehicle ultimately see? Is autonomous driving safe enough? A new research paper from Waymo and Google will answer these mysteries. Video explanation: https://youtu.be/0nJMnw1Ldks ##Short analysis: https: //www.louisbouchard.ai/waymo-lidar/Instant NeRF: Instant neural primitives based on multi-resolution hash encoding Using AI models, people can turn captured images into high-quality 3D models. This challenging task allows researchers to use 2D images to create how an object or person would look in a three-dimensional world. Through hash-encoded neural primitives (graphical primitives), Nvidia can train NeRF in 5 seconds and achieve better results. In less than two years of research, the training speed of NeRF was increased by more than 1,000 times.
##Paper link: https://arxiv.org/abs/2201.05989
##Video Explanation: https://youtu.be/UHQZBQOVAIU Short analysis: https://www.louisbouchard.ai/nvidia-photos-into-3d- scenes/ DALL·E 2: Text-to-image generation model based on CLIP features DALL·E 2 not only generates realistic images from text, its output is four times the resolution! However, the performance improvement does not seem to be enough to satisfy OpenAI. For this reason, they also let DALL·E 2 learn a new skill: image repair. That is to say, you can edit the image with DALL·E 2, or add any new elements you want, such as adding a flamingo in the background. ##Paper link: https://arxiv.org/abs/2204.06125 Video explanation: https://youtu.be/rdGVbPI42sA Short analysis: https://www.louisbouchard.ai/openais -new-model-dall-e-2-is-amazing/ Google and Tel Aviv University propose one Very powerful DeepFake technology. With it, you can do almost anything. Simply take hundreds of photos of a person, encode their images, and fix, edit, or create whatever look you want. It's both amazing and scary, especially when you see the results. ##Paper link: https://arxiv.org/abs/2203.17272 Project address: https://mystyle-personalized-prior.github.io/ Video explanation: https://youtu. be/BNWAEvFfFvQ Short analysis: https://www.louisbouchard.ai/mystyle/OPT: Open pre-trained Transformer language model It has 175 billion parameters, twice the number of neurons in the human brain! Such a large neural network allowed the model to learn nearly the entire internet, understanding how we write, exchange and understand text. Just when people were marveling at the powerful functions of GPT-3, Meta took a big step towards the open source community. They released an equally powerful model that is now fully open source! This model not only has more than 100 billion level parameters, but also, compared with GPT-3, OPT-175B is more open and accessible.
##Paper link: https://arxiv.org/abs/2205.01068
##Video link: https://youtu.be/Ejg0OunCi9U Short analysis: https://www.louisbouchard.ai/opt-meta/ BlobGAN: Spatially discrete scene representation BlobGAN uses "blobs" to describe objects in the scene. The researchers can move the blobs, making them larger, smaller, or even deleted, which will have the same effect on the object they represent in the image. As the authors share in their results, you can create new images in the dataset by duplicating blobs. Now, the code of BlobGAN has been open sourced. If you are interested, hurry up and try it out! Paper link: https://arxiv.org/abs/2205.02837 Project address: https://github.com/dave-epstein/blobgan Colab Demo: https://colab.research.google.com/drive /1clvh28Yds5CvKsYYENGLS3iIIrlZK4xO?usp=sharing#scrollTo=0QuVIyVplOKu ##Video explanation: https://youtu.be/mnEzjpiA_4E Short analysis: https://www.louisbouchard.ai/blobgan/ DeepMind built a single "universal" agent Gato. You can play Atari games, create subtitle images, chat with people, and control robotic arms! What’s even more shocking is that it can complete all tasks by training it only once and using the same weights. Gato is a multi-modal agent. This means it can both create captions for images and act as a chatbot to answer questions. Although GPT-3 can also chat with you, it is obvious that Gato can do more. After all, there are often AIs that can chat, but not many that can play games with them. ##Paper link: https://arxiv.org/abs/2205.06175 Video explanation: https://youtu.be/xZKSWNv6Esc Imagen: Text-to-Image Diffusion Model with Deep Language Understanding If you think DALL · E 2 is great, so let’s see what this new model from Google Brain – Imagen – can do. DALL·E is amazing, but the images generated often lack realism. This is the problem that Imagen developed by the Google team aims to solve. According to benchmarks comparing text-to-image models, Imagen has achieved remarkable results in text-image synthesis with text embeddings for large language models. The resulting images are both imaginative and realistic.
##Paper link: https://arxiv.org/abs/2205.11487 Project address: https://imagen.research.google/
##Video explanation: https ://youtu.be/qhtYPhPWCsI Short analysis: https://www.louisbouchard.ai/google-brain-imagen/ DALL·E MiniA set of scary pictures of Xiao Zha became popular on Twitter for a while. This set of San value-for-money works was created by DALL·E mini. As the "youth version" of the DALL·E family, DALL·E mini is free and open source. The code has been left, who will be the next character to be magically modified? Project address: https://github.com/borisdayma/dalle-mini Online experience: https:// huggingface.co/spaces/dalle-mini/dalle-mini Video explanation: https://youtu.be/K3bZXXjW788 Short analysis: https://www.louisbouchard.ai/dalle-mini/ This NLLB-200 model released by Meta AI, the model naming concept comes from "No Language Left Behind" (No Language Left Behind), and it can achieve arbitrary translation in more than 200 languages. The highlight of the research is that the researchers improved most low-resource language training by multiple orders of magnitude, while achieving SOTA results for 200 language translations. Paper link: https://research.facebook.com/publications/no-language-left-behind/ Project address: https://github.com/facebookresearch/fairseq/tree/nllb Online experience: https://nllb.metademolab.com/ ##Video explanation: https://youtu. be/2G4NeG17Eis Short analysis: https://www.louisbouchard.ai/no-language-left-behind/ This research, which won the CVPR 2022 Best Paper Honor Award, proposes a novel Dual-Shutter method to detect multiple scene sources simultaneously by using a "slow" camera (130FPS). High-speed (up to 63kHz) surface vibrations and does so by capturing vibrations caused by audio sources. Thus, various needs such as the separation of musical instruments and the elimination of noise can be realized. Paper link: https://openaccess.thecvf.com/content/CVPR2022/papers/Sheinin_Dual-Shutter_Optical_Vibration_Sensing_CVPR_2022_paper.pdf Project address: https://imaging.cs.cmu.edu/vibration/ Video explanation: https://youtu.be/n1M8ZVspJcs Short analysis: https://www.louisbouchard .ai/cvpr-2022-best-paper/ Make-A-Scene is more than just "another DALL·E". Although DALL·E can generate random images based on text prompts, which is really cool, it also limits the user's control over the generated results. Meta’s goal is to promote creative expression, combining this text-to-image trend with the previous sketch-to-image model, resulting in “Make-A-Scene”: text A fantastic blend between sketch conditional image generation. Paper link: https://arxiv.org/abs/2203.13131 Video explanation: https://youtu.be/K3bZXXjW788 Short analysis: https://www.louisbouchard. ai/make-a-scene/ BANMo: Build a target 3D animation model from any video Based on this research from Meta, you just need Given any video capturing a deformable object, such as uploading several videos of cats and dogs, BANMo can reconstruct an editable animated 3D model by integrating 2D clues from thousands of images into a canonical space. And no predefined shape templates are needed. ##Paper link: https://arxiv.org/abs/2112.12761 Project address: https://github.com/facebookresearch/banmo ##Video explanation: https://youtu.be/jDTy-liFoCQ Short analysis: https://www.louisbouchard.ai/banmo/ Using latent diffusion model for high-resolution image synthesis Diffusion models have recently achieved SOTA results in most image tasks, including text-to-image using DALL·E, and many other image generation related tasks such as image inpainting , style transfer or image super-resolution.
##Paper link: https://arxiv.org/abs/2112.10752 Video explanation: https://youtu.be /RGBNdD3Wn-g ##Short analysis: https://www.louisbouchard.ai/latent-diffusion-models/ PSG: Scene-based image generation model To this end, researchers from Nanyang Polytechnic proposed a panoptic scene graph generation (PSG) task based on panoramic segmentation. Compared with traditional detection frame-based scene graph generation, the PSG task requires comprehensive output of all relationships in the image (including the relationship between objects and objects, the relationship between objects and the background, and the background relationship with the background) and use accurate segmentation blocks to locate objects.
##Paper link: https://arxiv.org/abs/2207.11247 Project address: https://psgdataset.org/ Online application: https://huggingface.co/spaces/ECCV2022/ PSG ##Video explanation: https://youtu.be/cSsE_H_0Cr8
Short story analysis: https://www.louisbouchard.ai/psg/ Use text inversion to achieve personalized generation of text to image The image generation models of major manufacturers this year can be said to be like the eight immortals crossing the sea, each showing their magical powers, but how to make the model generate image works of a specific style? Scholars from Tel Aviv University and NVIDIA have collaborated to launch a personalized image generation model that can DIY the images you want. ##Paper link: https://arxiv.org/abs/2208.01618 Project address: https://textual-inversion.github.io/ Video explanation : https://youtu.be/f3oXa7_SYek Short analysis: https://www.louisbouchard.ai/imageworthoneword/ The learning of visual text models has undoubtedly achieved great success, but how to pre-train this new language image The extension of the method to the video domain remains an open question. Scholars from Microsoft and the Chinese Academy of Sciences have proposed a simple and effective method to directly adapt pre-trained language image models to video recognition, rather than pre-training new models from scratch. Paper link: https://arxiv.org/abs/2208.02816 Project address: https://github.com/microsoft/VideoX/tree/master/X-CLIP Video explanation: https://youtu.be/seb4lmVPEe8 Short analysis: https://www.louisbouchard.ai/general-video-recognition/ The painter paints on the canvas to his heart's content. With such a clear and smooth picture, can you think that every frame of the video is generated by AI? Make-A-Video launched by MetaAI can generate videos of different styles in a few seconds by simply inputting a few words, which is called "Video Version DALL·E" It’s not an exaggeration. ##Paper link: https://arxiv.org/abs/2209.14792 Video explanation: https://youtu.be/MWwESVyHWto Whisper: Large-Scale Weakly Supervised Speech Recognition Model Have you ever thought about it? A translation software that can quickly translate speech in videos, even in languages you don’t understand yourself? OpenAI’s open source Whisper can do just that. Whisper was trained on more than 680,000 hours of multilingual data. It can recognize multilingual sounds in noisy backgrounds and convert them into text. In addition, it can also translate professional terms.
##Paper link: https://arxiv.org/abs/2212.04356 Project address: https://github.com/openai/whisper
Video explanation: https://youtu.be/uFOkMme19Zs Short analysis :https://www.louisbouchard.ai/whisper/ Text can generate images, videos, and There are 3D models ~ DreamFusion launched by Google can generate 3D models with one click by using pre-trained 2D text to image diffusion models, a diffusion model trained on billions of image-text pairs Driving the latest breakthroughs in text-to-3D model synthesis. ##Paper link: https://arxiv.org/abs/2209.14988 Video explanation: https://youtu.be/epuU0VRIcjE ## Short analysis: https: //www.louisbouchard.ai/dreamfusion/Imagic: Real image editing method based on diffusion model Researchers from Google, the Technion-Israel Institute of Technology, and the Weizmann Institute of Science introduced a real image editing method based on the diffusion model - Imagic, which can be achieved using only text PS of real photos. For example, we can change the pose and composition of a person while retaining its original features, or I want a standing dog to sit down and a bird to spread its wings.
##Paper link: https://arxiv.org/abs/2210.09276 Video explanation: https://youtu.be/ gbpPQ5kVJhM Short analysis: https://www.louisbouchard.ai/imagic/ eDiffi: Higher High-quality text-image synthesis model An image synthesis model stronger than DALL·E and Stable Diffusion is here! This is NVIDIA's eDiffi, which can generate higher-quality images more accurately. In addition, adding brush templates can add more creativity and flexibility to your works.
## Paper link: https://arxiv.org/abs/2211.01324 Project address: https://deepimagination.cc/eDiff-I/
##Video explanation :https://youtu.be/grwp-ht_ixo Short analysis: https://www.louisbouchard.ai/ediffi/ Infinite Nature: Learning infinite view generation of natural scenes from a single imageHave you ever thought about taking a photo and having it open like a door? What about flying into the picture? Scholars from Google and Cornell University have turned this imagination into reality with InfiniteNature-Zero, which can generate unlimited views of natural scenes from a single image. ## Paper link: https://arxiv.org/abs/2207.11148 Project address: https://infinite-nature.github.io/ Video explanation : https://youtu.be/FQzGhukV-l0 Short analysis: https://www.louisbouchard.ai/infinitenature-zero Galactica developed by Meta is a large language model that is comparable in size to GPT-3, but it excels in It is scientific knowledge. The model can write government white papers, news commentary, Wikipedia pages and code. It also knows how to cite and how to write equations. This is a big deal for artificial intelligence and science. ##Paper link: https://arxiv.org/abs/2211.09085 Video explanation: https://youtu.be/2GfxkCWWzLU RAD-NeRF: Real-time portrait synthesis model based on audio spatial decomposition Since the emergence of DeepFake and NeRF , AI face-changing seems to be commonplace, but there is a problem. The face changed by AI sometimes reveals its secrets because it does not match the shape of the mouth. The emergence of RAD-NeRF can solve this problem. It can perform real-time portrait synthesis of the speakers appearing in the video, and also supports custom avatars.
##Paper link: https://arxiv.org/abs/2211.12368 Project address: https://me.kiui.moe/radnerf/ Video explanation: https://youtu.be/ JUqnLN6Q4B0 Short analysis: https://www.louisbouchard.ai/rad-nerf/ ChatGPT: Language model optimized for dialogue
##Video explanation: https://youtu.be/AsFgn8vU-tQ Short analysis: https://www.louisbouchard.ai/chatgpt/ Video face re-aging that can be directly used in production Although the current computer vision model can generate the age of the face, style transfer, etc., this only looks cool and has almost no effect in actual applications. Existing technology usually has facial problems. Problems such as feature loss, low resolution, and unstable results in subsequent video frames often require manual secondary editing. Recently Disney released the first practical, fully automated method for re-aging faces in video images for production use, FRAN (Face Re-Aging Network) , officially announced the end of the technology of relying on makeup artists to change the visual effects of actors' age in movies. ## Paper link: https://dl.acm.org/doi/pdf/10.1145/3550454.3555520 Project address: https://studios.disneyresearch.com/2022/11/30/production-ready-face-re-aging-for-visual-effects/ ##Video explanation: https://youtu.be/WC03N0NFfwk Short analysis: https://www.louisbouchard.ai/disney-re-age/NeROIC: Neural Rendering with Online Gallery
How do self-driving cars "see six directions"?
NLLB: No language left behind
DreamFusion: Use 2D images to generate 3D models
The above is the detailed content of Really important research! 32 papers take a hard look at the AI hot spots in 2022. For more information, please follow other related articles on the PHP Chinese website!