[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture-AI-php.cn

Home

Technology peripherals

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

PHPz

Oct 10, 2023 pm 01:41 PM

AI image paper theory Image self-supervised learning method

1. Brief introduction

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture This paper demonstrates a method for learning highly semantic image representations without relying on hand-crafted data augmentation. The paper introduces the Image-based Joint Embedding Prediction Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: predict the representation of different target patches in the same image from a single context patch. The core design choice guiding I-JEPA to generate semantic representations is the masking strategy; specifically, (a) predict several target patches in the image, (b) sample sample target patches at a sufficiently large scale (15% of the image - 20%), (c) using sufficiently rich (spatially distributed) context blocks is crucial. Empirically, the paper found that I-JEPA is highly scalable when combined with a visual transformer. For example, the paper trains a ViT-Huge/16 on ImageNet in 38 hours using 32 A100 GPUs to achieve strong downstream performance across a wide range of tasks requiring different levels of abstraction, from linear classification to object counting and depth prediction.

2. Research background

In computer vision, there are two common image self-supervised learning methods.

Invariance-based methods and generation methods. By optimizing the encoder through an invariance-based pre-training approach, similar embeddings can be generated for two or more views of the same image. Typically, image views are constructed using a set of handcrafted data augmentation methods, such as random scaling, cropping, color dithering, etc. These pre-training methods can generate high-semantic-level representations, but at the same time they also introduce strong biases that may have a negative impact on some downstream tasks or even pre-training tasks with different data distributions

Cognitive learning theory believes that, One driving mechanism behind representation learning in biological systems is how to adapt an internal model to predict responses to sensory input. This idea is at the heart of self-supervised generative methods, which remove or corrupt parts of the input and learn to predict what is corrupted. In particular, mask denoising methods learn representations by reconstructing random mask patches from the pixel or token level of the input. Compared with view-invariant methods, the pre-training task of masks requires less prior knowledge and is easily generalized beyond image modalities. However, the resulting representations often have lower semantic levels and lack invariance-based pre-training in off-the-shelf evaluations such as linear probing and transfer settings with limited supervision on semantic classification tasks. Therefore, a more sophisticated adaptation mechanism (e.g., end-to-end fine-tuning) is required to obtain the full advantages of these methods.

In this work, the paper explores how to improve the semantic level of self-supervised representations without using additional prior knowledge to encode image transformations. To this end, the paper introduces an image joint embedding prediction architecture (I-JEPA). Figure 3 provides an illustration of this approach. The idea behind I-JEPA is to predict missing information in an abstract representation space; for example, given a context patch, predict the representation of different target patches in the same image, where the target representation is computed by a learned target encoder network.

Compared with generative methods that predict in pixel/marker space, I-JEPA utilizes abstract prediction targets that may eliminate unnecessary pixel-level details, resulting in the model learning more semantic features. Another core design choice guiding I-JEPA to produce semantic representations is the proposed multi-block masking strategy. Specifically, the paper demonstrates the importance of using an informative (spatially distributed) context patch to predict several target patches (of sufficiently large scale) in an image. Rewritten content: Compared with generative methods that predict in pixel/marker space, I-JEPA utilizes abstract prediction targets, potentially eliminating unnecessary pixel-level details, thereby enabling the model to learn more semantic features. Another core design choice of I-JEPA is to adopt a multi-block masking strategy to generate semantic representations. Specifically, the paper demonstrates the importance of using informative (spatially distributed) context patches to predict several target patches (of sufficiently large scale) in an image. Based on extensive empirical evaluation, research Show:

I-JEPA learns powerful off-the-shelf semantic representations without using hand-crafted view enhancements (Figure 1). I-JEPA outperforms pixel reconstruction methods such as MAE on ImageNet-1K linear detection, semi-supervised 1% ImageNet-1K, and semantic transfer tasks.

I-JEPA is competitive with view-invariant pre-training methods on semantic tasks and achieves better performance on low-level vision tasks such as object counting and depth prediction. By using a simpler model and less rigid inductive bias, I-JEPA is applicable to a wider set of tasks.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture I-JEPA is also scalable and efficient. Pre-training ViT-H/14 on ImageNet takes approximately 2400 GPU hours, which is 50% faster than ViTB/16 pre-trained with iBOT and 140% faster than ViT-L/16 pre-trained with MAE. Predictions in representation space significantly reduce the total computation required for self-supervised pre-training.

Self-supervised learning is a method of representation learning in which a system learns to capture relationships between its inputs. This goal can be easily described using the framework of energy-based models (EBMs), where the goal of self-supervision is to assign high energy to incompatible inputs and low energy to compatible inputs. Many existing generative and non-generative self-supervised learning methods can indeed be converted in this framework; see Figure 2

after rewriting: Joint-Embedding Architectures are A pre-training method based on invariance, which can be used in the EBM framework to perform forced conversion, see Figure 2a. The learning goal of the joint embedding architecture is to make compatible inputs x and y output similar embeddings, while incompatible inputs output different embeddings. In image-based pre-training, compatible x and y pairs are typically constructed by randomly applying hand-crafted data augmentations to the same input images. The main challenge of JEA is representation collapse, where the energy landscape is flat (i.e., the encoder produces a constant output regardless of the input). In the past few years, several methods have been studied to prevent representation collapse, such as contrastive losses that explicitly push negative example embeddings, non-contrastive losses that minimize the information redundancy of embeddings, and clustering-based methods to maximize the average Embedded entropy. There are also some heuristic methods that use asymmetric architectural design between x encoder and y encoder to avoid collapse. Generative Architectures. Reconstruction-based self-supervised learning methods can also be cast in EBM frameworks using generative architectures; see Figure 2b

Generative architectures learn to directly reconstruct signal y from a compatible signal x, using an additional A decoder network of (possibly latent) variables z to facilitate reconstruction. In image-based pre-training, a common approach in computer vision is to use masks to generate compatible x,y pairs, where x is a copy of image y, but with some patches masked. The conditioning variable z then corresponds to a set of (possibly learnable) masks and position markers that specify the decoder of the image patch to be reconstructed. As long as the information capacity of z is lower than the signal y, these architectures do not focus on representation collapse.

Joint-Embedding Predictive Architectures. As shown in Figure 2c, the joint embedding prediction architecture is conceptually similar to the generative architecture; however, a key difference is that the loss function is applied to the embedding space rather than the input space. JEPA learns to predict the embedding of signal y from a compatible signal x, using a prediction network of additional (possibly latent) variables z to facilitate prediction. The proposed I-JEPA provides an instantiation of this architecture in the context of images using masks; see Figure 3. In contrast to joint embedding architectures, JEPA does not seek representations that are invariant to a set of handcrafted data augmentations, but rather representations that predict each other when additional information z-conditions are present. However, as with joint embedding architectures, representation collapse is also a concern for JEPA. The paper exploits an asymmetric architecture between x and y encoders to avoid representation collapse in I-JEPA.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture 3. Method introduction

The paper now describes the proposed image-based joint embedding prediction architecture (I-JEPA), as shown in Figure 3 Show. The overall goal is as follows: given a context patch, predict the representation of different target patches in the same image. The paper uses the Visual Transformer (ViT) architecture as the context encoder, target encoder and predictor. A ViT consists of a stack of Transformer layers, each of which consists of a self-attention operation and a fully connected MLP. The paper's encoder/predictor architecture is reminiscent of the generative mask autoencoder (MAE) approach. However, a key difference is that the I-JEPA method is non-generative and predictions are made in the representation space.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture ##Image Classification

To demonstrate that I-JEPA learns high-level representations without relying on hand-crafted data augmentation, the paper reports results on various image classification tasks using linear detection and partial fine-tuning protocols. In this section, the paper considers self-supervised models pre-trained on the ImageNet-1K dataset. See Appendix A for implementation details of pre-training and assessment. All I-JEPA models are trained in resolution 224×224 unless explicitly stated otherwise.

ImageNet-1K. Table 1 shows the performance on the common ImageNet-1K linear evaluation benchmark. After self-supervised pre-training, the model weights are frozen and a linear classifier is trained on top using the full ImageNet-1K training set. Compared to popular masked autoencoders (MAE) and data2vec methods, which also do not rely on extensive hand-crafted data augmentation before training, the paper sees that I-JEPA significantly improves linear detection performance while using less amount of calculation. Additionally, I-JEPA benefits from scale. ViT-H/16 trained at resolution 448 matches the performance of view-invariant methods such as iBOT without requiring additional manual data augmentation.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Low sample size ImageNet-1K. Table 2 shows the performance on the 1% ImageNet benchmark. These methods utilize pre-trained models for ImageNet classification, using only 1% of ImageNet labels, with approximately 12 or 13 images per category. The model is tuned via fine-tuning or linear probing, depending on what works best for each method. When using similar encoder architecture, I-JEPA outperforms MAE and requires fewer pre-training epochs. The performance of I-JEPA using the ViTH/14 architecture is comparable to ViT-L/16 pre-trained using data 2vec, but the computational load is significantly less. By increasing the image input resolution, I-JEPA performs better than previous methods, including joint embedding methods and leveraging additional hand-crafted data augmentation methods before training, such as MSN, DINO and iBOT

Transfer learning. Table 3 shows the performance on various downstream image classification tasks using linear probes. I-JEPA significantly outperforms previous methods that do not use augmentation (MAE and Data2vec) and reduces the gap with the best methods that leverage handcrafted viewpoint-invariant before training, even surpassing the popular ones on CIFAR100 and Place205 DINO.
[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

5. Local Prediction Tasks

I-JEPA learns semantic image representation, which significantly improves the downstream image classification performance of previous methods, such as MAE and data2vec. Furthermore, I-JEPA benefits from scale and can close the gap and even beyond, leveraging additional hand-crafted data augmentation of view invariance-based methods. In this section, we find that I-JEPA can also learn local image features and outperform view invariance-based methods in low-level and intensive prediction tasks such as object counting and depth prediction.

Table 4 shows the performance of various low-level tasks using linear probing. In particular, after pre-training, the model's weights are frozen and a linear model is trained on top for object counting and depth prediction on the Clevr dataset. Compared with view-invariant methods such as DINO and iBOT, the I-JEPA method effectively captures low-level image features before training and outperforms in object counting (Clevr/Count) and (largely) depth prediction (Clevr/Dist). to them. [Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture 6. Scalability

The rewritten content is as follows: Based on comparison with previous methods, I-JEPA is highly scalable in terms of model efficiency. Figure 5 shows the semi-supervised results of the GPU-hour evaluation on 1% of ImageNet-1K. I-JEPA requires less computation than previous methods and achieves strong performance without relying on manual data augmentation. Compared to reconstruction-based methods such as MAE that directly uses pixels as targets, I-JEPA introduces additional overhead by computing targets in the representation space (approximately 7% slower per iteration)

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture Scaling data size. The paper also finds that I-JEPA benefits from pre-training on a larger data set. Table 5 shows the transfer learning performance on semantic tasks and low-level tasks when increasing the size of the pre-training dataset (IN1K vs IN22K). Transfer learning performance on these conceptually distinct tasks improves when pre-trained on larger and more diverse datasets. Scaling model size. Table 5 also shows that I-JEPA benefits from larger model size when pre-trained on IN22K. Compared with the ViT-H/14 model, pre-training on ViT-G/16 significantly improves downstream performance on image classification tasks such as Place205 and INat18. The ViTG/16 model does not improve performance on low-level downstream tasks. ViT-G/16 uses a larger input patch size, which may be detrimental to local prediction tasks.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

7. Predictor Visualizations can be rewritten

The function of the predictor in I-JEPA is to take the output of the context encoder and mask it with the position The mask token is a condition that predicts the representation of the target block at the location specified by the mask token. One question is whether predictors conditioned on position mask tokens are learning to correctly capture position uncertainty in the target. To study this question qualitatively, we visualize the output of the predictor. After pre-training, the paper freezes the weights of the context encoder and predictor, and trains a decoder according to the RCDM framework to map the average pool of the predictor output back to pixel space. Figure 6 shows the decoder output for various random seeds. Features that are common across samples represent the information contained in the average pooled predictor representation. The I-JEPA predictor correctly captures position uncertainty and produces high-level object parts with correct poses (e.g., the back of a bird and the top of a car). Different masses in different samples represent information not contained in the representation. In this case, the I-JEPA predictor discards precise low-level details and background information.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

8. The importance of Ablations

Predicting in representation space. Table 7 compares the low-shot performance when computing 1% ImageNet-1K in pixel space and representation space. The paper speculates that a key component of I-JEPA is that the loss is calculated entirely in the representation space, allowing the target encoder to produce abstract prediction targets that eliminate irrelevant pixel-level details. It is clear from Table 7 that prediction in pixel space leads to a significant degradation in linear detection performance.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Rewritten content: The masking strategy has been modified in Table 8. This study reduces the number of target blocks in the multi-block mask strategy proposed in the I-JEPA pre-training process and adjusts the scale of context and target blocks, as shown in Figure 4. We trained I-JEPA for 300 epochs using various multi-block settings and performed performance comparisons on the 1% ImageNet-1K benchmark using linear probes. To summarize, we found that it is very important to predict several relatively large (semantic) target patches, combined with informative (spatially distributed) context patches

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Table 6 also performs similar ablations when comparing with other masking strategies. The paper compares with a rasterized masking strategy, where the image is segmented into four large quadrants and the goal is to use one quadrant as context to predict the other three quadrants. The paper also compares traditional block and random masking strategies commonly used for reconstruction-based methods. In block masking, the target is a single image patch and the context is the image complement. In random masking, the target is a random (possibly discontinuous) set of image patches, and the context is the complement of the image. Note that in all considered masking strategies, there is no overlap between context and target blocks. The proposed multi-block masking strategy is the key for I-JEPA to learn semantic representation. Even switching to traditional block masks reduces the performance of ImageNet by more than 24%.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

9. Conclusion The paper

proposes a method called I-JEPA for learning semantic image representation. The method does not rely on hand-crafted data augmentation. Studies show that by making predictions in representation space, I-JEPA converges faster than pixel reconstruction methods and is able to learn high semantic level representations. Compared with methods based on view invariance, I-JEPA emphasizes the path of learning general representations using joint embedding architectures without relying on hand-crafted view enhancement

Appendix See the original text, original link: https:/ /arxiv.org/abs/2301.08243

The above is the detailed content of [Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7566

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

106

Related knowledge

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Breaking through the boundaries of traditional defect detection, 'Defect Spectrum' achieves ultra-high-precision and rich semantic industrial defect detection for the first time. Jul 26, 2024 pm 05:38 PM

In modern manufacturing, accurate defect detection is not only the key to ensuring product quality, but also the core of improving production efficiency. However, existing defect detection datasets often lack the accuracy and semantic richness required for practical applications, resulting in models unable to identify specific defect categories or locations. In order to solve this problem, a top research team composed of Hong Kong University of Science and Technology Guangzhou and Simou Technology innovatively developed the "DefectSpectrum" data set, which provides detailed and semantically rich large-scale annotation of industrial defects. As shown in Table 1, compared with other industrial data sets, the "DefectSpectrum" data set provides the most defect annotations (5438 defect samples) and the most detailed defect classification (125 defect categories

NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K Jul 26, 2024 am 08:40 AM

The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performers. Model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many fields. In addition to general models, some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, and InternVL for visual-language tasks.

Training with millions of crystal data to solve the crystallographic phase problem, the deep learning method PhAI is published in Science Aug 08, 2024 pm 09:22 PM

Editor |KX To this day, the structural detail and precision determined by crystallography, from simple metals to large membrane proteins, are unmatched by any other method. However, the biggest challenge, the so-called phase problem, remains retrieving phase information from experimentally determined amplitudes. Researchers at the University of Copenhagen in Denmark have developed a deep learning method called PhAI to solve crystal phase problems. A deep learning neural network trained using millions of artificial crystal structures and their corresponding synthetic diffraction data can generate accurate electron density maps. The study shows that this deep learning-based ab initio structural solution method can solve the phase problem at a resolution of only 2 Angstroms, which is equivalent to only 10% to 20% of the data available at atomic resolution, while traditional ab initio Calculation

Google AI won the IMO Mathematical Olympiad silver medal, the mathematical reasoning model AlphaProof was launched, and reinforcement learning is so back Jul 26, 2024 pm 02:40 PM

For AI, Mathematical Olympiad is no longer a problem. On Thursday, Google DeepMind's artificial intelligence completed a feat: using AI to solve the real question of this year's International Mathematical Olympiad IMO, and it was just one step away from winning the gold medal. The IMO competition that just ended last week had six questions involving algebra, combinatorics, geometry and number theory. The hybrid AI system proposed by Google got four questions right and scored 28 points, reaching the silver medal level. Earlier this month, UCLA tenured professor Terence Tao had just promoted the AI Mathematical Olympiad (AIMO Progress Award) with a million-dollar prize. Unexpectedly, the level of AI problem solving had improved to this level before July. Do the questions simultaneously on IMO. The most difficult thing to do correctly is IMO, which has the longest history, the largest scale, and the most negative

Nature's point of view: The testing of artificial intelligence in medicine is in chaos. What should be done? Aug 22, 2024 pm 04:37 PM

Editor | ScienceAI Based on limited clinical data, hundreds of medical algorithms have been approved. Scientists are debating who should test the tools and how best to do so. Devin Singh witnessed a pediatric patient in the emergency room suffer cardiac arrest while waiting for treatment for a long time, which prompted him to explore the application of AI to shorten wait times. Using triage data from SickKids emergency rooms, Singh and colleagues built a series of AI models that provide potential diagnoses and recommend tests. One study showed that these models can speed up doctor visits by 22.3%, speeding up the processing of results by nearly 3 hours per patient requiring a medical test. However, the success of artificial intelligence algorithms in research only verifies this

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

PRO | Why are large models based on MoE more worthy of attention? Aug 07, 2024 pm 07:08 PM

In 2023, almost every field of AI is evolving at an unprecedented speed. At the same time, AI is constantly pushing the technological boundaries of key tracks such as embodied intelligence and autonomous driving. Under the multi-modal trend, will the situation of Transformer as the mainstream architecture of AI large models be shaken? Why has exploring large models based on MoE (Mixed of Experts) architecture become a new trend in the industry? Can Large Vision Models (LVM) become a new breakthrough in general vision? ...From the 2023 PRO member newsletter of this site released in the past six months, we have selected 10 special interpretations that provide in-depth analysis of technological trends and industrial changes in the above fields to help you achieve your goals in the new year. be prepared. This interpretation comes from Week50 2023

See all articles