2022 is a big year for AI, and also for data competitions, with total prize money across all platforms exceeding $5 million.
Recently, the machine learning competition analysis platform ML Contests conducted a large-scale statistics on the 2022 data competition. New report takes a look at all the noteworthy happenings in 2022. The following is a compilation of the original text.
Highlights:
The competition with the largest prize money is Drivendata’s Snow Cast Showdown Contest sponsored by the U.S. Bureau of Reclamation. Participants receive $500,000 in prize money and are designed to help improve water supply management by providing accurate snowwater flow estimates for different regions across the West. As always, Drivendata has written a detailed article on the matchup and has a detailed solution report that is well worth a read.
The most popular competition of 2022 is Kaggle’s American Express Default Prediction competition, which aims to predict whether customers will repay their loans. More than 4,000 teams competed, with $100,000 in prize money distributed to the top four teams. For the first time this year, a first-time entry was won by a one-person team using an ensemble of neural networks and LightGBM models.
The largest independent competition is Stanford University’s AI Audit Challenge, which offers a $71,000 reward pool for the best “models, solutions, datasets, and tools.” To find ways to solve the problem of "illegal discriminatory AI review systems".
Three competitions based on financial predictions are all on Kaggle: JPX’s Tokyo Stock Exchange predictions, Ubiquant’s market predictions, and G-Research’s crypto predictions.
In comparisons in different directions, computer vision accounts for the highest proportion, NLP ranks second, and sequential decision-making problems (reinforcement learning) are on the rise. Kaggle responded to this growth in popularity by introducing simulation competitions in 2020. Aicrowd also hosts many reinforcement learning competitions. In 2022, 25 of those Interactive events totaled more than $300,000.
In the official NeurIPS 2022 competition Real Robot Challenge, participants must learn to control a three-fingered robot to move a cube to a target location or position it at a specific point in space, and Be facing the right direction. Participants' strategies are run on the physical robot every week, and the results are updated on the leaderboard. The award is a $5,000 prize and the academic honor of speaking at the NeurIPS Symposium.
Although people are familiar with Kaggle and Tianchi, there are currently many machine learning competition platforms that form an active ecosystem.
The picture below shows the 2022 platform comparison:
Give some examples:
Most of the prize money for competitions run on large platforms from industry, but machine learning competitions clearly have a richer history in academia, as Isabelle Guyon discussed in her NeurIPS invited talk this year.
NeurIPS is one of the most prestigious academic machine learning conferences in the world. The most important machine learning papers in the past decade are often presented at the conference, including AlexNet, GAN, Transformer and GPT-3.
NeurIPS first held the Data Challenge in Machine Learning (CIML) workshop in 2014, and there has been a competition component since 2017. Since then, the competition and total prize money have continued to grow, reaching nearly $400,000 in December 2022.
Other machine learning conferences also host competitions, including CVPR, ICPR, IJCAI, ICRA, ECCV, PCIC, and AutoML.
About half of all machine learning competitions have prize pools of over $10,000. There is no doubt that many interesting competitions have small prizes, and this report only considers those with monetary prizes or academic honors. Often, data competitions associated with prestigious academic conferences provide the winners with travel grants to attend the conference.
While some tournament platforms do tend to have larger prize pools on average than others (see platform comparison chart), many platforms are hosting at least one prize pool in 2022 Very Big Competitions - The top ten competitions by total prize money include those run on DrivenData, Kaggle, CodaLab and AIcrowd.
This survey analyzes the techniques used by the winning algorithm through questionnaires and code observation.
Quite consistently, Python was the language of choice for the contest winners, which may not be an unexpected result for people. Of those who use Python, about half primarily use Jupyter Notebook, and the other half use standard Python scripts.
A winning solution using mostly R is: Amir Ghazi won on Kaggle to predict the 2022 American Men’s College Basketball tournament winner's game. He did this by using — apparently copying verbatim — code from a 2018 competition-winning solution written by Kaggle Grandmaster Darius Barušauskas. Unbelievably, Darius also competed in this race in 2022, using a new approach and finishing 593rd.
When looking at the packages used in the winning solutions, the results showed that all winners using Python to some extent PyData stack.
The most popular software packages are divided into three categories - core toolkits, NLP categories and computer vision categories.
Among them, the growth of the deep learning framework PyTorch has been stable, and its jump from 2021 to 2022 is very obvious: PyTorch has gone from being the winning solution to 77% increased to 96%.
Of the 46 winning solutions using deep learning, 44 used PyTorch as their primary framework and only two used TensorFlow. Even more tellingly, one of the two competitions won using TensorFlow, Kaggle's Great Barrier Reef Competition, offers an additional $50,000 in prize money to the winning team using TensorFlow. Another competition won using TensorFlow used the high-level Keras API.
While there were 3 winners using pytorch-lightning and 1 using fastai - both were built on PyTorch above - but the vast majority of people use PyTorch directly.
It may now be said that at least in the data race, PyTorch has won the machine learning framework battle. This is consistent with broader machine learning research trends.
Notably, we found no instances of the winning team using other neural network libraries, such as JAX (built by Google and used by DeepMind), PaddlePaddle (developed by Baidu) or MindSpore (developed by Huawei).
Tools have a tendency to dominate the world, but technology does not. At CVPR 2022, the ConvNext architecture was introduced as the “ConvNet of the 2020s” and proved to outperform recent Transformer-based models. It was used in at least two competition-winning computer vision solutions, and CNN overall remains the most popular neural network architecture among computer vision competition winners to date.
Computer vision is very similar to language modeling in the use of pre-trained models: on public datasets such as ImageNet ), easy-to-understand architecture trained on. The most popular repository is Hugging Face Hub, accessible through timm, which makes it extremely convenient to load pre-trained versions of dozens of different computer vision models.
The advantages of using pre-trained models are obvious: real-world images and human-generated text have some common characteristics, and using pre-trained models can bring common sense knowledge, similar to Yu used a larger and more general training data set.
Typically, pre-trained models are fine-tuned – and further trained – based on task-specific data (such as data provided by competition organizers), but not always. The winner of the Image Matching Challenge used a pre-trained model without any fine-tuning at all - "Due to the (different) quality of the training and test data in this competition, we did not fine-tune using the provided training because we thought it would Not very effective." The decision paid off.
So far, the most popular pre-trained computer vision model type among the 2022 winners is EfficientNet, which, as the name suggests, has the advantage of being less resource intensive than many other models.
Transformer-based models have dominated natural language processing since their inception in 2017 The field of language processing (NLP). Transformer is the "T" in BERT and GPT, and is also the core of ChatGPT.
So it’s no surprise that all winning solutions in natural language processing competitions have Transformer-based models at their core. It’s no surprise that they are all implemented in PyTorch. They all used pre-trained models, loaded using Hugging Face’s Transformers library, and almost all used Microsoft Research’s version of the DeBERTa model – usually deberta-v3-large.
Many of them require large amounts of computing resources. For example, the Google AI4Code winner ran an A100 (80GB) for approximately 10 days to train a single deberta-v3-large for their final solution. This approach is the exception (using a single master model and a fixed train/evaluation split) - all other solutions make heavy use of ensemble models, and almost all use some form of k-fold cross-validation. For example, the winner of the Jigsaw Toxic Comments contest used a weighted average of the outputs of 15 models.
Transformer-based ensembles are sometimes used in conjunction with LSTM or LightGBM, and there are also at least two instances of pseudo-labeling that were effectively used for the winning solution.
XGBoost was once synonymous with Kaggle. However, LightGBM is clearly the favorite GBDT library for the 2022 winners - winners mentioned LightGBM as many times in their solution reports or questionnaires as CatBoost and XGBoost combined, CatBoost came in second, and XGBoost surprisingly ranked third.
##As roughly expected, most winners used GPUs for training— — This can greatly improve the training performance of gradient boosted trees and is actually required for deep neural networks. A significant number of award recipients have access to clusters provided by their employer or university, often including GPUs.
Somewhat surprisingly, we didn’t find any instances of using Google’s Tensor Processing Unit, the TPU, to train a winning model. We also didn’t see any winning models trained on Apple’s M-series chips, which have been supported by PyTorch since May 2022.
Google's cloud notebook solution Colab was popular, with one winner on the free plan, one on the Pro plan, and another on Pro (we can't confirm the fourth winner) or using the package used by Colab).
Local personal hardware was more popular than cloud hardware, and although nine winners mentioned the GPU they used for training, they did not specify whether they used a local or cloud GPU.
The most popular GPU is the latest high-end AI accelerator card NVIDIA A100 (here A100 40GB and A100 80GB are placed together, since the winner can't always tell the difference), and often multiple A100s - for example, the winner of Zindi's Turtle Recall competition used 8 A100 (40GB) GPUs, and the other two winners used 4 A100.
Team FormationMany competitions allow up to 5 entrants per team, teams can consist of individuals or smaller teams at some point before the results submission deadline "Merge" them together before the deadline.
Some competitions allow for larger teams, for example, Waymo’s Open Data Challenge allows up to 10 people per team.
ConclusionThis is a rough look at the 2022 machine learning competition. Hope you can find some useful information in it.
There are many exciting new competitions in 2023, and we look forward to releasing more insights as they wrap up.
The above is the detailed content of Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games. For more information, please follow other related articles on the PHP Chinese website!