The AIxiv column is a column where academic and technical content is published on this site. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
With the deepening of research on large models, how to promote them to more modalities has become a hot topic in academia and industry. Recently released large closed-source models such as GPT-4o and Claude 3.5 already have strong image understanding capabilities, and open-source field models such as LLaVA-NeXT, MiniCPM, and InternVL have also shown performance that is getting closer to closed-source.
In this era of "80,000 kilograms per mu" and "one SoTA every 10 days", a multi-modal evaluation framework that is easy to use, has transparent standards and is reproducible has become increasingly important, and this is not easy.
In order to solve the above problems, researchers from Nanyang Technological University's LMMs-Lab jointly open sourced LMMs-Eval, which is an evaluation framework specially designed for multi-modal large-scale models and provides evaluation of multi-modal models (LMMs). A one-stop, efficient solution.
Code repository: https://github.com/EvolvingLMMs-Lab/lmms-eval
Official homepage: https://lmms-lab.github.io/
Paper address : https://arxiv.org/abs/2407.12772
List address: https://huggingface.co/spaces/lmms-lab/LiveBench
Since its release in March 2024, LMMs-Eval The framework has received collaborative contributions from the open source community, companies, and universities. It has now received 1.1K Stars on Github, with more than 30+ contributors, including a total of more than 80 data sets and more than 10 models, and it continues to increase.
Standardized assessment framework
In order to provide a standardized assessment platform, LMMs-Eval includes the following features:
Unified interface: LMMs-Eval is based on the text assessment framework lm-evaluation-harness It has been improved and expanded to facilitate users to add new multi-modal models and data sets by defining a unified interface for models, data sets and evaluation indicators.
One-Click Launch: LMMs-Eval hosts over 80 (and growing) datasets on HuggingFace, carefully transformed from the original sources, including all variants, versions, and splits. Users do not need to make any preparations. With just one command, multiple data sets and models will be automatically downloaded and tested, and the results will be available in a few minutes.
Transparent and reproducible: LMMs-Eval has a built-in unified logging tool. Each question answered by the model and whether it is correct or not will be recorded, ensuring reproducibility and transparency. It also facilitates comparison of the advantages and disadvantages of different models.
The vision of LMMs-Eval is that future multi-modal models no longer need to write their own data processing, inference and submission code. In today's environment where multi-modal test sets are highly concentrated, this approach is unrealistic, and the measured scores are difficult to directly compare with other models. By accessing LMMs-Eval, model trainers can focus more on improving and optimizing the model itself, rather than spending time on evaluation and alignment results.
The "Impossible Triangle" of Evaluation
The ultimate goal of LMMs-Eval is to find a 1. wide coverage 2. low cost 3. zero data leakage method to evaluate LMMs. However, even with LMMs-Eval, the author team found it difficult or even impossible to do all three at the same time.
As shown in the figure below, when they expanded the evaluation dataset to more than 50, it became very time-consuming to perform a comprehensive evaluation of these datasets. Furthermore, these benchmarks are also susceptible to contamination during training. To this end, LMMs-Eval proposed LMMs-Eval-Lite to take into account wide coverage and low cost. They also designed LiveBench to be low cost and with zero data leakage.
LMMs-Eval-Lite: Wide coverage lightweight evaluation
大規模なモデルを評価する場合、膨大な数のパラメータとテストタスクにより、評価タスクの時間とコストが大幅に増加することが多いため、誰もが小規模なモデルを使用することを選択することがよくあります。データ セットを使用するか、評価に特定のデータ セットを使用します。ただし、限定された評価では、モデルの機能の理解が不足することがよくあります。評価の多様性と評価コストの両方を考慮するために、LMMs-Eval は LMMs-Eval-Lite
LiveBench: LMM 動的テスト
従来のベンチマークは、固定された質問と回答を使用した静的な評価に重点を置いています。マルチモーダル研究の進歩により、スコア比較ではオープンソース モデルが GPT-4V などの商用モデルよりも優れていることがよくありますが、実際のユーザー エクスペリエンスでは劣ります。動的なユーザー主導のチャットボット Arenas と WildVision は、モデルの評価にますます人気が高まっていますが、何千ものユーザー設定を収集する必要があり、評価には非常にコストがかかります。 LiveBench の中心的なアイデアは、汚染ゼロを達成し、コストを低く抑えるために、継続的に更新されるデータセットでモデルのパフォーマンスを評価することです。著者チームは Web から評価データを収集し、ニュースやコミュニティ フォーラムなどの Web サイトから最新のグローバル情報を自動的に収集するパイプラインを構築しました。情報の適時性と信頼性を確保するために、著者チームは CNN、BBC、日本の朝日新聞、中国の新華社通信を含む 60 以上の報道機関や Reddit などのフォーラムから情報源を選択しました。具体的な手順は次のとおりです。The above is the detailed content of The multimodal model evaluation framework lmms-eval is released! Comprehensive coverage, low cost, zero pollution. For more information, please follow other related articles on the PHP Chinese website!