Table of Contents
Papers and Trends
Home Technology peripherals AI Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month

Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month

May 18, 2023 pm 04:31 PM
ai Open source

Some time ago, Google’s leaked internal documents expressed the view that although on the surface it seems that OpenAI and Google are chasing each other on large AI models, the real winner may not come from these two, because There is a third party force that is quietly rising. This power is "open source".

Around Meta's LLaMA open source model, the entire community is rapidly building models with similar capabilities to OpenAI and Google's large models. Moreover, the open source model iterates faster and is more customizable. More privacy.

Recently, Sebastian Raschka, former assistant professor at the University of Wisconsin-Madison and chief AI education officer of the startup Lightning AI, said that The past month has been very difficult for open source. great.

However, with so many large language models (LLM) emerging one after another, it is not easy to keep a firm grasp of all models. So, in this article, Sebastian shares resources and research insights on the latest open source LLMs and datasets.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

There have been so many research papers appearing in the past month that it can be difficult to choose from them Select the most favorite ones for in-depth discussion. Sebastian prefers papers that provide additional insights rather than simply demonstrate more powerful models. In view of this, what first caught his attention was the Pythia paper co-authored by researchers from Eleuther AI and Yale University and other institutions.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Paper address: https://arxiv.org/pdf/2304.01373.pdf

#Pythia: Gaining insights from large-scale training

Open source Pythia family of large models is really the other autoregressive decoding An interesting alternative to the container-style model (i.e. GPT-like model). The paper reveals some interesting insights into the training mechanism and introduces corresponding models ranging from 70M to 12B parameters.

The Pythia model architecture is similar to GPT-3 but includes improvements such as Flash attention (like LLaMA) and rotational position embedding (like PaLM). At the same time, Pythia was trained with 300B tokens on the 800GB diverse text data set Pile (1 epoch on the regular Pile and 1.5 epoch on the deduplication Pile).

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

The following are some insights and reflections from the Pythia paper:

  • In Will there be any impact if training on repeated data (i.e. training epoch>1) will have any impact? Results show that data deduplication does not improve or harm performance;
  • Do training commands affect memory? Unfortunately, it turns out not. I say sorry because if it does, the nasty verbatim memory problem can be alleviated by reordering the training data;
  • Doubling the batch size can halve the training time but Does not impair convergence.

Open Source Data

The past month has been particularly exciting for open source AI, with several An open source implementation of LLM and a large set of open source data sets. These datasets include Databricks Dolly 15k, OpenAssistant Conversations (OASST1) for instruction fine-tuning, and RedPajama for pre-training. These dataset efforts are especially laudable because data collection and cleaning accounts for 90% of real-world machine learning projects, yet few people enjoy this work.

Databricks-Dolly-15 dataset

Databricks-Dolly-15 is a data set used for LLM fine-tuning set of over 15,000 instruction pairs written by thousands of DataBricks employees (similar to training systems like InstructGPT and ChatGPT).

OASST1 Dataset

OASST1 dataset is used to fine-tune a pre-trained LLM on a collection of ChatGPT assistant-like conversations created and annotated by humans. , containing 161,443 messages written in 35 languages ​​and 461,292 quality assessments. These are organized in over 10,000 fully annotated dialogue trees.

RedPajama dataset used for pre-training

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

RedPajama is an open source data set for LLM pre-training, similar to Meta's SOTA LLaMA model. This dataset aims to create an open source competitor to most popular LLMs, which are currently either closed source business models or only partially open source.

The bulk of RedPajama consists of CommonCrawl, which filters sites in English, but Wikipedia articles cover 20 different languages.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

LongForm Dataset

Paper《 The LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction" introduces a collection of manually created documents based on existing corpora such as C4 and Wikipedia and the instructions of these documents, thereby creating an instruction tuning data set suitable for long text generation.

Paper address: https://arxiv.org/abs/2304.08460

Alpaca Libre project

#The Alpaca Libre project aims to recreate the Alpaca project by converting 100k MIT licensed demos from the Anthropics HH-RLHF repository into an Alpaca compatible format.

Expand the open source data set

Instruction fine-tuning is how we evolve from a GPT-3-like pre-trained basic model to a more The key to a powerful ChatGPT-like large language model. Open source human-generated instruction datasets such as Databricks-Dolly-15 help achieve this. But how do we scale further? Is it possible not to collect additional data? One approach is to bootstrap an LLM from its own iteration. Although the Self-Instruct method was proposed 5 months ago (and is outdated by today's standards), it is still a very interesting method. It is worth emphasizing that it is possible to align pretrained LLMs with instructions thanks to Self-Instruct, a method that requires almost no annotations.

How does it work? In short, it can be divided into the following four steps:

  • The first is a seed task pool with a set of manually written instructions (175 in this case) and sample instructions;
  • Secondly use a pre-training LLM (such as GPT-3) to determine the task category;
  • Then give new instructions to make the pre-training The LLM generates the response;
  • # and finally collects, trims, and filters the response before adding the instruction to the task pool.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

In practice, work based on ROUGE scores will be more effective. For example, Self-Instruct fine-tuned LLM is better than GPT-3 based LLM, and can compete with LLMs pretrained on large human-written instruction sets. At the same time, self-instruct can also benefit from LLM that has been fine-tuned on manual instructions.

But of course, the gold standard for evaluating LLMs is asking human raters. Based on human evaluation, Self-Instruct outperforms basic LLMs, as well as LLMs trained on human instruction datasets in a supervised manner (such as SuperNI, T0 Trainer). Interestingly, however, Self-Instruct does not perform better than methods trained with reinforcement learning with human feedback (RLHF).

Artificially generated vs synthetic training data set

Artificially generated instruction data set or self-instruct data set, which one is more promising? Woolen cloth? Sebastian sees a future in both. Why not start with a manually generated instruction data set (e.g. databricks-dolly-15k of 15k instructions) and then extend it using self-instruct? The paper "Synthetic Data from Diffusion Models Improves ImageNet Classification" shows that combining real image training sets with AI-generated images can improve model performance. It would be interesting to explore whether this is also true for text data.

Paper address: https://arxiv.org/abs/2304.08466

Recent paper "Better Language" Models of Code through Self-Improvement" is research in this direction. The researchers found that code generation tasks can be improved if a pretrained LLM uses its own generated data.

Paper address: https://arxiv.org/abs/2304.01228

Less is more Less is more?

In addition, in addition to pre-training and fine-tuning the model on larger and larger data sets, how can we improve the performance of smaller data sets? What about the efficiency? The paper "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes" proposes using a distillation mechanism to manage task-specific smaller models that use less training data but exceed standard fine-tuning. performance.

Paper address: https://arxiv.org/abs/2305.02301

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Tracking open source LLM

The number of open source LLM is exploding. On the one hand, it is a very good development trend (compared to controlling the model through paid API ), but on the other hand keeping track of it all can be cumbersome. The following four resources provide different summaries of most relevant models, including their relationships, underlying datasets, and various licensing information.

The first resource is the ecosystem graph website based on the paper "Ecosystem Graphs: The Social Footprint of Foundation Models", which provides the following tables and interactive dependency graphs (not shown here).

This ecosystem diagram is the most comprehensive list Sebastian has seen to date, but it can be a bit confusing because it includes many less popular LLMs. Checking the corresponding GitHub repository shows that it has been updated for at least a month. It's also unclear if it will add newer models.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

  • ##Paper address: https://arxiv.org/abs/2303.15772
  • Ecosystem graph website address: https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table

The second resource is the beautifully drawn evolutionary tree from the recent paper Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, which focuses on the most popular LLMs and their Relationship.

Although readers have seen a very beautiful and clear visual LLM evolutionary tree, there are also some minor doubts. It's not clear why the bottom doesn't start from the original transformer architecture. Also open source labels are not very accurate, for example LLaMA is listed as open source, but the weights are not available under an open source license (only the inference code is).

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

##Paper address: https://arxiv.org/abs/2304.13712

The third resource is a table drawn by Sebastian's colleague Daniela Dapena, from the blog "The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs...".

Although the table below is smaller than other resources, it has the advantage of including model dimensions and licensing information. This table will be very useful if you plan to use these models in any project.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Blog address: https://lightning.ai/pages/community/community-discussions/the-ultimate-battle-of -language-models-lit-llama-vs-gpt3.5-vs-bloom-vs/

The fourth resource is the LLaMA-Cult-and-More overview table , which provides additional information on fine-tuning methods and hardware costs.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Overview table address: https://github.com/shm007g/LLaMA-Cult-and- More/blob/main/chart.md

Use LLaMA-Adapter V2 to fine-tune multi-modal LLM

Sebastian predicts that we will see more multi-modal LLM models this month, so we have to talk about the recently released paper "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model". First, let’s review what is LLaMA-Adapter? It is a parameter-efficient LLM fine-tuning technique that modifies the previous transformer blocks and introduces a gating mechanism to stabilize training.

Paper address: https://arxiv.org/abs/2304.15010

Using the LLaMA-Adapter method, The researchers were able to fine-tune a 7B parameter LLaMA model in just 1 hour (8 A100 GPUs) on 52k instruction pairs. Although only the newly added 1.2M parameters (adapter layer) have been fine-tuned, the 7B LLaMA model is still frozen.

The focus of LLaMA-Adapter V2 is multi-modality, that is, building a visual command model that can receive image input. Although the original V1 could receive text tokens and image tokens, images were not fully explored.

LLaMA-Adapter From V1 to V2, researchers improved the adapter method through the following three main techniques.

  • Early visual knowledge fusion: Instead of fusing visual and adapted cues at each adapted layer, the visual token is connected to the word token in the first transformer block ;
  • Use more parameters: unfreeze all normalization layers and add bias units and scaling factors to each linear layer in the transformer block;
  • Joint training with disjoint parameters: for subtitle data, only the visual projection layer is trained; for data followed by instructions, only the adaptation layer (and the newly added parameters mentioned above) is trained.

LLaMA V2 (14M) has many more parameters than LLaMA V1 (1.2 M), but it is still lightweight, accounting for only 0.02% of the total parameters of 65B LLaMA . Particularly impressive is that by fine-tuning only 14M parameters of the 65B LLaMA model, the resulting LLaMA-Adapter V2 performs on par with ChatGPT (when evaluated using the GPT-4 model). LLaMA-Adapter V2 also outperforms the 13B Vicuna model using the full fine-tuning approach.

Unfortunately, the LLaMA-Adapter V2 paper omits the computational performance benchmark included in the V1 paper, but we can assume that V2 is still much faster than the fully fine-tuned method.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Other open source LLM

The development of large models is so fast that we can’t list them all. Some of the famous open source LLM and chatbots launched this month include Open-Assistant, Baize, StableVicuna, ColossalChat, Mosaic’s MPT, etc. Additionally, below are two particularly interesting multimodal LLMs.

OpenFlamingo

OpenFlamingo is an open source copy of the Flamingo model released by Google DeepMind last year. OpenFlamingo aims to provide multi-modal image inference capabilities for LLM, allowing people to interleave text and image input.

MiniGPT-4

##MiniGPT-4 is another open source model with visual language capabilities. It is based on the BLIP-27 frozen visual encoder and the frozen Vicuna LLM.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

NeMo Guardrails

With these With the emergence of large language models, many companies are thinking about how and whether they should deploy them, and security concerns are particularly prominent. There are no good solutions yet, but there is at least one more promising approach: NVIDIA has open sourced a toolkit to solve the LLM hallucination problem.

In a nutshell, how it works is that this method uses database links to hard-coded prompts that must be managed manually. Then, if the user enters prompt, that content will be matched first to the most similar entry in that database. The database then returns a hardcoded prompt which is then passed to LLM. So if one carefully tests the hardcoded prompt, one can ensure that the interaction does not deviate from allowed topics etc.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

##This is an interesting but not groundbreaking approach as it does not offer anything better or new for LLM capabilities, it simply limits the extent to which a user can interact with LLM. Still, until researchers find alternative ways to mitigate hallucination problems and negative behaviors in LLM, this may be a viable approach.

The guardrails approach can also be combined with other alignment techniques, such as the popular human-feedback reinforcement learning training paradigm that the authors introduced in a previous issue of Ahead of AI.

Consistency Model

Talking about interesting models other than LLM is a good attempt, and OpenAI finally open sourced them Code for the consistency model: https://github.com/openai/consistency_models.

The consistency model is considered a feasible and effective alternative to the diffusion model. You can get more information in the consistency model paper.

The above is the detailed content of Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to check CentOS HDFS configuration How to check CentOS HDFS configuration Apr 14, 2025 pm 07:21 PM

Complete Guide to Checking HDFS Configuration in CentOS Systems This article will guide you how to effectively check the configuration and running status of HDFS on CentOS systems. The following steps will help you fully understand the setup and operation of HDFS. Verify Hadoop environment variable: First, make sure the Hadoop environment variable is set correctly. In the terminal, execute the following command to verify that Hadoop is installed and configured correctly: hadoopversion Check HDFS configuration file: The core configuration file of HDFS is located in the /etc/hadoop/conf/ directory, where core-site.xml and hdfs-site.xml are crucial. use

Centos shutdown command line Centos shutdown command line Apr 14, 2025 pm 09:12 PM

The CentOS shutdown command is shutdown, and the syntax is shutdown [Options] Time [Information]. Options include: -h Stop the system immediately; -P Turn off the power after shutdown; -r restart; -t Waiting time. Times can be specified as immediate (now), minutes ( minutes), or a specific time (hh:mm). Added information can be displayed in system messages.

What are the backup methods for GitLab on CentOS What are the backup methods for GitLab on CentOS Apr 14, 2025 pm 05:33 PM

Backup and Recovery Policy of GitLab under CentOS System In order to ensure data security and recoverability, GitLab on CentOS provides a variety of backup methods. This article will introduce several common backup methods, configuration parameters and recovery processes in detail to help you establish a complete GitLab backup and recovery strategy. 1. Manual backup Use the gitlab-rakegitlab:backup:create command to execute manual backup. This command backs up key information such as GitLab repository, database, users, user groups, keys, and permissions. The default backup file is stored in the /var/opt/gitlab/backups directory. You can modify /etc/gitlab

Centos install mysql Centos install mysql Apr 14, 2025 pm 08:09 PM

Installing MySQL on CentOS involves the following steps: Adding the appropriate MySQL yum source. Execute the yum install mysql-server command to install the MySQL server. Use the mysql_secure_installation command to make security settings, such as setting the root user password. Customize the MySQL configuration file as needed. Tune MySQL parameters and optimize databases for performance.

How to operate distributed training of PyTorch on CentOS How to operate distributed training of PyTorch on CentOS Apr 14, 2025 pm 06:36 PM

PyTorch distributed training on CentOS system requires the following steps: PyTorch installation: The premise is that Python and pip are installed in CentOS system. Depending on your CUDA version, get the appropriate installation command from the PyTorch official website. For CPU-only training, you can use the following command: pipinstalltorchtorchvisiontorchaudio If you need GPU support, make sure that the corresponding version of CUDA and cuDNN are installed and use the corresponding PyTorch version for installation. Distributed environment configuration: Distributed training usually requires multiple machines or single-machine multiple GPUs. Place

Detailed explanation of docker principle Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

How to view GitLab logs under CentOS How to view GitLab logs under CentOS Apr 14, 2025 pm 06:18 PM

A complete guide to viewing GitLab logs under CentOS system This article will guide you how to view various GitLab logs in CentOS system, including main logs, exception logs, and other related logs. Please note that the log file path may vary depending on the GitLab version and installation method. If the following path does not exist, please check the GitLab installation directory and configuration files. 1. View the main GitLab log Use the following command to view the main log file of the GitLabRails application: Command: sudocat/var/log/gitlab/gitlab-rails/production.log This command will display product

How is the GPU support for PyTorch on CentOS How is the GPU support for PyTorch on CentOS Apr 14, 2025 pm 06:48 PM

Enable PyTorch GPU acceleration on CentOS system requires the installation of CUDA, cuDNN and GPU versions of PyTorch. The following steps will guide you through the process: CUDA and cuDNN installation determine CUDA version compatibility: Use the nvidia-smi command to view the CUDA version supported by your NVIDIA graphics card. For example, your MX450 graphics card may support CUDA11.1 or higher. Download and install CUDAToolkit: Visit the official website of NVIDIACUDAToolkit and download and install the corresponding version according to the highest CUDA version supported by your graphics card. Install cuDNN library:

See all articles