Table of Contents
Text version of the "LLM word segmentation" course
Home Technology peripherals AI Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles

Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles

Feb 26, 2024 am 11:00 AM
ai gpt

Some time ago, the AI ​​course launched by AI master Karpathy has already received 150,000 views across the entire network.

At that time, some netizens said that the value of this 2-hour course was equivalent to 4 years of college.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Just in the past few days, Karpathy had a new idea:

will 2 The 13-hour "Building a GPT Tokenizer from Scratch" video content is transformed into a book chapter or blog post, focusing on the topic of "word segmentation."

The specific steps are as follows:

- Add subtitles or narration text to the video.

- Cut the video into paragraphs with matching images and text.

- Use the prompt engineering technology of large language models to translate paragraph by paragraph.

- Outputs the results as a web page with links to parts of the original video.

More broadly, such a workflow can be applied to any video input, automatically generating "companion guides" for various tutorials in a format that is easier to read, browse, and search.

This sounds feasible, but also quite challenging.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

He wrote an example to illustrate his imagination under the GitHub project minbpe.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Address: https://github.com/karpathy/minbpe/blob/master/lecture.md

Karpathy said that this was a task that he completed manually, which was to watch the video and translate it into an article in markdown format.

"I only watched about 4 minutes of the video (i.e. 3% done), and this already took about 30 minutes to write, so it would be great if something like this could be done automatically Very good".

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Next, it’s class time!

Text version of the "LLM word segmentation" course

Hello everyone, today we will discuss the issue of "word segmentation" in LLM.

Unfortunately, "word segmentation" is a relatively complex and tricky component of the most advanced large models, but it is necessary for us to understand it in detail.

Because many of the flaws of LLM may be attributed to neural networks or other seemingly mysterious factors, but these flaws can actually be traced back to "word segmentation".

Character-level word segmentation

So, what is word segmentation?

In fact, in the previous video "Let's build GPT from scratch", I have already introduced tokenization, but that was just a very simple character-level version.

If you go to Google colab and check out that video, you'll see that we start with the training data (Shakespeare), which is just a big string in Python:

First Citizen: Before we proceed any further, hear me speak.All: Speak, speak.First Citizen: You are all resolved rather to die than to famish?All: Resolved. resolved.First Citizen: First, you know Caius Marcius is chief enemy to the people.All: We know't, we know't.
Copy after login

But how do we input strings into LLM?

We can see that we first need to build a vocabulary for all possible characters in the entire training set:

# here are all the unique characters that occur in this textchars = sorted(list(set(text)))vocab_size = len(chars)print(''.join(chars))print(vocab_size)# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz# 65
Copy after login

Then based on the vocabulary above, create a lookup table for converting between single characters and integers. This lookup table is just a Python dictionary:

stoi = { ch:i for i,ch in enumerate(chars) }itos = { i:ch for i,ch in enumerate(chars) }# encoder: take a string, output a list of integersencode = lambda s: [stoi[c] for c in s]# decoder: take a list of integers, output a stringdecode = lambda l: ''.join([itos[i] for i in l])print(encode("hii there"))print(decode(encode("hii there")))# [46, 47, 47, 1, 58, 46, 43, 56, 43]# hii there
Copy after login

Once we convert a string into a sequence of integers, we see that each integer, is Index used as a 2D embedding of trainable parameters.

Because our vocabulary size is vocab_size=65 , this embedding table will also have 65 rows:

class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)def forward(self, idx, targets=None):tok_emb = self.token_embedding_table(idx) # (B,T,C)
Copy after login

Here, the integer "extracts" a row from the embedding table, and this row is the vector representing the word segmentation. This vector will then be fed into the Transformer as input for the corresponding time step.

Using the BPE algorithm for "character block" word segmentation

Naive settings for the "character-level" language model Say, it's all good.

But in practice, in state-of-the-art language models, people use more complex schemes to build these representational vocabularies.

Specifically, these solutions do not work at the character level, but at the "character block" level. The way these chunk vocabularies are built is using algorithms such as Byte Pair Encoding (BPE), which we describe in detail below.

Let’s briefly review the historical development of this method. The paper that uses the byte-level BPE algorithm for language model word segmentation is the GPT-2 paper Language Models are Unsupervised published by OpenAI in 2019. Multitask Learners.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Paper address: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Scroll down to Section 2.2, “Input Representation”, where they describe and motivate this algorithm. At the end of this section, you'll see them say:

The vocabulary expanded to 50,257 words. We also increased the context size from 512 to 1024 tokens and used a larger batchsize of 512.

Recall that in the Transformer’s attention layer, each token is associated with a limited list of previous tokens in the sequence.

This article points out that the context length of the GPT-2 model has increased from 512 tokens in GPT-1 to 1024 tokens.

In other words, token is the basic "atom" of the LLM input.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

"Tokenization" is the process of converting the original string in Python into a token list, and vice versa.

There is another popular example that proves the universality of this abstraction. If you also search for "token" in Llama 2's paper, you will get 63 matching results.

For example, the paper claims that they trained on 2 trillion tokens, etc.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Paper address: https://arxiv.org/pdf/2307.09288.pdf

A brief discussion on the complexity of word segmentation

Before we delve into the details of implementation, let us briefly explain the necessity of a detailed understanding of the "word segmentation" process.

Word segmentation is at the heart of many, many weird problems in LLM, and I suggest you don't ignore it.

Many seemingly problems with neural network architecture are actually related to word segmentation. Here are just a few examples:

- Why doesn't LLM spell words? ——Word segmentation

- Why can’t LLM perform super simple string processing tasks, such as reversing strings? ——Word segmentation

#- Why is LLM worse in non-English language (such as Japanese) tasks? ——Participle

#- Why is LLM not good at simple arithmetic? ——Word segmentation

#- Why does GPT-2 encounter more problems when coding in Python? ——Word segmentation

- Why does my LLM suddenly stop when it sees the string ? ——Participle

- What is this strange warning I received about "trailing whitespace"? --Participle

- If I ask LLM about "SolidGoldMagikarp", why does it crash? ——Word segmentation

#- Why should I use YAML with LLM instead of JSON? ——Word Segmentation

#- Why is LLM not a true end-to-end language modeling? ——Participle

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

We will return to these questions at the end of the video.

Visual preview of word segmentation

Next, let us load this word segmentation WebApp.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Address: https://tiktokenizer.vercel.app/

for this web application The advantage is that word segmentation runs in real time in a web browser, allowing you to easily enter some text strings on the input side and see the word segmentation results on the right.

At the top, you can see that we are currently using the gpt2 tokenizer, and you can see that the string pasted in this example is currently being tokenized into 300 tokens.

Here, they are clearly shown with color:

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

For example, the string "Tokenization" Encoded to token30642, followed by token 1634.

token "is" (note that this is three characters, including the preceding space, this is very important!) is 318.

Pay attention to the use of spaces, because it is absolutely present in the string and must be worded together with all other characters. However, it is usually omitted during visualization for the sake of clarity.

You can turn its visualization features on and off at the bottom of the app. Likewise, token "at" is 379, "the" is 262, and so on.

Next, we have a simple arithmetic example.

Here we see that the tokenizer may be inconsistent in its decomposition of numbers. For example, the number 127 is a 3-character token, but the number 677 is because there are 2 tokens: 6 (again, note the preceding space) and 77.

We rely on LLM to explain this arbitrariness.

It must learn about these two tokens (6 and 77 actually combine to form the number 677), both within its parameters and during training.

Similarly, we can see that if LLM wants to predict that the result of this sum is the number 804, it must output within two time steps:

First, it must issue token "8", then token "04".

Note that all of these splits look completely arbitrary. In the example below, we can see that 1275 is "12", then "75", 6773 is actually three tokens "6", "77", and "3", and 8041 is "8" and "041" .

(To be continued...)

(TODO: If we want to continue the text version of the content, unless we figure out how to get it from the video Automatically generated in)

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Netizens are online, giving advice

Netizens said, great, actually I prefer reading these posts rather than watching videos, it's easier to pace myself.

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Some netizens also gave Karpathy advice:

"Feels tricky, but it might be possible using LangChain. I was wondering if I could use whisper transcription to produce a high-level outline with clear chapters, and then process those chapter chunks in parallel, in the context of the overall outline , focus on the specific content of the respective chapter blocks (also generate illustrations for each parallel-processed chapter). Then all generated reference marks are compiled to the end of the article through LLM."

Full of useful information! The first text version of Master Karpathys two-hour AI course, a new workflow automatically converts videos into articles

Someone has written a pipeline for this, and it will be open source soon.


The above is the detailed content of Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What are the types of return values ​​of c language function? Summary of types of return values ​​of c language function? What are the types of return values ​​of c language function? Summary of types of return values ​​of c language function? Apr 03, 2025 pm 11:18 PM

The return value types of C language function include int, float, double, char, void and pointer types. int is used to return integers, float and double are used to return floats, and char returns characters. void means that the function does not return any value. The pointer type returns the memory address, be careful to avoid memory leakage.结构体或联合体可返回多个相关数据。

C language starts from 0 C language starts from 0 Apr 03, 2025 pm 08:24 PM

It may be a bit difficult to get started with C language learning, but after mastering the correct method, you will quickly master the basics and gradually master them. This guide will guide you step by step to learn the core concepts of C language, from basics to advanced topics. Directory C language basics and data types User input conditional expression abbreviation switch statement C language array nested loop C language function structure pointer C language basics and data types C programs follow standard structures and use multiple data types to define variables. The basic program structure is as follows: #includeintmain(){printf("hello,world!");ret

Concept of c language function Concept of c language function Apr 03, 2025 pm 10:09 PM

C language functions are reusable code blocks. They receive input, perform operations, and return results, which modularly improves reusability and reduces complexity. The internal mechanism of the function includes parameter passing, function execution, and return values. The entire process involves optimization such as function inline. A good function is written following the principle of single responsibility, small number of parameters, naming specifications, and error handling. Pointers combined with functions can achieve more powerful functions, such as modifying external variable values. Function pointers pass functions as parameters or store addresses, and are used to implement dynamic calls to functions. Understanding function features and techniques is the key to writing efficient, maintainable, and easy to understand C programs.

C Programmer &#s Undefined Behavior Guide C Programmer &#s Undefined Behavior Guide Apr 03, 2025 pm 07:57 PM

Exploring Undefined Behaviors in C Programming: A Detailed Guide This article introduces an e-book on Undefined Behaviors in C Programming, a total of 12 chapters covering some of the most difficult and lesser-known aspects of C Programming. This book is not an introductory textbook for C language, but is aimed at readers familiar with C language programming, and explores in-depth various situations and potential consequences of undefined behaviors. Author DmitrySviridkin, editor Andrey Karpov. After six months of careful preparation, this e-book finally met with readers. Printed versions will also be launched in the future. This book was originally planned to include 11 chapters, but during the creation process, the content was continuously enriched and finally expanded to 12 chapters - this itself is a classic array out-of-bounds case, and it can be said to be every C programmer

How to calculate c-subscript 3 subscript 5 c-subscript 3 subscript 5 algorithm tutorial How to calculate c-subscript 3 subscript 5 c-subscript 3 subscript 5 algorithm tutorial Apr 03, 2025 pm 10:33 PM

The calculation of C35 is essentially combinatorial mathematics, representing the number of combinations selected from 3 of 5 elements. The calculation formula is C53 = 5! / (3! * 2!), which can be directly calculated by loops to improve efficiency and avoid overflow. In addition, understanding the nature of combinations and mastering efficient calculation methods is crucial to solving many problems in the fields of probability statistics, cryptography, algorithm design, etc.

Unique shared library issues Unique shared library issues Apr 03, 2025 pm 08:00 PM

Problem Description Recently, I encountered a link error when I tried to link a self-built C language shared library to a local project, and I encountered a link error, prompting "Undefined reference". The error message is as follows: /bin/ld:/tmp/cchb7mj8.o:infunction`sdl_main':main.c:(.text 0x3c):undefinedreferenceto`sdl_enterappmaincallbacks'...(other similar undefined references)..collect2:error:ldreturned1exitstatusmake:***[

Exercise C: Building a simple phonebook application Exercise C: Building a simple phonebook application Apr 03, 2025 pm 08:15 PM

One of the best ways to learn C language programming is to practice it. This article will take you step through a project I recently completed: a simple phonebook application. This app demonstrates file processing and basic data management in C, allowing you to add, view, and delete contacts. The following is the complete code: #include#include//Function declaration voidaddcontact(charname[],charnumber[]);voidviewcontacts();voiddeletecontact(c

Object-oriented in C? Implementing interfaces from scratch Object-oriented in C? Implementing interfaces from scratch Apr 03, 2025 pm 08:21 PM

This article discusses how to simulate the concept of interfaces in object-oriented programming in C language. We will take the calculation of vehicle prices as an example, implement them in Java and C languages ​​respectively, compare the differences between the two languages, and show how to implement the basic functions of the interface in C. Java implementation: In Java, the interface is defined using the interface keyword, and the class implements the interface through the implements keyword. The sample code is as follows: interfaceVehicle{intprice();}classCarimplementsVehicle{privatefinalintspeed;publi

See all articles