Artificial Intelligence is everywhere these days, and language models are a big part of that. When ChatGPT was introduced, you might have wondered how the AI could predict the next word in a sentence or even write entire paragraphs. In this tutorial, we’ll build a super simple language model without relying on fancy frameworks like TensorFlow or PyTorch—just plain Python and NumPy.
Before I begin the tutorial, let me explain what is a large language model (LLM).
In this tutorial, we’re creating a much simpler version of this—a bigram model—to
Sounds cool? Let’s get started!?
We'll be creating a bigram model, which will give you a basic idea of how language models work. It predicts the next word in a sentence based on the current word. We’ll keep it straightforward and easy to follow so you’ll learn how things work without getting buried in too much detail.??
Before we begin, let's make sure you’ve got Python and NumPy ready to go. If you don’t have NumPy installed, quickly install it with:
pip install numpy
A language model predicts the next word in a sentence. We’ll keep things simple and build a bigram model. This just means that our model will predict the next word using only the current word.
We’ll start with a short text to train the model. Here’s a small sample we’ll use:
import numpy as np # Sample dataset: A small text corpus corpus = """Artificial Intelligence is the new electricity. Machine learning is the future of AI. AI is transforming industries and shaping the future."""
First things first, we need to break this text into individual words and create a vocabulary (basically a list of all unique words). This gives us something to work with.
# Tokenize the corpus into words words = corpus.lower().split() # Create a vocabulary of unique words vocab = list(set(words)) vocab_size = len(vocab) print(f"Vocabulary: {vocab}") print(f"Vocabulary size: {vocab_size}")
Here, we’re converting the text to lowercase and splitting it into words. After that, we create a list of unique words to serve as our vocabulary.
Computers work with numbers, not words. So, we’ll map each word to an index and create a reverse mapping too (this will help when we convert them back to words later).
word_to_idx = {word: idx for idx, word in enumerate(vocab)} idx_to_word = {idx: word for word, idx in word_to_idx.items()} # Convert the words in the corpus to indices corpus_indices = [word_to_idx[word] for word in words]
Basically, we’re just turning words into numbers that our model can understand. Each word gets its own number, like “AI” might become 0, and “learning” might become 1, depending on the order.
Now, let’s get to the heart of it: building the bigram model. We want to figure out the probability of one word following another. To do that, we’ll count how often each word pair (bigram) shows up in our dataset.
pip install numpy
Here’s what’s happening:
We’re counting how often each word follows another (that's the bigram).
Then, we turn those counts into probabilities by normalizing them.
In simple terms, this means that if "AI" is often followed by "is," the probability for that pair will be higher.
Let’s now test our model by making it predict the next word based on any given word. We do this by sampling from the probability distribution of the next word.
import numpy as np # Sample dataset: A small text corpus corpus = """Artificial Intelligence is the new electricity. Machine learning is the future of AI. AI is transforming industries and shaping the future."""
This function takes a word, looks up its probabilities, and randomly selects the next word based on those probabilities. If you pass in "AI," the model might predict something like "is" as the next word.
Finally, let's generate a whole sentence! We’ll start with a word and keep predicting the next word a few times.
# Tokenize the corpus into words words = corpus.lower().split() # Create a vocabulary of unique words vocab = list(set(words)) vocab_size = len(vocab) print(f"Vocabulary: {vocab}") print(f"Vocabulary size: {vocab_size}")
This function takes an initial word and predicts the next one, then uses that word to predict the following one, and so on. Before you know it, you’ve got a full sentence!
There you have it—a simple bigram language model built from scratch using just Python and NumPy. We didn’t use any fancy libraries, and you now have a basic understanding of how AI can predict text. You can play around with this code, feed it different text, or even expand it by using more advanced models.
Give it a try, and let me know how it goes. Happy coding!
The above is the detailed content of Build Your Own Language Model: A Simple Guide with Python and NumPy. For more information, please follow other related articles on the PHP Chinese website!