How to implement the Naive Bayes algorithm for machine learning in PHP-PHP Tutorial-php.cn

This article mainly introduces the Naive Bayes algorithm for machine learning in PHP. It analyzes the concepts, principles and PHP implementation techniques of the Naive Bayes algorithm in detail in the form of examples. Friends who need it can refer to it. I hope it can help everyone. .

The example in this article describes the implementation of the Naive Bayes algorithm for machine learning in PHP. Share it with everyone for your reference, the details are as follows:

Machine learning has become ubiquitous in our lives. Everything from the thermostat working when you're at home to smart cars and the smartphones in our pockets. Machine learning seems to be everywhere and is an area well worth exploring. But what is machine learning? Generally speaking, machine learning is about allowing the system to continuously learn and predict new problems. From simple predictions of shopping items to complex digital assistant predictions.

In this article I will introduce the Naive Bayes algorithm Clasifier as a class. This is a simple algorithm that is easy to implement and gives satisfactory results. But this algorithm requires a little statistical knowledge to understand. In the last part of the article you can see some example code and even try to do your own machine learning.

Getting Started

So, what function is this Classifier used to achieve? In fact, it is mainly used to determine whether a given statement is positive or negative. For example, "Symfony is the best" is a positive statement, and "No Symfony is bad" is a negative statement. So after giving a statement, I want this Classifier to return a statement type without giving a new rule.

I named Classifier a class with the same name and contains a guess method. This method accepts a statement as input and returns whether the statement is positive or negative. The class looks like this:

class Classifier
{
 public function guess($statement)
 {}
}

Copy after login

I prefer to use enum typed classes instead of strings for my return values. I named the class of this enumeration type Type, and it contains two constants: one POSITIVE and one NEGATIVE. These two constants will be used as the return value of the guess method.

class Type
{
 const POSITIVE = &#39;positive&#39;;
 const NEGATIVE = &#39;negative&#39;;
}

Copy after login

The initialization work has been completed, and the next step is to write our algorithm for prediction.

Naive Bayes

The Naive Bayes algorithm works based on a training set, and makes corresponding predictions based on this training set. This algorithm uses simple statistics and a bit of mathematics to calculate the results. For example, the training set consists of the following four texts:

语句	类型
Symfony is the best	Positive
PhpStorm is great	Positive
Iltar complains a lot	Negative
No Symfony is bad	Negative

如果给定语句是“Symfony is the best”，那么你可以说这个语句是积极地。你平常也会根据之前学习到的相应知识做出对应的决定，朴素贝叶斯算法也是同样的道理：它根据之前的训练集来决定哪一个类型更加相近。

学习

在这个算法正式工作之前，它需要大量的历史信息作为训练集。它需要知道两件事：每一个类型对应的词产生了多少次和每一个语句对应的类型是什么。我们在实施的时候会将这两种信息存储在两个数组当中。一个数组包含每一类型的词语统计，另一个数组包含每一个类型的语句统计。所有的其他信息都可以从这两个数组中聚合。代码就像下面的一样：

function learn($statement, $type)
{
 $words = $this->getWords($statement);
 foreach ($words as $word) {
 if (!isset($this->words[$type][$word])) {
  $this->words[$type][$word] = 0;
 }
 $this->words[$type][$word]++; // 增加类型的词语统计
 }
 $this->documents[$type]++; // 增加类型的语句统计
}

Copy after login

有了这个集合以后，现在这个算法就可以根据历史数据接受预测训练了。

定义

为了解释这个算法是如何工作的，几个定义是必要的。首先，让我们定义一下输入的语句是给定类型中的一个的概率。这个将会表示为P（Type）。它是以已知类型的数据的类型作为分子，还有整个训练集的数据数量作为分母来得出的。一个数据就是整个训练集中的一个。到现在为止，这个方法可以将会命名为totalP，像下面这样：

function totalP($type)
{
 return ($this->documents[$type] + 1) / (array_sum($this->documents) + 1);
}

Copy after login

请注意，在这里分子和分母都加了1。这是为了避免分子和分母都为0的情况。

根据上面的训练集的例子，积极和消极的类型都会得出0.6的概率。每中类型的数据都是2个，一共是4个数据所以就是（2+1）/（4+1）。

第二个要定义的是对于给定的一个词是属于哪个确定类型的概率。这个我们定义成P(word,Type)。首先我们要得到一个词在训练集中给出确定类型出现的次数，然后用这个结果来除以整个给定类型数据的词数。这个方法我们定义为p：

function p($word, $type)
{
 $count = isset($this->words[$type][$word]) ? $this->words[$type][$word] : 0;
 return ($count + 1) / (array_sum($this->words[$type]) + 1);
}

Copy after login

在本次的训练集中，“is”的是积极类型的概率为0.375。这个词在整个积极的数据中的7个词中占了两次，所以结果就是（2+1）/（7+1）。

最后，这个算法应该只关心关键词而忽略其他的因素。一个简单的方法就是将给定的字符串中的单词分离出来：

function getWords($string)
{
 return preg_split(&#39;/\s+/&#39;, preg_replace(&#39;/[^A-Za-z0-9\s]/&#39;, &#39;&#39;, strtolower($string)));
}

Copy after login

准备工作都做好了，开始真正实施我们的计划吧！

预测

为了预测语句的类型，这个算法应该计算所给定语句的两个类型的概率。像上面一样，我们定义一个P（Type,sentence）。得出概率高的类型将会是Classifier类中算法返回的结果。

为了计算P（Type,sentence）,算法当中将用到贝叶斯定理。算法像这样被定义：P（Type,sentence）= P（Type）* P（sentence,Type）/ P（sentence）。这意味着给定语句的类型概率和给定类型语句概率除以语句的概率的结果是相同的。

那么算法在计算每一个相同语句的P（Tyoe,sentence），P（sentence）是保持一样的。这意味着算法就可以省略其他因素，我们只需要关心最高的概率而不是实际的值。计算就像这样：P（Type,sentence） = P（Type）* P（sentence,Type）。

最后，为了计算P（sentence,Type），我们可以为语句中的每个词添加一条链式规则。所以在一条语句中如果有n个词的话，它将会和P（word_1,Type）* P（word_2,Type）* P（word_3,Type）* .....*P（word_n,Type）是一样的。每一个词计算结果的概率使用了我们前面看到的定义。

好了，所有的都说完了，是时候在php中实际操作一下了：

function guess($statement)
{
 $words = $this->getWords($statement); // 得到单词
 $best_likelihood = 0;
 $best_type = null;
 foreach ($this->types as $type) {
 $likelihood = $this->pTotal($type); //计算 P(Type)
 foreach ($words as $word) {
  $likelihood *= $this->p($word, $type); // 计算 P(word, Type)
 }
 if ($likelihood > $best_likelihood) {
  $best_likelihood = $likelihood;
  $best_type = $type;
 }
 }
 return $best_type;
}

Copy after login

这就是所有的工作，现在算法可以预测语句的类型了。你要做的就是让你的算法开始学习：

$classifier = new Classifier();
$classifier->learn(&#39;Symfony is the best&#39;, Type::POSITIVE);
$classifier->learn(&#39;PhpStorm is great&#39;, Type::POSITIVE);
$classifier->learn(&#39;Iltar complains a lot&#39;, Type::NEGATIVE);
$classifier->learn(&#39;No Symfony is bad&#39;, Type::NEGATIVE);
var_dump($classifier->guess(&#39;Symfony is great&#39;)); // string(8) "positive"
var_dump($classifier->guess(&#39;I complain a lot&#39;)); // string(8) "negative"

Copy after login

所有的代码我已经上传到了GIT上，https://github.com/yannickl88/blog-articles/blob/master/src/machine-learning-naive-bayes/Classifier.php

github上完整php代码如下：

 [], Type::NEGATIVE => []];
 private $documents = [Type::POSITIVE => 0, Type::NEGATIVE => 0];
 public function guess($statement)
 {
 $words  = $this->getWords($statement); // get the words
 $best_likelihood = 0;
 $best_type = null;
 foreach ($this->types as $type) {
  $likelihood = $this->pTotal($type); // calculate P(Type)
  foreach ($words as $word) {
  $likelihood *= $this->p($word, $type); // calculate P(word, Type)
  }
  if ($likelihood > $best_likelihood) {
  $best_likelihood = $likelihood;
  $best_type = $type;
  }
 }
 return $best_type;
 }
 public function learn($statement, $type)
 {
 $words = $this->getWords($statement);
 foreach ($words as $word) {
  if (!isset($this->words[$type][$word])) {
  $this->words[$type][$word] = 0;
  }
  $this->words[$type][$word]++; // increment the word count for the type
 }
 $this->documents[$type]++; // increment the document count for the type
 }
 public function p($word, $type)
 {
 $count = 0;
 if (isset($this->words[$type][$word])) {
  $count = $this->words[$type][$word];
 }
 return ($count + 1) / (array_sum($this->words[$type]) + 1);
 }
 public function pTotal($type)
 {
 return ($this->documents[$type] + 1) / (array_sum($this->documents) + 1);
 }
 public function getWords($string)
 {
 return preg_split('/\s+/', preg_replace('/[^A-Za-z0-9\s]/', '', strtolower($string)));
 }
}
$classifier = new Classifier();
$classifier->learn(&#39;Symfony is the best&#39;, Type::POSITIVE);
$classifier->learn(&#39;PhpStorm is great&#39;, Type::POSITIVE);
$classifier->learn(&#39;Iltar complains a lot&#39;, Type::NEGATIVE);
$classifier->learn(&#39;No Symfony is bad&#39;, Type::NEGATIVE);
var_dump($classifier->guess(&#39;Symfony is great&#39;)); // string(8) "positive"
var_dump($classifier->guess(&#39;I complain a lot&#39;)); // string(8) "negative"

Copy after login