This article was peer-reviewed by Wern Ancheta. Thanks to all the peer reviewers at SitePoint for getting SitePoint content to its best!
Recently, it seems that everyone is talking about machine learning. Your social media stream is filled with posts about ML, Python, TensorFlow, Spark, Scala, Go, and more; if you're like me, you might be wondering, what about PHP?
Yes, what about machine learning and PHP? Fortunately, someone was crazy about not only raising this question, but also developing a general machine learning library that we can use in our next project. In this post, we will take a look at PHP-ML – a machine learning library for PHP – we will write a sentiment analysis class that can be reused later on for our own chatbots or Twitterbots. The main goals of this article are:
Read better PHP development tools and technologies to make you a better developer! Read this book Read this book!
Machine learning is a subset of artificial intelligence that focuses on giving "the ability of computers to learn without explicit programming." This is achieved by using a general algorithm that can be "learned" from a specific data set.
A common use of machine learning, for example, is classification. Classification algorithms are used to divide data into different groups or categories. Some examples of classification applications include:
Machine learning is a general term for general algorithms covering many different tasks, and it is mainly divided into two types of algorithms according to the learning method - supervised learning and unsupervised learning.
In supervised learning, we use labeled data to train our algorithm, which takes the format of input objects (vectors) and required output values; the algorithm analyzes the training data and produces so-called inference functions, which we can apply on the new unlabeled dataset.
For the rest of this post, we will focus on supervised learning because it is easier to see and verify relationships; remember that both algorithms are equally important and interesting; some might think unsupervised learning is more useful, Because it excludes the need to tag data.
This type of learning, on the other hand, uses unlabeled data from the beginning. We don't know the required output value of the dataset, so we let the algorithm draw inferences from the dataset; unsupervised learning is especially convenient when performing exploratory data analysis to find hidden patterns in the data.
Know PHP-ML, a library that claims to be a new method of PHP machine learning. The library implements algorithms, neural networks and tools for data preprocessing, cross-validation, and feature extraction.
I first admit that PHP is an unusual choice for machine learning, because the advantages of the language are not very suitable for machine learning applications. That is, not every machine learning application needs to process PeB-level data and perform a lot of computations—for simple applications, we should be able to use PHP and PHP-ML.
The best use case for this library I can see now is the implementation of classifiers, whether it is spam filters or sentiment analysis. We will define a classification problem and build a solution step by step to understand how to use PHP-ML in our project.
To give an example of the process of implementing PHP-ML and adding some machine learning to our application, I wanted to find an interesting problem to solve, and what better way to do this than building a Twitter sentiment analysis class What about showing the classifier?
One of the key requirements required to build a successful machine learning project is a good starting dataset. Datasets are crucial because they will allow us to train our classifier against classified examples. With the recent massive noise around airlines, what better data set than using customer tweets to airlines?
Luckily, thanks to Kaggle.io, we can already use the tweet dataset. You can use this link to download Twitter US Airlines sentiment database from their website
Let's first look at the dataset we're going to work on. The original dataset contains the following columns:
and looks like the following example (a table that can be scrolled sideways):
tweet_id
airline_sentiment
airline_sentiment_confidence
negativereason
negativereason_confidence negativereason_confidenceairline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_creat ed tweet_location user_timezone
570306133677760513 neutral 1.0 Virgin America cairdin 0 @VirginAmerica What @dhepburn said . 2015-02-24 11:35:52 -0800 Eastern Time (US & Canada) 570301130888122368 positive 0.3486 0.0 Virgin America jnardino 0 @VirginAmerica plus you've added commercials to the experience… tacky. 2015-02- 24 11:15:59 -0800 Pacific Time (US & Canada) 570301083672813571 neutral 0.6837 Virgin America yvonnalynn 0 @VirginAmerica I didn't today… Must mean I need to take another trip! 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) 570301031407624196 neg active 1.0 Bad Flight 0.7033 Virgin America jnardino 0 “ @VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse" 2015-02-24 11:15:36 -0800 P acific Time (US & Canada) 570300817074462722 negative 1.0 Can't Tell 1.0 Virgin America jnardino 0 @VirginAmerica and it's a really big bad thing about it 2015-02-24 11:14:45 -0800 Pacific Time (US & Canada) 570300767074181121 negative 1.0 Can't Tell 0.6842 Virgin America jnardino 0 “@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA” 2015-02-24 11:14:33 -0800 Pacific Time (US & Canada) 570300616901320704 positive 0.6745 0.0 Virgin America cjmcginnis 0 “@VirginAmerica yes nearly every time I fly VX this “ear worm” won't go away :)” 2015-02-24 11:13:57 -0800 San Francisco CA Pacific Time (US & Canada) 570300248553349120 neutral 0.634 Virgin America pilot 0 “@VirginAmerica Really missed a prime opportunity for Men Without Hats parody there. https://www.php.cn/link/76379ed89 eafe43c8f6bd64fd09e3852” 2015-02-24 11:12:29 -0800 Los Angeles Pacific Time (US & Canada) This file contains 14,640 tweets, so it is a good working dataset for us. Now, with the number of columns we currently have, we have more data than the examples need; for practical purposes we only care about the following columns:
where text will become our characteristic and airline_sentiment will become our target. The remaining columns can be discarded because they will not be used in our exercises. Let's start by creating the project and initialize the composer with the following file:
<code>{ "name": "amacgregor/phpml-exercise", "description": "Example implementation of a Tweet sentiment analysis with PHP-ML", "type": "project", "require": { "php-ai/php-ml": "^0.4.1" }, "license": "Apache License 2.0", "authors": [ { "name": "Allan MacGregor", "email": "amacgregor@allanmacgregor.com" } ], "autoload": { "psr-4": {"PhpmlExercise\": "src/"} }, "minimum-stability": "dev" }</code>
<code>composer install </code>
If you need a Composer introduction, see here.
To make sure we set it up correctly, let's create a quick script that will load our Tweets.csv data file and make sure it has the data we need. Copy the following code as reviewDataset.php in the project root directory:
<?php namespace PhpmlExercise; require __DIR__ . '/vendor/autoload.php'; use Phpml\Dataset\CsvDataset; $dataset = new CsvDataset('datasets/raw/Tweets.csv',1); foreach ($dataset->getSamples() as $sample) { print_r($sample); }
Now, run the script using php reviewDataset.php, let's see the output:
<code>Array( [0] => 569587371693355008 ) Array( [0] => 569587242672398336 ) Array( [0] => 569587188687634433 ) Array( [0] => 569587140490866689 ) </code>
This looks useless now, doesn't it? Let's take a look at the CsvDataset class to better understand what's happening inside:
<?php public function __construct(string $filepath, int $features, bool $headingRow = true) { if (!file_exists($filepath)) { throw FileException::missingFile(basename($filepath)); } if (false === $handle = fopen($filepath, 'rb')) { throw FileException::cantOpenFile(basename($filepath)); } if ($headingRow) { $data = fgetcsv($handle, 1000, ','); $this->columnNames = array_slice($data, 0, $features); } else { $this->columnNames = range(0, $features - 1); } while (($data = fgetcsv($handle, 1000, ',')) !== false) { $this->samples[] = array_slice($data, 0, $features); $this->targets[] = $data[$features]; } fclose($handle); }
CsvDataset constructor takes 3 parameters:
If we look closely, we can see that the class is mapping the CSV file to two internal arrays: samples and targets. Samples contains all the characteristics provided by the file, while targets contains known values (negative, positive, or neutral).
Based on the above content, we can see that the format that our CSV file needs to follow is as follows:
<code>| feature_1 | feature_2 | feature_n | target | </code>
We will need to generate a clean dataset that contains only the columns we need to continue working. Let's call this script generateCleanDataset.php:
<?php namespace PhpmlExercise; require __DIR__ . '/vendor/autoload.php'; use Phpml\Exception\FileException; $sourceFilepath = __DIR__ . '/datasets/raw/Tweets.csv'; $destinationFilepath = __DIR__ . '/datasets/clean_tweets.csv'; $rows =[]; $rows = getRows($sourceFilepath, $rows); writeRows($destinationFilepath, $rows); /** * @param $filepath * @param $rows * @return array */ function getRows($filepath, $rows) { $handle = checkFilePermissions($filepath); while (($data = fgetcsv($handle, 1000, ',')) !== false) { $rows[] = [$data[10], $data[1]]; } fclose($handle); return $rows; } /** * @param $filepath * @param string $mode * @return bool|resource * @throws FileException */ function checkFilePermissions($filepath, $mode = 'rb') { if (!file_exists($filepath)) { throw FileException::missingFile(basename($filepath)); } if (false === $handle = fopen($filepath, $mode)) { throw FileException::cantOpenFile(basename($filepath)); } return $handle; } /** * @param $filepath * @param $rows * @internal param $list */ function writeRows($filepath, $rows) { $handle = checkFilePermissions($filepath, 'wb'); foreach ($rows as $row) { fputcsv($handle, $row); } fclose($handle); }
Nothing is too complicated, it's just enough to do the job. Let's execute it with php generateCleanDataset.php.
Now, let's point the reviewDataset.php script to a clean dataset:
<code>Array ( [0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now??? ) Array ( [0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight & been on hold now for 1hr 49min )</code>
BAM! This is the data we can use! So far, we have been creating simple scripts for manipulating data. Next, we will start creating a new class under src/classification/SentimentAnalysis.php.
<?php namespace PhpmlExercise\Classification; /** * Class SentimentAnalysis * @package PhpmlExercise\Classification */ class SentimentAnalysis { public function train() {} public function predict() {} }
Our emotion class will need to use two functions in our emotion analysis class:
Create a script named classifyTweets.php in the root directory of the project. We will use this script to instantiate and test our sentiment analysis class. Here is the template we will use:
<?php namespace PhpmlExercise; use PhpmlExercise\Classification\SentimentAnalysis; require __DIR__ . '/vendor/autoload.php'; // 步骤 1:加载数据集 // 步骤 2:准备数据集 // 步骤 3:生成训练/测试数据集 // 步骤 4:训练分类器 // 步骤 5:测试分类器的准确性
We already have code that can be used to load CSV into the dataset object in our earlier examples. We will use the same code and make some tweaks:
<?php ... use Phpml\Dataset\CsvDataset; ... $dataset = new CsvDataset('datasets/clean_tweets.csv',1); $samples = []; foreach ($dataset->getSamples() as $sample) { $samples[] = $sample[0]; }
This will generate a flat array containing only features (in this case the tweet text) which we will use to train our classifier.
Now, having the original text and passing that text to the classifier will not be useful or accurate, because each tweet is essentially different. Fortunately, there are ways to process text when trying to apply classification or machine learning algorithms. For this example, we will use the following two classes:
Let's start with the text vectorizer:
<code>{ "name": "amacgregor/phpml-exercise", "description": "Example implementation of a Tweet sentiment analysis with PHP-ML", "type": "project", "require": { "php-ai/php-ml": "^0.4.1" }, "license": "Apache License 2.0", "authors": [ { "name": "Allan MacGregor", "email": "amacgregor@allanmacgregor.com" } ], "autoload": { "psr-4": {"PhpmlExercise\": "src/"} }, "minimum-stability": "dev" }</code>
Next, apply the Tf-idf converter:
<code>composer install </code>
Our samples array now uses a format that can be easily understood by our classifier. We are not done yet, we need to mark each sample with its corresponding emotions.
Luckily, PHP-ML already covers this requirement, and the code is very simple:
<?php namespace PhpmlExercise; require __DIR__ . '/vendor/autoload.php'; use Phpml\Dataset\CsvDataset; $dataset = new CsvDataset('datasets/raw/Tweets.csv',1); foreach ($dataset->getSamples() as $sample) { print_r($sample); }
We can continue to use this dataset and train our classifier. However, we lack the test dataset used as validation, so we'll "cheat" a little bit and split our original dataset into two parts: a training dataset and a much smaller data for testing the model's accuracy set.
<code>Array( [0] => 569587371693355008 ) Array( [0] => 569587242672398336 ) Array( [0] => 569587188687634433 ) Array( [0] => 569587140490866689 ) </code>
This method is called cross-validation. This term comes from statistics and can be defined as follows:
Cross validation, sometimes called rotation estimation, is a model verification technique used to evaluate how the results of statistical analysis will generalize to independent data sets. It is mainly used for the goal setting of predictions and wants to estimate the accuracy of the prediction model in practice. — Wikipedia.com
Finally, we are ready to return and implement the SentimentAnalysis class. If you haven't noticed yet, a large part of machine learning is about collecting and manipulating data; actual implementations of machine learning models often involve less content.
To implement our sentiment analysis class, we have three available classification algorithms:
For this exercise, we will use the simplest one, the naive Bayes classifier, so let's continue to update our class to implement the train method:
<?php public function __construct(string $filepath, int $features, bool $headingRow = true) { if (!file_exists($filepath)) { throw FileException::missingFile(basename($filepath)); } if (false === $handle = fopen($filepath, 'rb')) { throw FileException::cantOpenFile(basename($filepath)); } if ($headingRow) { $data = fgetcsv($handle, 1000, ','); $this->columnNames = array_slice($data, 0, $features); } else { $this->columnNames = range(0, $features - 1); } while (($data = fgetcsv($handle, 1000, ',')) !== false) { $this->samples[] = array_slice($data, 0, $features); $this->targets[] = $data[$features]; } fclose($handle); }
As you can see, we let PHP-ML do all the heavy lifting for us. We just created a nice abstraction for our project. But how do we know if our classifiers are really training and working? It's time to use our testSamples and testLabels.
We do have to implement the prediction method before we continue to test our classifier:
<code>{ "name": "amacgregor/phpml-exercise", "description": "Example implementation of a Tweet sentiment analysis with PHP-ML", "type": "project", "require": { "php-ai/php-ml": "^0.4.1" }, "license": "Apache License 2.0", "authors": [ { "name": "Allan MacGregor", "email": "amacgregor@allanmacgregor.com" } ], "autoload": { "psr-4": {"PhpmlExercise\": "src/"} }, "minimum-stability": "dev" }</code>
Similarly, PHP-ML helped us and did all the heavy lifting for us. Let's update the classifyTweets class accordingly:
<code>composer install </code>
Finally, we need a way to test the accuracy of our training model; thankfully, PHP-ML covers this, too, and they have several metric classes. In our case, we are interested in the accuracy of the model. Let's look at the code:
<?php namespace PhpmlExercise; require __DIR__ . '/vendor/autoload.php'; use Phpml\Dataset\CsvDataset; $dataset = new CsvDataset('datasets/raw/Tweets.csv',1); foreach ($dataset->getSamples() as $sample) { print_r($sample); }
We should see something similar to the following:
<code>Array( [0] => 569587371693355008 ) Array( [0] => 569587242672398336 ) Array( [0] => 569587188687634433 ) Array( [0] => 569587140490866689 ) </code>
This post is a bit long, so let's review what we've learned so far:
This article also serves as an introduction to the PHP-ML library and hopes to give you a good understanding of the library's features and how to embed it in your own projects.
Finally, this article is by no means comprehensive, and there are still many things to learn, improve and experiment; here are some ideas that can help you improve further:
I hope you find this article useful. If you have some application ideas or any questions about PHP-ML, feel free to mention them in the comment section below!
Improving the accuracy of sentiment analysis involves a variety of strategies. First, make sure your training data is as clean and relevant as possible. This means deleting any unrelated data such as stop words, punctuation marks, and URLs. Second, consider using more complex algorithms. While Naive Bayes classifiers are a great starting point, other algorithms such as support vector machines (SVMs) or deep learning models may provide better results. Finally, consider using a larger dataset for training. The more data your model can learn, the more accurate it will be.
Yes, you can use other programming languages for sentiment analysis. Python, for example, has become a popular choice for its extensive machine learning library such as NLTK, TextBlob, and scikit-learn. However, PHP can also be used effectively for sentiment analysis, especially if you are already familiar with the language or if your project is built on the PHP framework.
Troubleshooting irony and antonyms in sentiment analysis is a challenging task. These language features often involve saying something but meaning the opposite, which is difficult for machine learning models to understand. One approach is to use more complex models that can understand the context, such as deep learning models. Another approach is to use a specialized satirical detection model that can be trained using a dataset of satirical comments.
The principles of sentiment analysis can be applied to any text data, including posts from other social media platforms. The main difference is how you collect data. Each social media platform has its own API for accessing user posts, so you need to be familiar with the API of the platform you are interested in.
Yes, sentiment analysis can be used in any language. However, the effectiveness of the analysis will depend on the quality of your training data. If you are using a language other than English, you need to use the dataset of that language to train your model. Some machine learning libraries also directly support multiple languages.
There are many ways to visualize sentiment analysis results. A common approach is to use bar charts to show the number of positive, negative, and neutral tweets. Another approach is to use word cloud to visualize the most commonly used words in the data. PHP has several libraries for creating these visualizations, such as pChart and GD.
Emotional analysis has many practical applications. Businesses can use it to monitor customers’ opinions on their products or services, politicians can use it to measure public opinions on policy issues, and researchers can use it to study social trends. The possibilities are endless.
Emojis can carry important emotional information, so it is important to include them in your analysis. One way is to replace each emoji with its text description before entering the data into the model. There are libraries that can help you do this, such as PHP's Emojione.
Spellow errors can be a challenge in sentiment analysis. One way is to use a spell checker to correct the error before entering the data into the model. Another approach is to use models that can handle spelling errors, such as deep learning models.
Keeping your sentiment analysis model up-to-date involves retraining it regularly using new data. This ensures that your model is in sync with language usage and emotional changes. You can automate this process by setting up a plan to retrain the model.
The above is the detailed content of How to Analyze Tweet Sentiments with PHP Machine Learning. For more information, please follow other related articles on the PHP Chinese website!