Home > Backend Development > PHP Tutorial > How to Analyze Tweet Sentiments with PHP Machine Learning

How to Analyze Tweet Sentiments with PHP Machine Learning

Lisa Kudrow
Release: 2025-02-09 10:09:10
Original
901 people have browsed it

How to Analyze Tweet Sentiments with PHP Machine Learning

This article was peer-reviewed by Wern Ancheta. Thanks to all the peer reviewers at SitePoint for getting SitePoint content to its best!


Recently, it seems that everyone is talking about machine learning. Your social media stream is filled with posts about ML, Python, TensorFlow, Spark, Scala, Go, and more; if you're like me, you might be wondering, what about PHP?

Yes, what about machine learning and PHP? Fortunately, someone was crazy about not only raising this question, but also developing a general machine learning library that we can use in our next project. In this post, we will take a look at PHP-ML – a machine learning library for PHP – we will write a sentiment analysis class that can be reused later on for our own chatbots or Twitterbots. The main goals of this article are:

  • Explore general concepts around machine learning and sentiment analysis
  • Review the functions and disadvantages of PHP-ML
  • Define the problem we will deal with
  • Prove that trying to do machine learning in PHP is not a totally crazy goal (optional)

Read better PHP development tools and technologies to make you a better developer! How to Analyze Tweet Sentiments with PHP Machine Learning Read this book Read this book! How to Analyze Tweet Sentiments with PHP Machine Learning

Key Points

  • PHP-ML is a universal PHP machine learning library suitable for small applications such as sentiment analysis.
  • This tutorial demonstrates how to use PHP-ML to build a sentiment analysis tool dedicated to analyzing tweets, focusing on supervised learning techniques.
  • The key step in sentiment analysis is to prepare the data, which involves selecting relevant features and labels from the dataset.
  • Text data requires specific preprocessing, such as tokenization and vectorization, to convert tweets into formats suitable for machine learning models.
  • The Naive Bayes classifier is used in the example because it simply and efficiently handles classified data.
  • This article emphasizes the importance of a clean and relevant data set for training models to ensure accurate emotional classification.

What is machine learning?

Machine learning is a subset of artificial intelligence that focuses on giving "the ability of computers to learn without explicit programming." This is achieved by using a general algorithm that can be "learned" from a specific data set.

A common use of machine learning, for example, is classification. Classification algorithms are used to divide data into different groups or categories. Some examples of classification applications include:

  • Email spam filter
  • Market segment
  • Fraud detection

Machine learning is a general term for general algorithms covering many different tasks, and it is mainly divided into two types of algorithms according to the learning method - supervised learning and unsupervised learning.

Supervised Learning

In supervised learning, we use labeled data to train our algorithm, which takes the format of input objects (vectors) and required output values; the algorithm analyzes the training data and produces so-called inference functions, which we can apply on the new unlabeled dataset.

For the rest of this post, we will focus on supervised learning because it is easier to see and verify relationships; remember that both algorithms are equally important and interesting; some might think unsupervised learning is more useful, Because it excludes the need to tag data.

Unsupervised learning

This type of learning, on the other hand, uses unlabeled data from the beginning. We don't know the required output value of the dataset, so we let the algorithm draw inferences from the dataset; unsupervised learning is especially convenient when performing exploratory data analysis to find hidden patterns in the data.

PHP-ML

Know PHP-ML, a library that claims to be a new method of PHP machine learning. The library implements algorithms, neural networks and tools for data preprocessing, cross-validation, and feature extraction.

I first admit that PHP is an unusual choice for machine learning, because the advantages of the language are not very suitable for machine learning applications. That is, not every machine learning application needs to process PeB-level data and perform a lot of computations—for simple applications, we should be able to use PHP and PHP-ML.

The best use case for this library I can see now is the implementation of classifiers, whether it is spam filters or sentiment analysis. We will define a classification problem and build a solution step by step to understand how to use PHP-ML in our project.

Question

To give an example of the process of implementing PHP-ML and adding some machine learning to our application, I wanted to find an interesting problem to solve, and what better way to do this than building a Twitter sentiment analysis class What about showing the classifier?

One of the key requirements required to build a successful machine learning project is a good starting dataset. Datasets are crucial because they will allow us to train our classifier against classified examples. With the recent massive noise around airlines, what better data set than using customer tweets to airlines?

Luckily, thanks to Kaggle.io, we can already use the tweet dataset. You can use this link to download Twitter US Airlines sentiment database from their website

Solution

Let's first look at the dataset we're going to work on. The original dataset contains the following columns:

  • tweet_id
  • airline_sentiment
  • airline_sentiment_confidence
  • negativereason
  • negativereason_confidence
  • airline
  • airline_sentiment_gold
  • name
  • negativereason_gold
  • retweet_count
  • text
  • tweet_coord
  • tweet_created
  • tweet_location
  • user_timezone

and looks like the following example (a table that can be scrolled sideways):

tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidencenegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_creat ed tweet_locationuser_timezone 570306133677760513 neutral 1.0 Virgin America cairdin 0 @VirginAmerica What @dhepburn said . 2015-02-24 11:35:52 -0800 Eastern Time (US & Canada) 570301130888122368 positive 0.3486 0.0 Virgin America jnardino 0 @VirginAmerica plus you've added commercials to the experience… tacky. 2015-02- 24 11:15:59 -0800 Pacific Time (US & Canada) 570301083672813571 neutral 0.6837 Virgin America yvonnalynn 0 @VirginAmerica I didn't today… Must mean I need to take another trip! 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) 570301031407624196 neg active 1.0 Bad Flight 0.7033 Virgin America jnardino 0 “ @VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse" 2015-02-24 11:15:36 -0800 P acific Time (US & Canada) 570300817074462722 negative 1.0 Can't Tell 1.0 Virgin America jnardino 0 @VirginAmerica and it's a really big bad thing about it 2015-02-24 11:14:45 -0800 Pacific Time (US & Canada) 570300767074181121 negative 1.0 Can't Tell 0.6842 Virgin America jnardino 0 “@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA” 2015-02-24 11:14:33 -0800 Pacific Time (US & Canada) 570300616901320704 positive 0.6745 0.0 Virgin America cjmcginnis 0 “@VirginAmerica yes nearly every time I fly VX this “ear worm” won't go away :)” 2015-02-24 11:13:57 -0800 San Francisco CA Pacific Time (US & Canada) 570300248553349120 neutral 0.634 Virgin America pilot 0 “@VirginAmerica Really missed a prime opportunity for Men Without Hats parody there. https://www.php.cn/link/76379ed89 eafe43c8f6bd64fd09e3852” 2015-02-24 11:12:29 -0800 Los Angeles Pacific Time (US & Canada) This file contains 14,640 tweets, so it is a good working dataset for us. Now, with the number of columns we currently have, we have more data than the examples need; for practical purposes we only care about the following columns:

  • text
  • airline_sentiment

where text will become our characteristic and airline_sentiment will become our target. The remaining columns can be discarded because they will not be used in our exercises. Let's start by creating the project and initialize the composer with the following file:

<code>{
    "name": "amacgregor/phpml-exercise",
    "description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
    "type": "project",
    "require": {
        "php-ai/php-ml": "^0.4.1"
    },
    "license": "Apache License 2.0",
    "authors": [
        {
            "name": "Allan MacGregor",
            "email": "amacgregor@allanmacgregor.com"
        }
    ],
    "autoload": {
        "psr-4": {"PhpmlExercise\": "src/"}
    },
    "minimum-stability": "dev"
}</code>
Copy after login
Copy after login
Copy after login
<code>composer install
</code>
Copy after login
Copy after login
Copy after login

If you need a Composer introduction, see here.

To make sure we set it up correctly, let's create a quick script that will load our Tweets.csv data file and make sure it has the data we need. Copy the following code as reviewDataset.php in the project root directory:

<?php namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Dataset\CsvDataset;

$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);

foreach ($dataset->getSamples() as $sample) {
    print_r($sample);
}
Copy after login
Copy after login
Copy after login

Now, run the script using php reviewDataset.php, let's see the output:

<code>Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )
</code>
Copy after login
Copy after login
Copy after login

This looks useless now, doesn't it? Let's take a look at the CsvDataset class to better understand what's happening inside:

<?php 
    public function __construct(string $filepath, int $features, bool $headingRow = true)
    {
        if (!file_exists($filepath)) {
            throw FileException::missingFile(basename($filepath));
        }

        if (false === $handle = fopen($filepath, 'rb')) {
            throw FileException::cantOpenFile(basename($filepath));
        }

        if ($headingRow) {
            $data = fgetcsv($handle, 1000, ',');
            $this->columnNames = array_slice($data, 0, $features);
        } else {
            $this->columnNames = range(0, $features - 1);
        }

        while (($data = fgetcsv($handle, 1000, ',')) !== false) {
            $this->samples[] = array_slice($data, 0, $features);
            $this->targets[] = $data[$features];
        }
        fclose($handle);
    }
Copy after login
Copy after login

CsvDataset constructor takes 3 parameters:

  • File path to source CSV
  • Specify integers of the number of features in the file
  • Boolean value indicating whether the first line is the title

If we look closely, we can see that the class is mapping the CSV file to two internal arrays: samples and targets. Samples contains all the characteristics provided by the file, while targets contains known values ​​(negative, positive, or neutral).

Based on the above content, we can see that the format that our CSV file needs to follow is as follows:

<code>| feature_1 | feature_2 | feature_n | target | </code>
Copy after login

We will need to generate a clean dataset that contains only the columns we need to continue working. Let's call this script generateCleanDataset.php:

<?php namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Exception\FileException;

$sourceFilepath         = __DIR__ . '/datasets/raw/Tweets.csv';
$destinationFilepath    = __DIR__ . '/datasets/clean_tweets.csv';

$rows =[];

$rows = getRows($sourceFilepath, $rows);
writeRows($destinationFilepath, $rows);


/**
 * @param $filepath
 * @param $rows
 * @return array
 */
function getRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath);

    while (($data = fgetcsv($handle, 1000, ',')) !== false) {
        $rows[] = [$data[10], $data[1]];
    }
    fclose($handle);
    return $rows;
}

/**
 * @param $filepath
 * @param string $mode
 * @return bool|resource
 * @throws FileException
 */
function checkFilePermissions($filepath, $mode = 'rb')
{
    if (!file_exists($filepath)) {
        throw FileException::missingFile(basename($filepath));
    }

    if (false === $handle = fopen($filepath, $mode)) {
        throw FileException::cantOpenFile(basename($filepath));
    }
    return $handle;
}

/**
 * @param $filepath
 * @param $rows
 * @internal param $list
 */
function writeRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath, 'wb');

    foreach ($rows as $row) {
        fputcsv($handle, $row);
    }

    fclose($handle);
}
Copy after login

Nothing is too complicated, it's just enough to do the job. Let's execute it with php generateCleanDataset.php.

Now, let's point the reviewDataset.php script to a clean dataset:

<code>Array
(
    [0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now???
)
Array
(
    [0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight & been on hold now for 1hr 49min
)</code>
Copy after login

BAM! This is the data we can use! So far, we have been creating simple scripts for manipulating data. Next, we will start creating a new class under src/classification/SentimentAnalysis.php.

<?php namespace PhpmlExercise\Classification;

/**
 * Class SentimentAnalysis
 * @package PhpmlExercise\Classification
 */
class SentimentAnalysis { 
    public function train() {}
    public function predict() {}
}
Copy after login

Our emotion class will need to use two functions in our emotion analysis class:

  • A training function that will use our dataset to train samples and labels and some optional parameters.
  • A prediction function that will take an unlabeled dataset and assign a set of labels based on the training data.

Create a script named classifyTweets.php in the root directory of the project. We will use this script to instantiate and test our sentiment analysis class. Here is the template we will use:

<?php 
namespace PhpmlExercise;
use PhpmlExercise\Classification\SentimentAnalysis;

require __DIR__ . '/vendor/autoload.php';

// 步骤 1:加载数据集

// 步骤 2:准备数据集

// 步骤 3:生成训练/测试数据集

// 步骤 4:训练分类器

// 步骤 5:测试分类器的准确性
Copy after login

Step 1: Load the dataset

We already have code that can be used to load CSV into the dataset object in our earlier examples. We will use the same code and make some tweaks:

<?php ...
use Phpml\Dataset\CsvDataset;
...
$dataset = new CsvDataset('datasets/clean_tweets.csv',1);

$samples = [];
foreach ($dataset->getSamples() as $sample) {
    $samples[] = $sample[0];
}
Copy after login

This will generate a flat array containing only features (in this case the tweet text) which we will use to train our classifier.

Step 2: Prepare the dataset

Now, having the original text and passing that text to the classifier will not be useful or accurate, because each tweet is essentially different. Fortunately, there are ways to process text when trying to apply classification or machine learning algorithms. For this example, we will use the following two classes:

  • Token Count Vectorizer: This converts the text sample set into a token count vector. Essentially, each word in our tweet becomes a unique number and tracks the number of times a word appears in a particular text sample.
  • Tf-idf converter: term frequency–inverse document frequency is an abbreviation of frequency, which is a numerical statistic designed to reflect the importance of a word to documents in a collection or corpus.

Let's start with the text vectorizer:

<code>{
    "name": "amacgregor/phpml-exercise",
    "description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
    "type": "project",
    "require": {
        "php-ai/php-ml": "^0.4.1"
    },
    "license": "Apache License 2.0",
    "authors": [
        {
            "name": "Allan MacGregor",
            "email": "amacgregor@allanmacgregor.com"
        }
    ],
    "autoload": {
        "psr-4": {"PhpmlExercise\": "src/"}
    },
    "minimum-stability": "dev"
}</code>
Copy after login
Copy after login
Copy after login

Next, apply the Tf-idf converter:

<code>composer install
</code>
Copy after login
Copy after login
Copy after login

Our samples array now uses a format that can be easily understood by our classifier. We are not done yet, we need to mark each sample with its corresponding emotions.

Step 3: Generate the training dataset

Luckily, PHP-ML already covers this requirement, and the code is very simple:

<?php namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Dataset\CsvDataset;

$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);

foreach ($dataset->getSamples() as $sample) {
    print_r($sample);
}
Copy after login
Copy after login
Copy after login

We can continue to use this dataset and train our classifier. However, we lack the test dataset used as validation, so we'll "cheat" a little bit and split our original dataset into two parts: a training dataset and a much smaller data for testing the model's accuracy set.

<code>Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )
</code>
Copy after login
Copy after login
Copy after login

This method is called cross-validation. This term comes from statistics and can be defined as follows:

Cross validation, sometimes called rotation estimation, is a model verification technique used to evaluate how the results of statistical analysis will generalize to independent data sets. It is mainly used for the goal setting of predictions and wants to estimate the accuracy of the prediction model in practice. — Wikipedia.com

Step 4: Training the classifier

Finally, we are ready to return and implement the SentimentAnalysis class. If you haven't noticed yet, a large part of machine learning is about collecting and manipulating data; actual implementations of machine learning models often involve less content.

To implement our sentiment analysis class, we have three available classification algorithms:

  • Support vector classification
  • KNearest Neighbor
  • Natural Bayes

For this exercise, we will use the simplest one, the naive Bayes classifier, so let's continue to update our class to implement the train method:

<?php 
    public function __construct(string $filepath, int $features, bool $headingRow = true)
    {
        if (!file_exists($filepath)) {
            throw FileException::missingFile(basename($filepath));
        }

        if (false === $handle = fopen($filepath, 'rb')) {
            throw FileException::cantOpenFile(basename($filepath));
        }

        if ($headingRow) {
            $data = fgetcsv($handle, 1000, ',');
            $this->columnNames = array_slice($data, 0, $features);
        } else {
            $this->columnNames = range(0, $features - 1);
        }

        while (($data = fgetcsv($handle, 1000, ',')) !== false) {
            $this->samples[] = array_slice($data, 0, $features);
            $this->targets[] = $data[$features];
        }
        fclose($handle);
    }
Copy after login
Copy after login

As you can see, we let PHP-ML do all the heavy lifting for us. We just created a nice abstraction for our project. But how do we know if our classifiers are really training and working? It's time to use our testSamples and testLabels.

Step 5: Test the accuracy of the classifier

We do have to implement the prediction method before we continue to test our classifier:

<code>{
    "name": "amacgregor/phpml-exercise",
    "description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
    "type": "project",
    "require": {
        "php-ai/php-ml": "^0.4.1"
    },
    "license": "Apache License 2.0",
    "authors": [
        {
            "name": "Allan MacGregor",
            "email": "amacgregor@allanmacgregor.com"
        }
    ],
    "autoload": {
        "psr-4": {"PhpmlExercise\": "src/"}
    },
    "minimum-stability": "dev"
}</code>
Copy after login
Copy after login
Copy after login

Similarly, PHP-ML helped us and did all the heavy lifting for us. Let's update the classifyTweets class accordingly:

<code>composer install
</code>
Copy after login
Copy after login
Copy after login

Finally, we need a way to test the accuracy of our training model; thankfully, PHP-ML covers this, too, and they have several metric classes. In our case, we are interested in the accuracy of the model. Let's look at the code:

<?php namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Dataset\CsvDataset;

$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);

foreach ($dataset->getSamples() as $sample) {
    print_r($sample);
}
Copy after login
Copy after login
Copy after login

We should see something similar to the following:

<code>Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )
</code>
Copy after login
Copy after login
Copy after login

Conclusion

This post is a bit long, so let's review what we've learned so far:

  • Having a good dataset from the beginning is essential to implementing machine learning algorithms.
  • The difference between supervised learning and unsupervised learning.
  • The meaning and use of cross-validation in machine learning.
  • Vectorization and transformation are essential for preparing text datasets for machine learning.
  • How to implement Twitter sentiment analysis by using Naive Bayes classifier of PHP-ML.

This article also serves as an introduction to the PHP-ML library and hopes to give you a good understanding of the library's features and how to embed it in your own projects.

Finally, this article is by no means comprehensive, and there are still many things to learn, improve and experiment; here are some ideas that can help you improve further:

  • Replace the Naive Bayes algorithm with the Support Vector Machine algorithm.
  • If you try to run against a full dataset (14,000 rows), you may notice the memory intensiveness of the process. Try to implement model persistence so that you don't have to train every time you run it.
  • Move the dataset generation into its own helper class.

I hope you find this article useful. If you have some application ideas or any questions about PHP-ML, feel free to mention them in the comment section below!

FAQs on PHP Machine Learning for Tweet Sentiment Analysis (FAQ)

How to improve the accuracy of sentiment analysis?

Improving the accuracy of sentiment analysis involves a variety of strategies. First, make sure your training data is as clean and relevant as possible. This means deleting any unrelated data such as stop words, punctuation marks, and URLs. Second, consider using more complex algorithms. While Naive Bayes classifiers are a great starting point, other algorithms such as support vector machines (SVMs) or deep learning models may provide better results. Finally, consider using a larger dataset for training. The more data your model can learn, the more accurate it will be.

Can I use other languages ​​other than PHP for sentiment analysis?

Yes, you can use other programming languages ​​for sentiment analysis. Python, for example, has become a popular choice for its extensive machine learning library such as NLTK, TextBlob, and scikit-learn. However, PHP can also be used effectively for sentiment analysis, especially if you are already familiar with the language or if your project is built on the PHP framework.

How to deal with irony and antonyms in emotional analysis?

Troubleshooting irony and antonyms in sentiment analysis is a challenging task. These language features often involve saying something but meaning the opposite, which is difficult for machine learning models to understand. One approach is to use more complex models that can understand the context, such as deep learning models. Another approach is to use a specialized satirical detection model that can be trained using a dataset of satirical comments.

How to use sentiment analysis for other social media platforms?

The principles of sentiment analysis can be applied to any text data, including posts from other social media platforms. The main difference is how you collect data. Each social media platform has its own API for accessing user posts, so you need to be familiar with the API of the platform you are interested in.

Can I use sentiment analysis for languages ​​other than English?

Yes, sentiment analysis can be used in any language. However, the effectiveness of the analysis will depend on the quality of your training data. If you are using a language other than English, you need to use the dataset of that language to train your model. Some machine learning libraries also directly support multiple languages.

How to visualize the results of sentiment analysis?

There are many ways to visualize sentiment analysis results. A common approach is to use bar charts to show the number of positive, negative, and neutral tweets. Another approach is to use word cloud to visualize the most commonly used words in the data. PHP has several libraries for creating these visualizations, such as pChart and GD.

How to use sentiment analysis in practical applications?

Emotional analysis has many practical applications. Businesses can use it to monitor customers’ opinions on their products or services, politicians can use it to measure public opinions on policy issues, and researchers can use it to study social trends. The possibilities are endless.

How to deal with emojis in sentiment analysis?

Emojis can carry important emotional information, so it is important to include them in your analysis. One way is to replace each emoji with its text description before entering the data into the model. There are libraries that can help you do this, such as PHP's Emojione.

How to deal with spelling errors in sentiment analysis?

Spellow errors can be a challenge in sentiment analysis. One way is to use a spell checker to correct the error before entering the data into the model. Another approach is to use models that can handle spelling errors, such as deep learning models.

How to keep my sentiment analysis model up to date?

Keeping your sentiment analysis model up-to-date involves retraining it regularly using new data. This ensures that your model is in sync with language usage and emotional changes. You can automate this process by setting up a plan to retrain the model.

The above is the detailed content of How to Analyze Tweet Sentiments with PHP Machine Learning. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template