Machine learning is actually simpler than you think

Many people feel that machine learning is unattainable and that it is a mysterious technology that only a few professional scholars understand.

After all, you are letting a machine running in a binary world come up with its own understanding of the real world. You are teaching them how to think. However, this article is hardly what you might think of as an obscure, complex, mathematical formula-filled article. Just like all basic common sense that helps us understand the world (for example: Newton's laws of motion, work needs to be completed, supply and demand relationships, etc.), the best methods and concepts of machine learning should also be concise and clear. Unfortunately, the vast majority of literature on machine learning is filled with complex symbols, obscure mathematical formulas, and unnecessary nonsense. It is this that surrounds the simple and basic idea of machine learning with a thick wall.

Now looking at a practical example, we need to add a "you may like" recommendation function at the end of an article, so how do we implement it?

To implement this idea we have a simple solution:

1. Get the title of the current article and split it into independent words (Translator’s Note: The original text is in English, you only need to split it based on spaces, Chinese word segmentation requires the use of a word segmenter)
2. Get all articles except the current article
3. Sort these articles according to the overlap between their content and the title of the current article

def similar_posts(post)
title_keywords = post.title.split(' ')
Post.all.to_a.sort |post1, post2|
post1_title_intersection = post1.body.split(' ') & title_keywords
post2_title_intersection = post2.body.split(' ') & title_keywords
post2_title_intersection.length <=> post1_title_intersection.length
end[0..9]
end

Using this method to find articles similar to the blog post "How Support Teams Improve Product Quality", we got the following top ten most relevant articles:

How to get started with a proven solution
Understand how your customers make decisions
Design the first-run interface to delight your users
How to recruit designers
Discussion on icon design
Interview with singer Ryan
Actively support customers through internal communications
Why being first doesn’t matter
Interview with Joshua Porter
Customer retention, group analysis and visualization

As you can see, the benchmark article is about how to provide team support efficiently, and this has nothing to do with analyzing customer groups and discussing the advantages of design. In fact, we can also take a better approach.

Now, we try to solve this problem using a true machine learning method. Proceed in two steps:

Express the article in mathematical form;
The K-means clustering algorithm was used to perform cluster analysis on the above data points.

1. Express the article in mathematical form

If we could display the articles in mathematical form, we could plot the similarity between previous articles and identify different clusters:

As shown in the figure above, it is not difficult to map each article into a coordinate point on the coordinate system. It can be achieved through the following two steps:

Find all the words in each article;
Create an array for each article. The elements in the array are 0 or 1, which are used to indicate whether a certain word appears in the article. The order of the array elements in each article is the same, but their values are different.

The Ruby code is as follows:

@posts = Post.all
@words = @posts.map do |p|
p.body.split(' ')
end.flatten.uniq
@vectors = @posts.map do |p|
@words.map do |w|
p.body.include?(w) ? 1 : 0
end
end

Suppose the value of @words is:

["Hello","Internal","Internal Communication","Reader","Blog","Publish"]

If the content of an article is "Hello Blog Post Reader", then its corresponding array is:

[1,0,0,1,1,1]

Of course, we currently cannot use simple tools to display this six-dimensional coordinate point like a two-dimensional coordinate system, but the basic concepts involved, such as the distance between two points, are all interoperable and can be generalized to Higher dimensions (so using a two-dimensional example to illustrate the problem still works).

2. Use K-means clustering algorithm to perform cluster analysis on data points

Now that we have the coordinates of a series of articles, we can try to find clusters of similar articles. Here we use a fairly simple clustering algorithm - K-means algorithm, which can be summarized in five steps:

Set a number K, which represents the number of objects in the cluster;
Randomly select K objects from all data objects as the initial K cluster centers;
Traverse all objects and assign them to the cluster closest to itself;
Update the cluster center, that is, calculate the mean of the objects in each cluster, and use the mean as the new center of the cluster;
Repeat steps 3 and 4 until the center of each cluster no longer changes.

We next visualize these steps in diagram form. First we randomly select two points (K=2) from a series of article coordinates:

We assign each article to the cluster closest to it:

We calculate the mean coordinate of all objects in each cluster as the new center of the cluster.

In this way, we have completed the first data iteration, and now we reassign the articles to the corresponding clusters based on the new cluster centers.

At this point, we have found the cluster corresponding to each article! Obviously, even if the cluster center continues to iterate, the cluster center will not change, and the cluster corresponding to each article will not change either.

The Ruby code for the above process is as follows:

@cluster_centers = [rand_point(), rand_point()]
15.times do
@clusters = [[], []]
@posts.each do |post|
min_distance, min_point = nil, nil
@cluster_centers.each.with_index do |center, i|
if distance(center, post) < min_distance
min_distance = distance(center, post)
min_point = i
end
end
@clusters[min_point] << post
end
@cluster_centers = @clusters.map do |post|
average(posts)
end
end

The following are the top ten articles obtained by this method that are similar to the blog post "How Support Teams Improve Product Quality":

Do you know better about this or are you smarter
Three Guidelines for Customer Feedback
Get the information you need from your customers
Product delivery is just the beginning
What do you think feature extensions look like
Know your user base
Convert customers with the right message and the right time
Communicate with your customers
Does your app have message push arrangements?
Have you tried communicating with your customers

The results speak for themselves.

We only used less than 40 lines of code and a simple algorithm introduction to implement this idea. However, if you read academic papers, you will never know how simple this should be. The following is an abstract of a paper introducing the K-means algorithm (I don’t know who proposed the K-means algorithm, but this is the first time the term "K-means" was proposed).

If you like to use mathematical symbols to express your ideas, there is no doubt that academic papers are very useful. However, there are actually more high-quality resources that can replace these complicated mathematical formulas, which are more practical and approachable.

Wikipedia (e.g. latent semantic indexing, cluster analysis)
Source code for open source machine learning libraries (e.g. Scipy’s K-Means, Scikit’s DBSCAN)
Books written from a programmer’s perspective (eg: Collective Intelligence Programming, Hacking Machine Learning)
Khan Academy

Give it a try

How to apply recommended tags for your project management? How to design your customer support tools? Or how users are grouped in social networks? These can all be implemented through simple codes and simple algorithms, which is a good opportunity to practice! So, if you think the problem you’re facing in your project can be solved with machine learning, why hesitate?

Machine learning is actually simpler than you think!

Original link: Intercom Translation: Bole Online - zhibinzeng
Translation link: http://blog.jobbole.com/53546/

================================================== ====
The PPC WeChat platform is launched!
Search "PHPChina" on WeChat and click the follow button to get the latest and most professional industry information pushed by PPC for you, and there are more special columns for you
【PPC Mining】: From time to time, we will provide you with stories about classic products and product people.
[PPC Foreign Language]: Share a foreign language translation article every day
【PPCoder】: Focus on replying to questions from following users every day
Machine learning is actually simpler than you think_PHP Tutorial