Detailed explanation of the principle of t-SNE algorithm and Python code implementation

WBOY
Release: 2024-01-22 23:48:05
forward
1493 people have browsed it

Detailed explanation of the principle of t-SNE algorithm and Python code implementation

T-distributed stochastic neighbor embedding (t-SNE) is an unsupervised machine learning algorithm for visualization. It uses nonlinear dimensionality reduction technology and based on the relationship between data points and features. Similarity attempts to minimize the difference between these conditional probabilities (or similarities) in high- and low-dimensional spaces to perfectly represent the data points in the low-dimensional space.

Therefore, t-SNE is good at embedding high-dimensional data in a two-dimensional or three-dimensional low-dimensional space for visualization. It should be noted that t-SNE uses a heavy-tailed distribution to calculate the similarity between two points in a low-dimensional space instead of a Gaussian distribution, which helps solve crowding and optimization problems. And outliers do not affect t-SNE.

t-SNE algorithm steps

#1. Find the pairwise similarity between adjacent points in high-dimensional space.

2. Based on the pairwise similarity of the points in the high-dimensional space, map each point in the high-dimensional space to a low-dimensional map.

3. Use gradient descent based on Kullback-Leibler divergence (KL divergence) to find a low-dimensional data representation that minimizes the mismatch between conditional probability distributions.

4. Use Student-t distribution to calculate the similarity between two points in low-dimensional space.

Python code to implement t-SNE on the MNIST data set

Import module

# Importing Necessary Modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
Copy after login

Read data

# Reading the data using pandas
df = pd.read_csv('mnist_train.csv')

# print first five rows of df
print(df.head(4))

# save the labels into a variable l.
l = df['label']

# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
Copy after login

Data pre- Processing

# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler

standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)
Copy after login

Output

# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]

model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000

tsne_data = model.fit_transform(data_1000)

# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))

# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()

plt.show()
Copy after login

The above is the detailed content of Detailed explanation of the principle of t-SNE algorithm and Python code implementation. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:163.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!