T-distributed stochastic neighbor embedding (t-SNE) is an unsupervised machine learning algorithm for visualization. It uses nonlinear dimensionality reduction technology and based on the relationship between data points and features. Similarity attempts to minimize the difference between these conditional probabilities (or similarities) in high- and low-dimensional spaces to perfectly represent the data points in the low-dimensional space.
Therefore, t-SNE is good at embedding high-dimensional data in a two-dimensional or three-dimensional low-dimensional space for visualization. It should be noted that t-SNE uses a heavy-tailed distribution to calculate the similarity between two points in a low-dimensional space instead of a Gaussian distribution, which helps solve crowding and optimization problems. And outliers do not affect t-SNE.
#1. Find the pairwise similarity between adjacent points in high-dimensional space.
2. Based on the pairwise similarity of the points in the high-dimensional space, map each point in the high-dimensional space to a low-dimensional map.
3. Use gradient descent based on Kullback-Leibler divergence (KL divergence) to find a low-dimensional data representation that minimizes the mismatch between conditional probability distributions.
4. Use Student-t distribution to calculate the similarity between two points in low-dimensional space.
Import module
# Importing Necessary Modules. import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler
Read data
# Reading the data using pandas df = pd.read_csv('mnist_train.csv') # print first five rows of df print(df.head(4)) # save the labels into a variable l. l = df['label'] # Drop the label feature and store the pixel data in d. d = df.drop("label", axis = 1)
Data pre- Processing
# Data-preprocessing: Standardizing the data from sklearn.preprocessing import StandardScaler standardized_data = StandardScaler().fit_transform(data) print(standardized_data.shape)
Output
# TSNE # Picking the top 1000 points as TSNE # takes a lot of time for 15K points data_1000 = standardized_data[0:1000, :] labels_1000 = labels[0:1000] model = TSNE(n_components = 2, random_state = 0) # configuring the parameters # the number of components = 2 # default perplexity = 30 # default learning rate = 200 # default Maximum number of iterations # for the optimization = 1000 tsne_data = model.fit_transform(data_1000) # creating a new data frame which # help us in plotting the result data tsne_data = np.vstack((tsne_data.T, labels_1000)).T tsne_df = pd.DataFrame(data = tsne_data, columns =("Dim_1", "Dim_2", "label")) # Plotting the result of tsne sn.FacetGrid(tsne_df, hue ="label", size = 6).map( plt.scatter, 'Dim_1', 'Dim_2').add_legend() plt.show()
The above is the detailed content of Detailed explanation of the principle of t-SNE algorithm and Python code implementation. For more information, please follow other related articles on the PHP Chinese website!