


Numerical distance based on machine learning: the distance between points in space
This article is reproduced from the WeChat public account "Living in the Information Age". The author lives in the information age. To reprint this article, please contact the Living in the Information Age public account.
In machine learning, a basic concept is how to judge the difference between two samples, so as to be able to evaluate the similarity and category information between the two samples. The measure to judge this similarity is the distance between two samples in the feature space.
There are many measurement methods based on different data characteristics. Generally speaking, for two data samples x, y, define a function d(x, y). If it is defined as the distance between the two samples, then d(x, y) needs to satisfy the following basic properties :
- Non-negativity: d(x,y)>=0
- Identity: d(x,y)=0 ⇔ x=y
- Symmetry: d (x, y) = d (y, x)
- Triangle inequality: d (x, y)
Generally speaking, common distance measures include: distance between points in space, distance between strings, similarity of sets, and distance between variable/concept distributions.
Today we will first introduce the distance between the most commonly used points in space.
The distance between points in space includes the following types:
1. Euclidean Distance
There is no doubt that, Euclidean distance is the distance that people are most familiar with. It is the straight-line distance between two points. Students who have studied junior high school mathematics all know how to calculate the distance between two points in two-dimensional space in the Cartesian coordinate system
The calculation formula is:
2. Manhattan Distance
Manhattan distance is also called taxi distance. Its concept comes from the many horizontal and vertical blocks in Manhattan, New York. , in this kind of neighborhood, if a taxi driver wants to walk from one point to another, it is useless to calculate the straight-line distance, because the taxi cannot fly over the buildings. Therefore, this distance is usually calculated by subtracting and adding the east-west and north-south distances of two points respectively. This is the actual distance that the taxi has to travel.3. Chebyshev Distance (Chebyshev Distance)
Chebyshev distance is defined as the maximum value of the difference in coordinate values between two points.The Min distance itself is not a special distance, but a formula that combines multiple distances (Manhattan distance, Euclidean distance, Chebyshev distance).
It is defined as, for two n-dimensional variables, the Min's distance is:
When p=1, you can see
At this time is the Manhattan distance.
When p=2, you can see that
is the Euclidean distance.
When p=∞, you can see that
This is the Chebyshev distance.
5. Standardized Euclidean Distance
Euclidean distance can measure the straight-line distance between two points, but in some cases In some cases, it may be affected by different units. For example, if there is a height difference of 5 mm and a weight difference of 5 kg at the same time, the perception may be completely different. If we want to cluster three models, their respective attributes are as follows:
A: 65000000 mg (ie 65 kg), 1.74 m
B: 60000000 mg (ie 60 kg) , 1.70 meters
C: 65,000,000 mg (i.e. 65 kg), 1.40 meters
According to our normal understanding, A and B are models with better figures and should be classified into the same category. However, when actually calculating in the above units, it is found that the difference between A and B is greater than the difference between A and C. The reason is that the different measurement units of attributes lead to excessive numerical differences. If the same data is changed to another unit.
A: 65kg, 174cm
B: 60kg, 170cm
C: 65kg, 140cm
Then we will get The result that comes to mind is that A and B are classified into the same category. Therefore, in order to avoid such differences due to different measurement units, we need to introduce standardized Euclidean distance. In this distance calculation, each component is normalized to an interval with equal mean and variance.
Assume that the mean (mean) of the sample set X is m and the standard deviation (standard deviation) is s, then the "standardized variable" of X is expressed as:
6. Lance and Williams Distance
Lance distance is also called Canberra distance,
7. Mahalanobis Distance
After standardizing the values, will there be no problems? maybe. For example, in a one-dimensional example, if there are two classes, one class has a mean of 0 and a variance of 0.1, and the other class has a mean of 5 and a variance of 5. So if a point with a value of 2 should belong to which category? We intuitively think that it must be the second category, because the first category is obviously unlikely to reach 2 numerically. But in fact, if calculated from the distance, the number 2 must belong to the first category.
So, in a dimension with small variance, a small difference may become an outlier. For example, in the figure below, A and B are at the same distance from the origin, but since the entire sample is distributed along the horizontal axis, point B is more likely to be a point in the sample, while point A is more likely to be an outlier. .
Problems may also occur when the dimensions are not independently and identically distributed. For example, point A and point B in the figure below are The origins are equally distant, but the main distribution is similar to f(x)=x, so A is more like an outlier.
Therefore, we can see that in this case, the standardized Euclidean distance will also have problems, so we need to introduce Mahalanobis distance.
The Mahalanobis distance rotates the variables according to the principal components to make the dimensions independent of each other, and then standardizes them to make the dimensions equally distributed. The principal component is the direction of the eigenvector, so you only need to rotate according to the direction of the eigenvector, and then scale the eigenvalue times. For example, after the above image is transformed, the following result will be obtained:
It can be seen that the outliers have been successfully separated.
The Mahalanobis distance was proposed by the Indian mathematician Mahalanobis and represents the covariance distance of the data. It is an efficient method to calculate the similarity of two unknown sample sets.
For a multivariate vector with mean
and covariance matrix Σ
, its Mahalanobis distance (the Mahalanobis distance of a single data point) is:
For The degree of difference between two random variables X and Y that obey the same distribution and whose covariance matrix is Σ. The Mahalanobis distance between data points x and y is:
If the covariance matrix is the identity matrix, then the Mahalanobis distance is simplified to the Euclidean distance. If the covariance matrix is a diagonal matrix, then the Mahalanobis distance becomes the standardized Euclidean distance.
8. Cosine Distance
As the name suggests, cosine distance comes from the cosine of the angle in geometry, which can be used to measure the difference in the direction of two vectors. rather than distance or length. When the cosine value is 0, the two vectors are orthogonal and the included angle is 90 degrees. The smaller the angle is, the closer the cosine value is to 1 and the direction is more consistent.
In N-dimensional space, the cosine distance is:
It is worth pointing out that the cosine distance does not satisfy the triangle inequality.
9. Geodesic Distance
Geodesic distance originally refers to the shortest distance between the surfaces of spheres. When the feature space is a plane, the geodesic distance is the Euclidean distance. In non-Euclidean geometry, the shortest line between two points on the sphere is the great arc connecting the two points. The sides of triangles and polygons on the sphere are also composed of these great arcs.
10. Bray Curtis Distance
Bray Curtis distance is mainly used Botany, Ecology and Environmental Sciences, it can be used to calculate differences between samples. The formula is:
The above is the detailed content of Numerical distance based on machine learning: the distance between points in space. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Image annotation is the process of associating labels or descriptive information with images to give deeper meaning and explanation to the image content. This process is critical to machine learning, which helps train vision models to more accurately identify individual elements in images. By adding annotations to images, the computer can understand the semantics and context behind the images, thereby improving the ability to understand and analyze the image content. Image annotation has a wide range of applications, covering many fields, such as computer vision, natural language processing, and graph vision models. It has a wide range of applications, such as assisting vehicles in identifying obstacles on the road, and helping in the detection and diagnosis of diseases through medical image recognition. . This article mainly recommends some better open source and free image annotation tools. 1.Makesens

In the fields of machine learning and data science, model interpretability has always been a focus of researchers and practitioners. With the widespread application of complex models such as deep learning and ensemble methods, understanding the model's decision-making process has become particularly important. Explainable AI|XAI helps build trust and confidence in machine learning models by increasing the transparency of the model. Improving model transparency can be achieved through methods such as the widespread use of multiple complex models, as well as the decision-making processes used to explain the models. These methods include feature importance analysis, model prediction interval estimation, local interpretability algorithms, etc. Feature importance analysis can explain the decision-making process of a model by evaluating the degree of influence of the model on the input features. Model prediction interval estimate

In layman’s terms, a machine learning model is a mathematical function that maps input data to a predicted output. More specifically, a machine learning model is a mathematical function that adjusts model parameters by learning from training data to minimize the error between the predicted output and the true label. There are many models in machine learning, such as logistic regression models, decision tree models, support vector machine models, etc. Each model has its applicable data types and problem types. At the same time, there are many commonalities between different models, or there is a hidden path for model evolution. Taking the connectionist perceptron as an example, by increasing the number of hidden layers of the perceptron, we can transform it into a deep neural network. If a kernel function is added to the perceptron, it can be converted into an SVM. this one

This article will introduce how to effectively identify overfitting and underfitting in machine learning models through learning curves. Underfitting and overfitting 1. Overfitting If a model is overtrained on the data so that it learns noise from it, then the model is said to be overfitting. An overfitted model learns every example so perfectly that it will misclassify an unseen/new example. For an overfitted model, we will get a perfect/near-perfect training set score and a terrible validation set/test score. Slightly modified: "Cause of overfitting: Use a complex model to solve a simple problem and extract noise from the data. Because a small data set as a training set may not represent the correct representation of all data." 2. Underfitting Heru

In the 1950s, artificial intelligence (AI) was born. That's when researchers discovered that machines could perform human-like tasks, such as thinking. Later, in the 1960s, the U.S. Department of Defense funded artificial intelligence and established laboratories for further development. Researchers are finding applications for artificial intelligence in many areas, such as space exploration and survival in extreme environments. Space exploration is the study of the universe, which covers the entire universe beyond the earth. Space is classified as an extreme environment because its conditions are different from those on Earth. To survive in space, many factors must be considered and precautions must be taken. Scientists and researchers believe that exploring space and understanding the current state of everything can help understand how the universe works and prepare for potential environmental crises

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

Translator | Reviewed by Li Rui | Chonglou Artificial intelligence (AI) and machine learning (ML) models are becoming increasingly complex today, and the output produced by these models is a black box – unable to be explained to stakeholders. Explainable AI (XAI) aims to solve this problem by enabling stakeholders to understand how these models work, ensuring they understand how these models actually make decisions, and ensuring transparency in AI systems, Trust and accountability to address this issue. This article explores various explainable artificial intelligence (XAI) techniques to illustrate their underlying principles. Several reasons why explainable AI is crucial Trust and transparency: For AI systems to be widely accepted and trusted, users need to understand how decisions are made

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing
