Translator | Li Rui
Reviewer | Sun Shujuan
Optimal transportation originated from economics and is now developed as a tool for how to best allocate resources. The origins of optimal transportation theory can be traced back to 1781, when French scientist Gaspard Monge studied a method of purportedly "moving the earth" and building fortifications for Napoleon's army. Overall, optimal transportation is the problem of how to move all resources (such as iron ore) from a set of starting points (mines) to a set of end points (steel plants) while minimizing the total distance the resources must travel. Mathematically, the researchers wanted to find a function that maps each origin to a destination while minimizing the total distance between the origin and its corresponding destination. Despite its innocuous description, progress on the original conception of the problem, known as Munger's conception, stalled for nearly 200 years.
In the 1940s, Soviet mathematician Leonid Kantorovich adapted the formulation of the problem into a modern version, now known as Monge Kantorov's theory, which was the first step toward a solution. The novelty here is allowing some iron ore from the same mine to be supplied to different steel plants. For example, 60% of the iron ore from a mine can be provided to a steel plant, while the remaining 40% of the iron ore from the mine can be provided to another steel plant. Mathematically, this is no longer a function, as the same origin now maps to potentially multiple destinations. In contrast, this is known as the coupling between the origin distribution and the destination distribution, as shown in the figure below; selecting a mine from the blue distribution (origin) and moving vertically along the figure shows where the iron ore is sent Distribution of steel plants (destination).
As part of this new development, Kantorivich introduced an important concept called the Wasserstein distance. Similar to the distance between two points on a map, the Wasserstein distance (also known as the bulldozer distance inspired by its original scenario) measures the distance between two distributions, such as the blue and magenta distributions in this case. If all iron mines are far from all iron plants, then the Wasserstein distance between the distribution (location) of mines and the distribution of steel plants will be large. Even with these new improvements, it's still unclear whether there really is a best way to transport iron ore resources, let alone which method. Finally, in the 1990s, the theory began to develop rapidly as improvements in mathematical analysis and optimization led to partial solutions to the problem. In the 21st century, optimal transportation began to spread to other fields, such as particle physics, fluid dynamics, and even statistics and machine learning.
With the explosion of new theories, optimal transportation has become the center of many new statistical and artificial intelligence algorithms over the past two decades. In almost every statistical algorithm, data are modeled, explicitly or implicitly, as having some underlying probability distribution. For example, if data on individual income are collected in different countries, there will be a probability distribution of that population's income in each country. If you wish to compare two countries based on the income distribution of their population, you need a way to measure the gap between the two distributions. This is exactly why optimizing transportation (especially Wasserstein distance) becomes so useful in data science. However, Wasserstein distance is not the only measure of the distance between two probability distributions. In fact, due to their connection to physics and information theory, the two options L-2 distance and Kullback-Leibler (KL) divergence have historically been more common. The main advantage of Wasserstein distance over these alternatives is that it takes into account both the values and their probabilities when calculating the distance, whereas L-2 distance and KL divergence only take into account probabilities. The image below shows an example of an artificial dataset on income for three fictional countries.
In this case, since the distributions do not overlap, the L-2 distance (or KL divergence) between the blue and magenta distributions will be the same as the blue and magenta distributions The L-2 distance between green distributions is roughly the same. On the other hand, the Wasserstein distance between the blue and magenta distributions will be much smaller than the Wasserstein distance between the blue and green distributions because there is a significant difference between the values (horizontal separation). This property of Wasserstein distance makes it ideal for quantifying differences between distributions, especially differences between data sets.
With massive amounts of data being collected every day and machine learning becoming more common in many industries, data scientists must be increasingly careful not to let them Analytics and algorithms perpetuate existing biases and biases in the data. For example, if a home mortgage approval data set contains information about the race of applicants, but minorities were discriminated against in the collection process due to the methods used or unconscious bias, then a model trained on that data will reflect the underlying deviation.
Optimizing transportation can help mitigate this bias and improve fairness in two ways. The first and simplest method is to use Wasserstein distance to determine whether there is potential bias in the data set. For example, one can estimate the Wasserstein distance between the distribution of loan amounts approved for women and the distribution of loan amounts approved for men. If the Wasserstein distance is very large, that is, statistically significant, then a potential bias may be suspected. This idea of testing whether there is a difference between two groups is known in statistics as a two-sample hypothesis test.
Alternatively, optimal shipping can even be used to enforce fairness in the model when the underlying dataset itself is biased. This is useful from a practical perspective, as many real-world datasets exhibit some degree of bias, and collecting unbiased data can be very expensive, time-consuming, or unfeasible. Therefore, it is more practical to use existing data, no matter how imperfect, and try to ensure that the model mitigates this bias. This is accomplished by enforcing a constraint in the model called strong demographic parity, which forces model predictions to be statistically independent of any sensitive attributes. One approach is to map the distribution of model predictions to the distribution of adjusted predictions that do not depend on sensitive attributes. However, adjusting predictions also changes the performance and accuracy of the model, so there is a trade-off between model performance and the degree to which the model relies on sensitive attributes (i.e., fairness).
Achieve optimal shipping by changing predictions as little as possible to ensure optimal model performance, while still ensuring that new predictions are independent of sensitive attributes. The new distribution predicted by this adjusted model is called the Wasserstein centroid and has been the subject of much research over the past decade. The Wasserstein center of gravity is similar to the mean of a probability distribution in that it minimizes the total distance from itself to all other distributions. The image below shows three distributions (green, blue and magenta) along with their Wasserstein centers of gravity (red).
In the above example, suppose a model is built to predict someone's age and income based on a dataset that contains a sensitive attribute, such as marital status. There are three possible values: single (blue), married (green), and widowed/divorced (magenta). The scatter plot shows the distribution of model predictions for each different value. But wishing to adjust these values so that the predictions of the new model are blind to a person's marital status, each of these distributions can be mapped to the center of gravity in red using optimal transport. Because all values map to the same distribution, one can no longer judge a person's marital status based on income and age, or vice versa. The center of gravity preserves the fidelity of the model as much as possible.
The increasing ubiquity of data and machine learning models used in business and government decision-making has led to the emergence of new social and ethical questions about how to ensure the fair application of these models. Many datasets contain some kind of bias due to the nature of how they are collected, so it is important that models trained on them do not exacerbate this bias or any historical discrimination. Optimal transportation is just one way to solve this problem, which has been growing in recent years. Nowadays, there are fast and efficient ways to calculate optimal transportation maps and distances, making this approach suitable for modern large data sets. As people increasingly rely on data-based models and insights, fairness has and will continue to be a core issue in data science, and optimal transportation will play a key role in achieving this goal.
Original title: Optimal Transport and its Applications to Fairness, author: Terrence Alsup
The above is the detailed content of Optimal transportation and its application to fairness. For more information, please follow other related articles on the PHP Chinese website!