Because I don’t know what the mathematical knowledge I have learned is useful. For R&D personnel in IT companies, they always feel that they need to learn some mathematics before entering big data-related positions. But in the vast world of mathematics, where is the end of data technology?
When it comes to data technology, the first thing that many people think of is mathematics, probably because of the solid position of numbers in the mathematical system, and this is natural. This article conducts some discussion on the mathematical foundation of data technology. (Recommended learning: Python video tutorial)
We know that there are three major branches of mathematics, namely algebra, geometry and analysis. Each branch extends into many small branches with the development of research. In this mathematical system, the mathematical foundations closely related to big data technology mainly include the following categories. (For the application of these mathematical methods in big data technology, please refer to the book "Internet Big Data Processing Technology and Application", 2017, Tsinghua University Press)
(1) Probability Theory and Mathematical Statistics
This part is very closely related to the development of big data technology, basic concepts such as conditional probability and independence, random variables and their distribution, multi-dimensional random variables and their distribution, variance analysis and regression analysis, random processes (Especially Markov), parameter estimation, Bayes theory, etc. are very important in big data modeling and mining. Big data has naturally high-dimensional characteristics. Design and analysis of data models in high-dimensional space requires a certain foundation in multi-dimensional random variables and their distribution. Bayes' theorem is one of the foundations of classifier construction. In addition to these basic knowledge, conditional random field CRF, latent Markov model, n-gram, etc. can be used to analyze vocabulary and text in big data analysis, and can be used to build predictive classification models.
Of course, information theory based on probability theory also plays a certain role in big data analysis. Methods such as information gain and mutual information used for feature analysis are all concepts in information theory.
(2) Linear algebra
This part of mathematical knowledge is also closely related to the development of data technology. Matrix, transpose, rank block matrix, vector, Orthogonal matrices, vector spaces, eigenvalues and eigenvectors are also commonly used technical methods in big data modeling and analysis.
In Internet big data, the analysis objects of many application scenarios can be abstracted into matrix representations, such as a large number of Web pages and their relationships, Weibo users and their relationships, the relationship between texts and vocabulary in text sets, etc. etc. can be represented by matrices. For example, when a Web page and its relationship are represented by a matrix, the matrix element represents the relationship between page a and another page b. This relationship can be a pointing relationship, 1 means there is a hyperlink between a and b, 0 means a, There are no hyperlinks between b. The famous PageRank algorithm is based on this matrix to quantify the importance of pages and prove its convergence.
Various operations based on matrices, such as matrix decomposition, are ways to extract features of analysis objects. Because the matrix represents a certain transformation or mapping, the matrix obtained after decomposition represents the analysis Some new characteristics of the object in the new space. Therefore, singular value decomposition SVD, PCA, NMF, MF, etc. are widely used in big data analysis.
(3) Optimization method
Model learning and training is a way for many analytical mining models to solve parameters. The basic question is: give Define a function f:A→R and find an element a0∈A such that for all a in A, f(a0)≤f(a) (minimize); or f(a0)≥f(a) (maximize change). The optimization method depends on the form of the function. From the current point of view, the optimization method is usually based on differential and derivative methods, such as gradient descent, hill climbing method, least squares method, conjugate distribution method, etc.
(4) Discrete Mathematics
The importance of discrete mathematics is self-evident. It is the foundation of all branches of computer science. Nature is also an important foundation for data technology. It won’t be expanded upon here.
Finally, it needs to be mentioned that many people think that they are not good at mathematics and cannot do well in the development and application of data technology, but this is not the case. Think clearly about what role you play in big data development and applications. Refer to the following entry points for big data technology research and application. The above mathematical knowledge is mainly reflected in the data mining and model layer. These mathematical knowledge and methods need to be mastered.
Of course, at other levels, the use of these mathematical methods is also very meaningful for improving algorithms. For example, at the data acquisition layer, a probability model can be used to estimate the value of crawler collection pages, so as to make better judgment. In the big data computing and storage layer, matrix block computing is used to achieve parallel computing.
For more Python-related technical articles, please visit the Python Tutorial column to learn!
The above is the detailed content of What mathematics do you need to learn for Python data analysis?. For more information, please follow other related articles on the PHP Chinese website!