Concept
The basic goal behind simple linear regression modeling is to derive values from pairs of X values and Y values (i.e., X and Y measurements ) to find the best straight line in the two-dimensional plane composed of. Once the straight line is found using the minimum variance method, various statistical tests can be performed to determine how well the straight line deviates from the observed Y values.
The linear equation (y = mx + b) has two parameters that must be estimated based on the provided X and Y data, they are the slope ( m) and y-intercept (b). Once these two parameters are estimated, you can enter the observed values into the linear equation and observe the predicted values of Y produced by the equation.
To estimate the m and b parameters using the minimum variance method, it is necessary to find the estimated values of m and b such that they are obtained for all X values. The observed and predicted values of the 🎜>Y value are the smallest. The difference between the observed value and the predicted value is called the error (y i- (mx i+ b) ), and if for each error value square and then sum these residuals, the result is a number called the predicted squared difference . Using the minimum variance method to determine the best fit involves finding estimates of m and b that minimize the prediction variance.
Two basic methods can be used to find the estimated values
m
and b that satisfy the minimum variance method. In the first approach, one can use a numerical search process to set different m and b values and evaluate them, ultimately deciding on the estimate that yields the minimum variance. The second method is to use calculus to find equations for estimating m and b. I'm not going to get into the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class to find the least squares estimates of m and b (See the getSlope() and getYIntercept methods in the SimpleLinearRegression class).
Even if you have equations that can be used to find the least squares estimates of
m
and b, it does not mean that simply plugging these parameters into a linear equation will result in a line that fits the data well. Matching straight lines. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.
You can use the statistical decision process to reject the alternative hypothesis that the straight line fits the data. This process is based on the calculation of the T statistic, using a probability function to find the probability of a randomly large observation. As mentioned in Part 1, the SimpleLinearRegression class generates a number of summary values, one of the important summary values is the T statistic, which measures how well the linear equation fits the data. If the fit is good, the T statistic tends to be a large value; if the T value is small, you should replace your linear equation with a default model that assumes that the mean of the
Y
values is Best predictor (because the average of a set of values can often be a useful predictor of the next observation).
To test whether the T statistic value is large enough to not use the average of the
Y
values as the best prediction value, you need to calculate the probability of obtaining the T statistic value randomly. If the probability is low, then the null assumption that the mean is the best predictor can be dispensed with, and accordingly one can be confident that a simple linear model is a good fit to the data. (See Part 1 for more information on calculating the probability of a T-statistic.)
Let’s go back to the statistical decision-making process. It tells you when not to adopt the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research setting, linear model alternative hypotheses need to be established through theoretical and statistical parameters.
The data research tool you will build implements a statistical decision-making process for linear models (t-tests) and provides summary data that can be used to construct the theoretical and statistical parameters needed to build linear models. Data research tools can be classified as decision support tools for knowledge workers to study patterns in small to medium-sized data sets.
From a learning perspective, simple linear regression modeling is worth studying because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear regression establish a good foundation for understanding multiple regression (Multiple Regression), factor analysis (Factor Analysis), and time series (Time Series).
Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data (usually with a logarithmic or power transformation). These transformations linearize the data so that it can be modeled using simple linear regression. The resulting linear model will be represented as a linear formula related to the transformed values.
Probability function
In the
previous article
, I used R to find the probability value, thereby avoiding the problem of using PHP to implement the probability function. I wasn't completely satisfied with this solution, so I started researching the question: what is needed to develop a probability function based on PHP.
I started looking online for information and code. One source for both is Probability Functions in the book Numerical Recipes in C. I reimplemented some probability function code (gammln.c and betai.c functions) in PHP, but I'm still not satisfied with the results. It seems to have a bit more code than some other implementations. Additionally, I need the inverse probability function.
Fortunately, I stumbled upon John Pezzullo's Interactive Statistical Calculation. John's website on Probability Distribution Functions has all the functions I need, implemented in JavaScript to make learning easier.
I ported the Student T and Fisher F functions to PHP. I changed the API a bit to conform to Java naming style and embedded all functions into a class called Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. Other tests that I didn't bother to implement (normality test and chi-square test) also use the doCommonMath method.
Another aspect of this transplant is also worth noting. By using JavaScript, users can assign dynamically determined values to instance variables, such as:
var PiD2 = pi() / 2 Copy after login |
<?php // Distribution.php // Copyright John Pezullo // Released under same terms as PHP. // PHP Port and OOfying by Paul Meagher class Distribution { function doCommonMath($q, $i, $j, $b) { $zz = 1; $z = $zz; $k = $i; while($k <= $j) { $zz = $zz * $q * $k / ($k - $b); $z = $z + $zz; $k = $k + 2; } return $z; } function getStudentT($t, $df) { $t = abs($t); $w = $t / sqrt($df); $th = atan($w); if ($df == 1) { return 1 - $th / (pi() / 2); } $sth = sin($th); $cth = cos($th); if( ($df % 2) ==1 ) { return 1 - ($th + $sth * $cth * $this->doCommonMath($cth * $cth, 2, $df - 3, -1)) / (pi()/2); } else { return 1 - $sth * $this->doCommonMath($cth * $cth, 1, $df - 3, -1); } } function getInverseStudentT($p, $df) { $v = 0.5; $dv = 0.5; $t = 0; while($dv > 1e-6) { $t = (1 / $v) - 1; $dv = $dv / 2; if ( $this->getStudentT($t, $df) > $p) { $v = $v - $dv; } else { $v = $v + $dv; } } return $t; } function getFisherF($f, $n1, $n2) { // implemented but not shown } function getInverseFisherF($p, $n1, $n2) { // implemented but not shown } } ?> Copy after login |