date = [ { "study_time": 1, "salary": 350, "absences": 5, "city": "San Francisco" }, { "study_time": 2, "salary": 1600, "absences": 4, "city": "London" }, { "study_time": 3, "salary": 2450, "absences": 3, "city": "Paris" }, { "study_time": 4, "salary": 5150, "absences": 5, "city": "San Francisco" }, { "study_time": 5, "salary": 5800, "absences": 4, "city": "London" }, { "study_time": 6, "salary": 6050, "absences": 3, "city": "Paris" } ]
{ "study_time": 13, "salary": ???, "absences": 5, "city": "San Francisco" }
Using Polynomial Regression the value 13 of this sequence would be: 24814
But the correct value was: 19550
Error: 5264
If I were to predict position 49 it would be: 182441
But the correct value was: 77150
Error: 105291
This was the "hidden algorithm" that produces the progression:
x = 0 absences_base = 50 salary_base = 1000 data = [] for i in range(50): if x == 0: x += 1 data.append({ "study_time": i + 1, "salary": (i * salary_base + (300 * 2 * (i + 1))) - (5 * absences_base), "absences": 5, "city": "San Francisco" }) elif x == 1: x += 1 data.append({ "study_time": i + 1, "salary": (i * salary_base + (200 * 2 * (i + 1))) - (4 * absences_base), "absences": 4, "city": "London" }) else: x = 0 data.append({ "study_time": i + 1, "salary": (i * salary_base + (100 * 2 * (i + 1))) - (3 * absences_base), "absences": 3, "city": "Paris" }) for entry in data: print(entry)
{'study_time': 1, 'salary': 350, 'absences': 5, 'city': 'San Francisco'} {'study_time': 2, 'salary': 1600, 'absences': 4, 'city': 'London'} {'study_time': 3, 'salary': 2450, 'absences': 3, 'city': 'Paris'} {'study_time': 4, 'salary': 5150, 'absences': 5, 'city': 'San Francisco'} {'study_time': 5, 'salary': 5800, 'absences': 4, 'city': 'London'} {'study_time': 6, 'salary': 6050, 'absences': 3, 'city': 'Paris'} {'study_time': 7, 'salary': 9950, 'absences': 5, 'city': 'San Francisco'} {'study_time': 8, 'salary': 10000, 'absences': 4, 'city': 'London'} {'study_time': 9, 'salary': 9650, 'absences': 3, 'city': 'Paris'} {'study_time': 10, 'salary': 14750, 'absences': 5, 'city': 'San Francisco'} {'study_time': 11, 'salary': 14200, 'absences': 4, 'city': 'London'} {'study_time': 12, 'salary': 13250, 'absences': 3, 'city': 'Paris'} {'study_time': 13, 'salary': 19550, 'absences': 5, 'city': 'San Francisco'}
Polynomial regression is a statistical technique that can be used to model and predict the relationship between two variables. However, in cases like this, where there are several variables involved (study time, salary, absences and city), polynomial regression may not be sufficient to capture all patterns in the time series.
The problem in question is a classic example of time series, where we need to predict future values based on patterns observed in the past.
This problem could be solved with machine learning
Furthermore, it can be essential to analyze all relationships between variables and test various hypotheses to discover what produces the progression. This may include:
Exploratory analysis: Use exploratory analysis techniques to better understand the nature of the time series and identify possible patterns or relationships between variables.
Statistical tests: Carry out statistical tests to check whether there is significance in the relationships observed between the variables.
Another solution would be to create an algorithm that does this with the most basic hypotheses:
This algorithm for testing "relational operations", it would be a direct machine learning (or explicit machine learning) approach. This means that the algorithm does not use advanced machine learning techniques, but rather implements rules and logical structures to learn time series patterns.
And by testing only basic hypotheses, the limitations would be:
While a machine learning model can:
Before looking for more complex solutions, it is best to make sure that a simpler solution has been adequately tested.
If we include just 3 more lines of the progression sequence, we can predict the exact value using polynomial progression
date = [ { "study_time": 1, "salary": 350, "absences": 5, "city": "San Francisco" }, { "study_time": 2, "salary": 1600, "absences": 4, "city": "London" }, { "study_time": 3, "salary": 2450, "absences": 3, "city": "Paris" }, { "study_time": 4, "salary": 5150, "absences": 5, "city": "San Francisco" }, { "study_time": 5, "salary": 5800, "absences": 4, "city": "London" }, { "study_time": 6, "salary": 6050, "absences": 3, "city": "Paris" }, {'study_time': 7, 'salary': 9950, 'absences': 5, 'city': 'San Francisco'}, {'study_time': 8, 'salary': 10000, 'absences': 4, 'city': 'London'}, {'study_time': 9, 'salary': 9650, 'absences': 3, 'city': 'Paris'} ]
So this problem can be solved with polynomial regression, as long as the data sample is sufficient
It is interesting to note that the model only needs a sample of the data up to row 9 to make accurate predictions. This suggests that there is a regular pattern in the time series that can be captured with a limited amount of data. And there really was.
import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression data = pd.DataFrame({ "study_time": [1, 2, 3, 4, 5, 6, 7, 8, 9], "absences": [5, 4, 3, 5, 4, 3, 5, 4, 3], "San Francisco": [0, 1, 0, 0, 1, 0, 0, 1, 0], # dummy variables "London": [0, 0, 1, 0, 0, 1, 0, 0, 1], # dummy variables "Paris": [1, 0, 0, 1, 0, 0, 1, 0, 0], # dummy variables "salary": [350, 1600, 2450, 5150, 5800, 6050, 9950, 10000, 9650] }) # Independent and dependent variables X = data[["study_time", "absences", "San Francisco", "London", "Paris"]] y = data["salary"] # Creating polynomial characteristics of degree 2 characteristics_2 = PolynomialFeatures(degree=2) x_pol_2 = characteristics_2.fit_transform(X) y_pol_2 = model2.predict(x_pol_2) # Fitting the linear regression model model2 = LinearRegression() model2.fit(x_pol_2, y) # New data provided for prediction new_data = pd.DataFrame({ "study_time": [13], "absences": [5], "San Francisco": [0], "London": [0], "Paris": [1] }) # Polynomial transformation of the new data new_data_pol_2 = characteristics_2.transform(new_data) predicted_salary = model2.predict(new_data_pol_2) print("Predicted Salary:", int(predicted_salary[0]) ) # Plot plt.subplot(1, 1, 1) plt.scatter(new_data["study_time"], predicted_salary, color='green', label='Predicted Salary') plt.scatter(data["study_time"], y, color='blue', label='Real Salary') plt.scatter(data["study_time"], y_pol_2, color='red', label='Polynomial Fit', marker='x') plt.title("Polynomial Regression - Salary and Study Time") plt.xlabel("Study Time") plt.ylabel("Salary") plt.legend() plt.show()
The above is the detailed content of Unlocking Accurate Predictions with Polynomial Regression. For more information, please follow other related articles on the PHP Chinese website!