In my previous post, I created a script to generate a CSV with laptop data, doing web scraping in PCComponentes.
This idea arose when trying to create a Machine Learning model that, depending on the components you provide, predicts the price of the device. However, when researching I found a public DataFrame that could be used to train the model, but it had a problem: the prices dated back to 2015, which made it of little use.
For this reason, I decided to build a DataFrame directly from the PCComponentes website, which would allow me to have updated and reliable data. Additionally, this process could be automated in the future (at least until PCComponentes changes the structure of its website).
Let's get into it!
Before training the model, it is necessary to organize and clean the data to make it easier to read and process. For this, we will use the Numpy, Pandas and Matplotlib libraries, widely used in data analysis and processing.
The first thing is to import these libraries and open the generated CSV:
import pandas as pd import numpy as np import matplotlib.pyplot as plt
Then, we delete the rows with empty or null values:
df = df.dropna()
Let's start by analyzing the different types of CPUs available. To view them, we will use the Seaborn library:
import seaborn as sns sns.countplot(data=df, x='CPU')
Here we see that there are 207 different types of CPUs. Training a model with all these values could be problematic, as much data would be irrelevant and generate noise that would affect performance.
Instead of removing the entire column, we will filter the most relevant values:
def cpu_type_define(text): text = text.split(' ') if text[0] == 'intel': if 'i' in text[-1]: if text[-1].split('-')[0] == 'i3': return 'low gamma intel processor' return text[0]+' '+text[1]+' '+text[-1].split('-')[0] return 'low gamma intel processor' elif text[0] == 'amd': if text[1] == 'ryzen': if text[2] == '3': return 'low gamma amd processor' return text[0]+' '+text[1]+' '+text[2] return 'low gamma amd processor' elif 'm' in text[0]: return 'Mac Processor' else: return 'Other Processor' data['Cpu'] = data['Cpu'].apply(cpu_type_define) sns.histplot(data=data,x='Cpu') data['Cpu'].value_counts()
Resulting in:
We carry out a similar process with graphics cards (GPU), reducing the number of categories to avoid noise in the data:
def gpu_type_define(text): if 'rtx' in text: num = int(''.join([char for char in text if char.isdigit()])) if num == 4080 or num == 4090 or num == 3080: return 'Nvidia High gamma' elif num == 4070 or num == 3070 or num == 4060 or num == 2080: return 'Nivida medium gamma' elif num == 3050 or num == 3060 or num == 4050 or num == 2070: return 'Nvidia low gamma' else: return 'Other nvidia grafic card' elif 'radeon' in text: if 'rx' in text: return 'Amd High gamma' else: return 'Amd low Gamma' elif 'gpu' in text: return 'Apple integrated graphics' return text data['Gpu'] = data['Gpu'].apply(gpu_type_define) sns.histplot(data=data,x='Gpu') data['Gpu'].value_counts()
Result:
To simplify storage data, we combine the total space of all hard drives into a single value:
def fitler_ssd(text): two_discs = text.split('+') if len(two_discs) == 2: return int(''.join([char for char in two_discs[0] if char.isdigit()])) + int(''.join([char for char in two_discs[1] if char.isdigit()])) else: return int(''.join([char for char in text if char.isdigit()])) data['SSD'] = data['SSD'].str.replace('tb','000') data['SSD'] = data['SSD'].str.replace('gb','') data['SSD'] = data['SSD'].str.replace('emmc','') data['SSD'] = data['SSD'].str.replace('ssd','')
Finally, we filter the RAM values to keep only numbers:
import pandas as pd import numpy as np import matplotlib.pyplot as plt
Before training the model, it is necessary to transform the non-numeric columns into data that the algorithm can interpret. For this, we use the ColumnTransformer and OneHotEncoder from the sklearn library:
df = df.dropna()
I tested several Machine Learning algorithms to determine which one was most efficient according to the coefficient of determination (R2 Score). Here are the results:
Modelo | R2 Score |
---|---|
Logistic Regression | -4086280.26 |
Random Forest | 0.8025 |
ExtraTreeRegressor | 0.7531 |
GradientBoostingRegressor | 0.8025 |
XGBRegressor | 0.7556 |
The best results were obtained with Random Forest and GradientBoostingRegressor, both with an R2 close to 1.
To improve further, I combined these algorithms using a Voting Regressor, achieving an R2 Score of 0.8085:
import seaborn as sns sns.countplot(data=df, x='CPU')
The model trained with the Voting Regressor was the most efficient. Now you are ready to integrate it into a web application, which I will explain in detail in the next post.
Link to the project
The above is the detailed content of Laptop Price Prediction with ML. For more information, please follow other related articles on the PHP Chinese website!