Since the release of Apache Spark (an open-source framework for processing Big Data), it has become one of the most widely used technologies for processing large amounts of data in parallel across multiple containers — it prides itself on efficiency and speed compared to similar software that existed before it.
Working with this amazing technology in Python is feasible through PySpark, a Python API that allows you to interact with and tap into ApacheSpark’s amazing potential using the Python programming language.
In this article, you will learn and get started with using PySpark to build a machine-learning model using the Linear Regression algorithm.
Note: Having prior knowledge of Python, an IDE like VSCode, how to use a command prompt/terminal and familiarity with Machine Learning concepts is essential for proper understanding of the concepts contained in this article.
By going through this article, you should be able to:
According to the Apache Spark official website, PySpark lets you utilize the combined strengths of ApacheSpark (simplicity, speed, scalability, versatility) and Python (rich ecosystem, matured libraries, simplicity) for “data engineering, data science, and machine learning on single-node machines or clusters.”
Image source
PySpark is the Python API for ApacheSpark, which means it serves as an interface that lets your code written in Python communicate with the ApacheSpark technology written in Scala. This way, professionals already familiar with the Python ecosystem can quickly utilize the ApacheSpark technology. This also ensures that existing libraries used in Python remain relevant.
In the ensuing steps, we will build a machine-learning model using the Linear Regression algorithm:
pip install pyspark
You can install these additional Python libraries if you do not have them.
pip install pyspark
pip install pandas numpy
from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator import pandas as pd
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()
data = spark.read.csv("data.csv", header=True, inferSchema=True)
Optionally, if you are working with a small dataset, you can convert it to a Python data frame and directory and use Python to check for missing values.
# Display the schema my data.printSchema() # Show the first ten rows data.show(10) # Count null values in each column missing_values = df.select( [count(when(isnull(c), c)).alias(c) for c in df.columns] ) # Show the result missing_values.show()
Use VectorAssembler to combine all features into a single vector column.
pandas_df = data.toPandas() # Use Pandas to check missing values print(pandas_df.isna().sum())
# Combine feature columns into a single vector column feature_columns = [col for col in data.columns if col != "label"] assembler = VectorAssembler(inputCols=feature_columns, outputCol="features") # Transform the data data = assembler.transform(data) # Select only the 'features' and 'label' columns for training final_data = data.select("features", "label") # Show the transformed data final_data.show(5)
Create an instance of the LogisticRegression class and fit the model.
train_data, test_data = final_data.randomSplit([0.7, 0.3], seed=42)
lr = LogisticRegression(featuresCol="features", labelCol="label") # Train the model lr_model = lr.fit(train_data)
Evaluate the model using the AUC metric
predictions = lr_model.transform(test_data) # Show predictions predictions.select("features", "label", "prediction", "probability").show(5)
The end-to-end code used for this article is shown below:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") # Compute the AUC auc = evaluator.evaluate(predictions) print(f"Area Under ROC: {auc}")
We have reached the end of this article. By following the steps above, you have built your machine-learning model using PySpark.
Always ensure that your dataset is clean and free of null values before proceeding to the next steps. Lastly, make sure your features all contain numerical values before going ahead to train your model.
The above is the detailed content of How to Use PySpark for Machine Learning. For more information, please follow other related articles on the PHP Chinese website!