Real-time air traffic data analysis with Spark Structured Streaming and Apache Kafka-Python Tutorial-php.cn

Table of Contents

A Brief Introduction to Real-Time Data Processing with Spark Structured Streaming and Apache Kafka

Gehazi Anc ・Sep 29 '22

Home

Backend Development

Python Tutorial

Real-time air traffic data analysis with Spark Structured Streaming and Apache Kafka

Mary-Kate Olsen

Oct 29, 2024 pm 09:55 PM

Currently, we live in a world where peta bytes of data are generated every second. As such, analyzing and processing this data in real time becomes more than essential for a company looking to generate business insights more accurately as data and more data is produced.

Today, we will develop real-time data analysis based on fictitious air traffic data using Spark Structured Streaming and Apache Kafka. If you don't know what these technologies are, I suggest reading my article that I wrote introducing them in more detail, as well as other concepts that will be covered throughout this article. So, don't forget to check it out?.

Análise de dados de tráfego aéreo em tempo real com Spark Structured Streaming e Apache Kafka

A Brief Introduction to Real-Time Data Processing with Spark Structured Streaming and Apache Kafka

Gehazi Anc ・Sep 29 '22

#python #pyspark #dataengineering #apachekafka

You can check out the complete project on my GitHub.

Architecture

Well, imagine that you, a data engineer, work at an airline called SkyX, where data about air traffic is generated every second.

You were asked to develop a dashboard that displays real-time data from these flights, such as a ranking of the most visited cities abroad; the cities where most people leave; and the aircraft that transport the most people around the world.

This is the data that is generated with each flight:

aircraft_name: name of the aircraft. At SkyX, there are only five aircraft available.
From: city where the aircraft is departing. SkyX only operates flights between five cities around the world.
To: aircraft destination city. As mentioned, SkyX only operates flights between five cities around the world.
Passengers: number of passengers the aircraft is transporting. All SkyX aircraft carry between 50 and 100 people on each flight.

Following is the basic architecture of our project:

Producer: responsible for producing aircraft air traffic data and sending it to an Apache Kafka topic.
Consumer: only observes the data arriving in real time to the Apache Kafka topic.
Data analysis: three dashboards that process and analyze in real time the data that arrives at the Apache Kafka topic. Analysis of the cities that receive the most tourists; analysis of the cities where most people leave to visit other cities; and analysis of the SkyX aircraft that transport the most people between cities around the world.

Preparing the development environment

This tutorial assumes you already have PySpark installed on your machine. If you haven't already, check out the steps in the documentation itself.

As for Apache Kafka, we will use it through containerization via Docker??.

And finally, we will use Python through a virtual environment.

Apache Kafka by containerization via Docker

Without further ado, create a folder called skyx and add the file docker-compose.yml inside it.

$ mkdir skyx
$ cd skyx
$ touch docker-compose.yml

Copy after login

Now, add the following content inside the docker-compose file:

version: '3.9'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 2181:2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Copy after login

Done! We can now upload our Kafka server. To do this, type the following command in the terminal:

$ docker compose up -d
$ docker compose ps

Copy after login

NAME                                COMMAND                  SERVICE             STATUS              PORTS
skyx-kafka-1       "/etc/confluent/dock…"   kafka               running             9092/tcp, 0.0.0.0:29092->29092/tcp
skyx-zookeeper-1   "/etc/confluent/dock…"   zookeeper           running             2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp

Copy after login

Note: This tutorial is using version 2.0 of Docker Compose. This is why there is no "-" between docker and compose ☺.

Now, we need to create a topic within Kafka that will store the data sent in real time by the producer. To do this, let's access Kafka inside the container:

$ docker compose exec kafka bash

And finally create the topic, called airtraffic.

$ kafka-topics --create --topic airtraffic --bootstrap-server localhost:29092

Created topic airtraffic.

Creation of the virtual environment

To develop our producer, that is, the application that will be responsible for sending real-time air traffic data to the Kafka topic, we need to use the kafka-python library. kafka-python is a community-developed library that allows us to develop producers and consumers that integrate with Apache Kafka.

First, let's create a file called requirements.txt and add the following dependency inside it:

kafka-python

Second, we will create a virtual environment and install the dependencies in the requirements.txt file:

$ mkdir skyx
$ cd skyx
$ touch docker-compose.yml

Copy after login

Done! Now our environment is ready for development?.

Producer development

Now let's create our producer. As mentioned, the producer will be responsible for sending air traffic data to the newly created Kafka topic.

As was also said in the architecture, SkyX only flies between five cities around the world, and only has five aircraft available?. It is worth mentioning that each aircraft carries between 50 and 100 people.

Note that the data is generated randomly and sent to the topic in json format in a time interval between 1 and 6 seconds?.

Let's go! Create a subdirectory called src and another subdirectory called kafka. Inside the kafka directory, create a file called airtraffic_producer.py and add the following code inside it:

version: '3.9'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 2181:2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Copy after login

Done! We develop our producer. Run it and let it run for a while.

$ python airtraffic_producer.py

Consumer development

Now let's develop our consumer. This will be a very simple application. It will just display the data arriving in the kafka topic in real time in the terminal.

Still inside the kafka directory, create a file called airtraffic_consumer.py and add the following code inside it:

$ docker compose up -d
$ docker compose ps

Copy after login

See, I told you it was very simple. Run it and watch the data that will be displayed in real time as the producer sends data to the topic.

$ python airtraffic_consumer.py

Data analysis: cities that receive the most tourists

Now we start with our data analysis. At this point, we will develop a dashboard, an application, that will display in real time a ranking of the cities that receive the most tourists. In other words, we will group the data by the to column and make a sum based on the passengers column. Very simple!

To do this, within the src directory, create a subdirectory called dashboards and create a file called tourists_analysis.py. Then add the following code inside it:

$ mkdir skyx
$ cd skyx
$ touch docker-compose.yml

Copy after login

And we can now execute our file through spark-submit. But calm down! When we are integrating PySpark with Kafka, we must run spark-submit differently. It is necessary to inform the Apache Kafka package and the current version of Apache Spark through the --packages.

parameter

If this is your first time integrating Apache Spark with Apache Kafka, spark-submit may take a while to run. This is because it needs to download the necessary packages.

Make sure the producer is still running so we can see the data analysis in real time. Inside the dashboards directory, run the following command:

$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 tourists_analysis.py

version: '3.9'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - 2181:2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - 29092:29092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:29092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Copy after login

Data analysis: cities where most people leave

This analysis is very similar to the previous one. However, instead of analyzing in real time the cities that receive the most tourists, we will analyze the cities where the most people leave. To do this, create a file called leavers_analysis.py and add the following code inside it:

$ docker compose up -d
$ docker compose ps

Copy after login

Make sure the producer is still running so we can see the data analysis in real time. Inside the dashboards directory, run the following command:

$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 leavers_analysis.py

NAME                                COMMAND                  SERVICE             STATUS              PORTS
skyx-kafka-1       "/etc/confluent/dock…"   kafka               running             9092/tcp, 0.0.0.0:29092->29092/tcp
skyx-zookeeper-1   "/etc/confluent/dock…"   zookeeper           running             2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp

Copy after login

Data analysis: aircraft that carry the most passengers

This analysis is much simpler than the previous ones. Let's analyze in real time the aircraft that transport the most passengers between cities around the world. Create a file called aircrafts_analysis.py and add the following code inside it:

$ python -m venv venv
$ venv\scripts\activate
$ pip install -r requirements.txt

Copy after login

Make sure the producer is still running so we can see the data analysis in real time. Inside the dashboards directory, run the following command:

$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 aircrafts_analysis.py

$ mkdir skyx
$ cd skyx
$ touch docker-compose.yml

Copy after login

Final considerations

And we finish here, guys! In this article we develop real-time data analysis based on fictitious air traffic data using Spark Structured Streaming and Apache Kafka.

To do this, we developed a producer that sends this data in real time to the Kafka topic, and then we developed 3 dashboards to analyze this data in real time.

I hope you liked it. See you next time?.

The above is the detailed content of Real-time air traffic data analysis with Spark Structured Streaming and Apache Kafka. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1317

PHP Tutorial

1268

C# Tutorial

1242

Related knowledge

Python vs. C : Applications and Use Cases Compared Apr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

Python: Games, GUIs, and More Apr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

The 2-Hour Python Plan: A Realistic Approach Apr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python vs. C : Learning Curves and Ease of Use Apr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

How Much Python Can You Learn in 2 Hours? Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

Python and Time: Making the Most of Your Study Time Apr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python: Exploring Its Primary Applications Apr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

See all articles