Home Backend Development Python Tutorial How to use Python's visualization tools

How to use Python's visualization tools

May 08, 2023 pm 10:19 PM
python

Exploring the Dataset

Before we discuss visualizing the data, let’s take a quick look at the dataset we’ll be working with. The data we are going to use comes from openlights. We're going to use the route data set, the airport data set, the airline data set. Among them, each row of path data corresponds to the flight path between two airports; each row of airport data corresponds to an airport in the world, and provides relevant information; each row of airline data provides of every airline.

First we read the data:

# Import the pandas library. import pandas # Read in the airports data. airports = pandas.read_csv("airports.csv", header=None, dtype=str) airports.columns = ["id", "name", "city", "country", "code", "icao", "latitude", "longitude", "altitude", "offset", "dst", "timezone"] # Read in the airlines data. airlines = pandas.read_csv("airlines.csv", header=None, dtype=str) airlines.columns = ["id", "name", "alias", "iata", "icao", "callsign", "country", "active"] # Read in the routes data. routes = pandas.read_csv("routes.csv", header=None, dtype=str) routes.columns = ["airline", "airline_id", "source", "source_id", "dest", "dest_id", "codeshare", "stops", "equipment"]
Copy after login

These data do not have the first item of the column, so we add the first item of the column by assigning the column attribute . We want to read each column as a string because doing so simplifies the subsequent step of comparing different data frames using row ids as matches. We achieve this by setting the dtype attribute value when reading data.

So before that we need to do some data cleaning work.

routes = routes[routes["airline_id"] != "//N"]

This line of command ensures that we only contain numerical data in the airline_id column.

Making a Histogram

Now that we understand the structure of the data, we can further start to draw points to continue exploring this problem. First, we will use the tool matplotlib. matplotlib is a relatively low-level plotting library in the Python stack, so it requires more commands than other tool libraries to make a good-looking curve. On the other hand, you can use matplotlib to make almost any curve because it is very flexible, and the price of flexibility is that it is very difficult to use.

We first show the route length distribution of different airlines by making a histogram. A histogram divides the lengths of all routes into different value ranges, and then counts the routes that fall within the different value ranges. From this we can know which airlines have long routes and which airlines have short routes.

In order to achieve this, we need to first calculate the length of the route. The first step is to use the distance formula. We will use the cosine semisine distance formula to calculate the distance between two points depicted by longitude and latitude. distance.

import math def haversine(lon1, lat1, lon2, lat2):     # Convert coordinates to floats.     lon1, lat1, lon2, lat2 = [float(lon1), float(lat1), float(lon2), float(lat2)]     # Convert to radians from degrees.     lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])     # Compute distance.     dlon = lon2 - lon1      dlat = lat2 - lat1      a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2     c = 2 * math.asin(math.sqrt(a))      km = 6367 * c     return km
Copy after login

We can then use a function to calculate the one-way distance between the origin airport and the destination airport. We need to get the source_id and dest_id corresponding to the airport data frame from the route data frame, and then match them with the id column of the airport data set. Then we only need to calculate. This function is like this:

def calc_dist(row):     dist = 0     try:         # Match source and destination to get coordinates.         source = airports[airports["id"] == row["source_id"]].iloc[0]         dest = airports[airports["id"] == row["dest_id"]].iloc[0]         # Use coordinates to compute distance.         dist = haversine(dest["longitude"], dest["latitude"], source["longitude"], source["latitude"])     except (ValueError, IndexError):         pass     return dist
Copy after login

If If the source_id and dest_id columns do not have valid values, then this function will report an error. Therefore, we need to add a try/catch module to catch this invalid situation.

***, we will use pandas to apply the distance calculation function to the routes data frame. This will give us a pandas sequence containing all route lengths in kilometers.

route_lengths = routes.apply(calc_dist, axis=1)

Now that we have the sequence of route distances, we will create a histogram that will categorize the data to the corresponding range, and then count how many routes fall into each different range:

import matplotlib.pyplot as plt  %matplotlib inline   plt.hist(route_lengths, bins=20)
Copy after login

We use import matplotlib.pyplot as plt to import the matplotlib point plot function. Then we used %matplotlib inline to set matplotlib to draw points in the ipython notebook. Finally, we used plt.hist(route_lengths, bins=20) to get a histogram. As we have seen, airlines tend to operate short-haul routes over close distances rather than long-haul routes over long distances.

Using seaborn

We can use seaborn to do similar point drawing. seaborn is a high-level library for Python. Seaborn is built on the basis of matplotlib and does some types of plotting, which are often related to simple statistical work. We can use the distplot function to plot a histogram based on a core probability density expectation. A core density expectation is a curve - essentially a curve that is smoother than a histogram and easier to see the rules.

import seaborn  seaborn.distplot(route_lengths, bins=20)
Copy after login

seaborn also has a more beautiful default style. seaborn does not contain a version corresponding to each version of matplotlib, but it is indeed a good quick point drawing tool, and compared to matplotlib's default chart, it can better help us understand the meaning behind the data. If you want to do more in-depth statistical work, seaborn is also a good library.

Bar chart

柱状图也虽然很好,但是有时候我们会需要航空公司的平均路线长度。这时候我们可以使用条形图--每条航线都会有一个单独的状态条,显示航空公司航线 的平均长度。从中我们可以看出哪家是国内航空公司哪家是国际航空公司。我们可以使用pandas,一个python的数据分析库,来酸楚每个航空公司的平 均航线长度。

import numpy # Put relevant columns into a dataframe. route_length_df = pandas.DataFrame({"length": route_lengths, "id": routes["airline_id"]}) # Compute the mean route length per airline. airline_route_lengths = route_length_df.groupby("id").aggregate(numpy.mean) # Sort by length so we can make a better chart. airline_route_lengths = airline_route_lengths.sort("length", ascending=False)
Copy after login

我们首先用航线长度和航空公司的id来搭建一个新的数据框架。我们基于airline_id把route_length_df拆分成组,为每个航空 公司建立一个大体的数据框架。然后我们调用pandas的aggregate函数来获取航空公司数据框架中长度列的均值,然后把每个获取到的值重组到一个 新的数据模型里。之后把数据模型进行排序,这样就使得拥有最多航线的航空公司拍到了前面。

这样就可以使用matplotlib把结果画出来。

plt.bar(range(airline_route_lengths.shape[0]), airline_route_lengths["length"])

Matplotlib的plt.bar方法根据每个数据模型的航空公司平均航线长度(airline_route_lengths["length"])来做图。

问题是我们想看出哪家航空公司拥有的航线长度是什么并不容易。为了解决这个问题,我们需要能够看到坐标轴标签。这有点难,毕竟有这么多的航空公司。 一个能使问题变得简单的方法是使图表具有交互性,这样能实现放大跟缩小来查看轴标签。我们可以使用bokeh库来实现这个--它能便捷的实现交互性,作出 可缩放的图表。

要使用booked,我们需要先对数据进行预处理:

def lookup_name(row):     try:         # Match the row id to the id in the airlines dataframe so we can get the name.         name = airlines["name"][airlines["id"] == row["id"]].iloc[0]     except (ValueError, IndexError):         name = ""     return name # Add the index (the airline ids) as a column. airline_route_lengths["id"] = airline_route_lengths.index.copy() # Find all the airline names. airline_route_lengths["name"] = airline_route_lengths.apply(lookup_name, axis=1) # Remove duplicate values in the index. airline_route_lengths.index = range(airline_route_lengths.shape[0])
Copy after login

上面的代码会获取airline_route_lengths中每列的名字,然后添加到name列上,这里存贮着每个航空公司的名字。我们也添加到id列上以实现查找(apply函数不传index)。

***,我们重置索引序列以得到所有的特殊值。没有这一步,Bokeh 无法正常运行。

现在,我们可以继续说图表问题:

import numpy as np from bokeh.io import output_notebook from bokeh.charts import Bar, show output_notebook() p = Bar(airline_route_lengths, 'name', values='length', title="Average airline route lengths") show(p)
Copy after login

用 output_notebook 创建背景虚化,在 iPython 的 notebook 里画出图。然后,使用数据帧和特定序列制作条形图。***,显示功能会显示出该图。

这个图实际上不是一个图像--它是一个 JavaScript 插件。因此,我们在下面展示的是一幅屏幕截图,而不是真实的表格。

有了它,我们可以放大,看哪一趟航班的飞行路线最长。上面的图像让这些表格看起来挤在了一起,但放大以后,看起来就方便多了。

水平条形图

Pygal 是一个能快速制作出有吸引力表格的数据分析库。我们可以用它来按长度分解路由。首先把我们的路由分成短、中、长三个距离,并在 route_lengths 里计算出它们各占的百分比。

long_routes = len([k for k in route_lengths if k > 10000]) / len(route_lengths) medium_routes = len([k for k in route_lengths if k < 10000 and k > 2000]) / len(route_lengths) short_routes = len([k for k in route_lengths if k < 2000]) / len(route_lengths)
Copy after login

然后我们可以在 Pygal 的水平条形图里把每一个都绘成条形图:

import pygal from IPython.display import SVG chart = pygal.HorizontalBar() chart.title = 'Long, medium, and short routes' chart.add('Long', long_routes * 100) chart.add('Medium', medium_routes * 100) chart.add('Short', short_routes * 100) chart.render_to_file('routes.svg') SVG(filename='routes.svg')
Copy after login

首先,我们使用 pandasapplymethod 计算每个名称的长度。它将找到每个航空公司的名字字符的数量。然后,我们使用  matplotlib 做一个散点图来比较航空 id 的长度。当我们绘制时,我们把 theidcolumn of airlines  转换为整数类型。如果我们不这样做是行不通的,因为它需要在 x 轴上的数值。我们可以看到不少的长名字都出现在早先的 id  中。这可能意味着航空公司在成立前往往有较长的名字。

我们可以使用 seaborn 验证这个直觉。Seaborn 增强版的散点图,一个联合的点,它显示了两个变量是相关的,并有着类似地分布。

data = pandas.DataFrame({"lengths": name_lengths, "ids": airlines["id"].astype(int)})
seaborn.jointplot(x="ids", y="lengths", data=data)

画弧线

在地图上看到所有的航空路线是很酷的,幸运的是,我们可以使用 basemap  来做这件事。我们将画弧线连接所有的机场出发地和目的地。每个弧线想展示一个段都航线的路径。不幸的是,展示所有的线路又有太多的路由,这将会是一团糟。 替代,我们只现实前 3000 个路由。

# Make a base map with a mercator projection.  Draw the coastlines. m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c') m.drawcoastlines() # Iterate through the first 3000 rows. for name, row in routes[:3000].iterrows():     try:         # Get the source and dest airports.         source = airports[airports["id"] == row["source_id"]].iloc[0]         dest = airports[airports["id"] == row["dest_id"]].iloc[0]         # Don't draw overly long routes.         if abs(float(source["longitude"]) - float(dest["longitude"])) < 90:             # Draw a great circle between source and dest airports.             m.drawgreatcircle(float(source["longitude"]), float(source["latitude"]), float(dest["longitude"]), float(dest["latitude"]),linewidth=1,color='b')     except (ValueError, IndexError):         pass  # Show the map. plt.show()
Copy after login

我们将做的最终的探索是画一个机场网络图。每个机场将会是网络中的一个节点,并且如果两点之间有路由将划出节点之间的连线。如果有多重路由,将添加线的权重,以显示机场连接的更多。将使用 networkx 库来做这个功能。

首先,计算机场之间连线的权重。

# Initialize the weights dictionary. weights = {} # Keep track of keys that have been added once -- we only want edges with a weight of more than 1 to keep our network size manageable. added_keys = [] # Iterate through each route. for name, row in routes.iterrows():     # Extract the source and dest airport ids.     source = row["source_id"]     dest = row["dest_id"]      # Create a key for the weights dictionary.     # This corresponds to one edge, and has the start and end of the route.     key = "{0}_{1}".format(source, dest)     # If the key is already in weights, increment the weight.     if key in weights:         weights[key] += 1     # If the key is in added keys, initialize the key in the weights dictionary, with a weight of 2.     elif key in added_keys:         weights[key] = 2     # If the key isn't in added_keys yet, append it.     # This ensures that we aren't adding edges with a weight of 1.     else:         added_keys.append(key)
Copy after login

一旦上面的代码运行,这个权重字典就包含了每两个机场之间权重大于或等于 2 的连线。所以任何机场有两个或者更多连接的路由将会显示出来。

# Import networkx and initialize the graph. import networkx as nx graph = nx.Graph() # Keep track of added nodes in this set so we don't add twice. nodes = set() # Iterate through each edge. for k, weight in weights.items():     try:         # Split the source and dest ids and convert to integers.         source, dest = k.split("_")         source, dest = [int(source), int(dest)]         # Add the source if it isn't in the nodes.         if source not in nodes:             graph.add_node(source)         # Add the dest if it isn't in the nodes.         if dest not in nodes:             graph.add_node(dest)         # Add both source and dest to the nodes set.         # Sets don't allow duplicates.         nodes.add(source)         nodes.add(dest)          # Add the edge to the graph.         graph.add_edge(source, dest, weight=weight)     except (ValueError, IndexError):         pass pos=nx.spring_layout(graph) # Draw the nodes and edges. nx.draw_networkx_nodes(graph,pos, node_color='red', node_size=10, alpha=0.8) nx.draw_networkx_edges(graph,pos,width=1.0,alpha=1) # Show the plot. plt.show()
Copy after login

The above is the detailed content of How to use Python's visualization tools. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

Can vscode be used for mac Can vscode be used for mac Apr 15, 2025 pm 07:36 PM

VS Code is available on Mac. It has powerful extensions, Git integration, terminal and debugger, and also offers a wealth of setup options. However, for particularly large projects or highly professional development, VS Code may have performance or functional limitations.

Can vscode run ipynb Can vscode run ipynb Apr 15, 2025 pm 07:30 PM

The key to running Jupyter Notebook in VS Code is to ensure that the Python environment is properly configured, understand that the code execution order is consistent with the cell order, and be aware of large files or external libraries that may affect performance. The code completion and debugging functions provided by VS Code can greatly improve coding efficiency and reduce errors.

See all articles