DataFrame using pandas for data processing

coldplay.xixi
Release: 2020-09-15 16:20:05
forward
4313 people have browsed it

DataFrame using pandas for data processing

Relevant learning recommendations: python tutorial

##This is

pandas data processing topic's second article, let's talk about the most important data structure in pandas - DataFrame.

In the previous article, we introduced the usage of Series, and also mentioned that Series is equivalent to a one-dimensional array, but pandas encapsulates many convenient and easy-to-use APIs for us. The DataFrame can be simply understood as a dict

composed of Series, thus splicing the data into a two-dimensional table. It also provides us with many interfaces for table-level data processing and batch data processing, which greatly reduces the difficulty of data processing.

Create DataFrame

DataFrame is a tabular data structure, which has two indexes, namely

Row index and column index allow us to easily obtain the corresponding rows and columns. This greatly reduces the difficulty of finding data for data processing.

First, let’s start with the simplest one, how to create a DataFrame.

Create from dictionary

We create a dict whose The key is the column name, and the value is a list. When we pass this dict into the DataFrame constructor, it will

create a DataFrame for us with key as the column name and value as the corresponding value. When we output in jupyter, it will automatically display the contents of the DataFrame in table form for us.

We can also create a DataFrame from a numpy two-dimensional array, if If we just pass in the numpy array without specifying the column name, then pandas

will use the number as the index to create the column

for us:

We are in When creating, pass in a list of strings for the columns field to specify a column name for it:

Reading from a file

Another very powerful function of pandas is that it canRead data from files in various formats to create DataFrame, such as commonly used excel, csv, or even databases.

For structured data such as excel, csv, json, etc., pandas provides a special API. We can find the corresponding API and use it:

If it is in some special format, it doesn't matter. We use read_table, which can read data from various text files and complete the creation by passing in the separator and other parameters. For example, in the previous article verifying the dimensionality reduction effect of PCA, we read data from a .data format file. The delimiter between columns in this file is a space, not the comma or table character of csv. We pass in the sep parameter through and specify the delimiter to complete the data reading.

This header parameter indicates which lines of the file are used as column names of the data. The default header=0 means that the first line is used as the column name. . If the column name does not exist in the data, header=None needs to be specified, otherwise problems will occur. We rarely need to use multi-level column names, so generally the most commonly used method is to take the default value or set it equal to None.

Among all these methods to create a DataFrame, the most commonly used is the last one , reading from a file. Because when we do machine learning or participate in some competitions in Kaggle, the data is often ready-made and given to us in the form of files. There are very few cases where we need to create data ourselves. If it is in an actual work scenario, although the data will not be stored in files, there will be a source, usually stored in some big data platforms, and the model will obtain training data from these platforms.

So in general, we rarely use other methods of creating DataFrame. We have some understanding and focus on mastering the method of reading from files.

Common operations

#The following introduces some common operations of pandas. These operations were performed before I learned how to use pandas systematically. Already understood. The reason for understanding it is also very simple, because they are too commonly used, and they can be said to be common sense content that must be known.

View data

When we run the DataFrame instance in jupyter, all the data in the DataFrame will be printed for us. , if there are too many rows of data, the middle part will be omitted in the form of ellipses. For a DataFrame with a large amount of data, we generally do not directly output and display it like this, but choose to display the first few or last few pieces of data. Two APIs are needed here.

The method for displaying the first several pieces of data is called head. It accepts a parameter and allows us to specify it to display the number of data we specify from the beginning.

Since there is an API for displaying the first few items, there is also an API for displaying the last few items. Such an API is called tail. Through it, we can view the last specified number of data in the DataFrame:

Add, delete and modify columns

We mentioned before that for DataFrame, it is actually equivalent to a dict composed of Series. Since it is a dict, we can naturally obtain the specified Series based on the key value.

There are two ways to get the specified column in DataFrame. We can add column names or find elements through dict to query:

We can also can read multiple columns at the same time. If there are multiple columns, only one method is supported, which is to query elements through dict. It allows receiving an incoming list and finding the data corresponding to the columns in the list. The result returned is a new DataFrame composed of these new columns.

We canuse del to delete a column we don’t need:

We want to create a new The columns are also very simple. We can directly assign values ​​to the DataFrame just like dict assignment:

The assigned object cannot only be Real numbers, can also be an array:

It is very simple to modify a certain column. We can also overwrite the original data through the same method of assignment.

Convert to numpy array

#Sometimes it is inconvenient for us to use pandas and want to obtain its corresponding original data , you can directly use .values ​​to obtain the numpy array corresponding to DataFrame:

Since each column in the DataFrame has a separate type , After being converted into a numpy array, all data share the same type. Then pandas will find a common type for all columns, which is why you often get an object type. Therefore, it is best to check the type before using .values ​​to ensure that there will be no errors due to the type.

Summary

In today’s article we learned about the relationship between DataFrame and Series, and also learned some DataFrame The basics and common usage. Although DataFrame can be approximately regarded as a dict composed of Series, in fact, as a separate data structure, it also has many own APIs, supports many fancy operations, and is a powerful tool for us to process data.

Professional organizations have made statistics. For an algorithm engineer, about 70% of the time will be invested in data processing. The time spent actually writing the model and adjusting parameters may be less than 20%. From this we can see the necessity and importance of data processing. In the field of Python, pandas is the best scalpel and toolbox for data processing. I hope everyone can master it.

If you want to learn more about programming, please pay attention to the php training column!

The above is the detailed content of DataFrame using pandas for data processing. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:juejin.im
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template