pandas basics
pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools
Similar to Numpy, whose core is ndarray, pandas also revolves around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequence and two-dimensional table structure respectively. The conventional import method of pandas is as follows:
from pandas import Series,DataFrame import pandas as pd
Series
Series can be regarded as a fixed-length ordered dictionary. Basically any one-dimensional data can be used to construct Series objects:
>>> s = Series([1,2,3.0,'abc']) >>> s 0 1 1 2 2 3 3 abc dtype: object
Although dtype:object
can contain a variety of basic data types, it always feels like it will affect performance. It is best Or keep it simple dtype.
The Series object contains two main attributes: index and values, which are the left and right columns in the above example. Because what is passed to the constructor is a list, the value of index is an integer that increases from 0. If a dictionary-like key-value pair structure is passed in, a Series corresponding to index-value will be generated; or in the initialization When using keyword parameters to explicitly specify an index object:
>>> s = Series(data=[1,3,5,7],index = ['a','b','x','y']) >>> s a 1 b 3 x 5 y 7 dtype: int64 >>> s.index Index(['a', 'b', 'x', 'y'], dtype='object') >>> s.values array([1, 3, 5, 7], dtype=int64)
The elements of the Series object will be constructed strictly according to the given index, which means: if the data parameter has a key-value pair, then only the elements in the index The key contained will be used; and if the corresponding key is missing from data, the key will be added even if a NaN value is given.
Note that although there is a correspondence between the index of Series and the elements of values, this is different from the mapping of dictionary. Index and values are actually still independent ndarray arrays, so the performance of Series objects is completely ok.
Series The biggest advantage of this data structure using key-value pairs is that the index will be automatically aligned when arithmetic operations are performed between Series.
In addition, the Series object and its index both contain a name
attribute:
>>> s.name = 'a_series' >>> s.index.name = 'the_index' >>> s the_index a 1 b 3 x 5 y 7 Name: a_series, dtype: int64
DataFrame
DataFrame It is a tabular data structure that contains a set of ordered columns (similar to index), and each column can be of a different value type (unlike ndarray, which can only have one dtype). Basically, you can think of a DataFrame as a collection of Series that share the same index.
The construction method of DataFrame is similar to Series, except that it can accept multiple one-dimensional data sources at the same time, and each one will become a separate column:
>>> data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]} >>> df = DataFrame(data) >>> df pop state year 0 1.5 Ohino 2000 1 1.7 Ohino 2001 2 3.6 Ohino 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 [5 rows x 3 columns]
Although the parameter data looks like a dictionary, The keys of the dictionary do not play the role of the index of the DataFrame, but the "name" attribute of the Series. The index generated here is still "01234".
The more complete DataFrame constructor parameters are: DataFrame(data=None,index=None,coloumns=None)
, columns is "name":
>>> df = DataFrame(data,index=['one','two','three','four','five'], columns=['year','state','pop','debt']) >>> df year state pop debt one 2000 Ohino 1.5 NaN two 2001 Ohino 1.7 NaN three 2002 Ohino 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN [5 rows x 4 columns]
Similarly Missing values are filled with NaN. Take a look at index, columns and index types:
>>> df.index Index(['one', 'two', 'three', 'four', 'five'], dtype='object') >>> df.columns Index(['year', 'state', 'pop', 'debt'], dtype='object') >>> type(df['debt']) <class 'pandas.core.series.Series'>
DataFrame row-oriented and column-oriented operations are basically balanced, and any column extracted is a Series.
Object properties
Reindex
Series objects are reindexed through their .reindex(index=None,**kwargs)
method accomplish. There are two commonly used parameters in **kwargs
: method=None,fill_value=np.NaN
:
ser = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) >>> a = ['a','b','c','d','e'] >>> ser.reindex(a) a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64 >>> ser.reindex(a,fill_value=0) a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64 >>> ser.reindex(a,method='ffill') a -5.3 b 7.2 c 3.6 d 4.5 e 4.5 dtype: float64 >>> ser.reindex(a,fill_value=0,method='ffill') a -5.3 b 7.2 c 3.6 d 4.5 e 4.5 dtype: float64
.reindex()
method A new object will be returned, its index strictly follows the given parameters, method:{'backfill', 'bfill', 'pad', 'ffill', None}
Parameters are used to specify interpolation (filling) Method, when not given, automatically fills with fill_value
, the default is NaN (ffill = pad, bfill = back fill, respectively refers to the forward or backward value during interpolation)
The reindexing method of the DataFrame object is: .reindex(index=None,columns=None,**kwargs)
. There is only one more optional columns parameter than Series, which is used to index the columns. The usage is similar to the above example, except that the interpolation method method
parameter can only be applied to rows, that is, axis 0.
>>> state = ['Texas','Utha','California'] >>> df.reindex(columns=state,method='ffill') Texas Utha California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8 [3 rows x 3 columns] >>> df.reindex(index=['a','b','c','d'],columns=state,method='ffill') Texas Utha California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8 [4 rows x 3 columns]
But fill_value
is still valid. Smart friends may have already thought about it, can we implement interpolation on columns through df.T.reindex(index,method='**').T
? The answer is yes. of. Also note that when using reindex(index,method='**')
, index must be monotonic, otherwise it will trigger a ValueError: Must be monotonic for forward fill
, for example, the last call in the above example will not work if index=['a','b','d','c']
is used.
Deleting items on the specified axis
means deleting an element of the Series or a certain row (column) of the DataFrame, through the object's .drop(labels, axis=0)
Method:
>>> ser d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 >>> df Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 [3 rows x 3 columns] >>> ser.drop('c') d 4.5 b 7.2 a -5.3 dtype: float64 >>> df.drop('a') Ohio Texas California c 3 4 5 d 6 7 8 [2 rows x 3 columns] >>> df.drop(['Ohio','Texas'],axis=1) California a 2 c 5 d 8 [3 rows x 1 columns]
.drop()
Returns a new object, and the meta object will not be changed.
Indexing and slicing
Like Numpy, pandas also supports indexing and slicing through obj[::]
, as well as filtering through boolean arrays.
However, it should be noted that because the index of the pandas object is not limited to integers, when using a non-integer as the slice index, it is included at the end.
>>> foo a 4.5 b 7.2 c -5.3 d 3.6 dtype: float64 >>> bar 0 4.5 1 7.2 2 -5.3 3 3.6 dtype: float64 >>> foo[:2] a 4.5 b 7.2 dtype: float64 >>> bar[:2] 0 4.5 1 7.2 dtype: float64 >>> foo[:'c'] a 4.5 b 7.2 c -5.3 dtype: float64
这里 foo 和 bar 只有 index 不同——bar 的 index 是整数序列。可见当使用整数索引切片时,结果与 Python 列表或 Numpy 的默认状况相同;换成 'c'
这样的字符串索引时,结果就包含了这个边界元素。
另外一个特别之处在于 DataFrame 对象的索引方式,因为他有两个轴向(双重索引)。
可以这么理解:DataFrame 对象的标准切片语法为:.ix[::,::]
。ix 对象可以接受两套切片,分别为行(axis=0)和列(axis=1)的方向:
>>> df Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 [3 rows x 3 columns] >>> df.ix[:2,:2] Ohio Texas a 0 1 c 3 4 [2 rows x 2 columns] >>> df.ix['a','Ohio'] 0
而不使用 ix ,直接切的情况就特殊了:
索引时,选取的是列
切片时,选取的是行
这看起来有点不合逻辑,但作者解释说 “这种语法设定来源于实践”,我们信他。
>>> df['Ohio'] a 0 c 3 d 6 Name: Ohio, dtype: int32 >>> df[:'c'] Ohio Texas California a 0 1 2 c 3 4 5 [2 rows x 3 columns] >>> df[:2] Ohio Texas California a 0 1 2 c 3 4 5 [2 rows x 3 columns]
使用布尔型数组的情况,注意行与列的不同切法(列切法的 :
不能省):
>>> df['Texas']>=4 a False c True d True Name: Texas, dtype: bool >>> df[df['Texas']>=4] Ohio Texas California c 3 4 5 d 6 7 8 [2 rows x 3 columns] >>> df.ix[:,df.ix['c']>=4] Texas California a 1 2 c 4 5 d 7 8 [3 rows x 2 columns]
算术运算和数据对齐
pandas 最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,结果的索引取索引对的并集。自动的数据对齐在不重叠的索引处引入空值,默认为 NaN。
>>> foo = Series({'a':1,'b':2}) >>> foo a 1 b 2 dtype: int64 >>> bar = Series({'b':3,'d':4}) >>> bar b 3 d 4 dtype: int64 >>> foo + bar a NaN b 5 d NaN dtype: float64
DataFrame 的对齐操作会同时发生在行和列上。
当不希望在运算结果中出现 NA 值时,可以使用前面 reindex 中提到过 fill_value
参数,不过为了传递这个参数,就需要使用对象的方法,而不是操作符:df1.add(df2,fill_value=0)
。其他算术方法还有:sub(), div(), mul()
。
Series 和 DataFrame 之间的算术运算涉及广播,暂时先不讲。
函数应用和映射
Numpy 的 ufuncs(元素级数组方法)也可用于操作 pandas 对象。
当希望将函数应用到 DataFrame 对象的某一行或列时,可以使用 .apply(func, axis=0, args=(), **kwds)
方法。
f = lambda x:x.max()-x.min() >>> df Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 [3 rows x 3 columns] >>> df.apply(f) Ohio 6 Texas 6 California 6 dtype: int64 >>> df.apply(f,axis=1) a 2 c 2 d 2 dtype: int64
排序和排名
Series 的 sort_index(ascending=True)
方法可以对 index 进行排序操作,ascending 参数用于控制升序或降序,默认为升序。
若要按值对 Series 进行排序,当使用 .order()
方法,任何缺失值默认都会被放到 Series 的末尾。
在 DataFrame 上,.sort_index(axis=0, by=None, ascending=True)
方法多了一个轴向的选择参数与一个 by 参数,by 参数的作用是针对某一(些)列进行排序(不能对行使用 by 参数):
>>> df.sort_index(by='Ohio') Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 [3 rows x 3 columns] >>> df.sort_index(by=['California','Texas']) Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 [3 rows x 3 columns] >>> df.sort_index(axis=1) California Ohio Texas a 2 0 1 c 5 3 4 d 8 6 7 [3 rows x 3 columns]
排名(Series.rank(method='average', ascending=True)
)的作用与排序的不同之处在于,他会把对象的 values 替换成名次(从 1 到 n)。这时唯一的问题在于如何处理平级项,方法里的 method
参数就是起这个作用的,他有四个值可选:average, min, max, first
。
>>> ser=Series([3,2,0,3],index=list('abcd')) >>> ser a 3 b 2 c 0 d 3 dtype: int64 >>> ser.rank() a 3.5 b 2.0 c 1.0 d 3.5 dtype: float64 >>> ser.rank(method='min') a 3 b 2 c 1 d 3 dtype: float64 >>> ser.rank(method='max') a 4 b 2 c 1 d 4 dtype: float64 >>> ser.rank(method='first') a 3 b 2 c 1 d 4 dtype: float64
注意在 ser[0]=ser[3] 这对平级项上,不同 method 参数表现出的不同名次。
DataFrame 的 .rank(axis=0, method='average', ascending=True)
方法多了个 axis 参数,可选择按行或列分别进行排名,暂时好像没有针对全部元素的排名方法。
统计方法
pandas 对象有一些统计方法。它们大部分都属于约简和汇总统计,用于从 Series 中提取单个值,或从 DataFrame 的行或列中提取一个 Series。
比如 DataFrame.mean(axis=0,skipna=True)
方法,当数据集中存在 NA 值时,这些值会被简单跳过,除非整个切片(行或列)全是 NA,如果不想这样,则可以通过 skipna=False
来禁用此功能:
>>> df one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 [4 rows x 2 columns] >>> df.mean() one 3.083333 two -2.900000 dtype: float64 >>> df.mean(axis=1) a 1.400 b 1.300 c NaN d -0.275 dtype: float64 >>> df.mean(axis=1,skipna=False) a NaN b 1.300 c NaN d -0.275 dtype: float64
其他常用的统计方法有:
##******** ************************************ | |
count | Number of non-NA values |
describe | Compute summary statistics for columns of Series or DF |
##min , max | Minimum and maximum values |
Index position of the minimum and maximum values (integer) | |
Index values of the minimum and maximum values | |
Sample quantile (0 to 1 ) | |
Sum | |
mean | |
Median | |
Calculate the average absolute dispersion based on the mean | |
Variance | |
Standard Deviation | |
Skewness of sample values (third moment) | |
Kurtosis of sample values (fourth moment) | |
Cumulative sum of sample values | |
Cumulative maximum value and cumulative minimum value of sample values | |
Cumulative product of sample values | |
Calculate the first difference (useful for time series) |
Calculate percent change
##Handle missing data
The main expression of NA in pandas is np.nan. In addition, Python's built-in None will also be treated as NA.
There are four ways to handle NA:
dropna, fillna, isnull, notnull.
is(not)null
This pair of methods performs element-level applications on the object, and then returns a Boolean array, which can generally be used for Boolean indexing.
dropnaFor a Series, dropna returns a Series containing only non-null data and index values.
The problem lies in the way DataFrame is processed, because once it is dropped, at least one row (column) must be lost. The solution here is similar to the previous one, but it still passes an additional parameter:
dropna(axis=0, how='any', thresh=None). The optional value of the how parameter is any or all. all discards the row (column) only if all slice elements are NA. Another interesting parameter is thresh, which is of type integer. Its function is that, for example, thresh=3, it will be retained when there are at least 3 non-NA values in a row. fillna
In addition to the basic type, the value parameter in can also use a dictionary, so that it can be achieved Fill different columns with different values. The usage of method is the same as the previous
.reindex()method, so I won’t go into details here.
inplace parameter
The above is the detailed content of pandas basics. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Pandas installation tutorial: Analysis of common installation errors and their solutions, specific code examples are required Introduction: Pandas is a powerful data analysis tool that is widely used in data cleaning, data processing, and data visualization, so it is highly respected in the field of data science . However, due to environment configuration and dependency issues, you may encounter some difficulties and errors when installing pandas. This article will provide you with a pandas installation tutorial and analyze some common installation errors and their solutions. 1. Install pandas

How to use pandas to read txt files correctly requires specific code examples. Pandas is a widely used Python data analysis library. It can be used to process a variety of data types, including CSV files, Excel files, SQL databases, etc. At the same time, it can also be used to read text files, such as txt files. However, when reading txt files, we sometimes encounter some problems, such as encoding problems, delimiter problems, etc. This article will introduce how to read txt correctly using pandas

Pandas is a powerful data analysis tool that can easily read and process various types of data files. Among them, CSV files are one of the most common and commonly used data file formats. This article will introduce how to use Pandas to read CSV files and perform data analysis, and provide specific code examples. 1. Import the necessary libraries First, we need to import the Pandas library and other related libraries that may be needed, as shown below: importpandasaspd 2. Read the CSV file using Pan

Python can install pandas by using pip, using conda, from source code, and using the IDE integrated package management tool. Detailed introduction: 1. Use pip and run the pip install pandas command in the terminal or command prompt to install pandas; 2. Use conda and run the conda install pandas command in the terminal or command prompt to install pandas; 3. From Source code installation and more.

Steps to install pandas in python: 1. Open the terminal or command prompt; 2. Enter the "pip install pandas" command to install the pandas library; 3. Wait for the installation to complete, and you can import and use the pandas library in the Python script; 4. Use It is a specific virtual environment. Make sure to activate the corresponding virtual environment before installing pandas; 5. If you are using an integrated development environment, you can add the "import pandas as pd" code to import the pandas library.

Practical tips for reading txt files using pandas, specific code examples are required. In data analysis and data processing, txt files are a common data format. Using pandas to read txt files allows for fast and convenient data processing. This article will introduce several practical techniques to help you better use pandas to read txt files, along with specific code examples. Reading txt files with delimiters When using pandas to read txt files with delimiters, you can use read_c

Data processing tool: Pandas reads data in SQL databases and requires specific code examples. As the amount of data continues to grow and its complexity increases, data processing has become an important part of modern society. In the data processing process, Pandas has become one of the preferred tools for many data analysts and scientists. This article will introduce how to use the Pandas library to read data from a SQL database and provide some specific code examples. Pandas is a powerful data processing and analysis tool based on Python

The secret of Pandas deduplication method: a fast and efficient way to deduplicate data, which requires specific code examples. In the process of data analysis and processing, duplication in the data is often encountered. Duplicate data may mislead the analysis results, so deduplication is a very important step. Pandas, a powerful data processing library, provides a variety of methods to achieve data deduplication. This article will introduce some commonly used deduplication methods, and attach specific code examples. The most common case of deduplication based on a single column is based on whether the value of a certain column is duplicated.
