Pandas is an open-source data manipulation and analysis library built on top of Python. It provides easy-to-use data structures like DataFrame and Series that facilitate data handling for all kinds of data analysis tasks. It is widely used for handling structured data, data cleaning, and preparation, which is a crucial step in data science workflows. Whether it's time series data, heterogeneous data, or data that comes in CSV, Excel, SQL databases, or JSON format, Pandas offers powerful tools to make working with this data much easier.
Before using any Pandas functionality, you need to import the library. It is commonly imported as pd to keep the syntax concise.
import pandas as pd
A Series is a one-dimensional labeled array, capable of holding any data type (integer, string, float, etc.). It can be created from a list, NumPy array, or a dictionary.
# Create a Pandas Series from a list s = pd.Series([1, 2, 3, 4])
Expected Output:
0 1 1 2 2 3 3 4 dtype: int64
A DataFrame is a two-dimensional labeled data structure, similar to a table in a database or an Excel spreadsheet. It consists of rows and columns. Each column can have a different data type.
# Create a DataFrame from a dictionary data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22], 'City': ['New York', 'London', 'Berlin']} df = pd.DataFrame(data)
Expected Output:
Name Age City 0 Alice 24 New York 1 Bob 27 London 2 Charlie 22 Berlin
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} df = pd.DataFrame(data)
data = [[1, 2, 3], [4, 5, 6]] df = pd.DataFrame(data, columns=["A", "B", "C"])
Expected Output:
A B C 0 1 2 3 1 4 5 6
Pandas provides several methods to inspect and get information about your data.
# Inspecting the DataFrame print(df.head()) print(df.tail()) print(df.info()) print(df.describe())
Expected Output:
A B C 0 1 2 3 1 4 5 6 A B C 0 1 2 3 1 4 5 6 <class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 2 non-null int64 1 B 2 non-null int64 2 C 2 non-null int64 dtypes: int64(3) memory usage: 128.0 bytes A B C count 2.0 2.0 2.0 mean 2.5 3.5 4.5 std 2.1 2.1 2.1 min 1.0 2.0 3.0 25% 1.5 2.5 3.5 50% 2.0 3.0 4.0 75% 2.5 3.5 4.5 max 4.0 5.0 6.0
You can access columns either using dot notation or by indexing with square brackets.
# Dot notation print(df.A) # Bracket notation print(df["B"])
You can use .iloc[] for integer-location based indexing and .loc[] for label-based indexing.
# Using iloc (index-based) print(df.iloc[0]) # Access first row # Using loc (label-based) print(df.loc[0]) # Access first row using label
You can slice DataFrames to get subsets of data. You can slice rows or columns.
# Select specific rows and columns subset = df.loc[0:1, ["A", "C"]]
Expected Output:
A C 0 1 3 1 4 6
You can add columns directly to the DataFrame by assigning values.
df['D'] = [7, 8] # Adding a new column
You can modify the values of a column by accessing it and assigning new values.
df['A'] = df['A'] * 2 # Modify the 'A' column
You can drop rows or columns using the drop() function.
df = df.drop(columns=['D']) # Dropping a column df = df.drop(index=1) # Dropping a row by index
Handling missing data is a critical task. Pandas provides several functions to handle missing data.
df = df.fillna(0) # Fill missing data with 0 df = df.dropna() # Drop rows with any missing values
The groupby() function is used for splitting the data into groups, applying a function, and then combining the results.
# Grouping by a column and calculating the sum grouped = df.groupby('City').sum()
You can apply various aggregation functions like sum(), mean(), min(), max(), etc.
# Aggregating data using mean df.groupby('City').mean()
You can sort a DataFrame by one or more columns using the sort_values() function.
# Sorting by a column in ascending order df_sorted = df.sort_values(by='Age') # Sorting by multiple columns df_sorted = df.sort_values(by=['Age', 'Name'], ascending=[True, False])
You can rank the values in a DataFrame using rank().
df['Rank'] = df['Age'].rank()
You can merge two DataFrames based on a common column or index.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}) df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'C': ['C0', 'C1', 'C2']}) merged_df = pd.merge(df1, df2, on='A')
You can concatenate DataFrames along rows or columns using concat().
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B']) concat_df = pd.concat([df1, df2], axis=0)
Pandas is a versatile tool for data manipulation, from importing and cleaning data to performing complex operations. This cheat sheet provides a quick overview of some of the most common Pandas features, helping you make your data analysis workflow more efficient.
The above is the detailed content of Pandas Cheat Sheet. For more information, please follow other related articles on the PHP Chinese website!