Removing Duplicate Rows Based on Multiple Columns in Python Pandas
The drop_duplicates function in Pandas provides an efficient way to remove duplicate rows from a DataFrame. However, what if you want to drop rows only if they match on a specific set of columns?
Problem:
Consider a DataFrame with columns "A," "B," and "C." You want to remove rows where the values in columns "A" and "C" are the same. In other words, you need to identify and drop rows 0 and 1 from this example DataFrame:
A | B | C | |
---|---|---|---|
0 | foo | 0 | A |
1 | foo | 1 | A |
2 | foo | 1 | B |
3 | bar | 1 | A |
Solution:
You can now easily achieve this using the drop_duplicates function and the subset parameter:
import pandas as pd df = pd.DataFrame({"A": ["foo", "foo", "foo", "bar"], "B": [0, 1, 1, 1], "C": ["A", "A", "B", "A"]}) df.drop_duplicates(subset=['A', 'C'], keep=False)
The keep= parameter specifies whether to drop duplicate rows, including the first occurrence, or to exclude them. Setting it to False will drop all duplicates.
The result is a DataFrame with rows 0 and 1 removed, leaving only the unique rows based on columns "A" and "C":
A | B | C | |
---|---|---|---|
0 | foo | 1 | B |
1 | bar | 1 | A |
The above is the detailed content of How to Remove Duplicate Rows in Pandas Based on Specific Columns?. For more information, please follow other related articles on the PHP Chinese website!