The pandas drop_duplicates function is a powerful tool for removing duplicate rows from a DataFrame, but what if you only want to drop rows that are duplicates across a subset of columns?
Consider the following DataFrame:
A | B | C |
---|---|---|
foo | 0 | A |
foo | 1 | A |
foo | 1 | B |
bar | 1 | A |
Suppose you want to drop rows that match on columns A and C. In this case, you would want to drop rows 0 and 1.
To achieve this, you can use the drop_duplicates function with the keep parameter set to False. This parameter specifies how to handle duplicate rows. By default, keep is set to first, which means that the first occurrence of a duplicate row will be kept. Setting keep to False will drop all duplicate rows.
The following code demonstrates how to drop rows with duplicate values in columns A and C:
import pandas as pd df = pd.DataFrame({"A": ["foo", "foo", "foo", "bar"], "B": [0, 1, 1, 1], "C": ["A", "A", "B", "A"]}) # Drop rows with duplicate values in columns 'A' and 'C' df = df.drop_duplicates(subset=['A', 'C'], keep=False) print(df)
Output:
A B C 2 foo 1 B 3 bar 1 A
As you can see, rows 0 and 1 have been dropped, as they are duplicates with respect to columns A and C.
The above is the detailed content of How to Drop Duplicate Rows Across Specific Columns in Pandas?. For more information, please follow other related articles on the PHP Chinese website!