Dropping Duplicate Rows across Multiple Columns in Python Pandas
The pandas drop_duplicates function eliminates duplicated rows from a DataFrame, an invaluable tool for data cleansing. To extend this functionality, one can specify the columns to check for uniqueness.
For instance, consider the following DataFrame:
A B C 0 foo 0 A 1 foo 1 A 2 foo 1 B 3 bar 1 A
Suppose you want to remove rows that have identical values in columns 'A' and 'C.' In this case, rows 0 and 1 would be eliminated.
Previously, this task required manual filtering or complex operations. However, with pandas' enhanced drop_duplicates function, it's now a breeze. The introduction of the keep parameter allows you to control how duplicates are handled.
To drop rows that match on specific columns, use the subset parameter. By setting keep to False, you instruct pandas to eliminate all duplicate rows:
import pandas as pd df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]}) df.drop_duplicates(subset=['A', 'C'], keep=False)
Output:
A B C 2 foo 1 B 3 bar 1 A
As you can see, rows 0 and 1 are successfully removed, leaving only the rows that are unique based on the values in columns 'A' and 'C.'
The above is the detailed content of How Can I Efficiently Remove Duplicate Rows Across Specific Columns in Pandas?. For more information, please follow other related articles on the PHP Chinese website!