Efficient Dropping of Consecutive Duplicates in Pandas
When working with pandas DataFrames, it's often necessary to remove duplicate values. The built-in drop_duplicates() method, however, removes all instances of duplicate values, including consecutive duplicates. For cases where only consecutive duplicates need to be dropped, there are more efficient methods available.
One approach involves using the shift() function. By comparing the DataFrame against its shifted version (a.shift(-1)), a boolean mask can be created that identifies where consecutive duplicates occur. This mask can then be used to select only the unique values, as seen in the following example:
a.loc[a.shift(-1) != a]
Another method utilizes the diff() function. It calculates the difference between rows and can be used to identify consecutive duplicates. However, it's slower than the shift() method for large datasets.
Using:
a.loc[a.diff() != 0]
The original answer suggested using shift() with a period of -1, but the correct usage is shift(1) (or simply shift()) since the default shift period is 1. This modification ensures that only the first consecutive value is returned:
a.loc[a.shift(1) != a]
Both the shift() and diff() methods provide efficient ways to drop consecutive duplicates in Pandas and should be considered based on the specific context and performance requirements.
The above is the detailed content of How to Efficiently Drop Consecutive Duplicates in Pandas?. For more information, please follow other related articles on the PHP Chinese website!