How to delete duplicates from Pandas DataFrame in Python

Removing duplicate values from a DataFrame is a common task in data cleaning and preprocessing. In the Pandas library, there are several methods to accomplish this task.

One way to delete duplicates is by using the drop_duplicates() method. This method removes any duplicated rows in the DataFrame and returns a new DataFrame without the duplicate rows. By default, it considers all columns when identifying duplicates, but it’s possible to specify a subset of columns to use when identifying duplicates.

Another approach is using the duplicated() method. This method returns a boolean mask that indicates whether each row is a duplicate or not. We can then use this mask to drop rows with duplicates, for example:

df = df[~df.duplicated()]

You can also use the pd.concat() combined with drop_duplicates this way, you can specify which columns you want to use to identify the duplicates and drop them.

df = pd.concat([df1, df2]).drop_duplicates(subset=['col1', 'col2'])

In addition to this, it’s possible to drop duplicates based on specific column(s), keep the first occurrence, last occurrence or keep only the observation(s) with the maximum or minimum values of a specific column.

It’s important to keep in mind that when removing duplicates, Pandas will only drop exact duplicate rows. If you want to remove duplicates based on a specific condition or threshold, you will need to create a new column that flags those rows and then drop them.

In summary, Pandas provides several methods for removing duplicates from a DataFrame. Depending on your use case, you can use the drop_duplicates() method, duplicated() method or pd.concat combined with drop_duplicates to remove the duplicate rows from a DataFrame.

In this Learn through Codes example, you will learn: How to delete duplicates from Pandas DataFrame in Python.