Data Manipulation

Dedup DataFrame

Sometimes we want to drop all the duplicated data in our DataFrame, and we can use the drop_duplicates() function.

For Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, the function will remove duplicates based on all columns.

1
2
3
4
5
6
df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

We can also specify specific columns via subset.

1
2
3
4
5
df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

Enhancing Performance

df.query() vs df.eval()

When considering whether to use these functions, there are always two considerations:

  • computation time
  • memory use

Memory use is the most predictable aspect.

Every compound expression invoving NumPy arrays or Pandas DataFrame will result in implicit creation of temporary arrays.

If the size of the temporary DataFrame is significant compared to your available system memory (typically several gigabytes), then it’s a good idea to use an eval() or query().

On the performance side, eval() can be faster even when you are not maxing-out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches.

Optimise Column Type

Reference