Data Manipulation
Dedup DataFrame
Sometimes we want to drop all the duplicated data in our DataFrame, and we can use the drop_duplicates()
function.
For Example:
|
|
By default, the function will remove duplicates based on all columns.
|
|
We can also specify specific columns via subset
.
|
|
Enhancing Performance
df.query()
vs df.eval()
When considering whether to use these functions, there are always two considerations:
- computation time
- memory use
Memory use is the most predictable aspect.
Every compound expression invoving NumPy arrays or Pandas DataFrame will result in implicit creation of temporary arrays.
If the size of the temporary DataFrame is significant compared to your available system memory (typically several gigabytes), then it’s a good idea to use an eval()
or query()
.
On the performance side, eval()
can be faster even when you are not maxing-out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then eval()
can avoid some potentially slow movement of values between the different memory caches.