Sometimes we want to drop all the duplicated data in our DataFrame, and we can use the
By default, the function will remove duplicates based on all columns.
We can also specify specific columns via
When considering whether to use these functions, there are always two considerations:
- computation time
- memory use
Memory use is the most predictable aspect.
Every compound expression invoving NumPy arrays or Pandas DataFrame will result in implicit creation of temporary arrays.
If the size of the temporary DataFrame is significant compared to your available system memory (typically several gigabytes), then it’s a good idea to use an
On the performance side,
eval() can be faster even when you are not maxing-out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then
eval() can avoid some potentially slow movement of values between the different memory caches.