How to show progress bar while unzipping tons of files

2019/04/14 12:00am

The original article is here.

In the machine learning field, there is plenty of public dataset for model training. Usually, such a dataset is provided as a zip archive file, so we can just download it, unarchive it with our good old friend unzip:

$ unzip /home/data/large_dataset.zip

But if you are working in Jupyter notebook, simple unzip command might hurt your screen with too much output and you’ll soon notice the performance of frontend UI is decreasing as the output increasing.

!unzip /home/data/large_dataset.zip -d /home/data/
Archive:  /home/data/large_dataset.zip
  inflating: /home/data/large_dataset/1/1900_753325_0060.png  
  inflating: /home/data/large_dataset/1/1900_754949_0023.png  
  inflating: /home/data/large_dataset/1/1900_758495_0075.png  
  inflating: /home/data/large_dataset/1/1900_761460_0029.png  
  inflating: /home/data/large_dataset/1/1900_766994_0030.png  
  inflating: /home/data/large_dataset/1/1900_776319_0015.png  
  ...
  (More 80K lines)

So people often shut it up with output redirecting:

!unzip /home/data/large_dataset.zip -d /home/data/ > /dev/null

or short -q option:

!unzip -q /home/data/large_dataset.zip -d /home/data/

Got it! The unzip command now silently unarchive files, no performance penalty, everything alright, just waiting … (few minutes passed); How soon can I expect unzip to finish?

pv command is rescue!

To prevent unzip command from making us anxious or sleepy, we want to see a progress bar which periodically reports an indication while unarchive. The pv command solves this problem, it can display a progress bar from any command’s output. As tried some times, finally I got an expected result:

n_files = !unzip -l /home/data/large_dataset.zip | grep .png | wc -l
!unzip -o /home/data/large_dataset.zip -d /home/data/ | pv -l -s {n_files[0]} > /dev/null
...
80.1k 0:00:09 [8.54k/s] [====================================>] 100%

Nice! Very helpful. What’s going on here is:

  1. Run unzip -l to take the number of files to process. We have to filter lines with grep and count lines by wc -l because unzip -l output includes directories.
  2. Pipe the output of unzip and pass the number of files to -s (size) option to show meaningful indicator.

That’s all! It’s about time to return to my notebook and try some experiments 😋