Categories
Code Linux

Find the largest directories

With human readable sizes.

📂 📂 📂

This command lists the sub directories of a given path, along with total disk space usage for each of them, sorted by largest first:

find /some/path -maxdepth 1 -mindepth 1 -type d -print0|\
    xargs -0 du -bs|\
    sort -nr -k1|\
    awk -F\\t '{
      printf("%.2f GiB\t%s\n", $1/1024/1024/1024, $2)
    }'

Example output:

0,98 GiB	/some/path/x
0,91 GiB	/some/path/y
0,50 GiB	/some/path/.cache
0,00 GiB	/some/path/.smalldir

You can use it to do a quick analysis of disk space usage in your home directory, or anywhere else.

Explanation

The find(1) command lists all immediate sub directories below the starting point /some/path. This is done using a combination of the -type d predicate, as well as -mindepth 1 and -maxdepth 1. The root path /some/path has depth 0 and so will not be included. For each directory found, the path is printed with the -print0 action. This action prints the path string, then a single null-byte. This byte works as a record separator in a similar fashion to newline characters, which are more commonly used in pipelines.

Output will look something like this:

$ find /some/path -maxdepth 1 -mindepth 1 -type d -print0
/some/path/x/some/path/.cache/some/path/y/some/path/.smalldir

Since the null byte is not a printable character, you will not see it directly in a terminal, and it looks like all the paths are mashed together into a single string. So why use this character ? Well, it is the absolute safest wrt. special characters and white space, as it cannot occur in file system names. So you are pretty much guaranteed that it will always work as a unique separator. Also, the xargs command, which consumes the find output in the pipeline, supports using the null character as separator, so why not.

|

Which brings us to the next command in the pipeline: xargs(1). This program will process input records according to rules documented in its manual page, and execute a specified command with one or more input records as command arguments. In this case, we tell xargs to invoke the command du -bs. To put it simply, xargs reads input records and uses those as arguments for a command that it will invoke. We also tell xargs to consider the null byte as the input record separator, with the -0 option.

The effect is that du is invoked like this:

$ du -bs /some/path/x /some/path/.cache /some/path/y /some/path/.smalldir
1048580096	/some/path/x
536875008	/some/path/.cache
978329600	/some/path/y
4096	/some/path/.smalldir

The du(1) command can calculate disk space usage for file system objects. We tell it to output sizes in bytes with -b. The -s option tells du to summarize each argument, which means only print total usage for exactly the provided arguments (skip info about their contents).

As we can see, du prints lines by default, where each line consists of two columns, and the columns are separated by a single tab character. The first column is total size in bytes and the second column is the directory path.

|

Next, the lines are sorted using sort(1). Option -k1 means use the first column as sorting key, -n tells sort to interpret the values numerically and -r reverses the natural ordering, so we get the biggest numbers first. Sort will output the same set of lines, but of course, in sorted order:

$ ...|sort -nr -k1
1048580096	/some/path/x
978329600	/some/path/y
536875008	/some/path/.cache
4096	/some/path/.smalldir

|

Lastly, awk(1) is used to transform the lines so that the raw byte sizes are converted to a more human readable form. We tell awk to only consider the tab character as column separator with -F\\t – which ensures all lines are parsed as exactly two fields1. Otherwise, paths with spaces would give us problems here.

The awk program simply executes printf("%.2f GiB\t%s\n"...) function for each line of input, using the size (in gibibytes2) as first format argument and the directory path as second. Field values are automatically available in numbered variables $1, $2, ... Numbers are formatted as floating point rounded to two decimals.

$ ...|awk -F\\t '{
      printf("%.2f GiB\t%s\n", $1/1024/1024/1024, $2)
    }'
0,98 GiB	/some/path/x
0,91 GiB	/some/path/y
0,50 GiB	/some/path/.cache
0,00 GiB	/some/path/.smalldir
  1. In awk terminology, columns are called fields and lines are called records. F is a mnemonic for Field separator. ↩︎
  2. Often confused with gigabytes, and nearly the same thing. 1 gibibyte is 10243 bytes, while 1 gigabyte is 10003 bytes. You can modify the awk program to suit your human readable size preference. ↩︎