Categories
Code Linux

Find the largest directories

With human readable sizes.

📂 📂 📂

This command lists the sub directories of a given path, along with total disk space usage for each of them, sorted by largest first:

find /some/path -maxdepth 1 -mindepth 1 -type d -print0|\
    xargs -0 du -bs|\
    sort -nr -k1|\
    awk -F\\t '{
      printf("%.2f GiB\t%s\n", $1/1024/1024/1024, $2)
    }'

Example output:

0,98 GiB	/some/path/x
0,91 GiB	/some/path/y
0,50 GiB	/some/path/.cache
0,00 GiB	/some/path/.smalldir

You can use it to do a quick analysis of disk space usage in your home directory, or anywhere else.

Explanation

The find(1) command lists all immediate sub directories below the starting point /some/path. This is done using a combination of the -type d predicate, as well as -mindepth 1 and -maxdepth 1. The root path /some/path has depth 0 and so will not be included. For each directory found, the path is printed with the -print0 action. This action prints the path string, then a single null-byte. This byte works as a record separator in a similar fashion to newline characters, which are more commonly used in pipelines.

Output will look something like this:

$ find /some/path -maxdepth 1 -mindepth 1 -type d -print0
/some/path/x/some/path/.cache/some/path/y/some/path/.smalldir

Since the null byte is not a printable character, you will not see it directly in a terminal, and it looks like all the paths are mashed together into a single string. So why use this character ? Well, it is the absolute safest wrt. special characters and white space, as it cannot occur in file system names. So you are pretty much guaranteed that it will always work as a unique separator. Also, the xargs command, which consumes the find output in the pipeline, supports using the null character as separator, so why not.

|

Which brings us to the next command in the pipeline: xargs(1). This program will process input records according to rules documented in its manual page, and execute a specified command with one or more input records as command arguments. In this case, we tell xargs to invoke the command du -bs. To put it simply, xargs reads input records and uses those as arguments for a command that it will invoke. We also tell xargs to consider the null byte as the input record separator, with the -0 option.

The effect is that du is invoked like this:

$ du -bs /some/path/x /some/path/.cache /some/path/y /some/path/.smalldir
1048580096	/some/path/x
536875008	/some/path/.cache
978329600	/some/path/y
4096	/some/path/.smalldir

The du(1) command can calculate disk space usage for file system objects. We tell it to output sizes in bytes with -b. The -s option tells du to summarize each argument, which means only print total usage for exactly the provided arguments (skip info about their contents).

As we can see, du prints lines by default, where each line consists of two columns, and the columns are separated by a single tab character. The first column is total size in bytes and the second column is the directory path.

|

Next, the lines are sorted using sort(1). Option -k1 means use the first column as sorting key, -n tells sort to interpret the values numerically and -r reverses the natural ordering, so we get the biggest numbers first. Sort will output the same set of lines, but of course, in sorted order:

$ ...|sort -nr -k1
1048580096	/some/path/x
978329600	/some/path/y
536875008	/some/path/.cache
4096	/some/path/.smalldir

|

Lastly, awk(1) is used to transform the lines so that the raw byte sizes are converted to a more human readable form. We tell awk to only consider the tab character as column separator with -F\\t – which ensures all lines are parsed as exactly two fields1. Otherwise, paths with spaces would give us problems here.

The awk program simply executes printf("%.2f GiB\t%s\n"...) function for each line of input, using the size (in gibibytes2) as first format argument and the directory path as second. Field values are automatically available in numbered variables $1, $2, ... Numbers are formatted as floating point rounded to two decimals.

$ ...|awk -F\\t '{
      printf("%.2f GiB\t%s\n", $1/1024/1024/1024, $2)
    }'
0,98 GiB	/some/path/x
0,91 GiB	/some/path/y
0,50 GiB	/some/path/.cache
0,00 GiB	/some/path/.smalldir
  1. In awk terminology, columns are called fields and lines are called records. F is a mnemonic for Field separator. ↩︎
  2. Often confused with gigabytes, and nearly the same thing. 1 gibibyte is 10243 bytes, while 1 gigabyte is 10003 bytes. You can modify the awk program to suit your human readable size preference. ↩︎
Categories
Code Linux

Find the largest files in a directory tree

Display human readable sizes.

🗄 🗄 🗄

This command lists the 20 biggest files anywhere in a directory tree, biggest first:

find /some/path -type f -printf '%s\t%P\n'|\
     sort -nr -k1|\
     awk -F\\t '{
         printf("%.2f MiB\t%s\n", $1/1024/1024, $2);
       }'|\
     head -n 20

Example output:

1000,00 MiB	x/file4.dat
500,00 MiB	y/file5.dat
433,00 MiB	y/z/file6.dat
300,00 MiB	file3.dat
20,00 MiB	file2.dat
1,00 MiB	file1.dat
0,00 MiB	tiny.txt

Replace /some/path with a directory path of your choice.

Explanation

The find(1) command is used to gather all the necessary data. For each file in a directory tree /some/path, the -printf action is invoked to produce a line containing the file size and the path.

$ find /some/path -type f -printf '%s\t%P\n'
1048576000	x/file4.dat
1048576	file1.dat
314572800	file3.dat
524288000	y/file5.dat
454033408	y/z/file6.dat
0	tiny.txt
20971520	file2.dat

The lines use a tab character as a column separator, which is typical.

These lines are piped to sort(1), which sorts on the first column -k1 numerically and in reverse order -nr (biggest first). Then, awk(1) is used to format the lines with human readable file sizes in MiB1 using printf. Awk is explicitly told to only use a single tab character as field/column separator with -F\\t. Finally, we apply head(1) to limit output to 20 lines.

  1. Short for mebibytes, which is commonly confused with megabytes and nearly the same thing. 1 mebibyte is 10242 bytes, while 1 megabyte is 10002 bytes. ↩︎
Categories
Code Linux

Find most recently modified files in a directory tree

Display human readable timestamps.

🕒 🕑 🕐

find . -printf '%T@\t%p\n'|\
     sort -nr -k1|\
     awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

This command lists all files and directories recursively and sorts the output such that most recently modified files and directories come first. For each output line, the modification timestamp is displayed in a human readable format. It uses find(1), sort(1) and awk(1) to efficiently accomplish this task.

Example:

$ find /tmp/tree -printf '%T@\t%p\n'|sort -nr -k1|awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'
2023-08-28 11:49        /tmp/tree/c
2023-08-28 11:46        /tmp/tree/d
2023-08-28 11:46        /tmp/tree
2023-08-25 11:49        /tmp/tree/c/1.txt
2023-07-29 11:51        /tmp/tree/a
2022-01-05 10:51        /tmp/tree/a/b/2.txt
2022-01-05 10:51        /tmp/tree/a/b/1.txt
2022-01-05 10:51        /tmp/tree/a/b
2018-03-07 10:50        /tmp/tree/c/2.txt
2015-06-11 11:50        /tmp/tree/c/3.txt

Command pipeline break down and explanation

The find command lists all files and directories for the provided path in file system order and invokes the printf action for each path. The format string specifies that the first column should be the path modification time as a number %T@, and the second column should be the path itself %p.

$ find /tmp/tree -printf '%T@\t%p\n'1693215986.8473050140   /tmp/tree
1693216167.2705184680   /tmp/tree/c
1434016210.0000000000   /tmp/tree/c/3.txt
1520416204.0000000000   /tmp/tree/c/2.txt
1692956994.0000000000   /tmp/tree/c/1.txt
1690624265.0000000000   /tmp/tree/a
1641376305.0000000000   /tmp/tree/a/b
1641376313.0000000000   /tmp/tree/a/b/2.txt
1641376313.0000000000   /tmp/tree/a/b/1.txt
1693215986.8473050140   /tmp/tree/d

|

The sort command reads lines from standard input and sorts them using the first column as sorting key and interpreting those values numerically. It outputs the sorted lines in descending/reverse order:

$ ...|sort -nr -k1
1693216167.2705184680   /tmp/tree/c
1693215986.8473050140   /tmp/tree/d
1693215986.8473050140   /tmp/tree
1692956994.0000000000   /tmp/tree/c/1.txt
1690624265.0000000000   /tmp/tree/a
1641376313.0000000000   /tmp/tree/a/b/2.txt
1641376313.0000000000   /tmp/tree/a/b/1.txt
1641376305.0000000000   /tmp/tree/a/b
1520416204.0000000000   /tmp/tree/c/2.txt
1434016210.0000000000   /tmp/tree/c/3.txt
-nnumeric interpretation of the values used for sorting lines
-rdescending/reversed order
-k1sorting key, use first column of each line as the sorting value
Options breakdown

|

The awk command transforms the lines so that timestamps are presented in a human readable form.

$ ...|awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'
2023-08-28 11:49        /tmp/tree/c
2023-08-28 11:46        /tmp/tree/d
2023-08-28 11:46        /tmp/tree
2023-08-25 11:49        /tmp/tree/c/1.txt
2023-07-29 11:51        /tmp/tree/a
2022-01-05 10:51        /tmp/tree/a/b/2.txt
2022-01-05 10:51        /tmp/tree/a/b/1.txt
2022-01-05 10:51        /tmp/tree/a/b
2018-03-07 10:50        /tmp/tree/c/2.txt
2015-06-11 11:50        /tmp/tree/c/3.txt
-F\\ttell awk to use the tab character as field separator, used to split each input line into columns
{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}the awk program itself; for each line, print() a formatted timestamp by parsing the first column using strftime(), a tab character, then the path from the second column unchanged, forming a line of output. (In awk syntax, values presented in sequence are concatenated, so the print function only receives one string argument.)
Options breakdown

Variations

List oldest first

By simply removing the -r option to sort, we get ascending order, e.g. from older to newer timestamps:

find /tmp/tree -printf '%T@\t%p\n'|\
    sort -n -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

Limit number of lines output

By adding head(1) to the pipeline we can limit output to any desirable number:

find /tmp/tree -printf '%T@\t%p\n'|\
    sort -nr -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'|\
    head -n 10

List only files

By adding -type f predicate to the find command, we can exclude all directories from the output:

find /tmp/tree -type f -printf '%T@\t%p\n'|\
    sort -nr -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

(Or vice versa, we can use -type d to include only directories.)

Find single most recently modified path under each top level directory

This one is more specialized and lists for each top level directory only the single most recently modified sub-path, sorted by modification time:

find /tmp/tree -mindepth 2 -printf '%T@\t%P\n'|\
       sort -nr -k1|\
       awk -F\\t '
         {
           topdir=$2;
           gsub(/\/.*/, "", topdir);
           if (topdir=="") { topdir="." };
           if (!seen[topdir]) { 
             seen[topdir]=1;
             print strftime("%Y-%m-%d %H:%M",$1) "\t" $2;
           }
         }'

This might be useful to get an indication of which directory trees have recently been updated vs old trees that rarely see any file system writing activity. (The modification time of the top level directories themselves is often not a good indicator, since it does not reflect modifications deeper down in the tree.)

To accomplish this we tell find to only print paths two or more levels below the search path root /tmp/tree with -mindepth 2 and add a subtle change to the format pattern which strips the search path prefix: %P. Also, we need to expand the awk program so that it only prints the first path per encountered top level directory.

Example output:

2023-08-25 11:49	c/1.txt
2022-01-05 10:51	a/b/2.txt

So the most recently modified subtree is c/, with most recently modified file c/1.txt. (To include all files directly below the search path, you can change -mindepth to 1.)