Category: Code

All posts related to coding and projects.

Find the largest files in a directory tree

Post author By Øyvind Stegard
Post date 31. August 2023

Display human readable sizes.

🗄 🗄 🗄

This command lists the 20 biggest files anywhere in a directory tree, biggest first:

find /some/path -type f -printf '%s\t%P\n'|\
     sort -nr -k1|\
     awk -F\\t '{
         printf("%.2f MiB\t%s\n", $1/1024/1024, $2);
       }'|\
     head -n 20

Example output:

1000,00 MiB	x/file4.dat
500,00 MiB	y/file5.dat
433,00 MiB	y/z/file6.dat
300,00 MiB	file3.dat
20,00 MiB	file2.dat
1,00 MiB	file1.dat
0,00 MiB	tiny.txt

Replace /some/path with a directory path of your choice.

Explanation

The find(1) command is used to gather all the necessary data. For each file in a directory tree /some/path, the -printf action is invoked to produce a line containing the file size and the path.

$ find /some/path -type f -printf '%s\t%P\n'
1048576000	x/file4.dat
1048576	file1.dat
314572800	file3.dat
524288000	y/file5.dat
454033408	y/z/file6.dat
0	tiny.txt
20971520	file2.dat

The lines use a tab character as a column separator, which is typical.

These lines are piped to sort(1), which sorts on the first column -k1 numerically and in reverse order -nr (biggest first). Then, awk(1) is used to format the lines with human readable file sizes in MiB¹ using printf. Awk is explicitly told to only use a single tab character as field/column separator with -F\\t. Finally, we apply head(1) to limit output to 20 lines.

Short for mebibytes, which is commonly confused with megabytes and nearly the same thing. 1 mebibyte is 1024² bytes, while 1 megabyte is 1000² bytes. ↩︎

Tags shell

Code Linux

Find most recently modified files in a directory tree

Post author By Øyvind Stegard
Post date 30. August 2023

Display human readable timestamps.

🕒 🕑 🕐

find . -printf '%T@\t%p\n'|\
     sort -nr -k1|\
     awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

This command lists all files and directories recursively and sorts the output such that most recently modified files and directories come first. For each output line, the modification timestamp is displayed in a human readable format. It uses find(1), sort(1) and awk(1) to efficiently accomplish this task.

Example:

$ find /tmp/tree -printf '%T@\t%p\n'|sort -nr -k1|awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'
2023-08-28 11:49        /tmp/tree/c
2023-08-28 11:46        /tmp/tree/d
2023-08-28 11:46        /tmp/tree
2023-08-25 11:49        /tmp/tree/c/1.txt
2023-07-29 11:51        /tmp/tree/a
2022-01-05 10:51        /tmp/tree/a/b/2.txt
2022-01-05 10:51        /tmp/tree/a/b/1.txt
2022-01-05 10:51        /tmp/tree/a/b
2018-03-07 10:50        /tmp/tree/c/2.txt
2015-06-11 11:50        /tmp/tree/c/3.txt

Command pipeline break down and explanation

The find command lists all files and directories for the provided path in file system order and invokes the printf action for each path. The format string specifies that the first column should be the path modification time as a number %T@, and the second column should be the path itself %p.

$ find /tmp/tree -printf '%T@\t%p\n'1693215986.8473050140   /tmp/tree
1693216167.2705184680   /tmp/tree/c
1434016210.0000000000   /tmp/tree/c/3.txt
1520416204.0000000000   /tmp/tree/c/2.txt
1692956994.0000000000   /tmp/tree/c/1.txt
1690624265.0000000000   /tmp/tree/a
1641376305.0000000000   /tmp/tree/a/b
1641376313.0000000000   /tmp/tree/a/b/2.txt
1641376313.0000000000   /tmp/tree/a/b/1.txt
1693215986.8473050140   /tmp/tree/d

The sort command reads lines from standard input and sorts them using the first column as sorting key and interpreting those values numerically. It outputs the sorted lines in descending/reverse order:

$ ...|sort -nr -k1
1693216167.2705184680   /tmp/tree/c
1693215986.8473050140   /tmp/tree/d
1693215986.8473050140   /tmp/tree
1692956994.0000000000   /tmp/tree/c/1.txt
1690624265.0000000000   /tmp/tree/a
1641376313.0000000000   /tmp/tree/a/b/2.txt
1641376313.0000000000   /tmp/tree/a/b/1.txt
1641376305.0000000000   /tmp/tree/a/b
1520416204.0000000000   /tmp/tree/c/2.txt
1434016210.0000000000   /tmp/tree/c/3.txt

`-n`	numeric interpretation of the values used for sorting lines
`-r`	descending/reversed order
`-k1`	sorting key, use first column of each line as the sorting value

Options breakdown

The awk command transforms the lines so that timestamps are presented in a human readable form.

$ ...|awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'
2023-08-28 11:49        /tmp/tree/c
2023-08-28 11:46        /tmp/tree/d
2023-08-28 11:46        /tmp/tree
2023-08-25 11:49        /tmp/tree/c/1.txt
2023-07-29 11:51        /tmp/tree/a
2022-01-05 10:51        /tmp/tree/a/b/2.txt
2022-01-05 10:51        /tmp/tree/a/b/1.txt
2022-01-05 10:51        /tmp/tree/a/b
2018-03-07 10:50        /tmp/tree/c/2.txt
2015-06-11 11:50        /tmp/tree/c/3.txt

`-F\\t`	tell awk to use the tab character as field separator, used to split each input line into columns
`{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}`	the awk program itself; for each line, `print()` a formatted timestamp by parsing the first column using `strftime`(), a tab character, then the path from the second column unchanged, forming a line of output. (In awk syntax, values presented in sequence are concatenated, so the `print` function only receives one string argument.)

Options breakdown

Variations

List oldest first

By simply removing the -r option to sort, we get ascending order, e.g. from older to newer timestamps:

find /tmp/tree -printf '%T@\t%p\n'|\
    sort -n -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

Limit number of lines output

By adding head(1) to the pipeline we can limit output to any desirable number:

find /tmp/tree -printf '%T@\t%p\n'|\
    sort -nr -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'|\
    head -n 10

List only files

By adding -type f predicate to the find command, we can exclude all directories from the output:

find /tmp/tree -type f -printf '%T@\t%p\n'|\
    sort -nr -k1|\
    awk -F\\t '{print strftime("%Y-%m-%d %H:%M",$1) "\t" $2}'

(Or vice versa, we can use -type d to include only directories.)

Find single most recently modified path under each top level directory

This one is more specialized and lists for each top level directory only the single most recently modified sub-path, sorted by modification time:

find /tmp/tree -mindepth 2 -printf '%T@\t%P\n'|\
       sort -nr -k1|\
       awk -F\\t '
         {
           topdir=$2;
           gsub(/\/.*/, "", topdir);
           if (topdir=="") { topdir="." };
           if (!seen[topdir]) { 
             seen[topdir]=1;
             print strftime("%Y-%m-%d %H:%M",$1) "\t" $2;
           }
         }'

This might be useful to get an indication of which directory trees have recently been updated vs old trees that rarely see any file system writing activity. (The modification time of the top level directories themselves is often not a good indicator, since it does not reflect modifications deeper down in the tree.)

To accomplish this we tell find to only print paths two or more levels below the search path root /tmp/tree with -mindepth 2 and add a subtle change to the format pattern which strips the search path prefix: %P. Also, we need to expand the awk program so that it only prints the first path per encountered top level directory.

Example output:

2023-08-25 11:49	c/1.txt
2022-01-05 10:51	a/b/2.txt

So the most recently modified subtree is c/, with most recently modified file c/1.txt. (To include all files directly below the search path, you can change -mindepth to 1.)

Tags shell

Code Linux

project change directory

Post author By Øyvind Stegard
Post date 18. February 2023

I generally keep my development projects under a common directory like ~/dev/ or ~/projects/. Here is bash shell script code which I use to quickly jump into and between projects on the command line. It includes auto completion, which is a pretty important part of its usefulness. You can put in your ~/.bashrc file and adapt the PROJECTS_PATH variable to suit your own needs.

# Project cd
PROJECTS_PATH=~/dev

# Main command function
pcd() {
    local args=() op=cd opt OPTIND
    # Option parsing ergonomics: allow options anywhere in command line
    while [ $# -gt 0 ]; do
        while getopts 'p' opt; do
            [ $opt = p ] && op=pushd || return 1
        done
        shift $((OPTIND-1)) && OPTIND=1
        [ $# -gt 0 ] && args+=("$1") && shift
    done
    
    local path="$PROJECTS_PATH/${args[0]}/${args[1]}"
    if [ "${args[0]}" = .. ]; then
        local gitroot=$(git rev-parse --show-toplevel 2>/dev/null)
        if [ -d "$gitroot" ]; then
            path="$gitroot/${args[1]}"
        fi
    fi
    
    if [ -d "$path" ]; then
        $op "$path"
        [ $op = cd ] && pwd
    fi
}

# Completion function for pcd
_pcdcomp() {
    [ "$1" != pcd ] && return 1
    COMPREPLY=()

    # Current word being completed
    local word=${COMP_WORDS[$COMP_CWORD]}
    # IFS must be set to a single newline so compgen suggestions with spaces work
    local IFS=$'\n' pdir_idx= sdir_idx= i comp_opt=$(compgen -W '-p' -- "$word")

    # Scan command line state
    for ((i=1; i<${#COMP_WORDS[*]}; i++)); do
        if [ "${COMP_WORDS[$i]:0:1}" != - ]; then
            [ -z "$pdir_idx" ] && pdir_idx=$i && continue
            [ -z "$sdir_idx" ] && sdir_idx=$i
        elif [ "${COMP_WORDS[$i]}" = '-p' -a $i -ne $COMP_CWORD ]; then
            comp_opt=
        fi
    done

    # By default, all completions are suffixed with a space, so cursor jumps to
    # next command argument when a completion is selected uniquely, except for
    # the project subdir argument. We handle this manually, since adjusting the
    # 'nospace' option dynamically with compopt has proven to be unreliable.
    local add_space_to_completions=1
    
    # Provide completions according to command line state
    if [ $COMP_CWORD = ${pdir_idx:--1} ]; then
        # State: project argument
        
        if [ "${word:0:1}" = . ]; then
            COMPREPLY=('..')
        else
            COMPREPLY=($(cd "$PROJECTS_PATH" && compgen -X \*.git -d -- "$word"))
        fi
        if [ "$comp_opt" ]; then
            COMPREPLY+=("$comp_opt")
        fi
    elif [ $COMP_CWORD = ${sdir_idx:--1} ]; then
        # State: project subdir argument
        
        local project_root="$PROJECTS_PATH"/"${COMP_WORDS[$pdir_idx]}" git_root
        if [ "${COMP_WORDS[$pdir_idx]}" = .. ]; then
            git_root=$(git rev-parse --show-toplevel 2>/dev/null) && project_root=$git_root
        fi
        
        COMPREPLY=($(cd "$project_root" 2>/dev/null && compgen -X \*.git -S/ -d -- "$word"))
        if [ ${#COMPREPLY[*]} -gt 0 ]; then
            # Avoid space after subdir argument, to allow for drilling while completing
            add_space_to_completions=
        elif [ -z "$word" ]; then
            # No available subdirs for selected project and empty current arg, offer '.' and options
            COMPREPLY=('.')
            if [ "$comp_opt" ]; then
                COMPREPLY+=("$comp_opt")
            fi
        fi
    elif [ "$comp_opt" ]; then
        # State: end of regular args or other
        
        COMPREPLY+=("$comp_opt")
    fi
    
    # Post process, do shell safe name quoting and possibly add space to each completion:
    for ((i=0; i<${#COMPREPLY[*]}; i++)); do
        COMPREPLY[$i]=$(printf "%q${add_space_to_completions:+ }" "${COMPREPLY[$i]}")
    done
}

# Bind completion function to command:
complete -o nospace -F _pcdcomp pcd

You can also find the code on Github: https://github.com/oyvindstegard/pcd

How to use

After loading the code, type pcd and hit TAB to see completion of all project directories. Hit ENTER to jump to a selected project. It will also complete into a project sub-directory as optional second argument. To jump up to a project root directory you can use pcd .. – this works if it is a git repository. You can combine it with a second arg to drill into another directory tree of the same project. Lastly, you can use the option -p to use pushd instead cd when changing directory.

Screencast which shows how the pcd command works.

Notes on implementation of a completion function

As you may have noticed, the code for the programmable completion is a lot more complex than the actual command. In my experience, getting ergonomically pleasing and sufficiently intelligent command line completion tend to become more finicky than what I envision initially. The command line, argument types and cursor position combined constitute several intermediate and final states to handle. Typically, the current word to complete will depend on both preceding and succeeding command line context.

Things to consider

Adding command options in addition to regular arguments complicates matters. You will have more state to handle, and you shouldn’t provide completion for the same option twice, unless that is valid for your command.
You need to parse the entire command line state every time your completion function is invoked, so you have good enough contextual information about what completions to provide at the cursor. Don’t offer to complete something which would produce an invalid command.
You really need to learn exactly how the shell behaves with regard to the completion variables, the completion related built-in commands and options.
Avoid slow/heavy commands in your completion function, because user experience will suffer greatly when pressing TAB causes the shell to hang for a long time without any feedback. Completion data which is expensive to compute or fetch should be cached.
When debugging a completion function, you don’t really want to output anything directly to the terminal, as it will visually interfere with the command your are testing and printed completions, causing a jarring experience. Instead, what I recommend is to append whatever debug logging you have to a dedicated file and tail that file in another terminal window while testing.

Tags shell