Locking critical sections in shell scripts

A critical section is some piece of code which, due to its nature and effects, should be executed by at most one thread or process at a time. If such code is executed concurrently, the results often become undefined and arbitrary. Atomically locking a shared resource is a common pattern to synchronize execution of critical sections and ensure mutually exclusive access.

Shell scripts mostly deal in processes and files, and there are several common scenarios where code is actually a critical section. If such code is run in several processes concurrently, it could introduce race conditions and arbitrary results. Consider a script that starts a background process if it is not already running – a typical pattern to start a singleton daemon process. Such code is a critical section, because you can end up with two running daemons if the code runs concurrently (and the daemon does not check for other instances of itself). Another good example is multiple scripts writing to a shared file, or even a shared directory structure.

There are a few strategies to implement locking in shell scripts, some are better than others. In this post, I will focus on one of the most robust ways: using flock(1). This nice tool gives you access to kernel level file locking from you shell. It has some clear advantages over traditional existence based file locking:

It is truly atomic.
The kernel manages the locks and releases them automatically when lock owning processes die. So no more stale lock files to clean up.
You can block and wait for lock, indefinitely or with a timeout, and instantly get it when another process frees it. (Avoid lock polling loops with sleeping.)

It is important to understand that file locks are tied to both a file on the file system and running processes with open file descriptors to it. Even if a file used for locking exists on the file system, it does not mean the lock is taken ! The file system acts as a namespace of shared resources on which we can attach locks. Also note that the locks are advisory only – if a process does not care to check for locks, it will not participate in any synchronization and can do whatever it pleases.

Shell script with locking functions

We will look at a script which needs to protect a critical section with locking. The locking shall be done on a common file job.lock, which means any process with access to that file can obtain or check for a lock. To code along, you can copy the script to your own file and run the examples.

job.sh

#!/bin/bash

lock_acquire() {
    # Open a file descriptor to lock file
    exec {LOCKFD}>job.lock || return 1

    # Block until an exclusive lock can be obtained on the file descriptor
    flock -x $LOCKFD
}

lock_release() {
    test "$LOCKFD" || return 1
    
    # Close lock file descriptor, thereby releasing exclusive lock
    exec {LOCKFD}>&- && unset LOCKFD
}

lock_acquire || { echo >&2 "Error: failed to acquire lock"; exit 1; }

# --- Begin critical section ---

if [ -f job.dat ]; then
    value=$(<job.dat)
else
    value=0
fi
value=$((value + 1))

echo $value >job.dat

# --- End critical section ---

lock_release

The lock_acquire function uses flock -x N to obtain an exlusive lock on a file descriptor N. Since the file descriptor is opened by the script process itself, it will be the owner of the lock after flock exits. Flock is able to lock the descriptor because it is inherited from the shell process that started it. The critical section reads a number from a file if it exists, increments it by one, and writes the updated number back to the file.

Testing

First we’ll run the job script once:

$ bash job.sh 
$ ls
job.dat  job.lock  job.sh
$ cat job.dat 
1

A job.dat file is produced with a value of 1, which is entirely expected and not very interesting.

Next we’ll start 100 job processes asynchronously as fast as possible in the background, which means that many of them will run concurrently. We do this two times:

$ rm job.dat 
$ (for i in {1..100}; do bash job.sh & done; wait)
$ cat job.dat 
100
$ (for i in {1..100}; do bash job.sh & done; wait)
$ cat job.dat 
200

The for loop is started in a sub shell to avoid job control messages. The data has been incremented exactly 100 times after the first run, and incremented again by 100 after the second. If you look at the code in the critical section, it both reads, updates and then writes to the shared file, and doing this without locking would not work consistently.

Actually, let us try that, by commenting out the lock_acquire call in the script:

[...]

#lock_acquire || { echo >&2 "Error: failed to acquire lock"; exit 1; }

# --- Begin critical section ---

Then we run the test again:

$ rm job.dat
$ (for i in {1..100}; do bash job.sh & done; wait)
$ cat job.dat
3
$ (for i in {1..100}; do bash job.sh & done; wait)
$ cat job.dat
14

Which ends with a final result of 14, clearly incorrect and arbitrary. The results will vary with each run and depend on things like the speed of your computer.

Releasing the lock ?

In this case the script actually does not need to release the lock right before it exits, because the kernel will automatically do that when the process exits anyway. We will try it by re-enabling the locking call, and commenting out the lock_release call:

[...]

lock_acquire || { echo >&2 "Error: failed to acquire lock"; exit 1; }

# --- Begin critical section ---
[...]
# --- End critical section ---

#lock_release

And run the test:

$ rm job.dat
$ (for i in {1..100}; do bash job.sh & done; wait)
$ (for i in {1..100}; do bash job.sh & done; wait)
$ cat job.dat
200

It still works fine.

Starting a daemon process from your shell init scripts

A common use case is the need to start a single daemon process from your shell init scripts, unless one is already running. So you only ever want one such thing running. Consider the following:

if ! ps -ef|grep some-daemon|grep -qv grep; then
    some-daemon & pid=$!
    echo Started some-daemon with pid $pid
fi

This code is racy unless it is protected by locking. If you were to start two terminals more or less simultaneously, both executing your shell init scripts, you could possibly end up with two running daemon processes.

To protect this with flock, you could do the following:

if exec {bashrc_fd}<~/.bashrc && flock -nx $bashrc_fd; then
    # --- Begin critical section ---
    if ! ps -ef|grep some-daemon|grep -qv grep; then
        some-daemon & pid=$!
        echo Started some-daemon with pid $pid
    fi
    # --- End critical section ---

    flock -u $bashrc_fd && exec {bashrc_fd}>&-
fi

Here we open a read-only file descriptor to ~/.bashrc and then try to grab an exclusive lock on it, but we do it non-blocking with option -n. If some other bash process is already executing that part of the init file, flock will not succeed and immediately exit with a non-zero code, so the block is skipped. It has the effect that only one bash process will execute the code, and others running at the same time will skip it.

You may notice that we explicitly release the lock using flock -u $bashrc_fd after the critical section. Normally it is enough to close the file descriptor used for locking, but when starting child processes, those may inherit and keep such descriptors open. So the parent process closing its copy of the descriptor may not be enough actually release lock. Therefore we do it explicitly.

Closing notes

The manual page for the flock command lists a few good examples of how you can use it your scripts. However, none of those examples show how you can make the current shell process own and control the locks without using sub shells or flock invocations to wrap critical sections/commands.

The manual page for the flock(2) system call is a good read if you are interested in more details about how it works.

Read more about handling file descriptors with Bash in this part of the bash(1) manual.