Keeping your Drupal backups under control

One of the good practices a Drupal developer must adhere to is keeping backups of a site, even if it's just a dev site. Since the advent of drush archive-dump much of this task is already automated; but there is still a missing piece to get it working seamlessly: purging old backups and keeping only recent ones. For this very need I just wrote a simple bash script that you can set up and use very easily.

Let's start by showing an actual usage example: consider the scenario where you want to keep hourly backups of a Drupal site. A simple option would be to write a script that runs drush archive-dump on the site's directory and set up a cron job to run it hourly. It works correctly, but it will cause a bit of a headache a few days later when you go to check the backups folder... there will be lots of them!. It's even worst if you want to keep more than one backup schedule (hourly, daily, monthly, etc), since they will all be mixed. Of course, doing hourly backups is not a good idea for a production site, but it is for a small dev site.

Now, what if you could just call a script, passing some arguments which are obvious in meaning and get your backups without having to remove old ones by hand?; that could look like this in the command line:

smartbackups.sh /path/to/mydrupasite mydrupalsite/weekly 4

The previous line would fire up the script, telling it where to locate the site (mydrupalsite in this example), giving a subdirectory to store the backups inside ~/drush-backups and instructing it to make sure only the most recent 4 are kept there. The good news is the "hypothetical" command does already work with the script you find right below.

Let's take a look at the entire code for the script, and explain its relevant parts after that:

#!/bin/sh

# Recent X backups script for use with Drush
#
# Arguments:
# $1: Drupal's directory path
# $2: subfolder, relative to the default backup folder, do not include
#     leading or trailing slash
# $3: number of recent backups to keep in the folder on which the backup
#     will be saved (either the default one or a subfolder)
#

# Check params
USAGEMSG="\nMissing args error, usage is:\n  script.sh /path/to/drupal backup/subpath last-N-to-keep"
if [ ! -n "$1" ] ; then
  echo "$USAGEMSG"
  exit
elif [ ! -n "$2" ] ; then
  echo "$USAGEMSG"
  echo "\nMissing arg 2.\n  Second argument is path relative to ~/drush-backups/scheduled.\n  If the given subfolder does not exist, it will be created.\n  Do not include leading or trailing slash."
  exit
elif [ ! -n "$3" ] ; then
  echo "$USAGEMSG"
  echo "\nMissing arg 3.\n  Specify the number of recent copies to keep in \n  given backup-subfolder"
  exit
fi

# Unique suffix - use dates to easily find oldest files to delete
SUFFIX=$(date +%F-%H%M%S)

# Base path on which to perform all tasks
BASEPATH=$(echo ~/drush-backups/scheduled)

# Define the full path to the backup folder, allowing passing a subfolder to the script
BKPATH="$BASEPATH/$2"
BKFILE="$BKPATH/bk-$SUFFIX.tar.gz"

mkdir -p $BKPATH

cd $1
~/drush/drush arb --destination=$BKFILE

cd $BKPATH
COUNT=$(find . -maxdepth 1 -type f | wc -l)
while [ "$COUNT" -gt $3 ]; do
  find . -maxdepth 1 -type f | sort | head -n 1 | xargs rm
  COUNT=$(find . -maxdepth 1 -type f | wc -l)
done

The first block of commentaries is just a brief explanation of the params required; just after that there is a basic documentation. Be careful: since this version of the script does not check true validity of params, you must be sure to invoke it with correct values. Of course, usage of this script on your part is under your sole responsibility ;).

The script works by creating a suffix for the file name of the backup file (line 29), it uses dates, to easily purge old ones later. Next is to define a base path on which to store the backups, in this case it's the ~/drush-backups/scheduled directory; ~/drush-backups/archive-dump is the default location for files generated by drush archive-dump. After that the final path for the backup file and the actual file path are put together, followed by an invocation to mkdir that creates the full path if it does not exist yet. Then it comes the time to cd to the site's directory and run drush arb, passing the desired file path (lines 40 and 41); here we invoke drush archive-dump (using the arb alias) to get the backup created and saved in the right location. Finally, there's a bit of script-fu used to count the number of files on the working backups path (line 44), and a while loop that keeps deleting files one by one (lines 45 to 48), until the count gets to be the one given by the third argument to the script.

Since lines 44 and 46 seem to be quite cryptic, let's explain them a little bit:

Line 44 runs find on the current location for backups, allowing only to get files on the same location (maxdepth 1) and checking that they are not directories (or other weird types of files!), but actual files (type f); then it comes a "|" (called pipe) which allows to pass the results of find to the word-count command; the word-count, wc, is just instructed to not count words, but entire lines (l parameter). As a result, the COUNT variable stores the current number of files on the path used to save the backup.
The find on line 46 does the same job as the one in line 44, but then the results are passed to sort (which sorts the lines), which in turn passes the sorted lines to head, which is instructed to only preserve the first line (n 1) and then rm is given that line (oldest file at this very moment) by means of xargs. Each time this line runs, it deletes the oldest file in the current backup directory.

The final ingredient would be to just create cron jobs, scheduling the script to run at the desired time and passing it the appropriate arguments each time; since the path to the Drupal site is passed in the first argument, you can use the very same script to do the job for as many sites as you want (given they are on the same machine). This way, creating backup schedules and avoiding to be flooded by old files is all accomplished by a simple cron job. A sample cron job that creates hourly backups and keeps only the newest 3, would look like:

0   *   *   *   *   /path/to/smartbackups.sh /path/to/mydrupalsite mydrupalsite/hourly 3

As easy as pie!