Diffing Two States of a Drupal DB (MySQL)

In certain situations, you may need to have a deep look at the result of some process (deep as in at the DB level); one such situation is when you perform a mass update of data and verifying all went correctly would be relatively easy by diffing the before/after states, then splitting the removals/additions and doing some matching from one group to the other. Not all cases fall under this description, but when they do it's best to have some tooling to avoid spending lots of time and risking human error. Let's take a look at the script I wrote for that very purpose.

Let's go straight into the script (let's call it db-dumper-per-table.sh):

#!/bin/bash

#
# Usage: script.sh /path/to/output/dir DBname
#                                     ^ (no trailing slash!)
#
# DBName is used for DBUser and DBPass, host is localhost.
#

# Uncomment for debugging
#set -x

#
# Usage: dumptables --data=[yes|no] table1 table2 ... tableN
#
function dumptables {
  if [[ $1 == "--data=yes" ]]
  then
    DATAFLAG=""
    DUMPMSG="full table"
  else
    DATAFLAG="--no-data"
    DUMPMSG="structure for table"
  fi

  # Remove data setting from list of tables
  shift

  # Export one table at a time, in a separate file.
  # The last three flags are the ones producing a dump that is diff-friendly.
  for arg in $@; do
    DUMPFILE=${args[0]}/$arg.sql
    echo "Dumping $DUMPMSG $arg into file $DUMPFILE"
    mysqldump --user=${args[1]} --password=${args[1]} ${args[1]} $DATAFLAG --order-by-primary --skip-extended-insert --skip-comments $arg > $DUMPFILE
  done
}

args=("$@")

echo "=================================================="
echo "Output dir: $1"
echo "Ensuring the output dir exists..."
mkdir -p $1
echo "DBName: $2"
echo ""

# Exclude caches from full dump (-v flag for grep)
FULLTABLES=$(echo "USE $2; SHOW TABLES;" | mysql -u$2 -p$2 $2 | tail -n +2 | grep -v "^cache" | sort)
echo "$FULLTABLES" > $1/tables-list.txt
dumptables --data=yes $FULLTABLES

# Only dump structure for caches
STRUCTURETABLES=$(echo "USE $2; SHOW TABLES;" | mysql -u$2 -p$2 $2 | tail -n +2 | grep "^cache" | sort)
echo "$STRUCTURETABLES" >> $1/tables-list.txt
dumptables --data=no $STRUCTURETABLES

The process is very straightforward: a small function performs the heavy lifting and allows choosing a full-dump of a table or just dumping the structure (useful for caches, which will just pollute your diff with meaningless garbage if you export the cache contents); that function is invoked from code that will get the full list of tables for the chosen DB and pick the desired ones via grep. As a nice extra you'll get a tables-list.txt file that will let you know which tables were removed/added between the two states, pretty useful!.

As a bonus, here's an example script that will let you automate a full test (capture state before the process, run the process, then capture the state after and, finally, leverage git to generate the diff):

#!/bin/bash

# Uncomment for debugging
#set -x

# Initialize and ensure output dir exists
BASE_DIR="/path/to/some/dir"
mkdir -p $BASE_DIR

# Dump state BEFORE
./db-dumper-per-table.sh $BASE_DIR/before DBName

# Run the process of your interest
drush ev "trigger_some_process()"

# Dump state AFTER
./db-dumper-per-table.sh $BASE_DIR/after DBName

# Diff the two dumps
cd $BASE_DIR
mkdir diff
cd diff
cp ../before/*.sql .
git init
git add .
git commit -m "before"
rm *.sql *.txt
cp ../after/*.sql .
git add .
git commit -m "after"
git diff HEAD~1..HEAD --stat

When this script finishes, you'll get a nice summary showing which tables changed + some statistics about the removals/additions. Seeing the actual diff is as easy as running:

git diff HEAD~1..HEAD # check all the tables dumped
git diff HEAD~1..HEAD tablename.sql # check just one table

Splitting the removals/additions from the diff is left as an exercise to the reader (hint: a couple greps piped will get the job done).