Update 4/27/2012: See my more recent post Checking for Ticks for a more up-to-date and capable version of the script in this post.
Update 3/13/2012: Updated to make much faster, and to fix the “Files that have different sizes on disk than what’s listed in the DB” command, which was incorrect.
Recently, we moved the alf_data directory for our production system to a new set of disks. As we did so, a glitch meant that a few of the files were not correctly copied over. We got it sorted out fairly quickly, but it did serve to make me paranoid about the consistency between what we have listed in the Alfresco DB and the files we have on disk.
So, this morning, I decided to write up a little unix shell script to do that comparison. Fortunately, it’s made a lot easier by the fact that there’s a alf_content_url table in Alfresco’s database which lists both every file in alf_data/contentstore as well as the size that Alfresco thinks it is.
This script then became simply an exercise in getting a file listing and database dump from the alf_content_url table into the same format so they can be diff-ed.
- Use the script below at your own risk, especially the “Files on disk that are not in the Alfresco DB. Can be backed up, then deleted.” bit. Unless you thoroughly understand what this script is doing and are comfortable with it, be very careful here. You could badly damage your Alfresco repository.
- I’m running EE 3.4.7 on RHEL5 and MySQL, and that’s what this script is written for. It shouldn’t be that hard to modify it for use on other platforms or other databases.
You’ll need to make the following substitutions in the script below. You should remove the square brackets when you perform the substitution.
- [username]: The username Alfresco uses to access its database
- [password]: The password Alfresco uses to access its database
- [host]: The hostname of the database server
- [port]: The port on the database server to connect to
- [database]: The name of the database on the database server
- [alf_data directory]: The path to the ‘alf_data’ directory, for example
- [escaped alf_data directory]: The path to the ‘alf_data’ directory with all the ‘/’s replaced with ‘\/’s, for example
#!/bin/bash TEMP="$(dirname "$0")" cd "$TEMP"; /usr/bin/mysql --skip-column-names -u [username] --password='[password]' \ -h [host] -P [port] -e "select content_size, content_url FROM alf_content_url " \ [database] | /bin/sed 's/store:\//\./' | /bin/sort -n --field-separator='/' -k2 -k3 -k4 \ -k5 -k6 -k7 > ./compare_db_results.txt /usr/bin/find [alf_data directory] -type f \ | /usr/bin/xargs du -b | /bin/sed 's/[escaped alf_data directory]\/contentstore/\./' \ | /bin/sort -n --field-separator='/' -k2 -k3 -k4 -k5 -k6 -k7 > ./compare_file_results.txt /usr/bin/diff compare_db_results.txt compare_file_results.txt \ | /bin/sort --field-separator='/' -k2 -k3 -k4 -k5 -k6 -k7 > compare_store_diff.txt echo "WARNING: Files that have different sizes on disk than what's listed in the DB" /usr/bin/uniq -f2 -D compare_store_diff.txt | /bin/grep "[<>]" | /bin/sed 's/^>/---\n>/' echo echo "WARNING: Files listed in the Alfresco DB that are not on disk" /usr/bin/uniq -f2 -u compare_store_diff.txt | /bin/grep "<" echo echo "Files on disk that are not in the Alfresco DB. Can be deleted." /usr/bin/uniq -f2 -u compare_store_diff.txt | /bin/grep ">"
The alf_content_url table also includes a CRC for each file. It would be nice to generate and check that CRC against the files on disk, but that might make this script unusably slow.