Checking for Ticks

This is a follow-on post to my “Counting the Livestock” post from a couple of weeks ago. Reading that post first will give some background to this one.

With many thanks to Alex Strachan and Chris Turner, I’ve been able to greatly improve the script and have it do a somewhat deeper examination of how the files on disk are referenced by tables in the repository. I’ve also greatly cleaned up the script, so it should be easier to understand and modify for use in your environment. (For example, the 7 variables at the start of the script are all you need to change, rather than having to do a search-and-replace for that info throughout.)

First, some background on the structure of the Alfresco database in terms of how it refers to files:

  1. At the lowest level, the alf_content_url table directly contains the paths to the files on disk corresponding to nodes in the repository, as well as their sizes.
  2. The alf_content_data table contains things like the mime type and content encoding, and refers to the alf_content_url table.
  3. Finally, 3 tables refer to entries in the alf_content_data table: alf_node_properties, avm_nodes, and alf_audit_model.

The script below is designed to check all of that for consistency. That is, it will let you know if the 5 tables above all map correctly down to files on disk.

Important notes:

  1. Use the script below at your own risk, especially the “Files on disk that are not in the Alfresco DB. Can be backed up, then deleted.” bit. Unless you thoroughly understand what this script is doing and are comfortable with it, be very careful here. You could badly damage your Alfresco repository.
  2. This is only a reporting tool. If it reports errors in your repository, some may be harmless, some may be indicative of a real problem. This is meant as a starting point for investigations into potential problems, not as a fix for those problems.
  3. Even though the script below only reads from the database and files on disk, and doesn’t make any changes, it’s still a really good idea to have a good backup of the repository and database before running this.

The Script

I’m running EE 3.4.8 on RHEL5 and MySQL, and that’s what this script is written for. It shouldn’t be that hard to modify it for use on other platforms or other databases.

#!/bin/bash

#update the 7 variables below to correspond to your environment
DB_HOST=dbhost.company.com
DB_PORT=3306
DB_NAME=alfresco
DB_USER=alfresco
DB_PASS='alfresco'
REPO_PATH="/var/lib/alf_data/contentstore"
TMP_PATH="/tmp/check_db/"

#You may not need to modify anything below this line unless you are adapting
#this script to a different platform
ORIG_WD=`pwd`
mkdir -p $TMP_PATH
cd $TMP_PATH

ESCAPED_REPO_PATH=$(printf "%s\n" "$REPO_PATH" | sed 's/[][\.*^$/]/\\&/g')

#Get the list of files the DB thinks are on disk
/usr/bin/mysql --skip-column-names -u $DB_USER --password=$DB_PASS \
-h $DB_HOST -P $DB_PORT -e "select content_size, content_url FROM alf_content_url " \
$DB_NAME | /bin/sed 's/store:\//\./' | /bin/sort -n --field-separator='/' -k2 -k3 -k4 \
-k5 -k6 -k7 > compare_db_results.txt

#Get the list of files that are actually on disk
/usr/bin/find $REPO_PATH -type f \
| /usr/bin/xargs du -b | /bin/sed "s/$ESCAPED_REPO_PATH/\./" \
| /bin/sort -n --field-separator='/' -k2 -k3 -k4 -k5 -k6 -k7 > compare_file_results.txt

#Compare the list of files from the DB to the list from disk
/usr/bin/diff compare_db_results.txt compare_file_results.txt \
| /bin/sort --field-separator='/' -k2 -k3 -k4 -k5 -k6 -k7 > compare_store_diff.txt

diff_sizes=`/usr/bin/uniq -f2 -D compare_store_diff.txt | grep "[<>]" | \
/bin/sed 's/^>/---\n>/'`
if [ -n "$diff_sizes" ]
then
    echo "WARNING: Files that have different sizes on disk than what's listed in the DB"
    echo "$diff_sizes"
fi

not_on_disk=`/usr/bin/uniq -f2 -u compare_store_diff.txt | /bin/grep "<"`
if [ -n "$not_on_disk" ]
then
    echo
    echo "WARNING: Files listed in the Alfresco DB that are not on disk"
    echo "$not_on_disk"
fi

not_in_db=`/usr/bin/uniq -f2 -u compare_store_diff.txt | /bin/grep ">"`
if [ -n "$not_in_db" ]
then
    echo
    echo "Files on disk that are not in the Alfresco DB. Can be backed up then deleted."
    echo "$not_in_db"
fi

#get the list of nodes that appear in alf_content_data
mysql -N -u $DB_USER --password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
"select id FROM alf_content_data" $DB_NAME | sort -un > alf_content_data_ids.txt

#get the list of nodes that appear in alf_node_properties
mysql -N -u $DB_USER --password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
"select long_value FROM alf_node_properties anp WHERE (anp.actual_type_n=21)" $DB_NAME | \
sort -un > alf_node_properties_long_value.txt

#get the list of nodes that appear in avm_nodes
mysql -N -u $DB_USER --password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
"select substring(content_url,4) FROM avm_nodes" $DB_NAME  | sort -un > \
avm_nodes_content_url.txt

#get the list of nodes that appear in alf_audit_model
mysql -N -u $DB_USER --password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
"select content_data_id FROM alf_audit_model" $DB_NAME | sort -un > \
alf_audit_model_content_data_id.txt

echo
echo "Unique entries in alf_content_data"
perl -nle 'print if /^\d+$/' alf_content_data_ids.txt | wc -l

echo "Unique entries in alf_node_properties"
perl -nle 'print if /^\d+$/' alf_node_properties_long_value.txt | wc -l

echo "Unique entries in avm_nodes"
perl -nle 'print if /^\d+$/' avm_nodes_content_url.txt | wc -l

echo "Unique entries in alf_audit_model"
perl -nle 'print if /^\d+$/' alf_audit_model_content_data_id.txt | wc -l

echo
echo "Unique entries in use (alf_node_properties + avm_nodes + alf_audit_model)"
cat alf_node_properties_long_value.txt avm_nodes_content_url.txt \
alf_audit_model_content_data_id.txt | sort -un | perl -nle 'print if /^\d+$/' > \
entries_in_use.txt
cat entries_in_use.txt | wc -l

echo
echo "In alf_content_data but not in any of alf_node_properties, avm_nodes, or \
alf_audit_model"
diff alf_content_data_ids.txt entries_in_use.txt | perl -nle 'print if /^< \d+$/' | wc -l

echo "In alf_node_properties but not alf_content_data"
diff alf_content_data_ids.txt alf_node_properties_long_value.txt | \
perl -nle 'print if /^> \d+$/' | wc -l

echo "In avm_nodes but not alf_content_data"
diff alf_content_data_ids.txt avm_nodes_content_url.txt | \
perl -nle 'print if /^> \d+$/' | wc -l

echo "In alf_audit_model but not alf_content_data"
diff alf_content_data_ids.txt alf_audit_model_content_data_id.txt | \
perl -nle 'print if /^> \d+$/' | wc -l

diff alf_content_data_ids.txt entries_in_use.txt | perl -nle 'print if /^< \d+$/' | \
/bin/sed 's/^< //'  >limbo_nodes.txt

nodes_to_check=`perl -pe 's/\n/, /g' limbo_nodes.txt | /bin/sed 's/, $//'`
if [ -n "$nodes_to_check" ]
then
    echo
    echo "Listing of node IDs in alf_content_data that are not in any of \
alf_node_properties, avm_nodes, or alf_audit_model"
    echo "$nodes_to_check"

    echo
    echo "Files corresponding to the node IDs in alf_content_data that are not in any of \
alf_node_properties, avm_nodes, or alf_audit_model"
    mysql -N -u $DB_USER --password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    "select acu.content_url from alf_content_data acd, alf_content_url acu where acd.id \
in ($nodes_to_check) AND acd.content_url_id=acu.id;" $DB_NAME | \
    sed "s/store:\//$ESCAPED_REPO_PATH/"
fi

cd $ORIG_WD
rm -rf $TMP_PATH

Example Results

Clean repository

The output from running this script against a fully-intact repository looks like:

Unique entries in alf_content_data
15426
Unique entries in alf_node_properties
14459
Unique entries in avm_nodes
962
Unique entries in alf_audit_model
5

Unique entries in use (alf_node_properties + avm_nodes + alf_audit_model)
15426

In alf_content_data but not in any of alf_node_properties, avm_nodes, or alf_audit_model
0
In alf_node_properties but not alf_content_data
0
In avm_nodes but not alf_content_data
0
In alf_audit_model but not alf_content_data
0

Less-clean repository

The output from running this script against a repository thats still functioning perfectly normally, but has a few issues, looks like the following. In other words, if you get output like this, don’t panic! Everything may still be OK:

WARNING: Files that have different sizes on disk than what's listed in the DB
---
> 0	./2012/3/13/14/29/12296f59-d5f6-4c4c-af0d-16583aa299ad.bin
< 270	./2012/3/13/14/29/12296f59-d5f6-4c4c-af0d-16583aa299ad.bin
---
> 0	./2012/3/13/14/29/74f4bb37-440b-4c4d-b118-85aa7c2d3602.bin
< 274	./2012/3/13/14/29/74f4bb37-440b-4c4d-b118-85aa7c2d3602.bin
---
> 0	./2012/3/13/14/29/a03d4b7c-090f-4535-ae6f-fd7ceebf3233.bin
< 352	./2012/3/13/14/29/a03d4b7c-090f-4535-ae6f-fd7ceebf3233.bin
---
> 0	./2012/3/13/14/29/dd9d0e21-78a2-419e-a640-d413243cbb33.bin
< 322	./2012/3/13/14/29/dd9d0e21-78a2-419e-a640-d413243cbb33.bin
---
> 0	./2012/3/13/14/29/dfba42f6-4de8-4ba8-bfac-570dc83919c0.bin
< 270	./2012/3/13/14/29/dfba42f6-4de8-4ba8-bfac-570dc83919c0.bin

Files on disk that are not in the Alfresco DB. Can be deleted.
> 304	./dsmerror.log

Unique entries in alf_content_data
318462
Unique entries in alf_node_properties
305564
Unique entries in avm_nodes
12865
Unique entries in alf_audit_model
5

Unique entries in use (alf_node_properties + avm_nodes + alf_audit_model)
318434

In alf_content_data but not in any of alf_node_properties, avm_nodes, or alf_audit_model
28
In alf_node_properties but not alf_content_data
0
In avm_nodes but not alf_content_data
0
In alf_audit_model but not alf_content_data
0

Listing of node IDs in alf_content_data that are not in any of alf_node_properties, 
avm_nodes, or alf_audit_model
1178218, 1185143, 1185851, 1185858, 1189818, 1189883, 1189887, 1189889, 1189891, 1190182, 
1192920, 1193702, 1193985, 1195720, 1200796, 1228477, 1232014, 1234133, 1234135, 1246246, 
1257045, 1257046, 1307167, 1315011, 1315014, 1315018, 1315306, 1325256

Files corresponding to the node IDs in alf_content_data that are not in any of 
alf_node_properties, avm_nodes, or alf_audit_model
/var/lib/alf_data/contentstore/2011/9/19/19/16/88d5fe33-b713-48e8-9ea5-0c8f9b8e67bc.bin
/var/lib/alf_data/contentstore/2011/10/4/14/45/b8c81f19-c1d7-4742-8ac6-0afdddd25662.bin
/var/lib/alf_data/contentstore/2011/10/5/16/27/653e0bc9-12d6-4793-b193-2cb586d3764c.bin
/var/lib/alf_data/contentstore/2011/10/5/16/27/6bb5309c-bd3c-41fd-9da9-fd15e0ce5775.bin
/var/lib/alf_data/contentstore/2011/10/17/13/32/edd92f96-2592-4665-ae00-43071461ba99.bin
/var/lib/alf_data/contentstore/2011/10/18/14/56/6dfe33b9-bb47-4b4b-9fc4-b3e2a5b5b907.bin
/var/lib/alf_data/contentstore/2011/10/18/14/58/f9a0e767-edd1-48ce-9370-4d14e3905841.bin
/var/lib/alf_data/contentstore/2011/10/18/14/59/82b483df-5993-454a-98a8-1367b19fd833.bin
/var/lib/alf_data/contentstore/2011/10/18/14/59/df59b441-b67d-4a26-a5ee-01911703b9b5.bin
/var/lib/alf_data/contentstore/2011/10/19/16/13/3a664989-fa0b-4e2c-9222-ebd8fe02cf32.bin
/var/lib/alf_data/contentstore/2011/10/26/11/12/06efd829-3468-4b39-8c92-c4822d4b3341.bin
/var/lib/alf_data/contentstore/2011/10/26/17/38/7bae34d5-8097-49f3-b7a9-a2b2cb4c63c8.bin
/var/lib/alf_data/contentstore/2011/10/27/11/16/389f3450-477a-495a-b509-8800c6c6b046.bin
/var/lib/alf_data/contentstore/2011/10/28/13/8/1cefb1b5-664f-427f-b325-906012eba680.bin
/var/lib/alf_data/contentstore/2011/11/4/12/30/bf52a856-9516-4d11-af6a-645c0322d1c2.bin
/var/lib/alf_data/contentstore/2011/12/5/14/45/856d2b67-28e8-40ce-8859-26a29a23afe3.bin
/var/lib/alf_data/contentstore/2011/12/7/16/36/6d4b2dd7-551b-4472-a075-2f9f3e4ecaf6.bin
/var/lib/alf_data/contentstore/2011/12/14/12/42/1910cb27-6812-420e-abcb-05e6793d2e54.bin
/var/lib/alf_data/contentstore/2011/12/14/12/42/f38205bf-2d48-452c-bf5f-36cdb0480f30.bin
/var/lib/alf_data/contentstore/2012/1/12/14/14/c6de3fe1-74c3-410d-903d-b89d3739fb47.bin
/var/lib/alf_data/contentstore/2012/1/30/15/10/2c49054d-b6d4-4864-b2e2-46ce242c1962.bin
/var/lib/alf_data/contentstore/2012/1/30/15/10/59b8d996-6dcc-44dc-9a5e-caf7a9453dc1.bin
/var/lib/alf_data/contentstore/2012/3/7/19/50/9378ca45-851e-4ad2-833c-e7bceed3ded8.bin
/var/lib/alf_data/contentstore/2012/3/14/11/28/80bcceb1-c7bd-485f-aa40-41e6ef0a17d4.bin
/var/lib/alf_data/contentstore/2012/3/14/11/28/2be42bd0-e658-435a-9b07-3c3b06537604.bin
/var/lib/alf_data/contentstore/2012/3/14/11/28/6020177f-ae02-4833-8465-a21cbe24dd09.bin
/var/lib/alf_data/contentstore/2012/3/14/16/32/d42425d9-4139-4c1e-8b08-a394474eafcb.bin
/var/lib/alf_data/contentstore/2012/3/23/14/33/367ff98d-8bce-4970-a67e-be2b41a63ed7.bin
This entry was posted in Scripts. Bookmark the permalink.

5 Responses to Checking for Ticks

  1. Pingback: Counting the livestock [Updated] | Wandering Alfresco

  2. Paul Price says:

    I converted over to use SQL Server on Windows within Cygwin bash window. Here is what I ended up with…

    #!/bin/bash

    ## 3/6/2013 Paul Price
    ## Changed to use SQL Server from within a Cygwin bash session on Windows.
    ## I kept the orriginal mysql commands as comments.
    ## Note that I used a trusted connection (Windows User Credentials) instead of
    ## using a SQL User ID and Password. To switch, remove “-E” and replace
    ## with “-U $DB_USER -P DB_PASS” in the sqlcmd lines.

    #update the variables below to correspond to your environment
    DB_HOST=YourHostName
    DB_INSTANCE=YourSqlInstance
    DB_NAME=YourAlfrescoDbName
    REPO_PATH=”/cygdrive/d/Alfresco/alf_data/contentstore”
    TMP_PATH=”/cygdrive/d/tmp/check_db”

    #You may not need to modify anything below this line unless you are adapting
    #this script to a different platform
    ORIG_WD=`pwd`
    mkdir -p $TMP_PATH
    cd $TMP_PATH

    ESCAPED_REPO_PATH=$(printf “%s\n” “$REPO_PATH” | sed ‘s/[][\.*^$/]/\\&/g’)

    #Get the list of files the DB thinks are on disk
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select content_size, content_url FROM alf_content_url ” \
    | /bin/sed ‘s/store:\//\./’ | /bin/sort -n –field-separator=’/’ \
    -k2 -k3 -k4 -k5 -k6 -k7 > compare_db_results.txt

    # PPRICE: Here is the code I replaced:
    ##/usr/bin/mysql –skip-column-names -u $DB_USER –password=$DB_PASS \
    ##-h $DB_HOST -P $DB_PORT -e “select content_size, content_url FROM alf_content_url ” \
    ##$DB_NAME | /bin/sed ‘s/store:\//\./’ | /bin/sort -n –field-separator=’/’ -k2 -k3 -k4 \
    ##-k5 -k6 -k7 > compare_db_results.txt

    #Get the list of files that are actually on disk
    /usr/bin/find $REPO_PATH -type f \
    | /usr/bin/xargs du -b | /bin/sed “s/$ESCAPED_REPO_PATH/\./” \
    | /bin/sort -n –field-separator=’/’ -k2 -k3 -k4 -k5 -k6 -k7 > compare_file_results.txt

    #Compare the list of files from the DB to the list from disk
    # PPRICE: Added the “-b” flag to account for tabs in one file vs. spaces in the other
    /usr/bin/diff -b compare_db_results.txt compare_file_results.txt \
    | /bin/sort –field-separator=’/’ -k2 -k3 -k4 -k5 -k6 -k7 > compare_store_diff.txt

    diff_sizes=`/usr/bin/uniq -f2 -D compare_store_diff.txt | grep “[]” | \
    /bin/sed ‘s/^>/—\n>/’`
    if [ -n “$diff_sizes” ]
    then
    echo “WARNING: Files that have different sizes on disk than what’s listed in the DB”
    echo “$diff_sizes”
    fi

    not_on_disk=`/usr/bin/uniq -f2 -u compare_store_diff.txt | /bin/grep “”`
    if [ -n “$not_in_db” ]
    then
    echo
    echo “Files on disk that are not in the Alfresco DB. Can be backed up then deleted.”
    echo “$not_in_db”
    fi

    #get the list of nodes that appear in alf_content_data
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select id FROM alf_content_data” \
    | sort -un > alf_content_data_ids.txt

    # PPRICE: Here is the code I replaced:
    #mysql -N -u $DB_USER –password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    #”select id FROM alf_content_data” $DB_NAME | sort -un > alf_content_data_ids.txt

    #get the list of nodes that appear in alf_node_properties
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select long_value FROM alf_node_properties anp WHERE (anp.actual_type_n=21)” \
    | sort -un > alf_node_properties_long_value.txt

    # PPRICE: Here is the code I replaced:
    #get the list of nodes that appear in alf_node_properties
    mysql -N -u $DB_USER –password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    “select long_value FROM alf_node_properties anp WHERE (anp.actual_type_n=21)” $DB_NAME | \
    sort -un > alf_node_properties_long_value.txt

    #get the list of nodes that appear in avm_nodes
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select substring(content_url,4) FROM avm_nodes” \
    | sort -un > avm_nodes_content_url.txt

    # PPRICE: Here is the code I replaced:
    ##mysql -N -u $DB_USER –password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    ##”select substring(content_url,4) FROM avm_nodes” $DB_NAME | sort -un > \
    ##avm_nodes_content_url.txt

    #get the list of nodes that appear in alf_audit_model
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select content_data_id FROM alf_audit_model” \
    | sort -un > alf_audit_model_content_data_id.txt

    # PPRICE: Here is the code I replaced:
    ##mysql -N -u $DB_USER –password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    ##”select content_data_id FROM alf_audit_model” $DB_NAME | sort -un > \
    ##alf_audit_model_content_data_id.txt

    echo
    echo “Unique entries in alf_content_data”
    perl -nle ‘print if /^\d+$/’ alf_content_data_ids.txt | wc -l

    echo “Unique entries in alf_node_properties”
    perl -nle ‘print if /^\d+$/’ alf_node_properties_long_value.txt | wc -l

    echo “Unique entries in avm_nodes”
    perl -nle ‘print if /^\d+$/’ avm_nodes_content_url.txt | wc -l

    echo “Unique entries in alf_audit_model”
    perl -nle ‘print if /^\d+$/’ alf_audit_model_content_data_id.txt | wc -l

    echo
    echo “Unique entries in use (alf_node_properties + avm_nodes + alf_audit_model)”
    cat alf_node_properties_long_value.txt avm_nodes_content_url.txt \
    alf_audit_model_content_data_id.txt | sort -un | perl -nle ‘print if /^\d+$/’ > \
    entries_in_use.txt
    cat entries_in_use.txt | wc -l

    echo
    echo “In alf_content_data but not in any of alf_node_properties, avm_nodes, or \
    alf_audit_model”
    diff alf_content_data_ids.txt entries_in_use.txt | perl -nle ‘print if /^ \d+$/’ | wc -l

    echo “In avm_nodes but not alf_content_data”
    diff alf_content_data_ids.txt avm_nodes_content_url.txt | \
    perl -nle ‘print if /^> \d+$/’ | wc -l

    echo “In alf_audit_model but not alf_content_data”
    diff alf_content_data_ids.txt alf_audit_model_content_data_id.txt | \
    perl -nle ‘print if /^> \d+$/’ | wc -l

    diff alf_content_data_ids.txt entries_in_use.txt | perl -nle ‘print if /^< \d+$/' | \
    /bin/sed 's/^limbo_nodes.txt

    nodes_to_check=`perl -pe ‘s/\n/, /g’ limbo_nodes.txt | /bin/sed ‘s/, $//’`
    if [ -n “$nodes_to_check” ]
    then
    echo
    echo “Listing of node IDs in alf_content_data that are not in any of \
    alf_node_properties, avm_nodes, or alf_audit_model”
    echo “$nodes_to_check”

    echo
    echo “Files corresponding to the node IDs in alf_content_data that are not in any of \
    alf_node_properties, avm_nodes, or alf_audit_model”
    sqlcmd -E -S $DB_HOST\\$DB_INSTANCE -h -1 -W -d $DB_NAME \
    -Q “select acu.content_url from alf_content_data acd, alf_content_url acu where acd.id \
    in ($nodes_to_check) AND acd.content_url_id=acu.id;” \
    | sed “s/store:\//$ESCAPED_REPO_PATH/”

    # PPRICE: Here is the code I replaced:
    ##mysql -N -u $DB_USER –password=$DB_PASS -h $DB_HOST -P $DB_PORT -e \
    ##”select acu.content_url from alf_content_data acd, alf_content_url acu where acd.id \
    ##in ($nodes_to_check) AND acd.content_url_id=acu.id;” $DB_NAME | \
    ##sed “s/store:\//$ESCAPED_REPO_PATH/”
    fi

    cd $ORIG_WD
    rm -rf $TMP_PATH

  3. Imran Iqbal says:

    Firstly, I’d like to thank you for sharing your script. I’ve just conducted a massive overhaul of the system I’m working with and this was very helpful. I’ve converted it to a work for a PostgreSQL-based installation (Alfresco 4.2.c) although I’m not an expert by any means so the changes are probably far from ideal. I’ve mentioned the differences in the script comments:

    #!/bin/bash
    ################################################################################
    # Script basis: https://wanderingalfresco.wordpress.com/2012/04/20/checking-for-ticks/
    # Modified by: Imran Iqbal
    # Date: 2013-10-13
    # For system: Alfresco Community 4.2.c
    # PostgreSQL 9.1
    # Ubuntu 10.04
    # Changes: (1) Converted mysql commands to psql equivalents
    # (2) Minor changes in various places to deal with whitespace
    # issues in psql output
    # (3) Using command variables instead of literal command paths
    # – Some of the original paths were different on my system
    # (4) Minor formatting for consistency
    ################################################################################

    #update the 7 variables below to correspond to your environment
    DB_HOST=dbhost.company.com
    DB_PORT=5432
    DB_NAME=alfresco
    DB_USER=alfresco
    DB_PASS=’alfresco’
    REPO_PATH=”/var/lib/alf_data/contentstore”
    TMP_PATH=”/tmp/check_db/”

    #Commands
    CMD_sed=$( which sed )
    CMD_psql=$( which psql )
    CMD_sort=$( which sort )
    CMD_find=$( which find )
    CMD_xargs=$( which xargs )
    CMD_du=$( which du )
    CMD_diff=$( which diff )
    CMD_uniq=$( which uniq )
    CMD_grep=$( which grep )
    CMD_perl=$( which perl )

    #You may not need to modify anything below this line unless you are adapting
    #this script to a different platform
    ORIG_WD=`pwd`
    mkdir -p $TMP_PATH
    cd $TMP_PATH

    ESCAPED_REPO_PATH=$(printf “%s\n” “$REPO_PATH” | $CMD_sed ‘s/[][\.*^$/]/\\&/g’)

    #Get the list of files the DB thinks are on disk
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select content_size, content_url FROM alf_content_url ” | $CMD_sed ‘s/store:\//\./’ | \
    $CMD_sort -n –field-separator=’/’ -k2 -k3 -k4 -k5 -k6 -k7 > compare_db_results.txt

    #Remove unwanted column separation character and then any blank lines – normally one right at the beginning
    $CMD_sed -i ‘s:\ |\ :\ :g’ compare_db_results.txt
    $CMD_sed -i ‘/^$/d’ compare_db_results.txt

    #Get the list of files that are actually on disk
    $CMD_find $REPO_PATH -type f | \
    $CMD_xargs $CMD_du -b | $CMD_sed “s/$ESCAPED_REPO_PATH/\./” | \
    $CMD_sort -n –field-separator=’/’ -k2 -k3 -k4 -k5 -k6 -k7 > compare_file_results.txt

    #Compare the list of files from the DB to the list from disk
    $CMD_diff -w compare_db_results.txt compare_file_results.txt | \
    $CMD_sort –field-separator=’/’ -k2 -k3 -k4 -k5 -k6 -k7 > compare_store_diff.txt

    diff_sizes=`$CMD_uniq -f2 -D compare_store_diff.txt | $CMD_grep “[]” | \
    $CMD_sed ‘s/^>/—\n>/’`
    if [ -n “$diff_sizes” ]
    then
    echo “WARNING: Files that have different sizes on disk than what’s listed in the DB”
    echo “$diff_sizes”
    fi

    not_on_disk=`$CMD_uniq -f2 -u compare_store_diff.txt | $CMD_grep “”`
    if [ -n “$not_in_db” ]
    then
    echo
    echo “Files on disk that are not in the Alfresco DB. Can be backed up then deleted.”
    echo “$not_in_db”
    fi

    #get the list of nodes that appear in alf_content_data
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select id FROM alf_content_data” | \
    $CMD_sort -un > alf_content_data_ids.txt
    $CMD_sed -i ‘s/^\s\+//’ alf_content_data_ids.txt

    #get the list of nodes that appear in alf_node_properties
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select long_value FROM alf_node_properties anp WHERE (anp.actual_type_n=21)” | \
    $CMD_sort -un > alf_node_properties_long_value.txt
    $CMD_sed -i ‘s/^\s\+//’ alf_node_properties_long_value.txt

    #get the list of nodes that appear in avm_nodes
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select substring(content_url,4) FROM avm_nodes” | \
    $CMD_sort -un > \ avm_nodes_content_url.txt
    $CMD_sed -i ‘s/^\s\+//’ avm_nodes_content_url.txt

    #get the list of nodes that appear in alf_audit_model
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select content_data_id FROM alf_audit_model” | \
    $CMD_sort -un > \ alf_audit_model_content_data_id.txt
    $CMD_sed -i ‘s/^\s\+//’ alf_audit_model_content_data_id.txt

    echo
    echo “Unique entries in alf_content_data”
    $CMD_perl -nle ‘print if /^\d+$/’ alf_content_data_ids.txt | wc -l

    echo “Unique entries in alf_node_properties”
    $CMD_perl -nle ‘print if /^\d+$/’ alf_node_properties_long_value.txt | wc -l

    echo “Unique entries in avm_nodes”
    $CMD_perl -nle ‘print if /^\d+$/’ avm_nodes_content_url.txt | wc -l

    echo “Unique entries in alf_audit_model”
    $CMD_perl -nle ‘print if /^\d+$/’ alf_audit_model_content_data_id.txt | wc -l

    echo
    echo “Unique entries in use (alf_node_properties + avm_nodes + alf_audit_model)”
    cat alf_node_properties_long_value.txt avm_nodes_content_url.txt \
    alf_audit_model_content_data_id.txt | $CMD_sort -un | $CMD_perl -nle ‘print if /^\d+$/’ > \
    entries_in_use.txt
    cat entries_in_use.txt | wc -l

    echo
    echo “In alf_content_data but not in any of alf_node_properties, avm_nodes, or \
    alf_audit_model”
    $CMD_diff alf_content_data_ids.txt entries_in_use.txt | $CMD_perl -nle ‘print if /^ \d+$/’ | wc -l

    echo “In avm_nodes but not alf_content_data”
    $CMD_diff alf_content_data_ids.txt avm_nodes_content_url.txt | \
    $CMD_perl -nle ‘print if /^> \d+$/’ | wc -l

    echo “In alf_audit_model but not alf_content_data”
    $CMD_diff alf_content_data_ids.txt alf_audit_model_content_data_id.txt | \
    $CMD_perl -nle ‘print if /^> \d+$/’ | wc -l

    $CMD_diff alf_content_data_ids.txt entries_in_use.txt | $CMD_perl -nle ‘print if /^< \d+$/' | \
    $CMD_sed 's/^limbo_nodes.txt

    nodes_to_check=`$CMD_perl -pe ‘s/\n/, /g’ limbo_nodes.txt | $CMD_sed ‘s/, $//’`
    if [ -n “$nodes_to_check” ]
    then
    echo
    echo “Listing of node IDs in alf_content_data that are not in any of \
    alf_node_properties, avm_nodes, or alf_audit_model”
    echo “$nodes_to_check”

    echo
    echo “Files corresponding to the node IDs in alf_content_data that are not in any of \
    alf_node_properties, avm_nodes, or alf_audit_model”
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select acu.content_url from alf_content_data acd, alf_content_url acu where acd.id \
    in ($nodes_to_check) AND acd.content_url_id=acu.id;” | \
    $CMD_sed “s/store:\//$ESCAPED_REPO_PATH/”
    fi

    cd $ORIG_WD
    rm -rf $TMP_PATH

  4. Imran Iqbal says:

    Incredibly sorry but I joined a couple of the script lines, forgetting to remove the backslashes. The following lines need to be substituded above:

    #get the list of nodes that appear in avm_nodes
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select substring(content_url,4) FROM avm_nodes” | \
    $CMD_sort -un > avm_nodes_content_url.txt
    $CMD_sed -i ‘s/^\s\+//’ avm_nodes_content_url.txt

    #get the list of nodes that appear in alf_audit_model
    PGPASSWORD=$DB_PASS $CMD_psql –tuples-only -U $DB_USER -h $DB_HOST -p $DB_PORT -d $DB_NAME -c \
    “select content_data_id FROM alf_audit_model” | \
    $CMD_sort -un > alf_audit_model_content_data_id.txt
    $CMD_sed -i ‘s/^\s\+//’ alf_audit_model_content_data_id.txt

Leave a comment