I was given a USB stick with 16,000 PowerPoint files with the .ppt
extension dating from 2005, of which each document is a manually scanned page of sheet music. This was done by a very diligent soul all those years ago, who knew little of the pernicious ways how Big Tech operate by regularly changing the data formats of digital artefacts, which forces users to constantly upgrade their expensive software. No surprise then that files produced by the version of PowerPoint from back in 2005 are not recognised by the more recent versions of the PowerPoint application.
In this case study, there are further problems: each document page is a separate PowerPoint file, there are spelling errors in the titles, the naming is inconsistent, and a host of other iniquities were introduced during the manual scanning of this corpus magnus.
All this wonderful sheet music should be made available to the band members and serve as a digital repo for the band's music, but left in this format and file organization the hard work seem to now be totally wasted. This is a typical Data Cleansing problem, so let's have a look at how bad it is and what could be done to fix this data mess, so that mortals might be able to read and enjoy it again. We are only going to fix the directory and file names, not the contents, which are essentially scanned bitmapped images, and that is an OCR problem for another day!
Data Cleansing with BASH & LibreOffice
Data Cleansing operations typically entail an issue discovery phase followed by a fixing process spread over a number of steps, depending on in how bad a shape your source data is, and what it is that you are trying to achieve. A good practice is to have a new working directory for every step. Simply call each working directory steps1
, step2
, etc., and document what you do in every step. If the data cleansing step in your current working directory fails, simply delete everything in it and copy everything from the previous step's working directory and try again. You will thank me for this soon enough. It is not unusual to find further data issues as you progress through your data cleansing steps.
Start off by creating your workspace of, say, 6 "step" directories;
gerrit@z3:~$ mkdir step{1..6}
What does this command so?
By providing a range of values in the{}
bit, the command is executed once of every value in the series. This is the same as saying:mkdir step1; mkdir step2; mkdir step3; ... ; mkdir step6
Step 1: Unpack, Observe & some initial cleansing
Unpack or copy the content into the directory step1
, and have a look at its contents using the ls
and the tree
command. I get something like this:
gerrit@z3:~/step1$ ls
...
STAR LAKE - MARCH ⇐ double space
STAR WARS
ST KILDA
STRAWBERRY FAIR
STRIDING EDGESTRIKE UP THE BAND
...
ZORBA'S DANCE
And:
gerrit@z3:~/step1$ tree
...
── ZORBA'S DANCE
├── ZORBA'S DANCE 1ST BARITONE 1.ppt ⇐ double space
├── ZORBA'S DANCE 1ST BARITONE 2.ppt
├── ZORBA'S DANCE 1ST HORN 1.ppt
├── ZORBA'S DANCE 1ST HORN 2.ppt
├── ZORBA'S DANCE 2ND CORNET 1.ppt
├── ZORBA'S DANCE 2ND CORNET 2.ppt
├── ZORBA'S DANCE FLUGAL 1.ppt ⇐ spelling error: Flugel, or Flügel/Fluegel
├── ZORBA'S DANCE 2ND HORN 1.ppt
├── ZORBA'S DANCE 2ND HORN 2.ppt
├── ZORBA'S DANCE 2ND TROMBONE1.ppt ⇐ missing space before page number
├── ZORBA'S DANCE 2ND TROMBONE 2.ppt
├── ZORBA'S DANCE REPIANO.ppt ⇐ spelling error: Ripienoetc...
Initial observations:
We have directories named by song title, which contain the song's instrument files. The files are more or less named as follows: [title][space][instrument][space][page number]
.ppt
The problems that we can see are:
- All files seem to be old-style PowerPoint files with the
.ppt
extension - These old-style PowerPoint files should be made readable by converting them to PDF files.
- Every page of a multipage document is a separate file, and it can be seen that the document's page number if embedded at the end of the file name.
- There are double spaces in a few file names and directory names. This looks sloppy and could could confusion later on as we parse the names of the files.
- Every page of a multipage document is a separate file, and it can be seen that the document's page number if embedded at the end of the file name.
- Sometimes the space before the page number is missing
- What happens when there are documents with pages that go into the double digits - are the pages left-padded with 0s for correct sorting?
- These files were awkwardly named in capitals. It's ugly and hinders reading. We are not mainframe programmers.
We will deal with these issues in the next step, but first another sanity check:
Are we sure that there are only .ppt files here?
All files seem to be old-style PowerPoint files with tthe .ppt extension, but let's double check by getting the extension of each file with the sed
command and then doing a count of each unique occurrence by piping all the extension though the uniq
command:
gerrit@z3:~/step1$ find . -type f | sed 's/.*\.//' | sort | uniq -c
1 doc
15876 ppt
Oh dear, there is a single .doc
file in there! The quickest way to find it is to use the versatile find
command again:
gerrit@z3:~/step1$ find . -name "*.doc"
./ALL THROUGH THE NIGHT/ALL THROUGH THE NIGHT MISSING.doc
On inspection, this file contains some irrelevant notes, so we can delete it. There may be other irrelevant files in your collection, so get rid of them as early on as possible in your process!
Bulk-remove files with particular file extension:
This is how bulk-remove files with a certain file extensions that are known to be irrelevant:
gerrit@z3:~/step1$ find . -name "*.doc" -exec rm {} \; -print
./ALL THROUGH THE NIGHT/ALL THROUGH THE NIGHT MISSING.doc
What did we do here?
The versitilefind
command can be used to also execute a command for every file that matches the search criteria.
In this case, the seach criteria is "*.doc
", specified in the-name
parameter, and we delete every occurance of it in the command that follows the-exec
parameter.
There is a placeholder for where the filepath would normally go in that command, which is the{}
thingy. This is a common convension for indicating placeholders in BASH.
The end of the embedded command is signalled with the\;
marker, which is also a common BASH convension.
For good measure, we also add the
Step 2: Convert all obsolete file formats to PDF file format
Since we have no intention to modify the contents of these old-skool PowerPoint files, one might as well convert them to a read-only format that may last a little longer and is likely to be upwardly compatible with future versions of the file format's reading application. And while we are at it, let's collate the loose files into the multipage documents that they are supposed to be.
First of all, if you are happy with what you have achieved in Step 1, prepare the work area for Step 2: cd ~/step2; cp -r ../step1/* .
Why PDF?
PDF is the commonly accepted and open standard file format, and nearly everyone has a PDF file reader installed on their computing device at no extra cost. The tools to modify PDF files can be expensive, but we will not be editing these; we will just curate and make them readable again for everybody. Furthermore, the PDF file format has a far better chance of "not going out of date", unlike the bespoke and closed-source file formats, as we have here, that are at the behest of the vendor to lock consumers into their product line. We can be sure that all our hard work will still be readable in years to come. After all, billions of documents are produced daily in PDF format, so let's produce a few more, then.
To read these obsolete PowerPoint files in the first place, we can use LibreOffice (bundled with most Linux distros), which can read many document types and their legacy versions that hail as far back as the 1990's. The command to do the conversion on a sample file is:
gerrit@z3:~$
soffice --headless --invisible --convert-to pdf --outdir "/tmp" samplefile.ppt
About LibreOffice's command-line parameters
Since we don't want to open the GUI up every time that we process a file, we add the--headless
and the--invisible
parameters to the command that invokes LibreOffice, and instruct it to read our sample file and save it as a PDF file. The--outdir
parameter is added to thesoffice
command to tell it where to put the output, otherwise the output will be in the current working directory from where this command is invoked.
Applying this to all 16,000 files
Luckily, it is possible to execute this complex command to all the files in the directory with our new friend, the find
command, can find. solve this problem with a single BASH-incantation that would make Harry Potter blush:
gerrit@z3:~/step2$
find . \
> -name "*.ppt" \
└─ search citeria ─┘
> -type f \
└─ only search files ─┘
> -exec \
└─ execute this for every file found ─┘
> bash -c 'd="${0%/*}" ; soffice --headless --invisible --convert-to pdf "$0" --outdir "$d"' \
└─ These are the subshell commands ──────────────────────────────────────────────┘
>{} \;
└─ found filepath placeholder ─┘└─ end of -exec marker ─┘
Note that a long command can be extended to the next line my using a "
\
" character and then hittingReturn
. The folling line has a ">
" pre-amble to indicate that it is a continuation from the previous line.
You still with me? This is what happens:
There are two things that need to happen when the find
finds the file:
- Determine the directory name where to put the converted file, since the behaviour of this call the LibreOffice is to dump the converted PDF output file in our current working directory, and we would end up with all the PDF files in one directory with the likelihood of some files overwriting each other. What we really would like it to do is put the converted PDF file alongside the old .ppt file in their respective directories. Luckily, we can force LibreOffice to dump the output file in any desired directory by setting the --outdir [some directory] parameter.
- Perform the conversion
The -exec parameter is limited to the a single operation, so if multiple operations are required, which is the case here, then the commands need to be either scripted in another BASH script that is then called from here, or we can use a subshell inside the -exec
command, which is a more elegant and maintainable approach.
More on BASH subshells
Think of the invoked subshell from -exec
as an embedded mini-script of commands, in the form:
$ find ...stuff... -exec bash -c '...a bunch of commands that use $0...' {} ;\
└─ These are the subshell commands ───┘
Note that unlike a self-contained bash program where parameter $0
is /usr/bin/bash
and parameters $1
, $2
, onwards are the first , second, etc. passed-on command line parameters when invoking the BASH script. However, in a BASH subshell, as invoked by bash -c
, the parameter indexes start at 0. This means that the first passed-on parameter inside the subshell is referenced as $0
. And since we pass the filepath placeholder {}
as the first parameter to the subshell, the filepath is what the subshell will get when it refers to $0
. Since the subshell typically only contains a few commands, they are often left on the same line and seperaterd by the ";
" character.
And now for some string manipulations in BASH...
It gets even more interesting: We only want the directory part of the file path in variable $d
to tell LibreOffice in which directory to put its results. Remember that the PowerPoint filepath in {}
arrives in the sub-shell as parameter 0, which we access as $0
, or ${0}
. Using the curly-bracket form allows us to add in-line operators to the variable in order to manipulate the content of the variable "on-the-fly", such as specifying a regular expressions to remove bits of string that match the regex, or to change its case, and much more. Back to the issue in hand: we want to remove the filename from the filepath, so that we end up with just the directory, which we do with the %[regular expression]
BASH substitution operator, which removes the part of the string that matches the regular expression (lovingly called a "regex"), starting from the right hand side, iterating toward the string beginning. This removal process stops when the first match has been made and we call this the "non-greedy" mode. Now if we used the %%[regular expression] operator, all consecutive instances of this regular expression match would be removed from the string, so this mode is unsurprisingly called the "greedy mode". It works in a similar way with the #[regular expression] operator and the ##[regular expression] operator, except that they start their regular expression matching going from begining towards the string end. Applying our new-found knowledge to extract the file directory from the filepath, we use BASH substitution to set as d=${0%/*}
. This is a nice way of saying: strip everything from the pathname starting from the end and going backwards, until the first /
is encountered, i.e. remove the file name from the path, leaving us with just the file's full directory.
Likewise, if we want to construct the name of the resulting PDF file from the PPT file, we can strip the .ppt
file extension and replace it with the string .pdf
. The following BASH substitution in the subshell will do this: t=${0%\.ppt}.pdf
, which is another nice way to say things: remove everything from the filepath that matches the string .ppt
, and then add .pdf
to the get the PDF filepath in variable t
. Note that we escape the period (.) to indicate that it an actual period and not regex-speak for "any character".
Restart from where we left off
This can be a lenghty process depending on your computation resources. If there were some failures along the way, or the process had to be stopped, would we be able to restart where it stopped at the successfull last conversion, and so avoid having to repeat the conversion for the files that have already been converted? Yes, we can! With a simple check if the converted PDF file already exists, the conversion can be skipped, or else perform the file conversion. This is done by adding to the BASH subshell command a conditional clause on whether to invoke the LibreOffice conversion utility if the PowerPoint file's corresponding PDF file does not yet exist.
First, using BASH substituion, we determine the name of the PDF file from the PowerPoint file, which is: pdffile=${0%\.ppt}.pdf
.
Putting the -f
file-exists operator in a BASH-style if-then-else ternary expression, which has the structure of [condition] && [do this if true] || [do this if false]
, the subshell script becomes:
d="${0%/*}"
└ directory ┘
pdffile=${0%\.ppt}.pdf└ make up PDF filename ┘
[[ -f $pdffile ]] \
└ condition ────┘ && echo "File $pdffile exists. Skipping..." \
└ do this if true ─────────────────────┘ || libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d"
└ do this if false ───────────────────────────────────────────────────┘
This can be simplified by only testing for the inverse condition, i.e. that the file does not exit, using the "!
" negation operator:
t=${0%\.ppt}.pdf; [[ ! -f $t ]] && libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d"
Putting this all together, the final command to process all our files is:
gerrit@z3:~/step2$ find . -name "*.ppt" -type f -exec bash -c 'd="${0%/*}"; t=${0%\.ppt}.pdf; [[ ! -f $t ]] && libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d"' {} \;
This process took 5 hours to run for the 16,000 files on an old quad-core laptop, which quickly broke out into a sweat and had the fans whirling. But computers are our slaves, right?
Step 3: Fixing text errors in the file names
First of all, if you are happy with what you have achieved in the prevoious step, prepare the work area for Step 3 by only copying the .pdf
files into the new work area. This simplest method is to copy everything over and to then remove the source .ppt
files using find
:
gerrit@z3:~$ cd ~/step3
gerrit@z3:~/step3$ cp -r ~/step2/* .
gerrit@z3:~/step3$ find . -name "*.ppt" -exec rm {} \;
Cleaning the Filenames up
Use the rename
command to rename many files in the BASH shell. It is very useful because a sed
-style regular expression can be used to manipulate the file name, and it can be applied to an entire directoy of files. Be sure to read its man-page (just type man rename
). Since the command works on an entire directory, we only need to iterate through all the directories and then apply the command in each one.
Spelling fixes
Let's start by fixing those clucky spelling errors in the now familiar find
command with BASH subshell:
gerrit@z3:~/step3$
find . -type d -exec bash -c 'cd "$0"; rename "s/REPIANO/RIPIENO/" *; cd - > /dev/null' {} \;
gerrit@z3:~/step3$
find . -type d -exec bash -c 'cd "$0"; rename "s/FLUGAL/FLUGEL/" *; cd - > /dev/null' {} \;
The find command only looks for directories (-type d
) and in the subshell the passed-in directory name is entered and the text in all the files that match the regular expression (the misspellt word) have that text substituted (hence the "s
" operator) with the correctly-spelt expression. We then return to the original directory (cd -
) without showing any output by redirecting this command's output to oblivion (the > /dev/null
bit).
Similarly, any double spaces in the files can be converted to single spaces:
gerrit@z3:~/step3$
find . -type d -exec bash -c 'cd "$0"; rename "s/ / /g" *; cd - > /dev/null' {} \;
Note the "g
" modifier, which tells the substitution operator not to stop after the first instance of a double-space has been found - there may be more cases of this in the same file name.
Page numbering issues
Every page of a multipage document is a separate file with the page number embedded in the file name. Sometimes the space before the page number is missing, which can be fixed with the magic of regular expressions. A quick listing of files that fall foul of this problem are:
$ find . -type f -name "*[A-Z][0-9]\.pdf"
./OKLAHOMA/OKLAHOMA SOPRANO1.pdf
./THE GAY 90'S/THE GAY 90'S SOLO CORNET2.pdf
./VALSE ESTUDIANTINA/VALSE ESTUDIANTINA 3RD CORNET1.pdfetc...
To apply a fix, all that is required is to add a rename
command in the -exec
parameter of find
: that inserts a space before the page number and the file extension using a placeholder marked by the parenthesis ()
and adding a
~/step3
$ find . -type f -name "*[A-Z][0-9]\.pdf" -exec rename 's/(..pdf)/ \1/' {} \;
Test if it worked - this should not return anything:
~/step3$ find . -type f -name "*[A-Z][0-9]\.pdf"
The next page numbering problem is that when there are more than 9 pages in the document, the page numbers are not 0-padded, which causes the pages not to be listed in their intended order, as shown in this example:
~/step3$ find . -type f | sort
...
./VANGUARD/VANGUARD CONDUCTOR 10.PDF./VANGUARD/VANGUARD CONDUCTOR 11.PDF
./VANGUARD/VANGUARD CONDUCTOR 12.PDF
./VANGUARD/VANGUARD CONDUCTOR 13.PDF
./VANGUARD/VANGUARD CONDUCTOR 14.PDF
./VANGUARD/VANGUARD CONDUCTOR 15.PDF
./VANGUARD/VANGUARD CONDUCTOR 1.PDF
./VANGUARD/VANGUARD CONDUCTOR 2.PDF
./VANGUARD/VANGUARD CONDUCTOR 3.PDF
./VANGUARD/VANGUARD CONDUCTOR 4.PDF
./VANGUARD/VANGUARD CONDUCTOR 5.PDF
./VANGUARD/VANGUARD CONDUCTOR 6.PDF
./VANGUARD/VANGUARD CONDUCTOR 7.PDF
./VANGUARD/VANGUARD CONDUCTOR 8.PDF
./VANGUARD/VANGUARD CONDUCTOR 9.PDF
etc.
A quick survey shows that there are no documents that hold more than 99 pages, so left-padding the page number to two digits is sufficient. The quick fix is to add a leading 0 in front of every page number that is only one digit long. We search for these files in the -name
parameter and apply rename
's search-and-replace that prefixes a 0 to the found expression:
~/step3$ find . -type f -name "* [0-9]\.pdf" -exec rename 's/ ([0-9]\.pdf)/ 0\1/' {} \;
We now have files named: ./VANGUARD/VANGUARD CONDUCTOR 01.PDF
./VANGUARD/VANGUARD CONDUCTOR 02.PDF
./VANGUARD/VANGUARD CONDUCTOR 03.PDF
...
./VANGUARD/VANGUARD CONDUCTOR 15.PDF
Sanity checking the results so far
In this case, the general format of the directory and files is [song title]/
[song title] [instrument].pdf
. A good check would be to see where this is not the case and then manually correct mismatches. This sort of check throws up all sorts of other errors that would bedevil the subsequent processes. So, depending on your data set, run this check repeatedly until all issues are fixed. The example shows that there are still some mismatches that originated from the original manual scanning process.
~/step3$ find . -type f -exec bash -c 'dir=${0%/*}; dir=${dir#\.\/}; file=${0##*/}; [[ $file =~ $dir ]] && : || printf "\nDirectory song name [%s] is not in instrument part file [%s]" "$dir" "$file" ' {} \; | sort
Directory song name [JUST ONE CORNETTO] is not in instrument part file [JUST ONE CONETTO 2ND BARITONE.pdf]
Directory song name [TIME TO SAY GOODBYE] is not in instrument part file [TIME TO SAY GOODBY FLUGEL.pdf]
etc...
Since these manual corrections can take some time, do not do any more work in this step's workspace and proceed to the next step.
Did you see what we did there?
In the ternary if-then-else of[[ $file =~ $dir ]] && : || printf...
, we need do nothing when the condition is true, so we put a do-nothing operator in the true-clause in the form of a ":
". Not doing this would create a BASH syntax error. The output is piped through thesort
command, which will reduce your directory-jumping while manually fixing things, sincefind
does not return results in any specific order.
Step 4: Tarting up the file names
Rmember that the general format of the directoy and files is [song title]/
[song title] [instrument].pdf
. It is useful to have the mark the separation between the song title and the the instrument with a hyphen in the filename, so that it is in the form [song title] - [instrument].pdf
. For example: ./VANGUARD/VANGUARD CONDUCTOR 1.PDF
becomes
./VANGUARD/VANGUARD - CONDUCTOR 1.PDF
. This time we use the sed
utility with a bit of regex-magic inside a find
command, like so:
~/step4$ find . -type f -exec bash -c 'dir=${0%/*}; dir=${dir#\.\/}; newfile=$( echo $0 | sed "s/\($dir\/$dir\)\/1 -/" ); mv "$0" "$newfile"' {} \;
Interesting to know:
The most frequent use ofsed
(a "stream editor") is to use its "s
"-command to search for a regex and replace it with something else, in the form:'s/regex/replacement/'
. Traditionally, we use the "/
" charachter to delimit the terms in ased
expression, which means that any "/
" characters in thesed
expression itself need to be escaped with a "\
" character. This means that"s/\($dir\/$dir\)/\1 -/"
could be rewritten as"s|\($dir/$dir\)|\1 -|"
, using the "|
" character instead, and the "/
" inside the regex does not need to be escaped with a "\
". When we use BASH variables insidesed
, the command needs to be bracketed with double quotes to interpolate the actual values of the variables, i.e. the value of whatever is in$dir
is processed, instead of the string "$dir
" itself.
Step 5: Friendlier filename capitalization
We still take care to be able to recover when things don't go as planned, right?
~$ cd ~/step5
~/step5$ cp -r ~/step4/* .
Filename Readability
All our files and directory names are in upper case. Let's make them them more readable!
A simple approach is to convert everything into lowercase, except for the first letter of every word. This is often called American-style capitalization. Am improvement on this approach would be to only capitalize sentence beginnings, proper names and nouns. Since we can't identify nouns without the use of a cumbersome dictionary, an effective although imperfect approach is to apply American-style capitalization to all the words in the file and directory names, and then convert certain words into lowercase, such as articles, adjectives, adverbs and pronouns (except for "I"). All of this can be done in sed
, and the basic premise is demonstrated by running the directory listing through this sed
command;
~/step5$ ls | sed -e "s/\(\w\)\(\w*\)/\1\L\2/g" \ ⇐ American capitalization regex
-e "s/ Me / me /g" \ ⇐ regex to lowercase this free-standing word
-e 's/ Are / are /g' \ ⇐ and so on...
-e 's/ Own / own /g' \
-e 's/ The / the /g' \
etc...
Which gives:
...
You Raise me Up ⇐ sed did this
Your own Tale ⇐ sed did this too!
Yuletide Gallop
Y Viva Espana
Zorba's Dance
This could become a very long and unmanageable sed
command if we want to treat all the possible articles, adjectives, adverbs and pronoun exceptions. Luckily, it can all be put into a single sed script "rename.sed
", which can be called in sed
like this with the -f
parameter:
~/step5$
ls | sed -f rename.sed
The sed script contains the American-style capitalization replacement in the first line, and depending on your case, all the exceptions that need to be forced to lower case:
~/step5$ nano rename.sed
s/\(\w\)\(\w*\)/\1\L\2/g ⇐ American capitalization regex
s/ Me / me /g ⇐
regex to lowercase this free-standing word
s/ Are / are /g
s/ The / the /g
s/ Of / of /g
s/ To / to /g
s/ To$/ to/g ⇐ deal with last word in string
etc...
Note that the words that need to be cast to lowercase are left- and right-padded with a space in the search terms, since we do not want to apply this where it is the first word in the title. When the word will be the last in the title string, the regex needs an end-of-string marker, "$", as shown in the example. Putting all this in a BASH subshell in find
's -exec
parameter, we get:
~/step5$ find . -type f -exec bash -c 'newfile=$(echo {} | sed -f rename.sed); mv "{}" "$newfile"' \;
How was this put together?
In the-exec
parameter, the found-file placeholder{}
is the AS-IS file and contains the full file path to the next found file, which is munged by the sed command to form the TO-BE filename, so that we can construct a command along the lines ofmv "AS-IS" "TO-BE"
. Note the use of the double quotes to prevent file names that contain spaces from splitting and to avoid other wicked shell-trickery when a processed string contains paranthesis, ampersants and apostrophes.
This first needs to be done on the directory names, which contain the song names, and then on the file names, which contain the song's instrument parts. Note the use of the -type d
parameter for directories, and -type f
for files:
~/step5$ find . -type d -exec bash -c ' newdir=$(echo "{}" | sed -f rename.sed); mv "{}" "$newdir"' \;
~/step5$ find . -type f -exec bash -c ' newfile=$(echo "{}" | sed -f rename.sed); mv "{}" "$newfile"' \;
Looking at the results, the song title directories are much more readable now:
~/step5$ find .
./Cavalry of the Steppes
./Cavalry of the Steppes/Cavalry of the Steppes - 3rd 4th Horn 01.pdf
./Cavalry of the Steppes/Cavalry of the Steppes - 3rd 4th Horn 02.pdf
./Cavalry of the Steppes/Cavalry of the Steppes - Euphonium 01.pdf./Cavalry of the Steppes/Cavalry of the Steppes - Euphonium 02.pdf
etc...
Step 6: Collating loose pages into documents
This can get messy, so we want to make a new working space:
$ cd ~/step6
~/step5$ cp -r ../step5/* .
Using the listing from the previous step as an example, we want to collate the pages that belong to each document into a single file, and the resulting document file needs to be named like this:
./Cavalry of the Steppes/Cavalry of the Steppes - 3rd 4th Horn.pdf
./Cavalry of the Steppes/Cavalry of the Steppes - Euphonium.pdf
etc...
The page numbers at the end of the file name disappear, and if there is no page number, then it means that there is only one page and it therefore already as it should be. For the remaining pages that need to be collated, we use page "01" as the starting page of each document and then use the pdftk utility to collate the pages together to form a new PDF file, without a page number. The pdftk utility can perform all sorts of PDF file and content manipulations, allthough we are only interested in using the tool to concatenate multiple PDF files into one file, for example: pdftk doc01.pdf doc02.pdf doc03.pdf doc04.pdf cat output doc.pdf
By globbing the files with ??
, we can reduce it to: pdftk doc??.pdf cat output doc.pdf
Or even more elegant by setting the glob in a variable, as we will use this form: glob='doc??.pdf'
; pdftk $glob cat output doc.pdf
Let's find each file that is a page "01" by setting the search parameter -name "* 01.pdf"
and create the globbing expression by replacing the "01" with "??". Then we escape all spaces, ampersands (hex code 0x26), apostrophes (hex code 0x27) and parentheses with a backslash "\
". From this we can create the final document name in variable $doc, by stripping the "??" bit from the $glob
variable. Finally we concatenate the pages into a single PDF document and delete the old page-numbered files.
~/step6$ find . -name "* 01.pdf" -exec bash -c ' glob=${0/ 01\./ \?\?.}
└ search citeria ┘
└ files to concatenate ┘
glob=$(echo $glob | sed -e "s/\([ \x26\x27)(]\)/\\\\\1/g")└ escape troublesome characters with backslashes
─
─
─
─
─
─
─
┘
doc=${glob/ \?\?/}
└ remove "??" ────┘
cmd="pdftk $glob cat output $doc 2>> errors"; eval $cmd
└ concatenate pages into a document───────────────────┘
cmd="rm $glob"; eval $cmd' {} \;
└ remove loose pages ────┘
We append any errors that pdftk
generates to an error file, conveniently here called errors
.txt. Since this process can take quite a while, we can amuse ourselves by watching any errors on a separate terminal as they occur, which in tech-speak is called "tail
-ing":
~/step6$ tail -f errors.txt
We can put this into one line that seasoned Hogwarts alumni would applaud:
~/step6$ find . -name "* 01.pdf" -exec bash -c 'glob="${0/ 01\./ \?\?.}"; glob=$(echo "$glob" | sed -e "s/\([ \x26\x27)(]\)/\\\\\1/g"); doc="${glob/ \?\?/}"; cmd="pdftk $glob cat output $doc 2>>errors.txt"; echo $cmd; eval $cmd; cmd="rm $glob"; echo $cmd; eval $cmd' {} \;
We end up with these concatenated files:
./Cavalry of the Steppes
./Cavalry of the Steppes/Cavalry of the Steppes - Solo Cornet.pdf
./Cavalry of the Steppes/Cavalry of the Steppes - 1st Trombone.pdf
./Cavalry of the Steppes/Cavalry of the Steppes - 2nd Trombone.pdf
etc...
Perfect, and ready to publish for our users to browse!