Burst a Content-indexed PDF doc into separate chapter PDF files

You can burst a PDF doc with a table of contents into separate chapter PDF files, with each file named according to the chapter's name. This is a description of how to go about it, and in the end there is a full "Burst By Bookmark" shell script to do exactly this, although strictly speaking, we are not dealing with PDF bookmarks, but with PDF table-of-contents chapters. If the PDF doc doesn't have a table of content, you can add one using a bespoke and expensive PDF tool, or use a different method to chunk the doc up into chapters.

Here is a quick demonstration of what happens: We start with a single PDF file, BigBook.pdf, that has a table of contents. On opening this file in a PDF file viewer such as okular, the table of content on the left displays the chapters:

Running the "Burst By Bookmark" shell script, assuming it is the one and only PDF file in the directory, we get:

$ ./burstbybookmark
Chunking BigBook.pdf into 20 files:
Assembling 'BigBook - Chapter 1.pdf' from pages 1 to 17...
Assembling 'BigBook - Chapter 2.pdf' from pages 18 to 20...
...
Assembling 'BigBook - Chapter 21.pdf' from pages 69 to 71...
Assembling 'BigBook - Chapter 22.pdf' from pages 72 to 74...
Done

How does the "Burst By Bookmark" shell script work?

We have the pdftk tool as at our disposal, which is a bit of a Swiss Army Knife for dealing with those sometimes annoying PDF files. Since the file has a table of contents, one can view the chapters' page boundaries (referred to as "bookmarks" by pdftk) and use this as a basis to automate the file splitting into a separate file per chapter:

$ pdftk BigBook.pdf dump_data | grep -2 "BookmarkTitle:"
NumberOfPages: 74
BookmarkBegin
BookmarkTitle: Chapter 1
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: Chapter 2
BookmarkLevel: 1
BookmarkPageNumber: 18
...

This lists the page that each chapter starts at. Assuming the common use of page boundaries between chapters, it is reasonable to infer that a chapter ends one page before the next chapter starts, e.g. From this listing we can see that the chapter "Chapter 1" starts from page 1, and that "Chapter 2" begins at page 18, which means that "Chapter 1" ends on page 17.

The pdftk utility can also selectively reassemble pages from one PDF file into another PDF file, so the PDF file for Chapter 1 would be created like this:

$ pdftk BigBook.pdf cat 1-17 output "Bigbook - Chapter 1.pdf"

And so on for each chapter.

The final script looks like this and does some BASH array iteration magic:

#!/bin/bash
# The main tool needs to be installed!
[[ -z $(which pdftk) ]] && printf "pdftk is not installed\nExiting...\n" && exit 1
# Use only file in CWD if none specified on command line
[[ -f $1 ]] && f=$1 || f=$(ls -S *.pdf | head -1)
# Make up title
t=$(echo ${f%.pdf})
# Check that this file has a table of contents
pdftk $f dump_data | grep "BookmarkTitle: "
if [[ -n "$1" && "$1" != "$f" ]]; then
cat <<!
${0##*/}

Chunks a PDF file with a table of contents into separate PDF files by chapter.

Usage:
${0##*/} [filename.pdf]
or
${0##*/}
use the first listed PDF file in the current directory
!
exit 1
fi

declare -a bookmarks
declare -a startpages
declare -a endpages
readarray -t bookmarks < <(pdftk "$f" dump_data | grep "BookmarkTitle: " | sed -e "s/.*: //" )
readarray -t startpages < <(pdftk "$f" dump_data | grep -2 "BookmarkTitle: " | grep "BookmarkPageNumber:" | sed -e "s/.*: //" )
lastpage=$(pdftk "$f" dump_data | grep "NumberOfPages: " | sed -e "s/.*: //" )

echo "Chunking '$f' into ${#startpages[@]} files:"

for ((i=0; i<${#startpages[@]}; i++)); do
if [[ $i -lt $((${#startpages[@]} - 1 )) ]]; then
endpages[$i]=$(( ${startpages[$(($i+1))]} - 1 ))
else
endpages[$i]=$lastpage
fi
echo "Assembling '$t - ${bookmarks[$i]}.pdf' from pages ${startpages[$i]} to ${endpages[$i]}..."
pdftk "$f" cat ${startpages[$i]}-${endpages[$i]} output "$t - ${bookmarks[$i]}.pdf"
done

echo "Done"

Famous Posters

hoekstra.co.uk

Burst a Content-indexed PDF doc into separate chapter PDF files