-
Notifications
You must be signed in to change notification settings - Fork 2
PDF Batch Ingest Guide
Eben English edited this page Sep 12, 2019
·
5 revisions
NewspaperWorks provides functionality for batch ingest of issue-level PDF files via a command-line rake task.
To invoke the rake task, run the following command from the home directory of your application:
$ rake newspaper_works:ingest_pdf_issues -- --path=/path/to/your/pdf/batch
In addition to path
, the rake task also accepts arguments for admin_set
, depositor
, and visibility
, as in:
$ rake newspaper_works:ingest_pdf_issues -- --path=/path/to/your/pdf/batch --admin_set=admin_set/default --depositor=admin_user@example.com --visibility=open
When run, the rake task will:
- Create a
NewspaperTitle
object for the publication represented in the batch - Iterate over the directories in the batch, creating a
NewspaperIssue
object for each PDF file - Split the PDF into constituent pages, creating a
NewspaperPage
object for each - Perform page-level OCR and word-coordinate analysis of the page text
- Attach existing page-level derivatives (ALTO, PDF, JSON, etc.) to the
NewspaperPage
objects - Index OCR text to Solr for full-text searching
- Create a word-coordinate JSON derivative file to facilitate page-image search hit highlighting
- Add metadata to the created objects based on the directory and file names.
- LCCN
- publication title
- place of publication
- publication date
- edition number
Notes:
- If a
NewspaperTitle
object with the LCCN in the batch already exists, objects will be associated with the existingNewspaperTitle
. - If no admin_set is specified, the default AdminSet (
admin_set/default
) will be used. - If no depositor is specified, objects will have a
depositor
value ofUser.batch_user.user_key
by default. - If visibility is not specified, objects will have
visibility
value ofopen
by default. - A log file of the batch process will be output to your application's
log/ingest.log
.
The ingest script makes the following assumptions:
- There MUST be a main/home parent directory that contains all PDF files.
- The name of this main/home directory MUST correspond to the LCCN for the publication.
- Each PDF file MUST correspond to a single edition of a single issue.
- Each PDF file MUST be named according to the following convention: YYYYMMDDEE.pdf, where:
- YYYY represents the 4-digit year
- MM represents the 2-digit month
- DD represents the 2-digit day
- EE represents the 2-digit edition number (default is 01)
For an example of a PDF batch, see newspaper_works_fixtures.