-
Notifications
You must be signed in to change notification settings - Fork 2
TIFF or JP2 Batch Ingest Guide
Eben English edited this page Sep 25, 2019
·
3 revisions
NewspaperWorks provides functionality for batch ingest of page-level TIFF or JP2 files (without corresponding ALTO or OCR files) via a command-line rake task.
To invoke the rake task, run the following command from the home directory of your application:
$ rake newspaper_works:ingest_issues -- --path=/path/to/your/pages/batch
In addition to path
, the rake task also accepts arguments for admin_set
, depositor
, and visibility
, as in:
$ rake newspaper_works:ingest_issues -- --path=/path/to/your/pages/batch --admin_set=admin_set/default --depositor=admin_user@example.com --visibility=open
When run, the rake task will:
- Create a
NewspaperTitle
object for the publication represented in the batch - Iterate over the directories in the batch, creating a
NewspaperIssue
object for each directory - Iterate over the files in each directory, creating a
NewspaperPage
object for each directory - Create a TIFF primary file for each page, if the files are JP2 format.
- Perform page-level OCR and word-coordinate analysis of the page text
- Attach existing page-level derivatives (ALTO, PDF, JSON, etc.) to the
NewspaperPage
objects - Index OCR text to Solr for full-text searching
- Create a word-coordinate JSON derivative file to facilitate page-image search hit highlighting
- Compile an issue-level PDF file from page files and attach as primary file to each
NewspaperIssue
object - Add metadata to the created objects based on the directory names:
- LCCN
- publication title
- place of publication
- publication date
- edition number
Notes:
- If a
NewspaperTitle
object with the LCCN in the batch already exists, objects will be associated with the existingNewspaperTitle
. - If no
admin_set
is specified, the default AdminSet (admin_set/default
) will be used. - If no
depositor
is specified, objects will have adepositor
value ofUser.batch_user.user_key
by default. - If
visibility
is not specified, objects will havevisibility
value ofopen
by default. - A log file of the batch process will be output to your application's
log/ingest.log
.
The ingest script makes the following assumptions:
- There MUST be a main/home parent directory that contains all TIFF or JP2 page files.
- The name of this main/home directory MUST correspond to the LCCN for the publication.
- Each subdirectory of the main/home directory MUST correspond to a single edition of a single issue.
- An issue subdirectory MUST be named according to the following convention: YYYYMMDDEE, where:
- YYYY represents the 4-digit year
- MM represents the 2-digit month
- DD represents the 2-digit day
- EE represents the 2-digit edition number (default is 01)
- Each file in an issue subdirectory MUST correspond to a single page of a single issue.
- The order of the files MUST correspond to the order of the pages.
For an example of a TIFF or JP2 batch, see newspaper_works_fixtures.