I got invovled in a metadata cleanup project following the RenouVaud project, in which some former RERO member libraries migrated from VTLS Virtua to ExLibris Alma. Some local classification data was lost during the migration. This project aims to extract select classification authority records from the former system and map them to the model used by Alma.
The main script is called riponne-main.py
and can be run from the command-line as follows:
python riponne-main.py -i <SOURCE FILE(S)> -o <OUTPUT FILE> -m <MAPPING>
Where
<SOURCE FILE(S)>
(required) is the path to one or more files in MARCXML format on which to run the script. A wildcard character can be used to select multiple files.<OUTPUT FILE>
(required) is the path to the file that will be written as output.<MAPPING>
(required) is one ofmusi
,musg
,laf
,vddoc
,BCURmu
,BCURpt
orBCURcg
(vocabularies to be mapped)
Log information will be output on the console. If you wish to store it in a log file, simply redirect the output to a file (see example below).
Example:
python riponne-main.py -i source-files/*.xml -o output/laf.xml -m laf >> log/20210105-laf.log
Python version 3.6 or higher is required. This was tested with version 3.8.5.
The following libraries are used. You may need to install them prior to running the script:
- pymarc (for handling MARC records)
- argparse (for parsing arguments)
- re (for pattern-matching using regular expressions)
- datetime (for tracking run time)
These are my personal notes while working on this code. Feel free to disregard.
Modified the script to map 019__$a
fields to 680__$i
at the request of client. Also corrected two small cataloguing inconsistencies in the source files that were discovered by analyzing the logs.
Did a fresh run on all vocabularies and delivered new mapped files.
Modified logging format to be more readable, no longer logging fields that are skipped by design. Also, the log now is in French.
Added another failsafe for records without 172__$a
field and ran vddoc mapping successfully.
The vddoc routine is run twice for each file, once looking for vddoc
and once for vddoc-la
so it takes twice as long to process (about 140 seconds instead of 70ish).
Not super efficient, could be done better, but it works. Memory usage might be an issue if running the function on a larger number of records.
Added more inline documentation and cleaned up code a little.
Corrected the CLASBCUR routine that wasn't storing 153__$a
properly when classification strings were missing.
Also found out when trying to merge a few output files that XSLT merge isn't doing what we want (merging records). But thanks to the fact that the tree is quite simple, it's easy to merge them with a series of quick bash lines:
$ sed '$d' firstfile.xml > merged.xml
$ sed '$d' middlefile.xml | tail -n +3 >> merged.xml
$ tail -n +3 lastfile.xml >> merged.xml
This does the trick, but it would be more elgant to write a little script to do it in one fell swoop.
TO DO:
- Write a better way to merge XMLs
Rewrote the main function using argparse
in order to accept multiple input files and loop through them. This way, output from
mulitple input files is stored in the same file.
Fixed some faulty logic that wasn't mapping temporary (short) records well enough.
Added some processing to the LDR (position 17 is set to w
for full records and o
for temp/short records).
Enhanced logging functionality to track all skipped fields and subfields.
Did some stress testing and ran all mappings for client to check.
TO DO:
- Write usage documentation
- Add failsafe for records without a 172__$a
- Run vddoc mapping
Completed the mapping function, including BCUR regex logic. Trial run on BCURmu successful.
More stress tests required.
TO DO:
- Log unmapped subfields
- Complete inline documentation
- Investigate if merging resulting files can happen in Python (accept multiple input files)
Completed most of the mapping function.
TO DO:
- Map remaining 572s - find a way to copy field with all subfields
- Check if all possible subfields are taken care of in 153__$j concatenation
- Add failsafe for 153__$a in case there are no 572s
- Find out what needs to happen to LDR
- Log all unmapped fields
- Less permissive error catching - log all errors
Need to do some stress-testing too.
Added logic for records without 172__$a
Successfully ran the processing routine for musg
on all temp*
source files.
To merge the resulting XML files, I used this XSL Transform from Oliver Becker (source).
Usage (I need to find a better way to call Java8, which is needed by the latest version of Saxon):
/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -jar ../SaxonHE10-3J/saxon-he-10.3.jar output/musg0.xml XSLT/merge.xslt with=musg1.xml > musg01.xml
Come to think of it, this could be done in Python directly as well, I suppose.
Started playing with sample file. Able to read XMLMARC using pymarc, check against 172__$a
and write file.
Running small function tests on the walkthrough.ipynb iPython notebook then moving on to riponne-main.py for processing.
XML output files can be cleaned up (i.e. nicely indented etc.) by using xmllint -format -recover outfile.xml > outfile-clean.xml
.
TO DO:
- Add logic for records without
172__$a
- Add BCUR switching logic
- Start work on mapping function
After trying to write my own XSL transform, I realized an easy way to get a list of all MARC tags in the source file is to use MarcEdit:
- Convert MARCXML to MARC (edit: this is not actually necessary, MarcEdit can open MARCXML natively)
- Run the Field Count report in MarcEdit
Sample:
Field | Subfield | In Records | Total |
---|---|---|---|
000 | 5 | 5 | |
001 | 5 | 5 | |
005 | 5 | 5 | |
008 | 5 | 5 | |
035 | 5 | 5 | |
$a | 5 | ||
039 | 5 | 5 | |
$d | 2 | ||
$a | 3 | ||
$c | 2 | ||
$b | 3 | ||
$y | 5 | ||
$z | 5 | ||
040 | 5 | 5 | |
$a | 5 | ||
$b | 1 | ||
072 | 1 | 1 | |
$a | 1 | ||
172 | 5 | 5 | |
$e | 1 | ||
$a | 5 | ||
$d | 1 | ||
$2 | 5 | ||
572 | 3 | 8 | |
$a | 8 | ||
$x | 3 | ||
$2 | 8 | ||
$v | 7 |