Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDA Conversion #5

Open
afrozenpeach opened this issue May 9, 2016 · 13 comments
Open

RDA Conversion #5

afrozenpeach opened this issue May 9, 2016 · 13 comments

Comments

@afrozenpeach
Copy link
Owner

Add feature to convert records from AACR2 to RDA

@afrozenpeach
Copy link
Owner Author

Looking for users with knowledge of proper RDA to test this feature. If you have files you are interested in converting to RDA and can verify the conversion afterwards I would love to hear from you!

@zemkat
Copy link

zemkat commented May 12, 2016

I have a few suggestions, based on sending some AACR2 records through your conversion process:

  1. A lot of "AACR2" records these days are actually hybrids: formerly pure AACR2 records that have had some RDA elements added to them (this is also happening in OCLC). You might check if elements already exist before adding them to the records, or check if a record is already marked RDA in the 040 before making any changes to it, since a lot of collections are mixed.
  2. Subfield order is weird for 040: ‡a, ‡b, ‡e, ‡c, ‡d, so when you're adding ‡e rda, it needs to go before ‡c and ‡d
  3. Some ISBD/catalog card silliness: If the 300 field ends in "cm." (or "mm." or some symbol that doesn't have a period inherently part of it like "in." does), that period should be removed unless there is a 490 in the record.
  4. Quoted notes in 500 fields used to list their source like --T.p. or --T.p. verso or --Half t.p.. I don't know if it's specifically an RDA rule, but now these are more frequently spelled out like --Title page or --Title page verso or --Half title page.
  5. A copyright date should go into its own 264 with second indicator 4, like: 264 _4 ‡c ©2007 (the ‡c will be its only subfield, and there's no period after it) If that's the only date in the 264, you can replace it there with a supplied date (in brackets): 264 _1 ‡a Washington, D.C. : ‡b National Academies Press, ‡c [2007]
  6. When publication data was unknown in AACR2, we used latin abbreviations to indicate this. If you see "S.l." in 264‡a, that should be [Place of publication not identified]. Similarly, in 264 ‡b "s.n." should be [Publisher not identified]. There are other abbreviations used like [et al.], [sic], etc but what they should be replaced with may hard to recover from the record content. Another big barrier to a perfect conversion: so much that we used to record (like always "Calif." for California in place of publication) we now transcribe ("Calif.", "Cal.", "CA", "California", use whatever's on the piece) but I think it's nice to automate what you can, like you're doing!
  7. Also you may notice in a lot of old records, you have one big set of brackets that span the whole field (or set of fields even), like: 260 __ ‡a [Place : ‡b Publisher, ‡c Year] But these fields should each have their own set of brackets: 260 __ ‡a [Place] : ‡b [Publisher], ‡c [Year] This is actually an ISBD change, not an RDA change, but it hit at about the same time. It made more sense on catalog cards when people would be looking at the whole thing at once, but makes less sense when all fields are indexed separately.
  8. The GMD (245 ‡h) should probably be removed, as it is part of AACR2 but not RDA. (Some libraries are still keeping this for the moment until their OPACs figure out what to do with 336/337/338, but it's also going to start disappearing from OCLC records too, so its days are numbered.) Removing it is a little tricky, as you'll want to keep any ISBD punctuation in the ‡h that precedes the next subfield and keep it in ‡a, so:
    245 00 ‡a Cancer in elderly people ‡h [electronic resource] : ‡b workshop proceedings
    should become
    245 00 ‡a Cancer in elderly people : ‡b workshop proceedings
    It may help you determine 336/337/338 as well.
  9. 336 (Content Type) is a required field, but did not end up in any records I tried. You may be able to determine this from other parts of the record. For example, I'm thinking that you can conclude that you've got content type "text" (txt) if the 300 only says how many "pages" or "volumes" you have, but I'd have to check the rules more carefully to be sure. Other types tend to qualify that, like "1 score (40 pages)" or "1 atlas (80 pages)". GMD should also help, but it's weird because it's really a mix of 336/337/338, so it might give you a clue to only some of those.
  10. RDA carrier types (338) are grouped by media type (337) and the codes that appear in their ‡b reflect this. So if you're using 337 unmediated (n), the 338 should be something from the unmediated carriers list (code should start with n) (reference here) You might also have good luck determining this from the 300 or the fixed fields.

Please let me know if you have any questions about any of these!

@afrozenpeach
Copy link
Owner Author

This is all really great stuff! Thanks! Lots of my issues come from having a limited set of records to do testing on. Lots of the abbreviation fixing things I'm hoping to depend on users to assist in finding places that need more work. And of course with most RDA conversions, there's only so much an automated process can do, and I'm not trying to be a total and complete solution, but take as much of the hard work away as possible.

For 040 subfield 2, I need to do some extra work. Apparently it's the only place (that I'm aware of at least - are there more places this matters?) that doesn't have an exact alphabet -> numeric sorting order. Looks like manual sorting is going to be a feature I need to implement sooner rather than later. Honest question: On a scale of 1-10 what would you consider being "out of order" from the standards be? With a 1 being "don't care" and a 10 being "refuse to use unless this is fixed"?

The 264 stuff really confused me, and this build is my 2nd attempt at it. If there are multiple copyright years, should each one go in their own 264 _4 ?

The rest is all excellent tips. I think it makes enough sense, and will work on implementing those additions. I'll post again when I've made some progress for more testing! Thanks a million for your help!

@afrozenpeach
Copy link
Owner Author

afrozenpeach commented May 13, 2016

I just pushed out another update with fixes for all but 1, 2, and 7, assuming I understood your other issues correctly.

@afrozenpeach
Copy link
Owner Author

afrozenpeach commented May 13, 2016

In theory 1 should be fixed now as well.

@zemkat
Copy link

zemkat commented May 13, 2016

Great! I'm out of the office today but will test again next week.

There are many fields where the subfields are not in alphabetical order, like subject headings:

650 _0 ‡a Dogs ‡z Kentucky ‡x History.

or subfields that are repeated:

505 00 ‡t Cars / ‡r Ella Smith -- ‡t Hats / ‡r Jane Jones -- ‡t Trees / ‡r Mary Brown.

It's super important that subfield order currently a field is preserved (10); as far as the order of subfields you're adding to records (like ‡e rda), that's less important (maybe a 3-4 to me) and some other editors don't support that either, like normalization rules in Alma. MARC tag order is also not numerical.

For copyright dates, typically only the latest one is recorded so there should only be one. (I'll check and see if there are exceptions in AACR2 or RDA)

@afrozenpeach
Copy link
Owner Author

Part 2 is fixed now too, which only leaves 7... which is more difficult and I'm still thinking of ways to fix.

@afrozenpeach
Copy link
Owner Author

Is there any chance that you've been able to look at this again since I did more fixes?

@zemkat
Copy link

zemkat commented May 23, 2016

Hello! I'm looking for the new version in GitHub -- is there a zip or exe somewhere that I'm not seeing?

@afrozenpeach
Copy link
Owner Author

When on the main github page, click on the releases link.

[image: Inline image 1]

Which leads to: https://github.com/frozen-solid/CSharp_MARC/releases

It's definitely one of the things I dislike about Github... I wish the
releases link was more obvious/apparent. I'll probably put a link on my
readme.md to make it easier soon.

Sorry about that!

On Mon, May 23, 2016 at 4:01 PM, Kathryn Lybarger notifications@github.com
wrote:

Hello! I'm looking for the new version in GitHub -- is there a zip or exe
somewhere that I'm not seeing?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#5 (comment)

@zemkat
Copy link

zemkat commented May 25, 2016

I really should have hung on to the MARC record examples where I spotted the issues for testing!

1-2. Looks fixed!

  1. I'm still seeing the period show up. If I started with:

300 __ ‡a xx, 237 p. ; ‡c 23 cm.
490 _1 ‡a Signale : modern German letters, cultures, and thought

You should end up with:

300 __ ‡a xx, 237 p. ; ‡c 23 cm
490 _1 ‡a Signale : modern German letters, cultures, and thought

This one really is a picky difference, and a lot of people don't care, but it would be nice!

  1. Hmm, I tried a record that started with:
    260 __ ‡a Stanford, California : ‡b Stanford University Press, ‡c 2012, ©2012.

This should have have ended up with

264 _1 ‡a Stanford, California : ‡b Stanford University Press, ‡c 2012.
264 _4 ‡c ©2012

That first one is the publication year, the second one is the copyright, so now they're broken
up into two separate fields. 264 _1 is "publication information" and 264 _4 is "copyright date"

I've got to run but I'll check the rest when I can, and I'm happy to answer any questions.

In the meantime, I'm giving a talk at a regional tech services conference on Thursday about catalogers and developers working together! Do you mind if I mention your software as an example?

@afrozenpeach
Copy link
Owner Author

I'd love to have my software talked about! I don't mind in the least.

I'll see what I can do about improving the last couple of issues you can
found, but I too am at a conference this week. Development has slowed a bit
because of that.

Thanks again for all of your help! I really really appreciate it.

On Wed, May 25, 2016 at 11:53 AM, Kathryn Lybarger <notifications@github.com

wrote:

I really should have hung on to the MARC record examples where I spotted
the issues for testing!

1-2. Looks fixed!

  1. I'm still seeing the period show up. If I started with:

300 __ ‡a xx, 237 p. ; ‡c 23 cm.
490 _1 ‡a Signale : modern German letters, cultures, and thought

You should end up with:

300 __ ‡a xx, 237 p. ; ‡c 23 cm
490 _1 ‡a Signale : modern German letters, cultures, and thought

This one really is a picky difference, and a lot of people don't care, but
it would be nice!

  1. Hmm, I tried a record that started with: 260 __ ‡a Stanford,
    California : ‡b Stanford University Press, ‡c 2012, ©2012.

This should have have ended up with

264 _1 ‡a Stanford, California : ‡b Stanford University Press, ‡c 2012.
264 _4 ‡c ©2012

That first one is the publication year, the second one is the copyright,
so now they're broken
up into two separate fields. 264 _1 is "publication information" and 264
_4 is "copyright date"

I've got to run but I'll check the rest when I can, and I'm happy to
answer any questions.

In the meantime, I'm giving a talk at a regional tech services conference
on Thursday about catalogers and developers working together! Do you mind
if I mention your software as an example?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#5 (comment)

@afrozenpeach
Copy link
Owner Author

The 300c should be fixed now.

I'm not quite sure about how to handle the 264 fix, mainly because the copyright symbol isn't often included unless it's RDA. I'll have to do some thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants