Skip to content

How We Convert Plays from TXT or HTML to XML TEI

Frank Fischer edited this page May 20, 2019 · 6 revisions

Elevator Pitch

We take a play from one of our sources and convert it to TEI. For example: http://az.lib.ru/o/ostrowskij_a_n/text_0130.shtml was converted to https://raw.githubusercontent.com/dracor-org/rusdracor/master/tei/ostrovsky-les.xml.

Overview over our corpus: https://dracor.org/rus

We try to use parsing libraries like Beautiful Soup, but since almost none of the source documents are well-formed, it can't get us all the way. Regular expressions or manual work help where parsing isn't applicable.

The general TEI guidelines can be found here.

Sources

Our sources are:

  • Wikisource (ru.wikisource.org)
  • Библиотека Максима Мошкова (lib.ru)
  • Русская виртуальная библиотека (rvb.ru)
  • Интернет-библиотека Алексея Комарова (ilibrary.ru)
  • СовЛит (ruthenia.ru)

Plays we convert have to be in modern orthography and have to provide information on the print source that was used as original for the scan. In the TEI document, the bibliographic record will go into:

<bibl type="originalSource">
  <title>…</title>

Header

You can take the <teiHeader> from one of the other files (e.g., this one) and adjust it. Just add title, subtitle and author, don't bother about the rest, this will be done by our philologist task-force. You can also delete the <particDesc> part as this will be filled with character names in order of appearance when all the other work is finished.

Plays in Verse, Plays in Prose, Mixed Plays

The main difference between the running text of plays is them being in verse or prose. For the first, use <l>…</l> for verse lines, for the latter, use <p>…</p>.

Verses Distributed over Several Lines

Some verses are distributed over several lines. If (only if) the source holds implicit information about such multi-line verses, make sure they also go into the TEI version (using the attribute part="{I,M,F}", i.e., indicate the initial, middle or final position of a verse), example:

<sp who="#vorotynskij">
  <speaker>Воротынский</speaker>
  <l>А слушай, князь, ведь мы б имели право</l>
  <l part="I">Наследовать Феодору.</l>
</sp>
<sp who="#shujskij">
  <speaker>Шуйский</speaker>
  <l part="F">Да, боле,</l>
  <l part="I">Чем Годунов.</l>
</sp>
<sp who="#vorotynskij">
  <speaker>Воротынский</speaker>
  <l part="M">Ведь в самом деле!</l>
</sp>
<sp who="#shujskij">
  <speaker>Шуйский</speaker>
  <l part="F">Что ж?</l>
  <l>Когда Борис хитрить не перестанет,</l>
  <l>Давай народ искусно волновать;</l>
  <l>Пускай они оставят Годунова,</l>
  <l>Своих князей у них довольно; пусть</l>
  <l>Себе в цари любого изберут.</l>
</sp>

Speakers and IDs

Names of speakers are wrapped by the <speaker> element. The ID of the speaker is a lower-case transliteration of the speaker name, it goes into the who attribute of <sp> (sp = speech), example:

<sp who="#shujskij">
  <speaker>Шуйский</speaker>
  […]
</sp>

If there are spaces in a name or group, use the underscore "_".

Stage Directions

If a stage direction occurs right after the speaker name and before the first words they speak, they stand alone:

<sp who="#korion">
  <speaker>Корион</speaker>
  <stage>(сидя)</stage>
  <l>С тех пор как я к тому все мысли обращаю,</l>
  <l>В лютейшей горести отраду ощущаю.</l>
  […]
</sp>

As a general rule: Always wrap stage directions in <stage>…</stage> where they are. If they are on separate lines in the source file, put them outside of the speech text:

<l>Письмо, которое забыл отдать я вам.</l>
<stage>(В сторону.)</stage>
<l>Однако он теперь писать изволит сам.</l>

If they are located within a paragraph or a verse line, keep them within (and keep the brackets if the source has them):

<p>Садись, батюшка, за стол-то... <stage>(Дочери.)</stage> Поди там, вынимай из печи-то, что есетко...</p>

Page beginning <pb>

If like in rvb.ru you have information on page beginning, convert them, too, and keep the page numbers:

<l>Теперь стал весь не тот: печален и уныл;</l>
<pb n="5"/>
<l>Жестокая тоска его тревожить стала;</l>

Building the <particDesc>

The <particDesc> (= participation description) is contained in the <teiHeader>. It should be empty before you start to work on a file. After you have finished all the encoding, you can build the <particDesc>, which is a list of characters in order of appearance. You can add gender info like in the other files, but this will be cross-checked by the philologists in our team, so don't bother too much.

Footnotes, etc.

If a source file contains footnotes, put them in the <note place="foot"> element at the position where they occur, here's an example:

Лафлер (показавшись в ближних дверях). Il ne faut pas s'y frotter. {Не нужно здесь толочься (фр.).} (Прячется.)

… will be transformed to …

<sp who="lafler">
  <speaker>Лафлер</speaker>
  <stage>(показавшись в ближних дверях).</stage>
  <p>Il ne faut pas s'y frotter.<note place="foot">Не нужно здесь толочься (фр.).</note></p>
  <stage>(Прячется.)</stage>
</sp>

Tools