Skip to content

Latest commit

 

History

History
42 lines (41 loc) · 1.84 KB

README.md

File metadata and controls

42 lines (41 loc) · 1.84 KB

TEI - Text Encoding Initiative

Using Manuscripts, Inscriptions as an object for research

How to make pages into usable/re-usable research materials?

  • Digitization
  • Conversion to text
  • Encoding text
  • Providing text and images together

Digitization

  • Images of Manuscript
  • TIF or PNG (for high resolution image quality)
  • Conversion to text (Getting Text)
  • Is it handwritten? -- Transcribe (use tools like [From the Page] (https://fromthepage.com/), [Transkribus] (https://readcoop.eu/transkribus/))
  • Is it typed? -- OCR (Optical Character Recognition) -- OCR for Typescripts --- Online Systems --- Phone Apps --- Commercial Systems ---- ABBYY ---- Omnipage --- OCR for Handwritten (Handwritten Text Recognition) ---- [Amazon Textract] (aws.amazon.com/textract) ---- [Online OCR] (www.onlineocr.net) ---- [Transkribus NN] (www.transkribus.eu/Transkribus) --- Text Processor ---- MS Word (No-structure, proprietary) ---- Notepad++ (for Windows) ---- Bbedit (for Apple) --- [Getting started with Transcription] (https://tinker.edu.au/resources/recipe/getting-started-with-transcription-from-the-page/)
  • Wny text and page Image together? -- Presenting the original image together with the transcript is a more robust research method. -- allows to verify the transcript, and perhaps improve it -- may reveal other aspects to the document you didn't notice -- make the document available for future research

TEI (Text Encoding Initiative)

  • TEI is a standard set of XML
  • XML is a longstanding and widely used technology
  • XML marks up the structure of a document and not the appearance (unlike HTML)
  • XML provides a way to validate documents against a defined schema
  • [TEI Doc] (https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html)