Simple OCR extractor

This extractor uses Tesseract. Information on Tesseract and how to install it can be found at the Tesseract's project page.

This extractor uses pyClowder. Information on pyClowder can be found here.

This extractor depends on RabbitMQ. You will need to set the URL to point to RabbitMQ (https://www.rabbitmq.com/uri-spec.html). For example, to connect to RabbitMQ running on the localhost with default parameters you can use the following URL: amqp://guest:guest@localhost:5672/%2f

Overview

Performs simple OCR on an image and associates the resulting text with it. The text is not supposed to be a perfect transcription, but a way to associate words with an image so to make images more searchable.

Input

An image file in a format supported by Tesseract.

Output

OCR text extracted from the input associated with the original file.

Sample input and output files

A sample input file "browndog.png" and a sample output file "browndog.png.sample-output" are available in this directory.

Test locally with Clowder

In extractor-tesseract/ folder run:

docker build -t clowder/ocr:test .
In the tests subdirectory, run:

docker-compose -f docker-compose.yml -f docker-compose.extractors.yml up -d
Initialize Clowder:

docker run -ti --rm --network tests_clowder clowder/mongo-init
Enter email, first name, last name password, and admin: true when prompted.
Navigate to localhost:9001 and login with credentials you created in step 4.
Create a test space and dataset. Then click 'Select Files' and upload tests/browndog.png.
Click on file and type submit for extraction.
It may take a few minutes for you to be able to see the extractors available within Clowder.
Eventually you should see ocr in the list and click submit.
Navigate back to file and click on metadata.
You should see the ocr_text metadata present.

Setting the timezone variable (TZ) above is optional. It can help understand better the time shown in the log file. By default a container uses UTC.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
entrypoint.sh		entrypoint.sh
extractor_info.json		extractor_info.json
icon.png		icon.png
ocr.py		ocr.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple OCR extractor

Overview

Input

Output

Sample input and output files

Test locally with Clowder

About

Releases

Packages

Contributors 5

Languages

clowder-framework/extractor-tesseract

Folders and files

Latest commit

History

Repository files navigation

Simple OCR extractor

Overview

Input

Output

Sample input and output files

Test locally with Clowder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages