Source code | Dataset | ||
---|---|---|---|
Australia | Family, domestic and sexual violence... | parser | spider | tests |
json |
Australia | IP Glossary | parser | spider | tests |
json |
Canada | Dept. of Justice Legal Glossaries | parser | spider | tests |
json |
Canada | Glossary of Parliamentary Terms for... | parser | spider | tests |
json |
Intergovernmental | Rome Statute | parser | spider | tests |
json |
Ireland | Glossary of Legal Terms | parser | spider | tests |
json |
New Zealand | Glossary | parser | spider | tests |
json |
USA | US Courts Glossary | parser | spider | tests |
json |
USA | USCIS Glossary | parser | spider | tests |
json |
USA / Georgia | Attorney General Opinions | parser | spider | tests |
|
USA / Oregon | Oregon Administrative Rules | parser | spider | tests |
The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.
The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:
$ scrapy list
aus_ip_glossary
can_doj_glossaries
int_rome_statute
...
Then we can run one of the spiders:
$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs
This produces:
{
"date_accessed": "2019-03-21",
"chapters": [
{
"kind": "Chapter",
"db_id": "36",
"number": "101",
"name": "Oregon Health Authority, Public Employees' Benefit Board",
"url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
"divisions": [
{
"kind": "Division",
"db_id": "1",
"number": "1",
"name": "Procedural Rules",
"url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
"rules": [
{
"kind": "Rule",
"number": "101-001-0000",
"name": "Notice of Proposed Rule Changes",
"url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
"authority": [
"ORS 243.061 - 243.302"
],
"implements": [
"ORS 183.310 - 183.550",
"192.660",
"243.061 - 243.302",
"292.05"
],
"history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. & cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. & cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
}
]
}
]
}
]
}
(etc.)
The Wiki explains the JSON strategy.
I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.
Poetry for dependency management
So before I start working, I go into the virtual environment:
poetry shell
Making sure I have the current deps installed is always good to do:
poetry install
The pytest tests run easily:
pytest
- Java is required by the Python Tika package.
- Pylance/Pyright for type-checking
- Black for formatting
It has a small glitch, though: it usually runs all the tests twice when I save in VS Code.