Open-gov spiders written in Python

		Source code	Dataset
Australia	Family, domestic and sexual violence...	`parser` \| `spider` \| `tests`	`json`
Australia	IP Glossary	`parser` \| `spider` \| `tests`	`json`
Canada	Dept. of Justice Legal Glossaries	`parser` \| `spider` \| `tests`	`json`
Canada	Glossary of Parliamentary Terms for...	`parser` \| `spider` \| `tests`	`json`
Intergovernmental	Rome Statute	`parser` \| `spider` \| `tests`	`json`
Ireland	Glossary of Legal Terms	`parser` \| `spider` \| `tests`	`json`
New Zealand	Glossary	`parser` \| `spider` \| `tests`	`json`
USA	US Courts Glossary	`parser` \| `spider` \| `tests`	`json`
USA	USCIS Glossary	`parser` \| `spider` \| `tests`	`json`
USA / Georgia	Attorney General Opinions	`parser` \| `spider` \| `tests`
USA / Oregon	Oregon Administrative Rules	`parser` \| `spider` \| `tests`

The Ireland glossary parser is the best example of our coding style. See the wiki for a technical explanation of our parsing strategy.

Example: Oregon Administrative Rules Parser

The spiders retrieve HTML pages and output well formed JSON. It represents the source's structure. First, we can see which spiders are available:

$ scrapy list

aus_ip_glossary
can_doj_glossaries
int_rome_statute
...

Then we can run one of the spiders:

$ scrapy crawl --overwrite-output tmp/output.json usa_or_regs

This produces:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

(etc.)

The Wiki explains the JSON strategy.

Development Environment Notes

Python 3.10

I'm using asdf because the Homebrew distribution is more up-to-date than pyenv.

Poetry for dependency management

So before I start working, I go into the virtual environment:

poetry shell

Making sure I have the current deps installed is always good to do:

poetry install

Pytest for testing

The pytest tests run easily:

pytest

Other tools

Java is required by the Python Tika package.
Pylance/Pyright for type-checking
Black for formatting

Dependencies; helpful links

It has a small glitch, though: it usually runs all the tests twice when I save in VS Code.

Name		Name	Last commit message	Last commit date
Latest commit History 284 Commits
.github		.github
bin		bin
config		config
docs		docs
public_law		public_law
script		script
tests		tests
typings		typings
.gitignore		.gitignore
.tool-versions		.tool-versions
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapinghub-requirements.txt		scrapinghub-requirements.txt
scrapinghub.yml		scrapinghub.yml
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-gov spiders written in Python

Example: Oregon Administrative Rules Parser

Development Environment Notes

Python 3.10

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

About

Sponsor this project

Contributors 3

Languages

public-law/open-gov-crawlers

Folders and files

Latest commit

History

Repository files navigation

Open-gov spiders written in Python

Example: Oregon Administrative Rules Parser

Development Environment Notes

Python 3.10

Poetry for dependency management

Pytest for testing

Other tools

Dependencies; helpful links

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Contributors 3

Languages