async-provider-etl
is a Python package designed to perform ETL operations on hospital CMS data using asynchronous
and parallel processing. The package downloads datasets concurrently, processes the resultant dataframes in parallel, and stores the hospital data &
associated metadata in SQLite databases.
- Asynchronous Data Extraction: Utilizes
aiohttp
for efficient, non-blocking HTTP requests to download datasets. - Parallel Data Processing: Leverages
asyncio
,ProcessPoolExecutor
andThreadPoolExecutor
for concurrent and parallel processing. - SQLite Integration: Stores metadata and processed data in SQLite databases using
aiosqlite
, ensuring efficient, non-blocking queries to the embedded database. - Command-Line Interface: Configurable via CLI arguments for verbose logging.
If
pipx
is not already installed, you will need to install it .
To install the .whl
file using pipx
, you can use the following command:
$ pipx install "dist/async_provider_etl-0.1.0-py3-none-any.whl"
Then to trigger the ETL job, simply call the installed package:
$ async-provider-etl