The HCA DCP Query Service provides an interface for scientists and developers to query metadata associated with experimental and analysis data stored in the Human Cell Atlas Data Coordination Platform (DCP). Metadata from the DCP Data Store are indexed and stored in an AWS Aurora PostgreSQL database.
Queries to the database can be sent over HTTP through the Query Service API, which is available together with its documentation at https://query.staging.data.humancellatlas.org/.
For long-running queries (runtime over 20 seconds), the Query Service supports asynchronous tracking of query results.
When a long-running query triggers this mode, the caller will receive a
301 Moved Permanently
response status code with a Retry-After
header. The caller
is expected to wait the specified amount of time before checking the redirect destination, or use the query job ID
returned in the response JSON body to check the status of the query job. The caller may turn off this functionality
(and cause the API to time out and return an error when a long-running query is encountered) by setting the
async=False
flag when calling /query
.
For large query results, the Query Service may deposit results in S3 instead of returning them verbatim in the response
body. In this case, the client will receive a 302 Found
response status code
sending them to the response data location. In this mode, response data are confidential to the caller, and remain
accessible for 7 days. The caller may turn off this functionality by setting the async=False
flag when calling
/query
.
- For more user documentation, example queries, design goals, and other help, see
- For issue and task tracking, see our ZenHub board.
The REST API is a Chalice app that adopts OpenAPI's approach to specification driven development and leverages Connexion for input parameter validation. The Chalice app is deployed with Amazon API Gateway and AWS Lambda. The full API documentation can be found here.
The query service is subscribed to all updates to the
DCP Data Store (DSS). When data is added to the DSS, a webhook
containing the bundle id is sent to the Query API. The bundle_id is added to the
dcp-query-data-input-queue-[deployment-stage]
SQS queue, which eventually calls the
query-load-data-[deployment-stage]
lambda, which retrieves the bundle metadata and loads it into the database.
Run apt-get install jq moreutils gettext make virtualenv postgresql zip unzip
.
Run:
brew install jq moreutils gettext postgresql@10
ln -s /usr/local/opt/gettext/bin/envsubst /usr/local/bin
ln -s /usr/local/Cellar/postgresql@10/*/bin/psql /usr/local/bin
Most components of the Query Service are written in Python (Python version 3.6 or higher is required). After cloning
this repository, you can run the Query Service in local mode to experiment with it. In this mode, a PostgreSQL server
is expected to be running locally on the standard port (5432), and the current user is expected to have admin access to
the database referenced by APP_NAME
.
It is recommended that you set up a dedicated Python virtualenv to work with the Query Service, then install the development requirements and run the Chalice server in local mode:
mkdir venv
virtualenv --python python3.6 venv/36
source venv/36/bin/activate
pip install -r requirements-dev.txt
source environment
chalice local
Errors while running source environment
can be ignored if you are just experimenting with Query Service in local mode.
After starting chalice local
, you can open the Query Service in your browser at http://127.0.0.1:8000 to experiment
with it.
Global app configuration variables are stored in this file in the root of this repository. The file is a Bash script,
intended to be sourced in the shell by running source environment
. Some of these environment variables are made
available to the deployed AWS Lambda functions and to the Terraform scripts via the EXPORT_*
lists at the bottom of
the file.
Python runtime app configuration is centralized in this file. Some of the config values are imported from the
environment. In Python, the instance of the dcpquery._config.DCPQueryConfig
class is available as dcpquery.config
.
Python logging configuration is centralized in dcpquery.config.configure_logging()
. Call this method from any entry
point that loads dcpquery
.
Logging verbosity is controlled through the DCPQUERY_DEBUG
environment variable, set in the environment
file:
- Set
DCPQUERY_DEBUG=0
to disable debugging and set the app log level to ERROR. - Set
DCPQUERY_DEBUG=1
to change the log level to INFO and cause stack traces to appear in error responses. - Set
DCPQUERY_DEBUG=2
to change the log level to DEBUG and cause stack traces to appear in error responses.
Several configuration items, including database credentials, are saved by Terraform in the AWS Secrets Manager, and accessed by the Python app from there.
Runtime Python dependencies for the Query Service app are listed in requirements.txt.in
. Test/development Python
dependencies are listed in requirements-dev.txt.in
. These files are compiled into pip freeze
output files,
which are stored in requirements.txt
and requirements-dev.txt
, respectively.
To update the requirements files for the application, edit the *.in
files and run make refresh-all-requirements
,
then commit all the requirements*
file changes to version control.
To load a test dataset into the Query Service for experimenting, run make load-test-data
. This script fetches and
loads a small (<10 MB) set of HCA production metadata.
The Query Service requires AWS to deploy. It uses several AWS services, including Lambda, S3, API Gateway, SQS, ACM, and
Route53. To set up the deployment, first ensure that your AWS credentials are configured by installing the
AWS CLI and running aws configure
or setting the
AWS environment variables.
The Query Service uses Terraform to manage the deployment process. Before deploying, edit
the environment
file to set essential variables to control the deployment:
STAGE
- Application deployment stage (such asdev
,staging
orproduction
)APP_NAME
- The name that the query service will use to refer to itselfAPI_DOMAIN_NAME
- The domain name that the query service will register for itself using Route 53 and API Gateway. Before continuing, ensure that a certificate for this domain name is available (in VERIFIED state) in the AWS Certificate Manager.API_DNS_ZONE
- The name of a Route 53 zone that must be present in your AWS account, with a registered domain pointing to it. You can set up both at https://console.aws.amazon.com/route53/.
Run source environment
to set the results of the above edits in your shell. (You can also source environment-specific
convenience files such as environment.staging
that set the STAGE
variable and then source environment
for you.)
In the same shell, run make install-secrets
to initialize and upload credentials that Query Service
components need to communicate with each other.
Finally, to deploy the Query Service, run make deploy
in the same shell.
After deploying, you can update just the Lambda function codebase by running make update-lambda
(this is faster, but
no dependencies, routes, or IAM policies will be updated).
Run python -m dcpquery.db --help
for a list of commands available to manage the database. By default, this script
connects to the PostgreSQL database running on localhost. To connect to the remote database listed in the service
configuration, add --db remote
to the command.
-
Running
python -m dcpquery.db connect
will connect to the database using psql. -
Running
python -m dcpquery.db load
(ormake load
) will load all data from DSS. This requires substantial time and resources. To load a test dataset for experimenting, see "Loading Test Data".
The following command can be used python -m dcpquery.db connect --db remote
Run make test
to run unit tests. Run make integration-test
to run integration tests, which require the Query Service
to be deployed.
To get logs for the last 5 minutes from the app, type make get_logs
in this directory.
Lambda is automatically set up to emit logs
to CloudWatch Logs, which you can
browse in the AWS console by selecting the log group for your
Lambda. You can see built-in CloudWatch metrics (invocations, errors, etc.) by selecting your Lambda in
the Lambda AWS console and going to the Monitoring tab. To
tail and filter logs on the command line, you can use the logs
command in
the Aegea package, for example: aegea logs /aws/lambda/dcpquery-api-dev --start-time=-15m
(this is what make get_logs
runs).
For more help troubleshooting failures in the system, run scripts/trace.py
and follow the instructions.
You can view metric dashboards for each deployment stage at the links below
Please report bugs, issues, feature requests, etc. on GitHub. Contributions are welcome; please read CONTRIBUTING.md.
- Anytime you make changes to the database schema (adding a table, changing a field name, creating or updating an enum etc)
-
Autogenerate a migration file based on changes made to the ORM. On the command line run
make create-migration
- This will create a migration in
dcpquery/alembic/versions
. Take a look at the generated SQL to ensure it represents the changes you wish to make to the database. Potential issues with migration autogeneration are [listed below](#Autogenerate can't detect) - If you get this error you need to apply the migrations you've already created to the db (or delete them) before you can create a new migration
ERROR [alembic.util.messaging] Target database is not up to date. FAILED: Target database is not up to date.
- Note that this will create a migration even if you have not made any changes to the db (in that case it will just be an empty migration file which you should delete)
- This will create a migration in
-
To create a blank migration file run
alembic revision -m "description of changes"
- The description of changes will be appended to the migration file's name so you'll want to keep it short (less than 40 chars); spaces will be replaced with underscores
- You can then edit the newly created migration file (in
dcpquery/alembic/versions
)
- Ensure you are connected to the correct database (run
python -m dcpquery.db connect
to see the database url, use the--db remote
flag if necessary) - From the command line run
python -m dcpquery.db migrate
; use the--db remote
flag if necessary - To unapply a migration run (locally)
alembic downgrade migration_id
(the migration_id is the string in front of the underscore in the migration name, for file 000000000000_init_db.py the migration id is 000000000000) - To unapply a migration in a remote db it is simplest to hardcode the
db_url
inalembic/env.py
and then runalembic dowgrade migration_id
- Changes of table name. These will come out as an add/drop of two different tables, and should be hand-edited into a name change instead.
- Changes of column name. Like table name changes, these are detected as a column add/drop pair, which is not at all the same as a name change.
- Anonymously named constraints. Give your constraints a name, e.g. UniqueConstraint('col1', 'col2', name="my_name").
- Special SQLAlchemy types such as Enum when generated on a backend which doesn’t support ENUM directly - this because the representation of such a type in the non-supporting database, i.e. a CHAR+ CHECK constraint, could be any kind of CHAR+CHECK. For SQLAlchemy to determine that this is actually an ENUM would only be a guess, something that’s generally a bad idea.