The Gen3 User Data Library service allows management of many user selections of data. It creates a "library" containing all of a user's data selections.
Data selections are internally referred to as lists. A user can have 0 to many lists forming their library. A list has unique items that represent data in different forms. Lists can be stored, retrieved, modified, and deleted per user.
At the moment the lists support the following items:
- Global Alliance for Genomics and Health (GA4GH) Data Repository Service (DRS) Uniform Resource Identifiers (URIs)
- Gen3 GraphQL queries
This repo is a standard CRUD REST API. This service is
built on the fastapi framework and uses postgres as its
storage mechanism. Our ORM interface is the UserList
object as defined in the user_list.py file and
all behavior captured reflects modifications the underlying
table represented by this object. In our top level directory,
you can use several different .sh files to preform common
tasks.
- Use
run.shto spin up alocalhostinstance of the API - Use
test.shto run to set up the database as well as run all the tests - Use
clean.shto run several formatting and linting commands
We use .env files to hold all configurations for different
environment configurations. More information about accepted
configurations can be found under the docs folder in the
example env file. We use alembic to handle our database
setup as well as migrations.
For alembic, our system uses a generic single-database configuration with an async db API.
The API should nearly work out of the box. You will
need to install poetry dependencies, as well as set
up a .env file at the top level. The configuration
for this is described directly below. To generate the tables
you can run poetry run alembic upgrade head. Once you have
a .env set up, running run.sh should boot up
an API you can access in your browser by going to
localhost:8000 assuming you use the default ports.
The configuration is done via a .env which allows environment variable overrides if you don't want to use the actual
file.
Here's an example .env file you can copy and modify:
########## Secrets ##########
# make sure you have `postgresql+asyncpg` or you'll get errors about the default psycopg not supporting async
DB_CONNECTION_STRING="postgresql+asyncpg://postgres:postgres@localhost:5432/gen3userdatalibrary"
########## Configuration ##########
########## Debugging and Logging Configurations ##########
# DEBUG makes the logging go from INFO to DEBUG
DEBUG=False
# DEBUG_SKIP_AUTH will COMPLETELY SKIP AUTHORIZATION for debugging purposes
DEBUG_SKIP_AUTH=False
MAX_LISTS = 100
MAX_LIST_ITEMS = 1000
You need Postgres databases set up and you need to migrate them to the latest schema using Alembic.
The test db config by default is:
DB_CONNECTION_STRING="postgresql+asyncpg://postgres:postgres@localhost:5432/testgen3datalibrary"
So it expects a postgres user with access to a testgen3datalibrary database; you will need to ensure both are
created and set up correctly.
The general app (by default) expects the same postgres user with access to gen3datalibrary.
NOTE: The run.sh (and test.sh) scripts will attempt to create the database using the configured
DB_CONNECTION_STRINGif it doesn't exist.
The following script will migrate, setup env, and run the service locally:
./run.sh{
"name": "My Saved List 1",
"items": {
"drs://dg.4503:943201c3-271d-4a04-a2b6-040272239a64": {
"dataset_guid": "phs000001.v1.p1.c1",
"type": "GA4GH_DRS"
}
}
}curl --request GET \
--url http://localhost:8000/library/lists/44580043-1b42-4015-bfa3-923e3db98114 \
--header 'ID: f5407e8d-8cc8-46c2-a6a4-5b6f136b7281' \
--data '{"lists": [
{
"name": "My Saved List 1",
"items": {
"drs://dg.4503:943200c3-271d-4a04-a2b6-040272239a64": {
"dataset_guid": "phs000001.v1.p1.c1",
"type": "GA4GH_DRS"}}}]}'In order to ensure that users only interface with lists that
they have access to, we utilize an authz mechanism to
authorize users. We utilize Arborist
for this. Currently, there are three specific ways we utilize arborist.
First, we ensure a policy exists for the user or create one if not. You can see this in the dependencies file.
Second, we create or update a resource for new lists that are created. This is done in the upsert function in the lists route file.
Third, with the prior two steps established, we authorize incoming requests to ensure that a user who is defined in our system has access to the list they're requesting to view.
The library_owner role should be created in arborist:
roles:
- id: 'library_owner'
description: ''
permissions:
- id: 'library_reader'
action:
method: read
service: 'gen3-user-data-library'
- id: 'library_creator'
action:
method: create
service: 'gen3-user-data-library'
- id: 'library_updater'
action:
method: update
service: 'gen3-user-data-library'
- id: 'library_deleter'
action:
method: delete
service: 'gen3-user-data-library'If you add a new endpoint, please refer to the context configuration for information regarding expectations on what to add for an endpoint, such as authz parameters.
You can bash ./run.sh after install to run the app locally.
For testing, you can bash ./test.sh.
The default pytest options specified
in the pyproject.toml additionally:
- runs coverage and will error if it falls below the threshold
TODO: Setup profiling. cProfile actually doesn't play well with async, so pytest-profiling won't work. Perhaps use: https://github.com/joerick/pyinstrument ?
This quick bash ./clean.sh script is used to run isort and black over everything if
you don't integrate those with your editor/IDE.
NOTE: This requires the beginning of the setup for using Super Linter locally. You must have the global linter configs in
~/.gen3/.github/.github/linters. See Gen3's linter setup docs.
clean.sh also runs just pylint to check Python code for lint.
Here's how you can run it:
./clean.shNOTE: GitHub's Super Linter runs more than just
pylintso it's worth setting that up locally to run before pushing large changes. See Gen3's linter setup docs for full instructions. Then you can run pylint more frequently as you develop.
To build:
docker build -t gen3userdatalibrary:latest .To run:
docker run --name gen3userdatalibrary \
--env-file "./.env" \
-v "$SOME_OTHER_CONFIG":"$SOME_OTHER_CONFIG" \
-p 8089:8000 \
gen3userdatalibrary:latestTo exec into a bash shell in running container:
docker exec -it gen3userdatalibrary bashTo kill and remove running container:
docker kill gen3userdatalibrary
docker remove gen3userdatalibraryIf you want to debug the running app in an IDE and the bash scripts
are not an easy option (I'm looking at you PyCharm), then
you can use debug_run.py in the root folder as an entrypoint.
NOTE: There are some setup steps that the bash scripts do that you'll need to ensure are done. A key one is ensuring that the
PROMETHEUS_MULTIPROC_DIRenv var is set (default is/var/tmp/prometheus_metrics). And make sure the database exists and is migrated.
Metrics can be exposed at a /metrics endpoint compatible with Prometheus scraping and visualize in Prometheus or
Graphana, etc.
The metrics are defined in gen3userdatalibrary/metrics.py and in 1.0.0 are as follows:
- gen3_user_data_library_user_lists: Gen3 User Data Library User Lists. Does not count the items WITHIN the list, just the lists themselves.
- gen3_user_data_library_user_items: Gen3 User Data Library User Items (within Lists). This counts the amount of items within lists, rather than the lists themselves.
- gen3_user_data_library_api_requests_total: API requests for modifying Gen3 User Data Library User Lists. This includes all CRUD actions.
You can run Prometheus locally if you want to test or visualize these.
Run the service locally using poetry run bash run.sh.
Create a prometheus.yml config file, such
as: ~/Documents/prometheus/conf/prometheus.yml.
Put this in:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'gen3_user_data_library'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
# NOTE: The `host.docker.internal` below is so docker on MacOS can properly find the locally running service
- targets: [ 'host.docker.internal:8000' ]Note: Tested the above config on MacOS, with Linux you can maybe adjust these commands to actually expose the local network to the running prometheus container.
Then run this:
docker run --name prometheus -v ~/Documents/prometheus/conf/prometheus.yml:/etc/prometheus/prometheus.yml -d -p 127.0.0.1:9090:9090 prom/prometheus
Then go to http://127.0.0.1:9090.
And some recommended PromQL queries:
sum by (user_id) (gen3_user_data_library_user_lists)
sum by (user_id) (gen3_user_data_library_user_items)
sum by (status_code) (gen3_user_data_library_api_requests_total)