A database of scientific publications related to a mission.
kpub is a generic tool that enables an institution to keep track of it's scientific publications in an easy way. It leverages MongoDB and the ADS API to create and curate a database that contains the metadata of mission-related articles.
This project has been expanded to run as a Python library. Data is configured to run on a MongoDB database. A Flask server handles HTTP requests to serve data, update affiliation information, and update the database. It is designed to run at the W. M. Keck Observatory and may no longer function as it originally intended.
This tool is made possible thanks to the efforts of Geert Barentsen who wrote the original version of kpub for Kepler/K2. The major changes here are:
- Code is a library installed with
pip install . - Code is now config-file driven so it can be used by any facility or institution.
- Added optional tracking of instrument assocations and associated new plots.
- Added optional tracking of archive references and associated new plots.
- Estimates if the new articles are Keck Related and gives the reason why.
- Added affiliations mapping and plotting.
- Added automated PDF download, view, and search for highlight snippets.
- Removed reliance on 'andycasey/ads' third-party module (due to some limitations).
- Replaced installation script and Makefile with run script (due to some limitations).
- Download the source code (assuming $HOME for examples below):
cd $HOME
git clone https://github.com/KeckObservatory/kpub.git
-
Create an account at https://ui.adsabs.harvard.edu/, generate an ADS API key (https://ui.adsabs.harvard.edu/user/settings/token) and copy it for use in step 3 below.
-
Edit the
config.live.yamlfile. Read the config file and edit sections as needed. At a minimum, you will need to add theADS_API_KEYvalue. -
Install dependencies:
Create a conda environment using the provided environment.yaml file:
cd $HOME/kpub
conda env create -f environment.yaml
Or, create a venv with Python 3.13+ yourself.
- Install kpub as a module (this installs all dependencies from
pyproject.toml, including the classifier's ML deps):
pip install .
There are additional options you can use to install the machine learning module 'classifier' capabilities by
pip install .[classifier]
This is separate from the Flask application and should not be installed unless you want to use the transformer classification model when adding papers.
Add --help to any command below to get full usage instructions
kpub updateadds new publications by searching ADS (interactive);kpub addadds a publication using its ADS bibcode;kpub deletedeletes a publication using its ADS bibcode;kpub importimports publications from a JSON file;kpub exportexports publications to a JSON file (or CSV with-csvflag) and saves to data/ dirkpub plotcreates a visualization of the database and saves to data/plots/ dir;kpub plot_datacreates the data needed to generate plots (used by the frontend);kpub statscreates publication stats in markdown format and saves to data/output/ dir;kpub spreadsheetexports the publications to an Excel spreadsheetkpub update_citationsfor a given year, update the cite_read_boost, citation_count and citation fields
Main benefit is that you can call these function within a python script
import kpub
kpub.kpub_update('2025-07')
Otherwise you can call the functions from the command line.
Search ADS by pubdate month or year for new articles and add them without user input:
python -m kpub update 2015-07
python -m kpub update 2015
Update plots and stats files:
python -m kpub plot
python -m kpub stats
Add a new article to the database interactively using its bibcode:
python -m kpub add 2015arXiv150204715F -interactive
Remove an article using its bibcode:
python -m kpub delete 2015ApJ...800...46B
For example output, see the data/output/ sub-directory in this repository.
This tool visualizes the data on a table/interactive plots. Data is served by a flask server https://github.com/KeckObservatory/OperationAPIs/
- Navigate to kpub-viewer and build the project with npm
cd kpub-viewer
npm run build
- Make install and kdeploy to www3
cd -
make install
kdeploy -a /www/public/kpub
This new python library version created by Tyler Coda (tcoda at keck.hawaii.edu). This new configurable version created by Josh Riley (jriley at keck.hawaii.edu). Original Kepler/K2-specific version created by Geert Barentsen (geert.barentsen at nasa.gov).
Automatic publication classification lives under src/classifier/ (importable as kpub.classifier). It trains and runs transformer and LLM classifiers over articles in MongoDB, merging full text from data/pubs/full_text/.
Articles live in MongoDB. Full text stays on the filesystem (data/pubs/full_text/). Predictions write back to MongoDB as flat fields (ilabel, keck_score, idrp, drp_reason, ikoa, koa_reason).
Two collections are in play: articles is the production collection, and test_articles is a copy used for development and classifier experiments. The examples below use --collection test_articles; swap in articles when writing back to production.
The classifier is installed as part of pip install . (see Installation above). Its dependencies (torch, transformers, sentence-transformers, PyMuPDF, ollama, etc.) come in with that command.
Mongo connection details are read from src/config/config.live.yaml (the same file the autokpub app uses). Make sure the "kpub" block (server/port/user/pwd) is filled in.
Copy the tracked config templates to their local (gitignored) counterparts and edit as needed. These live inside the classifier package:
cp src/classifier/config/models.default.yaml src/classifier/config/models.yaml
cp src/classifier/config/article_subset.default.yaml src/classifier/config/article_subset.yamlRuntime outputs (experiment logs) go to classifier_runtime/ (gitignored). Trained model checkpoints are written to data/models/trained/.
Reads bibcodes and links_data from MongoDB, downloads PDFs, extracts text to data/pubs/full_text/{year}/{bibcode}.txt.
python -m kpub.classifier.data.fetch_full_text --collection test_articles --year 2024
python -m kpub.classifier.data.fetch_full_text --collection test_articles --start-year 2020 --end-year 2025Loads articles from MongoDB, derives labels from the affiliation field ("keck" → positive, everything else → negative), merges full text from the filesystem, and runs the standard train/test split. Docs without an affiliation set are skipped and reported in the run summary.
python -m kpub.classifier.scripts.train transformer --year 2000-2023 --save
python -m kpub.classifier.scripts.train transformer --no-test --save # train on all labeled dataAvailable models: transformer and llm. Hyperparameters live in src/classifier/config/models.yaml.
Train a base model once on the full reviewed history, then fine-tune on newly reviewed years as they arrive. The reviewed-subset filter lives in src/classifier/config/article_subset.yaml (copied from its .default.yaml sibling; currently: 2020–2024 excluding from_broad_query=true; 2025+ completely included).
# 1. Base model — full reviewed history
python -m kpub.classifier.scripts.train transformer --year 2000-2025 --collection articles --save
# 2. Fine-tune once 2025 data is reviewed
python -m kpub.classifier.scripts.train transformer --year 2020-2025 --collection articles \
--save --finetune [BASE MODEL] --subset-articlesWhen 2026 data is reviewed, extend the year range (e.g. --year 2020-2026) and update src/classifier/config/article_subset.yaml to match, then fine-tune again.
Loads articles from MongoDB, merges full text from the filesystem, runs classifiers, and writes predictions back to MongoDB. Three tasks are supported:
- keck — Transformer classifier. Writes
ilabel,keck_score. - drp — LLM classifier on keck-positive papers. Writes
idrp,drp_reason. - koa — LLM classifier. Writes
ikoa,koa_reason.
# Keck classification (transformer)
python -m kpub.classifier.scripts.predict 2024 --collection test_articles --task keck
# DRP classification (LLM, runs on keck-positive papers only)
python -m kpub.classifier.scripts.predict 2024 --collection test_articles --task drp
# KOA classification (LLM)
python -m kpub.classifier.scripts.predict 2024 --collection test_articles --task koaYear ranges (e.g. 2020-2024) are supported. Use --limit N to cap the number of docs classified for quick iteration. The drp and koa tasks require a running Ollama host (defaults to http://localhost:11434); override with --llm-host and --llm-model.
Before running
--task keck: the transformer checkpoint is date-stamped and hard-coded asDEFAULT_TRANSFORMERat the top ofsrc/classifier/scripts/predict.py. Update that constant to point at the checkpoint you want to use (underdata/models/trained/) before running, or pass--model-pathexplicitly on the command line.
This tool is made possible thanks to the efforts of Geert Barentsen who wrote the original version of kpub for Kepler/K2. Thanks also to NASA ADS for providing a web API to their database.