LlamBERT implements a hybrid approach approach for text classification that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus where we use it for efficiently extracting subontologies from the UMLS graph using natural language queries. This repository implements the method described in the research paper titled LlamBERT: Leveraging Semi-Supervised Learning for Text Classification.
Given a large corpus of unlabeled natural language data, LlamBERT follows these steps:
- Annotate a reasonably sized, randomly selected subset of the corpus utilizing an LLM and a prompt reflecting the labeling criteria;
- Parse the Llama 2 responses into the desired categories;
- Discard any data that fails to classify into any of the specified categories;
- Employ the resulting labels to perform supervised fine-tuning on a BERT classifier;
- Apply the fine-tuned BERT classifier to annotate the original unlabeled corpus.
| BERT model | Baseline train | LlamBERT train | LlamBERT train&extra | Combined extra+train |
|---|---|---|---|---|
| distilbert-base | 91.23 | 90.77 | 92.12 | 92.53 |
| bert-base | 92.35 | 91.58 | 92.76 | 93.47 |
| bert-large | 94.29 | 93.31 | 94.07 | 95.03 |
| roberta-base | 94.74 | 93.53 | 94.28 | 95.23 |
| roberta-large | 96.54 | 94.83 | 94.98 | 96.68 |
95th percentile confidence interval measured on 5 different random seeds.
| Model | Baseline | LlamBERT | Combined |
|---|---|---|---|
| bert-large | 94.84 (±0.25) | 95.70 (±0.21) | 96.14 (±0.42) |
| roberta-large | 95.00 (±0.18) | 96.02 (±0.12) | 96.64 (±0.14) |
| BiomedBERT-large | 96.72 (±0.17) | 96.66 (±0.13) | 96.92 (±0.10) |
- Llama-2-7b-chat: Requires a single A100 40GB GPU.
- Llama-2-70b-chat: Requires four A100 80GB GPUs
- gpt-4-0613: Requires OpenAI API access.
conda env create --file=environment.yml
CUDA_VISIBLE_DEVICES=0 python <script-name>.py <options>
-- -h for help, q for exit help.
If you use this code in your research, please cite the corresponding paper:
@article{csanady2024llambert,
title={LlamBERT: Large-scale low-cost data annotation in NLP},
author={Csan{\'a}dy, B{\'a}lint and Muzsai, Lajos and Vedres, P{\'e}ter and N{\'a}dasdy, Zolt{\'a}n and Luk{\'a}cs, Andr{\'a}s},
journal={arXiv preprint arXiv:2403.15938},
year={2024}
}- Bálint Csanády (csbalint@protonmail.ch)
- Lajos Muzsai (muzsailajos@protonmail.com)
- Péter Vedres (vedrespeter0000@gmail.com)
- Zoltán Nádasdy (zoltan@utexas.edu)
- András Lukács (andras.lukacs@ttk.elte.hu)
This project is licensed under the MIT License - see the LICENSE file for details.

