LlamBERT

Project Description

LlamBERT implements a hybrid approach approach for text classification that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus where we use it for efficiently extracting subontologies from the UMLS graph using natural language queries. This repository implements the method described in the research paper titled LlamBERT: Leveraging Semi-Supervised Learning for Text Classification.

Method Overview

Given a large corpus of unlabeled natural language data, LlamBERT follows these steps:

Annotate a reasonably sized, randomly selected subset of the corpus utilizing an LLM and a prompt reflecting the labeling criteria;
Parse the Llama 2 responses into the desired categories;
Discard any data that fails to classify into any of the specified categories;
Employ the resulting labels to perform supervised fine-tuning on a BERT classifier;
Apply the fine-tuned BERT classifier to annotate the original unlabeled corpus.

Comparison BERT test accuracies on the IMDb data.

BERT model	Baseline train	LlamBERT train	LlamBERT train&extra	Combined extra+train
distilbert-base	91.23	90.77	92.12	92.53
bert-base	92.35	91.58	92.76	93.47
bert-large	94.29	93.31	94.07	95.03
roberta-base	94.74	93.53	94.28	95.23
roberta-large	96.54	94.83	94.98	96.68

Accuracy comparison of different training data for the UMLS classification

95th percentile confidence interval measured on 5 different random seeds.

Model	Baseline	LlamBERT	Combined
bert-large	94.84 (±0.25)	95.70 (±0.21)	96.14 (±0.42)
roberta-large	95.00 (±0.18)	96.02 (±0.12)	96.64 (±0.14)
BiomedBERT-large	96.72 (±0.17)	96.66 (±0.13)	96.92 (±0.10)

Hardware Requirements

Llama-2-7b-chat: Requires a single A100 40GB GPU.
Llama-2-70b-chat: Requires four A100 80GB GPUs
gpt-4-0613: Requires OpenAI API access.

Installation

conda env create --file=environment.yml

General script usage

CUDA_VISIBLE_DEVICES=0 python <script-name>.py <options>

-- -h for help, q for exit help.

How to Cite

If you use this code in your research, please cite the corresponding paper:

@article{csanady2024llambert,
  title={LlamBERT: Large-scale low-cost data annotation in NLP},
  author={Csan{\'a}dy, B{\'a}lint and Muzsai, Lajos and Vedres, P{\'e}ter and N{\'a}dasdy, Zolt{\'a}n and Luk{\'a}cs, Andr{\'a}s},
  journal={arXiv preprint arXiv:2403.15938},
  year={2024}
}

Contributors

Bálint Csanády (csbalint@protonmail.ch)
Lajos Muzsai (muzsailajos@protonmail.com)
Péter Vedres (vedrespeter0000@gmail.com)
Zoltán Nádasdy (zoltan@utexas.edu)
András Lukács (andras.lukacs@ttk.elte.hu)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
BERT		BERT
ChatGPT		ChatGPT
LLM		LLM
UMLS		UMLS
error_analysis		error_analysis
plots		plots
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
environment_qwen.yml		environment_qwen.yml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamBERT

Project Description

Method Overview

Comparison BERT test accuracies on the IMDb data.

Accuracy comparison of different training data for the UMLS classification

Hardware Requirements

Installation

General script usage

How to Cite

Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LlamBERT

Project Description

Method Overview

Comparison BERT test accuracies on the IMDb data.

Accuracy comparison of different training data for the UMLS classification

Hardware Requirements

Installation

General script usage

How to Cite

Contributors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages