This repository contains code to train and use a BERT model to assign CAH3 codes to degrees.
Start by setting up a virtual env and installing dependencies.
python -m venv .venv
. ./.venv/bin/activate
pip install -r requirements.txt
To train the model run the cells in the train.ipynb
file via
jupyter notebook
. Training took about an hour on my MacBook pro. You
may need to adjust the model.to()
call if your system doesn't support
Metal Performance
Shaders.
Run tensorboard
while training to see loss graphs.
Once the model is trained you can save it and load it --- see the last
cell in the notebook, and app.py
, for examples.
To run inference, involve python app-py infer --input {input_file} --model {model_path}
The output will be a CSV with columns for degree name (the input), CAH3 code (the output) and the human-readable CAH3 category.
The /data
folder contains the training data:
- manually mapped codes from ILR, which require some cleaning (see
CAHData
) - the original HECoS > CAH mapping which exhaustively lists "official" degrees
- 4000 rows of mappings from "unofficial" degrees, produced by GPT-4
To generate more GPT data, run python app.py gpt --count 4000
.