This code provides a way to index and search the text files for the Diagnosis London data set. A version of Text Rank (via the Summa package) is used to create the index.
- Create a
text
folder in this directory and place the OCR text files in it - run
python index_text.py
This will result in an index_normalized.json
file being created.
- Create a
csv
file containing one keyword/phrase per line - Edit line 9 of
build_subject_set.py
to be a list of input files - Edit line 4 to reflect the maximum number of subjects it search should produce
- Download the
diagnosis-london-subjects.csv
from the Project Builder (so we can get the subject ID for any existing subjects on the project) - Run
python build_subject_set.py
This will results in one subject_file_names_<original input file name>.csv
file for each input csv
each with three columns:
file_name
: The name of the text filekeyword_score
: The relevance score (between 0 and 1)subject_id
: The Panoptes subject ID if the text file has already been uploaded to the system.