Diagnosis London text search

This code provides a way to index and search the text files for the Diagnosis London data set. A version of Text Rank (via the Summa package) is used to create the index.

Make the index

Create a text folder in this directory and place the OCR text files in it
run python index_text.py

This will result in an index_normalized.json file being created.

Search the text

Create a csv file containing one keyword/phrase per line
Edit line 9 of build_subject_set.py to be a list of input files
Edit line 4 to reflect the maximum number of subjects it search should produce
Download the diagnosis-london-subjects.csv from the Project Builder (so we can get the subject ID for any existing subjects on the project)
Run python build_subject_set.py

This will results in one subject_file_names_<original input file name>.csv file for each input csv each with three columns:

file_name: The name of the text file
keyword_score: The relevance score (between 0 and 1)
subject_id: The Panoptes subject ID if the text file has already been uploaded to the system.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
build_subject_set.py		build_subject_set.py
index_text.py		index_text.py
search_text.py		search_text.py
summa_by_word.py		summa_by_word.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diagnosis London text search

Make the index

Search the text

About

Releases

Packages

Languages

CKrawczyk/Diagnosis-London-keyword-search

Folders and files

Latest commit

History

Repository files navigation

Diagnosis London text search

Make the index

Search the text

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages