ocr-correction

Post-processing OCR errors with seq2seq models

Running OpenNMT

Preprocessing: python preprocess.py -train_src ../data/open_nmt_train_input.txt -train_tgt ../data/open_nmt_train_output.txt -valid_src ../data/open_nmt_devel_input.txt -valid_tgt ../data/open_nmt_devel_output.txt -save_data ../data/open/open -src_seq_length 10000 -tgt_seq_length 10000 -src_seq_length_trunc 500 -tgt_seq_length_trunc 500 Training: python train.py -data ../data/open/open -save_model ../models/open/open -gpuid 0

99% of the data is within the 500 character limit.

Evaluate

Evaluation script takes in two files, predictions and gold. Each file has one sentence per line. python3 evaluate.py --pred pred_file.txt --gold gold_file.txt

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data_generation		data_generation
.gitignore		.gitignore
README.md		README.md
calculate_noise.py		calculate_noise.py
char_dist.gz		char_dist.gz
character_replacement_distributions.json		character_replacement_distributions.json
convert.py		convert.py
errors.py		errors.py
evaluate.py		evaluate.py
evaluate_seq.py		evaluate_seq.py
evaluate_sequences.py		evaluate_sequences.py
generate_text.py		generate_text.py
generate_text_from_clusters.py		generate_text_from_clusters.py
keras_seq2seq.py		keras_seq2seq.py
noisify.py		noisify.py
one_to_many.py		one_to_many.py
open2txt.py		open2txt.py
opennmt.py		opennmt.py
show.py		show.py
split.py		split.py
uniform_lower_distribution.json		uniform_lower_distribution.json
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr-correction

Running OpenNMT

Evaluate

About

Releases

Packages

Contributors 3

Languages

TurkuNLP/ocr-correction

Folders and files

Latest commit

History

Repository files navigation

ocr-correction

Running OpenNMT

Evaluate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages