Generate simulated handwritten texts to use as supplementary data for machine learning tasks.
- OpenCV
- Python 3
- Java JRE
In order to generate synthetic images, a few prerequisites are necessary. These scripts depend on access to a set of files:
- Background images (i.e. the pages the text will be placed on)
- Handwriting samples
- Stains to place on the pages
By default, these need to be placed in background_images/
, handwriting_images/
,
and stain_imges/
folders relative to the root of this repository.
To get started, some good sources to get some representative data for each data type follow.
A good dataset to use is the IAM Handwriting Database which is available free for non-commercial research usage. Note that downloading the database will require registration.
N.B. All handwriting samples should be black text on a white background
A good collection of stains is provided by DIVADid itself. The small stains can be downloaded here and larger stains can be downloaded here
Once the required data files are in place, a simple demonstration of running the script is
./generate_images.py 10
which will generate 10 synthetic images in a folder (by default in tmp/
).
Options can be specified by editing the options.ini
file or passed in on
the command line. For example,
./generate_images.py --output_dir=~/synthetic_images 10
will generate 10 images and save them in the ~/synthetic_images
directory.
There are three high-level steps to the process of generating these synthetic images.
- Add degradations to paper images
- Alpha blend text to paper documents
- Further degrade images
See DIVADid for information on how the degradations are performed, as DIVADid is used for all degradations.