This repository contains the reference code and dataset for the paper HWD: A Novel Evaluation Score for Styled Handwritten Text Generation. If you find it useful, please cite it as:
@inproceedings{pippi2023hwd,
title={{HWD: A Novel Evaluation Score for Styled Handwritten Text Generation}},
author={Pippi, Vittorio and Quattrini, Fabio and and Cascianelli, Silvia and Cucchiara, Rita},
booktitle={Proceedings of the British Machine Vision Conference},
year={2023}
}
git clone https://github.com/aimagelab/HWD.git
cd HWD
python setup.py sdist bdist_wheel
pip install .
Detailed instructions for generating styled handwritten text will be added in a future update.
This section describes how to evaluate the quality of styled handwritten text generation using various scores.
Organize your data in the following folder structure:
dataset/
├── author1/
│ ├── sample1.png
│ ├── sample2.png
│ └── ...
├── author2/
│ ├── sample1.png
│ ├── sample2.png
│ └── ...
└── ...
Each author's folder should contain .png
images of their handwriting. To evaluate the Character Error Rate (CER), you must include a transcriptions.json
file in the dataset. This file should be a dictionary where:
- The key is the relative path to the image.
- The value is the ground-truth text contained in the image.
Example structure of transcriptions.json
:
{
"author1/sample1.png": "Hello world",
"author1/sample2.png": "Handwritten text generation",
"author2/sample1.png": "British Machine Vision Conference",
"author2/sample2.png": "Artificial intelligence"
}
The transcriptions.json
file is only required for CER evaluation and will be automatically parsed by the FolderDataset
class if present in the dataset directory. Ensure all images referenced in the transcriptions.json
file are in the corresponding folders.
Once your dataset is prepared, you can use the FolderDataset
class to load images for evaluation:
from hwd.datasets import FolderDataset
fakes = FolderDataset('/path/to/images/fake')
reals = FolderDataset('/path/to/images/real')
Some evaluation metrics depend on whether the dataset is folded or unfolded. The unfold operation divides images into square segments, preserving the original height.
For an image of height
fakes = fakes.unfold()
reals = reals.unfold()
For FID and KID, images are cropped by default, as described in the paper. If you wish to evaluate using the entire line instead of cropping, you can unfold the dataset using the above method.
The HWD is the primary evaluation score introduced in the paper. It compares two datasets (reference and generated) by resizing images to a height of 32 pixels and using the Euclidean distance between their features.
from hwd.scores import HWDScore
hwd = HWDScore(height=32)
score = hwd(fakes, reals)
print(f"HWD Score: {score}")
The FID compares the distributions of two datasets in the feature space of an InceptionNet pretrained on ImageNet. By default, images are cropped before evaluation.
from hwd.scores import FIDScore
fid = FIDScore(height=32)
score = fid(fakes, reals)
print(f"FID Score: {score}")
The BFID is a variant of the FID that operates on binarized images. This score is computed by applying Otsu's thresholding before evaluation.
from hwd.scores import BFIDScore
bfid = BFIDScore(height=32)
score = bfid(fakes, reals)
print(f"BFID Score: {score}")
The KID measures differences between sets of images by using the maximum mean discrepancy (MMD). By default, images are cropped before evaluation.
from hwd.scores import KIDScore
kid = KIDScore(height=32)
score = kid(fakes, reals)
print(f"KID Score: {score}")
The BKID is a variant of the KID that operates on binarized images. This score is computed by applying Otsu's thresholding before evaluation.
from hwd.scores import BKIDScore
bkid = BKIDScore(height=32)
score = bkid(fakes, reals)
print(f"BKID Score: {score}")
The CER evaluates the character-level accuracy of generated handwritten text images by comparing their contained text against the ground-truth transcriptions. By default, the model Microsoft/trocar-base-handwritten
is used.
from hwd.scores import CERScore
# Load datasets
fakes = FolderDataset('/path/to/images/fake') # Ensure this folder contains transcriptions.json
# Initialize CER score
cer = CERScore(height=64)
# Compute CER
score = cer(fakes)
print(f"CER Score: {score}")
The LPIPS measures perceptual differences between images by using feature activations from a deep network. The LPIPS score in this repo uses a custom implementation with the same backbone as HWD.
from hwd.scores import LPIPSScore
lpips = LPIPSScore(height=32)
score = lpips(fakes, reals)
print(f"LPIPS Score: {score}")
The I-LPIPS evaluates the intra-image consistency by comparing style coherence between crops within the sample. This is also a custom implementation using the same backbone as HWD.
from hwd.scores import IntraLPIPSScore
ilpips = IntraLPIPSScore(height=32)
score = ilpips(fakes)
print(f"I-LPIPS Score: {score}")