SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

[Paper] [Doc] [Model] [Datasets] [Config, Training and Inference] [Benchmark]

Introduction

Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed.

A Unified Training and Evaluation Benchmark for Scene Text Recognition

Recent research shows that STR models' performance can be been significantly improved by leveraging large-scale real-world datasets. However, many previous methods are trained on synthetic datasets, which fail to reflect their performance in real-world scenarios. Additionally, recent approaches have used different real-world datasets and inconsistent evaluation protocols, making it difficult to compare their performance.

To this end, we established a unified benchmark to re-train and evaluate mainstream STR methods.

First, to evaluate the performance of STR methods across diverse scenarios, we selected Union14M-Benchmarks as the test set. This benchmark includes a variety of complex scenarios. Additionally, we reported results on six test sets (Common Benchmarks) used in previous studies. For the training set, we used the large-scale real-world training dataset Union14M-L. To avoid data leakage, we filtered out the overlapping samples between Union14M-L (training set) and Union14M-L-Benchmark (Test set), resulting in the Union14M-L-Filter training dataset.

Furthermore, previous methods used inconsistent hyperparameter settings during training, which contributed to variations in their performance. To ensure reliable evaluation, we standardized key settings that significantly affect accuracy, such as the number of training epochs, data augmentation strategies, input size, and evaluation strategies. This ensures the reliability of our results.

Subsequently, we trained 24 reproduced STR methods and SVTRv2 using the Union14M-L-Filter dataset and evaluated their performance on both Common Benchmarks and Union14M-L-Benchmark.

Dataset Details

Training Dataset

Unified Training Set: All models are trained from scratch on a unified dataset named Union14M-L-Filter. This dataset was derived from Union14M-L, with certain adjustments for filtering.
Composition of Datasets: The training dataset includes different categories of samples, categorized as Easy, Medium, Hard, Norm, and Challenging.
- Union14M-L: Contains 3,230,742 images in total.
- Union14M-L-Filter: The filtered version contains 3,224,143 images.
- Overlap with Union14M-Benchmarks: Only 6,599 images overlap between Union14M-L and the benchmark datasets used for evaluation.

	Easy	Medium	Hard	Norm	Challenging	Total
Union14M-L	2,076,161	145,525	308,025	218,154	482,877	3,230,742
Union14M-L-Filter	2,073,822	144,677	306,771	217,070	481,803	3,224,143

Test Datasets

Common Test Set: Previous methods have primarily been evaluated on this set. While it includes some irregular text, the datasets are not highly challenging (e.g., little curva or rotation). Models trained on synthetic datasets often perform well here.
- Includes 6 subsets with regular and irregular samples.
- Example datasets: IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80.

Dataset	Image Count	Characteristics
IIIT5K	3,000	Regular
SVT	647	Regular
IC13	857	Regular
IC15	1,811	Irregular (low resolution, blurring)
SVTP	645	Irregular (affine, perspective)
CUTE80	288	Irregular (curved, clear)

Challenging Test Set (Union14M-L-Benchmark): Introduced to test the full capabilities of STR (Scene Text Recognition) models. This set includes significantly more difficult text samples, featuring extreme curva, rotation, artistic styles, overlapping text, and other challenges.
- Includes datasets such as Curve, Multi-Oriented, Artistic, Contextless, Salient, Multi-word, and General.

Dataset	Image Count	Characteristics
Curve	2,426	Severe curvature
Multi-Oriented	1,369	Severe rotation, multi-angle, vertical directions
Artistic	900	Artistic styles, often seen in logos
Contextless	779	No semantic meaning, out-of-dictionary words
Salient	1,585	Adjacent or overlapping text
Multi-word	829	Contains multiple words
General	400,000	Includes challenges like blurring, distortion

Special Test Sets: Designed to evaluate specific model capabilities beyond the above datasets.
- LTB (Long Text Benchmark): Evaluates performance on long texts (25–36 characters), as models are typically trained on short texts (≤25 characters).
- OST (Occluded Scene Text): Tests the model’s ability to infer text from damaged or partially erased samples.

Dataset	Image Count	Characteristics
LTB	3,376	Long text (25 < length < 36)
OST	4832	Partially erased/destroyed characters

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Additionally, Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

Implementation Details

The optimal training hyperparameters for all models are usually not fixed. However, key setting such as Training Epochs, Data Augmentation, Input Size, Data Type, and Evaluation Protocols—which significantly impact accuracy—must be strictly standardized to ensure fair and unbiased comparisons of model performance. By following these standardizations, the results can accurately reflect the true capabilities of the models, unaffected by experimental inconsistencies. The specific setting include:

Setting	Detail
Training Set	For training, when the text length of a text image exceeds 25, samples with text length ≤ 25 are randomly selected from the training set to ensure models are only exposed to short texts (length ≤ 25).
Test Sets	For all test sets except the long-text test set (LTB), text images with text length > 25 are filtered. Text length is calculated by removing spaces and non-94-character-set special characters.
Input Size	Unless a method explicitly requires a dynamic size, models use a fixed input size of 32×128. If a model performs incorrectly with 32×128 during training, the original size is used. The test input size matches the training size.
Data Augmentation	All models use the data augmentation strategy employed by PARSeq.
Training Epochs	Unless pre-training is required, all models are trained for 20 epochs.
Optimizer	AdamW is the default optimizer. If training fails to converge with AdamW, Adam or other optimizers are used.
Batch Size	Maximum batch size for all models is 1024. If single-GPU training is not feasible, 2 GPUs (512 per GPU) or 4 GPUs (256 per GPU) are used. If 4-GPU training out of memory, the batch size is halved, and the learning rate is adjusted accordingly.
Learning Rate	Default learning rate for batch size 1024 is 0.00065. The learning rate is adjusted multiple times to achieve the best results.
Learning Rate Scheduler	A linear warm-up for 1.5 epochs is followed by a OneCycle scheduler.
Weight Decay	Default weight decay is 0.05. NormLayer and Bias parameters have a weight decay of 0.
Data Type	All models are trained with mixed precision.
EMA or Similar Tricks	No EMA or similar tricks are used for any model.
Evaluation Protocols	Word accuracy is evaluated after filtering special characters and converting all text to lowercase.

Get Started with training a SoTA Scene Text Recognition model from scratch.

Installation

PyTorch version >= 1.13.0
Python version >= 3.7

git clone https://github.com/Topdu/OpenOCR.git
cd OpenOCR
# Ubuntu 20.04 Cuda 11.8
conda create -n openocr python==3.8
conda activate openocr
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Downloading Datasets

All data can be downloaded from Google Drive.

The structure of Datasets and OpenOCR code will be organized as follows:

Structure of Datasets and OpenOCR code

```text
benchmark_bctr # Chinese text datasets, optional
├── benchmark_bctr_test
│   ├── document_test
│   ├── handwriting_test
│   ├── scene_test
│   └── web_test
└── benchmark_bctr_train
    ├── document_train
    ├── handwriting_train
    ├── scene_train
    └── web_train
evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k
├── SVT
└── SVTP
iiit5k_test_images # for Latency Measurement, optional
ltb # Long Text Benchmark
OpenOCR
OST
synth # optional
├── MJ
│   ├── test
│   ├── train
│   └── val
└── ST
test # Common Benchmarks from PARSeq
├── ArT
├── COCOv1.4
├── CUTE80
├── IC13_1015
├── IC13_1095
├── IC13_857
├── IC15_1811
├── IC15_2077
├── IIIT5k
├── SVT
├── SVTP
└── Uber
u14m # lmdb format Union14M-Benchmark
├── artistic
├── contextless
├── curve
├── general
├── multi_oriented
├── multi_words
└── salient
Union14M-L-LMDB-Filtered # lmdb format Union14M-L-Filtered
├── train_challenging
├── train_easy
├── train_hard
├── train_medium
└── train_normal
```

Datasets used during Training

Datsets	Google Drive	Baidu Yun
Union14M-L-Filter	LMDB archives
Evaluation	LMDB archives

If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter.

Test Set

Datsets	Google Drive	Baidu Yun
Union14M-L-Benchmark	LMDB archives
Common-Benchmarks	LMDB archives
Long Text Benchmark (LTB)	LMDB archives
Occluded Scene Text (OST)	LMDB archives

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

Training & Evaluation & Inference & Latency Measurement

Note: Take SVTRv2 as an example here. The execution commands for each model are listed in detail on their readme pages.

Training

# Multi GPU training
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# For Multi RTX 4090
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# 20epoch runs for about 6 hours

Evaluation

# short text: Common, Union14M-Benchmark, OST
python tools/eval_rec_all_ratio.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml

After a successful run, the results are saved in a csv file in output_dir in the config file.

Inference

python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml --o Global.infer_img=/path/img_fold or /path/img_file

Latency Measurement

Firstly, downloading the IIIT5K images from Google Drive. Then, running the following command:

python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_rctc.yml --o Global.infer_img=../iiit5k_test_image

Results (Benchmark) & Configs & Checkpoints:

(TODO) Downloading all model checkpoints from Google Drive and Baidu Yun.

Method		Venue	Encoder	Config	Model	Common Benchmarks						Avg	Union-14M Benchmarks							Avg	LTB	OST	Param (M)	Latency (ms)
Method		Venue	Encoder	Config	Model	IIIT	SVT	IC13	IC15	SVTP	CUTE	Avg	Curve	Multi-Oriented	Artistic	Contextless	Salient	Multi-Words	General	Avg	LTB	OST	Param (M)	Latency (ms)
EDTRs	ASTER	TPAMI19	ResNet31+LSTM	Config	ckpt	96.1	93.0	94.9	86.1	87.9	92.0	91.68	70.9	82.2	56.7	62.9	73.9	58.5	76.3	68.75	0.02	61.9	19.04	14.9
	NRTR	ICDAR19	Stem+TF6	Config	ckpt	98.1	96.8	97.8	88.9	93.3	94.4	94.89	67.9	42.4	66.5	73.6	66.4	77.2	78.3	67.46	2.00	74.8	44.26	57.8
	MORAN	PR19	ResNet31+LSTM	Config	ckpt	96.7	91.7	94.6	84.6	85.7	90.3	90.61	51.2	15.5	51.3	61.2	43.2	64.1	69.3	50.82	0.06	57.9	17.35	16.8
	SAR	AAAI19	ResNet31+LSTM	Config	ckpt	98.1	93.8	96.7	86.0	87.9	95.5	93.01	70.5	51.8	63.7	73.9	64.0	79.1	75.5	68.36	0.00	60.6	57.47	63.1
	DAN	AAAI20	ResNet45+FPN	Config	ckpt	97.5	94.7	96.5	87.1	89.1	94.4	93.24	74.9	63.3	63.4	70.6	70.2	71.1	76.8	70.05	0.00	61.8	27.71	10.1
	SRN	CVPR20	ResNet50+FPN	Config	ckpt	97.2	96.3	97.5	87.9	90.9	96.9	94.45	78.1	63.2	66.3	65.3	71.4	58.3	76.5	68.43	0.00	64.6	51.70	14.9
	SEED	CVPR20	ResNet31+LSTM	Config	ckpt	96.5	93.2	94.2	87.5	88.7	93.4	92.24	69.1	80.9	56.9	63.9	73.4	61.3	76.5	68.87	0.10	62.6	23.95	15.3
	AutoSTR	ECCV20	SearchCNN+LSTM	Config	ckpt	96.8	92.4	95.7	86.6	88.2	93.4	92.19	72.1	81.7	56.7	64.8	75.4	64.0	75.9	70.09	0.10	61.5	6.04	12.1
	RoScanner	ECCV20	ResNet31	Config	ckpt	98.5	95.8	97.7	88.2	90.1	97.6	94.65	79.4	68.1	70.5	79.6	71.6	82.5	80.8	76.08	0.00	68.6	47.98	15.6
	ABINet	CVPR21	ResNet45+TF3	Config	ckpt	98.5	98.1	97.7	90.1	94.1	96.5	95.83	80.4	69.0	71.7	74.7	77.6	76.8	79.8	75.72	0.00	75.0	36.86	13.7
	VisionLAN	ICCV21	ResNet45+TF3	Config	ckpt	98.2	95.8	97.1	88.6	91.2	96.2	94.50	79.6	71.4	67.9	73.7	76.1	73.9	79.1	74.53	0.00	66.4	32.88	10.7
	PARSeq	ECCV22	ViT-S	Config	ckpt	98.9	98.1	98.4	90.1	94.3	98.6	96.40	87.6	88.8	76.5	83.4	84.4	84.3	84.9	84.26	0.00	79.9	23.83	19.0
	MATRN	ECCV22	ResNet45+TF3	Config	ckpt	98.8	98.3	97.9	90.3	95.2	97.2	96.29	82.2	73.0	73.4	76.9	79.4	77.4	81.0	77.62	0.00	77.8	44.34	21.3
	MGP-STR	ECCV22	ViT-B	Config	ckpt	97.9	97.8	97.1	89.6	95.2	96.9	95.74	85.2	83.7	72.6	75.1	79.8	71.1	83.1	78.65	0.00	78.6	148.00	8.2
	CPPD-B	Preprint	SVTR-B	Config	ckpt	99.0	97.8	98.2	90.4	94.0	99.0	96.40	86.2	78.7	76.5	82.9	83.5	81.9	83.5	81.91	0.00	79.6	27.00	8.0
	LPV-B	IJCAI23	SVTR-B	Config	ckpt	98.6	97.8	98.1	89.8	93.6	97.6	95.93	86.2	78.7	75.8	80.2	82.9	81.6	82.9	81.20	0.00	77.7	30.54	12.1
	MAERec	ICCV23	ViT-S	Config	ckpt	99.2	97.8	98.2	90.4	94.3	98.3	96.36	89.1	87.1	79.0	84.2	86.3	85.9	84.6	85.17	9.80	76.4	35.69	58.4
	LISTER	ICCV23	FocalNet-B	Config	ckpt	98.8	97.5	98.6	90.0	94.4	96.9	95.48	78.7	68.8	73.7	81.6	74.8	82.4	83.5	77.64	36.3	77.1	51.11	20.4
	CDistNet	IJCV24	ResNet45+TF3	Config	ckpt	98.7	97.1	97.8	89.6	93.5	96.9	95.59	81.7	77.1	72.6	78.2	79.9	79.7	81.1	78.62	0.00	71.8	43.32	62.9
	CAM	PR24	ConvNeXtV2-T	Config	ckpt	98.2	96.1	96.6	89.0	93.5	96.2	94.94	85.4	89.0	72.0	75.4	84.0	74.8	83.1	80.52	0.52	74.2	58.66	35.0
	BUSNet	AAAI24	ViT-S	Config	ckpt	98.3	98.1	97.8	90.2	95.3	96.5	96.06	83.0	82.3	70.8	77.9	78.8	71.2	82.6	78.10	0.00	78.7	32.10	12.0
	OTE	CVPR24	SVTR-B	Config	ckpt	98.6	96.6	98.0	90.1	94.0	97.2	95.74	86.0	75.8	74.6	74.7	81.0	65.3	82.3	77.09	0.00	77.8	20.28	18.1
CTCs	CRNN	TPAMI16	ResNet31+LSTM	Config	ckpt	95.8	91.8	94.6	84.9	83.1	91.0	90.21	48.1	13.0	51.2	62.3	41.4	60.4	68.2	49.24	47.21	58.0	16.20	5.8
	SVTR	IJCAI22	SVTR-B	Config	ckpt	98.0	97.1	97.3	88.6	90.7	95.8	94.58	76.2	44.5	67.8	78.7	75.2	77.9	77.8	71.17	45.08	69.6	18.09	6.2
	SVTRv2	Preprint	SVTRv2-T	Config	ckpt	98.6	96.6	98.0	88.4	90.5	96.5	94.78	83.6	76.0	71.2	82.4	77.2	82.3	80.7	79.05	47.83	71.4	5.13	5.0
			SVTRv2-S	Config	ckpt	99.0	98.3	98.5	89.5	92.9	98.6	96.13	88.3	84.6	76.5	84.3	83.3	85.4	83.5	83.70	47.57	78.0	11.25	5.3
			SVTRv2-B	Config	ckpt	99.2	98.0	98.7	91.1	93.5	99.0	96.57	90.6	89.0	79.3	86.1	86.2	86.7	85.1	86.14	50.23	80.0	19.76	7.0

Note: TF$_n$ denotes the $n$-layer Transformer block. $Size$ denotes the number of parameters ($M$). $Latency$ is measured on one NVIDIA 1080Ti GPU with Pytorch dynamic graph mode.

Results when trained on synthetic datasets ($ST$ + $MJ$).

Method		Venue	Encoder	Common Benchmarks						Avg	Union-14M Benchmarks							Avg	Param (M)
Method		Venue	Encoder	IC13	SVT	IIIT	IC15	SVTP	CUTE	Avg	Curve	Multi-Oriented	Artistic	Contextless	Salient	Multi-Words	General	Avg	Param (M)
EDTRs	ASTER	TPAMI19	ResNet+LSTM	93.3	90.0	90.8	74.7	80.2	80.9	84.98	34.0	10.2	27.7	33.0	48.2	27.6	39.8	31.50	27.2
	NRTR	ICDAR19	Stem+TF6	90.1	91.5	95.8	79.4	86.6	80.9	87.38	31.7	4.4	36.6	37.3	30.6	54.9	48.0	34.79	31.7
	MORAN	PR19	ResNet+LSTM	91.0	83.9	91.3	68.4	73.3	75.7	80.60	8.9	0.7	29.4	20.7	17.9	23.8	35.2	19.51	17.4
	SAR	AAAI19	ResNet+LSTM	91.5	84.5	91.0	69.2	76.4	83.5	82.68	44.3	7.7	42.6	44.2	44.0	51.2	50.5	40.64	57.7
	DAN	AAAI20	ResNet+FPN	93.4	87.5	92.1	71.6	78.0	81.3	83.98	26.7	1.5	35.0	40.3	36.5	42.2	42.1	32.04	27.7
	SRN	CVPR20	ResNet+FPN	94.8	91.5	95.5	82.7	85.1	87.8	89.57	63.4	25.3	34.1	28.7	56.5	26.7	46.3	40.14	54.7
	SEED*	CVPR20	ResNet+LSTM	93.8	89.6	92.8	80.0	81.4	83.6	86.87	40.4	15.5	32.1	32.5	54.8	35.6	39.0	35.70	24.0
	AutoSTR*	ECCV20	NAS+LSTM	94.7	90.9	94.2	81.8	81.7	-	-	47.7	17.9	30.8	36.2	64.2	38.7	41.3	39.54	6.0
	RoScanner	ECCV20	ResNet	95.3	88.1	94.8	77.1	79.5	90.3	87.52	43.6	7.9	41.2	42.6	44.9	46.9	39.5	38.09	48.0
	ABINet	CVPR21	ResNet+TF3	96.2	93.5	97.4	86.0	89.3	89.2	91.93	59.5	12.7	43.3	38.3	62.0	50.8	55.6	46.03	36.7
	VisionLAN	ICCV21	ResNet+TF3	95.8	91.7	95.7	83.7	86.0	88.5	90.23	57.7	14.2	47.8	48.0	64.0	47.9	52.1	47.39	32.8
	PARSeq*	ECCV22	ViT-S	97.0	93.6	97.0	86.5	88.9	92.2	92.53	63.9	16.7	52.5	54.3	68.2	55.9	56.9	52.62	23.8
	MATRN	ECCV22	ResNet+TF3	96.6	95.0	97.9	86.6	90.6	93.5	93.37	63.1	13.4	43.8	41.9	66.4	53.2	57.0	48.40	44.2
	MGP-STR*	ECCV22	ViT-B	96.4	94.7	97.3	87.2	91.0	90.3	92.82	55.2	14.0	52.8	48.5	65.2	48.8	59.1	49.09	148.0
	LevOCR*	ECCV22	ResNet+TF3	96.6	94.4	96.7	86.5	88.8	90.6	92.27	52.8	10.7	44.8	51.9	61.3	54.0	58.1	47.66	109.0
	CornerTF*	ECCV22	CornerEncoder	95.9	94.6	97.8	86.5	91.5	92.0	93.05	62.9	18.6	56.1	58.5	68.6	59.7	61.0	55.07	86.0
	CPPD	Preprint	SVTR-B	97.6	95.5	98.2	87.9	90.9	92.7	93.80	65.5	18.6	56.0	61.9	71.0	57.5	65.8	56.63	26.8
	SIGA*	CVPR23	ViT-B	96.6	95.1	97.8	86.6	90.5	93.1	93.28	59.9	22.3	49.0	50.8	66.4	58.4	56.2	51.85	113.0
	CCD*	ICCV23	ViT-B	97.2	94.4	97.0	87.6	91.8	93.3	93.55	66.6	24.2	63.9	64.8	74.8	62.4	64.0	60.10	52.0
	LISTER*	ICCV23	FocalNet-B	96.9	93.8	97.9	87.5	89.6	90.6	92.72	56.5	17.2	52.8	63.5	63.2	59.6	65.4	54.05	49.9
	LPV-B*	IJCAI23	SVTR-B	97.3	94.6	97.6	87.5	90.9	94.8	93.78	68.3	21.0	59.6	65.1	76.2	63.6	62.0	59.40	35.1
	CDistNet*	IJCV24	ResNet+TF3	96.4	93.5	97.4	86.0	88.7	93.4	92.57	69.3	24.4	49.8	55.6	72.8	64.3	58.5	56.38	65.5
	CAM*	PR24	ConvNeXtV2-B	97.4	96.1	97.2	87.8	90.6	92.4	93.58	63.1	19.4	55.4	58.5	72.7	51.4	57.4	53.99	135.0
	BUSNet	AAAI24	ViT-S	96.2	95.5	98.3	87.2	91.8	91.3	93.38	-	-	-	-	-	-	-	-	56.8
	DCTC	AAAI24	SVTR-L	96.9	93.7	97.4	87.3	88.5	92.3	92.68	-	-	-	-	-	-	-	-	40.8
	OTE	CVPR24	SVTR-B	96.4	95.5	97.4	87.2	89.6	92.4	93.08	-	-	-	-	-	-	-	-	25.2
	CFF	IJCAI24	CEFE	97.6	94.3	97.9	86.9	91.8	95.5	94.00	70.0	20.8	62.4	72.0	75.2	65.7	65.1	61.60	23.9
CTCs	CRNN	TPAMI16	ResNet+LSTM	82.9	81.6	91.1	69.4	70.0	65.5	76.75	7.5	0.9	20.7	25.6	13.9	25.6	32.0	18.03	8.3
	SVTR*	IJCAI22	SVTR-B	96.0	91.5	97.1	85.2	89.9	91.7	91.90	69.8	37.7	47.9	61.4	66.8	44.8	61.0	55.63	24.6
	SVTRv2	Preprint	SVTRv2-B	97.7	94.0	97.3	88.1	91.2	95.8	94.02	74.6	25.2	57.6	69.7	77.9	68.0	66.9	62.83	19.8

Note: * represents that the results on Union14M-Benchmarks are evaluated using the model they released.

Citation

If you find our method useful for your reserach, please cite:

@article{Du2024SVTRv2,
      title={SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition},
      author={Yongkun Du and Zhineng Chen and Hongtao Xie and Caiyan Jia and Yu-Gang Jiang},
      journal={CoRR},
      volume={abs/2411.15858},
      eprinttype={arXiv},
      year={2024},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15858}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

svtrv2.md

svtrv2.md

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Introduction

A Unified Training and Evaluation Benchmark for Scene Text Recognition

Dataset Details

Training Dataset

Test Datasets

Implementation Details

Get Started with training a SoTA Scene Text Recognition model from scratch.

Installation

Downloading Datasets

The structure of Datasets and OpenOCR code will be organized as follows:

Datasets used during Training

Test Set

Training & Evaluation & Inference & Latency Measurement

Training

Evaluation

Inference

Latency Measurement

Results (Benchmark) & Configs & Checkpoints:

Results when trained on synthetic datasets ($ST$ + $MJ$).

Citation

Files

svtrv2.md

Latest commit

History

svtrv2.md

File metadata and controls

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Introduction

A Unified Training and Evaluation Benchmark for Scene Text Recognition

Dataset Details

Training Dataset

Test Datasets

Implementation Details

Get Started with training a SoTA Scene Text Recognition model from scratch.

Installation

Downloading Datasets

The structure of Datasets and OpenOCR code will be organized as follows:

Datasets used during Training

Test Set

Training & Evaluation & Inference & Latency Measurement

Training

Evaluation

Inference

Latency Measurement

Results (Benchmark) & Configs & Checkpoints:

Results when trained on synthetic datasets ($ST$ + $MJ$).

Citation