[Paper] [Doc] [Model] [Datasets] [Config, Training and Inference] [Benchmark]
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed.
Recent research shows that STR models' performance can be been significantly improved by leveraging large-scale real-world datasets. However, many previous methods are trained on synthetic datasets, which fail to reflect their performance in real-world scenarios. Additionally, recent approaches have used different real-world datasets and inconsistent evaluation protocols, making it difficult to compare their performance.
To this end, we established a unified benchmark to re-train and evaluate mainstream STR methods.
First, to evaluate the performance of STR methods across diverse scenarios, we selected Union14M-Benchmarks as the test set. This benchmark includes a variety of complex scenarios. Additionally, we reported results on six test sets (Common Benchmarks) used in previous studies. For the training set, we used the large-scale real-world training dataset Union14M-L. To avoid data leakage, we filtered out the overlapping samples between Union14M-L (training set) and Union14M-L-Benchmark (Test set), resulting in the Union14M-L-Filter training dataset.
Furthermore, previous methods used inconsistent hyperparameter settings during training, which contributed to variations in their performance. To ensure reliable evaluation, we standardized key settings that significantly affect accuracy, such as the number of training epochs, data augmentation strategies, input size, and evaluation strategies. This ensures the reliability of our results.
Subsequently, we trained 24 reproduced STR methods and SVTRv2 using the Union14M-L-Filter dataset and evaluated their performance on both Common Benchmarks and Union14M-L-Benchmark.
- Unified Training Set: All models are trained from scratch on a unified dataset named Union14M-L-Filter. This dataset was derived from Union14M-L, with certain adjustments for filtering.
- Composition of Datasets: The training dataset includes different categories of samples, categorized as Easy, Medium, Hard, Norm, and Challenging.
- Union14M-L: Contains 3,230,742 images in total.
- Union14M-L-Filter: The filtered version contains 3,224,143 images.
- Overlap with Union14M-Benchmarks: Only 6,599 images overlap between Union14M-L and the benchmark datasets used for evaluation.
Easy | Medium | Hard | Norm | Challenging | Total | |
---|---|---|---|---|---|---|
Union14M-L | 2,076,161 | 145,525 | 308,025 | 218,154 | 482,877 | 3,230,742 |
Union14M-L-Filter | 2,073,822 | 144,677 | 306,771 | 217,070 | 481,803 | 3,224,143 |
- Common Test Set:
Previous methods have primarily been evaluated on this set. While it includes some irregular text, the datasets are not highly challenging (e.g., little curva or rotation). Models trained on synthetic datasets often perform well here.
- Includes 6 subsets with regular and irregular samples.
- Example datasets: IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80.
Dataset | Image Count | Characteristics |
---|---|---|
IIIT5K | 3,000 | Regular |
SVT | 647 | Regular |
IC13 | 857 | Regular |
IC15 | 1,811 | Irregular (low resolution, blurring) |
SVTP | 645 | Irregular (affine, perspective) |
CUTE80 | 288 | Irregular (curved, clear) |
- Challenging Test Set (Union14M-L-Benchmark):
Introduced to test the full capabilities of STR (Scene Text Recognition) models. This set includes significantly more difficult text samples, featuring extreme curva, rotation, artistic styles, overlapping text, and other challenges.
- Includes datasets such as Curve, Multi-Oriented, Artistic, Contextless, Salient, Multi-word, and General.
Dataset | Image Count | Characteristics |
---|---|---|
Curve | 2,426 | Severe curvature |
Multi-Oriented | 1,369 | Severe rotation, multi-angle, vertical directions |
Artistic | 900 | Artistic styles, often seen in logos |
Contextless | 779 | No semantic meaning, out-of-dictionary words |
Salient | 1,585 | Adjacent or overlapping text |
Multi-word | 829 | Contains multiple words |
General | 400,000 | Includes challenges like blurring, distortion |
- Special Test Sets:
Designed to evaluate specific model capabilities beyond the above datasets.
- LTB (Long Text Benchmark): Evaluates performance on long texts (25–36 characters), as models are typically trained on short texts (≤25 characters).
- OST (Occluded Scene Text): Tests the model’s ability to infer text from damaged or partially erased samples.
Dataset | Image Count | Characteristics |
---|---|---|
LTB | 3,376 | Long text (25 < length < 36) |
OST | 4832 | Partially erased/destroyed characters |
Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Additionally, Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.
The optimal training hyperparameters for all models are usually not fixed. However, key setting such as Training Epochs, Data Augmentation, Input Size, Data Type, and Evaluation Protocols—which significantly impact accuracy—must be strictly standardized to ensure fair and unbiased comparisons of model performance. By following these standardizations, the results can accurately reflect the true capabilities of the models, unaffected by experimental inconsistencies. The specific setting include:
Setting | Detail |
---|---|
Training Set | For training, when the text length of a text image exceeds 25, samples with text length ≤ 25 are randomly selected from the training set to ensure models are only exposed to short texts (length ≤ 25). |
Test Sets | For all test sets except the long-text test set (LTB), text images with text length > 25 are filtered. Text length is calculated by removing spaces and non-94-character-set special characters. |
Input Size | Unless a method explicitly requires a dynamic size, models use a fixed input size of 32×128. If a model performs incorrectly with 32×128 during training, the original size is used. The test input size matches the training size. |
Data Augmentation | All models use the data augmentation strategy employed by PARSeq. |
Training Epochs | Unless pre-training is required, all models are trained for 20 epochs. |
Optimizer | AdamW is the default optimizer. If training fails to converge with AdamW, Adam or other optimizers are used. |
Batch Size | Maximum batch size for all models is 1024. If single-GPU training is not feasible, 2 GPUs (512 per GPU) or 4 GPUs (256 per GPU) are used. If 4-GPU training out of memory, the batch size is halved, and the learning rate is adjusted accordingly. |
Learning Rate | Default learning rate for batch size 1024 is 0.00065. The learning rate is adjusted multiple times to achieve the best results. |
Learning Rate Scheduler | A linear warm-up for 1.5 epochs is followed by a OneCycle scheduler. |
Weight Decay | Default weight decay is 0.05. NormLayer and Bias parameters have a weight decay of 0. |
Data Type | All models are trained with mixed precision. |
EMA or Similar Tricks | No EMA or similar tricks are used for any model. |
Evaluation Protocols | Word accuracy is evaluated after filtering special characters and converting all text to lowercase. |
- PyTorch version >= 1.13.0
- Python version >= 3.7
git clone https://github.com/Topdu/OpenOCR.git
cd OpenOCR
# Ubuntu 20.04 Cuda 11.8
conda create -n openocr python==3.8
conda activate openocr
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
All data can be downloaded from Google Drive.
Structure of Datasets and OpenOCR code
```text
benchmark_bctr # Chinese text datasets, optional
├── benchmark_bctr_test
│ ├── document_test
│ ├── handwriting_test
│ ├── scene_test
│ └── web_test
└── benchmark_bctr_train
├── document_train
├── handwriting_train
├── scene_train
└── web_train
evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k
├── SVT
└── SVTP
iiit5k_test_images # for Latency Measurement, optional
ltb # Long Text Benchmark
OpenOCR
OST
synth # optional
├── MJ
│ ├── test
│ ├── train
│ └── val
└── ST
test # Common Benchmarks from PARSeq
├── ArT
├── COCOv1.4
├── CUTE80
├── IC13_1015
├── IC13_1095
├── IC13_857
├── IC15_1811
├── IC15_2077
├── IIIT5k
├── SVT
├── SVTP
└── Uber
u14m # lmdb format Union14M-Benchmark
├── artistic
├── contextless
├── curve
├── general
├── multi_oriented
├── multi_words
└── salient
Union14M-L-LMDB-Filtered # lmdb format Union14M-L-Filtered
├── train_challenging
├── train_easy
├── train_hard
├── train_medium
└── train_normal
```
Datsets | Google Drive | Baidu Yun |
---|---|---|
Union14M-L-Filter | LMDB archives | |
Evaluation | LMDB archives |
If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter.
Datsets | Google Drive | Baidu Yun |
---|---|---|
Union14M-L-Benchmark | LMDB archives | |
Common-Benchmarks | LMDB archives | |
Long Text Benchmark (LTB) | LMDB archives | |
Occluded Scene Text (OST) | LMDB archives |
Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.
Note: Take SVTRv2 as an example here. The execution commands for each model are listed in detail on their readme pages.
# Multi GPU training
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# For Multi RTX 4090
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# 20epoch runs for about 6 hours
# short text: Common, Union14M-Benchmark, OST
python tools/eval_rec_all_ratio.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
After a successful run, the results are saved in a csv file in output_dir
in the config file.
python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml --o Global.infer_img=/path/img_fold or /path/img_file
Firstly, downloading the IIIT5K images from Google Drive. Then, running the following command:
python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_rctc.yml --o Global.infer_img=../iiit5k_test_image
(TODO) Downloading all model checkpoints from Google Drive and Baidu Yun.
Method | Venue | Encoder | Config | Model | Common Benchmarks | Avg | Union-14M Benchmarks | Avg | LTB | OST | Param (M) |
Latency (ms) |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IIIT | SVT | IC13 | IC15 | SVTP | CUTE | Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | General | ||||||||||||
EDTRs | ASTER | TPAMI19 | ResNet31+LSTM | Config | ckpt | 96.1 | 93.0 | 94.9 | 86.1 | 87.9 | 92.0 | 91.68 | 70.9 | 82.2 | 56.7 | 62.9 | 73.9 | 58.5 | 76.3 | 68.75 | 0.02 | 61.9 | 19.04 | 14.9 |
NRTR | ICDAR19 | Stem+TF6 | Config | ckpt | 98.1 | 96.8 | 97.8 | 88.9 | 93.3 | 94.4 | 94.89 | 67.9 | 42.4 | 66.5 | 73.6 | 66.4 | 77.2 | 78.3 | 67.46 | 2.00 | 74.8 | 44.26 | 57.8 | |
MORAN | PR19 | ResNet31+LSTM | Config | ckpt | 96.7 | 91.7 | 94.6 | 84.6 | 85.7 | 90.3 | 90.61 | 51.2 | 15.5 | 51.3 | 61.2 | 43.2 | 64.1 | 69.3 | 50.82 | 0.06 | 57.9 | 17.35 | 16.8 | |
SAR | AAAI19 | ResNet31+LSTM | Config | ckpt | 98.1 | 93.8 | 96.7 | 86.0 | 87.9 | 95.5 | 93.01 | 70.5 | 51.8 | 63.7 | 73.9 | 64.0 | 79.1 | 75.5 | 68.36 | 0.00 | 60.6 | 57.47 | 63.1 | |
DAN | AAAI20 | ResNet45+FPN | Config | ckpt | 97.5 | 94.7 | 96.5 | 87.1 | 89.1 | 94.4 | 93.24 | 74.9 | 63.3 | 63.4 | 70.6 | 70.2 | 71.1 | 76.8 | 70.05 | 0.00 | 61.8 | 27.71 | 10.1 | |
SRN | CVPR20 | ResNet50+FPN | Config | ckpt | 97.2 | 96.3 | 97.5 | 87.9 | 90.9 | 96.9 | 94.45 | 78.1 | 63.2 | 66.3 | 65.3 | 71.4 | 58.3 | 76.5 | 68.43 | 0.00 | 64.6 | 51.70 | 14.9 | |
SEED | CVPR20 | ResNet31+LSTM | Config | ckpt | 96.5 | 93.2 | 94.2 | 87.5 | 88.7 | 93.4 | 92.24 | 69.1 | 80.9 | 56.9 | 63.9 | 73.4 | 61.3 | 76.5 | 68.87 | 0.10 | 62.6 | 23.95 | 15.3 | |
AutoSTR | ECCV20 | SearchCNN+LSTM | Config | ckpt | 96.8 | 92.4 | 95.7 | 86.6 | 88.2 | 93.4 | 92.19 | 72.1 | 81.7 | 56.7 | 64.8 | 75.4 | 64.0 | 75.9 | 70.09 | 0.10 | 61.5 | 6.04 | 12.1 | |
RoScanner | ECCV20 | ResNet31 | Config | ckpt | 98.5 | 95.8 | 97.7 | 88.2 | 90.1 | 97.6 | 94.65 | 79.4 | 68.1 | 70.5 | 79.6 | 71.6 | 82.5 | 80.8 | 76.08 | 0.00 | 68.6 | 47.98 | 15.6 | |
ABINet | CVPR21 | ResNet45+TF3 | Config | ckpt | 98.5 | 98.1 | 97.7 | 90.1 | 94.1 | 96.5 | 95.83 | 80.4 | 69.0 | 71.7 | 74.7 | 77.6 | 76.8 | 79.8 | 75.72 | 0.00 | 75.0 | 36.86 | 13.7 | |
VisionLAN | ICCV21 | ResNet45+TF3 | Config | ckpt | 98.2 | 95.8 | 97.1 | 88.6 | 91.2 | 96.2 | 94.50 | 79.6 | 71.4 | 67.9 | 73.7 | 76.1 | 73.9 | 79.1 | 74.53 | 0.00 | 66.4 | 32.88 | 10.7 | |
PARSeq | ECCV22 | ViT-S | Config | ckpt | 98.9 | 98.1 | 98.4 | 90.1 | 94.3 | 98.6 | 96.40 | 87.6 | 88.8 | 76.5 | 83.4 | 84.4 | 84.3 | 84.9 | 84.26 | 0.00 | 79.9 | 23.83 | 19.0 | |
MATRN | ECCV22 | ResNet45+TF3 | Config | ckpt | 98.8 | 98.3 | 97.9 | 90.3 | 95.2 | 97.2 | 96.29 | 82.2 | 73.0 | 73.4 | 76.9 | 79.4 | 77.4 | 81.0 | 77.62 | 0.00 | 77.8 | 44.34 | 21.3 | |
MGP-STR | ECCV22 | ViT-B | Config | ckpt | 97.9 | 97.8 | 97.1 | 89.6 | 95.2 | 96.9 | 95.74 | 85.2 | 83.7 | 72.6 | 75.1 | 79.8 | 71.1 | 83.1 | 78.65 | 0.00 | 78.6 | 148.00 | 8.2 | |
CPPD-B | Preprint | SVTR-B | Config | ckpt | 99.0 | 97.8 | 98.2 | 90.4 | 94.0 | 99.0 | 96.40 | 86.2 | 78.7 | 76.5 | 82.9 | 83.5 | 81.9 | 83.5 | 81.91 | 0.00 | 79.6 | 27.00 | 8.0 | |
LPV-B | IJCAI23 | SVTR-B | Config | ckpt | 98.6 | 97.8 | 98.1 | 89.8 | 93.6 | 97.6 | 95.93 | 86.2 | 78.7 | 75.8 | 80.2 | 82.9 | 81.6 | 82.9 | 81.20 | 0.00 | 77.7 | 30.54 | 12.1 | |
MAERec | ICCV23 | ViT-S | Config | ckpt | 99.2 | 97.8 | 98.2 | 90.4 | 94.3 | 98.3 | 96.36 | 89.1 | 87.1 | 79.0 | 84.2 | 86.3 | 85.9 | 84.6 | 85.17 | 9.80 | 76.4 | 35.69 | 58.4 | |
LISTER | ICCV23 | FocalNet-B | Config | ckpt | 98.8 | 97.5 | 98.6 | 90.0 | 94.4 | 96.9 | 95.48 | 78.7 | 68.8 | 73.7 | 81.6 | 74.8 | 82.4 | 83.5 | 77.64 | 36.3 | 77.1 | 51.11 | 20.4 | |
CDistNet | IJCV24 | ResNet45+TF3 | Config | ckpt | 98.7 | 97.1 | 97.8 | 89.6 | 93.5 | 96.9 | 95.59 | 81.7 | 77.1 | 72.6 | 78.2 | 79.9 | 79.7 | 81.1 | 78.62 | 0.00 | 71.8 | 43.32 | 62.9 | |
CAM | PR24 | ConvNeXtV2-T | Config | ckpt | 98.2 | 96.1 | 96.6 | 89.0 | 93.5 | 96.2 | 94.94 | 85.4 | 89.0 | 72.0 | 75.4 | 84.0 | 74.8 | 83.1 | 80.52 | 0.52 | 74.2 | 58.66 | 35.0 | |
BUSNet | AAAI24 | ViT-S | Config | ckpt | 98.3 | 98.1 | 97.8 | 90.2 | 95.3 | 96.5 | 96.06 | 83.0 | 82.3 | 70.8 | 77.9 | 78.8 | 71.2 | 82.6 | 78.10 | 0.00 | 78.7 | 32.10 | 12.0 | |
OTE | CVPR24 | SVTR-B | Config | ckpt | 98.6 | 96.6 | 98.0 | 90.1 | 94.0 | 97.2 | 95.74 | 86.0 | 75.8 | 74.6 | 74.7 | 81.0 | 65.3 | 82.3 | 77.09 | 0.00 | 77.8 | 20.28 | 18.1 | |
CTCs | CRNN | TPAMI16 | ResNet31+LSTM | Config | ckpt | 95.8 | 91.8 | 94.6 | 84.9 | 83.1 | 91.0 | 90.21 | 48.1 | 13.0 | 51.2 | 62.3 | 41.4 | 60.4 | 68.2 | 49.24 | 47.21 | 58.0 | 16.20 | 5.8 |
SVTR | IJCAI22 | SVTR-B | Config | ckpt | 98.0 | 97.1 | 97.3 | 88.6 | 90.7 | 95.8 | 94.58 | 76.2 | 44.5 | 67.8 | 78.7 | 75.2 | 77.9 | 77.8 | 71.17 | 45.08 | 69.6 | 18.09 | 6.2 | |
SVTRv2 | Preprint | SVTRv2-T | Config | ckpt | 98.6 | 96.6 | 98.0 | 88.4 | 90.5 | 96.5 | 94.78 | 83.6 | 76.0 | 71.2 | 82.4 | 77.2 | 82.3 | 80.7 | 79.05 | 47.83 | 71.4 | 5.13 | 5.0 | |
SVTRv2-S | Config | ckpt | 99.0 | 98.3 | 98.5 | 89.5 | 92.9 | 98.6 | 96.13 | 88.3 | 84.6 | 76.5 | 84.3 | 83.3 | 85.4 | 83.5 | 83.70 | 47.57 | 78.0 | 11.25 | 5.3 | |||
SVTRv2-B | Config | ckpt | 99.2 | 98.0 | 98.7 | 91.1 | 93.5 | 99.0 | 96.57 | 90.6 | 89.0 | 79.3 | 86.1 | 86.2 | 86.7 | 85.1 | 86.14 | 50.23 | 80.0 | 19.76 | 7.0 |
Note: TF$_n$ denotes the
Method | Venue | Encoder | Common Benchmarks | Avg | Union-14M Benchmarks | Avg | Param (M) |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IC13 | SVT | IIIT | IC15 | SVTP | CUTE | Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | General | |||||||
EDTRs | ASTER | TPAMI19 | ResNet+LSTM | 93.3 | 90.0 | 90.8 | 74.7 | 80.2 | 80.9 | 84.98 | 34.0 | 10.2 | 27.7 | 33.0 | 48.2 | 27.6 | 39.8 | 31.50 | 27.2 |
NRTR | ICDAR19 | Stem+TF6 | 90.1 | 91.5 | 95.8 | 79.4 | 86.6 | 80.9 | 87.38 | 31.7 | 4.4 | 36.6 | 37.3 | 30.6 | 54.9 | 48.0 | 34.79 | 31.7 | |
MORAN | PR19 | ResNet+LSTM | 91.0 | 83.9 | 91.3 | 68.4 | 73.3 | 75.7 | 80.60 | 8.9 | 0.7 | 29.4 | 20.7 | 17.9 | 23.8 | 35.2 | 19.51 | 17.4 | |
SAR | AAAI19 | ResNet+LSTM | 91.5 | 84.5 | 91.0 | 69.2 | 76.4 | 83.5 | 82.68 | 44.3 | 7.7 | 42.6 | 44.2 | 44.0 | 51.2 | 50.5 | 40.64 | 57.7 | |
DAN | AAAI20 | ResNet+FPN | 93.4 | 87.5 | 92.1 | 71.6 | 78.0 | 81.3 | 83.98 | 26.7 | 1.5 | 35.0 | 40.3 | 36.5 | 42.2 | 42.1 | 32.04 | 27.7 | |
SRN | CVPR20 | ResNet+FPN | 94.8 | 91.5 | 95.5 | 82.7 | 85.1 | 87.8 | 89.57 | 63.4 | 25.3 | 34.1 | 28.7 | 56.5 | 26.7 | 46.3 | 40.14 | 54.7 | |
SEED* | CVPR20 | ResNet+LSTM | 93.8 | 89.6 | 92.8 | 80.0 | 81.4 | 83.6 | 86.87 | 40.4 | 15.5 | 32.1 | 32.5 | 54.8 | 35.6 | 39.0 | 35.70 | 24.0 | |
AutoSTR* | ECCV20 | NAS+LSTM | 94.7 | 90.9 | 94.2 | 81.8 | 81.7 | - | - | 47.7 | 17.9 | 30.8 | 36.2 | 64.2 | 38.7 | 41.3 | 39.54 | 6.0 | |
RoScanner | ECCV20 | ResNet | 95.3 | 88.1 | 94.8 | 77.1 | 79.5 | 90.3 | 87.52 | 43.6 | 7.9 | 41.2 | 42.6 | 44.9 | 46.9 | 39.5 | 38.09 | 48.0 | |
ABINet | CVPR21 | ResNet+TF3 | 96.2 | 93.5 | 97.4 | 86.0 | 89.3 | 89.2 | 91.93 | 59.5 | 12.7 | 43.3 | 38.3 | 62.0 | 50.8 | 55.6 | 46.03 | 36.7 | |
VisionLAN | ICCV21 | ResNet+TF3 | 95.8 | 91.7 | 95.7 | 83.7 | 86.0 | 88.5 | 90.23 | 57.7 | 14.2 | 47.8 | 48.0 | 64.0 | 47.9 | 52.1 | 47.39 | 32.8 | |
PARSeq* | ECCV22 | ViT-S | 97.0 | 93.6 | 97.0 | 86.5 | 88.9 | 92.2 | 92.53 | 63.9 | 16.7 | 52.5 | 54.3 | 68.2 | 55.9 | 56.9 | 52.62 | 23.8 | |
MATRN | ECCV22 | ResNet+TF3 | 96.6 | 95.0 | 97.9 | 86.6 | 90.6 | 93.5 | 93.37 | 63.1 | 13.4 | 43.8 | 41.9 | 66.4 | 53.2 | 57.0 | 48.40 | 44.2 | |
MGP-STR* | ECCV22 | ViT-B | 96.4 | 94.7 | 97.3 | 87.2 | 91.0 | 90.3 | 92.82 | 55.2 | 14.0 | 52.8 | 48.5 | 65.2 | 48.8 | 59.1 | 49.09 | 148.0 | |
LevOCR* | ECCV22 | ResNet+TF3 | 96.6 | 94.4 | 96.7 | 86.5 | 88.8 | 90.6 | 92.27 | 52.8 | 10.7 | 44.8 | 51.9 | 61.3 | 54.0 | 58.1 | 47.66 | 109.0 | |
CornerTF* | ECCV22 | CornerEncoder | 95.9 | 94.6 | 97.8 | 86.5 | 91.5 | 92.0 | 93.05 | 62.9 | 18.6 | 56.1 | 58.5 | 68.6 | 59.7 | 61.0 | 55.07 | 86.0 | |
CPPD | Preprint | SVTR-B | 97.6 | 95.5 | 98.2 | 87.9 | 90.9 | 92.7 | 93.80 | 65.5 | 18.6 | 56.0 | 61.9 | 71.0 | 57.5 | 65.8 | 56.63 | 26.8 | |
SIGA* | CVPR23 | ViT-B | 96.6 | 95.1 | 97.8 | 86.6 | 90.5 | 93.1 | 93.28 | 59.9 | 22.3 | 49.0 | 50.8 | 66.4 | 58.4 | 56.2 | 51.85 | 113.0 | |
CCD* | ICCV23 | ViT-B | 97.2 | 94.4 | 97.0 | 87.6 | 91.8 | 93.3 | 93.55 | 66.6 | 24.2 | 63.9 | 64.8 | 74.8 | 62.4 | 64.0 | 60.10 | 52.0 | |
LISTER* | ICCV23 | FocalNet-B | 96.9 | 93.8 | 97.9 | 87.5 | 89.6 | 90.6 | 92.72 | 56.5 | 17.2 | 52.8 | 63.5 | 63.2 | 59.6 | 65.4 | 54.05 | 49.9 | |
LPV-B* | IJCAI23 | SVTR-B | 97.3 | 94.6 | 97.6 | 87.5 | 90.9 | 94.8 | 93.78 | 68.3 | 21.0 | 59.6 | 65.1 | 76.2 | 63.6 | 62.0 | 59.40 | 35.1 | |
CDistNet* | IJCV24 | ResNet+TF3 | 96.4 | 93.5 | 97.4 | 86.0 | 88.7 | 93.4 | 92.57 | 69.3 | 24.4 | 49.8 | 55.6 | 72.8 | 64.3 | 58.5 | 56.38 | 65.5 | |
CAM* | PR24 | ConvNeXtV2-B | 97.4 | 96.1 | 97.2 | 87.8 | 90.6 | 92.4 | 93.58 | 63.1 | 19.4 | 55.4 | 58.5 | 72.7 | 51.4 | 57.4 | 53.99 | 135.0 | |
BUSNet | AAAI24 | ViT-S | 96.2 | 95.5 | 98.3 | 87.2 | 91.8 | 91.3 | 93.38 | - | - | - | - | - | - | - | - | 56.8 | |
DCTC | AAAI24 | SVTR-L | 96.9 | 93.7 | 97.4 | 87.3 | 88.5 | 92.3 | 92.68 | - | - | - | - | - | - | - | - | 40.8 | |
OTE | CVPR24 | SVTR-B | 96.4 | 95.5 | 97.4 | 87.2 | 89.6 | 92.4 | 93.08 | - | - | - | - | - | - | - | - | 25.2 | |
CFF | IJCAI24 | CEFE | 97.6 | 94.3 | 97.9 | 86.9 | 91.8 | 95.5 | 94.00 | 70.0 | 20.8 | 62.4 | 72.0 | 75.2 | 65.7 | 65.1 | 61.60 | 23.9 | |
CTCs | CRNN | TPAMI16 | ResNet+LSTM | 82.9 | 81.6 | 91.1 | 69.4 | 70.0 | 65.5 | 76.75 | 7.5 | 0.9 | 20.7 | 25.6 | 13.9 | 25.6 | 32.0 | 18.03 | 8.3 |
SVTR* | IJCAI22 | SVTR-B | 96.0 | 91.5 | 97.1 | 85.2 | 89.9 | 91.7 | 91.90 | 69.8 | 37.7 | 47.9 | 61.4 | 66.8 | 44.8 | 61.0 | 55.63 | 24.6 | |
SVTRv2 | Preprint | SVTRv2-B | 97.7 | 94.0 | 97.3 | 88.1 | 91.2 | 95.8 | 94.02 | 74.6 | 25.2 | 57.6 | 69.7 | 77.9 | 68.0 | 66.9 | 62.83 | 19.8 |
Note: * represents that the results on Union14M-Benchmarks are evaluated using the model they released.
If you find our method useful for your reserach, please cite:
@article{Du2024SVTRv2,
title={SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition},
author={Yongkun Du and Zhineng Chen and Hongtao Xie and Caiyan Jia and Yu-Gang Jiang},
journal={CoRR},
volume={abs/2411.15858},
eprinttype={arXiv},
year={2024},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.15858}
}