Skip to content

Latest commit

 

History

History
1663 lines (1601 loc) · 63.1 KB

svtrv2.md

File metadata and controls

1663 lines (1601 loc) · 63.1 KB

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

[Paper] [Doc] [Model] [Datasets] [Config, Training and Inference] [Benchmark]

Introduction

Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed.

A Unified Training and Evaluation Benchmark for Scene Text Recognition

Recent research shows that STR models' performance can be been significantly improved by leveraging large-scale real-world datasets. However, many previous methods are trained on synthetic datasets, which fail to reflect their performance in real-world scenarios. Additionally, recent approaches have used different real-world datasets and inconsistent evaluation protocols, making it difficult to compare their performance.

To this end, we established a unified benchmark to re-train and evaluate mainstream STR methods.

First, to evaluate the performance of STR methods across diverse scenarios, we selected Union14M-Benchmarks as the test set. This benchmark includes a variety of complex scenarios. Additionally, we reported results on six test sets (Common Benchmarks) used in previous studies. For the training set, we used the large-scale real-world training dataset Union14M-L. To avoid data leakage, we filtered out the overlapping samples between Union14M-L (training set) and Union14M-L-Benchmark (Test set), resulting in the Union14M-L-Filter training dataset.

Furthermore, previous methods used inconsistent hyperparameter settings during training, which contributed to variations in their performance. To ensure reliable evaluation, we standardized key settings that significantly affect accuracy, such as the number of training epochs, data augmentation strategies, input size, and evaluation strategies. This ensures the reliability of our results.

Subsequently, we trained 24 reproduced STR methods and SVTRv2 using the Union14M-L-Filter dataset and evaluated their performance on both Common Benchmarks and Union14M-L-Benchmark.

Dataset Details

Training Dataset

  • Unified Training Set: All models are trained from scratch on a unified dataset named Union14M-L-Filter. This dataset was derived from Union14M-L, with certain adjustments for filtering.
  • Composition of Datasets: The training dataset includes different categories of samples, categorized as Easy, Medium, Hard, Norm, and Challenging.
    • Union14M-L: Contains 3,230,742 images in total.
    • Union14M-L-Filter: The filtered version contains 3,224,143 images.
    • Overlap with Union14M-Benchmarks: Only 6,599 images overlap between Union14M-L and the benchmark datasets used for evaluation.
Easy Medium Hard Norm Challenging Total
Union14M-L 2,076,161 145,525 308,025 218,154 482,877 3,230,742
Union14M-L-Filter 2,073,822 144,677 306,771 217,070 481,803 3,224,143

Test Datasets

  1. Common Test Set: Previous methods have primarily been evaluated on this set. While it includes some irregular text, the datasets are not highly challenging (e.g., little curva or rotation). Models trained on synthetic datasets often perform well here.
    • Includes 6 subsets with regular and irregular samples.
    • Example datasets: IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80.
Dataset Image Count Characteristics
IIIT5K 3,000 Regular
SVT 647 Regular
IC13 857 Regular
IC15 1,811 Irregular (low resolution, blurring)
SVTP 645 Irregular (affine, perspective)
CUTE80 288 Irregular (curved, clear)
  1. Challenging Test Set (Union14M-L-Benchmark): Introduced to test the full capabilities of STR (Scene Text Recognition) models. This set includes significantly more difficult text samples, featuring extreme curva, rotation, artistic styles, overlapping text, and other challenges.
    • Includes datasets such as Curve, Multi-Oriented, Artistic, Contextless, Salient, Multi-word, and General.
Dataset Image Count Characteristics
Curve 2,426 Severe curvature
Multi-Oriented 1,369 Severe rotation, multi-angle, vertical directions
Artistic 900 Artistic styles, often seen in logos
Contextless 779 No semantic meaning, out-of-dictionary words
Salient 1,585 Adjacent or overlapping text
Multi-word 829 Contains multiple words
General 400,000 Includes challenges like blurring, distortion
  1. Special Test Sets: Designed to evaluate specific model capabilities beyond the above datasets.
    • LTB (Long Text Benchmark): Evaluates performance on long texts (25–36 characters), as models are typically trained on short texts (≤25 characters).
    • OST (Occluded Scene Text): Tests the model’s ability to infer text from damaged or partially erased samples.
Dataset Image Count Characteristics
LTB 3,376 Long text (25 < length < 36)
OST 4832 Partially erased/destroyed characters

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Additionally, Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

Implementation Details

The optimal training hyperparameters for all models are usually not fixed. However, key setting such as Training Epochs, Data Augmentation, Input Size, Data Type, and Evaluation Protocols—which significantly impact accuracy—must be strictly standardized to ensure fair and unbiased comparisons of model performance. By following these standardizations, the results can accurately reflect the true capabilities of the models, unaffected by experimental inconsistencies. The specific setting include:

Setting Detail
Training Set For training, when the text length of a text image exceeds 25, samples with text length ≤ 25 are randomly selected from the training set to ensure models are only exposed to short texts (length ≤ 25).
Test Sets For all test sets except the long-text test set (LTB), text images with text length > 25 are filtered. Text length is calculated by removing spaces and non-94-character-set special characters.
Input Size Unless a method explicitly requires a dynamic size, models use a fixed input size of 32×128. If a model performs incorrectly with 32×128 during training, the original size is used. The test input size matches the training size.
Data Augmentation All models use the data augmentation strategy employed by PARSeq.
Training Epochs Unless pre-training is required, all models are trained for 20 epochs.
Optimizer AdamW is the default optimizer. If training fails to converge with AdamW, Adam or other optimizers are used.
Batch Size Maximum batch size for all models is 1024. If single-GPU training is not feasible, 2 GPUs (512 per GPU) or 4 GPUs (256 per GPU) are used. If 4-GPU training out of memory, the batch size is halved, and the learning rate is adjusted accordingly.
Learning Rate Default learning rate for batch size 1024 is 0.00065. The learning rate is adjusted multiple times to achieve the best results.
Learning Rate Scheduler A linear warm-up for 1.5 epochs is followed by a OneCycle scheduler.
Weight Decay Default weight decay is 0.05. NormLayer and Bias parameters have a weight decay of 0.
Data Type All models are trained with mixed precision.
EMA or Similar Tricks No EMA or similar tricks are used for any model.
Evaluation Protocols Word accuracy is evaluated after filtering special characters and converting all text to lowercase.

Get Started with training a SoTA Scene Text Recognition model from scratch.

Installation

  • PyTorch version >= 1.13.0
  • Python version >= 3.7
git clone https://github.com/Topdu/OpenOCR.git
cd OpenOCR
# Ubuntu 20.04 Cuda 11.8
conda create -n openocr python==3.8
conda activate openocr
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Downloading Datasets

All data can be downloaded from Google Drive.

The structure of Datasets and OpenOCR code will be organized as follows:

Structure of Datasets and OpenOCR code
```text
benchmark_bctr # Chinese text datasets, optional
├── benchmark_bctr_test
│   ├── document_test
│   ├── handwriting_test
│   ├── scene_test
│   └── web_test
└── benchmark_bctr_train
    ├── document_train
    ├── handwriting_train
    ├── scene_train
    └── web_train
evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k
├── SVT
└── SVTP
iiit5k_test_images # for Latency Measurement, optional
ltb # Long Text Benchmark
OpenOCR
OST
synth # optional
├── MJ
│   ├── test
│   ├── train
│   └── val
└── ST
test # Common Benchmarks from PARSeq
├── ArT
├── COCOv1.4
├── CUTE80
├── IC13_1015
├── IC13_1095
├── IC13_857
├── IC15_1811
├── IC15_2077
├── IIIT5k
├── SVT
├── SVTP
└── Uber
u14m # lmdb format Union14M-Benchmark
├── artistic
├── contextless
├── curve
├── general
├── multi_oriented
├── multi_words
└── salient
Union14M-L-LMDB-Filtered # lmdb format Union14M-L-Filtered
├── train_challenging
├── train_easy
├── train_hard
├── train_medium
└── train_normal
```

Datasets used during Training

Datsets Google Drive Baidu Yun
Union14M-L-Filter LMDB archives
Evaluation LMDB archives

If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter.

Test Set

Datsets Google Drive Baidu Yun
Union14M-L-Benchmark LMDB archives
Common-Benchmarks LMDB archives
Long Text Benchmark (LTB) LMDB archives
Occluded Scene Text (OST) LMDB archives

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

Training & Evaluation & Inference & Latency Measurement

Note: Take SVTRv2 as an example here. The execution commands for each model are listed in detail on their readme pages.

Training

# Multi GPU training
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# For Multi RTX 4090
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 tools/train_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml
# 20epoch runs for about 6 hours

Evaluation

# short text: Common, Union14M-Benchmark, OST
python tools/eval_rec_all_ratio.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml

After a successful run, the results are saved in a csv file in output_dir in the config file.

Inference

python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_smtr_gtc_rctc.yml --o Global.infer_img=/path/img_fold or /path/img_file

Latency Measurement

Firstly, downloading the IIIT5K images from Google Drive. Then, running the following command:

python tools/infer_rec.py --c configs/rec/svtrv2/svtrv2_rctc.yml --o Global.infer_img=../iiit5k_test_image

Results (Benchmark) & Configs & Checkpoints:

(TODO) Downloading all model checkpoints from Google Drive and Baidu Yun.

Method Venue Encoder Config Model Common Benchmarks Avg Union-14M Benchmarks Avg LTB OST Param
(M)
Latency
(ms)
IIIT SVT IC13 IC15 SVTP CUTE Curve Multi-Oriented Artistic Contextless Salient Multi-Words General
EDTRs ASTER TPAMI19 ResNet31+LSTM Config ckpt 96.1 93.0 94.9 86.1 87.9 92.0 91.68 70.9 82.2 56.7 62.9 73.9 58.5 76.3 68.75 0.02 61.9 19.04 14.9
NRTR ICDAR19 Stem+TF6 Config ckpt 98.1 96.8 97.8 88.9 93.3 94.4 94.89 67.9 42.4 66.5 73.6 66.4 77.2 78.3 67.46 2.00 74.8 44.26 57.8
MORAN PR19 ResNet31+LSTM Config ckpt 96.7 91.7 94.6 84.6 85.7 90.3 90.61 51.2 15.5 51.3 61.2 43.2 64.1 69.3 50.82 0.06 57.9 17.35 16.8
SAR AAAI19 ResNet31+LSTM Config ckpt 98.1 93.8 96.7 86.0 87.9 95.5 93.01 70.5 51.8 63.7 73.9 64.0 79.1 75.5 68.36 0.00 60.6 57.47 63.1
DAN AAAI20 ResNet45+FPN Config ckpt 97.5 94.7 96.5 87.1 89.1 94.4 93.24 74.9 63.3 63.4 70.6 70.2 71.1 76.8 70.05 0.00 61.8 27.71 10.1
SRN CVPR20 ResNet50+FPN Config ckpt 97.2 96.3 97.5 87.9 90.9 96.9 94.45 78.1 63.2 66.3 65.3 71.4 58.3 76.5 68.43 0.00 64.6 51.70 14.9
SEED CVPR20 ResNet31+LSTM Config ckpt 96.5 93.2 94.2 87.5 88.7 93.4 92.24 69.1 80.9 56.9 63.9 73.4 61.3 76.5 68.87 0.10 62.6 23.95 15.3
AutoSTR ECCV20 SearchCNN+LSTM Config ckpt 96.8 92.4 95.7 86.6 88.2 93.4 92.19 72.1 81.7 56.7 64.8 75.4 64.0 75.9 70.09 0.10 61.5 6.04 12.1
RoScanner ECCV20 ResNet31 Config ckpt 98.5 95.8 97.7 88.2 90.1 97.6 94.65 79.4 68.1 70.5 79.6 71.6 82.5 80.8 76.08 0.00 68.6 47.98 15.6
ABINet CVPR21 ResNet45+TF3 Config ckpt 98.5 98.1 97.7 90.1 94.1 96.5 95.83 80.4 69.0 71.7 74.7 77.6 76.8 79.8 75.72 0.00 75.0 36.86 13.7
VisionLAN ICCV21 ResNet45+TF3 Config ckpt 98.2 95.8 97.1 88.6 91.2 96.2 94.50 79.6 71.4 67.9 73.7 76.1 73.9 79.1 74.53 0.00 66.4 32.88 10.7
PARSeq ECCV22 ViT-S Config ckpt 98.9 98.1 98.4 90.1 94.3 98.6 96.40 87.6 88.8 76.5 83.4 84.4 84.3 84.9 84.26 0.00 79.9 23.83 19.0
MATRN ECCV22 ResNet45+TF3 Config ckpt 98.8 98.3 97.9 90.3 95.2 97.2 96.29 82.2 73.0 73.4 76.9 79.4 77.4 81.0 77.62 0.00 77.8 44.34 21.3
MGP-STR ECCV22 ViT-B Config ckpt 97.9 97.8 97.1 89.6 95.2 96.9 95.74 85.2 83.7 72.6 75.1 79.8 71.1 83.1 78.65 0.00 78.6 148.00 8.2
CPPD-B Preprint SVTR-B Config ckpt 99.0 97.8 98.2 90.4 94.0 99.0 96.40 86.2 78.7 76.5 82.9 83.5 81.9 83.5 81.91 0.00 79.6 27.00 8.0
LPV-B IJCAI23 SVTR-B Config ckpt 98.6 97.8 98.1 89.8 93.6 97.6 95.93 86.2 78.7 75.8 80.2 82.9 81.6 82.9 81.20 0.00 77.7 30.54 12.1
MAERec ICCV23 ViT-S Config ckpt 99.2 97.8 98.2 90.4 94.3 98.3 96.36 89.1 87.1 79.0 84.2 86.3 85.9 84.6 85.17 9.80 76.4 35.69 58.4
LISTER ICCV23 FocalNet-B Config ckpt 98.8 97.5 98.6 90.0 94.4 96.9 95.48 78.7 68.8 73.7 81.6 74.8 82.4 83.5 77.64 36.3 77.1 51.11 20.4
CDistNet IJCV24 ResNet45+TF3 Config ckpt 98.7 97.1 97.8 89.6 93.5 96.9 95.59 81.7 77.1 72.6 78.2 79.9 79.7 81.1 78.62 0.00 71.8 43.32 62.9
CAM PR24 ConvNeXtV2-T Config ckpt 98.2 96.1 96.6 89.0 93.5 96.2 94.94 85.4 89.0 72.0 75.4 84.0 74.8 83.1 80.52 0.52 74.2 58.66 35.0
BUSNet AAAI24 ViT-S Config ckpt 98.3 98.1 97.8 90.2 95.3 96.5 96.06 83.0 82.3 70.8 77.9 78.8 71.2 82.6 78.10 0.00 78.7 32.10 12.0
OTE CVPR24 SVTR-B Config ckpt 98.6 96.6 98.0 90.1 94.0 97.2 95.74 86.0 75.8 74.6 74.7 81.0 65.3 82.3 77.09 0.00 77.8 20.28 18.1
CTCs CRNN TPAMI16 ResNet31+LSTM Config ckpt 95.8 91.8 94.6 84.9 83.1 91.0 90.21 48.1 13.0 51.2 62.3 41.4 60.4 68.2 49.24 47.21 58.0 16.20 5.8
SVTR IJCAI22 SVTR-B Config ckpt 98.0 97.1 97.3 88.6 90.7 95.8 94.58 76.2 44.5 67.8 78.7 75.2 77.9 77.8 71.17 45.08 69.6 18.09 6.2
SVTRv2 Preprint SVTRv2-T Config ckpt 98.6 96.6 98.0 88.4 90.5 96.5 94.78 83.6 76.0 71.2 82.4 77.2 82.3 80.7 79.05 47.83 71.4 5.13 5.0
SVTRv2-S Config ckpt 99.0 98.3 98.5 89.5 92.9 98.6 96.13 88.3 84.6 76.5 84.3 83.3 85.4 83.5 83.70 47.57 78.0 11.25 5.3
SVTRv2-B Config ckpt 99.2 98.0 98.7 91.1 93.5 99.0 96.57 90.6 89.0 79.3 86.1 86.2 86.7 85.1 86.14 50.23 80.0 19.76 7.0

Note: TF$_n$ denotes the $n$-layer Transformer block. $Size$ denotes the number of parameters ($M$). $Latency$ is measured on one NVIDIA 1080Ti GPU with Pytorch dynamic graph mode.

Results when trained on synthetic datasets ($ST$ + $MJ$).

Method Venue Encoder Common Benchmarks Avg Union-14M Benchmarks Avg Param
(M)
IC13 SVT IIIT IC15 SVTP CUTE Curve Multi-Oriented Artistic Contextless Salient Multi-Words General
EDTRs ASTER TPAMI19 ResNet+LSTM 93.3 90.0 90.8 74.7 80.2 80.9 84.98 34.0 10.2 27.7 33.0 48.2 27.6 39.8 31.50 27.2
NRTR ICDAR19 Stem+TF6 90.1 91.5 95.8 79.4 86.6 80.9 87.38 31.7 4.4 36.6 37.3 30.6 54.9 48.0 34.79 31.7
MORAN PR19 ResNet+LSTM 91.0 83.9 91.3 68.4 73.3 75.7 80.60 8.9 0.7 29.4 20.7 17.9 23.8 35.2 19.51 17.4
SAR AAAI19 ResNet+LSTM 91.5 84.5 91.0 69.2 76.4 83.5 82.68 44.3 7.7 42.6 44.2 44.0 51.2 50.5 40.64 57.7
DAN AAAI20 ResNet+FPN 93.4 87.5 92.1 71.6 78.0 81.3 83.98 26.7 1.5 35.0 40.3 36.5 42.2 42.1 32.04 27.7
SRN CVPR20 ResNet+FPN 94.8 91.5 95.5 82.7 85.1 87.8 89.57 63.4 25.3 34.1 28.7 56.5 26.7 46.3 40.14 54.7
SEED* CVPR20 ResNet+LSTM 93.8 89.6 92.8 80.0 81.4 83.6 86.87 40.4 15.5 32.1 32.5 54.8 35.6 39.0 35.70 24.0
AutoSTR* ECCV20 NAS+LSTM 94.7 90.9 94.2 81.8 81.7 - - 47.7 17.9 30.8 36.2 64.2 38.7 41.3 39.54 6.0
RoScanner ECCV20 ResNet 95.3 88.1 94.8 77.1 79.5 90.3 87.52 43.6 7.9 41.2 42.6 44.9 46.9 39.5 38.09 48.0
ABINet CVPR21 ResNet+TF3 96.2 93.5 97.4 86.0 89.3 89.2 91.93 59.5 12.7 43.3 38.3 62.0 50.8 55.6 46.03 36.7
VisionLAN ICCV21 ResNet+TF3 95.8 91.7 95.7 83.7 86.0 88.5 90.23 57.7 14.2 47.8 48.0 64.0 47.9 52.1 47.39 32.8
PARSeq* ECCV22 ViT-S 97.0 93.6 97.0 86.5 88.9 92.2 92.53 63.9 16.7 52.5 54.3 68.2 55.9 56.9 52.62 23.8
MATRN ECCV22 ResNet+TF3 96.6 95.0 97.9 86.6 90.6 93.5 93.37 63.1 13.4 43.8 41.9 66.4 53.2 57.0 48.40 44.2
MGP-STR* ECCV22 ViT-B 96.4 94.7 97.3 87.2 91.0 90.3 92.82 55.2 14.0 52.8 48.5 65.2 48.8 59.1 49.09 148.0
LevOCR* ECCV22 ResNet+TF3 96.6 94.4 96.7 86.5 88.8 90.6 92.27 52.8 10.7 44.8 51.9 61.3 54.0 58.1 47.66 109.0
CornerTF* ECCV22 CornerEncoder 95.9 94.6 97.8 86.5 91.5 92.0 93.05 62.9 18.6 56.1 58.5 68.6 59.7 61.0 55.07 86.0
CPPD Preprint SVTR-B 97.6 95.5 98.2 87.9 90.9 92.7 93.80 65.5 18.6 56.0 61.9 71.0 57.5 65.8 56.63 26.8
SIGA* CVPR23 ViT-B 96.6 95.1 97.8 86.6 90.5 93.1 93.28 59.9 22.3 49.0 50.8 66.4 58.4 56.2 51.85 113.0
CCD* ICCV23 ViT-B 97.2 94.4 97.0 87.6 91.8 93.3 93.55 66.6 24.2 63.9 64.8 74.8 62.4 64.0 60.10 52.0
LISTER* ICCV23 FocalNet-B 96.9 93.8 97.9 87.5 89.6 90.6 92.72 56.5 17.2 52.8 63.5 63.2 59.6 65.4 54.05 49.9
LPV-B* IJCAI23 SVTR-B 97.3 94.6 97.6 87.5 90.9 94.8 93.78 68.3 21.0 59.6 65.1 76.2 63.6 62.0 59.40 35.1
CDistNet* IJCV24 ResNet+TF3 96.4 93.5 97.4 86.0 88.7 93.4 92.57 69.3 24.4 49.8 55.6 72.8 64.3 58.5 56.38 65.5
CAM* PR24 ConvNeXtV2-B 97.4 96.1 97.2 87.8 90.6 92.4 93.58 63.1 19.4 55.4 58.5 72.7 51.4 57.4 53.99 135.0
BUSNet AAAI24 ViT-S 96.2 95.5 98.3 87.2 91.8 91.3 93.38 - - - - - - - - 56.8
DCTC AAAI24 SVTR-L 96.9 93.7 97.4 87.3 88.5 92.3 92.68 - - - - - - - - 40.8
OTE CVPR24 SVTR-B 96.4 95.5 97.4 87.2 89.6 92.4 93.08 - - - - - - - - 25.2
CFF IJCAI24 CEFE 97.6 94.3 97.9 86.9 91.8 95.5 94.00 70.0 20.8 62.4 72.0 75.2 65.7 65.1 61.60 23.9
CTCs CRNN TPAMI16 ResNet+LSTM 82.9 81.6 91.1 69.4 70.0 65.5 76.75 7.5 0.9 20.7 25.6 13.9 25.6 32.0 18.03 8.3
SVTR* IJCAI22 SVTR-B 96.0 91.5 97.1 85.2 89.9 91.7 91.90 69.8 37.7 47.9 61.4 66.8 44.8 61.0 55.63 24.6
SVTRv2 Preprint SVTRv2-B 97.7 94.0 97.3 88.1 91.2 95.8 94.02 74.6 25.2 57.6 69.7 77.9 68.0 66.9 62.83 19.8

Note: * represents that the results on Union14M-Benchmarks are evaluated using the model they released.

Citation

If you find our method useful for your reserach, please cite:

@article{Du2024SVTRv2,
      title={SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition},
      author={Yongkun Du and Zhineng Chen and Hongtao Xie and Caiyan Jia and Yu-Gang Jiang},
      journal={CoRR},
      volume={abs/2411.15858},
      eprinttype={arXiv},
      year={2024},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15858}
}