As an important and challenging problem in computer vision, scene text detection has been drawing researchers' interest. The performance of text detection are largely pushed forward along with the boom of deep learning. However, although people have proposed different models to improve text detection in single image, less attention is paid to improve text detection in video, which is more challenging due to effects like motion blur, extreme rotation of text lines.
Given a video as input, we want to build a new model based on existing single image text detector, and improve the performance regarding accuracy, while not bring to much overhead to system efficiency.
**Figure 1** : **Flow Estimation**From left to right are: image1, image2, dense flow map, warpped image1. Warpped image1 should be close to image2 **Figure 2** : **Good Examples**
Boxes that have challenging rotation angle, or small size, could be detected by the new model, while original EAST couldn't detect them very well, also the detected boxes geometry are more precise. **Figure 3** : **Failure Cases**
For boxes sitting near the boundary, feature aggregation would sometimes fail due to imprecise flow esimation; robust flow estimation guarantees the precision for boxes prediction. **Figure 4** : **Results Comparison**
From the preliminary test results, we could see the detection performance for some videos are boosting up when we apply flow-based feature aggregation to a single image text detector, the recall has significant improvement; However, the dense feture aggregation is not robust to all videos, especially for regions that flow estimation is not accurate. Further ways to improve flow estimation or reduce the dependency on flow need to be proposed.