TSINGHUA UNIVERSITY - TENCENT JOINT LABORATORY

A Large Chinese Text Dataset in the Wild

Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li, Tai-Jiang Mu and Shi-Min Hu

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

  • 32,285 high resolution images
  • 1,018,402 character instances
  • 3,850 character categories
  • 6 kinds of attributes

(Click to open image in original resolution)

Tutorial

For latest tutorial, please checkout our git repository.

Part-1: basics
  • dataset split
  • annotation format
  • annotation examples

Lear more >>

Part-2: classification
  • train baseline models
  • submission format
  • evaluation API

(you can find it in git repository)

Part-3: detection
  • train baseline models
  • submission format
  • evaluation API

(you can find it in git repository)

Files

Evaluation Server

Contact

If you have any questions about the dataset or code, please contact Tai-Ling Yuan (yuantailing[at]gmail.com).

Bibtex:

@article{yuan2019ctw,
  author  = {Tai{-}Ling Yuan and Zhe Zhu and Kun Xu and Cheng{-}Jun Li and Tai{-}Jiang Mu and Shi{-}Min Hu},
  title   = {A Large Chinese Text Dataset in the Wild},
  journal = {Journal of Computer Science and Technology},
  volume  = {34},
  number  = {3},
  pages   = {509--521},
  year    = {2019},
}

Change Log

Terms of Use