TSINGHUA UNIVERSITY - TENCENT JOINT LABORATORY

Chinese Text in the Wild

Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li, Shi-Min Hu

In this paper we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.

  • 32,285 high resolution images
  • 1,018,402 character instances
  • 3,850 character categories
  • 6 kinds of attributes

(Click to open image in original resolution)

Tutorial

For latest tutorial, please checkout our git repository.

Part-1: basics
  • dataset split
  • annotation format
  • annotation examples

Lear more >>

Part-2: classification
  • train baseline models
  • submission format
  • evaluation API

(you can find it in git repository)

Part-3: detection
  • train baseline models
  • submission format
  • evaluation API

(you can find it in git repository)

Files

Evaluation Server

Contact

If you have any questions about the dataset or code, please contact Tai-Ling Yuan (yuantailing[at]gmail.com).

Change Log

Terms of Use