CTW dataset tutorial (Part 1: basics)

Hello, welcome to the tutorial of Chinese Text in the Wild (CTW) dataset. In this tutorial, we will show you:

  1. Basics

  2. Classification baseline

    • Train classification model
    • Results format and evaluation API
    • Evaluate your classification model
  3. Detection baseline

    • Train detection model
    • Results format and evaluation API
    • Evaluate your classification model

Our homepage is https://ctwdataset.github.io/, you may find some more useful information from that.

If you don't want to run the baseline code, please jump to Dataset split and Annotation format sections.

Notes:

This notebook MUST be run under $CTW_ROOT/tutorial.

All the code SHOULD be run with Python>=3.4. We make it compatible with Python>=2.7 with best effort.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The structure of this repository

Our git repository is git@github.com:yuantailing/ctw-baseline.git, which you can browse from GitHub.

There are several directories under $CTW_ROOT.

  • tutorial/: this tutorial
  • data/: download and place images and annotations
  • prepare/: prepare dataset splits
  • classification/: classification baselines using TensorFlow
  • detection/: a detection baseline using YOLOv2
  • judge/: evaluate testing results and draw results and statistics
  • pythonapi/: APIs to traverse annotations, to evaluate results, and for common use
  • cppapi/: a faster implementation to detection AP evaluation
  • codalab/: which we run on CodaLab (our evaluation server)
  • ssd/: a detection method using SSD

Most of the above directories have some similar structures.

  • */settings.py: configure directory of images, file path to annotations, and dedicated configurations for each step
  • */products/: store temporary files, logs, middle products, and final products
  • */pythonapi: a symbolic link to pythonapi/, in order to use Python API more conveniently

Most of the code is written in Python, while some code is written in C++, Shell, etc.

All the code is purposed to run in subdirectories, e.g., it's correct to execute cd $CTW_ROOT/detection && python3 train.py, and it's incorrect to execute cd $CTW_ROOT && python3 detection/train.py.

All our code won't create or modify any files out of $CTW_ROOT (except /tmp/), and don't need a privilege elevation (except for running docker workers on the evaluation server). You SHOULD install requirements before you run our code.

  • git>=1
  • Python>=3.4
  • Jupyter notebook>=5.0
  • gcc>=5
  • g++>=5
  • CUDA driver
  • CUDA toolkit>=8.0
  • CUDNN>=6.0
  • OpenCV>=3.0
  • requirements listed in $CTW_ROOT/requirements.txt

Recommonded hardware requirements:

  • RAM >= 32GB
  • GPU memory >= 12 GB
  • Hard Disk free space >= 200 GB
  • CPU logical cores >= 8
  • Network connection

Dataset Split

We split the dataset into 4 parts:

  1. Training set (~75%)

    For each image in training set, the annotation contains a lot of lines, while each lines contains some character instances.

    Each character instance contains:

    • its underlying character,
    • its bounding box (polygon),
    • and 6 attributes.

    Only Chinese character instances are completely annotated, non-Chinese characters (e.g., ASCII characters) are partially annotated.

    Some ignore regions are annotated, which contain character instances that cannot be recognized by human (e.g., too small, too fuzzy).

    We will show the annotation format in next sections.

  2. Validation set (~5%)

    Annotations in validation set is the same as that in training set.

    The split between training set and validation set is only a recommendation. We make no restriction on how you split them. To enlarge training data, you MAY use TRAIN+VAL to train your models.

  3. Testing set for classification (~10%)

    For this testing set, we make images and annotated bounding boxes publicly available. Underlying character, attributes and ignored regions are not avaliable.

    To evaluate your results on testing set, please visit our evaluation server.

  4. Testing set for detection (~10%)

    For this testing set, we make images public.

    To evaluate your results on testing set, please visit our evaluation server.

Notes:

You MUST NOT use annotations of testing set to fine tune your models or hyper-parameters. (e.g. use annotations of classification testing set to fine tune your detection models)

You MUST NOT use evaluation server to fine tune your models or hyper-parameters.

Download images and annotations

Visit our homepage (https://ctwdataset.github.io/) and gain access to the dataset.

  1. Clone our git repository.

    $ git clone git@github.com:yuantailing/ctw-baseline.git
    
  2. Download images, and unzip all the images to $CTW_ROOT/data/all_images/.

    For image file path, both $CTW_ROOT/data/all_images/0000001.jpg and $CTW_ROOT/data/all_images/any/path/0000001.jpg are OK, do not modify file name.

  3. Download annotations, and unzip it to $CTW_ROOT/data/annotations/downloads/.

    $ mkdir -p ../data/annotations/downloads && tar -xzf /path/to/ctw-annotations.tar.gz -C../data/annotations/downloads
    
  4. In order to run evaluation and analysis code locally, we will use validation set as testing sets in this tutorial.

    $ cd ../prepare && python3 fake_testing_set.py
    

    If you propose to train your model on TRAIN+VAL, you can execute cp ../data/annotations/downloads/* ../data/annotations/ instead of running the above code. But you will not be able to run evaluation and analysis code locally, just submit the results to our evaluation server.

  5. Create symbolic links for TRAIN+VAL ($CTW_ROOT/data/images/trainval/) and TEST($CTW_ROOT/data/images/test/) set, respectively.

    $ cd ../prepare && python3 symlink_images.py
    

Annotation format

In this section, we will show you:

  • Overall information format
  • Training set annotation format
  • Classification testing set format

We will display some examples in the next section.

Overall information format

Overall information file (../data/annotations/info.json) is UTF-8 (no BOM) encoded JSON.

The data struct for this information file is described below.

information:
{
    train: [image_meta_0, image_meta_1, image_meta_2, ...],
    val: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_cls: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_det: [image_meta_0, image_meta_1, image_meta_2, ...],
}

image_meta:
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
}

train, val, test_cls, test_det keys denote to training set, validation set, testing set for classification, testing set for detection, respectively.

The resolution of each image is always $2048 \times 2048$. Image ID is a 7-digits string, the first digit of image ID indicates the camera orientation in the following rule.

  • '0': back
  • '1': left
  • '2': front
  • '3': right

The file_name filed doesn't contain directory name, and is always image_id + '.jpg'.

Training set annotation format

All .jsonl annotation files (e.g. ../data/annotations/train.jsonl) are UTF-8 encoded JSON Lines, each line is corresponding to the annotation of one image.

The data struct for each of the annotations in training set (and validation set) is described below.

annotation (corresponding to one line in .jsonl):
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
    annotations: [sentence_0, sentence_1, sentence_2, ...],    # MUST NOT be empty
    ignore: [ignore_0, ignore_1, ignore_2, ...],               # MAY be an empty list
}

sentence:
[instance_0, instance_1, instance_2, ...]                 # MUST NOT be empty

instance:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],    # x, y are floating-point numbers
    text: str,                                            # the length of the text MUST be exactly 1
    is_chinese: bool,
    attributes: [attr_0, attr_1, attr_2, ...],            # MAY be an empty list
    adjusted_bbox: [xmin, ymin, w, h],                    # x, y, w, h are floating-point numbers
}

attr:
"occluded" | "bgcomplex" | "distorted" | "raised" | "wordart" | "handwritten"

ignore:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
    bbox: [xmin, ymin, w, h],
]

Original bounding box annotations are polygons, we will describe how polygon is converted to adjusted_bbox in appendix.

Notes:

The order of lines are not guaranteed to be consistent with info.json.

A polygon MUST be a quadrangle.

All characters in CJK Unified Ideographs are considered to be Chinese, while characters in ASCII and CJK Unified Ideographs Extension(s) are not.

Adjusted bboxes of character instances MUST be intersected with the image, while bboxes of ignore regions may not.

Some logos on the camera car (e.g., "腾讯街景地图" in 2040368.jpg) and licence plates are ignored to avoid bias.

Classification testing set format

The data struct for each of the annotations in classification testing set is described below.

annotation:
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
    proposals: [proposal_0, proposal_1, proposal_2, ...],
}

proposal:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
    adjusted_bbox: [xmin, ymin, w, h],
}

Notes:

The order of image_id in each line are not guaranteed to be consistent with info.json.

Non-Chinese characters (e.g., ASCII characters) MUST NOT appear in proposals.

In [1]:
from __future__ import print_function
from __future__ import unicode_literals

import json
import pprint
import settings

from pythonapi import anno_tools

print('Image meta info format:')
with open(settings.DATA_LIST) as f:
    data_list = json.load(f)
pprint.pprint(data_list['train'][0])
Image meta info format:
{'file_name': '0000172.jpg',
 'height': 2048,
 'image_id': '0000172',
 'width': 2048}
In [2]:
print('Training set annotation format:')
with open(settings.TRAIN) as f:
    anno = json.loads(f.readline())
pprint.pprint(anno, depth=3)
Training set annotation format:
{'annotations': [[{...}, {...}, {...}, {...}],
                 [{...}, {...}, {...}, {...}, {...}, {...}],
                 [{...}, {...}, {...}],
                 [{...}, {...}, {...}, {...}, {...}],
                 [{...}, {...}],
                 [{...}, {...}],
                 [{...}, {...}, {...}, {...}, {...}, {...}, {...}],
                 [{...}, {...}, {...}, {...}],
                 [{...}, {...}, {...}, {...}, {...}, {...}],
                 [{...}, {...}, {...}, {...}, {...}]],
 'file_name': '0000172.jpg',
 'height': 2048,
 'ignore': [{'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]},
            {'bbox': [...], 'polygon': [...]}],
 'image_id': '0000172',
 'width': 2048}
In [3]:
print('Character instance format:')
pprint.pprint(anno['annotations'][0][0])
Character instance format:
{'adjusted_bbox': [140.26028096262758,
                   897.1957001682758,
                   22.167573140645146,
                   38.36424196832945],
 'attributes': ['distorted', 'raised'],
 'is_chinese': True,
 'polygon': [[140.26028096262758, 896.7550603352049],
             [162.42785410327272, 898.0769798344178],
             [162.42785410327272, 935.7929346470926],
             [140.26028096262758, 935.0939571156308]],
 'text': '明'}
In [4]:
print('Traverse character instances in an image')
for instance in anno_tools.each_char(anno):
    print(instance['text'], end=' ')
print()
Traverse character instances in an image
明 海 地 产 易 生 · 印 帝 安 成 和 王 断 桥 铝 五 金 修 电 青 缘 平 房 四 合 院 租 售 过 户 咨 询 按 揭 抵 押 贷 款 内 拆 迁 咨 询 
In [5]:
print('Classification testing set format')
with open(settings.TEST_CLASSIFICATION) as f:
    anno = json.loads(f.readline())
pprint.pprint(anno, depth=2)
Classification testing set format
{'file_name': '0000486.jpg',
 'height': 2048,
 'image_id': '0000486',
 'proposals': [{...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...},
               {...}],
 'width': 2048}
In [6]:
print('Classification testing set proposal format')
pprint.pprint(anno['proposals'][0])
Classification testing set proposal format
{'adjusted_bbox': [398.7268146435821,
                   1211.6231527508403,
                   14.957597548258718,
                   29.099630908325935],
 'polygon': [[398.7268146435821, 1242.4864913065799],
             [413.6844121918408, 1237.1953683643387],
             [413.6844121918408, 1210.0061151780556],
             [398.7268146435821, 1214.8572278964098]]}

Draw annotations on images

In this section, we will draw annotations on images. This would help you to understand the format of annotations.

We show polygon bounding boxes of Chinese character instances in green, non-Chinese character instances in red, and ignore regions in yellow.

In [7]:
import cv2
import json
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import os
import settings

from pythonapi import anno_tools

%matplotlib inline

with open(settings.TRAIN) as f:
    anno = json.loads(f.readline())
path = os.path.join(settings.TRAINVAL_IMAGE_DIR, anno['file_name'])
assert os.path.exists(path), 'file not exists: {}'.format(path)
img = cv2.imread(path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(16, 16))
ax = plt.gca()
plt.imshow(img)
for instance in anno_tools.each_char(anno):
    color = (0, 1, 0) if instance['is_chinese'] else (1, 0, 0)
    ax.add_patch(patches.Polygon(instance['polygon'], fill=False, color=color))
for ignore in anno['ignore']:
    color = (1, 1, 0)
    ax.add_patch(patches.Polygon(ignore['polygon'], fill=False, color=color))
plt.show()