Hello, welcome to the tutorial of Chinese Text in the Wild (CTW) dataset. In this tutorial, we will show you:
Classification baseline
Detection baseline
Our homepage is https://ctwdataset.github.io/, you may find some more useful information from that.
If you don't want to run the baseline code, please jump to Dataset split and Annotation format sections.
Notes:
This notebook MUST be run under
$CTW_ROOT/tutorial
.All the code SHOULD be run with
Python>=3.4
. We make it compatible withPython>=2.7
with best effort.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Our git repository is git@github.com:yuantailing/ctw-baseline.git
, which you can browse from GitHub.
There are several directories under $CTW_ROOT
.
Most of the above directories have some similar structures.
pythonapi/
, in order to use Python API more convenientlyMost of the code is written in Python, while some code is written in C++, Shell, etc.
All the code is purposed to run in subdirectories, e.g., it's correct to execute cd $CTW_ROOT/detection && python3 train.py
, and it's incorrect to execute cd $CTW_ROOT && python3 detection/train.py
.
All our code won't create or modify any files out of $CTW_ROOT
(except /tmp/
), and don't need a privilege elevation (except for running docker workers on the evaluation server). You SHOULD install requirements before you run our code.
$CTW_ROOT/requirements.txt
Recommonded hardware requirements:
We split the dataset into 4 parts:
Training set (~75%)
For each image in training set, the annotation contains a lot of lines, while each lines contains some character instances.
Each character instance contains:
Only Chinese character instances are completely annotated, non-Chinese characters (e.g., ASCII characters) are partially annotated.
Some ignore regions are annotated, which contain character instances that cannot be recognized by human (e.g., too small, too fuzzy).
We will show the annotation format in next sections.
Validation set (~5%)
Annotations in validation set is the same as that in training set.
The split between training set and validation set is only a recommendation. We make no restriction on how you split them. To enlarge training data, you MAY use TRAIN+VAL to train your models.
Testing set for classification (~10%)
For this testing set, we make images and annotated bounding boxes publicly available. Underlying character, attributes and ignored regions are not avaliable.
To evaluate your results on testing set, please visit our evaluation server.
Testing set for detection (~10%)
For this testing set, we make images public.
To evaluate your results on testing set, please visit our evaluation server.
Notes:
You MUST NOT use annotations of testing set to fine tune your models or hyper-parameters. (e.g. use annotations of classification testing set to fine tune your detection models)
You MUST NOT use evaluation server to fine tune your models or hyper-parameters.
Visit our homepage (https://ctwdataset.github.io/) and gain access to the dataset.
Clone our git repository.
$ git clone git@github.com:yuantailing/ctw-baseline.git
Download images, and unzip all the images to $CTW_ROOT/data/all_images/
.
For image file path, both $CTW_ROOT/data/all_images/0000001.jpg
and $CTW_ROOT/data/all_images/any/path/0000001.jpg
are OK, do not modify file name.
Download annotations, and unzip it to $CTW_ROOT/data/annotations/downloads/
.
$ mkdir -p ../data/annotations/downloads && tar -xzf /path/to/ctw-annotations.tar.gz -C../data/annotations/downloads
In order to run evaluation and analysis code locally, we will use validation set as testing sets in this tutorial.
$ cd ../prepare && python3 fake_testing_set.py
If you propose to train your model on TRAIN+VAL, you can execute cp ../data/annotations/downloads/* ../data/annotations/
instead of running the above code. But you will not be able to run evaluation and analysis code locally, just submit the results to our evaluation server.
Create symbolic links for TRAIN+VAL ($CTW_ROOT/data/images/trainval/
) and TEST($CTW_ROOT/data/images/test/
) set, respectively.
$ cd ../prepare && python3 symlink_images.py
In this section, we will show you:
We will display some examples in the next section.
Overall information file (../data/annotations/info.json
) is UTF-8 (no BOM) encoded JSON.
The data struct for this information file is described below.
information:
{
train: [image_meta_0, image_meta_1, image_meta_2, ...],
val: [image_meta_0, image_meta_1, image_meta_2, ...],
test_cls: [image_meta_0, image_meta_1, image_meta_2, ...],
test_det: [image_meta_0, image_meta_1, image_meta_2, ...],
}
image_meta:
{
image_id: str,
file_name: str,
width: int,
height: int,
}
train
, val
, test_cls
, test_det
keys denote to training set, validation set, testing set for classification, testing set for detection, respectively.
The resolution of each image is always $2048 \times 2048$. Image ID is a 7-digits string, the first digit of image ID indicates the camera orientation in the following rule.
The file_name
filed doesn't contain directory name, and is always image_id + '.jpg'
.
All .jsonl
annotation files (e.g. ../data/annotations/train.jsonl
) are UTF-8 encoded JSON Lines, each line is corresponding to the annotation of one image.
The data struct for each of the annotations in training set (and validation set) is described below.
annotation (corresponding to one line in .jsonl):
{
image_id: str,
file_name: str,
width: int,
height: int,
annotations: [sentence_0, sentence_1, sentence_2, ...], # MUST NOT be empty
ignore: [ignore_0, ignore_1, ignore_2, ...], # MAY be an empty list
}
sentence:
[instance_0, instance_1, instance_2, ...] # MUST NOT be empty
instance:
{
polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]], # x, y are floating-point numbers
text: str, # the length of the text MUST be exactly 1
is_chinese: bool,
attributes: [attr_0, attr_1, attr_2, ...], # MAY be an empty list
adjusted_bbox: [xmin, ymin, w, h], # x, y, w, h are floating-point numbers
}
attr:
"occluded" | "bgcomplex" | "distorted" | "raised" | "wordart" | "handwritten"
ignore:
{
polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
bbox: [xmin, ymin, w, h],
]
Original bounding box annotations are polygons, we will describe how polygon
is converted to adjusted_bbox
in appendix.
Notes:
The order of lines are not guaranteed to be consistent with
info.json
.A polygon MUST be a quadrangle.
All characters in
CJK Unified Ideographs
are considered to be Chinese, while characters inASCII
andCJK Unified Ideographs Extension
(s) are not.Adjusted bboxes of character
instance
s MUST be intersected with the image, while bboxes ofignore
regions may not.Some logos on the camera car (e.g., "
腾讯街景地图
" in2040368.jpg
) and licence plates are ignored to avoid bias.
The data struct for each of the annotations in classification testing set is described below.
annotation:
{
image_id: str,
file_name: str,
width: int,
height: int,
proposals: [proposal_0, proposal_1, proposal_2, ...],
}
proposal:
{
polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
adjusted_bbox: [xmin, ymin, w, h],
}
Notes:
The order of
image_id
in each line are not guaranteed to be consistent withinfo.json
.Non-Chinese characters (e.g., ASCII characters) MUST NOT appear in proposals.
from __future__ import print_function
from __future__ import unicode_literals
import json
import pprint
import settings
from pythonapi import anno_tools
print('Image meta info format:')
with open(settings.DATA_LIST) as f:
data_list = json.load(f)
pprint.pprint(data_list['train'][0])
print('Training set annotation format:')
with open(settings.TRAIN) as f:
anno = json.loads(f.readline())
pprint.pprint(anno, depth=3)
print('Character instance format:')
pprint.pprint(anno['annotations'][0][0])
print('Traverse character instances in an image')
for instance in anno_tools.each_char(anno):
print(instance['text'], end=' ')
print()
print('Classification testing set format')
with open(settings.TEST_CLASSIFICATION) as f:
anno = json.loads(f.readline())
pprint.pprint(anno, depth=2)
print('Classification testing set proposal format')
pprint.pprint(anno['proposals'][0])
In this section, we will draw annotations on images. This would help you to understand the format of annotations.
We show polygon bounding boxes of Chinese character instances in green, non-Chinese character instances in red, and ignore regions in yellow.
import cv2
import json
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import os
import settings
from pythonapi import anno_tools
%matplotlib inline
with open(settings.TRAIN) as f:
anno = json.loads(f.readline())
path = os.path.join(settings.TRAINVAL_IMAGE_DIR, anno['file_name'])
assert os.path.exists(path), 'file not exists: {}'.format(path)
img = cv2.imread(path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(16, 16))
ax = plt.gca()
plt.imshow(img)
for instance in anno_tools.each_char(anno):
color = (0, 1, 0) if instance['is_chinese'] else (1, 0, 0)
ax.add_patch(patches.Polygon(instance['polygon'], fill=False, color=color))
for ignore in anno['ignore']:
color = (1, 1, 0)
ax.add_patch(patches.Polygon(ignore['polygon'], fill=False, color=color))
plt.show()
In order to create a tighter bounding box to character instances, we compute adjusted_bbox
in following steps, instead of use the real bounding box.
Adjusted bounding box is better than the real bounding box, especially for sharp polygons.
from __future__ import division
import collections
import matplotlib.patches as patches
import matplotlib.pyplot as plt
%matplotlib inline
def poly2bbox(poly):
key_points = list()
rotated = collections.deque(poly)
rotated.rotate(1)
for (x0, y0), (x1, y1) in zip(poly, rotated):
for ratio in (1/3, 2/3):
key_points.append((x0 * ratio + x1 * (1 - ratio), y0 * ratio + y1 * (1 - ratio)))
x, y = zip(*key_points)
adjusted_bbox = (min(x), min(y), max(x) - min(x), max(y) - min(y))
return key_points, adjusted_bbox
polygons = [
[[2, 1], [11, 2], [12, 18], [3, 16]],
[[21, 1], [30, 5], [31, 19], [22, 14]],
]
plt.figure(figsize=(10, 6))
plt.xlim(0, 35)
plt.ylim(0, 20)
ax = plt.gca()
for polygon in polygons:
color = (0, 1, 0)
ax.add_patch(patches.Polygon(polygon, fill=False, color=(0, 1, 0)))
key_points, adjusted_bbox = poly2bbox(polygon)
ax.add_patch(patches.Rectangle(adjusted_bbox[:2], *adjusted_bbox[2:], fill=False, color=(0, 0, 1)))
for kp in key_points:
ax.add_patch(patches.Circle(kp, radius=0.1, fill=True, color=(1, 0, 0)))
plt.show()