Table 1. Network configuration summary. The first row is the top layer. 'k', 's' and 'p' stand for kernel size, stride and padding size respectively
表1.网络配置摘要。 第一行是顶层。 " k"," s"和" p"分别代表内核大小,步幅和填充大小
IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.
IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.
SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].
IC13 [24]测试数据集继承了IC03的大部分数据。 它包含1,015个地面真相裁剪的单词图像。
IIIT5k [28]包含从互联网收集的3,000个裁剪的单词测试图像。 每个图像已与50个单词的词典和1000个单词的词典相关联。
SVT [34]测试数据集包含从Google街景收集的249幅街景图像。 从中裁剪出647个单词图像。 每个单词图像都有一个由Wang等人定义的50个单词的词典。[34]。
3.2. Implementation Details
The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th maxpooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100×32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as 'i' and 'l'.
表1总结了我们在实验中使用的网络配置。卷积层的体系结构基于VGG-VeryDeep体系结构[32]。 为了使它适合于识别英文文本,进行了一些调整。 在第3和第4个maxpooling层中,我们采用1×2大小的矩形池窗口,而不是常规的正方形池窗口。 这种调整会产生具有较大宽度的特征图,因此特征序列更长。 例如,包含10个字符的图像通常大小为100×32,可以从中生成25帧的特征序列。 该长度超过大多数英语单词的长度。 最重要的是,矩形合并窗口会产生矩形的接收场(如图2所示),这对于识别某些形状较窄的字符(例如" i"和" l")很有帮助。
The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.
网络不仅具有深层的卷积层,而且具有循环层。 众所周知,两者都很难训练。 我们发现批量归一化[19]技术对于训练这种深度的网络非常有用。 在第五和第六卷积层之后分别插入两个批处理归一化层。 使用批处理归一化层,可以大大加快培训过程。
We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C ) and the BK-tree data structure (in C ). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.
我们在Torch7 [10]框架内实现网络,并为LSTM单元(在Torch7 / CUDA中),转录层(在C 中)和BK树数据结构(在C 中)自定义实现。 实验是在装有2.50 GHzIntel®Xeon®E5- 2609 CPU,64GB RAM和NVIDIA®Tesla®K40 GPU的工作站上进行的。 使用ADADELTA训练网络,将参数ρ设置为0.9。 在训练过程中,所有图像均按比例缩放为100×32,以加快训练过程。 培训过程大约需要50个小时才能达到收敛。 将测试图像缩放为高度32。宽度与高度成比例地缩放,但至少100像素。 在没有词典的IC03上测得的平均测试时间为0.16s /样品。 将近似词典搜索应用于IC03的50k词典,并将参数δ设置为3。测试每个样本平均需要0.53s。
3.3. Comparative Evaluation
All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.
表2列出了通过建议的CRNN模型和最新技术(包括基于深度模型的方法)获得的上述四个公共数据集的所有识别准确性。
In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the "Full" lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other scripts like Chinese words. Therefore, the results of CRNN are competitive on all the testing datasets.
在受限的词典情况下,我们的方法始终优于大多数最新技术,并且平均而言胜过[22]中提出的最佳文本阅读器。 具体来说,我们在IIIT5k上获得了优异的性能,而与[22]相比,SVT仅在使用"完整"词典的IC03上获得了较低的性能。 注意,in [22]中的模型是在特定词典上训练的,即每个单词都与一个类别标签相关联。 与[22]不同,CRNN不仅限于识别已知词典中的单词,还可以处理随机字符串(例如电话号码),句子或其他脚本(如中文单词)。 因此,CRNN的结果在所有测试数据集上都具有竞争力。
In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the "none" columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best performance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.
在不受约束的词典情况下,我们的方法在SVT上实现了最佳性能,但仍落后于IC03和IC13的某些方法[8,22]。 请注意,表2中"无"列中的空白表示在没有词汇的情况下,此类方法无法应用于识别,或者在无限制的情况下未报告识别准确性。 我们的方法仅使用带有单词级别标签的合成文本作为训练数据,这与PhotoOCR [8]完全不同,后者使用790万个带有字符级别注释的真实单词图像进行训练。 受益于其庞大的字典,[22]在不受约束的词典情况下报告了最佳性能,但是,它并不是如上所述严格不受词典约束的模型。 从这个意义上讲,我们在无约束词典情况下的结果仍然很有希望。
For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.
为了进一步了解该算法相对于其他文本识别方法的优势,我们对名为E2E Train,Conv Ftrs,CharGT-Free,Unconstrained和Model Size的几个属性进行了全面比较,如表3所示。
Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).
表3.各种方法之间的比较。 比较的属性包括:1)端到端可培训(E2E培训); 2)使用直接从图像中学习的卷积特征,而不是使用手工的卷积特征(Conv Ftrs); 3)在训练过程中不需要角色的地面真相边界框(无CharGT); 4)不限于预定义的字典(无约束); 5)模型大小(如果使用了端到端可训练模型),由模型参数的数量(模型大小,M代表百万)衡量。
E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.
端到端培训:此列用于显示某种文本阅读模型是否可以进行端到端的培训,而无需任何预处理或通过几个单独的步骤,这表明此类方法对于培训而言是优雅而干净的。 从表3中可以看出,只有基于深度神经网络的模型(包括[22、21]和CRNN)才具有此属性。
Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.
Conv Ftrs:此列指示方法是直接使用从训练图像中学到的卷积特征还是手工特征作为基本表示。
CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.
CharGT-Free:此列用于指示字符级注释对于训练模型是否必不可少。 由于CRNN的输入和输出标签可以是一个序列,因此不需要字符级注释。
Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.
Unconstrained:此列用于指示训练后的模型是否仅限于特定词典,无法处理字典外单词或随机序列。请注意,尽管最近的模型是通过标签嵌入[5,14]和增量学习[ 22]取得了极好的竞争表现,它们被限制在特定的词典中。
Table 2. Recognition accuracies (%) on four datasets. In the second row, "50", "1k", "50k" and "Full" denote the lexicon used, and "None" denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.
表2.四个数据集的识别准确率(%)。 在第二行中," 50"," 1k"," 50k"和"完整"表示使用的词典,"无"表示不使用词典的识别。 (* [22]在严格意义上不是没有词典的,因为它的输出被限制在一个90k的字典中。
Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weightsharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.
模型大小:此列用于报告学习的模型的存储空间。 在CRNN中,所有层都具有权重共享连接,并且不需要完全连接的层。 因此,CRNN的参数数量远少于从CNN的变体中学习的模型[22,21],因此与[22,21]相比,模型要小得多。 我们的模型具有830万个参数,仅占用33MB RAM(每个参数使用4字节单精度浮点数),因此可以轻松地将其移植到移动设备上。
Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods. In addition, to test the impact of parameter δ, we experiment different values of δ in Eq. 2. In Fig. 4 we plot the recognition accuracy as a function of δ. Larger δ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger δ, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose δ = 3 as a tradeoff between accuracy and speed.
表3清楚地详细显示了不同方法之间的差异,并充分证明了CRNN相对于其他竞争方法的优势。 另外,为了测试参数δ的影响,我们在式中试验了不同的δ值。 2.在图4中,我们将识别精度绘制为δ的函数。 δ越大,候选者越多,因此基于词典的转录更加准确。 另一方面,由于较长的BK树搜索时间以及用于测试的候选序列数量增加,计算成本随着δ的增加而增长。 实际上,我们选择δ= 3作为精度和速度之间的折衷。
Figure 4. Blue line graph: recognition accuracy as a function parameter δ. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.
图4.蓝线图:识别精度作为函数参数δ。 红条:每个样本的词典搜索时间。 使用50k词典在IC03数据集上进行了测试。
3.4. Musical Score Recognition
A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.
乐谱通常由排列在谱线上的音符序列组成。 识别图像中的乐谱被称为光学音乐识别(OMR)问题。 以前的方法通常需要图像预处理(主要是二值化),人员线检测和个人笔记识别[29]。 我们将OMR视为序列识别问题,并使用CRNN直接从图像中预测音符序列。 为简单起见,我们仅识别音高,忽略所有和弦,并为所有乐谱采用相同的大音阶(C大调)。
To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) "Clean", which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) "Synthesized", which is created from "Clean", using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) "Real-World", which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1
据我们所知,目前尚无用于评估音高识别算法的公共数据集。 为了准备CRNN所需的训练数据,我们从[2]中收集了2650张图像。 每个图像包含一个分数片段,其中包含3至20个音符。 我们为所有图像手动标记地面真相标记序列(非ezpitches序列)。 通过旋转,缩放和受噪声破坏,以及通过将其背景替换为自然图像,可以将收集的图像增强到265k训练样本。 为了进行测试,我们创建了三个数据集:1)" Clean",其中包含从[2]中收集的260张图像。 示例如图5.a所示。 2)使用上面提到的扩充策略,从"清洁"创建的"合成"。 它包含200个样本,其中一些如图5.b所示。 3)"真实世界",其中包含200张使用手机摄像头从乐谱中拍摄的乐谱片段图像。 示例如图5.c.1所示。