AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Recurrent Convolutional Neural Networks For Continuous Sign Language Recognition By Staged Optimization
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), no. 1 (2017): 1610-1618
- Sign language is regarded as the most grammatically structured category of gestural communications
- This nature of sign language makes it an ideal test bed for developing methods to solve problems such as motion analysis and human-computer interaction.
- Continuous sign language recognition with deep neural networks remains challenging and non-trivial.
- In this problem, the recognition system is required to achieve representation and sequence learning from the weakly supervised unsegmented video stream.
- RNNs have shown superior performance to hidden Markov models (HMMs) on handling complex dynamic variations in sign recognition [21, 26], with limited amount of training data RNNs are more inclined to end in overfitting
- Sign language is regarded as the most grammatically structured category of gestural communications
- The main contributions of our work can be summarized as follows: (1) We develop our architecture with recurrent convolutional neural networks to achieve performance comparable to the state-of-the-arts in this weakly supervised problem, without importing extra information; (2) We fully exploit the representation capability of deep convolutional neural network by segmenting the sentencelevel labels to vast amounts of temporal segments with gloss labels, which directly guides the training of deep architecture for feature representation and avoids overfitting efficiently; (3) We design a three-stage optimization process for training our deep neural network architecture, and our approach is proved to take notable effect on the limited training set; (4) To the best of our knowledge, we are the first to propose a real-world continuous sign language recognition system fully based on deep neural networks in this scope, and we demonstrate its applicability from challenging continuous sign video streams
- We substitute 3D-convolutional neural networks [21, 29] for our proposed convolutional neural networks with stacked temporal convolution and pooling, and we assess the utility of pre-training the convolutional neural networks with loss employed in PN-Net  from video frames
- Continuous sign language recognition results of these experiments are listed in Table 2
- We have proposed a deep architecture with recurrent convolutional neural network for continuous sign language recognition
- The effectiveness of our approach is demonstrated on a challenging benchmark, where we have achieved the performance comparable to the state-of-theart
- Vision-based continuous sign language recognition systems usually take the image sequences of signers’ performance as input, and learn to automatically output the gloss labels in right order.
- The authors' proposed approach employs CNN with temporal convolution and pooling for spatio-temporal representation learning from video clips, and RNN with long short-term memory (LSTM) module to learn the mapping of feature sequences to sequences of glosses.
- The remainder of this section discusses the approach in detail
- The amounts of training samples for the 9 signers are unbalanced in this dataset, with the three most sampled signers account for 26.0%, 22.8%, 14.7% and three least 0.5%, 0.8%, 2.9%, while the WERs for these signers on validation set are 36.0, 38.6, 43.8 and 45.8, 43.3, 38.7 respectively
- This indicates that the system can learn the shared representations among different signers and to some extent handle the inter-signer variations.
- The authors have proposed a deep architecture with recurrent convolutional neural network for continuous sign language recognition.
- The authors have designed a staged optimization process for training the deep neural network architecture.
- The authors fully exploit the representation capability of CNN with tuning on vast amounts of gloss-level segments and effectively avoid overfitting with the deep architecture.
- The authors have proposed a novel detection net for regularization on the consistency between sequential predictions and detection results.
- The effectiveness of the approach is demonstrated on a challenging benchmark, where the authors have achieved the performance comparable to the state-of-theart
- Table1: Configuration of our architecture. The parameters for temporal convolution are denoted as “conv1D-[receptive field][number of channels]”. Temporal pooling layers are annotated with stride, and bidirectional LSTM (denoted by “BLSTM”) is with the dimension number of its hidden variables. The output dimensions of the fully connected layers are equal to the size of gloss vocabulary in our architecture
- Table2: Recognition results for end-to-end training stage on RWTH-PHOENIX-Weather 2014 multi-signer dataset in [%]. “C3d” stands for the 3D-CNN structure employed in [<a class="ref-link" id="c21" href="#r21">21</a>, <a class="ref-link" id="c29" href="#r29">29</a>], “ConvTC” for our proposed feature extraction architecture with VGG-S net pretrained on ISLVRC 2012, and “+pretrain” for our model further pretrained with PN-Net [<a class="ref-link" id="c1" href="#r1">1</a>] loss on the right hand patches from training set
- Table3: Recognition results for sequence learning stage on RWTH-PHOENIX-Weather 2014 multi-signer dataset in [%]. We assess the performance of different recurrent models and our proposed detection net. “BLSTM+det net” stands for the employed model with bidirectional LSTM and detection net, and “Ourend2end” for the full model with best performance in the stage of end-to-end training
- Table4: Performance comparison of different continuous sign language recognition approaches on RWTH-PHOENIX-Weather 2014 multisigner dataset in [%]. “r-hand” stands for right hand and “traj” stands for trajectory motion. “Extra supervision” imported in [<a class="ref-link" id="c18" href="#r18">18</a>] contains a sign language lexicon mapping signs to hand shape sequences, and the best result of [<a class="ref-link" id="c19" href="#r19">19</a>] uses [<a class="ref-link" id="c18" href="#r18">18</a>]+[<a class="ref-link" id="c16" href="#r16">16</a>] as the initial alignment
- Most systems for sign language recognition consist of a feature extractor to represent the spatial and temporal variations in sign language, and a sequence learning model to learn the correspondence between feature sequences and sequences of glosses. Moreover, continuous sign language recognition [16, 18, 19] is also closely related to weakly supervised learning problem, where precise temporal locations for the glosses are not available. Here we introduce the works related to sign language analysis from these aspects.
Spatio-temporal representations. Many previous works in the area of sign analysis [8, 16, 22, 25] use handcrafted features for spatio-temporal representations. In recent years, there has been a growing interest in feature extraction with deep neural networks due to the superior representation capability. The neural network methods adopted in gesture analysis include CNNs [17, 18, 26], 3D CNNs [21, 23, 30] and temporal convolutions . However, due to the data insufficiency in the problem of continuous sign language learning, the training of deep neural networks is inclined to end in overfitting. To alleviate the problem, Koller et al  integrate CNN into a weakly supervised learning scheme. They use the weakly labelled sequence with hand shape as an initialization to iteratively tune CNN and refine the hand shape labels with Expectation Maximization (EM) algorithm. Koller et al  also adopt the finger and palm orientations as the weakly supervision for tuning CNN. Different from [17, 18], we do not require extra annotations in our approach, and we directly use the gloss-level alignment proposals instead of sub-unit labels to help the network training.
- This work is supported by 973 Program (2013CB329503), Natural Science Foundation of China (Grant No 61473167 and No 61621136008), and the German Research Foundation (DFG) in Project Crossmodal Learning DFC TRR-169
- V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. arXiv, 2016.
- H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In Proc. CVPR, 2016.
- P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). In Proc. CVPR, 2009.
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC, 2014.
- D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.
- H. Cooper and R. Bowden. Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proc. CVPR, 2009.
- H. Cooper, E. J. Ong, N. Pugeault, and R. Bowden. Sign language recognition using sub-units. Journal of Machine Learning Research, 13:2205–2231, 2012.
- G. D. Evangelidis, G. Singh, and R. Horaud. Continuous gesture recognition from articulated poses. In ECCV Workshops, 2014.
- J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. H. Piater, and H. Ney. RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus. In Language Resources and Evaluation Conference, 2012.
- J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney. Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather. In Language Resources and Evaluation Conference, 2014.
- A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012.
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, 2006.
- A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li. Large-scale video classification with convolutional neural networks. In Proc. CVPR, 2014.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.
- O. Koller, J. Forster, and H. Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
- O. Koller, H. Ney, and R. Bowden. Automatic alignment of HamNoSys subunits for continuous sign language. In Language Resources and Evaluation Conference Workshops, 2016.
- O. Koller, H. Ney, and R. Bowden. Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In Proc. CVPR, 2016.
- O. Koller, S. Zargaran, H. Ney, and R. Bowden. Deep sign: hybrid CNN-HMM for continuous sign language recognition. In Proc. BMVC, 2016.
- P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand gesture recognition with 3D convolutional neural networks. In CVPR Workshops, 2015.
- P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proc. CVPR, 2016.
- C. Monnier, S. German, and A. Ost. A multi-scale boosted detector for efficient and robust gesture recognition. In ECCV Workshops, 2014.
- N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Multi-scale deep learning for gesture detection and localization. In ECCV Workshops, 2014.
- S. Ong and S. Ranganath. Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE transactions on pattern analysis and machine intelligence, 27(6):873–891, 2005.
- T. Pfister, J. Charles, and A. Zisserman. Large-scale learning of sign language by watching TV (using co-occurrences). In Proc. BMVC, 2013.
- L. Pigou, A. v. d. Oord, S. Dieleman, M. M. Van Herreweghe, and J. Dambre. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. arXiv, 2015.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV, 2015.
- D. Wu, L. Pigou, P.-J. Kindermans, N. Le, L. Shao, J. Dambre, and J.-M. Odobez. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1583–1597, 2016.