AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We develop an end-toend sequence learning scheme and employ connectionist temporal classification as the objective function for alignment proposal

Recurrent Convolutional Neural Networks For Continuous Sign Language Recognition By Staged Optimization

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), no. 1 (2017): 1610-1618

Cited: 155|Views59
EI

Abstract

This work presents a weakly supervised framework with deep neural networks for vision-based continuous sign language recognition, where the ordered gloss labels but no exact temporal locations are available with the video of sign sentence, and the amount of labeled sentences for training is limited. Our approach addresses the mapping of v...More

Code:

Data:

0
Introduction
  • Sign language is regarded as the most grammatically structured category of gestural communications
  • This nature of sign language makes it an ideal test bed for developing methods to solve problems such as motion analysis and human-computer interaction.
  • Continuous sign language recognition with deep neural networks remains challenging and non-trivial.
  • In this problem, the recognition system is required to achieve representation and sequence learning from the weakly supervised unsegmented video stream.
  • RNNs have shown superior performance to hidden Markov models (HMMs) on handling complex dynamic variations in sign recognition [21, 26], with limited amount of training data RNNs are more inclined to end in overfitting
Highlights
  • Sign language is regarded as the most grammatically structured category of gestural communications
  • The main contributions of our work can be summarized as follows: (1) We develop our architecture with recurrent convolutional neural networks to achieve performance comparable to the state-of-the-arts in this weakly supervised problem, without importing extra information; (2) We fully exploit the representation capability of deep convolutional neural network by segmenting the sentencelevel labels to vast amounts of temporal segments with gloss labels, which directly guides the training of deep architecture for feature representation and avoids overfitting efficiently; (3) We design a three-stage optimization process for training our deep neural network architecture, and our approach is proved to take notable effect on the limited training set; (4) To the best of our knowledge, we are the first to propose a real-world continuous sign language recognition system fully based on deep neural networks in this scope, and we demonstrate its applicability from challenging continuous sign video streams
  • We substitute 3D-convolutional neural networks [21, 29] for our proposed convolutional neural networks with stacked temporal convolution and pooling, and we assess the utility of pre-training the convolutional neural networks with loss employed in PN-Net [1] from video frames
  • Continuous sign language recognition results of these experiments are listed in Table 2
  • We have proposed a deep architecture with recurrent convolutional neural network for continuous sign language recognition
  • The effectiveness of our approach is demonstrated on a challenging benchmark, where we have achieved the performance comparable to the state-of-theart
Methods
  • Vision-based continuous sign language recognition systems usually take the image sequences of signers’ performance as input, and learn to automatically output the gloss labels in right order.
  • The authors' proposed approach employs CNN with temporal convolution and pooling for spatio-temporal representation learning from video clips, and RNN with long short-term memory (LSTM) module to learn the mapping of feature sequences to sequences of glosses.
  • The remainder of this section discusses the approach in detail
Results
  • The amounts of training samples for the 9 signers are unbalanced in this dataset, with the three most sampled signers account for 26.0%, 22.8%, 14.7% and three least 0.5%, 0.8%, 2.9%, while the WERs for these signers on validation set are 36.0, 38.6, 43.8 and 45.8, 43.3, 38.7 respectively
  • This indicates that the system can learn the shared representations among different signers and to some extent handle the inter-signer variations.
Conclusion
  • The authors have proposed a deep architecture with recurrent convolutional neural network for continuous sign language recognition.
  • The authors have designed a staged optimization process for training the deep neural network architecture.
  • The authors fully exploit the representation capability of CNN with tuning on vast amounts of gloss-level segments and effectively avoid overfitting with the deep architecture.
  • The authors have proposed a novel detection net for regularization on the consistency between sequential predictions and detection results.
  • The effectiveness of the approach is demonstrated on a challenging benchmark, where the authors have achieved the performance comparable to the state-of-theart
Tables
  • Table1: Configuration of our architecture. The parameters for temporal convolution are denoted as “conv1D-[receptive field][number of channels]”. Temporal pooling layers are annotated with stride, and bidirectional LSTM (denoted by “BLSTM”) is with the dimension number of its hidden variables. The output dimensions of the fully connected layers are equal to the size of gloss vocabulary in our architecture
  • Table2: Recognition results for end-to-end training stage on RWTH-PHOENIX-Weather 2014 multi-signer dataset in [%]. “C3d” stands for the 3D-CNN structure employed in [<a class="ref-link" id="c21" href="#r21">21</a>, <a class="ref-link" id="c29" href="#r29">29</a>], “ConvTC” for our proposed feature extraction architecture with VGG-S net pretrained on ISLVRC 2012, and “+pretrain” for our model further pretrained with PN-Net [<a class="ref-link" id="c1" href="#r1">1</a>] loss on the right hand patches from training set
  • Table3: Recognition results for sequence learning stage on RWTH-PHOENIX-Weather 2014 multi-signer dataset in [%]. We assess the performance of different recurrent models and our proposed detection net. “BLSTM+det net” stands for the employed model with bidirectional LSTM and detection net, and “Ourend2end” for the full model with best performance in the stage of end-to-end training
  • Table4: Performance comparison of different continuous sign language recognition approaches on RWTH-PHOENIX-Weather 2014 multisigner dataset in [%]. “r-hand” stands for right hand and “traj” stands for trajectory motion. “Extra supervision” imported in [<a class="ref-link" id="c18" href="#r18">18</a>] contains a sign language lexicon mapping signs to hand shape sequences, and the best result of [<a class="ref-link" id="c19" href="#r19">19</a>] uses [<a class="ref-link" id="c18" href="#r18">18</a>]+[<a class="ref-link" id="c16" href="#r16">16</a>] as the initial alignment
Download tables as Excel
Related work
  • Most systems for sign language recognition consist of a feature extractor to represent the spatial and temporal variations in sign language, and a sequence learning model to learn the correspondence between feature sequences and sequences of glosses. Moreover, continuous sign language recognition [16, 18, 19] is also closely related to weakly supervised learning problem, where precise temporal locations for the glosses are not available. Here we introduce the works related to sign language analysis from these aspects.

    Spatio-temporal representations. Many previous works in the area of sign analysis [8, 16, 22, 25] use handcrafted features for spatio-temporal representations. In recent years, there has been a growing interest in feature extraction with deep neural networks due to the superior representation capability. The neural network methods adopted in gesture analysis include CNNs [17, 18, 26], 3D CNNs [21, 23, 30] and temporal convolutions [26]. However, due to the data insufficiency in the problem of continuous sign language learning, the training of deep neural networks is inclined to end in overfitting. To alleviate the problem, Koller et al [18] integrate CNN into a weakly supervised learning scheme. They use the weakly labelled sequence with hand shape as an initialization to iteratively tune CNN and refine the hand shape labels with Expectation Maximization (EM) algorithm. Koller et al [17] also adopt the finger and palm orientations as the weakly supervision for tuning CNN. Different from [17, 18], we do not require extra annotations in our approach, and we directly use the gloss-level alignment proposals instead of sub-unit labels to help the network training.
Funding
  • This work is supported by 973 Program (2013CB329503), Natural Science Foundation of China (Grant No 61473167 and No 61621136008), and the German Research Foundation (DFG) in Project Crossmodal Learning DFC TRR-169
Reference
  • V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. arXiv, 2016.
    Google ScholarFindings
  • H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). In Proc. CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC, 2014.
    Google ScholarLocate open access versionFindings
  • D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.
    Google ScholarLocate open access versionFindings
  • H. Cooper and R. Bowden. Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proc. CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • H. Cooper, E. J. Ong, N. Pugeault, and R. Bowden. Sign language recognition using sub-units. Journal of Machine Learning Research, 13:2205–2231, 2012.
    Google ScholarLocate open access versionFindings
  • G. D. Evangelidis, G. Singh, and R. Horaud. Continuous gesture recognition from articulated poses. In ECCV Workshops, 2014.
    Google ScholarLocate open access versionFindings
  • J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. H. Piater, and H. Ney. RWTH-PHOENIX-Weather: A large vocabulary sign language recognition and translation corpus. In Language Resources and Evaluation Conference, 2012.
    Google ScholarLocate open access versionFindings
  • J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney. Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather. In Language Resources and Evaluation Conference, 2014.
    Google ScholarLocate open access versionFindings
  • A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012.
    Google ScholarFindings
  • A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, 2006.
    Google ScholarLocate open access versionFindings
  • A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li. Large-scale video classification with convolutional neural networks. In Proc. CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • O. Koller, J. Forster, and H. Ney. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
    Google ScholarLocate open access versionFindings
  • O. Koller, H. Ney, and R. Bowden. Automatic alignment of HamNoSys subunits for continuous sign language. In Language Resources and Evaluation Conference Workshops, 2016.
    Google ScholarLocate open access versionFindings
  • O. Koller, H. Ney, and R. Bowden. Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • O. Koller, S. Zargaran, H. Ney, and R. Bowden. Deep sign: hybrid CNN-HMM for continuous sign language recognition. In Proc. BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand gesture recognition with 3D convolutional neural networks. In CVPR Workshops, 2015.
    Google ScholarLocate open access versionFindings
  • P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • C. Monnier, S. German, and A. Ost. A multi-scale boosted detector for efficient and robust gesture recognition. In ECCV Workshops, 2014.
    Google ScholarLocate open access versionFindings
  • N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Multi-scale deep learning for gesture detection and localization. In ECCV Workshops, 2014.
    Google ScholarLocate open access versionFindings
  • S. Ong and S. Ranganath. Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE transactions on pattern analysis and machine intelligence, 27(6):873–891, 2005.
    Google ScholarLocate open access versionFindings
  • T. Pfister, J. Charles, and A. Zisserman. Large-scale learning of sign language by watching TV (using co-occurrences). In Proc. BMVC, 2013.
    Google ScholarLocate open access versionFindings
  • L. Pigou, A. v. d. Oord, S. Dieleman, M. M. Van Herreweghe, and J. Dambre. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. arXiv, 2015.
    Google ScholarFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • D. Wu, L. Pigou, P.-J. Kindermans, N. Le, L. Shao, J. Dambre, and J.-M. Odobez. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1583–1597, 2016.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn