Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt

CVPR, pp. 6757-6766, 2020.

Cited by: 0|Bibtex|Views148|DOI:https://doi.org/10.1109/CVPR42600.2020.00679
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
This paper proposes a novel model of learning Sketch Bidirectional Encoder Representation from Transformer, which is inspired by the recent BERT model from Natural Language Processing

Abstract:

Previous researches of sketches often considered sketches in pixel format and leveraged CNN based models in the sketch understanding. Fundamentally, a sketch is stored as a sequence of data points, a vector format representation, rather than the photo-realistic image of pixels. SketchRNN studied a generative neural representation for sket...More

Code:

Data:

0
Introduction
  • With the prevailing of touch-screen devices, e.g., iPad, everyone can draw simple sketches.
  • It supports the demand of automatically understanding the sketches, which have been extensively studied in [28, 22, 17] as a type of 2D pixel images.
  • The free-hand sketches reflect the abstraction and iconic representation that are composed of patterns, structure, form and even simple logic of objects and scenes in the world around us.
  • Fu is with School of Data Science, and MOE Frontiers Center for Brain Science, Shanghai Key Lab of Intelligent Information Processing Fudan University
Highlights
  • With the prevailing of touch-screen devices, e.g., iPad, everyone can draw simple sketches
  • This paper proposes a novel model of learning Sketch Bidirectional Encoder Representation from Transformer (Sketch-BERT), which is inspired by the recent BERT model [3] from Natural Language Processing (NLP)
  • We further present in addressing these tasks, a novel Sketch Gestalt Model (SGM), which is inspired by the Mask Language Model in Natural Language Processing
  • A novel sketch gestalt model is proposed for self-supervised learning task of sketches
  • We conduct experiments on sketch gestalt task to show the ability of SketchBERT on generative representation learning
  • The Sketch-BERT model can be extended to more tasks for sketches like sketch based image retrieval and sketch generation which can be studied in future
Methods
  • QuickDraw (%) TU-Berlin (%) T-1 T-5 T-1 T-5

    HOG-SVM [4] Ensemble [19] Bi-LSTM [9] Sketch-a-Net∗ [27] Sketch-a-Net [27]

    DSSA [22] ResNet18 [8] ResNet50 [8] TCNet [20]

    Sketch-BERT (w.) 88.30 97.82 76.30 91.40 Table 1.
  • (3) BiLSTM [9] : The authors employ a three-layer bidirectional LSTM model to test the recognition and retrieval tasks on sequential data of sketches.
  • (7) TC-Net [20]: It is a network based on DenseNet [12] for sketch based image retrieval task, the authors leverage the pre-trained model for classification and retrieval tasks.
  • (8) SketchRNN [7]: SketchRNN employed a variational autoencoder with LSTM network as encoder and decoder backbones to solve the sketch generation task, in the experiments, the authors use this approach to test the sketch gestalt task.
  • For fair comparison of structure, the authors retrain all models on QuickDraw and TU-Berlin datasets for different tasks
Results
  • Results on Sketch Recognition Task

    Recognition or classification is a typical task for understanding or modeling data in term of semantic information, so the authors first compare the classification results of the model with other baselines.
  • Interestingly the Sketch-BERT without self-supervised training by sGesta, achieves much worse results than the other baselines on this retrieval task
  • This further suggests that the SGM model proposed in selfsupervised learning step, can efficiently improve the generalization ability of the Sketch-BERT.
  • The authors' model is compared against SketchRNN [7], which, to the best of the knowledge, is the only generative model that is able to predict the masked sketch sequences
  • This task is conducted on QuickDraw dataset: both models are learned on training data, and predicted on the test data.
  • The authors show more examples of different classes on sketch onion flashlight floor lamp hammer guitar stethoscope basketball tiger helmet
Conclusion
  • The Sketch-BERT model has L = 8 weight-sharing Transformer layers with the hidden size of H = 768 and the number of selfattention heads of 12.
  • In self-supervised learning, the authors leverage the whole training data from QuickDraw to train the sketch gestalt model.
  • A novel sketch gestalt model is proposed for self-supervised learning task of sketches.
  • The results on QuickDraw and TUBerlin datasets show the superiority of Sketch-BERT on classification and retrieval tasks.
  • The Sketch-BERT model can be extended to more tasks for sketches like sketch based image retrieval and sketch generation which can be studied in future
Summary
  • Introduction:

    With the prevailing of touch-screen devices, e.g., iPad, everyone can draw simple sketches.
  • It supports the demand of automatically understanding the sketches, which have been extensively studied in [28, 22, 17] as a type of 2D pixel images.
  • The free-hand sketches reflect the abstraction and iconic representation that are composed of patterns, structure, form and even simple logic of objects and scenes in the world around us.
  • Fu is with School of Data Science, and MOE Frontiers Center for Brain Science, Shanghai Key Lab of Intelligent Information Processing Fudan University
  • Methods:

    QuickDraw (%) TU-Berlin (%) T-1 T-5 T-1 T-5

    HOG-SVM [4] Ensemble [19] Bi-LSTM [9] Sketch-a-Net∗ [27] Sketch-a-Net [27]

    DSSA [22] ResNet18 [8] ResNet50 [8] TCNet [20]

    Sketch-BERT (w.) 88.30 97.82 76.30 91.40 Table 1.
  • (3) BiLSTM [9] : The authors employ a three-layer bidirectional LSTM model to test the recognition and retrieval tasks on sequential data of sketches.
  • (7) TC-Net [20]: It is a network based on DenseNet [12] for sketch based image retrieval task, the authors leverage the pre-trained model for classification and retrieval tasks.
  • (8) SketchRNN [7]: SketchRNN employed a variational autoencoder with LSTM network as encoder and decoder backbones to solve the sketch generation task, in the experiments, the authors use this approach to test the sketch gestalt task.
  • For fair comparison of structure, the authors retrain all models on QuickDraw and TU-Berlin datasets for different tasks
  • Results:

    Results on Sketch Recognition Task

    Recognition or classification is a typical task for understanding or modeling data in term of semantic information, so the authors first compare the classification results of the model with other baselines.
  • Interestingly the Sketch-BERT without self-supervised training by sGesta, achieves much worse results than the other baselines on this retrieval task
  • This further suggests that the SGM model proposed in selfsupervised learning step, can efficiently improve the generalization ability of the Sketch-BERT.
  • The authors' model is compared against SketchRNN [7], which, to the best of the knowledge, is the only generative model that is able to predict the masked sketch sequences
  • This task is conducted on QuickDraw dataset: both models are learned on training data, and predicted on the test data.
  • The authors show more examples of different classes on sketch onion flashlight floor lamp hammer guitar stethoscope basketball tiger helmet
  • Conclusion:

    The Sketch-BERT model has L = 8 weight-sharing Transformer layers with the hidden size of H = 768 and the number of selfattention heads of 12.
  • In self-supervised learning, the authors leverage the whole training data from QuickDraw to train the sketch gestalt model.
  • A novel sketch gestalt model is proposed for self-supervised learning task of sketches.
  • The results on QuickDraw and TUBerlin datasets show the superiority of Sketch-BERT on classification and retrieval tasks.
  • The Sketch-BERT model can be extended to more tasks for sketches like sketch based image retrieval and sketch generation which can be studied in future
Tables
  • Table1: The Top-1 (T-1) and Top-5 (T-5) accuracy of our model and other baselines on classification task; w./o., and w. indicate the results without, and with the self-supervised learning by sketch Gestalt, individually. ∗ means the results in original paper [<a class="ref-link" id="c27" href="#r27">27</a>]
  • Table2: The Top-1, Top-5 accuracy and mean Average Precision(mAP) of our model and other baselines on sketch retrieval task. w./o., and w. indicate the results without, and with the self-supervised learning by sketch gestalt
  • Table3: The performance of classification and retrieval tasks on
  • Table4: The performance of classification and retrieval tasks of Sketch-BERT with different volumes of pre-training data
  • Table5: The performance of classification and retrieval tasks for different structures of Sketch-BERT (L − A − H)
Download tables as Excel
Related work
  • Representation of Sketches. The research on representation of sketches has been lasted for a long time. As the studies of images and texts, learning discriminative feature for sketches is also a hot topic for learning sketch representation. The majority of such works [11, 19, 28, 27, 20, 17] achieved the goal through the classification or retrieval tasks. Traditional methods always focused on hand-crafted features, such as BoW [11], HOG [10] and ensemble structured features [19]. Recently, there are works that tried to learn neural representation of sketches. Due to the huge visual gap between sketches and images, Sketch-ANet [28] designed a specific Convolutional Neural Network (CNN) structure for sketches, which achieved the state-ofart performance at that time, with several following works [27, 22]. On the other hand, TC-Net [20] utilized an auxiliary classification task to directly solve the sketch recognition by the backbone, e.g., DenseNet [12]. Different from the above methods which directly utilized the pixel level information from sketch images, researchers made use of vector form representation of sketches in [17, 30].
Funding
  • This work was supported in part by NSFC Projects (U1611461,61702108), Science and Technology Commission of Shanghai Municipality Projects (19511120700), Shanghai Municipal Science and Technology Major Project (2018SHZDZX01), and Shanghai Research and Innovation Functional Program (17DZ2260900)
Reference
  • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. 2
    Findings
  • Agne Desolneux, Lionel Moisan, and Jean-Michel Morel. Gestalt theory and computer vision. In Theory and Decision Library A:, 2004. 1
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. (document), 1, 2, 3.1, 3.2, 3.4
    Findings
  • Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? SIGGRAPH, 2012. 4.1
    Google ScholarLocate open access versionFindings
  • Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. TVCG, 2010. 4.1
    Google ScholarLocate open access versionFindings
  • S. Gidaris, P. Singh, and N. Komodakis. Unsupervised rep- resentation learning by predicting image rotations. In ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • David Ha and Douglas Eck. A neural representation of sketch drawings. In ICLR, 2018. (document), 1, 2, 3.1, 4.1, 4.1, 4.4
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 4.1, 4.1
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997. 2, 4.1, 4.1
    Google ScholarLocate open access versionFindings
  • Rui Hu and John Collomosse. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU, 2013. 2
    Google ScholarLocate open access versionFindings
  • Rui Hu, Tinghuai Wang, and John Collomosse. A bagof-regions approach to sketch-based image retrieval. In ICIP. IEEE, 202
    Google ScholarFindings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017. 2, 4.1
    Google ScholarLocate open access versionFindings
  • Zhewei Huang, Wen Heng, and Shuchang Zhou. Learning to paint with model-based deep reinforcement learning. ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019. 1, 3.2
    Findings
  • Lei Li, Changqing Zou, Youyi Zheng, Qingkun Su, Hongbo Fu, and Chiew-Lan Tai. Sketch-r2cnn: An attentive network for vector sketch recognition. arXiv preprint arXiv:1811.08170, 2018. 1, 2
    Findings
  • Yijun Li, Chen Fang, Aaron Hertzmann, Eli Shechtman, and Ming-Hsuan Yang. Im2pencil: Controllable pencil illustration from photographs. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Yi Li, Yi-Zhe Song, and Shaogang Gong. Sketch recognition by ensemble matching of structured features. In BMVC, 2013. 2, 4.1, 4.1
    Google ScholarLocate open access versionFindings
  • Hangyu Lin, Peng Lu, Yanwei Fu, Shaogang Gong, Xiangyang Xue, and Yu-Gang Jiang. Tc-net for isbir: Triplet classification network for instance-level sketch based image retrieval. In ACM Multimedia, 2019. 2, 3.4, 4.1, 4.1
    Google ScholarLocate open access versionFindings
  • M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jifei Song, Qian Yu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, 2017. 1, 2, 4.1, 4.1
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeualPS, 2017. 2
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019. 2
    Findings
  • Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505–5514, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In ICCV, pages 4471–4480, 2019. 2, 4.5, 5
    Google ScholarLocate open access versionFindings
  • Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2, 4.1, 1, 4.1
    Google ScholarLocate open access versionFindings
  • Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Sketch-a-net: A deep neural network that beats humans. IJCV, 122(3):411–425, 2017. 1, 2, 4.1
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, ChengLin Liu, and Yoshua Bengio. Drawing and recognizing chinese characters with recurrent neural network. TPAMI, 40(4):849–862, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tao Zhou, Chen Fang, Zhaowen Wang, Jimei Yang, Byungmoon Kim, Zhili Chen, Jonathan Brandt, and Demetri Terzopoulos. Learning to sketch with deep q networks and demonstrated strokes. arXiv preprint arXiv:1810.05977, 2018. 2
    Findings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments