Sketchformer: Transformer-based Representation for Sketched Structure

CVPR, pp. 14141-14150, 2020.

Cited by: 5|Bibtex|Views80|DOI:https://doi.org/10.1109/CVPR42600.2020.01416
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have demonstrated the potential for Transformer networks to learn a multi-purpose representation for sketch, but believe many further applications of Sketchformer exist beyond the three tasks studied here

Abstract:

Sketchformer is a novel transformer-based representation for encoding free-hand sketches input in a vector form, i.e. as a sequence of strokes. Sketchformer effectively addresses multiple tasks: sketch classification, sketch based image retrieval (SBIR), and the reconstruction and interpolation of sketches. We report several variants ex...More

Code:

Data:

0
Introduction
  • Sketch representation and interpretation remains an open challenge, for complex and casually constructed drawings.
  • Long-short term memory (LSTM) networks have shown significant promise in learning search embeddings [32, 5] due to their ability to model higher-level structure and temporal order versus convolutional neural networks (CNNs) on rasterized sketches [3, 18, 6, 22].
  • The limited temporal extent of LSTM restricts the structural complexity of sketches that may be accommodated in sequence embeddings.
Highlights
  • Sketch representation and interpretation remains an open challenge, for complex and casually constructed drawings
  • The limited temporal extent of long-short term memory restricts the structural complexity of sketches that may be accommodated in sequence embeddings
  • 3) Sketch based Image Retrieval (SBIR) We show that Sketchformer can be unified with raster embedding to produce a search embedding for sketch based image retrieval to deliver improved prevision over a large photo corpus (Stock10M)
  • We evaluate sketch based image retrieval (SBIR) over Stock10M dataset of diverse photos and artworks, as such data is commonly indexed for large-scale sketch based image retrieval evaluation [6, 5]
  • We showed interpolation within the embedding yields plausible blending of sketches within and between classes, and that reconstruction of sketches is improved for complex sketches
  • We have demonstrated the potential for Transformer networks to learn a multi-purpose representation for sketch, but believe many further applications of Sketchformer exist beyond the three tasks studied here
Methods
  • LiveSketch [5] 72.93 SketchRNN [12] 67.69 Shuffled TForm-Cont.
  • TForm-Tok-Dict 78.34 Short Mid Long.
  • LiveSketch [5] 62.1 SketchRNN [12] 4.05
Results
  • The authors show that Sketchformer driven by a dictionary learning tokenization scheme outperforms state of the art sequence embeddings for sketched object recognition over QuickDraw! [19]; the largest and most diverse public corpus of sketched objects.
Conclusion
  • The authors evaluate the performance of the proposed transformer embeddings for three common tasks; sketch classification, sketch reconstruction and interpolation, and sketch based image retrieval (SBIR).
  • For SBIR and interpolation experiments the authors sort QD-862k by sequence length, and sample three datasets (QD345-S, QD345-M, QD345-L) at centiles 10, 50 and 90 respectively to create a set of short, medium and long stroke sequences.
  • Each of these three datasets samples one sketch per class at random from the centile yielding three evaluation sets of 345 sketches.
  • Fusion with additional modalities might enable sketch driven photo generation [16] using complex sketches, or with a language embedding for novel sketch synthesis applications
Summary
  • Introduction:

    Sketch representation and interpretation remains an open challenge, for complex and casually constructed drawings.
  • Long-short term memory (LSTM) networks have shown significant promise in learning search embeddings [32, 5] due to their ability to model higher-level structure and temporal order versus convolutional neural networks (CNNs) on rasterized sketches [3, 18, 6, 22].
  • The limited temporal extent of LSTM restricts the structural complexity of sketches that may be accommodated in sequence embeddings.
  • Methods:

    LiveSketch [5] 72.93 SketchRNN [12] 67.69 Shuffled TForm-Cont.
  • TForm-Tok-Dict 78.34 Short Mid Long.
  • LiveSketch [5] 62.1 SketchRNN [12] 4.05
  • Results:

    The authors show that Sketchformer driven by a dictionary learning tokenization scheme outperforms state of the art sequence embeddings for sketched object recognition over QuickDraw! [19]; the largest and most diverse public corpus of sketched objects.
  • Conclusion:

    The authors evaluate the performance of the proposed transformer embeddings for three common tasks; sketch classification, sketch reconstruction and interpolation, and sketch based image retrieval (SBIR).
  • For SBIR and interpolation experiments the authors sort QD-862k by sequence length, and sample three datasets (QD345-S, QD345-M, QD345-L) at centiles 10, 50 and 90 respectively to create a set of short, medium and long stroke sequences.
  • Each of these three datasets samples one sketch per class at random from the centile yielding three evaluation sets of 345 sketches.
  • Fusion with additional modalities might enable sketch driven photo generation [16] using complex sketches, or with a language embedding for novel sketch synthesis applications
Tables
  • Table1: Sketch classification results over QuickDraw! [<a class="ref-link" id="c19" href="#r19">19</a>] for three variants of the proposed transformer embedding, contrasting each to models learned from randomly permuted stroke order. Comparing to two recent LSTM based approaches for sketch sequence encoding [<a class="ref-link" id="c5" href="#r5">5</a>, <a class="ref-link" id="c12" href="#r12">12</a>]
  • Table2: User study quantifying accuracy of sketch reconstruction
  • Table3: User study quantifying interpolation quality for a pair of sketches of same (intra-) or between (inter-) classes. Preference is expressed by 5 independent workers, and results with > 50%
  • Table4: Quantifying the performance of Sketch2Sketch retrieval under two RNN baselines and three proposed variants. We report category- and instance-level retrieval (mAP%)
  • Table5: Quantifying accuracy of Sketchformer for Sketch2Image search (SBIR). Mean average precision (mAP) computed to rank 15 over Stock10M for the QD345-Q query set
Download tables as Excel
Related work
  • Representation learning for sketch has received extensive attention within the domain of visual search. Classical sketch based image retrieval (SBIR) techniques explored spectral, edge-let based, and sparse gradient features the latter building upon the success of dictionary learning based models (e.g. bag of words) [26, 1, 23]. With the advent of deep learning, convolutional neural networks (CNNs) were rapidly adopted to learn search embedding [35]. Triplet loss models are commonly used for visual search in the photographic domain [29, 20, 11], and have been extended to SBIR. Sangkloy et al [22] used a three-branch CNN with triplet loss to learn a general cross-domain embedding for SBIR. Fine-grained (within-class) SBIR was similarly explored by Yu et al [34]. Qi et al [18] instead use contrastive loss to learn correspondence between sketches and pre-extracted edge maps. Bui et al [2, 3] perform crosscategory retrieval using a triplet model and combined their technique with a learned model of visual aesthetics [31] to constrain SBIR using aesthetic cues in [6]. A quadruplet loss was proposed by [25] for fine-grained SBIR. The generalization of sketch embeddings beyond training classes have also been studied [4, 15], and parameterized for zero-shot learning [9]. Such concepts were later applied in sketchbased shape retrieval tasks [33]. Variants of CycleGAN [36] have also shown to be useful as generative models for sketch [13]. Sketch-A-Net was a seminal work for sketch classification that employed a CNN with large convolutional kernels to accommodate the sparsity of stroke pixels [34]. Recognition of partial sketches has also been explored by [24]. Wang et al [30] proposed sketch classification by sampling unordered points of a sketch image to learning a canonical order.
Funding
  • Reports several variants exploring continuous and tokenized input representations, and contrast their performance
  • Shows that sketch reconstruction and interpolation are improved significantly by the Sketchformer embedding for complex sketches with longer stroke sequences
  • Proposes Sketchformer, the first Transformer based network for learning a deep representation for freehand sketches
  • Evaluates the efficacy of each learned sketch embedding for common sketch interpretation tasks
Reference
  • Tu Bui and John Collomosse. Scalable sketch-based image retrieval using color gradient features. In Proc. ICCV Workshops, pages 1–8, 2015. 2
    Google ScholarLocate open access versionFindings
  • T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Generalisation and sharing in triplet convnets for sketch based visual search. CoRR Abs, arXiv:1611.05301, 2016. 2
    Findings
  • T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Computer Vision and Image Understanding (CVIU), 2017. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression. Elsevier Computers & Graphics, 2018. 2, 4, 5, 7
    Google ScholarFindings
  • J. Collomosse, T. Bui, and H. Jin. Livesketch: Query perturbations for guided sketch-based visual search. In Proc. CVPR, pages 1–9, 2019. 1, 2, 4, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • J. Collomosse, T. Bui, M. Wilber, C. Fang, and H. Jin. Sketching with style: Visual search with sketches and aesthetic context. In Proc. ICCV, 2017. 1, 2, 7
    Google ScholarLocate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019. 1, 2
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, v.1, pages 4171–4186. Association for Computational Linguistics, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Dey, P. Riba, A. Dutta, J. Llados, and Y. Song. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proc. CVPR, 2012
    Google ScholarLocate open access versionFindings
  • David H Douglas and Thomas K Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: the international journal for geographic information and geovisualization, 10(2):112–122, 1973. 2, 5
    Google ScholarLocate open access versionFindings
  • Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. Deep image retrieval: Learning global representations for image search. In Proc. ECCV, pages 241–257, 2016. 2
    Google ScholarLocate open access versionFindings
  • D. Ha and D. Eck. A neural representation of sketch drawings. In Proc. ICLR. IEEE, 2018. 1, 2, 4, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • Y. Song T. Xiang T. Hospedales J. Song, K. Pang. Learning to sketch with shortcut cycle consistency. In Proc. CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Lei Li, Changqing Zou, Youyi Zheng, Qingkun Su, Hongbo Fu, and Chiew-Lan Tai. Sketch-r2cnn: An attentive network for vector sketch recognition. arXiv preprint arXiv:1811.08170, 2018. 2
    Findings
  • K. Pang, K. Li, Y. Yang, H. Zhang, T. Hospedales, T. Xiang, and Y. Song. Generalising fine-grained sketch-based image retrieval. In Proc. CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Gaugan: semantic image synthesis with spatially adaptive normalization. In ACM SIGGRAPH 2019 Real-Time Live!, page 2. ACM, 2019. 8
    Google ScholarLocate open access versionFindings
  • N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image transformer. In Proc. NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. Sketch-based image retrieval via siamese convolutional neural network. In Proc. ICIP, pages 2460–2464. IEEE, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • The Quick, Draw! Dataset. https://github.com/googlecreativelab/quickdraw-dataset. Accessed:2018-10-11.1, 2, 5
    Findings
  • Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Proc. ECCV, pages 3–20, 2016. 2
    Google ScholarLocate open access versionFindings
  • Umar Riaz Muhammad, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Learning deep sketch abstraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8014–8023, 2018. 2
    Google ScholarLocate open access versionFindings
  • Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The sketchy database: Learning to retrieve badly drawn bunnies. In Proc. ACM SIGGRAPH, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Rosalia G Schneider and Tinne Tuytelaars. Sketch classification and classification-driven analysis using fisher vectors. ACM Transactions on Graphics (TOG), 33(6):174, 2014. 2
    Google ScholarLocate open access versionFindings
  • Omar Seddati, Stephane Dupont, and Saıd Mahmoudi. Deepsketch 2: Deep convolutional neural networks for partial sketch recognition. In 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2016. 2
    Google ScholarLocate open access versionFindings
  • O. Seddati, S. Dupont, and S. Mahoudi. Quadruplet networks for sketch-based image retrieval. In Proc. ICMR, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. ICCV, 2003. 2
    Google ScholarLocate open access versionFindings
  • Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. Endto-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015. 4
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. NeurIPS. IEEE, 2017. 1, 2, 3, 8
    Google ScholarLocate open access versionFindings
  • Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proc. CVPR, pages 1386–1393, 2014. 2
    Google ScholarLocate open access versionFindings
  • Xiangxiang Wang, Xuejin Chen, and Zhengjun Zha. Sketchpointnet: A compact network for robust sketch recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2994–2998. IEEE, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, and S. Belongie. Bam! the behance artistic media dataset for recognition beyond photography. In Proc. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • P. Xu, Y. Huang, T. Yuan, K. Pang, Y-Z. Song, T. Xiang, and T. Hospedales. Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proc. CVPR, 2018. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Yongzhe Xu, Jiangchuan Hu, Kun Zeng, and Yongyi Gong. Sketch-based shape retrieval via multi-view attention and generalized similarity. In 2018 7th International Conference on Digital Home (ICDH), pages 311–317. IEEE, 2018. 2
    Google ScholarLocate open access versionFindings
  • Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. Sketch me that shoe. In Proc. CVPR, pages 799–807, 2016. 2
    Google ScholarLocate open access versionFindings
  • Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, and Xiaochun Cao. Sketchnet: Sketch classification with web images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1105– 1113, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proc. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments