Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space

pp. 714-722, 2019.

Cited by: 3|Bibtex|Views107|DOI:https://doi.org/10.1145/3292500.3330997
EI
Other Links: dl.acm.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball

Abstract:

Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimizat...More

Code:

Data:

0
Introduction
  • Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8].
  • Moseley and Wang [34] give a cost, which is akin to Dasgupta’s [16] and is well approximated by hierarchical agglomerative clustering with average linkage, and Adams et al [1] give a MCMC-based inference procedure for a nested stick-breaking objective
Highlights
  • Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8]
  • We present an objective function that is differentiable with respect to our tree node embeddings and perform hierarchical clustering by optimizing this objective using stochastic gradient descent
  • One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
  • We presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball
  • We showed its ability to perform hierarchical clustering on large scale data
  • We hypothesize that our increased performance is due to gradientbased hyperbolic hierarchical clustering (gHHC)’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes
  • We showed how our model can be jointly optimized with multi-task regression
Methods
  • The quality of the tree structures learned by the method.
  • One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
  • In both cases, the authors see that the method produces meaningful structure
Results
  • The authors show that the method outperforms state-of-the-art approaches on a clustering task of ImageNet ILSVRC images [28] by 15 points of dendrogram purity.
  • The authors find having a good initialization of the model improves performance as is the case with all clustering methods.
  • The authors hypothesize that the increased performance is due to gHHC’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes
Conclusion
  • The authors presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball.
  • The authors showed its ability to perform hierarchical clustering on large scale data.
  • The authors showed how the model can be jointly optimized with multi-task regression.
  • The authors hope to explore gradient-based optimization of tree structures in deep latent-variable models.
  • The authors thank Ari Kobren for his helpful discussions and work on earlier versions of gradient-based hierarchical clustering as well as.
Summary
  • Introduction:

    Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8].
  • Moseley and Wang [34] give a cost, which is akin to Dasgupta’s [16] and is well approximated by hierarchical agglomerative clustering with average linkage, and Adams et al [1] give a MCMC-based inference procedure for a nested stick-breaking objective
  • Objectives:

    The authors' objective is inspired by recent work on cost functions for hierarchical clustering [9–11, 13, 14, 16, 34, 49].
  • The authors' objective is to minimize the expected distance between xi and the embeddings of the nodes that are likely to be a least common ancestor of xi and xj and increase the distance between xi and nodes that are likely to be a least common ancestor of xi , xj , and xk.
  • The authors' objective is amenable to stochastic gradient descent with respect to Z by sampling triples of points (xi , xj , xk )
  • Methods:

    The quality of the tree structures learned by the method.
  • One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
  • In both cases, the authors see that the method produces meaningful structure
  • Results:

    The authors show that the method outperforms state-of-the-art approaches on a clustering task of ImageNet ILSVRC images [28] by 15 points of dendrogram purity.
  • The authors find having a good initialization of the model improves performance as is the case with all clustering methods.
  • The authors hypothesize that the increased performance is due to gHHC’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes
  • Conclusion:

    The authors presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball.
  • The authors showed its ability to perform hierarchical clustering on large scale data.
  • The authors showed how the model can be jointly optimized with multi-task regression.
  • The authors hope to explore gradient-based optimization of tree structures in deep latent-variable models.
  • The authors thank Ari Kobren for his helpful discussions and work on earlier versions of gradient-based hierarchical clustering as well as.
Tables
  • Table1: Dendrogram Purity. Results for competing approaches from [<a class="ref-link" id="c28" href="#r28">28</a>] using each algorithm’s optimal setting. Bold indicates the best performing method. On small-scale problems (first three datasets) HAC performs very well. As the number of ground truth clusters, dimensionality and data points increases, our algorithm outperforms state of the art methods
  • Table2: Results over the school dataset
Download tables as Excel
Related work
  • Hierarchical clustering is a widely studied problem theoretically, in machine learning, and in applications. Apart from the work on Dasgupta’s cost and related costs [49], there has been much work on probabilistic models that have been used to describe the quality of hierarchical clusterings. Much of this work uses Bayesian nonparametric models to describe tree structures [1]. There has also been some work using discriminative graphical models to measure the quality of a clustering [50]. These cost functions come with their own inductive biases and the optimization of them with similar techniques to this paper could be interesting future work. Gradientbased methods are prevalent in flat clustering such as stochastic and mini-batch k-means [42].
Funding
  • This work was supported in part by the Center for Data Science and the Center for Intelligent
  • Information Retrieval, in part by the National Science Foundation under Grant No NSF-1763618
Reference
  • R. P. Adams, Z. Ghahramani, and M. I. Jordan. 2010. Tree-Structured Stick Breaking for Hierarchical Data.
    Google ScholarFindings
  • O. Bachem, M Lucic, H. Hassani, and A. Krause. 2016. Fast and provably good seedings for k-means. NeurIPS.
    Google ScholarFindings
  • M.F. Balcan, A. Blum, and S. Vempala. 2008. A discriminative framework for clustering via similarity functions. STOC.
    Google ScholarLocate open access versionFindings
  • J. Bingham and S. Sudarsanam. 2000. Visualizing large hierarchical clusters in hyperbolic space. Bioinformatics.
    Google ScholarFindings
  • C. Blundell, Y. W. Teh, and K. A. Heller. 2010. Bayesian rose trees. UAI.
    Google ScholarLocate open access versionFindings
  • F. Boguná, M.and Papadopoulos and D. Krioukov. 2010. Sustaining the internet with hyperbolic mapping. Nature communications.
    Google ScholarFindings
  • S. Bonnabel. 2013. Stochastic gradient descent on Riemannian manifolds. IEEE
    Google ScholarFindings
  • P. F. Brown, P. V Desouza, R. L Mercer, V. J D. Pietra, and J. C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics.
    Google ScholarFindings
  • M. Charikar and V. Chatziafratis. 2017. Approximate hierarchical clustering via sparsest cut and spreading metrics. SODA.
    Google ScholarLocate open access versionFindings
  • M. Charikar, V. Chatziafratis, and R. Niazadeh. 2019. Hierarchical Clustering better than Average-Linkage. SODA.
    Google ScholarLocate open access versionFindings
  • M. Charikar, V. Chatziafratis, R. Niazadeh, and G. Yaroslavtsev. 2019. Hierarchical
    Google ScholarFindings
  • K. Clark and C. D Manning. 2016. Improving coreference resolution by learning entity-level distributed representations. ACL.
    Google ScholarLocate open access versionFindings
  • V. Cohen-Addad, V. Kanade, and F. Mallmann-Trenn. 2017. Hierarchical clustering beyond the worst-case. NeurIPS.
    Google ScholarLocate open access versionFindings
  • V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. 2018. Hierarchical clustering: Objective functions and algorithms. SODA.
    Google ScholarLocate open access versionFindings
  • A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. IIWeb.
    Google ScholarFindings
  • S. Dasgupta. 20A cost function for similarity-based hierarchical clustering. STOC.
    Google ScholarLocate open access versionFindings
  • E. Emamjomeh-Zadeh and D. Kempe. 2018. Adaptive hierarchical clustering using ordinal queries. SODA.
    Google ScholarFindings
  • H. Fichtenberger, M. Gillé, M. Schmidt, V. Schwiegelshohn, and C. Sohler. 2013. BICO: BIRCH meets coresets for k-means clustering. ESA.
    Google ScholarLocate open access versionFindings
  • O.E. Ganea, G. Bécigneul, and T. Hofmann. 2018. Hyperbolic Entailment Cones for Learning Hierarchical Embeddings. ICML.
    Google ScholarLocate open access versionFindings
  • P. Goyal, Z. Hu, X. Liang, C. Wang, and E. P. Xing. 2017. Nonparametric variational auto-encoders for hierarchical representation learning. ICCV.
    Google ScholarLocate open access versionFindings
  • V. Guillemin and A. Pollack. 2010. Differential topology.
    Google ScholarFindings
  • K. A Heller and Z. Ghahramani. 2005. Bayesian hierarchical clustering. ICML.
    Google ScholarLocate open access versionFindings
  • J. Himberg, A. Hyvärinen, and F. Esposito. 2004. Validating the independent components of neuroimaging time series via clustering and visualization. Neuroimage.
    Google ScholarFindings
  • Y. Jernite, A. Choromanska, and D. Sontag. 2017. Simultaneous learning of trees and representations for extreme classification and density estimation. ICML. [25] D. P Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. ICLR.
    Google ScholarFindings
  • [26] R. Kleinberg. 2007. Geographic routing using hyperbolic space. INFOCOM. [27] D. A Knowles and Z. Ghahramani. 2011. Pitman-Yor diffusion trees. UAI.
    Google ScholarFindings
  • [28] A. Kobren, N. Monath, A. Krishnamurthy, and A. McCallum. 2017. A hierarchical algorithm for extreme clustering. KDD.
    Google ScholarLocate open access versionFindings
  • [29] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná. 2010. Hyperbolic geometry of complex networks. Physical Review E.
    Google ScholarLocate open access versionFindings
  • [30] A. Krishnamurthy, S. Balakrishnan, M. Xu, and A. Singh. 2012. Efficient active algorithms for hierarchical clustering. ICML.
    Google ScholarLocate open access versionFindings
  • [31] J. Lamping and R. Rao. 1994. Laying out and visualizing large trees using a hyperbolic space. UIST.
    Google ScholarLocate open access versionFindings
  • [32] H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky. 2012. Joint entity and event coreference resolution across documents. EMNLP.
    Google ScholarLocate open access versionFindings
  • [33] K. Lee, L. He, M. Lewis, and L. Zettlemoyer. 2017. End-to-end neural coreference resolution. EMNLP.
    Google ScholarLocate open access versionFindings
  • [34] B. Moseley and J. Wang. 2017. Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search. NeurIPS.
    Google ScholarFindings
  • [35] M. Nickel and D. Kiela. 2017. Poincaré embeddings for learning hierarchical representations. NeurIPS.
    Google ScholarFindings
  • [36] J. Pennington, R. Socher, and C. D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP.
    Google ScholarLocate open access versionFindings
  • [37] L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. ACL.
    Google ScholarLocate open access versionFindings
  • [38] S. Rendle, Z. Freudenthaler, C.and Gantner, and L. Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. UAI.
    Google ScholarLocate open access versionFindings
  • [39] S. Roy and S. Pokutta. 2016. Hierarchical clustering via spreading metrics. NeurIPS.
    Google ScholarFindings
  • [40] F. Sala, C. De Sa, A. Gu, and C. Ré. 2018. Representation tradeoffs for hyperbolic embeddings. ICML.
    Google ScholarLocate open access versionFindings
  • [41] R. Sarkar. 2011. Low distortion delaunay embedding of trees in hyperbolic plane. GD.
    Google ScholarLocate open access versionFindings
  • [42] D. Sculley. 2010. Web-scale k-means clustering. WWW.
    Google ScholarFindings
  • [43] J. Seo and B. Shneiderman. 2002. Interactively exploring hierarchical clustering results [gene identification]. Computer.
    Google ScholarFindings
  • [44] T. Sørlie, C. M Perou, R. Tibshirani, T. Aas, S. Geisler, H. J.sen, T. Hastie, M. B Eisen, M. Van De Rijn, S. S Jeffrey, et al. 2001. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. PNAS.
    Google ScholarFindings
  • [45] M. Spivak. 1979. A comprehensive introduction to differential geometry, Publish or Perish.
    Google ScholarFindings
  • [46] E. Strubell, P. Verga, D. Andor, D. Weiss, and A. McCallum. 2018. LinguisticallyInformed Self-Attention for Semantic Role Labeling. EMNLP.
    Google ScholarLocate open access versionFindings
  • [47] Alexandru Tifrea, Gary Becigneul, and Octavian-Eugen Ganea. 2019. Poincare Glove: Hyperbolic Word Embeddings. ICLR.
    Google ScholarLocate open access versionFindings
  • [48] T. D. Q. Vinh, Y. Tay, S. Zhang, G. Cong, and X.-L. Li. 2018. Hyperbolic Recommender Systems. arxiv.
    Google ScholarFindings
  • [49] D. Wang and Y. Wang. 2018. An Improved Cost Function for Hierarchical Cluster Trees. arXiv.
    Google ScholarFindings
  • [50] M. Wick, S. Singh, and A. McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. ACL.
    Google ScholarLocate open access versionFindings
  • [51] D. H Widyantoro, T. R Ioerger, and J. Yen. 2002. An incremental approach to building a cluster hierarchy. ICDM.
    Google ScholarFindings
  • [52] A. R Zamir, A. Sax, W. Shen, L. J Guibas, J. Malik, and S. Savarese. 2018. Taskonomy: Disentangling task transfer learning. CVPR.
    Google ScholarLocate open access versionFindings
  • [53] H. Zhang, S. J Reddi, and S. Sra. 2016. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. NeurIPS.
    Google ScholarFindings
  • [54] T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: A new data clustering algorithm and its applications. SIGMOD.
    Google ScholarLocate open access versionFindings
  • [55] Y. Zhang, A. Ahmed, V. Josifovski, and A. Smola. 2014. Taxonomy discovery for personalized recommendation. ICDM.
    Google ScholarFindings
  • [56] Y. Zhang and D.-Y. Yeung. 2010. A Convex Formulation for Learning Task Relationships in Multi-Task Learning citation. UAI.
    Google ScholarLocate open access versionFindings
  • [57] Y. Zhao and G. Karypis. 2002. Evaluation of hierarchical clustering algorithms for document datasets. CIKM.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments