# Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space

pp. 714-722, 2019.

EI

Weibo:

Abstract:

Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimizat...More

Code:

Data:

Introduction

- Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8].
- Moseley and Wang [34] give a cost, which is akin to Dasgupta’s [16] and is well approximated by hierarchical agglomerative clustering with average linkage, and Adams et al [1] give a MCMC-based inference procedure for a nested stick-breaking objective

Highlights

- Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8]
- We present an objective function that is differentiable with respect to our tree node embeddings and perform hierarchical clustering by optimizing this objective using stochastic gradient descent
- One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
- We presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball
- We showed its ability to perform hierarchical clustering on large scale data
- We hypothesize that our increased performance is due to gradientbased hyperbolic hierarchical clustering (gHHC)’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes
- We showed how our model can be jointly optimized with multi-task regression

Methods

- The quality of the tree structures learned by the method.
- One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
- In both cases, the authors see that the method produces meaningful structure

Results

- The authors show that the method outperforms state-of-the-art approaches on a clustering task of ImageNet ILSVRC images [28] by 15 points of dendrogram purity.
- The authors find having a good initialization of the model improves performance as is the case with all clustering methods.
- The authors hypothesize that the increased performance is due to gHHC’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes

Conclusion

- The authors presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball.
- The authors showed its ability to perform hierarchical clustering on large scale data.
- The authors showed how the model can be jointly optimized with multi-task regression.
- The authors hope to explore gradient-based optimization of tree structures in deep latent-variable models.
- The authors thank Ari Kobren for his helpful discussions and work on earlier versions of gradient-based hierarchical clustering as well as.

Summary

## Introduction:

Hierarchical clustering is a ubiquitous and often-used tool for data analysis [44, 57], visualization [23, 43] and mining of meaningful representations of data [8].- Moseley and Wang [34] give a cost, which is akin to Dasgupta’s [16] and is well approximated by hierarchical agglomerative clustering with average linkage, and Adams et al [1] give a MCMC-based inference procedure for a nested stick-breaking objective
## Objectives:

The authors' objective is inspired by recent work on cost functions for hierarchical clustering [9–11, 13, 14, 16, 34, 49].- The authors' objective is to minimize the expected distance between xi and the embeddings of the nodes that are likely to be a least common ancestor of xi and xj and increase the distance between xi and nodes that are likely to be a least common ancestor of xi , xj , and xk.
- The authors' objective is amenable to stochastic gradient descent with respect to Z by sampling triples of points (xi , xj , xk )
## Methods:

The quality of the tree structures learned by the method.- One in which the input representation is fixed over the ImageNet dataset (Figure 4) and one in which the input representation is jointly learned with the tree structure as described in Section 5 using GloVe [36] (Figure 3)
- In both cases, the authors see that the method produces meaningful structure
## Results:

The authors show that the method outperforms state-of-the-art approaches on a clustering task of ImageNet ILSVRC images [28] by 15 points of dendrogram purity.- The authors find having a good initialization of the model improves performance as is the case with all clustering methods.
- The authors hypothesize that the increased performance is due to gHHC’s ability to update internal nodes in mini-batch fashion without making the incremental hard decisions that are difficult with a small number of internal nodes
## Conclusion:

The authors presented a novel hierarchical clustering algorithm that uses gradient-based optimization and a continuous representation of trees in the Poincaré ball.- The authors showed its ability to perform hierarchical clustering on large scale data.
- The authors showed how the model can be jointly optimized with multi-task regression.
- The authors hope to explore gradient-based optimization of tree structures in deep latent-variable models.
- The authors thank Ari Kobren for his helpful discussions and work on earlier versions of gradient-based hierarchical clustering as well as.

- Table1: Dendrogram Purity. Results for competing approaches from [<a class="ref-link" id="c28" href="#r28">28</a>] using each algorithm’s optimal setting. Bold indicates the best performing method. On small-scale problems (first three datasets) HAC performs very well. As the number of ground truth clusters, dimensionality and data points increases, our algorithm outperforms state of the art methods
- Table2: Results over the school dataset

Related work

- Hierarchical clustering is a widely studied problem theoretically, in machine learning, and in applications. Apart from the work on Dasgupta’s cost and related costs [49], there has been much work on probabilistic models that have been used to describe the quality of hierarchical clusterings. Much of this work uses Bayesian nonparametric models to describe tree structures [1]. There has also been some work using discriminative graphical models to measure the quality of a clustering [50]. These cost functions come with their own inductive biases and the optimization of them with similar techniques to this paper could be interesting future work. Gradientbased methods are prevalent in flat clustering such as stochastic and mini-batch k-means [42].

Funding

- This work was supported in part by the Center for Data Science and the Center for Intelligent
- Information Retrieval, in part by the National Science Foundation under Grant No NSF-1763618

Reference

- R. P. Adams, Z. Ghahramani, and M. I. Jordan. 2010. Tree-Structured Stick Breaking for Hierarchical Data.
- O. Bachem, M Lucic, H. Hassani, and A. Krause. 2016. Fast and provably good seedings for k-means. NeurIPS.
- M.F. Balcan, A. Blum, and S. Vempala. 2008. A discriminative framework for clustering via similarity functions. STOC.
- J. Bingham and S. Sudarsanam. 2000. Visualizing large hierarchical clusters in hyperbolic space. Bioinformatics.
- C. Blundell, Y. W. Teh, and K. A. Heller. 2010. Bayesian rose trees. UAI.
- F. Boguná, M.and Papadopoulos and D. Krioukov. 2010. Sustaining the internet with hyperbolic mapping. Nature communications.
- S. Bonnabel. 2013. Stochastic gradient descent on Riemannian manifolds. IEEE
- P. F. Brown, P. V Desouza, R. L Mercer, V. J D. Pietra, and J. C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics.
- M. Charikar and V. Chatziafratis. 2017. Approximate hierarchical clustering via sparsest cut and spreading metrics. SODA.
- M. Charikar, V. Chatziafratis, and R. Niazadeh. 2019. Hierarchical Clustering better than Average-Linkage. SODA.
- M. Charikar, V. Chatziafratis, R. Niazadeh, and G. Yaroslavtsev. 2019. Hierarchical
- K. Clark and C. D Manning. 2016. Improving coreference resolution by learning entity-level distributed representations. ACL.
- V. Cohen-Addad, V. Kanade, and F. Mallmann-Trenn. 2017. Hierarchical clustering beyond the worst-case. NeurIPS.
- V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. 2018. Hierarchical clustering: Objective functions and algorithms. SODA.
- A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. IIWeb.
- S. Dasgupta. 20A cost function for similarity-based hierarchical clustering. STOC.
- E. Emamjomeh-Zadeh and D. Kempe. 2018. Adaptive hierarchical clustering using ordinal queries. SODA.
- H. Fichtenberger, M. Gillé, M. Schmidt, V. Schwiegelshohn, and C. Sohler. 2013. BICO: BIRCH meets coresets for k-means clustering. ESA.
- O.E. Ganea, G. Bécigneul, and T. Hofmann. 2018. Hyperbolic Entailment Cones for Learning Hierarchical Embeddings. ICML.
- P. Goyal, Z. Hu, X. Liang, C. Wang, and E. P. Xing. 2017. Nonparametric variational auto-encoders for hierarchical representation learning. ICCV.
- V. Guillemin and A. Pollack. 2010. Differential topology.
- K. A Heller and Z. Ghahramani. 2005. Bayesian hierarchical clustering. ICML.
- J. Himberg, A. Hyvärinen, and F. Esposito. 2004. Validating the independent components of neuroimaging time series via clustering and visualization. Neuroimage.
- Y. Jernite, A. Choromanska, and D. Sontag. 2017. Simultaneous learning of trees and representations for extreme classification and density estimation. ICML. [25] D. P Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. ICLR.
- [26] R. Kleinberg. 2007. Geographic routing using hyperbolic space. INFOCOM. [27] D. A Knowles and Z. Ghahramani. 2011. Pitman-Yor diffusion trees. UAI.
- [28] A. Kobren, N. Monath, A. Krishnamurthy, and A. McCallum. 2017. A hierarchical algorithm for extreme clustering. KDD.
- [29] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná. 2010. Hyperbolic geometry of complex networks. Physical Review E.
- [30] A. Krishnamurthy, S. Balakrishnan, M. Xu, and A. Singh. 2012. Efficient active algorithms for hierarchical clustering. ICML.
- [31] J. Lamping and R. Rao. 1994. Laying out and visualizing large trees using a hyperbolic space. UIST.
- [32] H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky. 2012. Joint entity and event coreference resolution across documents. EMNLP.
- [33] K. Lee, L. He, M. Lewis, and L. Zettlemoyer. 2017. End-to-end neural coreference resolution. EMNLP.
- [34] B. Moseley and J. Wang. 2017. Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search. NeurIPS.
- [35] M. Nickel and D. Kiela. 2017. Poincaré embeddings for learning hierarchical representations. NeurIPS.
- [36] J. Pennington, R. Socher, and C. D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP.
- [37] L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. ACL.
- [38] S. Rendle, Z. Freudenthaler, C.and Gantner, and L. Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. UAI.
- [39] S. Roy and S. Pokutta. 2016. Hierarchical clustering via spreading metrics. NeurIPS.
- [40] F. Sala, C. De Sa, A. Gu, and C. Ré. 2018. Representation tradeoffs for hyperbolic embeddings. ICML.
- [41] R. Sarkar. 2011. Low distortion delaunay embedding of trees in hyperbolic plane. GD.
- [42] D. Sculley. 2010. Web-scale k-means clustering. WWW.
- [43] J. Seo and B. Shneiderman. 2002. Interactively exploring hierarchical clustering results [gene identification]. Computer.
- [44] T. Sørlie, C. M Perou, R. Tibshirani, T. Aas, S. Geisler, H. J.sen, T. Hastie, M. B Eisen, M. Van De Rijn, S. S Jeffrey, et al. 2001. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. PNAS.
- [45] M. Spivak. 1979. A comprehensive introduction to differential geometry, Publish or Perish.
- [46] E. Strubell, P. Verga, D. Andor, D. Weiss, and A. McCallum. 2018. LinguisticallyInformed Self-Attention for Semantic Role Labeling. EMNLP.
- [47] Alexandru Tifrea, Gary Becigneul, and Octavian-Eugen Ganea. 2019. Poincare Glove: Hyperbolic Word Embeddings. ICLR.
- [48] T. D. Q. Vinh, Y. Tay, S. Zhang, G. Cong, and X.-L. Li. 2018. Hyperbolic Recommender Systems. arxiv.
- [49] D. Wang and Y. Wang. 2018. An Improved Cost Function for Hierarchical Cluster Trees. arXiv.
- [50] M. Wick, S. Singh, and A. McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. ACL.
- [51] D. H Widyantoro, T. R Ioerger, and J. Yen. 2002. An incremental approach to building a cluster hierarchy. ICDM.
- [52] A. R Zamir, A. Sax, W. Shen, L. J Guibas, J. Malik, and S. Savarese. 2018. Taskonomy: Disentangling task transfer learning. CVPR.
- [53] H. Zhang, S. J Reddi, and S. Sra. 2016. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. NeurIPS.
- [54] T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: A new data clustering algorithm and its applications. SIGMOD.
- [55] Y. Zhang, A. Ahmed, V. Josifovski, and A. Smola. 2014. Taxonomy discovery for personalized recommendation. ICDM.
- [56] Y. Zhang and D.-Y. Yeung. 2010. A Convex Formulation for Learning Task Relationships in Multi-Task Learning citation. UAI.
- [57] Y. Zhao and G. Karypis. 2002. Evaluation of hierarchical clustering algorithms for document datasets. CIKM.

Full Text

Tags

Comments