Debiasing Sample Loadings and Scores in Exponential Family PCA for Sparse Count Data
arxiv(2023)
摘要
Multivariate count data with many zeros frequently occur in a variety of
application areas such as text mining with a document-term matrix and cluster
analysis with microbiome abundance data. Exponential family PCA (Collins et
al., 2001) is a widely used dimension reduction tool to understand and capture
the underlying low-rank structure of count data. It produces principal
component scores by fitting Poisson regression models with estimated loadings
as covariates. This tends to result in extreme scores for sparse count data
significantly deviating from true scores. We consider two major sources of bias
in this estimation procedure and propose ways to reduce their effects. First,
the discrepancy between true loadings and their estimates under a limited
sample size largely degrades the quality of score estimates. By treating
estimated loadings as covariates with bias and measurement errors, we debias
score estimates, using the iterative bootstrap method for loadings and
considering classical measurement error models. Second, the existence of MLE
bias is often ignored in score estimation, but this bias could be removed
through well-known MLE bias reduction methods. We demonstrate the effectiveness
of the proposed bias correction procedure through experiments on both simulated
data and real data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要