Feature big data


引用 0|浏览0
Today’s applications often contain datasets that are too big to fit in a single computer’s main memory. Analyzing these massive datasets will require scalable and sophisticated machine-learning methods. Two commonly used approaches are stochastic optimization and inference algorithms,1 which process one data point at a time; and distributed computing based on the MapReduce framework,2 where the computation proceeds in iterations, with a master processor distributing the computation to slaves at each iteration. Although stochastic optimization and inference algorithms are effective for largescale machine learning, they are inherently sequential. On the other hand, MapReduce-based algorithms suffer from the curse of the last reducer, in that the slaves must wait for the slowest processor to finish before moving on to the next computational iteration. In this article, we describe NOMAD, a novel nomadic framework that combines stochastic optimization’s and distributed computing’s advantages without incurring their drawbacks. NOMAD is an acronym for Nonlocking, stOchastic Multimachine framework for Asynchronous and Decentralized computation. We show that many modern machine-learning problems have a double separability property, meaning the objective function decomposes into a sum over two different variables. We use two concrete problems to illustrate our framework: matrix completion for recommender systems and latent Dirichlet allocation for topic modeling.
AI 理解论文