Chrome Extension
WeChat Mini Program
Use on ChatGLM

ParalOS: A Scheduling & Memory Management Framework for Heterogeneous VPUs.

Digital Systems Design (DSD)(2021)

Natl Tech Univ Athens

Cited 2|Views9
Abstract
Embedded systems are presented today with the challenge of a very rapidly evolving application diversity followed by increased programming and computational complexity. Customised heterogeneous System-on-Chip (SoC) processors emerge as an attractive HW solution in various application domains, however, they still require sophisticated SW development to provide efficient implementations at the expense of slower adaptation to algorithmic changes. In this context, the current paper proposes a framework for accelerating the SW development of computationally intensive applications on Vision Processing Units (VPUs), while still enabling the exploitation of their full HW potential via low-level kernel optimisations. Our framework is tailored for heterogeneous architectures and integrates a dynamic task scheduler, a novel scratchpad memory management scheme, I/O & inter-process communication techniques, as well as a visual profiler. We evaluate our work on the Intel Movidius Myriad VPUs using synthetic benchmarks and real-world applications, which vary from Convolutional Neural Networks (CNNs) to computer vision algorithms. In terms of execution time, our results range from a limited ~8% performance overhead vs optimised CNN programs to 4.2× performance gain in content-dependent applications. We achieve up to 33% decrease in scratchpad memory usage vs well-established memory allocators and up to 6× smaller inter-process communication time.
More
Translated text
Key words
Vision Processing Unit,System on Chip,Myriad,Heterogeneous Computing,Framework,Scheduling,Scratchpad Memory Management
求助PDF
上传PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一种名为ParalOS的框架,用于加速异构视觉处理单元(VPUs)上计算密集型应用程序的软件开发,并通过低级内核优化实现硬件潜能的充分利用。

方法】:ParalOS框架针对异构架构设计,集成了动态任务调度器、创新的Scratchpad内存管理方案、输入/输出及进程间通信技术,以及一个视觉性能分析器。

实验】:作者在Intel Movidius Myriad VPUs上使用合成基准测试和实际应用(包括卷积神经网络(CNNs)和计算机视觉算法)评估了ParalOS框架。结果显示,执行时间相比优化后的CNN程序性能损失不超过8%,而在内容依赖型应用中性能提升达到4.2倍;内存使用相比成熟内存分配器减少了33%,进程间通信时间减少了6倍。