2016:DianNao Family Energy-Efficient Hardware Accelerators for Machine Learning-程序员宅基地

技术标签: pr  

  • 这个发表在了
  • https://cacm.acm.org/
  • communication of the ACM

  • 参考文献链接是
Chen Y , Chen T , Xu Z , et al. 
DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning[J]. 
Communications of the Acm, 2016, 59(11):105-112
  • https://cacm.acm.org/magazines/2016/11/209123-diannao-family/fulltext
  • 果然找到了他

  • 特码的,我下载了,6不6
The original version of this paper is entitled “DianNao: A
Small-Footprint, High-Throughput Accelerator for Ubiq-
uitous Machine Learning” and was published in Proceed-
ings of the International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS)
49, 4 (March 2014), ACM, New York, NY, 269284.

Abstract

  • ML pervasive
    • broad range of applications
      • broad range of systems(embedded to data centers)

  • computer
    • toward heterogeneous multi-cores
    • a mix of cores and hardware accelerators,
  • designing hardware accelerators for ML
    • achieve high efficiency and broad application scope

第二段

  • efficient computational primitives
    • important for a hardware accelerator,
  • inefficient memory transfers can
    • potentially void the throughput, energy, or cost advantages of accelerators,
  • an Amdahl’s law effect
  • become a first-order concern,

  • like in processors,
    • rather than an element factored in accelerator design on a second step

  • a series of hardware accelerators
    • designed for ML(nn),
    • the impact of memory on accelerator design, performance, and energy.

  • representative neural network layers
  • 450.65x over GPU
  • energy by 150.31x on average
    • for 64-chip DaDianNao (a member of the DianNao family)

1 INTRODUCTION

  • designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
    issue.

  • The first question is for which category of applications one should primarily design accelerators?
  • Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
    techniques.
  • This trend in application comes together with a third trend in machine learning (ML) where a small number
    of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
    years to be state-of-the-art across a broad range of applications.
  • As a result, there is a unique opportunity to design accelerators having significant application scope as well as
    high performance and efficiency. 4

第二段

  • Currently, ML workloads
  • mostly executed on
    • multicores using SIMD[44]
    • on GPUs[7]
    • or on FPGAs[2]

  • the aforementioned trends
    • have already been identified
    • by researchers who have proposed accelerators implementing,
  • CNNs[2]
  • Multi-Layer Perceptrons [43] ;

  • accelerators focusing on other domains,
    • image processing,
    • propose efficient implementations of some of the computational primitives used
    • by machine-learning techniques, such as convolutions[37]

  • There are also ASIC implementations of ML
    • such as Support Vector Machine and CNNs.

  • these works focused on
    • efficiently implementing the computational primitives
      • ignore memory transfers for the sake of simplicity[37,43]
      • plug their computational accelerator to memory via a more or less sophisticated DMA. [2,12,19]

第三段

  • While efficient implementation of computational primitives is a first and important step with promising results,
    inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
    Amdahl’s law effect, and thus, they should become a first-
    order concern, just like in processors, rather than an element
    factored in accelerator design on a second step.

  • Unlike in processors though, one can factor in the specific nature of
    memory transfers in target algorithms, just like it is done for accelerating computations.

  • This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

第四段

  • In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
    DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
  • We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
    transfers and to perform them as efficiently as possible.

2 DIANNAO: A NN ACCELERATOR

  • DianNao
    • first of DianNao accelerator family,
  • accommodates sota nn techniques (dl ),
  • inherits the broad application scope of nn.

2.1 Architecture

  • DianNao
    • input buffer for input (NBin)
    • output buffer for output (NBout)
    • buffer for synaptic(突触) weights (SB)
    • connected to a computational block (performing both synapses and neurons computations)
    • NFU, and CP, see Figure 1

NBin是存放输入神经元
SB是存放突触的权重的
这个NBout是存放输出神经元

我觉得图示的可以这样理解:2个输入神经元,2个突触,将这2个对应乘起来,输出是1个神经元啊。但是我的这个NFU牛逼啊,他可以一次性求两个输出神经元。

NFU

  • a functional block of T i T_i Ti inputs/synapses(突触)
    • T n T_n Tn output neurons,
  • time-shared by different algorithmic blocks of neurons.

这个NFU对 T i T_i Ti个输入和突触运算,得到 T n T_n Tn个输出神经元,突触不是应该是 T i × T n T_i\times T_n Ti×Tn个吗??,

  • Depending on the layer type,
    • computations at the NFU can be decomposed in either two or three stages

  • For classifier and convolutional:
    • multiplication of synapses × \times × inputs:NFU-1
    • , additions of all multiplications, :NFU-2
    • sigmoid. :NFU-3

如果是分类层或者卷积的话的话,那就是简单的突触 × \times × 输入,然后加起来,求sigmoid。这个我能理解哦,这种情况不就是卷积吗。

如果是分类层,那么输入就是

  • last stage (sigmoid or another nonlinear function) can vary.

  • For pooling, no multiplication(no synapse),
    • pooling can be average or max.

  • adders(加法器) have multiple inputs,
    • they are in fact adder trees,

  • the second stage also contains
    • shifters and max operators for pooling.

要啥移位啊??

  • the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using ( f ( x ) = a i x × + b i , x ∈ [ x i , x i + 1 ] f(x) = a_i x \times + b_i , x \in [x_i , x_{i+1} ] f(x)=aix×+bi,x[xi,xi+1]) (16 segments are sufficient)

On-chip Storage

  • on-chip storage structures of DianNao
    • can be construed as modified buffers of scratchpads.

  • While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
  • The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
  • However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
    storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.
第二段
  • on-chip storage into three (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (read width and reuse distance).

  • The first benefit of splitting structures is to tailor the SRAMs to the appropriate
    read/write width,
  • and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
  • Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).

2.2 Loop tiling

  • DianNao 用 loop tiling去减少memory access
    • so可容纳大的神经网络
  • 举例
    • 一个classifier 层
      • N n N_n Nn输出神经元
      • 全连接到 N i N_i Ni的输入
      • 如下图

N n N_n Nn个输出, N i N_i Ni个输入,sypase应该是 N n × N i N_n\times N_i Nn×Ni大小,用这个矩阵 × N i \times N_i ×Ni即可得到结果啊

  • 先取出来一块
    • 有点疑惑啊
    • 万一右边第一个元素和左边全部元素都有关
    • 你咋算啊 ()
    • 其实啊,我他妈算右边第一个时候
    • 只需要用到和synapse的一行呀!
    • 那你那个大大的synapse矩阵咋办啊
      在这里插入图片描述
  • 下面是原始代码和和
    • tiled代码
    • 他把分类层映射到DianNao

在这里插入图片描述

for(int n=0;n<Nn;n++)
	sum[n]=0;
for(int n=0;n<Nn;n++) //输出神经元
	for(int i=0;i<Ni;i++) //输入神经元
		sum[n]+=synapse[n][i]*neuron[i];
for(int n=0;n<Nn;n++)
	neuron[n]=Sigmoid(sum[n]);		
  • 俺的想法:
    • 一次来Tnn个输出
    • 和Tii个输入
    • 然后这个东西对于硬件还是太大了
    • 再拆
    • 来Tn个和Ti个吧
    • 就酱
for(int nnn=0;nnn<Nn;nnn+=Tnn){
    //tiling for output 神经元
//第一个for循环准备扔出去Tnn个输出
    for(int iii=0;iii<Ni;iii+=Tii){
    //tiling for input 神经元
//第二个for循环准备扔进来Tii个输入
//下面就这两个东西动手

        for(int nn=nnn;nn<nnn+Tnn;nn+=Tn){
    
//第三个for循环觉得觉得Tnn还是太大了,继续拆
//大小是Tn
//那么我们对每一个Tnn块!(开始位置是nn哦!!)
//我们如下求解

///
            for(int n=nn;n<nn+Tn;n++)
//第一步把中间结果全部搞成零!

            sum[n]=0;
//为求sum[n],sum[n]=synapse的第n行乘neuron的全部啊!
        for(int ii=iii;ii<iii+Tii;ii+=Ti)

//上面的for是对Ti进行拆

            for(int n=nn;n<nn+Tn;n++)
                for(int i=ii;i<ii+Ti;i++)
                    sum[n]+=synapse[n][i]*neuron[i];

    for(int nn=nnn;nn<nnn+Tnn;nn+=Tn)
        neuron[n]=sigmoid(sum[n]);
///

 }   }  }
  • 在tiled代码中, i i ii ii n n nn nn
    • 表示NFU有 T i T_i Ti个输入和突触
      • T n T_n Tn个输出神经元
  • 输入神经元被每个输出神经元需要重用
    • 但这个输入向量也太他妈大了
    • 塞不到Nbin块里啊
    • 所以也要对循环 i i ii ii分块,因子 T i i T_{ii} Tii

上面的代码肯定有问题,正确的如下:

	for (int nnn = 0; nnn < Nn; nnn += Tnn) {
    
		for (int nn = nnn; nn < nnn + Tnn; nn += Tn) {
    
			for (int n = nn; n < nn + Tn; n++)
				sum[n] = 0;
			for (int iii = 0; iii < Ni; iii += Tii) {
    
				for (int ii = iii; ii < iii + Tii; ii += Ti)
					for (int n = nn; n < nn + Tn; n++)
						for (int i = ii; i < ii + Ti; i++)
							sum[n] += synapse[n][i] * neuron[i];
			}
			for (int n = nn; n < nn + Tn; n++)
				printf("s%ds ", sum[n]);
		}
	}
	for (int index = 0; index < Nn; index++)
		printf("%d ", sum[index]);
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/zhoutianzi12/article/details/110244427

智能推荐

C语言bfs算法自动走贪吃蛇,智能寻路贪吃蛇系列之 初级BFS寻路算法-程序员宅基地

文章浏览阅读240次。//Bfs.cpp#include "stdafx.h"#include "Bfs.h"int dir[4][2]={{0,1},{0,-1},{1,0},{-1,0}};void Bfs::InitBfs(bool **chess,XY size){m_size=size;m_chess=new bool *[m_size.x];m_visit=new bool *[m_size.x];m_pa..._bfs贪吃蛇

技术讨论:我心中TOP1的编程语言_内存安全的编程语言-程序员宅基地

文章浏览阅读6.6k次,点赞13次,收藏14次。编程语言(programming language)是一种计算机和人之间交流的形式。它是一种为了完成计算机任务而编写的特定语言。编程语言包括指令、变量、函数、条件语句、循环语句等等。程序员使用编程语言来告诉计算机执行任务,例如打开文件、执行数学运算、连接数据库等等。不同的编程语言适用于不同的应用领域,例如Java和Python在Web开发、机器学习、数据分析等领域应用广泛,而C++在操作系统、游戏开发等领域应用较多。【百度百科释义】_内存安全的编程语言

mysql 实数型变量定义,PL/pgSQL从入门到放弃(2)-变量定义与数据类型-程序员宅基地

文章浏览阅读256次。本文由 @小刘先森 原创,转载请注明出处。使用PL/pgSQL也有比较久的时间了,写几篇从入门开始学习的文章,方便小伙伴们学习。PL/pgSQL从入门到放弃(1)-入门PL/pgSQL从入门到放弃(2)-变量定义与数据类型PL/pgSQL从入门到放弃(3)-函数PL/pgSQL从入门到放弃(4)-控制结构PL/pgSQL从入门到放弃(5)-游标声明变量上一篇介绍到,PL/pgSQL是块结构的语言。..._plsql实数

warning C4251-程序员宅基地

文章浏览阅读115次。c++ - Warning C4251 when building a DLL that exports a class containing an ATL::CString member - Stack Overflow_warning c4251

真·卷积神经网络发明者福岛邦彦获奖!李飞飞、LSTM之父点赞祝贺-程序员宅基地

文章浏览阅读1.8k次。点击下方卡片,关注“CVer”公众号AI/CV重磅干货,第一时间送达本文转载自:机器之心1980 年,福岛邦彦首次使用卷积神经网络实现了模式识别,他被认为是真正的卷积神经网络发明者。近日,..._卷积 彦

C语言 预处理指令和宏定义_c#宏定义和预处理指令-程序员宅基地

文章浏览阅读167次。C语言 预处理指令和宏定义_c#宏定义和预处理指令

随便推点

java 虚拟机最佳实践_深入理解Java虚拟机:JVM高级特性与最佳实践(第2版)-程序员宅基地

文章浏览阅读43次。第1版两年内印刷近10次,4家网上书店的评论近4?000条,98%以上的评论全部为5星级的好评,是整个Java图书领域公认的经典著作和超级畅销书,繁体版在台湾也十分受欢迎。第2版在第1版的基础上做了很大的改进:根据最新的JDK1.7对全书内容进行了全面的升级和补充;增加了大量处理各种常见JVM问题的技巧和最佳实践;增加了若干与生产环境相结合的实战案例;对第1版中的错误和不足之处的修正等等。第2版不...

程序员进阶篇:简单聊聊mysql的脏读、不可重复读-程序员宅基地

文章浏览阅读905次,点赞18次,收藏13次。脏读,就是读到了其他会话还没有提交的修改。下面用例子说明:可以看到,会话 2 修改了 id 为 222 的用户,在还没提交或回滚事务之前,会话 1 就读到了这些改动。脏读的本质就是,还没结束的写操作被读操作分割了。所以,为了解决脏读,就必须让写操作不可被读操作分割(当然,也不能被其他写操作分割),即保证所谓的原子性。不可重复读,就是在同一个事务中,多次读相同的记录但读到了不同的结果。

eclipse javaee版本配置tomcat并向tomcat发布工程 ._eclipse javaee 编译 发布-程序员宅基地

文章浏览阅读3.9k次。1.下载最新的eclipse javaee版本,下载地址为:http://www.eclipse.org/downloads/,这里注意一定要选择javaee版本,2.Tomcat下载,链接为:http://tomcat.apache.org/3.下载eclipse tomcat插件,下载地址为:http://download.csdn.net/detail/longsheng_eclipse javaee 编译 发布

第十一届 “MathorCup“- B题:基于机器学习的团簇能量预测及结构全局寻优方法-程序员宅基地

文章浏览阅读214次。团簇是由多个分子或原子聚集在一起的微观结构,研究团簇的全局最优结构(即能量最低)对于发现新型材料的结构和性能具有重要意义。传统的理论研究方法存在计算时间长、计算效果差等问题,而机器学习作为一个极具前景的多学科交叉领域,能够有效提高模型学习与计算的效率。因此本文针对三维团簇的能量预测和结构寻优问题,采用了多种机器学习方法进行研究。最后,我们对模型进行了灵敏度分析不断改变学习因子的数值,发现模型预测出的能量值波动幅度较低,具有较强的稳定性,证明了建立模型的可靠性、有效性及鲁棒性,

【HTML/CSS/JavaScript-编程指南】-程序员宅基地

文章浏览阅读676次,点赞8次,收藏10次。学习网站:https://www.runoob.com/html/html-tutorial.html。

推荐文章

热门文章

相关标签