A general constraint-centric scheduling framework for spatial architectures

PLDI, no. 6 (2013): 495-506

Cited by: 64|Views143
EI
Weibo:
Scheduling is a fundamental problem for spatial architectures, which are increasingly used to address energy efficiency

Abstract:

Specialized execution using spatial architectures provides energy efficient computation, but requires effective algorithms for spatially scheduling the computation. Generally, this has been solved with architecture-specific heuristics, an approach which suffers from poor compiler/architect productivity, lack of insight on optimality, and ...More

Code:

Data:

0
ZH
Introduction
  • The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-byinstruction pipelined execution and instead adopting a spatial architecture paradigm.
  • Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
  • To republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Highlights
  • Hardware specialization has emerged as an important way to sustain microprocessor performance improvements to address transistor energy efficiency challenges and general purpose processing’s inefficiencies [6, 8, 19, 28]
  • We describe the Integer Linear Programming (ILP) constraints which pertain to each scheduler responsibility, show a diagram capturing this responsibility pictorially for our running example in Figure 1
  • Considering the ILP constraint formulation for the general framework, our GAMS implementation is around lines of code
  • Scheduling is a fundamental problem for spatial architectures, which are increasingly used to address energy efficiency
  • Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem
Results
  • Is this ILP-based approach implementable? Yes, it is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
  • Is this ILP-based approach implementable?
  • It is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
  • Considering the ILP constraint formulation for the general framework, the GAMS implementation is around lines of code
Conclusion
  • Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem.
  • The authors applied this formulation to three diverse architectures, ran them on a standard ILP solver, and demonstrated such a general scheduler outperforms or matches the respective specialized schedulers.
  • The authors discuss the possibility of improving the scheduling time through algorithmic specialization, and how the scheduler delivers on its promises of compilerdeveloper productivity/extensibility, cross-architecture applicability, and insights on optimality
Summary
  • Introduction:

    The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-byinstruction pipelined execution and instead adopting a spatial architecture paradigm.
  • Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
  • To republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
  • Results:

    Is this ILP-based approach implementable? Yes, it is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
  • Is this ILP-based approach implementable?
  • It is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
  • Considering the ILP constraint formulation for the general framework, the GAMS implementation is around lines of code
  • Conclusion:

    Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem.
  • The authors applied this formulation to three diverse architectures, ran them on a standard ILP solver, and demonstrated such a general scheduler outperforms or matches the respective specialized schedulers.
  • The authors discuss the possibility of improving the scheduling time through algorithmic specialization, and how the scheduler delivers on its promises of compilerdeveloper productivity/extensibility, cross-architecture applicability, and insights on optimality
Tables
  • Table1: Related work – Legend: i) computation placement ii) data routing iii) event timing iv) utilization v) optimization objective ventional processor. Each FU is connected to four neighboring switches from where it gets input values and injects outputs. The switches allow datapaths to be dynamically specialized. Using a compiler, applications are profiled to extract the most commonly executed regions, called path-trees, which are then mapped to the
  • Table2: Relationship between architectural primitives and scheduler responsibilities
  • Table3: Summary of formal notation used
  • Table4: Description of ILP model implementation for PLUG
  • Table5: Tools and methodology for quantitative evaluation
  • Table6: Benchmark characteristics and ILP scheduler behavior
  • Table7: Feasibility concerns
  • Table8: Applicability to other Spatial Architectures
Download tables as Excel
Reference
  • Trips toolchain, http://www.cs.utexas.edu/trips/dist/.
    Findings
  • A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools.
    Google ScholarLocate open access versionFindings
  • S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. A theoretical and practical approach to instruction scheduling on spatial architectures. Technical report, MIT, 2002.
    Google ScholarFindings
  • S. Amellal and B. Kaminska. Functional synthesis of digital systems with tass. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 13(5):537 –552, 1994.
    Google ScholarLocate open access versionFindings
  • C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In
    Google ScholarLocate open access versionFindings
  • O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz. Energyperformance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In ISCA 2010.
    Google ScholarLocate open access versionFindings
  • S. S. Battacharyya, E. A. Lee, and P. K. Murthy. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, 1996.
    Google ScholarFindings
  • S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67–77, 2011.
    Google ScholarLocate open access versionFindings
  • D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and the IEEE Computer, 37(7):44–55, 2004.
    Google ScholarLocate open access versionFindings
  • N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner. Applicationspecific processing on a general-purpose core via transparent instruction set customization. In MICRO 2004.
    Google ScholarLocate open access versionFindings
  • J. Cong, K. Gururaj, G. Han, and W. Jiang. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst., 17(9), Sept. 2009.
    Google ScholarLocate open access versionFindings
  • K. Coons, X. Chen, S. Kushwaha, K. S. McKinley, and D. Burger.
    Google ScholarFindings
  • L. De Carli, Y. Pan, A. Kumar, C. Estan, and K. Sankaralingam. Plug: Flexible lookup modules for rapid deployment of new protocols in high-speed routers. In SIGCOMM 2009.
    Google ScholarLocate open access versionFindings
  • L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS, 2008.
    Google ScholarLocate open access versionFindings
  • A. Deb, J. M. Codina, and A. Gonzales. Softhv: A hw/sw co-designed processor with horizontal and vertical fusion. In International Conference on Computing Frontiers 2011.
    Google ScholarLocate open access versionFindings
  • A. E. Eichenberger and E. S. Davidson. Efficient formulation for optimal modulo schedulers. In PLDI 1997.
    Google ScholarLocate open access versionFindings
  • J. R. Ellis. Bulldog: a compiler for vliw architectures. PhD thesis, 1985.
    Google ScholarFindings
  • D. W. Engels, J. Feldman, D. R. Karger, and M. Ruhl. Parallel processor scheduling with delay constraints. In SODA 2001.
    Google ScholarLocate open access versionFindings
  • [36] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a raw machine. In ASPLOS 1998.
    Google ScholarLocate open access versionFindings
  • [37] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and S. J. Eggers. Instruction scheduling for a tiled dataflow architecture. In ASPLOS 2006.
    Google ScholarLocate open access versionFindings
  • [38] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and S. J. Eggers. Modeling instruction placement on a spatial architecture. In SPAA 2006.
    Google ScholarLocate open access versionFindings
  • [39] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, M. Budiu, and S. C. Goldstein. Tartan: Evaluating spatial computation for whole program execution. In ASPLOS 2006.
    Google ScholarLocate open access versionFindings
  • [40] R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In PACT 2004.
    Google ScholarLocate open access versionFindings
  • [41] E. Özer, S. Banerjia, and T. M. Conte. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO 31.
    Google ScholarFindings
  • [42] J. Palsberg and M. Naik. Ilp-based resource-aware compilation, 2004.
    Google ScholarFindings
  • [43] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim. Edgecentric modulo scheduling for coarse-grained reconfigurable architectures. In PACT 2008.
    Google ScholarLocate open access versionFindings
  • [44] W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. In Supercomputing 1991.
    Google ScholarLocate open access versionFindings
  • [45] N. Satish, K. Ravindran, and K. Keutzer. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE 2007.
    Google ScholarFindings
  • [46] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In MICRO 2003.
    Google ScholarLocate open access versionFindings
  • [47] M. Thuresson, M. Sjalander, M. Bjork, L. Svensson, P. LarssonEdefors, and P. Stenstrom. Flexcore: Utilizing exposed datapath control for efficient computing. In IC-SAMOS 2007.
    Google ScholarFindings
  • [48] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the energy of mature computations. In ASPLOS 2010.
    Google ScholarLocate open access versionFindings
  • [49] H. M. Wagner. An integer linear-programming model for machine scheduling. Naval Research Logistics Quarterly, 6(2):131–140, 1959.
    Google ScholarLocate open access versionFindings
  • [50] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All to Software: RAW Machines. Computer, 30(9):86–93, 1997.
    Google ScholarLocate open access versionFindings
  • [51] M. Watkins, M. Cianchetti, and D. Albonesi. Shared reconfigurable architectures for cmps. In FPGA 2008.
    Google ScholarLocate open access versionFindings
  • [52] L. A. Wolsey and G. L. Nemhauser. Integer and Combinatorial
    Google ScholarLocate open access versionFindings
  • [19] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and
    Google ScholarFindings
  • [20] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In MICRO 2012.
    Google ScholarLocate open access versionFindings
  • [21] K. Fan, H. h. Park, M. Kudlur, and S. o. Mahlke. Modulo scheduling for highly customized datapaths to increase hardware reusability. In International Journal of Parallel Programming, 21:313–347, 1992.
    Google ScholarLocate open access versionFindings
  • [23] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In ASPLOS 2009.
    Google ScholarLocate open access versionFindings
  • [24] G. J. Gordon, S. A. Hong, and M. Dudık. First-order mixed integer linear programming. In UAI 2009.
    Google ScholarLocate open access versionFindings
  • [25] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 33(5), 2012.
    Google ScholarLocate open access versionFindings
  • [26] V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically specialized datapaths for energy efficient computing. In HPCA 2011.
    Google ScholarLocate open access versionFindings
  • [27] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO 2011.
    Google ScholarLocate open access versionFindings
  • [28] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15, 2011.
    Google ScholarLocate open access versionFindings
  • [29] J. N. Hooker. Logic, optimization and constraint programming. INFORMS Journal on Computing, 14:295–321, 2002.
    Google ScholarLocate open access versionFindings
  • [30] J. N. Hooker and M. A. Osorio. Mixed logical-linear programming. Discrete Appl. Math., 96-97(1), Oct. 1999.
    Google ScholarLocate open access versionFindings
  • [31] Z. Huang, S. Malik, N. Moreano, and G. Araujo. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst., 3(2):361–384, May 2004.
    Google ScholarLocate open access versionFindings
  • [32] R. Joshi, G. Nelson, and K. Randall. Denali: a goal-directed superoptimizer. In PLDI 2002.
    Google ScholarLocate open access versionFindings
  • [33] K. Kailas and A. Agrawala. Cars: A new code generation framework for clustered ilp processors. In HPCA 2001.
    Google ScholarLocate open access versionFindings
  • [34] M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In PLDI 2008.
    Google ScholarLocate open access versionFindings
  • [35] A. Kumar, L. De Carli, S. J. Kim, M. de Kruijf, K. Sankaralingam, C. Estan, and S. Jha. Design and implementation of the plug architecture for programmable and efficient network lookups. In PACT 2010.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Best Paper
Best Paper of PLDI, 2013
Tags
Comments