# A general constraint-centric scheduling framework for spatial architectures

PLDI, no. 6 (2013): 495-506

EI

Weibo:

Abstract:

Specialized execution using spatial architectures provides energy efficient computation, but requires effective algorithms for spatially scheduling the computation. Generally, this has been solved with architecture-specific heuristics, an approach which suffers from poor compiler/architect productivity, lack of insight on optimality, and ...More

Code:

Data:

ZH

Introduction

- The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-byinstruction pipelined execution and instead adopting a spatial architecture paradigm.
- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
- To republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Highlights

- Hardware specialization has emerged as an important way to sustain microprocessor performance improvements to address transistor energy efficiency challenges and general purpose processing’s inefficiencies [6, 8, 19, 28]
- We describe the Integer Linear Programming (ILP) constraints which pertain to each scheduler responsibility, show a diagram capturing this responsibility pictorially for our running example in Figure 1
- Considering the ILP constraint formulation for the general framework, our GAMS implementation is around lines of code
- Scheduling is a fundamental problem for spatial architectures, which are increasingly used to address energy efficiency
- Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem

Results

- Is this ILP-based approach implementable? Yes, it is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
- Is this ILP-based approach implementable?
- It is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
- Considering the ILP constraint formulation for the general framework, the GAMS implementation is around lines of code

Conclusion

- Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem.
- The authors applied this formulation to three diverse architectures, ran them on a standard ILP solver, and demonstrated such a general scheduler outperforms or matches the respective specialized schedulers.
- The authors discuss the possibility of improving the scheduling time through algorithmic specialization, and how the scheduler delivers on its promises of compilerdeveloper productivity/extensibility, cross-architecture applicability, and insights on optimality

Summary

## Introduction:

The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-byinstruction pipelined execution and instead adopting a spatial architecture paradigm.- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
- To republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
## Results:

Is this ILP-based approach implementable? Yes, it is possible to express the scheduling problem as an ILP problem and implement it for real architectures.- Is this ILP-based approach implementable?
- It is possible to express the scheduling problem as an ILP problem and implement it for real architectures.
- Considering the ILP constraint formulation for the general framework, the GAMS implementation is around lines of code
## Conclusion:

Compared to the architecture-specific schedulers, which are the current state-of-the-art, this paper provides a general formulation of spatial scheduling as a constraint-solving problem.- The authors applied this formulation to three diverse architectures, ran them on a standard ILP solver, and demonstrated such a general scheduler outperforms or matches the respective specialized schedulers.
- The authors discuss the possibility of improving the scheduling time through algorithmic specialization, and how the scheduler delivers on its promises of compilerdeveloper productivity/extensibility, cross-architecture applicability, and insights on optimality

- Table1: Related work – Legend: i) computation placement ii) data routing iii) event timing iv) utilization v) optimization objective ventional processor. Each FU is connected to four neighboring switches from where it gets input values and injects outputs. The switches allow datapaths to be dynamically specialized. Using a compiler, applications are profiled to extract the most commonly executed regions, called path-trees, which are then mapped to the
- Table2: Relationship between architectural primitives and scheduler responsibilities
- Table3: Summary of formal notation used
- Table4: Description of ILP model implementation for PLUG
- Table5: Tools and methodology for quantitative evaluation
- Table6: Benchmark characteristics and ILP scheduler behavior
- Table7: Feasibility concerns
- Table8: Applicability to other Spatial Architectures

Reference

- Trips toolchain, http://www.cs.utexas.edu/trips/dist/.
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools.
- S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. A theoretical and practical approach to instruction scheduling on spatial architectures. Technical report, MIT, 2002.
- S. Amellal and B. Kaminska. Functional synthesis of digital systems with tass. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 13(5):537 –552, 1994.
- C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In
- O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz. Energyperformance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In ISCA 2010.
- S. S. Battacharyya, E. A. Lee, and P. K. Murthy. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, 1996.
- S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67–77, 2011.
- D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and the IEEE Computer, 37(7):44–55, 2004.
- N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner. Applicationspecific processing on a general-purpose core via transparent instruction set customization. In MICRO 2004.
- J. Cong, K. Gururaj, G. Han, and W. Jiang. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst., 17(9), Sept. 2009.
- K. Coons, X. Chen, S. Kushwaha, K. S. McKinley, and D. Burger.
- L. De Carli, Y. Pan, A. Kumar, C. Estan, and K. Sankaralingam. Plug: Flexible lookup modules for rapid deployment of new protocols in high-speed routers. In SIGCOMM 2009.
- L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS, 2008.
- A. Deb, J. M. Codina, and A. Gonzales. Softhv: A hw/sw co-designed processor with horizontal and vertical fusion. In International Conference on Computing Frontiers 2011.
- A. E. Eichenberger and E. S. Davidson. Efficient formulation for optimal modulo schedulers. In PLDI 1997.
- J. R. Ellis. Bulldog: a compiler for vliw architectures. PhD thesis, 1985.
- D. W. Engels, J. Feldman, D. R. Karger, and M. Ruhl. Parallel processor scheduling with delay constraints. In SODA 2001.
- [36] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a raw machine. In ASPLOS 1998.
- [37] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and S. J. Eggers. Instruction scheduling for a tiled dataflow architecture. In ASPLOS 2006.
- [38] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and S. J. Eggers. Modeling instruction placement on a spatial architecture. In SPAA 2006.
- [39] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, M. Budiu, and S. C. Goldstein. Tartan: Evaluating spatial computation for whole program execution. In ASPLOS 2006.
- [40] R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In PACT 2004.
- [41] E. Özer, S. Banerjia, and T. M. Conte. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO 31.
- [42] J. Palsberg and M. Naik. Ilp-based resource-aware compilation, 2004.
- [43] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim. Edgecentric modulo scheduling for coarse-grained reconfigurable architectures. In PACT 2008.
- [44] W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. In Supercomputing 1991.
- [45] N. Satish, K. Ravindran, and K. Keutzer. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE 2007.
- [46] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In MICRO 2003.
- [47] M. Thuresson, M. Sjalander, M. Bjork, L. Svensson, P. LarssonEdefors, and P. Stenstrom. Flexcore: Utilizing exposed datapath control for efficient computing. In IC-SAMOS 2007.
- [48] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the energy of mature computations. In ASPLOS 2010.
- [49] H. M. Wagner. An integer linear-programming model for machine scheduling. Naval Research Logistics Quarterly, 6(2):131–140, 1959.
- [50] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All to Software: RAW Machines. Computer, 30(9):86–93, 1997.
- [51] M. Watkins, M. Cianchetti, and D. Albonesi. Shared reconfigurable architectures for cmps. In FPGA 2008.
- [52] L. A. Wolsey and G. L. Nemhauser. Integer and Combinatorial
- [19] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and
- [20] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In MICRO 2012.
- [21] K. Fan, H. h. Park, M. Kudlur, and S. o. Mahlke. Modulo scheduling for highly customized datapaths to increase hardware reusability. In International Journal of Parallel Programming, 21:313–347, 1992.
- [23] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In ASPLOS 2009.
- [24] G. J. Gordon, S. A. Hong, and M. Dudık. First-order mixed integer linear programming. In UAI 2009.
- [25] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 33(5), 2012.
- [26] V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically specialized datapaths for energy efficient computing. In HPCA 2011.
- [27] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO 2011.
- [28] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15, 2011.
- [29] J. N. Hooker. Logic, optimization and constraint programming. INFORMS Journal on Computing, 14:295–321, 2002.
- [30] J. N. Hooker and M. A. Osorio. Mixed logical-linear programming. Discrete Appl. Math., 96-97(1), Oct. 1999.
- [31] Z. Huang, S. Malik, N. Moreano, and G. Araujo. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst., 3(2):361–384, May 2004.
- [32] R. Joshi, G. Nelson, and K. Randall. Denali: a goal-directed superoptimizer. In PLDI 2002.
- [33] K. Kailas and A. Agrawala. Cars: A new code generation framework for clustered ilp processors. In HPCA 2001.
- [34] M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In PLDI 2008.
- [35] A. Kumar, L. De Carli, S. J. Kim, M. de Kruijf, K. Sankaralingam, C. Estan, and S. Jha. Design and implementation of the plug architecture for programmable and efficient network lookups. In PACT 2010.

Best Paper

Best Paper of PLDI, 2013

Tags

Comments