Skip to main content
Log in

Tuning Compiler Optimizations for Simultaneous Multithreading

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading: Maximizing on-chip parallelism, 22nd Ann. Int'l. Symp. Computer Architecture, pp. 392–403 (June 1995).

  2. D. M. Tullsen et al., Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, 23rd Ann. Int'l. Symp. Computer Architecture, pp. 191–202 (May 1996).

  3. J. L. Lo et al., Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading, ACM Trans. Computer Sys., 15(3):322–354 (August 1997).

    Google Scholar 

  4. S. J. Eggers et al., Simultaneous multithreading: A platform for next-generation processors, IEEE Micro, 17(5):12–19 (October 1997).

    Google Scholar 

  5. G. Alverson et al., Tera hardware-software cooperation, Proc. ACM Int'l. Conf. Supercomputing (November 1997).

  6. M. E. Wolf and M. S. Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Trans. Parallel and Distrib. Syst., 2(4):452–471 (October 1991).

    Google Scholar 

  7. M. Cierniak and W. Li, Unifying data and control transformations for distributed sharedmemory machines, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 205–217 (June 1995).

  8. S. Carr, K. S. McKinley, and C. W. Tseng, Compiler optimizations for improving data locality, Sixth Int' l. Conf. Architectural Support for Progr. Lang. Operat. Syst., pp. 252–262 (October 1994).

  9. K. Dixit, New CPU benchmark suites from SPEC, '92 Digest of Papers, pp. 305–310 (February 1992).

  10. SPEC, '95 Technical Manual (August 1995).

  11. S. C. Woo et al., The SPLASH-2 programs: Characterization and methodological considerations, 22nd Ann. Int'l. Symp. Computer Architecture, pp. 24–36 (June 1995).

  12. P. G. Lowney et al., The multiflow trace scheduling compiler, J. Supercomputing, 7(1/2):51–142 (May 1993).

    Google Scholar 

  13. M. W. Hall et al., Maximizing multiprocessor performance with the SUIF compiler, IEEE Computer, 29(12):84–89 (December 1996).

    Google Scholar 

  14. E. Bugnion et al., Compiler-directed page coloring for multiprocessors, Seventh Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst., pp. 244–255 (October 1997).

  15. J. Boyle et al., Portable Programs for Parallel Processors, Holt, Rinehart, and Winston, Inc. (1987).

    Google Scholar 

  16. S. McFarling, Combining Branch Predictors, Technical Report TN-36, DEC-Western Research Laboratory (June 1993).

  17. S. A. Mahlke et al., Effective compiler support for predicated execution using the hyperblock, 25th Int'l. Symp. Microarchitecture, pp. 45–54 (December 1992).

  18. S. Hily and A. Seznec, Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading, Fifth Int'l. Symp. High Performance Computer Architecture, pp. 64–67 (January 1999).

  19. M. S. Lam, E. E. Rothberg, and M. E. Wolf, The cache performance and optimizations of blocked algorithms, Fourth Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst., pp. 63–74 (April 1991).

  20. S. Coleman and K. S. McKinley, Tile size selection using cache organization and data layout, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 279–290 (June 1995).

  21. S. Carr and K. Kennedy, Compiler blockability of numerical algorithms, Proc. ACM Int'l. Conf. Supercomputing, pp. 114–124 (November 1992).

  22. L. Carter, J. Ferrante, and S. F. Hummel, Hierarchical tiling for improved superscalar performance, Proc. Ninth Int'l. Parallel Processing Symp., pp. 239–245 ( April 1995).

  23. T. C. Mowry, M. S. Lam, and A. Gupta, Design and evaluation of a compiler algorithm for prefetching, Fifth Int' l. Conf. Architectural Support for Progr. Lang. Operat. Syst., pp. 62–75 (September 1992).

  24. P. Hsu and E. Davidson, Highly concurrent scalar processing, 13th Ann. Int'l. Symp. Computer Architecture, pp. 386–395 (June 1986).

  25. B. R. Rau et al., The Cydra 5 departmental supercomputer, IEEE Computer, 22:12–35 (January 1989).

    Google Scholar 

  26. J. Allen et al., Conversion of control dependence to data dependence, Conf. Record of the Tenth Ann. ACM Symp. Principles Progr. Lang., pp. 177–189 ( January 1983).

  27. A. E. Charlesworth, An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family, IEEE Computer, 14(9):18–27 (December 1981).

    Google Scholar 

  28. B. R. Rau and C. Glaeser, Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing, 14th Ann. Workshop on Microprogr., pp. 183–197 (October 1981).

  29. M. Lam, Software pipelining: An effective scheduling technique for VLIW machines, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 318–328 (June 1988).

  30. A. Aiken and A. Nicolau, Optimal loop parallelization, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 308–317 (June 1988).

  31. D. M. Tullsen et al., Supporting fine-grained synchronization on a simultaneous multithreading processor, Fifth Int'l. Symp. High Performance Computer Architecture, pp. 54–58 (January 1999).

  32. J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, Data and computation transformations for multiprocessors, Fifth ACM Sigplan Symp. Principles Practice Parallel Progr., pp. 166–178 (July 1995).

  33. W. W. Hwu et al., The superblock: An effective technique for VLIW and superscalar compilation, J. Supercomputing, 7(12):229–248 (May 1993).

    Google Scholar 

  34. M. Lam and R. Wilson, Limits of control flow on parallelism, 19th Ann. Int'l. Symp. Computer Architecture, pp. 46–57 (May 1992).

  35. D. Gannon, W. Jalby, and K. Gallivan, Strategies for cache and local memory management by global program transformation, J. Parallel Distribut. Comput., 5(5):587–616 (October 1988).

    Google Scholar 

  36. M. E. Wolf and M. S. Lam, A data locality optimizing algorithm, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30–44 (June 1991).

  37. K. Kennedy and K. S. McKinley, Maximizing loop parallelism and improving data locality via loop fusion and distribution, Sixth Int'l. Workshop Languages and Compilers for Parallel Computing, pp. 301–319 (August 1993).

  38. R. Alverson et al., The Tera computer system, Proc. ACM Int'l. Conf. Supercomputing, pp. 1–6 (June 1990).

  39. T. N. Vijaykumar and G. S. Sohi, Task selection for a multiscalar processor, 31st Int'l. Symp. Microarchitecture (November 1998).

  40. J. P. Singh, J. L. Hennessy, and A. Gupta, Scaling parallel programs for multiprocessors: Methodology and examples, IEEE Computer, 27(7):42–50 (July 1993).

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lo, J.L., Eggers, S.J., Levy, H.M. et al. Tuning Compiler Optimizations for Simultaneous Multithreading. International Journal of Parallel Programming 27, 477–503 (1999). https://doi.org/10.1023/A:1018780200739

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1018780200739

Navigation