Bibliography

[1] W. A. Abu-Sufah and A. D. Malony. Vector processing on the Alliant FX/8 multiprocessor. In Proceedings of the International Conference on Parallel Processing, pages 559–566, 1986. Cited on page(s) 58

[2] S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66–76, 1996. DOI: 10.1109/2.546611 Cited on page(s) 126

[3] A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. Cited on page(s) 11, 19, 127

[4] A. Aiken and A. Nicolau. Perfect pipelining: A new loop parallelization technique. Technical report, Cornell University, 1987. Cited on page(s) 127

[5] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27:367–432, September 1995. DOI: 10.1145/212094.212131 Cited on page(s) 127

[6] F. E. Allen, M. G. Burke, P. Charles, R. Cytron, and J. Ferrante. An overview of the PTRAN analysis system for multiprocessing. J. of Parallel Distributed Computing, 5(5):617–640, 1988. DOI: 10.1016/0743-7315(88)90015-9 Cited on page(s) 1

[7] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of programming languages, pages 177–189, New York, NY, USA, 1983. ACM. DOI: 10.1145/567067.567085 Cited on page(s) 39

[8] R. Allen, D. Bäumgartner, K. Kennedy, and A. Porterfield. Ptool : A semi-automatic parallel programming assistant. In 1986 International Conference on Parallel Programming, pages 164–170, 1986. Cited on page(s) 1

[9] R. Allen and S. Johnson. Compiling C for vectorization, parallelization, and inline expansion. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, pages 241–249, New York, NY, USA, 1988. ACM. DOI: 10.1145/53990.54014 Cited on page(s) 126

[10] R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Trans. Program. Lang. Syst., 9(4):491–542, 1987. DOI: 10.1145/29873.29875 Cited on page(s) 126

[11] AltiVec technologies. http://www.freescale.com/webapp/sps/site/
overview.jsp?code=DRPPCALTVC
. Last accessed January 5, 2012. Cited on page(s) 7

[12] Unrolling AltiVec, Part 1: Introducing the PowerPC SIMD unit. http://www.ibm.com/developerworks/power/
library/pa-unrollav1/
. Last accessed January 5, 2012. Cited on page(s) 7

[13] Graphics cards from AMD. http://sites.amd.com/us/game/products/
graphics/Pages/graphics.aspx?lid=Gaming_Graphics&lpos=HP_bottom
_bucket
. Last accessed January 5, 2012. Cited on page(s) 7

[14] Multi-core processing with amd. http://www.amd.com/us/products/technologies/
multi-core-processing/Pages/multi-core-processing.aspx
. Last accessed January 5, 2012. Cited on page(s) 9

[15] Z. Ammarguellat and W. L. Harrison, III. Automatic recognition of induction variables and recurrence relations by abstract interpretation. In Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, volume 25, pages 283–295, New York, NY, USA, June 1990. ACM. DOI: 10.1145/93542.93583 Cited on page(s) 127

[16] Antlr v3. http://antlr.org/. Last accessed January 5, 2012. Cited on page(s) 11

[17] A. W. Appel and J. Palsberg. Modern Compiler Implementation in Java. Cambridge University Press, New York, NY, USA, 2nd edition, 2003. Cited on page(s) 11, 19

[18] Automatically Tuned Linear Algebra Software (ATLAS). Downloaded from http://mathatlas.sourceforge.net/. Last accessed January 5, 2012. Cited on page(s) 128

[19] M. Bach, M. Charney, R. Cohn, T. Devor, E. Demikovsky, K. Hazelwood, A. Jaleel, C.-K. Luk, G. Lyons, H. Patil, and A. Tal. Analyzing parallel programs with pin. IEEE Computer, 43(3):34–41, March 2010. DOI: 10.1109/MC.2010.60 Cited on page(s) 10

[20] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Comput. Surv., 26:345–420, December 1994. DOI: 10.1145/197405.197406 Cited on page(s) 127

[21] U. Banerjee. Unimodular transformations of double loops. In Third Workshop on Languages and Compilers for Parallel Computing, pages 192–219. The MIT Press, 1990. Cited on page(s) 93, 94, 127

[22] U. Banerjee. Dependence analysis (loop transformation for restructuring compilers. Springer, 1996. Cited on page(s) 37, 125

[23] U. Banerjee, R. Eigenmann, A. Nicolau, and D. A. Padua. Automatic program parallelization. Proceedings of the IEEE, 81:211 – 243, February 1993. DOI: 10.1109/5.214548 Cited on page(s) 125, 126

[24] U. Banjerjee. Dependence Analysis for Supercomputing. Springer, 1988. Cited on page(s) 121, 125

[25] K. E. Batcher. Design of a massively parallel processor. IEEE Transactions on Computers, C29:836–840, September 1980. DOI: 10.1109/TC.1980.1675684 Cited on page(s) 7

[26] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. In Compiler Construction, pages 283–303, 2010. DOI: 10.1007/978-3-642-11970-5_16 Cited on page(s) 126

[27] Beowulf.org. Cited on page(s) 9

[28] A. Bhowmik and M. Franklin. A general compiler framework for speculative multithreading. In Proceedings of the fourteenth annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, pages 99–108, New York, NY, USA, 2002. ACM. DOI: 10.1145/564870.564885 Cited on page(s) 130

[29] IBM Research Blue Gene project page. http://www.research.ibm.com/bluegene
/press_release.html
. Last accessed January 5, 2012. Cited on page(s) 9, 95

[30] W. Blume and R. Eigenmann. Performance analysis of parallelizing compilers on the perfect benchmarks programs. IEEE Transactions on Parallel and Distributed Systems, 3:643–656, 1992. DOI: 10.1109/71.180621 Cited on page(s) 127

[31] W. Blume and R. Eigenmann. The Range Test: a dependence test for symbolic, non-linear expressions. In Supercomputing ’94, pages 528–537, 1994. DOI: 10.1145/602770.602858 Cited on page(s) 63, 126

[32] H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, pages 68–78, 2008. DOI: 10.1145/1375581.1375591 Cited on page(s) 126

[33] R. Bordawekar, U. Bondhugula, and R. Rao. Believe it or not!: multi-core cpus can match gpu performance for a flop-intensive application! In International Conference on Parallel Architectures and Compilation Techniques, pages 537–538, 2010. DOI: 10.1145/1854273.1854340 Cited on page(s) 126

[34] W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. The Illiac IV system. Proceedings of the IEEE, 60(4):369–388, April 1972. DOI: 10.1109/PROC.1972.8647 Cited on page(s) 7

[35] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. Fortran 90D/HPF compiler for distributed memory MIMD computers: design, implementation, and performance results. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing, Supercomputing ’93, pages 351–360, New York, NY, USA, 1993. ACM. DOI: 10.1145/169627.169750 Cited on page(s) 110, 128

[36] Z. Bozkus, L. Meadows, S. Nakamoto, V. Schuster, and M. Young. PGHPF - an optimizing High Performance Fortran compiler for distributed memory machines. Scientific Programming, 6(1):29–40, 1997. Cited on page(s) 114, 128

[37] M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 69–84, Washington, DC, USA, 2007. IEEE Computer Society. DOI: 10.1109/MICRO.2007.20 Cited on page(s) 131

[38] I. Buck, T. Foley, D. R. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, 2004. DOI: 10.1145/1015706.1015800 Cited on page(s) 7, 126

[39] M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the ACM SIGPLAN’86 Symposium on Compiler Construction, volume 21(6) of SIGPLAN Notices, page 162–175, June 1986. DOI: 10.1145/12276.13328 Cited on page(s) 33

[40] D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: a test suite and results. In Proceedings of the 1988 ACM/IEEE conference on Supercomputing, Supercomputing ’88, pages 98–105, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press. DOI: 10.1109/SUPERC.1988.44642 Cited on page(s) 126

[41] D. Callahan and K. Kennedy. Estimating interlock and improving balance for pipelined machines. Journal of Parallel and Distributed Computing, 5:334–358, 1988. DOI: 10.1016/0743-7315(88)90002-0 Cited on page(s) 127

[42] D. Callahan, K. Kennedy, and J. Subhlok. Analysis of event synchronization in a parallel programming tool. In PPOPP, pages 21–30, 1990. DOI: 10.1145/99163.99167 Cited on page(s) 126

[43] W. W. Carlson, J. M. Draper, K. Y. D. E. Culler, E. Brooks, and K. Warren. Introduction to UPC and language specification. Technical report, IDA Center for Computing Sciences, 1999. Technical Report CCS-TR-99- 157. Cited on page(s) 128

[44] S. Carr, C. Ding, and P. Improving software pipelining with unroll-and-jam. In 28th Hawaii International Conference on System Sciences, 1996. DOI: 10.1109/HICSS.1996.495462 Cited on page(s) 127

[45] S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768–1810, 1994. DOI: 10.1145/197320.197366 Cited on page(s) 127

[46] B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. Scientific Programming, 1(1):31–50, 1992. Cited on page(s) 110, 128

[47] M. Chastain, G. Gostin, and J. M. S. Wallach. The Convex C240 architecture. In Proceedings of the 1988 ACM/IEEE conference on Supercomputing, Supercomputing ’88, pages 321–329, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press. DOI: 10.1109/SUPERC.1988.44669 Cited on page(s) 1

[48] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222–231, 1999. DOI: 10.1145/305619.305645 Cited on page(s) 128

[49] D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Proceedings of the 17th annual International Symposium on Computer Architecture, ISCA ’90, pages 239–248. ACM, 1990. DOI: 10.1109/ISCA.1990.134531 Cited on page(s) 126

[50] J.-H. Chow, L. E. Lyon, and V. Sarkar. Automatic parallelization for symmetric shared-memory multiprocessors. In Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative Research, page 5. IBM, 1996. Cited on page(s) 4

[51] The Cilk project. http://supertech.csail.mit.edu/cilk/. Last accessed January 5, 2012. Cited on page(s) 63

[52] M. Cintra and D. R. Llanos. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the ninth ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, PPoPP ’03, pages 13–24, New York, NY, USA, 2003. ACM. DOI: 10.1145/781498.781501 Cited on page(s) 129

[53] M. Cintra, J. F. Martínez, and J. Torrellas. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of the 27th annual International Symposium on Computer Architecture, pages 13–24, New York, NY, USA, 2000. ACM. DOI: 10.1145/342001.363382 Cited on page(s) 130

[54] Co-Array Fortran. http://www.co-array.org/. Last accessed August 26, 2011. Cited on page(s) 128

[55] C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, Y. Yao, and D. Chavarría-Miranda. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Proceedings of the tenth ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, PPoPP ’05, pages 36–47, New York, NY, USA, 2005. ACM. DOI: 10.1145/1065944.1065950 Cited on page(s) 128

[56] J.-F. Collard. Automatic parallelization of while-loops using speculative execution. International Journal of Parallel Programming, 23:191–219, 1995. 10.1007/BF02577789. DOI: 10.1007/BF02577789 Cited on page(s) 127, 129

[57] K. Cooper and L. Torczon. Engineering a Compiler. Morgan Kaufmann, 2011. Cited on page(s) 125

[58] P. Cousot. Abstract interpretation. ACM Comput. Surv., 28:324–328, June 1996. DOI: 10.1145/234528.234740 Cited on page(s) 28

[59] Abstract interpretation, 2008. http://www.di.ens.fr/~cousot/AI/. Last accessed January 5, 2012. Cited on page(s) 28

[60] CUDA: Parallel programming made easy. http://www.nvidia.com/object/cuda_home_
new.html
. Last accessed January 5, 2012. Cited on page(s) 7, 126

[61] R. Cytron. Doacross: Beyond vectorization for multiprocessors. In International Conference on Parallel Processing, pages 836–844, 1986. Cited on page(s) 58

[62] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. An efficient method of computing static single assignment form. In ACM Conference on the Principals of Programming Languages, pages 25–35, 1989. DOI: 10.1145/75277.75280 Cited on page(s) 13, 14, 38, 39, 125

[63] F. H. Dang and L. Rauchwerger. Speculative parallelization of partially parallel loops. In Languages, Compilers, and Run-Time Systems for Scalable Computers, volume 1915 of Lecture Notes in Computer Science, pages 285–299. Springer, 2000. Cited on page(s) 129

[64] F. H. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test: Speculative parallelization of partially parallel loops. In IPDPS, 2002. DOI: 10.1109/IPDPS.2002.1015493 Cited on page(s) 129

[65] C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’07, pages 223–234, New York, NY, USA, 2007. ACM. DOI: 10.1145/1250734.1250760 Cited on page(s) 131

[66] Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey. A multi-platform Co-Array Fortran compiler. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 29–40, Washington, DC, USA, 2004. IEEE Computer Society. DOI: 10.1109/PACT.2004.1342539 Cited on page(s) 128

[67] Z.-H. Du, C.-C. Lim, X.-F. Li, C. Yang, Q. Zhao, and T.-F. Ngai. A cost-driven compilation framework for speculative parallelization of sequential programs. In Proceedings of the ACM Conference on Programming Language Design and Implementation, 2004. DOI: 10.1145/996841.996852 Cited on page(s) 130

[68] K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In 20th Annual Workshop on microprogramming, 1987. DOI: 10.1145/255305.255317 Cited on page(s) 127

[69] K. Ebcioglu, R. D. Groves, K.-C. Kim, G. M. Silberman, and I. Ziv. VLIW compilation techniques in a superscalar environment. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 36–48, 1994. DOI: 10.1145/178243.178247 Cited on page(s) 67

[70] A. E. Eichenberger, P. Wu, and K. O’Brien. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 82–93, New York, NY, USA, 2004. ACM. DOI: 10.1145/996841.996853 Cited on page(s) 126

[71] E. Elmroth and F. G. Gustavson. Applying recursion to serial and parallel qr factorization leads to better performance. IBM Journal of Research and Development, 44(4):605–624, 2000. DOI: 10.1147/rd.444.0605 Cited on page(s) 128

[72] P. Feautrier. Parametric integer programming. RAIRO Recherche Op’erationnelle, 22, 1988. Cited on page(s) 126

[73] P. Feautrier. Automatic parallelization in the polytope model. In G.-R. Perrin and A. Darte, editors, The Data Parallel Programming Model, volume 1132 of Lecture Notes in Computer Science, pages 79–103. Springer Berlin / Heidelberg, 1996. Cited on page(s) 126

[74] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. In 6th International Symposium on Programming, volume 167, pages 125–132, 1984. DOI: 10.1145/24039.24041 Cited on page(s) 125

[75] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and it use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, 1987. DOI: 10.1145/24039.24041 Cited on page(s) 13, 38, 39

[76] FFTW. The project web page is at http://fftw.org/. Last accessed on January 5, 2012. Cited on page(s) 128

[77] C. N. Fischer, R. K. Cytron, and R. J. LeBlanc. Crafting A Compiler. Addison-Wesley Publishing Company, USA, 1st edition, 2009. Cited on page(s) 11, 19, 125

[78] A. Fisher and A. Ghuloum. Parallelizing complex scans and reductions. In Conference on Programming Language Design and Implementation, 1994. DOI: 10.1145/773473.178255 Cited on page(s) 127

[79] A. Fisher and A. Ghuloum. Parallelizing complex scans and reductions. SIGPLAN Not., 29:135–146, June 1994. DOI: 10.1145/773473.178255 Cited on page(s) 127

[80] J. Fisher. Trace scheduling: a technique for global microcode compaction. IEEE Trans. on Computers, C-30(7), July 1981. DOI: 10.1109/TC.1981.1675827 Cited on page(s) 127

[81] J. A. Fisher, J. R. Ellis, J. C. Ruttenberg, and A. Nicolau. Parallel processing: a smart compiler and a dumb machine (with retrospective). In 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999, A Selection, pages 112–124, 1984. DOI: 10.1145/989393.989408 Cited on page(s) 67, 127

[82] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Generation, Optimization, and Platform Adaptation”. DOI: 10.1109/JPROC.2004.840301 Cited on page(s) 128

[83] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput., 5(5):587–616, 1988. DOI: 10.1016/0743-7315(88)90014-7 Cited on page(s) 127

[84] M. J. Garzarán, M. Prvulovic, V. Viñals, J. M. Llabería, L. Rauchwerger, and J. Torrellas. Using software logging to support multi-version buffering in thread-level speculation. In IEEE PACT, pages 170–179, 2003. DOI: 10.1109/PACT.2003.1238013 Cited on page(s) 129

[85] G. Goff, K. Kennedy, and C.-W. Tseng. Practical dependence testing. In PLDI, pages 15–29, 1991. DOI: 10.1145/113446.113448 Cited on page(s) 29

[86] M. Grigni and F. Manne. On the complexity of the generalized block distribution. In Third International Workshop on Parallel Algorithms for Irregularly Structured Problems, pages 319–326, 1996. Proceedings available in volume 1117 of Springer LNCS. DOI: 10.1007/BFb0030123 Cited on page(s) 128

[87] D. Grune, H. E. Bal, C. J. Jacobs, and K. G. Langendoen. Modern Compiler Design. Wiley, 2000. Cited on page(s) 11

[88] M. Gupta and P. Banerjee. Paradigm: a compiler for automatic data distribution on multicomputers. In Proceedings of the 7th International Conference on Supercomputing, ICS ’93, pages 87–96, New York, NY, USA, 1993. ACM. DOI: 10.1145/165939.165959 Cited on page(s) 128

[89] M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’95, New York, NY, USA, 1995. ACM. DOI: 10.1145/224170.224422 Cited on page(s) 114, 128

[90] M. Gupta, S. Mukhopadhyay, and N. Sinha. Automatic parallelization of recursive procedures. In Proceedings of the IEEE Conference on Parallel Architecture and Compiler Techniques (PACT), pages 139–148, 1999. DOI: 10.1023/A:1007560600904 Cited on page(s) 63

[91] M. Gupta, S. Mukhopadhyay, and N. Sinha. Automatic parallelization of recursive procedures. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, PACT ’99, pages 139–, Washington, DC, USA, 1999. IEEE Computer Society. DOI: 10.1109/PACT.1999.807504 Cited on page(s) 127

[92] M. Gupta, S. Mukhopadhyay, and N. Sinha. Automatic parallelization of recursive procedures. International Journal of Parallel Programming, 28(6):537–562, 2000. DOI: 10.1023/A:1007560600904 Cited on page(s) 63

[93] F. G. Gustavson, I. Jonsson, B. Kågström, and P. Ling. Towards peak performance on hierarchical smp memory architectures - new recursive blocked data formats and blas. In PPSC, 1999. Cited on page(s) 128

[94] M. Haghighat and C. Polychronopoulos. Symbolic program analysis and optimization for parallelizing compilers. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 538–562. Springer Berlin / Heidelberg, 1993. Cited on page(s) 127

[95] M. Haghighat and C. Polychronopoulos. Symbolic analysis: A basis for parallelization, optimization, and scheduling of programs. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 567–585. Springer Berlin / Heidelberg, 1994. Cited on page(s) 127

[96] M. R. Haghighat and C. D. Polychronopoulos. Symbolic program analysis and optimization for parallelizing compilers. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers, pages 538–562, 1992. Cited on page(s) 127

[97] R. A. Haring. Ibm blue gene/q compute chip. In Hot Chips 23, 2011. Available at http://www.hotchips.org/conference-archives/hot-chips-23. Last accessed January 5, 2012. Cited on page(s) 130

[98] J. Harris, J. A. Bircsak, M. R. Bolduc, J. A. Diewald, I. Gale, N. W. Johnson, S. Lee, C. A. Nelson, and C. D. Offner. Compiling High Performance Fortran for distributed-memory systems. Digital Technical Journal, 7(3):5–23, 1995. Cited on page(s) 114, 128

[99] T. Harris, J. R. Larus, and R. Rajwar. Transactional Memory, 2nd edition. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2010. Cited on page(s) 130

[100] M. S. Hecht. Global data-flow analysis of computer programs. PhD thesis, Princeton University, Princeton, NJ, USA, 1973. tech report no. AAI7409690. Cited on page(s) 19

[101] M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the ACM International Symposium on Computer Architecture, pages 289–300, 1993. DOI: 10.1109/ISCA.1993.698569 Cited on page(s) 66

[102] D. Hillis. The Connection Machine. MIT Press, 1989. Cited on page(s) 7

[103] S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM, 35:66–80, August 1992. DOI: 10.1145/135226.135230 Cited on page(s) 110, 128

[104] J. Hoeflinger and Y. Paek. A comparative analysis of dependence testing mechanisms. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers, pages 289–303, 2000. DOI: 10.1007/3-540-45574-4_19 Cited on page(s) 126

[105] M. Hopkins. A perspective on the 801/Reduced Instruction Set Computer. IBM Systems Journal, 26(1):107–121, 1987. DOI: 10.1147/sj.261.0107 Cited on page(s) 58

[106] Power4 design. http://www.research.ibm.com/power4/. Last accessed January 5, 2012. Cited on page(s) 1

[107] E. B. III and K. Warren. The 1991 MPCI yearly report: The attack of the killer micros. Technical report, Lawrence Livermore National Laboratory, 1991. Technical Report UCRLID-107022. Cited on page(s) 1, 58

[108] W. L. H. III. The interprocedural analysis and automatic parallelization of Ccheme programs. Lisp and Symbolic Computation, 2(2):179–396, 1989. DOI: 10.1007/BF01808954 Cited on page(s) 62, 127

[109] W. L. H. III and Z. Ammarguellat. A program’s eye view of Miprac. In 5th International Workshop on Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science. Springer, August 3-5 1993. Cited on page(s) 62, 127

[110] Intel Pentium D Processor 820. http://ark.intel.com/Product.aspx?id=27512. Last accessed January 5, 2012. Cited on page(s) 1

[111] Intel graphics media accelerator 950. http://www.intel.com/products/chipsets/gma
950/index.htm
. Last accessed January 5, 2012. Cited on page(s) 7

[112] G. Jin, J. Mellor-Crummey, and R. Fowler. Increasing temporal locality with skewing and recursive blocking. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’01, pages 43–43, New York, NY, USA, 2001. ACM. DOI: 10.1145/582034.582077 Cited on page(s) 127

[113] R. B. Jones and V. H. Allan. Software pipelining: a comparison and improvement. In Proceedings of the 23rd annual Workshop and Symposium on Microprogramming and Microarchitecture, MICRO 23, pages 46–56, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. DOI: 10.1109/MICRO.1990.151426 Cited on page(s) 127

[114] P. Jouvelot and B. Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In 1989 International Conference on Supercomputing, 1989. DOI: 10.1145/318789.318810 Cited on page(s) 127

[115] A. Kejariwal, H. Saito, X. Tian, M. Girkar, W. Li, U. Banerjee, A. Nicolau, and C. n. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In Proceedings of the 20th annual International Conference on Supercomputing (ICS ’06), ICS ’06, pages 361–371, New York, NY, USA, 2006. ACM. DOI: 10.1145/1183401.1183452 Cited on page(s) 126

[116] K. Kennedy and R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Cited on page(s) 126

[117] K. Kennedy, C. Koelbel, and H. Zima. The rise and fall of High Performance Fortran: an historical object lesson. In Proceedings of the third ACM SIGPLAN conference on History of programming languages, HOPL III, pages 712–722, New York, NY, USA, 2007. ACM. DOI: 10.1145/1238844.1238851 Cited on page(s) 128

[118] K. Kennedy and U. Kremer. Automatic data layout for distributed-memory machines. ACM Trans. Program. Lang. Syst., 20:869–916, July 1998. DOI: 10.1145/291891.291901 Cited on page(s) 128

[119] D. Klappholz, K. Psarris, and X. Kong. On the perfect accuracy of an approximate subscript analysis test. In International Conference on Supercomputing, pages 201–212, 1990. DOI: 10.1145/255129.255158 Cited on page(s) 37

[120] K. Knobe and V. Sarkar. Array SSA form and its use in parallelization. In ACM Conference on the Principals of Programming Languages, pages 107–120, 1998. DOI: 10.1145/268946.268956 Cited on page(s) 13

[121] I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 346–357, 1997. DOI: 10.1145/258915.258946 Cited on page(s) 127

[122] C. Koelbel. Compile-time generation of regular communications patterns. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing ’91, pages 101–110, New York, NY, USA, 1991. ACM. DOI: 10.1145/125826.125890 Cited on page(s) 128

[123] C. Koelbel. An overview of High Performance Fortran. SIGPLAN Fortran Forum, 11:9–16, December 1992. DOI: 10.1145/140734.140736 Cited on page(s) 128

[124] C. Koelbel, D. Loveman, R. Schreiber, G. Steele, and M. E. Zosel. The High Performance Fortran Handbook. MIT Press, 1993. Cited on page(s) 101, 128

[125] X. Kong, D. Klappholz, and K. Psarris. The I test: an improved dependence test for automatic parallelization and vectorization. Parallel and Distributed Systems, IEEE Transactions on, 2(3):342–349, jul 1991. DOI: 10.1109/71.86109 Cited on page(s) 126

[126] J. S. Kowalik. Parallel MIMD computation: The HEP supercomputer and its applications. MIT Press, 1985. available as http://hdl.handle.net/1721.1/1745. Last accessed January 5, 2012. Cited on page(s) 56

[127] A. Krishnamurthy and K. A. Yelick. Optimizing parallel programs with explicit synchronization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 196–204, 1995. DOI: 10.1145/223428.207142 Cited on page(s) 126

[128] A. Krishnamurthy and K. A. Yelick. Analyses and optimizations for shared address space programs. J. Parallel Distrib. Comput., 38(2):130–144, 1996. DOI: 10.1006/jpdc.1996.0136 Cited on page(s) 46, 50, 126

[129] D. Kuck and A. Goldstein. http://www.ieeeghn.org/wiki/index.php/Oral-History:David_Kuck. Last accessed January 5, 2012. Cited on page(s) 1

[130] D. J. Kuck. The Structure of Computers and Computations. John Wiley & Sons, Inc., 1978. Cited on page(s) 127

[131] D. J. Kuck, E. S. Davidson, D. H. Lawrie, A. H. Sameh, and C.-Q. Zhu. The Cedar system and an initial performance study. In 25 Years of ISCA: Retrospectives and Reprints, pages 462–472, 1998. DOI: 10.1145/285930.286005 Cited on page(s) 126

[132] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In ACM Conference on the Principles of Programming Languages, pages 207–218, 1981. DOI: 10.1145/567532.567555 Cited on page(s) 1, 126

[133] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the ACM Conference on Principles of Programming Languages, pages 207–218, 1981. DOI: 10.1145/567532.567555 Cited on page(s) 127

[134] R. H. Kuhn. Optimization and interconnection complexity for: Parallel processors, single stage networks, and decision trees. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, February 1980. Cited on page(s) 37

[135] M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Optimistic parallelism benefits from data partitioning. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’08), ASPLOS XIII, pages 233–243, New York, NY, USA, 2008. ACM. DOI: 10.1145/1346281.1346311 Cited on page(s) 131

[136] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’07, pages 211–222, New York, NY, USA, 2007. ACM. DOI: 10.1145/1250734.1250759 Cited on page(s) 66, 131

[137] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. Communications of the ACM, 52:89–97, Sept. 2009. DOI: 10.1145/1562164.1562188 Cited on page(s) 66, 131

[138] M. S. Lam. Software pipelining: an effective scheduling technique for VLIW machines (with retrospective). In 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999, A Selection, pages 244–256. ACM, 2004. DOI: 10.1145/989393.989420 Cited on page(s) 127

[139] M. S. Lam and M. E. Wolf. A data locality optimizing algorithm (with retrospective). In 20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999, A Selection, pages 442–459, 1991. DOI: 10.1145/989393.989437 Cited on page(s) 29

[140] M. S. Lam and M. E. Wolf. A data locality optimizing algorithm. SIGPLAN Not., 39:442–459, April 2004. DOI: 10.1145/989393.989437 Cited on page(s) 127

[141] J. Lee, D. A. Padua, and S. P. Midkiff. Basic compiler algorithms for parallel programs. In Proceedings of the seventh ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, PPoPP ’99, pages 1–12, 1999. DOI: 10.1145/301104.301105 Cited on page(s) 11, 126

[142] S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP ’09), pages 101–110, 2009. DOI: 10.1145/1594835.1504194 Cited on page(s) 7, 126

[143] The lex & yacc page. http://dinosaur.compilertools.net/. Last accessed January 5, 2012. Cited on page(s) 11

[144] Z. Li and W. A. Abu-Sufah. A technique for reducing synchronization overhead in large scale multiprocessors. In Proceedings of the International Symposia on Computer Architecture, pages 284–291, 1985. DOI: 10.1145/327070.327266 Cited on page(s) 62, 126

[145] H. Lin, S. P. Midkiff, and R. Eigenmann. A study of the usefulness of producer/consumer synchronization. In Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, 2011. To appear. Cited on page(s) 60

[146] W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss, J. Renau, and J. Torrellas. Posh: a tls compiler that exploits program structure. In Proceedings of the eleventh ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, PPoPP ’06, pages 158–167, New York, NY, USA, 2006. ACM. DOI: 10.1145/1122971.1122997 Cited on page(s) 130

[147] J. Manson, W. Pugh, and S. V. Adve. The Java memory model. Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on the Principles of Programming Languages, 40:378–391, January 2005. Cited on page(s) 126

[148] P. Marcuello and A. Gonzalez. Thread-spawning schemes for speculative multithreading. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on, pages 55 – 64, feb. 2002. DOI: 10.1109/HPCA.2002.995698 Cited on page(s) 130

[149] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph., 22(3):896–907, 2003. DOI: 10.1145/882262.882362 Cited on page(s) 7, 126

[150] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18:424–453, July 1996. DOI: 10.1145/233561.233564 Cited on page(s) 127

[151] M. Mehrara, J. Hao, P.-C. Hsu, and S. Mahlke. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 166–176, New York, NY, USA, 2009. ACM. DOI: 10.1145/1542476.1542495 Cited on page(s) 130

[152] S. P. Midkiff. Dependence analysis in parallel loops with i±k subscripts. In Proceedings of the Eight International Workshop on Languages and Compilers for Parallel Computing, volume 1033 of Lecture Notes in Computer Science, pages 331–345. Springer, 1995. DOI: 10.1007/BFb0014209 Cited on page(s) 46

[153] S. P. Midkiff. Local iteration set computation for block-cyclic distributions. In Proceedings of the International Conference on Parallel Processing, Volume 2, pages 77–84, 1995. Cited on page(s) 101

[154] S. P. Midkiff. Optimizing the representation of local iteration sets and access sequences for block-cyclic distributions. In Ninth International Workshop on Languages and Compilers for Parallel Computing, pages 420–434, 1996. DOI: 10.1007/BFb0017267 Cited on page(s) 101

[155] S.P. Midkiff, J.E. Moreira, and M. Snir. Optimizing array reference checking in Java programs. IBM Systems Journal, 37(3):409–453, 1998. DOI: 10.1147/sj.373.0409 Cited on page(s) 38

[156] S.P. Midkiff and D.A. Padua. Compiler generated synchronization for do loops. In Proceedings of the International Conference on Parallel Programming (ICPP), pages 544–551, 1986. Cited on page(s) 61, 62, 126

[157] S. P. Midkiff and D. A. Padua. Compiler algorithms for synchronization. IEEE Trans. Computers, 36(12):1485–1495, 1987. DOI: 10.1109/TC.1987.5009499 Cited on page(s) 62, 126

[158] M. J. Moravan, J. Bobba, K. E. Moore, L. Yen, M. D. Hill, B. Liblit, M. M. Swift, and D. A. Wood. Supporting nested transactional memory in logtm. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006, pages 359–370, 2006. DOI: 10.1145/1168918.1168902 Cited on page(s) 66

[159] J. E. Moreira, S. P. Midkiff, and M. Gupta. From flop to megaflops: Java for technical computing. ACM Trans. Program. Lang. Syst., 22(2):265–295, 2000. DOI: 10.1145/349214.349222 Cited on page(s) 38

[160] Message passing interface forum. http://mpi-forum.org/. Last accessed January 5, 2012. Cited on page(s) 128

[161] S. S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann/Elsevier Science India, 2003. Cited on page(s) 125

[162] e. Narasimha R. Adiga. An overview of the BlueGene/L supercomputer. In ACM Supercomputing (SC ’02), pages 1–22, 2002. DOI: 10.1109/SC.2002.10017 Cited on page(s) 9, 95

[163] R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17:1–31, August 1998. DOI: 10.1145/289918.289920 Cited on page(s) 128

[164] Nvidia technologies. http://www.nvidia.com/page/technologies.html. Last accessed January 5, 2012. Cited on page(s) 7

[165] J. T. Oplinger, D. L. Heine, and M. S. Lam. In search of speculative thread-level parallelism. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, PACT ’99, pages 303–, Washington, DC, USA, 1999. IEEE Computer Society. DOI: 10.1109/PACT.1999.807576 Cited on page(s) 130

[166] G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 105–118, Washington, DC, USA, 2005. IEEE Computer Society. DOI: 10.1109/MICRO.2005.13 Cited on page(s) 131

[167] D. Padua. The Fortran I compiler. Computing in Science and Engg., 2:70–75, January 2000. DOI: 10.1109/5992.814661 Cited on page(s) 125

[168] D. A. Padua. Multiprocessors: Discussion of Some Theoretical and Practical Problems. PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, October 1979. Cited on page(s) 58

[169] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Commun. ACM, 29:1184–1201, December 1986. DOI: 10.1145/7902.7904 Cited on page(s) 125, 126, 127

[170] D. Palermo, E. Su, I. E.W. Hodges, and P. Banerjee. (R) compiler support for privatization on distributed - memory machines. Parallel Processing, International Conference on, 3:0017, 1996. DOI: 10.1109/ICPP.1996.538555 Cited on page(s) 127

[171] J. Patel and E. Davidson. Improving the throughput of a pipeline by insertion of delays. In Third Annual Symposium on Computer Architecture, pages 159–164, 1976. DOI: 10.1145/800110.803575 Cited on page(s) 127

[172] D. A. Patterson and C. H. Sequin. RISC I: a reduced instruction set VLSI computer. In 25 years of the International Symposia on Computer Architecture (selected papers), pages 216–230, New York, NY, USA, 1998. ACM. Originally appeared in ISCA ’98. DOI: 10.1145/285930.285981 Cited on page(s) 58

[173] G. Pfister. In Search of Clusters, Second Edition. Prentice Hall, 1997. Cited on page(s) 9, 95

[174] Pin: A dynamic binary instrumentation tool. http://www.pintool.org/. Last accessed January 5, 2012. Cited on page(s) 10

[175] W. Pottenger. Induction variable substitution and reduction recognition in the polaris parallelizing compiler. MS Thesis. Cited on page(s) 127

[176] W. M. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In International Conference on Supercomputing, pages 444–448, 1995. DOI: 10.1145/224538.224655 Cited on page(s) 127

[177] L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part ii, multidimensional time. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 90–100, 2008. DOI: 10.1145/1379022.1375594 Cited on page(s) 126

[178] L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part i, one-dimensional time. In CGO, pages 144–156, 2007. DOI: 10.1109/CGO.2007.21 Cited on page(s) 126

[179] Power.org. http://www-03.ibm.com/systems/power/. Last accessed January 5, 2012. Cited on page(s) 9

[180] IBM Power 770 and 780 (9117-MMC, 9179-MHC) technical overview and introduction. http://www.redbooks.ibm.com/Redbooks.nsf/
RedbookAbstracts/redp4798.html?Open
. Last accessed January 5, 2012. Cited on page(s) 9

[181] W. Pugh. The Omega Test: a fast and practical integer programming algorithm for dependence analysis. In SC, pages 4–13, 1991. DOI: 10.1145/125826.125848 Cited on page(s) 37, 126

[182] W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, (8):102 – 114, August 1992. DOI: 10.1145/125826.125848 Cited on page(s) 37

[183] W. Pugh and D. Wonnacott. Eliminating false data dependences using the Omega Test. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 140–151, 1992. DOI: 10.1145/143103.143129 Cited on page(s) 126

[184] W. Pugh and D. Wonnacott. Going beyond integer programming with the Omega Test to eliminate false data dependences. IEEE Trans. Parallel Distrib. Syst., 6(2):204–211, 1995. DOI: 10.1109/71.342135 Cited on page(s) 37, 126

[185] The Java memory model. A detailed discussion of problems and issues with the original Java memory model can be found at http://www.cs.umd.edu/~pugh/java/memoryModel/. Last accessed on January 5, 2012. Cited on page(s) 126

[186] C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’05, pages 269–279, New York, NY, USA, 2005. ACM. DOI: 10.1145/1064978.1065043 Cited on page(s) 130

[187] M. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Science/Engineering/Math, 2003. Cited on page(s) 100

[188] A. Raman, H. Kim, T. R. Mason, T. B. Jablin, and D. I. August. Speculative parallelization using software multi-threaded transactions. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS ’10, pages 65–76, New York, NY, USA, 2010. ACM. DOI: 10.1145/1736020.1736030 Cited on page(s) 131

[189] E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the 6th annual IEEE/ACM International Symposium on Code generation and optimization, CGO ’08, pages 114–123, New York, NY, USA, 2008. ACM. DOI: 10.1145/1356058.1356074 Cited on page(s) 55, 131

[190] R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), pages 177–188, 2004. DOI: 10.1109/PACT.2004.1342552 Cited on page(s) 55, 131

[191] B. Rau and C. Glaeser. Some scheduling techniques and and easily horizontal architecture for high performance scientific computing. In 1th Annual Workshop on Microprogramming, 1981. Cited on page(s) 127

[192] L. Rauchwerger, F. Arzu, and K. Ouchi. Standard templates adaptive parallel library (stapl). In 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 402–409, 1998. Cited on page(s) 125

[193] L. Rauchwerger and D. A. Padua. Parallelizing while loops for multiprocessor systems. In Proceedings of the 9th International Parallel Processing Symposium, April 25-28, 1995, Santa Barbara, California, USA (IPPS), pages 347–356, 1995. DOI: 10.1109/IPPS.1995.395955 Cited on page(s) 65, 127, 129

[194] L. Rauchwerger and D. A. Padua. The lrpd test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parallel Distrib. Syst., 10(2):160–180, 1999. DOI: 10.1109/71.752782 Cited on page(s) 129

[195] J. Reid. Coarrays in the next Fortran standard. SIGPLAN Fortran Forum, 29:10–27, July 2010. DOI: 10.1145/1837137.1837138 Cited on page(s) 114, 128

[196] G. Ren, P. Wu, and D. Padua. A preliminary study on the vectorization of multimedia applications for multimedia extensions. In L. Rauchwerger, editor, Languages and Compilers for Parallel Computing, volume 2958 of Lecture Notes in Computer Science, pages 420–435. Springer Berlin / Heidelberg, 2004. DOI: 10.1007/978-3-540-24644-2_27 Cited on page(s) 126

[197] M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942–991, 1997. DOI: 10.1145/267959.269969 Cited on page(s) 63

[198] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the 1989 ACM Conference on Programming Language Design and Implementation, pages 69–80, 1989. DOI: 10.1145/73141.74824 Cited on page(s) 97, 107

[199] H. Rong, Z. Tang, R. Govindarajan, A. Douillet, and G. R. Gao. Single-dimension software pipelining for multi-dimensional loops. In Symposium on Code Generation and Optimization, pages 163–174, 2004. DOI: 10.1145/1216544.1216550 Cited on page(s) 127

[200] B. K. Rosen. High-level data flow analysis. Commun. ACM, 20:712–724, October 1977. DOI: 10.1145/359842.359849 Cited on page(s) 19

[201] R. Rugina and M. Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the seventh ACM SIGPLAN Symposium on the Principles and Practice of Parallel Programming, PPoPP ’99, pages 72–83, New York, NY, USA, 1999. ACM. DOI: 10.1145/301104.301111 Cited on page(s) 127

[202] S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: static & dynamic memory reference analysis. In ICS, pages 274–284, 2002. DOI: 10.1023/A:1024597010150 Cited on page(s) 37, 126

[203] S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: Static & dynamic memory reference analysis. International Journal of Parallel Programming, 31(4):251–283, 2003. Cited on page(s) 126

[204] R. Russell. The CRAY-1 computer system. Communications of the ACM, 21(1):63–72, January 1978. DOI: 10.1145/359327.359336 Cited on page(s) 1

[205] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP ’08), pages 73–82, 2008. DOI: 10.1145/1345206.1345220 Cited on page(s) 7, 126

[206] J. Sampson, R. Gonzalez, J.-F. Collard, N. P. Jouppi, M. Schlansker, and B. Calder. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 235–246, Washington, DC, USA, 2006. IEEE Computer Society. DOI: 10.1109/MICRO.2006.23 Cited on page(s) 126

[207] Intel Core i5-2540M processor. http://ark.intel.com/products/50072/Intel-Core-i5-2540M-Processor-(3M-Cache-2_60-GHz). Cited on page(s) 9

[208] D. A. Schmidt. Data flow analysis is model checking of abstract interpretations. In 1998 ACM Conference on the Principles of Programming Languages (POPL ’98), pages 38–48, 1998. DOI: 10.1145/268946.268950 Cited on page(s) 29

[209] D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988. DOI: 10.1145/42190.42277 Cited on page(s) 43, 44, 45, 46, 126

[210] Z. Shen, Z. Li, and P. Yew. An empirical study of Fortran programs for parallelizing compilers. IEEE Transactions on Parallel and Distributed Systems, 1:356–364, 1990. DOI: 10.1109/71.80162 Cited on page(s) 127

[211] B. J. Smith. A pipelined, shared resource computer. In Proceedings of the International Conference on Parallel Processing, pages 6–8, 1978. Cited on page(s) 1

[212] G. Snelting, T. Robschink, and J. Krinke. Efficient path conditions in dependence graphs for software safety analysis. ACM Trans. Softw. Eng. Methodol., 15:410–457, October 2006. DOI: 10.1145/1178625.1178628 Cited on page(s) 34

[213] M. Snir, J. Dongarra, J. S. Kowalik, S. Huss-Lederman, S. W. Otto, and D. W. Walker. MPI: The Complete Reference. MIT Press, 2nd edition, 1998. Cited on page(s) 128

[214] N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 28:363–400, 2000. 10.1023/A:1007559022013. DOI: 10.1023/A:1007559022013 Cited on page(s) 126

[215] SSE performance programming. http://developer.apple.com/hardwaredrivers
/ve/sse.html
. Last accessed January 5, 2012. Cited on page(s) 7

[216] Intel compiler options for intel SSE and intel AVX generation (SSE2, SSE3, SSE3_ATOM, SSSE3, SSE4.1, SSE4.2, AVX, AVX2) and processor-specific optimizations. http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/. Last accessed January 5, 2012. Cited on page(s) 7

[217] STAPL: Standard Template Adaptive Parallel Library. https://parasol.tamu.edu/groups/rwergergroup/research/stapl/. Last accessed January 5, 2012. DOI: 10.1145/1815695.1815713 Cited on page(s) 125

[218] J. Steffan and T. Mowry. The potential for using thread-level data speculation to facilitate automatic parallelization. In Proceedings of the Symposium on High-Performance Computer Architecture, pages 2–13, feb 1998. DOI: 10.1109/HPCA.1998.650541 Cited on page(s) 129

[219] J. G. Steffan, C. Colohan, A. Zhai, and T. C. Mowry. The stampede approach to thread-level speculation. ACM Trans. Comput. Syst., 23:253–300, August 2005. DOI: 10.1145/1082469.1082471 Cited on page(s) 131

[220] J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. Improving value communication for thread-level speculation. In In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), page 65, 2002. DOI: 10.1109/HPCA.2002.995699 Cited on page(s) 131

[221] B. Steffen. Data flow analysis as model checking. In Proceedings of 1991 Conference on the Theoretical Aspects of Computer Science (TACS ’91). Springer-Verlag, 1991. Available as volume 526 of Lecture Notes in Computer Science. DOI: 10.1007/3-540-54415-1_54 Cited on page(s) 29

[222] B. Steffen. Generating data-flow analysis algorithms for modal specifications. Science of Computer Programming, (139):21–115, 1993. DOI: 10.1016/0167-6423(93)90003-8 Cited on page(s) 29

[223] H.-M. Su and P.-C. Yew. On data synchronization for multiprocessors. In International Symposium on Computer Architecture, volume 17, pages 416–423, 1989. DOI: 10.1145/74926.74972 Cited on page(s) 126

[224] J. Su and K. A. Yelick. Automatic communication performance debugging in PGAS languages. In Languages and Compilers for Parallel Computing, pages 232–245, 2007. DOI: 10.1007/978-3-540-85261-2_16 Cited on page(s) 126

[225] Z. Sura, X. Fang, C.-L. Wong, S. P. Midkiff, J. Lee, and D. A. Padua. Compiler techniques for high performance sequentially consistent Java programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2005), pages 2–13, 2005. DOI: 10.1145/1065944.1065947 Cited on page(s) 49, 126

[226] G. Tanase, A. A. Buss, A. Fidel, Harshvardhan, I. Papadopoulos, O. Pearce, T. G. Smith, N. Thomas, X. Xu, N. Mourad, J. Vu, M. Bianco, N. M. Amato, and L. Rauchwerger. The STAPL parallel container framework. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 235–246, 2011. DOI: 10.1145/2038037.1941586 Cited on page(s) 125

[227] D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. In Proceedings of the 12th International Conference on Architectural support for Programming Languages and Operating Systems, ASPLOS-XII, pages 325–335, New York, NY, USA, 2006. ACM. DOI: 10.1145/1168919.1168898 Cited on page(s) 7, 126

[228] J. Test, M. Myszewski, and R. Swift. The alliant fx/series: A language driven architecture for parallel processing of dusty deck fortran. In J. de Bakker, A. Nijman, and P. Treleaven, editors, PARLE Parallel Architectures and Languages Europe, volume 258 of Lecture Notes in Computer Science, pages 345–356. Springer Berlin / Heidelberg, 1987. 10.1007/3-540-17943-7 Cited on page(s) 1, 58

[229] Intel threading building blocks 3.0 for open source. http://threadingbuildingblocks.org/. Last accessed January 5, 2012. Cited on page(s) 125

[230] X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran compiler for hyper-threading technology: Implementation and performance. Intel Technology Journal, 6(01):1–11, 2002. Available at http://download.intel.com/technology/itj/
2002/volume06issue01/art04_fortrancompiler
/vol6iss1_art04.pdf
, last checked January 5, 2012. Cited on page(s) 4

[231] G. Tournavitis, Z. Wang, B. Franke, and M. F. O’Boyle. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 177–187, New York, NY, USA, 2009. ACM. DOI: 10.1145/1542476.1542496 Cited on page(s) 130

[232] R. Touzeau. A Fortran compiler for the FPS-164 Scientific Computer. In SIGPLAN 1984 Symposium on Compiler Construction, pages 48 – 57, 1984. DOI: 10.1145/502874.502879 Cited on page(s) 127

[233] P. Tu and D. A. Padua. Automatic array privatization. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers, pages 500–521, 1993. DOI: 10.1007/3-540-45403-9_8 Cited on page(s) 127

[234] P. Tu and D. A. Padua. Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. In International Conference on Supercomputing, pages 414–423, 1995. DOI: 10.1145/224538.224648 Cited on page(s) 127

[235] M. Ujaldon, E. L. Zapata, B. M. Chapman, and H. P. Zima. Vienna-fortran/hpf extensions for sparse and irregular problems and their compilation. IEEE Trans. Parallel Distrib. Syst., 8(10):1068–1083, 1997. DOI: 10.1109/71.629489 Cited on page(s) 128

[236] G. Upadhyaya, S. P. Midkiff, and V. S. Pai. Using data structure knowledge for efficient lock generation and strong atomicity. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, pages 281–292, 2010. DOI: 10.1145/1837853.1693490 Cited on page(s) 66

[237] Berkeley unified parallel C. http://upc.lbl.gov/. Last accessed January 5, 2012. Cited on page(s) 50, 128

[238] N. Vachharajani, R. Rangan, E. Raman, M.J. Bridges, G. Ottoni, and D.I. August. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, PACT ’07, pages 49–59, Washington, DC, USA, 2007. IEEE Computer Society. DOI: 10.1109/PACT.2007.4336199 Cited on page(s) 131

[239] L. Wang, J. M. Stichnoth, and S. Chatterjee. Runtime performance of parallel array assignment: an empirical study. In Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’96, Washington, DC, USA, 1996. IEEE Computer Society. DOI: 10.1145/369028.369036 Cited on page(s) 101, 128

[240] M. N. Wegman and F. K. Zadeck. Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst., 13(2):181–210, 1991. DOI: 10.1145/103135.103136 Cited on page(s) 14, 125

[241] M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software engineering, pages 439–449, Piscataway, NJ, USA, 1981. IEEE Press. DOI: 10.1109/FOSM.2008.4659249 Cited on page(s) 34, 130

[242] M. E. Wolf and M. Lam. A data locality optimizing algorithm. In ACM Conference on Programming Language Design and Implementation, pages 30 – 44, 1991. DOI: 10.1145/113446.113449 Cited on page(s) 93, 94, 127

[243] M. E. Wolf and M. Lam. Loop transformation and algorithm to maximize parallelism. IEEE Transaction on Parallel and Distributed Computing, 2(4):452 – 471, October 1991. DOI: 10.1109/71.97902 Cited on page(s) 93, 94, 127

[244] M. Wolfe. Loop skewing: the wavefront method revisited. Int. J. Parallel Program., 15:279–293, October 1986. DOI: 10.1007/BF01407876 Cited on page(s) 71, 126, 127

[245] M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, Los Angeles, California, USA, December 1-4, 1987, pages 357–361. SIAM, 1987. Cited on page(s) 127

[246] M. Wolfe. Multiprocessor synchronization for concurrent loops. IEEE Software, 5(1):34–42, 1988. DOI: 10.1109/52.1992 Cited on page(s) 126

[247] M. Wolfe. Vector optimization vs vectorization. J. Parallel Distrib. Comput., 5(5):551–567, 1988. DOI: 10.1016/0743-7315(88)90012-3 Cited on page(s) 126

[248] M. Wolfe. Beyond induction variables. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, pages 162–174, New York, NY, USA, 1992. ACM. DOI: 10.1145/143095.143131 Cited on page(s) 69, 127

[249] M. Wolfe. The definition of dependence distance. ACM Trans. Program. Lang. Syst., 16(4):1114–1116, 1994. DOI: 10.1145/183432.183440 Cited on page(s) 125

[250] M. Wolfe. High performance compilers for parallel computing. Addison-Wesley Publishing Company, 1996. Cited on page(s) 71, 84, 126, 127

[251] M. Wolfe. Parallelizing compilers. ACM Comput. Surv., 28:261–262, March 1996. DOI: 10.1145/234313.234417 Cited on page(s) 125, 126

[252] M. Wolfe and C. W. Tseng. The Power Test for data dependence. IEEE Trans. Parallel Distrib. Syst., 3:591–601, September 1992. DOI: 10.1109/71.159042 Cited on page(s) 37, 126

[253] P. Wu, A. Cohen, and D. Padua. Induction variable analysis without idiom recognition: beyond monotonicity. In Proceedings of the 14th International Conference on Languages and compilers for parallel computing, pages 427–441, Berlin, Heidelberg, 2003. Springer-Verlag. DOI: 10.1007/3-540-35767-X_28 Cited on page(s) 127

[254] K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. J. Garzarán, D. A. Padua, K. Pingali, P. Stodghill, and P. Wu. A comparison of empirical and model-driven optimization. In Proceedings of the ACM Conference on Programming Language Design and Implementation, pages 63–76, 2003. DOI: 10.1145/781131.781140 Cited on page(s) 128

[255] A. Zhai, C. B. Colohan, J. G. Steffan, and T. C. Mowry. Compiler optimization of memory-resident value communication between speculative threads. In Proceedings of the International nymposium on Code Generation and Optimization: feedback-directed and runtime optimization, CGO ’04, pages 39–, Washington, DC, USA, 2004. IEEE Computer Society. DOI: 10.1109/CGO.2004.1281662 Cited on page(s) 131

[256] H. Zhong, M. Mehrara, S. A. Lieberman, and S. A. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In 14th International Symposium on High Performance Computer Architecture (HPCA), pages 290–301, 2008. DOI: 10.1109/HPCA.2008.4658647 Cited on page(s) 131

[257] W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th annual International Symposium on Computer Architecture, ISCA ’07, pages 35–45, New York, NY, USA, 2007. ACM. DOI: 10.1145/1273440.1250668 Cited on page(s) 126

[258] C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the 28th annual International Symposium on Computer Architecture, ISCA ’01, pages 2–13, New York, NY, USA, 2001. ACM. DOI: 10.1145/379240.379246 Cited on page(s) 130

[259] C. Zilles and G. Sohi. Master/slave speculative parallelization. In Proceedings of the 35th annual ACM/IEEE International Symposium on Microarchitecture, MICRO 35, pages 85–96, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. DOI: 10.1109/MICRO.2002.1176241 Cited on page(s) 130

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.174.204