References

[1] M. Wehner, L. Oliker, and J. Shalf. A real cloud computer. IEEE Spectrum, 46(10):24–29, 2009.

[2] B. Wilkinson and M. Allen. Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers, 2nd ed. Toronto, Canada: Pearson, 2004.

[3] A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, 2nd ed. Reading, MA: Addison Wesley, 2003.

[4] Standards Coordinating Committee 10, Terms and Definitions. The IEEE Standard Dictionary of Electrical and Electronics Terms, J. Radatz, Ed. IEEE, 1996.

[5] F. Elguibaly (Gebali). α-CORDIC: An adaptive CORDIC algorithm. Canadian Journal on Electrical and Computer Engineering, 23:133–138, 1998.

[6] F. Elguibaly (Gebali), HCORDIC: A high-radix adaptive CORDIC algorithm. Canadian Journal on Electrical and Computer Engineering, 25(4):149–154, 2000.

[7] J.S. Walther. A unified algorithm for elementary functions. In Proceedings of the 1971 Spring Joint Computer Conference, N. Macon, Ed. American Federation of Information Processing Society, Montvale, NJ, May 18–20, 1971, pp. 379–385.

[8] J.E. Volder. The CORDIC Trigonometric Computing Technique. IRE Transactions on Electronic Computers, EC-8(3):330–334, 1959.

[9] R.M. Karp, R.E. Miller, and S. Winograd. The organization of computations for uniform recurrence equations. Journal of the Association of Computing Machinery, 14:563–590, 1967.

[10] V.P. Roychowdhury and T. Kailath. Study of parallelism in regular iterative algorithms. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms and Architecture, Crete, Greece, F. T. Leighton, Ed. Association of Computing Machinery, 1990, pp. 367–376.

[11] H.V. Jagadish, S.K. Rao, and T. Kailath. Multiprocessor architectures for iterative algorithms. Proceedings of the IEEE, 75(9):1304–1321, 1987.

[12] D.I. Moldovan. On the design of algorithms for VLSI systohc arrays. Proceedings of the IEEE, 81:113–120, 1983.

[13] F. Gebali, H. Elmiligi, and M.W. El-Kharashi. Networks on Chips: Theory and Practice. Boca Raton, FL: CRC Press, 2008.

[14] B. Prince. Speeding up system memory. IEEE Spectrum, 2:38–41, 1994.

[15] J.L. Gustafson. Reevaluating Amdahl’s law. Communications of the ACM, pp. 532–533, 1988.

[16] W.H. Press. Discrete radon transform has an exact, fast inverse and generalizes to operations other than sums along lines. Proceedings of the National Academy of Sciences, 103(51):19249–19254, 2006.

[17] F. Pappetti and S. Succi. Introduction to Parallel Computational Fluid Dynamics. New York: Nova Science Publishers, 1996.

[18] W. Stallings. Computer Organization and Architecture. Upper Saddle River, NJ: Pearson/Prentice Hall, 2007.

[19] C. Hamacher, Z. Vranesic, and S. Zaky. Computer Organization, 5th ed. New York: McGraw-Hill, 2002.

[20] D.A. Patterson and J.L. Hennessy. Computer Organization and Design: The Hardware/Software Interface. San Francisco, CA: Morgan Kaufman, 2008.

[21] F. Elguibaly (Gebali). A fast parallel multiplier-accumulator using the modified booth algorithm. IEEE Transaction Circuits and Systems II: Analog and Digital Signal Processing, 47:902–908, 2000.

[22] F. Elguibaly (Gebali). Merged inner-prodcut processor using the modified booth algorithm. Canadian Journal on Electrical and Computer Engineering, 25(4):133–139, 2000.

[23] S. Sunder, F. Elguibaly (Gebali), and A. Antoniou. Systolic implementation of digital filters. Multidimensional Systems and Signal Processing, 3:63–78, 1992.

[24] T. Ungerer, B. Rubic, and J. Slic. Multithreaded processors. Computer Journal, 45(3):320–348, 2002.

[25] M. Johnson. Superscalar Microprocessor Design. Englewood Cliffs, NJ: Prentice Hall, 1990.

[26] M.J. Flynn. Very high-speed computing systmes. Proceedings of the IEEE, 54(12):1901–1909, 1966.

[27] M. Tomasevic and V. Milutinovic. Hardware approaches to cache coherence in shared-memory multiprocessors: Part 1. IEEE Micro, 14(5):52–59, 1994.

[28] F. Gebali. Analysis of Computer and Communication Networks. New York: Springer, 2008.

[29] T.G. Lewis and H. El-Rewini. Introduction to Parallel Computing. Englewood Cliffs, NJ: Prentice Hall, 1992.

[30] J. Zhang, T. Ke, and M. Sun. The parallel computing based on cluster computer in the processing of mass aerial digital images. In International Symposium on Information Processing, F. Yu and Q. Lou, Eds. IEEE Computer Society, Moscow, May 23–25, 2008, pp. 398–393.

[31] AMD. Computing: The road ahead. http://hpcrd.lbl.gov/SciDAC08/files/presentations/SciDAC_Reed.pdf, 2008.

[32] B.K. Khailany, T. Williams, J. Lin, E.P. Long, M. Rygh, D.W. Tovey, and W.J. Dally. A programmable 512 GOPS stream processor for signa, image, and video processing. IEEE Journal of Solid-State Circuits, 43(1):202–213, 2008.

[33] B. Burke. NVIDIA CUDA technology dramatically advances the pace of scientific research. http://www.nvidia.com/object/io_1229516081227.html?_templated=320, 2009.

[34] S. Rixner, W.J. Dally, U.J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. Mattson, and J.D. Ownes. A bandwidth-efficient architecture for media processing. In Proceedings of the 31st Annual International Symposium on Microarchitecture. Los Alamitos, CA: IEEE Computer Society Press, 1998, pp. 3–13.

[35] H. El-Rewini and T.G. Lewis. Distributed and Parallel Computing. Greenwich, CT: Manning Publications, 1998.

[36] E.W. Dijkstra. Solution of a problem in concurrent programming control. Communications of the ACM, 8(9):569, 1965.

[37] D.E. Culler, J.P. Singh, and A. Gupta. Parallel Computer Architecture. San Francisco, CA: Morgan Kaufmann, 1999.

[38] A.S. Tanenbaum and A.S. Woodhull. Operating Systems : Design and Implementation. Englewood Cliffs, NJ: Prentice Hall, 1997.

[39] W. Stallings. Operating Systems: Internals and Design Principles. Upper Saddle River, NJ: Prentice Hall, 2005.

[40] A. Silberschatz, P.B. Galviin, and G. Gagne. Operating System Concepts. New York: John Wiley, 2009.

[41] M.J. Young. Recent developments in mutual exclusion for multiprocessor systems. http://www.mjyonline.com/MutualExclusion.htm, 2010.

[42] Sun Micorsystems. Multithreading Programming Guide. Santa Clara, CA: Sun Microsystems, 2008.

[43] F. Gebali. Design and analysis of arbitration protocols. IEEE Transaction on Computers, 38(2):161171, 1989.

[44] S.W. Furhmann. Performance of a packet switch with crossbar architecture. IEEE Transaction Communications, 41:486–491, 1993.

[45] C. Clos. A study of non-blocking switching networks. Bell System Technology Journal, 32:406–424, 1953.

[46] R.J. Simcoe and T.-B. Pei. Perspectives on ATM switch architecture and the influence of traffic pattern assumptions on switch design. Computer Communication Review, 25:93–105, 1995.

[47] K. Wang, J. Huang, Z. Li, X. Wang, F. Yang, and J. Bi. Scaling behavior of internet packet delay dynamics based on small-interval measurements. In The IEEE Conference on Local Computer Networks, H. Hassanein and M. Waldvogel, Eds. IEEE Computer Society, Sydney, Australia, November 15–17, 2005, pp. 140–147.

[48] M.J. Quinn. Parallel Programming. New York: McGraw-Hill, 2004.

[49] C.E. Leiserson and I.B. Mirman. How to Survive the Multicore Software Revolution. Lexington, MA: Cilk Arts, 2009.

[50] Cilk Arts. Smooth path to multicores. http://www.cilk.com/, 2009.

[51] OpenMP. OpenMP: The OpenMP API specification for parallel programming. http://openmp.org/wp/, 2009.

[52] G. Ippolito. YoLinux tutorial index. http://www.yolinux.com/TUTORIALS/LinuxTutorialPosix Threads.html, 2004.

[53] M. Soltys. Operating systems concepts. http://www.cas.mcmaster.ca/∼soltys/cs3sh3-w03/, 2003.

[54] G. Hillar. Visualizing parallelism and concurrency in Visual Studio 2010 Beta 2. http://www.drdobbs.com/windows/220900288, 2009.

[55] C.E. Leiserson. The Cilk++ Concurrency Platform. Journal of Supercomputing, 51(3), 244–257, 2009.

[56] C. Carmona. Programming the thread pool in the .net framework. http://msdn.microsoft.com/en-us/library/ms973903.aspx, 2002.

[57] MIP Forum. Message passing interface forum. http://www.mpi-forum.org/, 2008.

[58] G.E. Blelloch. NESL: A parallel programming language. http://www.cs.cmu.edu/∼scandal/nesl.html, 2009.

[59] S. Amanda. Intel’s Ct Technology Code Samples, April 6, 2010, http://software.intel.com/en-us/articles/intels-ct-technology-code-samples/.

[60] Threading Building Blocks. Intel Threading Building Blocks 2.2 for open source. http://www.threadingbuildingblocks.org/, 2009.

[61] S. Patuel. Design: Task parallel library explored. http://blogs.msdn.com/salvapatuel/archive/2007/11/11/task-parallel-library-explored.aspx, 2007.

[62] N. Furmento, Y. Roudier, and G. Siegel. Survey on C++ parallel extensions. http://www-sop.inria. fr/sloop/SCP/, 2009.

[63] D. McCrady. Avoiding contention using combinable objects. http://blogs.msdn.com/b/nativeconcurrency/archive/2008/09/25/avoiding-contention-usingcombinable-objects.aspx, 2008.

[64] Intel. Intel Cilk++ SDK programmer’s guide. http://software.intel.com/en-us/articles/intel-cilk/, 2009.

[65] R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of theACM (JACM), 46(5), 1999.

[66] J. Mellor-Crummey. Comp 422 parallel computing lecture notes and handouts. http://www.clear.rice.edu/comp422/lecture-notes/, 2009.

[67] M. Frigo, P. Halpern, C.E. Leiserson and S. Lewin-Berlin. Reducers and other Cilk++ hyperobjects, ACM Symposium on Parallel Algorithms and Architectures, Calgary, Alberta, Canada, pp. 79–90, August 11–13, 2009.

[68] B.C. Kuszmaul. Rabin–Karp string matching using Cilk++, 2009. http://software.intel.com/file/21631.

[69] B. Barney. OpenMP. http://computing.llnl.gov/tutorials/openMP/, 2009.

[70] OpenMP. Summary of OpenMP 3.0 c/c++ syntax. http://openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdf, 2009.

[71] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40–53, 2008.

[72] P.N. Gloaskowsky. NVIDIA’s Fermi: The first complete GPU computing architecture, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_Architecture.pdf.

[73] NVIDIA. NVIDIA’s next generation CUDA computer architecture: Fermi, 2009. http://www.nvidia.com/object/fermi_architecture.html.

[74] X. Li. CUDA programming. http://dynopt.ece.udel.edu/cpeg455655/lec8_cudaprogramming.pdf.

[75] D. Kirk and W.-M. Hwu. ECE 498 AL: Programming massively processors. http://courses.ece.illinois.edu/ece498/al/, 2009.

[76] NVIDIA. NVIDIA CUDA Library Documentation 2.3. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/index.html, 2010.

[77] Y. Wu. Parallel decomposed simplex algorithms and loop spreading. PhD thesis, Oregon State University, 1988.

[78] J.H. McClelan, R.W. Schafer, and M.A. Yoder. Signal Processing First. Upper Saddle River, NJ: Pearson/Prentice Hall, 2003.

[79] V.K. Ingle and J.G. Proakis. Digital Signal Processing Using MATLAB. Pacific Grove, CA: Brooks/Cole Thompson Learning, 2000.

[80] H.T. Kung. Why systolic architectures. IEEE Computer Magazine, 15:37–46, 1982.

[81] H.T. Kung. VLSI Array Processors. Englewood Cliffs, NJ: Prentice Hall, 1988.

[82] G.L. Nemhauser and L.A. Wolsey. Integrand Combinatorial Optimization. New York: John Wiley, 1988.

[83] F.P. Preparata and M.I. Shamos. Computational Geometry. New York: Springer-Verlag, 1985.

[84] A. Schrijver. Theory of Linear and Integer Programming. New York: John Wiley, 1986.

[85] D.S. Watkins. Fundamentals of Matrix Computations. New York: John Wiley, 1991.

[86] F. El-Guibaly (Gebali) and A. Tawfik. Mapping 3D IIR digital filter onto systolic arrays. Multidimensional Systems and Signal Processing, 7(1):7–26, 1996.

[87] S. Sunder, F. Elguibaly (Gebali), and A. Antoniou. Systolic implementation of two- dimensional recursive digital filters. In Proceedings of the IEEE Symposium on Circuits and Systems, New Orleans, LA, May 1–3, 1990, H. Gharavi, Ed. IEEE Circuits and Systems Society, pp. 1034–1037.

[88] M.A. Sid-Ahmed. A systolic realization of 2-D filters. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-37, IEEE Acoustics, Speech and Signal Processing Society, 1989, pp. 560–565.

[89] D. Esteban and C. Galland. Application of quadrature mirror filters to split band voice coding systems. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Hartford, CT, May 9–11, 1977, F. F. Tsui, Ed. IEEE Acoustics, Speech and Signal Processing Society, pp. 191–195.

[90] J.W. Woods and S.D. ONeil. Sub-band coding of images. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, IEEE Acoustics, Speech and Signal Processing Society, 1986, pp. 1278–1288.

[91] H. Gharavi and A. Tabatabai. Sub-band coding of monochrome and color images. In IEEE Transactions on Circuits and Systems, Vol. CAS-35, IEEE Circuits and Systems Society, 1988, pp. 207–214.

[92] R. Ramachandran and P. Kabal. Bandwidth efficient transmultiplexers. Part 1: Synthesis. IEEE Transaction Signal Process, 40:70–84, 1992.

[93] G. Jovanovic-Dolece. Multirate Systems: Design and Applications. Hershey, PA: Idea Group Publishing, 2002.

[94] R.E. Crochiere and L.R. Rabiner. Multirate Signal Processing. Englewood Cliffs, NJ: Prentice Hall, 1983.

[95] R.E. Crochiere and L.R. Rabiner. Interpolation and decimation of digital signals—A tutorial review. Proceedings of the IEEE, 69(3):300–331, 1981.

[96] E. Abdel-Raheem. Design and VLSI implementation of multirate filter banks. PhD thesis, University of Victoria, 1995.

[97] E. Abdel-Raheem, F. Elguibaly (Gebali), and A. Antoniou. Design of low-delay FIR QMF banks using the lagrange-multiplier approach. In IEEE 37th Midwest Symposium on Circuits and Systems, Lafayette, LA, M. A. Bayoumi and W. K. Jenkins, Eds. Lafayette, LA, August 3–5, IEEE Circuits and Systems Society, 1994, pp. 1057–1060.

[98] E. Abdel-Raheem, F. Elguibaly (Gebali), and A. Antoniou. Systolic implementations of polyphase decimators and interpolators. In IEEE 37th Midwest Symposium on Circuits and Systems, Lafayette, LA, M. A. Bayoumi and W. K. Jenkins, Eds. Lafayette, LA, August 3–5, IEEE Circuits and Systems Society, 1994, pp. 749–752.

[99] A. Rafiq, M.W. El-Kharashi, and F. Gebali. A fast string search algorithm for deep packet classification. Computer Communications, 27(15):1524–1538, 2004.

[100] F. Gebali and A. Rafiq. Processor array architectures for deep packet classification. IEEE Transactions on Parallel and Distributed Computing, 17(3):241–252, 2006.

[101] A. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. Boca Raton, FL: CRC Press, 1997.

[102] B. Scheneier. Applied Cryptography. New York: John Wiley, 1996.

[103] W. Stallings. Cryptography and Network Security: Principles and Practice. Englewood Cliffs, NJ: Prentice Hall, 2005.

[104] A. Reyhani-Masoleh and M.A. Hasan. Low complexity bit parallel architectures for polynomial basis multiplication over GF(2m). IEEE Transactions on Computers, 53(8):945–959, 2004.

[105] T. Zhang and K.K. Parhi. Systematic design of original and modified mastrovito multipliers for general irreducible polynomials. IEEE Transactions on Computers, 50(7):734–749, 2001.

[106] C.-L. Wang and J.-H. Guo. New systolic arrays for c + ab2, inversion, and division in GF(2m). IEEE Transactions on Computers, 49(10):1120–1125, 2000.

[107] C.-Y. Lee, C.W. Chiou, A.-W. Deng, and J.-M. Lin. Low-complexity bit-parallel systolic architectures for computing a(x)b2(x) over GF(2m). IEE Proceedings on Circuits, Devices & Systems, 153(4):399–406, 2006.

[108] N.-Y. Kim, H.-S. Kim, and K.-Y. Yoo. Computation of a(x)b2(x) multiplication in GF(2m) using low-complexity systolic architecture. IEE Proceedings Circuits, Devices & Systems, 150(2):119–123, 2003.

[109] C. Yeh, I.S. Reed, and T.K. Truong. Systolic multipliers for finite fields GF(2m). IEEE Transactions on Computers, C-33(4):357–360, 1984.

[110] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. New York: Springer-Verlag, 2004.

[111] M. Fayed. A security coprocessor for next generation IP telephony architecture, abstraction, and strategies. PhD thesis, University of Victoria, ECE Department, University of Victoria, Victoria, BC, 2007.

[112] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in GF(2m) using normal bases. Information and Computing, 78(3):171–177, 1998.

[113] A. Goldsmith. Wireless Communications. New York: Cambridge University Press, 2005.

[114] M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design. New York: Computer Science Press, 1990.

[115] M.J.S. Smith. Application-Specific Integrated Circuits. New York: Addison Wesley, 1997.

[116] M. Fayed, M.W. El-Kharashi, and F. Gebali. A high-speed, low-area processor array architecture for multipli- cation and squaring over GF(2m). In Proceedings of the Second IEEE International Design and Test Workshop (IDT 2007), 2007, Y. Zorian, H. ElTahawy, A. Ivanov, and A. Salem, Eds. Cairo, Egypt: IEEE, pp. 226–231.

[117] M. Fayed, M.W. El-Kharashi, and F. Gebali. A high-speed, high-radix, processor array architecture for real-time elliptic curve cryptography over GF(2m). In Proceedings of the 7th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT 2007), Cairo, Egypt, E. Abdel-Raheem and A. El-Desouky, Eds. December 15–18, IEEE Signal Processing Society and IEEE Computer Society, 2007, pp. 57–62.

[118] F. Gebali, M. Rehan, and M.W. El-Kharashi. A hierarchical design methodology for full-search block matching motion estimation. Multidimensional Systems and Signal Processing, 17:327–341, 2006.

[119] M. Serra, T. Slater, J.C. Muzio, and D.M. Miller. The analysis of one-dimensional linear cellular automata and their aliasing properties. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 9(7):767–778, 1990.

[120] L.R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing. Upper Saddle River, NJ: Prentice Hall, 1975.

[121] E.H. Wold and A.M. Despain. Pipeline and parallel-pipeline FFT processors for VLSI implementation. IEEE Transactions on Computers, 33(5):414–426, 1984.

[122] G.L. Stuber, J.R. Barry, S.W. McLaughlin, Y. Li, M.A. Ingram, and T.H. Pratt. Broadband MIMO-OFDM wireless systems. Proceedings of the IEEE, 92(2):271–294, 2004.

[123] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19:297–301, 1965.

[124] B. McKinney. The VLSI design of a general purpose FFT processing node, MASc thesis, University of Victoria, 1986.

[125] A.M. Despain. Fouier transform computer using CORDIC iterations. IEEE Transactions on Computers, C-23(10):993–1001, 1974.

[126] J.G. Nash. An FFT for wireless protocols. In 40th Annual Hawaii International Conference on System Sciences: Mobile Computing Hardware Architectures, R. H. Sprague, Ed. January 3–6, 2007.

[127] C.-P. Fan, M.-S. Lee, and G.-A. Su. A low multiplier and multiplication costs 256-point FFT implementa- tion with simplified radix-24 SDF architecture. In IEEE Asia Pacific Conference on Circuits and Systems APCCAS, December 4–7, Singapore: IEEE, 2006, pp. 1935–1938.

[128] S. He and M. Torkelson. A new approach to pipeline FFT processor. In Proceedings of IPPS ’96: The 10th International Parallel Processing Symposium, Honolulu, Hawaii, April 15–19, IEEE Computer Society, K. Hwang, Ed., 1996, pp. 766–770.

[129] G.H. Golub and C.F. van Horn. Matrix Computations, 2nd ed. Blatimore, MD: The Johns Hopkins University Press, 1989.

[130] I. Jacques and C. Judd. Numerical Analysis. New York: Chapman and Hall, 1987.

[131] R.L. Burden, J.D. Faires, and A.C. Reynolds. Numerical Analysis. Boston: Prindle, Weber & Schmidt, 1978.

[132] D.E. Knuth. The Art of Computer Programming, vol. 3: Sorting and Searching. New York: Addison-Wesley, 1973.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.95.150