References

Next: About this document Up: Literature overview considering Instruction Previous: Miscellaneous topics

References

Abnous and Bagherzadeh, 1994: Abnous, A. and Bagherzadeh, N. (1994). Pipelining and Bypassing in a VLIW Processor. IEEE transactions on Parallel and Distributed Systems, 5(6):658-664.
Abraham et al., 1996: Abraham, S., Kathail, V., and Deitrich, B. (1996). Meld Scheduling: Relaxing Scheduling Constraints across Region Boundaries. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 308-321, Paris, France.
Aho et al., 1985: Aho, A. V., Sethi, R., and Ullman, J. D. (1985). Compilers: Principles, Techniques and Tools. Addison-Wesley Series in Computer Science. Addison-Wesley Publishing Company, Reading, Massachusetts.
Aiken and Nicolau, 1988: Aiken, A. and Nicolau, A. (1988). Optimal Loop Parallelization. In Proceedings of the SIGPLAN'88 conference on Programming Language Design and Implementation, pages 308-317, Atlanta, Georgia.
Arnold and Corporaal, 1997: Arnold, M. and Corporaal, H. (1997). Data Transport Reduction in Move Processors. In Third Annual Conference of ASCI, The Netherlands.
Austin et al., 1995: Austin, T. M., Pnevmatikatos, D. N., and Sohi, G. S. (1995). Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 82-92, Michigan.
Ball and Larus, 1993: Ball, T. and Larus, J. (1993). Branch Prediction for Free. In Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 300-313.
Ball and Larus, 1996: Ball, T. and Larus, J. (1996). Efficient Path Profiling. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 46-57, Paris, France.
Beaty, 1991: Beaty, S. J. (1991). Genetic Algorithms and Instruction Scheduling. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 206-211, Albuquerque.
Beckmann, 1994: Beckmann, C. J. (1994). Hardware and Software for Functional and Fine Grain Parallellism. PhD thesis, University of Illinois at Urbana-Champaign, Centre of Supercompter Research and Development.
Bernstein and Rodeh, 1991: Bernstein, D. and Rodeh, M. (1991). Global instruction scheduling for superscalar machines. In Proceedings of the ACM SIGPLAN 1991 conference on Programming Language Design and Implementation, pages 241-255.
Briggs, 1992: Briggs, P. (1992). Register Allocation via Graph Coloring. PhD thesis, Rice University.
Brownhil et al., 1997: Brownhil, C., Nicolau, A., Novack, S., and Polychronopoulos, C. (1997). The PROMIS Compiler Prototype. In Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques, San Francisco, USA.
Burger et al., 1996a: Burger, D., Goodman, J. R., and Kägi, A. (1996a). Memory Bandwidth Limitations of Future Microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 78-89, Philadelphia, Pennsylvania. ACM SIGARCH and IEEE Computer Society TCCA.
Burger et al., 1996b: Burger, D., Kaxiras, S., and Goodman, J. R. (1996b). DataScalar Architectures. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, Pennsylvania. ACM SIGARCH and IEEE Computer Society TCCA.
Burger et al., 1996c: Burger, D., Kaxiras, S., and Goodman, J. R. (1996c). DataScalar Architectures and the SPSD Execution Model. Technical Report TR 1317, University of Wisconsin-Madison Computer Sciences Department.
Calder et al., 1997: Calder, B., Feller, P., and Eustace, A. (1997). Value Profiling. In [MICRO30, 1997], pages 259-269.
Calder and Grunwald, 1995: Calder, B. and Grunwald, D. (1995). Next Cache Line and Set Prediction. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 287-296, Santa Margherita Ligure, Italy.
Capitanio et al., 1992: Capitanio, A., Dutt, N., and Nicolau, A. (1992). Partitioned Register Files for VLIWs: A Preliminary Analysis of Tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 292-300, Portland.
Chang et al., 1995: Chang, P.-Y., Hao, E., and Patt, Y. (1995). Alternative Implementations of Hybrid Branch Predictors. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 252-257, Michigan.
Chang et al., 1997: Chang, P.-Y., Hao, E., and Patt, Y. N. (1997). Target Prediction for Indirect Jumps. In [ISCA24, 1997].
Chang et al., 1996: Chang, P.-Y., Hao, E., Patt, Y. N., and Chang, P. P. (1996). Using Predicated Execution to Improve the Performance of a Dynamically Scheduled Machine with Speculative Execution. International Journal of Parallel Programming, 24(3).
Chekuri et al., 1996: Chekuri, C. et al. (1996). Profile-Driven Instruction Level Parallel Scheduling with Application to Super Blocks. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 58-67, Paris, France.
Chen et al., 1996: Chen, I.-C. K., Coffey, J. T., and Mudge, T. N. (1996). Analysis of Branch Prediction Via Data Compression. In ASPLOS VII, pages 128-137, Cambridge, Massachusetts.
Chen et al., 1994: Chen, W., Mahlke, S., Warter, N., Anik, S., and Hwu, W. (1994). Profile assisted instruction scheduling. Int. J. of Parallel Programming, 22(2):151-181.
Clark, 1987: Clark, D. W. (1987). Pipelining and Performance in the VAX 8800 Processor. In Proceedings of ASPLOS-II, pages 173-177, Palo Alto, California.
Cogswell, 1995: Cogswell, B. H. (1995). Timing insensitive binary-to-binary translation. PhD thesis, Carnegie Mellon University.
Colwell et al., 1987: Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Rodman, P. K. (1987). A VLIW Architecture for a Trace Scheduling Compiler. In Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages pages 180-192. ACM. SIGPLAN Notices Vol. 22, No. 10.
Conte et al., 1996: Conte, T. M. et al. (1996). Instruction Fetch Mechanisms for VLIW Architectures with Compressed Encodings. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 201-211, Paris, France.
Conte and Sathaye, 1995: Conte, T. M. and Sathaye, S. W. (1995). Dynamic Rescheduling: A Technique for Object Code Compatibility in VLIW Architectures. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 208-218, Ann Arbor, Michigan.
Corporaal, 1997: Corporaal, H. (1997). Microprocessor Architectures; from VLIW to TTA. John Wiley. ISBN 0-471-97157-X.
Dehnert and Towle, 1993: Dehnert, J. C. and Towle, R. A. (1993). Compiling for the Cydra 5. The Journal of Supercomputing, 7(1/2):181-228.
Diep et al., 1995: Diep, T. A., Nelson, C., and Shen, J. P. (1995). Performance Evaluation of the PowerPC 620 Microarchitecture. In The 22nd Annual International Symposium on Computer Architecture, pages 163-174.
Dubey, 1997: Dubey, P. K. (1997). Architectural and Design Implications of Mediaprocessing. Tutorial, Hot Chips IX, Stanford.
Dubey et al., 1995: Dubey, P. K., O'Brien, K., O'Brien, K., and Barton, C. (1995). Single-Program Speculative Multithreading (SPSM) Architecture: Compiler-Assisted Fine-Grained Multithreading. In International Conference on Parallel Architectures and Compilation Techniques, pages 109-121.
Dunn and Hsu, 1996: Dunn, D. A. and Hsu, W.-C. (1996). Instruction Scheduling for the HP PA-8000. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 298-307, Paris, France.
Dwyer and Torng, 1992: Dwyer, H. and Torng, H. (1992). An Out-of-Order Superscalar Processor with Speculative Execution and Fast, Precise Interrupts. In Proceedings of the 25th Annual International Workshop on Microprogramming, pages 272-281, Portland, Oregon.
Ebcioğlu, 1987: Ebcioğlu, K. (1987). A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. In Proceedings of the 20th Annual Workshop on Microprogramming.
Ebcioğlu and Altman, 1997: Ebcioğlu, K. and Altman, E. (1997). DAISY: Dynamic Compilation for 100% Architectural Compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, Denver, Colorado.
Ebcioğlu et al., 1997: Ebcioğlu, K., Altman, E., and Hokenek, E. (1997). A JAVA ILP Machine Based on Fast Dynamic Compilation. In IEEE MASCOTS International Workshop on Security and Efficiency Aspects of Java, Eilat, Israel.
Ebcioğlu et al., 1994: Ebcioğlu, K., Groves, R., Kim, K., Silberman, G., and Ziv, I. (1994). VLIW Compilation Techniques in a Superscalar Environment. ACM SIGPLAN Notices, (PLDI'94), 29(6):36-48.
Ebcioğlu and Nakatani, 1989: Ebcioğlu, K. and Nakatani, T. (1989). A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture. In Proceedings of the Second Workshop on Programming Languages and Compilers for Parallel Computing, University of Illinois at Urbana-Champaign.
Eickemeyer and Vassiliadis, 1993: Eickemeyer, R. J. and Vassiliadis, S. (1993). A load-instruction unit for pipelined processors. IBM Journal of Research and Development, 37(4):547-564.
Ellis, 1986: Ellis, J. R. (1986). Bulldog: A Compiler for VLIW Architectures. ACM Doctoral Dissertation Awards. MIT Press, Cambridge, Massachusetts.
Emer and Gloy, 1997: Emer, J. and Gloy, N. (1997). A language for describing predictors and its application to automatic synthesis. In [ISCA24, 1997].
Ertl and Krall, 1994: Ertl, M. A. and Krall, A. (1994). Delayed Exceptions - Speculative Execution of Trapping Instructions. In Lecture Notes in Computer Science 786, Compiler Construction, pages 158-171. Springer-Verlag.
et al, 1981: et al, G. J. C. (1981). Register Allocation via Coloring. Computer Languages, 6:47-57.
Farkas et al., 1997a: Farkas, K. I., Chow, P., Jouppi, N. P., and Vranesic, Z. (1997a). The Multicluster Architecture: Reducing Cycle Time Through Partitioning. In [MICRO30, 1997], pages 149-159.
Farkas et al., 1997b: Farkas, K. I., Jouppi, N. P., and Chow, P. (1997b). Register File Design Considerations in Dynamically Scheduled Processors. Technical Report Research Report 95/10, Digital Western Research Laboratory.
Fisher et al., 1996: Fisher, J., Faraboschi, P., and Desoli, G. (1996). Custom-Fit Processors: Letting Applications Define Architectures. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 324-335, Paris, France.
Fisher, 1981: Fisher, J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers, C-30(7):478-490.
Fisher and Freudenberger, 1992: Fisher, J. A. and Freudenberger, S. M. (1992). Predicting conditional branch directions from previous runs of a program. In Proceedings of ASPLOS-V, pages 85-97, Boston.
Foley, 1996: Foley, P. (1996). The Mpact Media Processor Redefines the Multimedia PC. In CompCon '96 conference proceedings, Santa Clara.
Gabbay and Mendelson, 1997: Gabbay, F. and Mendelson, A. (1997). Can Program Profiling Support Value Prediction? In [MICRO30, 1997], pages 270-280.
Gibbons and Muchnick, 1986: Gibbons, P. B. and Muchnick, S. S. (1986). Efficient Instruction Scheduling for a Pipelined Architecture. In Proceedings of the SIGPLAN Symposium on Compiler Construction, pages 11-16.
Gieseke, 1997: Gieseke, B. (1997). A 600MHz Superscalar RISC Microprocessor with Out-of-Order Execution. In IEEE International Solid-State Circuits Conference.
Gillies et al., 1996: Gillies, D. M. et al. (1996). Global Predicate Analysis and its Application to Register Allocation. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 114-125, Paris, France.
Girkar and Polychronopoulos, 1994: Girkar, M. and Polychronopoulos, C. D. (1994). The Hierarchical Task Graph as a Universal Intermediate Representation. International Journal of Parallel Programming, 22(5):519-551.
Glossner and Vassiliadis, 1997: Glossner, C. J. and Vassiliadis, S. (1997). The DELFT-JAVA Engine: An Introduction. In Third int. Euro-Par Conference, pages 766-770, Pasau, Germany.
Gloy et al., 1996: Gloy, N., Young, C., Chen, J. B., and Smith, M. D. (1996). An Analysis of Dynamic Branch Prediction Schemes on System Workloads. In ISCA-23.
Gonz�lez et al., 1997: Gonz�lez, A., Valero, M., Topham, N., and Parcerisa, J. M. (1997). Eliminating Cache Conflict Misses Through XOR-Based Placement Functions. In Proc. of the ACM Int. Conf. on Supercomputing, pages 76-83, Vienna, Austria.
Granlund and Kenner, 1992: Granlund and Kenner (1992). Eliminating Branches using a Superoptimizer and the GNU C Compiler. In ACM SIGPLAN, pages 341-352.
Grunwald et al., 1995: Grunwald, D. et al. (1995). Corpus-Based Static Branch Prediction. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation.
Hank et al., 1995: Hank, R. E., Hwu, W. W., and Rau, B. R. (1995). Region-based compilation: an introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 158-168, Michigan.
Hennessy and Patterson, 1996: Hennessy, J. L. and Patterson, D. A. (1996). Computer Architecture, a Quantitative Approach, Second Edition. Morgan Kaufmann publishers.
Holler, 1996: Holler, A. M. (1996). Optimization for a Superscalar Out-of-Order Machine. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 336-348, Paris, France.
Hoogerbrugge, 1996: Hoogerbrugge, J. (1996). Code generation for Transport Triggered Architectures. PhD thesis, Delft Univ. of Technology. ISBN 90-9009002-9.
Hordijk and Corporaal, 1997a: Hordijk, J. and Corporaal, H. (1997a). A Comparison of Different Multithreading Architectures. Technical Report 1-68340-44(1997)11, Department of Electrical Engineering, Delft University of Technology.
Hordijk and Corporaal, 1997b: Hordijk, J. and Corporaal, H. (1997b). The Potential of Exploiting Coarse-Grain Task Parallelism from Sequential Programs. In HPCN Europe '97, The International Conference and Exhibition on High-Performance Computing and Networking, Vienna, Austria.
Hsieh et al., 1996: Hsieh, C.-H. A., Gyllenhaal, J. C., and Hwu, W. W. (1996). Java Bytecode to Native Code Translation: The Caffeine Prototype and Preliminary Results. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 90-97, Paris, France.
Hsu and Davidson, 1986: Hsu, P. Y. T. and Davidson, E. S. (1986). Higly Concurrent Scalar Processing. In Proceedings of ISCA-13, pages 386-395.
Huff, 1993: Huff, R. A. (1993). Lifetime-Sensitive Modulo Scheduling. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258-267.
Hunt, 1995: Hunt, D. (1995). Advanced Performance Features of the 64-bit PA-8000. In COMPCON 1995 Digest of Papers, pages 123-128.
Hwu et al., 1993: Hwu, W. W. et al. (1993). The Superblock: An Effective Technique for VLIW and Superscalar Compilation. The Journal of Supercomputing, 7(1/2):229-248.
Hwu and Patt, 1987: Hwu, W. W. and Patt, Y. N. (1987). Checkpoint Repair for High-Performance Out-of-Order Execution Machines. Transactions on Computers, C-36(12).
ISCA24, 1997: ISCA24 (1997). Proceedings of the 24th Annual International Symposium on Computer Architecture, Denver, Colorado. ACM SIGARCH and IEEE Computer Society TCCA.
Jacobson et al., 1997: Jacobson, Q., Rotenberg, E., and Smith, J. E. (1997). Path-Based Next Trace Prediction. In [MICRO30, 1997], pages 14-23.
Janssen and Corporaal, 1995: Janssen, J. and Corporaal, H. (1995). Partitioned Register Files for TTAs. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 303-312, Michigan.
Janssen and Corporaal, 1997: Janssen, J. and Corporaal, H. (1997). Registers On Demand, an integrated region scheduler and register allocator. In Submitted paper.
J.L. Lo et al., 1997: J.L. Lo, S. E., Emer, J., Levy, H., Stamm, R., and Tullsen, D. (1997). Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems.
Johnson and Schlansker, 1996: Johnson, R. and Schlansker, M. (1996). Analysis Techniques for Predicated Code. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 100-113, Paris, France.
Johnson, 1991: Johnson, W. M. (1991). Superscalar Microprocessor Design. Prentice Hall.
Jouppi and Wall, 1989: Jouppi, N. P. and Wall, D. W. (1989). Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pages 272-282.
Karkowski and Corporaal, 1997: Karkowski, I. and Corporaal, H. (1997). Overcoming the Limitations of the Traditional Loop Parallelization. In HPCN Europe '97, The International Conference and Exhibition on High-Performance Computing and Networking, Vienna, Austria.
Kathail et al., 1994: Kathail, V., Schlansker, M., and Rau, B. (1994). HPL PlayDoh Architecture Specification: Version 1.0. Technical Report HPL-93-80, Hewlett Packard Computer Systems Laboratory, Palo Alto, CA.
Kunkel and Smith, 1986: Kunkel, S. R. and Smith, J. E. (1986). Optimal Pipelining in Supercomputers. In ISCA-13, pages 404-414, Tokyo, Japan.
Lam, 1988: Lam, M. (1988). Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 318-328.
Lam and Wilson, 1992: Lam, M. S. and Wilson, R. P. (1992). Limits of control flow on parallelism. In ISCA-19, pages 46-57, Australia.
Larus, 1990: Larus, J. R. (1990). Parallelism in Numeric and Symbolic Programs. In Proceedings of International Workshop on Compilers for Parallel Computers, pages 157-170.
Lavery and Hwu, 1996: Lavery, D. M. and Hwu, W. W. (1996). Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 126-137, Paris, France.
Lee et al., 1997a: Lee, C.-C., Chen, I.-C. K., and Mudge, T. N. (1997a). The Bi-Mode Branch Predictor. In [MICRO30, 1997], pages 4-13.
Lee et al., 1995: Lee, D., Baer, J.-L., Calder, B., and Grunwald, D. (1995). Instruction Cache Fetch Policies for Speculative Execution. In ISCA-22 proceedings, pages 357-367, Santa Margherita Ligure, Italy.
Lee and Smith, 1984: Lee, J. and Smith, A. (1984). Branch Prediction Strategies and Branch Target Buffer Design. In IEEE Computer, pages 6-22.
Lee et al., 1997b: Lee, R. B. et al. (1997b). MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors. Presentation, Hot Chips IX, Stanford.
Liao, 1996: Liao, S. Y.-H. (1996). Code Generation and Optimization for Embedded Digital Signal Processors. PhD thesis, MIT.
Lilja and Bird, 1994: Lilja, D. J. and Bird, P. L. (1994). The interaction of compilation technology and computer architecture. Kluwer Academic Publishers.
Lipasti and Shen, 1996: Lipasti, M. H. and Shen, J. P. (1996). Exceeding the Dataflow Limit via Value Prediction. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 226-237, Paris, France.
Lipasti et al., 1996: Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. (1996). Value Locality and Load Value Prediction. In Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, Massachusetts.
Lowney et al., 1993: Lowney, P. G. et al. (1993). The Multiflow Trace Scheduling Compiler. The Journal of Supercomputing, 7(1/2):51-142.
Luk and Mowry, 1996: Luk, C.-K. and Mowry, T. C. (1996). Compiler-Based Prefetching for Recursive Data Structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222-233, Cambridge, Massachusetts.
Mahadevan and Ramakrishnan, 1994: Mahadevan, U. and Ramakrishnan, S. (1994). Instruction Scheduling over Regions: A Framework for Scheduling Across Basic Blocks. In Proceedings of the International Conference on Compiler Construction, pages 419-434, Edinburgh, Scotland.
Mahlke et al., 1992a: Mahlke, S. A. et al. (1992a). Sentinel scheduling for VLIW and Superscalar Processors. In Proceedings of ASPLOS-V, pages 238-247, Boston.
Mahlke et al., 1995: Mahlke, S. A. et al. (1995). A Comparison of Full and Partial Predicated Execution Support for ILP Processors. In The 22nd Annual International Symposium on Computer Architecture, pages 138-149.
Mahlke et al., 1992b: Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. (1992b). Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture.
Mahlke and Natarajan, 1996: Mahlke, S. A. and Natarajan, B. (1996). Compiler Synthesized Dynamic Branch Prediction. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 153-164.
Martin et al., 1997: Martin, M. M., Roth, A., and Fischer, C. N. (1997). Exploiting Dead Value Information. In [MICRO30, 1997], pages 125-135.
Maydan et al., 1995: Maydan, D. E., Hennessy, J. L., and Lam, M. S. (1995). Effectiveness of Data Dependence Analysis. International Journal of Parallel Programming, 23(1).
McFarling, 1993: McFarling, S. (1993). Combining Branch Predictors. WRL Technical Note TN-36.
McFarling and Hennessy, 1986: McFarling, S. and Hennessy, J. (1986). Reducing the Cost of Branches. In Proc. 13th Ann. Int'l Symp. on Computer Architecture, pages 396-403.
MICRO30, 1997: MICRO30 (1997). Proceedings of the 30th Annual International Symposium on Microarchitecture, Research Triangle Park, North Carolina. IEEE Computer Society TC-MICRO and ACM SIGMICRO.
Moon and Ebcioğlu, 1992: Moon, S. and Ebcioğlu, K. (1992). An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland.
Moshovos et al., 1997: Moshovos, A., Breach, S. E., Vijaykumar, T., and Sohi, G. S. (1997). Dynamic Speculation and Synchronization of Data Dependences. In [ISCA24, 1997].
Moshovos and Sohi, 1997: Moshovos, A. and Sohi, G. S. (1997). Streamlining Inter-Operation Memory Communication via Data Dependence Prediction. In [MICRO30, 1997], pages 235-245.
Moudgill, 1994: Moudgill, M. (1994). Implementing and exploiting static speculation on multiple instruction issue processors. PhD thesis, Cornell University.
Moudgill and Vassiliadis, 1996: Moudgill, M. and Vassiliadis, S. (1996). On Precise Interrupts. IEEE Micro, pages 58-67.
Muchnick, 1997: Muchnick, S. (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers. ISBN 1-55860-320-4.
Nair, 1995: Nair, R. (1995). Dynamic Path-Based Branch Correlation. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 15-23, Michigan.
Nicolau, 1985: Nicolau, A. (1985). Percolation Scheduling: A Parallel Compilation Technique. Technical Report TR 85-678, Cornell University, Department of Computer Science, Cornell University, Ithaca, NY 14853, USA.
Nicolau, 1989: Nicolau, A. (1989). Run-TIme disambiguation: Coping with statically unpredictable dependencies. IEEE Transactions on Computers, 38(5).
Nicolau and Fisher, 1984: Nicolau, A. and Fisher, J. (1984). Measuring the Parallelism available for very long instruction word architectures. computers, C33(11):968-976.
Normyle and Csoppenszky, 1997: Normyle, K. and Csoppenszky, M. (1997). UltraSPARC IIi - A highly integrated 300 Mhz 64-bit SPARC V9 CPU. Presentation, Hot Chips IX, Stanford.
Novack et al., 1995a: Novack, S., Hummel, J., and Nicolau, A. (1995a). A simple mechanism for improving the accuracy and efficiency of instruction level disambiguation. Lecture Notes in Computer Science 1033, pages 289-303.
Novack et al., 1995b: Novack, S., Nicolau, A., and Dutt, N. (1995b). A Unified Code Generation Approach Using Mutation Scheduling. In Code Generation for Embedded Processors, page Chapter 12. Kluwer Academic Publishers.
Palacharla et al., 1997: Palacharla, S., Jouppi, N. P., and Smith, J. E. (1997). Complexity-Effective Superscalar Processors. In [ISCA24, 1997].
Pan et al., 1992: Pan, S.-T. et al. (1992). Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of ASPLOS-V, pages 76-84, Boston.
Park and Schlansker, 1991: Park, J. C. H. and Schlansker, M. (1991). On predicated execution. Technical Report HPL-91-58, HP Laboratories, Palo Alto.
Patterson et al., 1997: Patterson, D. et al. (1997). A case for intelligent RAM. IEEE micro, pages 34-44.
Patterson, 1995: Patterson, J. (1995). Accurate Static Branch Prediction by Value Range Propagation. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation.
Paulin and Knight, 1989: Paulin, P. G. and Knight, J. P. (1989). Force-Directed Scheduling for the Behavioral Synthesis of ASIC's. IEEE trans. on computer-aided design, 8(6):661-679.
Pearl, 1984: Pearl, J. (1984). Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley.
PicoJava, 1996: PicoJava (1996). Picojava I Microprocessor Core Architecture. SUN white paper on Web-site: http://www.sun.com/sparc/whitepapers/wpr-0014-01.
Pinter, 1993: Pinter, S. S. (1993). Register Allocation with Instruction Scheduling: a New Approach. In SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 248-257.
Ramakrishnan, 1992: Ramakrishnan, S. (1992). Software Pipelining in PA-RISC Compilers. Hewlett-Packard Journal, pages 39-45.
Rathnam and Slavenburg, 1996: Rathnam, S. and Slavenburg, G. (1996). An Architectural Overview of the Programmable Multimedia Processor, TM-1. In CompCon '96 conference proceedings, Santa Clara.
Rau, 1993: Rau, B. R. (1993). Dynamically Scheduled VLIW Processors. In Proceedings of the 26th Annual International Symposium on Microarchitecture, pages 80-92, Austin, Texas.
Rau, 1994: Rau, B. R. (1994). Iterative Modulo Scheduling: An Algorithm For Software Pipelining Loops. In Proceedings of the 27th Annual International Workshop on Microprogramming, San Jose, California.
Rau, 1996: Rau, B. R. (1996). Iterative Modulo Scheduling. Int. J. of Parallel Programming, 24(1):3-64.
Rau and Fisher, 1993: Rau, B. R. and Fisher, J. A. (1993). Instruction-Level Parallel Processing: History, Overview and Perspective. The Journal of Supercomputing, 7(1/2):9-50.
Rau and Glaeser, 1981: Rau, B. R. and Glaeser, C. D. (1981). Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183-198.
Rivers et al., 1997: Rivers, J. A., Tyson, G. S., Davidson, E. S., and Austin, T. M. (1997). On High-Bandwidth Data Cache Design for Multi-Issue Processors. In [MICRO30, 1997], pages 46-56.
Rompaey et al., 1992: Rompaey, K. V., Bolsens, I., and Man, H. D. (1992). Just in time scheduling. In ICCD-92, pages 295-300, Boston.
Rotenberg et al., 1996: Rotenberg, E., Bennett, S., and Smith, J. E. (1996). Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 24-34, Paris, France.
Rotenberg et al., 1997: Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. (1997). Trace Processors. In [MICRO30, 1997], pages 138-148.
Sánchez and González, 1997: Sánchez, F. J. and González, A. (1997). Cache Sensitive Modulo Scheduling. In [MICRO30, 1997], pages 338-348.
Saulsbury et al., 1996: Saulsbury, A., Pong, F., and Nowatzyk, A. (1996). Missing the Memory Wall: The Case for Processor/Memory Integration. In ISCA-23.
Sazeides and Smith, 1997: Sazeides, Y. and Smith, J. E. (1997). The Predictability of Data Values. In [MICRO30, 1997], pages 248-258.
Sazeides et al., 1996: Sazeides, Y., Vassiliadis, S., and Smith, J. E. (1996). The Performance Potential of Data Dependence Speculation & Collapsing. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 238-247, Paris, France.
Schlansker and Kathail, 1995: Schlansker, M. and Kathail, V. (1995). Critical Path Reduction for Scalar Programs. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 57-69, Michigan.
Schuette and Shen, 1993: Schuette, M. and Shen, J. (1993). Instruction-Level Experimental Evaluation of the Multiflow TRACE 14/300 VLIW Computer. The Journal of Supercomputing, 7(1/2):249.
Seznec et al., 1996: Seznec, A., Jourdan, S., Sainrat, P., and Michaud, P. (1996). Multiple-Block Ahead Branch Predictors. In ASPLOS VII, Cambridge, Massachusetts.
Silberman and Ebcioğlu, 1993: Silberman, G. M. and Ebcioğlu, K. (1993). An architecture framework for supporting heterogeneous instruction-set architectures. IEEE Computer, 26(6):39-56.
Simone et al., 1995: Simone, M. et al. (1995). Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High Performance RISC Microprocessor. In The 22nd Annual International Symposium on Computer Architecture, pages 151-162.
Smith and Sohi, 1995: Smith, J. E. and Sohi, G. S. (1995). The Microarchitecture of Superscalar Processors. Technical report, University of Wisconsin-Madison.
Smith et al., 1992: Smith, M. D., Horowitz, M., and Lam, M. S. (1992). Efficient superscalar performance through boosting. In Proceedings of ASPLOS-V, pages 248-261, Boston.
Smotherman et al., 1991: Smotherman, M. et al. (1991). Efficient DAG Construction and Heuristic Calculation for Instruction Scheduling. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 93-102, Albuquerque.
S�nchez et al., 1997: S�nchez, F. J., Gonz�lez, A., and Valero, M. (1997). Static Locality Analysis for Cache Management. In Proc. of Int. Conf. on Parallel Architectures and Compilation Techniques, San Francisco, USA.
Sodani and Sohi, 1997: Sodani, A. and Sohi, G. S. (1997). Dynamic Instruction Reuse. In [ISCA24, 1997].
Sohi et al., 1995: Sohi, G. S., Breach, S. E., and Vijaykumar, T. (1995). Multiscalar Processors. In ISCA'22 proceedings, pages 414-425, Santa Margherita Ligure, Italy.
Stark et al., 1997: Stark, J., Racunas, P., and Patt, Y. N. (1997). Reducing the Performance Impact of Instruction Cache Misses by Writing Instructions into the Reservation Stations Out-of-Order. In [MICRO30, 1997], pages 34-43.
Stok, 1994: Stok, L. (1994). Data path synthesis. INTEGRATION, the VLSI journal, 18(1):1-71.
Su and Wang, 1991: Su, B. and Wang, J. (1991). Loop-Carried Dependence and the General URPR Software Pipelining Approach. In Proceedings of HICSS-24, Vol. 2, pages 366-372.
Sweany and Beaty, 1990: Sweany, P. and Beaty, S. (1990). Post-Compaction Register Assignment in a Retargetable Compiler. In Proceedings of the 23rd Annual Workshop on Microprogramming and Microarchitectures, pages 107-116, Orlando.
Truong, 1997: Truong, L. (1997). The VelociTI Architecture of the TMS320C6x. Presentation, Hot Chips IX, Stanford.
Tsai and Yew, 1996: Tsai, J.-Y. and Yew, P.-C. (1996). The Superthreaded Architecture: Thread Pipelining with Run-time Data Dependence Checking and Control Speculation. Technical Report TR 96-037, Univ. of Minnesota, Department of Computer Science.
Tullsen et al., 1995: Tullsen, D. M., Eggers, S. J., and Levy, H. M. (1995). Simultaneous Multithreading: maximizing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 392-403, Santa Margherita Ligure, Italy.
Tullsen et al., 1996: Tullsen, D. M. et al. (1996). Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23nd Annual International Symposium on Computer Architecture, Philadelphia, PA.
Uht and Sindagi, 1995: Uht, A. K. and Sindagi, V. (1995). Disjoint Eager Execution: An Optimal Form of Speculative Execution. In Proc. of the 28th Annual International Symposium on Microarchitecture.
Vajapeyam and Mitra, 1997: Vajapeyam, S. and Mitra, T. (1997). Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences. In ISCA-97.
Vassiliadis et al., 1993: Vassiliadis, S., Phillips, J., and Blaner, B. (1993). Interlock Collapsing ALUs. IEEE Transactions on Computers, 42(7):825-839.
Wall, 1991: Wall, D. W. (1991). Limits of Instruction-Level Parallelism. In Proceedings of ASPLOS-IV, pages 176-188, Santa Clara, California. ACM.
Wang and Eisenbeis, 1993: Wang, J. and Eisenbeis, C. (1993). Decomposed Software Pipelining: A New Approach to Exploit Instruction Level Parallelism for Loop Programs. In IFIP WG 10.3 Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, Orlando, Florida.
Wang and Franklin, 1997: Wang, K. and Franklin, M. (1997). Highly Accurate Data Value Prediction using Hybrid Predictors. In [MICRO30, 1997], pages 281-290.
Warter et al., 1993: Warter, N. J. et al. (1993). Reverse If-Conversion. In Proceedings of the ACM SIGPLAN '93 Conference on Program Language Design and Implementation.
Weiss and Smith, 1994: Weiss, S. and Smith, J. E. (1994). POWER and PowerPC. Morgan Kaufmann Publishers.
Wilhelm and Maurer, 1995: Wilhelm, R. and Maurer, D. (1995). Compiler Design. Addison-Wesley.
Wilson et al., 1995: Wilson, R., Franch, R., Wilson, C., Amarasinghe, S., Anderson, J., Tjiang, S., Liao, S.-W., Tseng, C.-W., Hall, M., Lam, M., and Hennessy, J. (1995). An Overview of the SUIF Compiler System. Web-site: http://suif.stanford.edu/suif/suif.html.
Wolf, 1992: Wolf, M. E. (1992). Improving Locality and Parallelism in Nested Loops. PhD thesis, Stanford University, Computer Systems Laboratory, Stanford, CA 94305. also available as technical report CSL-TR-92-538.
Wolfe and Chanin, 1992: Wolfe, A. and Chanin, A. (1992). Executing Compressed Programs on An Embedded RISC Architecture. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 81-91, Portland, Oregon.
Wu and Larus, 1994: Wu, Y. and Larus, J. (1994). Static Branch Frequency and Program Profile Analysis. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 1-11.
Yeager, 1996: Yeager, K. C. (1996). MIPS R10000. IEEE micro.
Yeh and Patt, 1991: Yeh, T. and Patt, Y. (1991). Two-level adaptive training branch prediction. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 51-61.
Yeh and Patt, 1992: Yeh, T. and Patt, Y. (1992). Alternative Implementations of Two-Level Adaptive Branch Prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 124-134.
Yeh and Patt, 1993: Yeh, T. and Patt, Y. (1993). A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 257-266.
Young et al., 1995: Young, C., Gloy, N., and Smith, M. D. (1995). A Comparative Analysis of Schemes for Correlated Branch Prediction. In The 22nd Annual International Symposium on Computer Architecture, pages 276-286.
Zima and Chapman, 1991: Zima, H. and Chapman, B. (1991). Supercompilers for Parallel and Vector Computers. Addison-Wesley.

Henk Corporaal
Tue Mar 10 11:20:49 CET 1998