By Monica S. Lam

This booklet is a revision of my Ph. D. thesis dissertation submitted to Carnegie Mellon college in 1987. It files the study and result of the compiler expertise built for the Warp laptop. Warp is a systolic array outfitted out of customized, high-performance processors, each one of which could execute as much as 10 million floating-point operations in line with moment (10 MFLOPS). lower than the course of H. T. Kung, the Warp computing device matured from an instructional, experimental prototype to a advertisement manufactured from common electrical. The Warp laptop verified that the scalable structure of high-peiformance, programmable systolic arrays represents a pragmatic, low-budget solu­ tion to the current and destiny computation-intensive functions. The good fortune of Warp resulted in the follow-on iWarp venture, a joint undertaking with Intel, to boost a single-chip 20 MFLOPS processor. the provision of the hugely built-in iWarp processor can have an important influence on parallel computing. one of many significant demanding situations within the improvement of Warp used to be to construct an optimizing compiler for the laptop. First, the processors within the xx A Systolic Array Optimizing Compiler array cooperate at an outstanding granularity of parallelism, interplay among processors needs to be thought of within the iteration of code for person processors. moment, the person processors themselves derive their functionality from a VLIW (Very lengthy guideline be aware) guide set and a excessive measure of inner pipelining and parallelism. The compiler comprises optimizations concerning the array point of parallelism, in addition to optimizations for the person VLIW processors.

Automatic synthesis techniques have been proposed only for simple application domains and simple machine models. The problem of using the array level concurrency of a highperformance array effectively is an open issue. The approach adopted in this work is to expose this level of concurrency to the users. The user specifies the high-level problem decomposition method and the compiler handles the low-level synchronization of the cells. The justification of the approach and the exact computation model are presented in the next chapter.

The problem Communication operations cannot be arbitrarily reordered, because reordering can alter the semantics of the program as well as introduce deadlock into a program. To illustrate the former, consider the following: (a) First cell Send(R,X,l); Send(R,X,2); Second cell Receive(L,X,c); Receive(L,X,d); (b) First cell Second cell Receive(L,X,d); Receive(L,X,c); Send(R,X,l); Send(R,X,2); Programs (a) and (b) are not equivalent, because the values of the variables c and d in the second cell are interchanged.

A queue for each inter-cell communication path (X. Y. and Adr Queue). and a register file to buffer data for each floating-point unit (AReg and MReg). All these components are interconnected through a crossbar switch. The instructions are executed at the rate of one instruction every major clock cycle of 200 ns. The details of each component of the cell. and the differences between the prototype and the production machine. are given below. Architecture of Warp 15 Floating-point units. The floating-point multiplier and adder are implemented with commercial floating-point chips [59].

