![]() An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). Although other computing platforms, such as PC clusters, could benefit as well from our design methodology, our focus is exclusively reconfigurable multiprocessing that has recently received tremendous attention in academia and industry. Under continuous traffic, results for a Xilinx XC2V6000 FPGA show that the average transmission time per 32-bit word is about 1.35 clock cycles. Experimental results justify the implementation of the MPI primitives in hardware to support parallel programming in reconfigurable computing. To evaluate the coprocessor, a router of low latency was designed as well to enable the direct interconnection of several coprocessors in cluster-on-a-chip systems. These primitives form a universal and orthogonal set that can be used to implement any other MPI function. This paper takes advantage of the effectiveness and efficiency of one-sided Remote Memory Access (RMA) communications, and presents the design and evaluation of a coprocessor that implements a set of MPI primitives for RMA. This process also supports the portability of MPI-based code developed for more conventional platforms. The introduction of a hardware design to implement directly MPI primitives in configurable multiprocessor computing creates a framework for efficient parallel code development involving data exchanges independently of the underlying hardware implementation. In addition, performance comparison with conventional parallel computers and PC clusters is very cumbersome or impossible since the latter often employ MPI or similar communication libraries. Nevertheless, specialized components must be built to support interprocessor communications in these FPGA-based designs, and the resulting code may be difficult to port to other reconfigurable platforms. Reconfigurable computing has recently reached rewarding levels that enable the embedding of programmable parallel systems of respectable size inside one or more Field-Programmable Gate Arrays (FPGAs). Limited hardware support for MPI is sometimes available in expensive systems. Its functions are normally implemented in software due to their enormity and complexity, thus resulting in large communication latencies. The Message Passing Interface (MPI) is a widely used standard for interprocessor communications in parallel computers and PC clusters. General terms: design architecture performance Keywords:,multiprocessor shared,memory ,vector MPP We give preliminary performance,results and discuss design tradeoffs. We describe the system architecture and microarchitecture of the processor, memory controller, and router chips. The system supports common programming models such as MPI and OpenMP, as well as global address space languages,such as UPC and CAF. ![]() ![]() ![]() Each BlackWidow node,is implemented,as a 4-way SMP with up to 128 Gbytes of DDR2 main memory,capacity. The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives. Global memory is directly accessible with processor,loads and,stores and,is globally coherent. The BlackWidow system is a distributed shared memory,(DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution unit and an 8-pipe vector unit capable of 20.8 Gflops for 64-bit operations and,41.6 Gflops for 32-bit operations at the prototype operating frequency of 1.3 GHz. ![]() Abstract This paper,describes the system,architecture of the Cray BlackWidow scalable vector multiprocessor. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |