Software development and Hardware Testing

Project Coordinator: CP3-Origins

Partners: Edinburgh University, Swansea University

We aim to develop a common software framework using the new emerging standards, like the new OpenCL framework, to create a simple-to-use environment to make efficient use of these new machines. Such framework will have applications stretching beyond the physics community to many other applications. (milestones 1-3). Develop and maintain a set of benchmarks that can quantify the performance of a parallel platform with an index that is meaningful for real-world applications. (milestones 4-8). Develop strong links with industrial partners to widen the impact of our training on the job market.

Task 1: Develop new software and test of new GPU hardware.

The recent progress in hardware development will soon deliver hardware that is capable of PFlop performance. This is achieved through the effective use of multi-core architectures that require a new programming paradigm based on highly-parallelized algorithms, running on heterogenous architectures. To exploit the full potential of heterogeneous computing, the strategy is to deploy multiple types of processing elements within a single workflow, and allowing each to perform the tasks to which it is best suited. The major challenge to overcome is the programming complexity required to distribute workloads across multiple processors and the additional effort required if those processors are of different types.
In the last few years we have witnessed a phenomenal increase of activity revolving around heterogeneous architectures, both in the scientific community and in many other fields such as finance, medical imaging, electronic design automation, and many others.
The increased complexity of both hardware and software requires the development of new software in order to implement the physically relevant models on different supercomputer architectures, and to test new hardware such as GPU. Despite the theoretical nature of our physics objectives, the numerical side of our investigations creates an ideal interface with the industries that are interested in development, or intensive usage, of numerical techniques.
Lattice gauge theory computer programs provide an ideal tool for assessing the scalability of massively parallel supercomputers. The balance between local and distributed computations can be finely tuned by choosing the ratio between the local and global lattice size and how many directions are being distributed on different processors. In addition to this well-known property of Lattice Gauge Theory codes in the context of benchmarking, our programs allow to change arbitrarily the number of colours and the representation for the fermions, allowing even a more fine-grained evaluation of parallel platforms. In addition to help us assessing the performance of a platform for our specific scientific goals, a benchmarking suite based on our code can provide a useful measurement of the scalability of parallel platforms for computational and communication intensive codes that can be used as reliable index for real-world applications in fields such as finance and weather forecasting.

Milestones:

  1. Implement a common set of efficient parallel data structures for use on different computing hardware. This data provide a user-friendly interface to manage common tasks on which more complex algorithms are build, such as move the data from one computing component to another within a single node, handling data ordering for maximum efficiency, data compression and simple operations on the data itself (e.g. sum of arrays of numbers). Timeframe: 3 m.
  2. Develop an efficient framework to manage the large number of concurrent tasks running on the heterogeneous node. This framework should be able to dispatch tasks to different computing hardware according to the available resources. Timeframe: 4 m.
  3. Develop a library to simplify computing device management and (semi)automatic generation of computing kernels. Timeframe: 4 m.
  4. Assessment of the performance of communication-intensive and computation-intensive parts of our code as a function of the global and local lattice size, the number of parallel direction, the number of colours and the representation of the fermions. Timeframe: 2 m.
  5. Branching of a test suite aimed at providing verbose feedback on performance and efficiency. Timeframe: 4 m.
  6. Study of various simulation scenarios on a simple parallel platform (Beowulf clusters) and identification of a performance index. Timeframe: 6 m.
  7. Test of the performance index on various parallel platforms (IBM Blue Gene, IBM Power 5 supercomputers) and compilation of a table of results. Timeframe: 12 m.
  8. Performance studies using the benchmark suite and further development of it in collaboration with industrial partners. Timeframe: 12 m. At the end of the project, the test suite will be made publicly available.