Quasi fat trees for HPC clouds and their fault-resilient closed-form routing

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Scopus citations

Abstract

High-Performance Computing (HPC) Clusters and Data Center Networks often rely on fat-tree topologies. However, fat trees and their known variants are not designed for concurrent small jobs. As a result, in recent years, HPC designers have introduced ad-hoc topologies to offer better performance for these concurrent small jobs. In this paper, we present and formally define these topologies, which we call Quasi Fat Trees (QFTs). Specifically, we formulate the graph structure of these new topologies, and show that they perform better for concurrent small jobs. Furthermore, we derive a closed-form and fault-resilient contention-free routing algorithm for all global shift permutations. This routing optimizes the run-time of large computing jobs that utilize MPI collectives. Finally, we verify the algorithm by running its implementation as an OpenSM routing engine on various sizes of QFT topologies, and show that it exhibits good performance.

Original languageEnglish
Title of host publicationProceedings - 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects, HOTI 2014
Pages41-48
Number of pages8
ISBN (Electronic)9781479958603
DOIs
StatePublished - 15 Oct 2014
Event22nd IEEE Annual Symposium on High-Performance Interconnects, HOTI 2014 - Mountain View, United States
Duration: 26 Aug 201428 Aug 2014

Publication series

NameProceedings - 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects, HOTI 2014

Conference

Conference22nd IEEE Annual Symposium on High-Performance Interconnects, HOTI 2014
Country/TerritoryUnited States
CityMountain View
Period26/08/1428/08/14

Keywords

  • Fat Tree
  • HPC
  • Routing
  • Topology

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Quasi fat trees for HPC clouds and their fault-resilient closed-form routing'. Together they form a unique fingerprint.

Cite this