STRETCH was the most complex electronic system yet designed and, in fact, it was the first one with a design based on an earlier computer (the IBM 704). Unfortunately, it failed its primary goal, that of being 200 or even 100 times faster than the competition, since it was only about 25-50 times faster. Only seven other Stretch machines were built after the one that went to Los Alamos, all for government agencies (like the Weather Service for charting the path of storms) or government contractors (like MITRE).
In April 1955, IBM had lost a major bid to build a computer for the U.S. Atomic Energy Commission’s Livermore Laboratory, to the UNIVAC division of Remington Rand. UNIVAC had promised up to five times the processing power as the Government’s bid request, so IBM decided it should play that game too, next time it had an opportunity.
Supercomputers – the pioneers
When Los Alamos Scientific Laboratory was next to publish a bid request, IBM promised that a system operating at 100 times present speeds would be ready for delivery at the turn of the decade. Here is where the categorical split happened between “conventional computers” and supercomputers: IBM committed itself to producing a whole new kind of computing mechanism, one entirely transistorized for the first time. There had always been a race to build the fastest and most capable machine, but the market had not yet begun its path to maturity until that first cell split, when it was determined that atomic physics research represented a different customer profile compared to business accounting, and needed a different class of machine.
Stephen W. Dunwell was Stretch’s lead engineer and project manager. In a 1989 oral history interview for the University of Minnesota’s Charles Babbage Institute, he recalled the all-hands meeting he attended, along with legendary IBM engineer Gene Amdahl and several others. There, the engineers and their managers came to the collective realization that there needed to be a class of computers above and beyond the common computing machine, if IBM was to regain a competitive edge against competitors such as Sperry Rand.
Gordon Bell, the brilliant engineer who developed the VAX series for DEC, would later recall that engineers of his ilk began using the term “supercomputer” when referring to machines in this upper class, as early as 1957, while the 7030 project was underway.
The architectural gap between the previous IBM 701 design and that of the new IBM 7030 was so great that engineers dubbed the new system “Stretch”. It introduced the notion of instruction “look-ahead” and index registers, both of which are principal components of modern x86 processor design. Though it utilized 64-bit “words” internally, Stretch utilized the first random-access memory mechanism from magnetic disk, breaking down those words into 8-bit alphanumeric segments that engineers dubbed “bytes”.
Though IBM successfully built and delivered eight 7030 models between 1961 and 1963, keeping a ninth for itself, Dunwell’s superiors declared it a failure for only being 30 times faster than 1955 benchmarks, instead of 100. Declaring something you built yourself a failure typically prompts others to agree with you, often for no other viable reason. When competitor Control Data set about to build a system a mere three times faster than the IBM 7030, and then in 1964 met that goal with the CDC 6600 (principally designed by Seymour Cray) the “supercomputer” moniker stuck to it like glue. (Even before Control Data ceased to exist, the term attached itself to Cray.) Indeed, the CDC 6600 introduced vector processing, executing single instructions on multiple registers in sequence, which was the beginning of parallelism. But no computer today, not even your smartphone, is without parallel processing, nor is it without index registers, look-ahead instruction pre-fetching or bytes.
The giants of supercomputing
According to the Top500.org, nowadays IBM sits in the second spot of the supercomputer race.
The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coincides with the International Supercomputing Conference in June, and the second is presented at the ACM/IEEE Supercomputing Conference in November. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and bases rankings on HPL, a portable implementation of the high-performance LINPACK benchmark written in Fortran for distributed-memory computers.
The 55th edition of the TOP500 saw some significant additions to the list, spearheaded by a new number one system from Japan. The latest rankings also reflect a steady growth in aggregate performance and power efficiency.
The new top system, Fugaku, turned in a High Performance Linpack (HPL) result of 415.5 petaflops, besting the now second-place Summit system by a factor of 2.8x. Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, which are often used in machine learning and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan.
Number two on the list is Summit, an IBM-built supercomputer that delivers 148.8 petaflops on HPL. The system has 4,356 nodes, each equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. The nodes are connected with a Mellanox dual-rail EDR InfiniBand network. Summit is running at Oak Ridge National Laboratory (ORNL) in Tennessee and remains the fastest supercomputer in the US.
At number three is Sierra, a system at the Lawrence Livermore National Laboratory (LLNL) in California achieving 94.6 petaflops on HPL. Its architecture is very similar to Summit, equipped with two Power9 CPUs and four NVIDIA Tesla V100 GPUs in each of its 4,320 nodes. Sierra employs the same Mellanox EDR InfiniBand as the system interconnect.
Sunway TaihuLight, a system developed by China’s National Research Center of Parallel Computer Engineering & Technology (NRCPC) drops to number four on the list. The system is powered entirely by Sunway 260-core SW26010 processors. Its HPL mark of 93 petaflops has remained unchanged since it was installed at the National Supercomputing Center in Wuxi, China in June 2016.
At number five is Tianhe-2A (Milky Way-2A), a system developed by China’s National University of Defense Technology (NUDT). Its HPL performance of 61.4 petaflops is the result of a hybrid architecture employing Intel Xeon CPUs and custom-built Matrix-2000 coprocessors. It is deployed at the National Supercomputer Center in Guangzhou, China.
A new system on the list, HPC5, captured the number six spot, turning in an HPL performance of 35.5 petaflops. HPC5 is a PowerEdge system built by Dell and installed by the Italian energy firm Eni S.p.A, making it the fastest supercomputer in Europe. It is powered by Intel Xeon Gold processors and NVIDIA Tesla V100 GPUs and uses Mellanox HDR InfiniBand as the system network.
Another new system, Selene, is in the number seven spot with an HPL mark of 27.58 petaflops. It is a DGX SuperPOD, powered by NVIDIA’s new “Ampere” A100 GPUs and AMD’s EPYC “Rome” CPUs. Selene is installed at NVIDIA in the US. It too uses Mellanox HDR InfiniBand as the system network.
Frontera, a Dell C6420 system installed at the Texas Advanced Computing Center (TACC) in the US is ranked eighth on the list. Its 23.5 HPL petaflops is achieved with 448,448 Intel Xeon cores.
The second Italian system in the top 10 is Marconi-100, which is installed at the CINECA research center. It is powered by IBM Power9 processors and NVIDIA V100 GPUs, employing dual-rail Mellanox EDR InfiniBand as the system network. Marconi-100’s 21.6 petaflops earned it the number nine spot on the list.
Rounding out the top 10 is Piz Daint at 21.2 petaflops, a Cray XC50 system installed at the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland. It is equipped with Intel Xeon processors and NVIDIA P100 GPUs.
Interesting facts revealed by Top500:
China continues to dominate the TOP500 when it comes to system count, claiming 226 supercomputers on the list. The US is number two with 114 systems; Japan is third with 30; France has 18; and Germany claims 16. Despite coming in second on system count, the US continues to edge out China in aggregate list performance with 644 petaflops to China’s 565 petaflops. Japan, with its significantly smaller system count, delivers 530 petaflops.
Also, Chinese manufacturers dominate the list in the number of installations with Lenovo (180), Sugon (68) and Inspur (64) accounting for 312 of the 500 systems. HPE claims 37 systems, while Cray/HPE has 35 systems. Fujitsu is represented by just 13 systems, but thanks to its number one Fugaku supercomputer, the company leads the list in aggregate performance with 478 petaflops. Lenovo, with 180 systems, comes in second in performance with 355 petaflops.
Regardless of the manufacturer, as a technology trend, a total of 144 systems on the list are using accelerators or coprocessors, which is nearly the same as the 145 reported six months ago. As has been the case in the past, the majority of the systems equipped with accelerator/coprocessors (135) are using NVIDIA GPUs.
The x86 continues to be the dominant processor architecture, being present in 481 out of the 500 systems. Intel claims 469 of these, with AMD installed in 11 and Hygon in the remaining ones. Arm processors are present in just four TOP500 systems, three of which employ the new Fujitsu A64FX processor, the rest being powered by Marvell’s ThunderX2 processor
The breakdown of system interconnect share is largely unchanged from six months ago. Ethernet is used in 263 systems, InfiniBand is used in 150, and the remainder employ custom or proprietary networks. Despite Ethernet’s dominance in sheer numbers, those systems account for 471 petaflops, while InfiniBand-based systems provide 803 petaflops. Due to their use in some of the list’s most powerful supercomputers, systems with custom and proprietary interconnects together represent 790 petaflops.
The most energy-efficient system on the Green500 is the MN-3, based on a new server from Preferred Networks. It achieved a record 21.1 gigaflops/watt during its 1.62 petaflops performance run. The system derives its superior power efficiency from the MN-Core chip, an accelerator optimized for matrix arithmetic. It is ranked number 395 in the TOP500 list.
In second position is the new NVIDIA Selene supercomputer, a DGX A100 SuperPOD powered by the new A100 GPUs. It occupies position seven on the TOP500.
In third position is the NA-1 system, a PEZY Computing/Exascaler system installed at NA Simulation in Japan. It achieved 18.4 gigaflops/watt and is at position 470 on the TOP500.
The number nine system on the Green500 is the top-performing Fugaku supercomputer, which delivered 14.67 gigaflops per watt. It is just behind Summit in power efficiency, which achieved 14.72 gigaflops/watt.
The TOP500 list has incorporated the High-Performance Conjugate Gradient (HPCG) Benchmark results, which provided an alternative metric for assessing supercomputer performance and is meant to complement the HPL measurement.
The number one TOP500 supercomputer, Fugaku, is still the leader on the HPCG benchmark, with a record 13.4 HPCG-petaflops. The two US Department of Energy systems, Summit at ORNL and Sierra at LLNL, are now second and third, respectively, on the HPCG benchmark. Summit achieved 2.93 HPCG-petaflops and Sierra 1.80 HPCG-petaflops. All the remaining systems achieved less than one HPCG-petaflops.