NCHPC: High Performance Computing

Showing posts with label High Performance Computing. Show all posts

Monday, April 12, 2010

Product Review: Cray CX1000™ High(brid) Performance Computers

The Cray CX1000 series is a dense, power efficient and supremely powerful rack-mounted supercomputer featuring best-of-class technologies that can be mixed-and-matched in a single rack creating a customized hybrid computing platform to meet a variety of scientific workloads.

Cray is announced the Cray CX1000 system; a dense, power efficient and supremely powerful rack-mounted supercomputer that allows you to leverage the latest Intel® Xeon® processors for:

Scale-out cluster computing using dual-socket Intel Xeon 5600s (Cray CX1000-C)
Scale-through (GPU) computing leveraging NVIDIA Tesla® (Cray CX1000-G)
Scale-up computing with SMP nodes built on Intel’s QuickPath Interconnect (QPI) technology offering "fat memory" nodes (Cray CX1000-S)

High(brid) Performance Computing – The Cray CX1000 redefines HPC by delivering hybrid capabilities through a choice of chassis, each delivering one of the most important architectures of the next decade.

Cray CX1000-C Chassis
The compute-based Cray CX1000-C chassis includes 18 dual-socket Intel Xeon 5600 blades with an integrated 36-port QDR InfiniBand switch and a 24-port Gigabit Ethernet switch – all in 7U. With support for Windows® HPC Server 2008 or Red Hat Linux via the Cray Cluster Manager, the Cray CX1000-C system provides outstanding support for ISV applications as well as dual-boot capability for ultimate application flexibility. The Cray CX1000-C system maintains Cray's "Ease of Everything" approach by incorporating blades, switches and cabling all within a single chassis. The result is an easy-to-install system with compelling capabilities for scale-out high performance computing.

Two high-frequency Intel® Xeon® 5600 series processors (up to 2.93 GHz)
Large memory capacity (up to 48GB memory per blade with 4GB DDR3 DIMMs)
One SATA HDD or one SSD drive or diskless

Cray CX1000-G Chassis

The GPU-based Cray CX1000-G chassis delivers nine double-width, dual-socket Intel Xeon 5600 blades, each incorporating two NVIDIA Tesla GPUs. Cray CX1000-G systems allow users to maximize GPU performance with its unique architecture by eliminating I/O bottlenecks – an industry first. These 7U systems include an integrated 36-port QDR InfiniBand switch and a 24-port Gigabit Ethernet switch. The Cray CX1000-G system is the best solution to your density limitations by offering 18 NVIDIA Tesla GPUs in a 7U form factor. Combining Intel Xeon 5600 performance with NVIDIA Tesla-based acceleration offers true hybrid computing options.

Double-width blade
Two Intel® Xeon® 5600 series processors
Two NVIDIA® Tesla® M1060 GPUs
Up to 48GB or memory per blade with 8GB DDR3 DIMMs
Two ConnectX adapters providing single QDR IB channel

Cray CX1000-S Chassis

The SMP-based Cray CX1000-S server is offered in two configurations, offering up to 128 Intel® Xeon® 7500 series processors and 1 TB of memory in a 6U system. The Cray CX1000-SC compute node is made up of uniquely designed 1.5U "Building Blocks", each housing 32 cores interconnected using Intel QPI. The Cray CX1000-SM management node is a 3U server with four Intel Xeon 7500 series processors (32 cores) and up to 256 GB of memory.

Coherency switch – a proprietary feature based on Intel QPI technology allowing scalability from a single "building block" of 32 cores up to a maximum of 4 "building blocks" with 128 cores in 6U
Up to 1TB of memory (with 8GB DIMMS)
Support for applications requiring extensive I/O capacity

(The more information about this product can be obtained from Cray's product pages)

Wednesday, April 7, 2010

Host-Based Processing Eliminates Scaling Issues for InfiniBand Fabrics

Scientific, engineering, and research facilities rely on InfiniBand fabrics because they offer the highest available bandwidth and the lowest available latency. But depending on the design of the InfiniBand HCAs, this advantage can be squandered as the number of compute nodes scales up into the hundreds or thousands. One of the main challenges in efficient scaling is how and where InfiniBand protocol is processed.

Adapter-based vs. host-based processing
There are two basic ways to handle protocol processing, and the choice can make a huge difference in overall fabric performance, particularly as a cluster scales. Some vendors rely heavily on adapter-based ('on-load) processing techniques, in which each InfiniBand host channel adapter (HCA) includes an embedded microprocessor that processes the communications protocols. Other vendors primarily use host-based processing, in which the server processes the communications protocols. In the early days of InfiniBand clusters, a typical server may have had just one or two single- or dual-core processors. With the ability to issue one instruction per second at a relatively low clock rate, these servers benefitted from having communications processing offloaded to the host channel adapter.

(Full version of this article can be obtained from HPCwire's web pages)

Tuesday, March 16, 2010

Intel Ups Performance Ante with Westmere Server Chips

Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.

For the HPC crowd, the performance improvements are the big story. Thanks in large part to the 32nm transistor size, Intel was able to incorporate six cores and 12 MB of L3 cache on a single die -- a 50 percent increase compared to the Xeon 5500 parts. According to Intel, that translated into a 20 to 60 percent boost in application performance and 40 percent better performance per watt.

Using the high performance Linpack benchmark, Intel is reporting a 61 percent improvement for a 6-core 5600 compared its 4-core Xeon 5500 predecessor (146 gigaflops versus 91 gigaflops). You might be wondering how this was accomplished, given that the 5600 comes with only 50 percent more cores and cache. It turns out that Intel's comparison was based on its two top-of-the line Xeon chips from each processor family. The 146 gigaflops result was delivered by a X5680 processor, which runs a 3.33 GHz and has a TDP of 130 watts, while the 91 gigaflops mark was turned in by the X5570 processor, which runs at 2.93 GHz and has a TDP of 95 watts. Correcting for clock speed, the 5600 Linpack would be something closer to 128 gigaflops, representing a still-respectable 41 percent boost.

Intel also reported performance improvements across a range of technical HPC workloads. These include a 20 percent boost on memory bandwidth (using Stream-MP), a 21 percent average improvement with a number of CAE codes, a 44 percent average improvement for life science codes, and a 63 percent improvement using a Black Scholes financial benchmark. These results also reflect the same 3.33/2.93 GHz clock speed bias discussed in the Linpack test, so your mileage may vary.

Looking at the performance per watt metric, the new 5600 chips also have a clear edge. An apples-to-apples comparison of the X5570 (2.93 GHz, 95 watt) and x5670 (2.93 GHz, 95 watts), has the latter chip delivering 40 percent more performance per watt. That's to be expected since two extra cores are available on the X5670 to do extra work.

Intel is also offering low-power 40 and 60 watt versions of the 5600 alongside the mainstream 80, 95, and 130 watt offerings. These low-power versions would be especially useful where energy consumption, rather than performance, is the driving factor. For example, a 60 watt L5640 matches the raw performance of a 95 watt X5570, potentially saving 30 percent in power consumption. Intel is even offering a 30 watt L3406, aimed at the single-processor microserver segment. Other power-saving goodies that come with the 5600 include a more efficient Turbo Boost and memory power management facility, automated low power states for six cores, and support for lower power DDR3 memory.

The Xeon 5600 parts are socket-compatible with the 5500 processors and can use the same chipsets, making a smooth upgrade path for system OEMs. Like their 5500 predecessors, the 5600s support DDR3 memory to the tune of three memory channels per socket. Practically speaking, that means two cores share a memory channel when all six cores are running full blast.

The enterprise market will be pleased by the new on-chip security features in the 5600 architecture. First, there is the new AES instructions for accelerating database encryption, whole disk encryption and secure internet transactions. The 5600 also offers what Intel is calling Trusted Execution Technology (TXT). TXT can be used to prevent the insertion of malicious VM software at bootup in a virtualized cloud computing environment.

Although the 5600 family will bring Intel into the mainstream six-core server market, the company is offering new four-core parts as well. In fact, the fastest clock is owned by the X5677, a quad-core processor that tops out at 3.46 GHz. These top-of-the-line four-core versions might find a happy home with many HPC users, in particular where single-threaded application performance is paramount. This would be especially true for workloads that tend to be memory-bound, since in this case more cores might actually drag down performance by incurring processing overhead while waiting for a memory channel to open up.

Intel's marketing strategy for the Xeon 5600 is not that different from its 5500 sales pitch: improved processor efficiencies generate quick payback on the investment. For the 5600, the claim is that you can replace 15 racks of single-core Xeons with a single rack of the new chips, that is, as long as you don't need any more performance. Intel is touting a five-month payback for this performance-neutral upgrade.

On the other hand, if you need 15 times the performance, you can do a 1:1 replacement of your single-core servers and still realize about eight percent in energy savings. But since software support and server warranty costs dominate maintenance expenses, any energy savings might get swallowed up by these other costs.
Intel says it is keeping the prices on the 5600 processors in line with the Xeon 5500s, although the new processor series spans a wider range of offerings. At the low end, you have the L3406, a 30 watt 2.26 GHz part with four cores just 4 MB of L3. It goes for just $189. At the top end are the six-core X5680 and the four-core X5677, both of which are offered at $1,663. Prices quoted are in quantities of 1,000.

In conjunction with Intel's launch, a number of HPC OEMs are also announcing new systems based on the Xeon 5600 series. For example, Cray announced its CX1 line of deskside machines will now come with the new chips. SGI is also incorporating the new Xeons into its portfolio, including the Altix ICE clusters, the InfiniteStorage servers, and the Octane III personal super. SGI will also use the new chips in its just-announced Origin 400 workgroup blade solution. IBM, HP and Dell are likewise rolling out new x86 servers based on the 5600 processors.

AMD is looking to trump Intel's latest Xeon offerings with its new Magny-Cours 6100 series Opteron processors, which are set to launch at the end of the month. The new Opterons come in 8- and 12-core flavors and are debuting alongside AMD's new G34 chipset. Although the Opterons lack the HyperThreading technology of the Xeons, the additional physical cores and fourth memory channel should make up for this. Also unlike the 5600 architecture, the Opteron 6100 support both 2-socket and 4-socket systems, giving server makers additional design flexibility. In any case, the x86 rivalry is quickly heating up as the two chipmakers battle for market share in 2010.

(This article sourced from the HPC Wire and original text can be found their web pages)

Wednesday, March 10, 2010

HPC Training Course: 5 - 6 May 2010, University College, London, UK

Introduction to High Performance Computing.

Introduction to the DEISA Infrastructure.

DEISA is running two training courses at University College, London, in early May 2010. Both will be based around a number of practical programming exercises. No prior knowledge is assumed for either of the courses.

The first course on Wednesday 5th May is an "Introduction to High Performance Computing". It will cover the fundamentals of modern HPC architectures and the two major parallel programming models: shared variables and message passing. Practical sessions will involve running existing parallel programs to investigate issues such as performance and scalability.

The second course on Thursday 6th May is an "Introduction to the DEISA Infrastructure". This will cover the basic aspects of the DEISA distributed supercomputer environment and the software tools that are used to access it, including the Application Hosting Environment (AHE). Practical sessions will involve installing software on the desktop and using it to access the DEISA systems.

Courses are available free for academic attendees. If the courses become over-subscribed, preference will be given to members of the Virtual Physiological Human Network of Excellence.

Those attending are encouraged to use their own laptops for both courses.

(To register, please fill in the form at their web pages)

Monday, February 15, 2010

Solving The Protein Folding Problem with HPC

The University of Florida uses high-performance computing to simulate protein folding and help in the fight against disease.

Challenge: The Protein Folding Problem
Just like a road map, there are many ways to fold a protein molecule but only one is right. Misfold a map and the only penalty is inconvenience; but misfold a protein and the penalty can be a bad disease. How does a protein know the shape into which it is supposed to fold? High-performance computing can help answer this question.

Low free energy is good. Laboratory experiments can probe around only the unfolded and folded regions of the energy curve. Computer experiments can probe the whole thing. Professor Adrian Roitberg and Seonah Kim are doing just that on the UF High Performance Computing (HPC) Cluster at the University of Florida. The cluster depends on the high performance and reliability of the Cisco® InfiniBand fabric that connects the AMD Opteron based Rackable servers and storage subsystem. Kim has run more than 45 days on 100 processors and isn't done yet.

The simulation uses the highly parallelized Assisted Model Building with Energy Refinement (AMBER) package of molecular simulation programs. Why so long to study just two proteins? For one thing, biology involves a lot of water. The pinkish cloud is 7000 water molecules (21,000 atoms) surrounding a 14-residue peptide molecule (the bluish "worm" in the middle). The AMBER simulation works by calculating the motions of all these molecules. They bend, rotate, and move through space, avoiding or bouncing off one another. The simulation divides time into little steps and uses Newton's laws of physics to calculate the motion of the thousands of atoms at each step.

The High-Performance Computing Initiative at UF is an innovative approach to such needs. The design is a computing grid, linking specialized research computing clusters to a central parallel cluster over a dedicated high-speed network. Funding from the National Science Foundation and a cooperative agreement with Cisco provided the routers and switches for that grid.

(This news summarized from Cisco and original text can be reach their website)

SC10 Conference

The SC Conference is the premier international conference for high performance computing (HPC), networking, storage and analysis. Conference will be held this year in New Orleans, LA,USA at November 15th - 18th, 2010.

For more info visit sc10.supercomputing.org

Saturday, February 13, 2010

NBCR Releases APBS Roll for Rocks 5.3

The NBCR (National Biomedical Computation Resource) at University of California, San Diego is pleased to announce the availability of APBS (Adaptive Poisson-Boltzmann Solver) Roll package for Rocks clusters version 5.3 for i386 and x86_64 architectures.

APBS is a scalable Poisson-Boltzmann equation solver used to study electrostatic properties of small to nanoscale biomolecular systems. The APBS Roll simplifies APBS deployment and integration on Rocks clusters. More information about APBS can be found at SourceForge or at the NBCR web site.

This APBS Roll contains the latest APBS version 1.2.1b and PDB2PQR package version 1.5. The APBS Roll can be downloaded from the APBS download site and the Roll documentation including installation and usage information is available here.

Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. Hundreds of researchers from around the world have used Rocks to deploy their own cluster (see the Rocks Cluster Site).

Sunday, July 19, 2009

ScaleMP announces vSMP Foundation for Cluster Structures

The vSMP Foundation for Cluster™ solution provides a simplified compute architecture for high-performance clusters - it hides the InfiniBand fabric, offers built-in high-performance storage as a cluster-filesystem replacement and reduces the number of operating systems to one, making it much easier to administer. This solution is ideally suited for smaller compute implementations in which management tools and skills may not be readily available.

The target customers for the Cluster product are those with initial high performance cluster implementations who are concerned with the complexity of creation and management of the cluster environment.

Key Advantages:

Simplified install and management of high performance clusters;
- Eliminates multiple nodes, operating systems to one;
- Eliminates the need for separate cluster-filesystem,
Stronger entry-level value proposition – scale up growth opportunities with no additional overhead.

You can reach detailed product info at their web pages.

NCHPC