NCHPC: April 2010

Friday, April 23, 2010

Ceph: The Distributed File System Creature from the Object Lagoon

The last two years have seen a large number of file systems added to the kernel with many of them maturing to the point where they are useful, reliable, and in production in some cases. In the run up to the 2.6.34 kernel, Linus recently added the Ceph client. What is unique about Ceph is that it is a distributed parallel file system promising scalability and performance, something that NFS lacks.

High-level view of Ceph

One might ask about the origin of Ceph since it is somewhat unusual. Ceph is really short for Cephalopod which is the class of moulluscs to which the octopus belongs. So it’s really short for octopus, sort of. If you want more detail, talk a look at the Wikipedia article about Ceph. Now that name has been partially explained, let’s look at the file system.

Ceph was started by Sage Weil for his PhD dissertation at the University of California, Santa Cruz in the Storage Systems Research Center in the Jack Baskin School of Engineering. The lab is funded by the DOE/NNSA involving LLNL (Lawrence Livermore National Labs), LANL (Los Alamos National Labs), and Sandia National Laboratories. He graduated in the fall of 2007 and has kept developing Ceph. As mentioned previously, his efforts have been rewarded with the integration of the Ceph client into the upcoming 2.6.34 kernel.

The design goals of Ceph are to create a POSIX file system (or close to POSIX) that is scalable, reliable, and has very good performance. To reach these goals Ceph has the following major features:

It is object-based
It decouples metadata and data (many parallel file systems do this as well)
It uses a dynamic distributed metadata approach

These three features and how they are implemented are at the core of Ceph (more on that in the next section).

However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient.

(Full version of this article can be obtained from Linux Magazine's web pages)

Friday, April 16, 2010

New Cray OS Brings ISVs in for a Soft Landing

Cray has never made a big deal about the custom Linux operating system it packages with its XT supercomputing line. In general, companies don't like to tout proprietary OS environments since they tend to lock custom codes in and third-party ISV applications out. But the third generation Cray Linux Environment (CLE3) that the company announced is designed to make elite supercomputing an ISV-friendly experience.

Besides adding compatibility to off-the-shelf ISV codes, which we'll get to in a moment, the newly-minted Cray OS contains a number of other enhancements. In the performance realm, CLE3 increases overall scalability to greater than 500,000 cores (up from 200,000 in CLE2), adds Lustre 1.8 support, and includes some advanced scheduler features. Cray also added a feature called "core specialization," which allows the user to pin a single core on the node to the OS and devote the remainder to application code. According to Cray, on some types of codes, this can bump performance 10 to 20 percent. CLE3 also brings with it some additional reliability features, including NodeKARE, a diagnostic capability that makes sure jobs are running on healthy nodes.

But the biggest new feature added to CLE3 is compatibility with standard HPC codes from independent software vendors (ISVs). This new capability has the potential to open up a much broader market for Cray's flagship XT product line, and further blur the line between proprietary supercomputers and traditional HPC clusters.

Cray has had an on-again off-again relationship with HPC software vendors. Many of the established ISVs in this space grew up alongside Cray Research, and software from companies like CEI, LSTC, SIMULIA, and CD-adapco actually ran on the original Cray Research machines. Over time, these vendors migrated to standard x86 Linux and Windows systems, which became their prime platforms, and dropped products that required customized solutions for supercomputers. Cray left most of the commercial ISVs behind as it focused on high-end HPC and custom applications.

Programming Environment of CLE
The CLE programming environment includes tools designed to complement and enhance each other, resulting in a rich, easy-to-use programming environment that facilitates the development of scalable applications.

Parallel programming models: MPI, SHMEM, UPC, OpenMP, and Co-Array Fortran within the node
MPI 2.0 standard, optimized to take advantage of the scalable interconnect in the Cray XT system
Various MPI libraries supported under Cluster Compatibility Mode
Optimized C, C++, UPC, Fortran90, and Fortran 2003 compilers
High-performance optimized math libraries of BLAS, FFTs, LAPACK, ScaLAPACK, SuperLU, and Cray Scientiific Libraries
Cray Apprentice2 performance analysis tools

(Full version of this article can be obtained from HPCwire's web pages)

Monday, April 12, 2010

Product Review: Cray CX1000™ High(brid) Performance Computers

The Cray CX1000 series is a dense, power efficient and supremely powerful rack-mounted supercomputer featuring best-of-class technologies that can be mixed-and-matched in a single rack creating a customized hybrid computing platform to meet a variety of scientific workloads.

Cray is announced the Cray CX1000 system; a dense, power efficient and supremely powerful rack-mounted supercomputer that allows you to leverage the latest Intel® Xeon® processors for:

Scale-out cluster computing using dual-socket Intel Xeon 5600s (Cray CX1000-C)
Scale-through (GPU) computing leveraging NVIDIA Tesla® (Cray CX1000-G)
Scale-up computing with SMP nodes built on Intel’s QuickPath Interconnect (QPI) technology offering "fat memory" nodes (Cray CX1000-S)

High(brid) Performance Computing – The Cray CX1000 redefines HPC by delivering hybrid capabilities through a choice of chassis, each delivering one of the most important architectures of the next decade.

Cray CX1000-C Chassis
The compute-based Cray CX1000-C chassis includes 18 dual-socket Intel Xeon 5600 blades with an integrated 36-port QDR InfiniBand switch and a 24-port Gigabit Ethernet switch – all in 7U. With support for Windows® HPC Server 2008 or Red Hat Linux via the Cray Cluster Manager, the Cray CX1000-C system provides outstanding support for ISV applications as well as dual-boot capability for ultimate application flexibility. The Cray CX1000-C system maintains Cray's "Ease of Everything" approach by incorporating blades, switches and cabling all within a single chassis. The result is an easy-to-install system with compelling capabilities for scale-out high performance computing.

Two high-frequency Intel® Xeon® 5600 series processors (up to 2.93 GHz)
Large memory capacity (up to 48GB memory per blade with 4GB DDR3 DIMMs)
One SATA HDD or one SSD drive or diskless

Cray CX1000-G Chassis

The GPU-based Cray CX1000-G chassis delivers nine double-width, dual-socket Intel Xeon 5600 blades, each incorporating two NVIDIA Tesla GPUs. Cray CX1000-G systems allow users to maximize GPU performance with its unique architecture by eliminating I/O bottlenecks – an industry first. These 7U systems include an integrated 36-port QDR InfiniBand switch and a 24-port Gigabit Ethernet switch. The Cray CX1000-G system is the best solution to your density limitations by offering 18 NVIDIA Tesla GPUs in a 7U form factor. Combining Intel Xeon 5600 performance with NVIDIA Tesla-based acceleration offers true hybrid computing options.

Double-width blade
Two Intel® Xeon® 5600 series processors
Two NVIDIA® Tesla® M1060 GPUs
Up to 48GB or memory per blade with 8GB DDR3 DIMMs
Two ConnectX adapters providing single QDR IB channel

Cray CX1000-S Chassis

The SMP-based Cray CX1000-S server is offered in two configurations, offering up to 128 Intel® Xeon® 7500 series processors and 1 TB of memory in a 6U system. The Cray CX1000-SC compute node is made up of uniquely designed 1.5U "Building Blocks", each housing 32 cores interconnected using Intel QPI. The Cray CX1000-SM management node is a 3U server with four Intel Xeon 7500 series processors (32 cores) and up to 256 GB of memory.

Coherency switch – a proprietary feature based on Intel QPI technology allowing scalability from a single "building block" of 32 cores up to a maximum of 4 "building blocks" with 128 cores in 6U
Up to 1TB of memory (with 8GB DIMMS)
Support for applications requiring extensive I/O capacity

(The more information about this product can be obtained from Cray's product pages)

IO Profiling of Applications: strace_analyzer

Strace is a very useful tool for examining the IO profile of applications, as it comes standard on every Linux distro. However, as we’ll see in this article, strace can produce hundreds of thousands of lines of output. Trying to develop statistics and trends from a files of this size is virtually impossible to do by hand.

In this article, we will take a look at a tool to do a statistical analysis on strace output: strace_analyzer. This tool can take an individual strace file that has been created with the “-T -ttt” options and produce a statistical analysis of the IO portion of the strace. It also produces data files and .csv (comma delimited files for spreadsheets) files that can be used for plotting.

(Full version of this article can be obtained from Linux Magazine's web pages)

Book Review: The OpenCL Programming Book

Fixstars Corporation announces a book which starts with the basics of parallelization, covers the main concepts, grammar, and setting up a development environment for OpenCL, concluding with source-code walkthroughs of the FFT and Mersenne Twister algorithms written in OpenCL. It is highly recommended for those wishing to get started on programming in OpenCL.

(The pricing and more information can be obtained Fixstars web pages.)

Wednesday, April 7, 2010

Host-Based Processing Eliminates Scaling Issues for InfiniBand Fabrics

Scientific, engineering, and research facilities rely on InfiniBand fabrics because they offer the highest available bandwidth and the lowest available latency. But depending on the design of the InfiniBand HCAs, this advantage can be squandered as the number of compute nodes scales up into the hundreds or thousands. One of the main challenges in efficient scaling is how and where InfiniBand protocol is processed.

Adapter-based vs. host-based processing
There are two basic ways to handle protocol processing, and the choice can make a huge difference in overall fabric performance, particularly as a cluster scales. Some vendors rely heavily on adapter-based ('on-load) processing techniques, in which each InfiniBand host channel adapter (HCA) includes an embedded microprocessor that processes the communications protocols. Other vendors primarily use host-based processing, in which the server processes the communications protocols. In the early days of InfiniBand clusters, a typical server may have had just one or two single- or dual-core processors. With the ability to issue one instruction per second at a relatively low clock rate, these servers benefitted from having communications processing offloaded to the host channel adapter.

(Full version of this article can be obtained from HPCwire's web pages)

Friday, April 2, 2010

Red Hat Focuses New RHEL 5.5 on Multicore

Open-source enterprise software company Red Hat has updated its flagship operating system, Red Hat Enterprise Linux (RHEL), to take full advantage of the latest spoils from the heated microprocessor battle between Advanced Micro Devices and Intel.

RHEL version 5.5, released Wednesday, has been reconfigured for Intel's just-released eight-core Nehalem-EX and AMD's almost-as-recently released 12-core "Magny-Cours" Opteron 6100 Series processors. The software also supports the IBM eight-core Power7 processors, released in February.

RHEL 5.5 also now supports Single Root I/O Virtualization (SR-IOV), a specification that allows multiple virtual guests to better share PCI hardware resources and I/O devices. While some I/O-intensive applications, such as database servers, can experience as much as a 30 percent reduction in performance when virtualized, these new technologies could reduce that latency to as little as 5 percent.

Beyond support for the new round of multicore releases, RHEL 5.5 has a number of other new features as well. It has been updated to extend Active Directory integration, through the use of the latest version of Samba file- and print-sharing software. Also, for the first time, RHEL's version of SystemTap can trace the run-time performance of C++ applications (much like Oracle's DTrace does for Solaris' applications).
RHEL 5.5 also aggregates all the bug fixes and maintenance patches since the release of RHEL 5.4, released last September.

RHEL 5.5 is available for download for subscribers.

(This news sourced from the pcworld.com and full version can be reached their web pages)

Thursday, April 1, 2010

AMD Launches Intel Counter-Assault with New Opteron Chips

AMD has officially launched its Opteron 6100 series processors, code-named "Magny-Cours." Available in 8-core and 12-core flavors, the new 6100 parts are targeted for 2P and 4P server duty and are being pitched against Intel's latest high-end Xeon silicon: the 6-core Westmere EP processor for 2P servers and the upcoming 8-core Nehalem EX processor for 4P-and-above servers.

With the 6100 launch, AMD's battle with Intel for the high-end x86 server market enters a new era. In the two-socket server space, Intel's Westmere EP retains the speed title, clock-frequency-wise. At the same time, Nehalem EX, due to be announced tomorrow, will give Intel exclusive ownership of the 8P-and-above x86 server market. Meanwhile, AMD will use Magny-Cours to try to outmaneuver Intel with better price-performance and performance-per-watt on two-socket and four-socket machines.

While Intel can still deliver faster cores on its Westmere EP, thanks in part to its 32nm process technology, AMD, with its 45nm technology, has opted to go for more cores that run proportionally slower. The fastest Westmere EP CPUs top out at 3.33 GHz for the 6-core version and 3.46 GHz for the 4-core version. In contrast, the speediest 12-core and 8-core Magny-Cours come in at 2.3 GHz and 2.4 GHz respectively.

The $266 to $1,386 price spread for Magny-Cours will look especially attractive for large-scale 4P setups compared to the more expensive Nehalem EX. (As of Monday, prices on the EX series have not been announced, but are expected to range between $800 to $3,600.) For HPC deployments in particular, where hundreds or thousands of nodes are involved, the up-front cost savings are likely to be significant.

(This article summarized from the HPCwire and full version can be reached their web pages)

NCHPC