Several Event Annoucements

These are just some announcements of upcoming events in which I am involved in a varying degree. The first two will be take place at RWTH Aachen University and attendance is free of charge, the second is part of the SC12 conference in Salt Lake City, UT in the US.

Tuning for bigSMP HPC Workshop – aixcelerate (October 8th – 10th, 2012). The number of cores per processor chip is increasing. Today’s “fat” compute nodes are equipped with up to 16 eight-core Intel Xeon processors, resulting in 128 phyiscal cores, with up to 2 TB of main memory. Furthermore, special solutions like a ScaleMP vSMP system may consist of 16 nodes with 4 eight-core Intel Xeon processors each and 4 TB of accumulated main memory, scaling the number of cores even further up to 1024 per machine.  While message-passing with MPI is the dominating paradigm for parallel programming in the domain of high performance computing (HPC), with the growing number of cores per cluster node the combination of MPI with shared memory programming is gaining importance. The efficient use of these systems also requires NUMA-aware data management. In order to exploit different levels of parallelism, namely through shared memory programming within a node and message-passing across the nodes, obtaining good performance becomes increasingly difficult.  This tuning workshop will in detail cover tools and methods to program big SMP systems. The first day will focus on OpenMP programming on big NUMA systems, the second day will focus on Intel Performance Tools as well as the ScaleMP machine, and the third day will focus on Hybrid Parallelization. Attendees are kindly requested to prepare and bring in their own code, if applicable. If you do not have an own code, but you are interested in the presented topics, you may work on prepared exercises during the lab time (hands-on). It is recommended to have good knowledge in MPI and/or OpenMP. More details and the registration link can be found at the event website.

OpenACC Tutorial Workshop (October 11th  to 12th, 2012). OpenACC is a directive-based programming model for accelerators which enables delegating the responsibility for low-level (e.g. CUDA or OpenCL) programming tasks to the compiler. To this end, using the OpenACC API, the programmer can easily offload compute-intensive loops to an attached accelerator. The open industry standard OpenACC has been introduced in November 2011 and supports accelerating regions of code in standard C, C++ and Fortran. It provides portability across operating systems, host CPUs and accelerators. Up to know, OpenACC compilers exist from Cray, PGI and CAPS. During this workshop, you will work with PGI’s OpenACC implementation on Nvidia Quadro 6000 GPUs. This OpenACC workshop is divided into two parts (with separate registrations!). In the first part, we will give an introduction to the OpenACC API while focusing on GPUs. It is open for everyone who is interested in the topic. In contrast to the first part, the second part will not contain any presentations or hands-on sessions. To the second day, we invite all programmers who have their own code and want to give it a try to accelerate it on a GPU using OpenACC and with the help of our team members and Nvidia staff. More details and the registration link can be found at the event website.

Advanced OpenMP Tutorial at SC12 (November 12th, 2012). With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP but rather with the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance. While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We discuss how OpenMP features are implemented and then focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and private versus shared data. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking. We close with debugging, compare various tools, and illustrate how to avoid correctness pitfalls. More details can be found on the event website.

Dan Reed on Technical (Cloud) Computing with Microsoft: Vision

During ISC 2011 in Hamburg I got the opportunity to talk to Microsoft’s Dan Reed, Corporate Vice President, Technology Policy and Extreme Computing Group. It was a very nice discussion soon targeting towards HPC in the Cloud, touching the topics of Microsoft’s Vision, Standards, and Education. Karsten Reineck from the Fraunhofer SCAI was also present, he already put an excerpt of the interview on his blog (in German). The following is my recapitulation of the discussion pointing out his most important statements – part 1 of 2.

Being the person I am, I started the talk with a nasty question on the pricing scheme of Azure (and similar commercial offerings), claiming that it is pretty expensive both per CPU hour as well as per byte of I/O. Just recently we did a full cost accounting to calculate our price per CPU hour for our HPC service, and we found us to be cheaper by a notable factor.

Dan Reed: Academic sites, of reasonable size such as yours, can do HPC cheaper because they are utilizing the hardware on a 24×7 basis. Traditionally, they do not offer service-level agreements on how fast any job starts, they just queue the jobs. Azure is different, and it has to be, one can get the resources available in a guaranteed time frame. As of today, HPC in the Cloud is interesting for burst scenarios where the on-promise resources are not sufficient, or for people for whom traditional HPC is too complex (regardless of Windows vs. Linux, just maintaining an on-premise cluster versus buying HPC time when it is needed).

I am completely in line with that. I expressed my belief that we will need (and have!) academic HPC centers for the foreseeable future. Basically, we are just a (local) HPC cloud service provider for our users – which of course we call customers, internally. To conclude this topic, he said something very interesting:

Dan Reed: In industry, the cost is not the main constraint, the skill is.

Ok, since we are offering HPC services on Linux and Windows, and since there was quite some buzz around the future of the Windows HPC Server product during ISC, I asked where the Windows HPC Server product is heading to in the future.

Dan Reed: The foremost goal is to better integrate and support cloud issues. For example, currently, there are two schedulers, the Azure scheduler and the traditional Windows HPC Server scheduler. Basically, that is one scheduler too much. Regarding improvements in Azure, we will see support for high-speed interconnects soon.

Azure support for MPI programs has just been introduced with Windows HPC Server 2008 R2 SP2 (a long product name, hm?). By the way, he assumes that future x GigaBit Ethernet will be favoured over InfiniBand.

For us it is clearly interesting to see where Azure, and other similar offerings, are heading to, and we can learn something from that for our own HPC service. For example, we already offer service-level agreements for some customers under some circumstances. However, on-premise resources will play the dominating role for academic HPC in the foreseeable future. Thus I am interested in the future of the product and asked specifically about the future of the Windows HPC Server.

Dan Reed: Microsoft, as a company, is strongly committed to a service-based business model. This has to be understood in order to realize what is driving some of the shifts we are seeing right now, both in the products and the organization itself. The focus on Cloud Computing elevated the HPC Server team, the Technical Computing division is now part of the Azure organization. The emphasis of the future product development thus is clearly shifting towards cloud computing, that is true, although the product remains to be improved and features will be added for a few releases (already in planning).

Well, as a MVP for Windows HPC Server, and a member of the Customer Advisory Board, I know something about the planning of upcoming product release, so I believe Microsoft is still committed to the product (as opposed to some statements made by other people during ISC). However, I do not see the Windows Server itself moving in the right direction for HPC. Obviously HPC is just a niche market for Microsoft, but better support for multi- and many-core processors and hierarchical memory architectures (NUMA !) would be desirable. Asking (again) on that, I got the following answer:

Dan Reed: Windows HPC Server is derived from Windows Server, which itself is derived from Windows. So, if you want to know where Windows HPC Server is going with regard to its base technologies, you have to see (and understand) where Windows itself is going.

Uhm, ok, so we better take a close look at Windows 8 :-). Regarding Microsoft’ way towards Cloud Computing, I will write a second blog post later to cover more of our discussion on the topics of Standards and Education. This this blog post is on the Vision, I just want to share a brief discussion we had when heading back to the ISC show floor. I asked him on his personal (!) opinion on the race towards Exascale. Will we get an Exascale system by (the end of) 2019?

Dan Reed: Given the political will and money, we will overcome the technical issues we are facing today.

Ok. Given that someone has that will and the money, would such a system be usable? Do you see any single application for such a system?

Dan Reed: Big question mark. I would rather see money being invested in solving the software issues. If we get such powerful systems, we have to be able to make use of them for more than just a single project.

Again, I am pretty much in line with that. By no means I am claiming to fully understand all challenges and opportunities of Exascale systems, but what I do see are the challenges to make use of today’s Petaflop systems with applications other than LINPACK, especially from the domain of Computational Engineering. Taking the opportunity, my last question was: Who do you guess would have the political will and the money to build an Exascale system first, the US, or Europe, or rather Asia?

Dan Reed: Uhm. If I would have to bet, I would bet on Asia. And if such a system comes from Asia, all critical system components will be designed and manufactured in Asia.

Interesting. And clearly a challenge.

An Update on Building and Using BOOST.MPI on Windows HPC Server 2008 R2

My 2008 blog post on Building and Using BOOST.MPI on Windows HPC Server 2008 still generates quite some traffic. Since some things have changed since then, I thought it could help those visitors to provide an updated howto. Again, this post puts the focus on building boost.mpi with various versions of MS-MPI, and does not cover all aspects of building boost on Windows (go to Getting Started on Windows for that).

The problem that still remains is, that the MPI auto-configuration only looks for MS-MPI v1, which came with the Compute Cluster Pack and was typically installed to the directory C:\Program Files\Microsoft Compute Cluster Pack. MS-MPI v2, that comes with the Microsoft HPC Pack 2008 [R2], is typically installed to the directory C:\Program Files\Microsoft HPC Pack 2008 [R2] SDK, but the auto-configuration does not examine these directories. In the old post I explained where to change the path the auto-configurator is looking at. Of course, this is not what one expects from an “auto”-configuration tool. Extending the mpi.jam file to search for all possible standard directories where MS-MPI might be installed in turned out to be pretty simple. You can download my modified mpi.jam for boost 1.46.1 supporting MS-MPI v1 and v2 and replace the mpi.jam file that comes with the boost package. As a summary, below are the basic steps to build boost with boost.mpi on Windows (HPC) Server 2008 using Visual Studio and MS-MPI.

  1. Download boost 1.46.1 (82 MB), which is the most current version by the time of this writing (May 13th, 2011).
  2. Extract the archive. For the rest of the instructions I will assume X:\src.boost_1_46_1 as the directory the archive has been extracted into.
  3. Open a Visual Studio command prompt from the Visual Studio Tools submenu. Depending on what you intend to build, you have to use the 32-bit or 64-bit compiler environment. Execute all commands listed in the rest of the instructions from within this command prompt.
  4. Run bootstrap.bat. This will build bjam.exe.
  5. Modify the mpi.jam file located in the tools\build\v2\tools subdirectory to search for MS-MPI in the right place, or use my modified mpi.jam for boost 1.46.1 supporting MS-MPI v1 and v2 instead.
  6. Edit the user-config.jam file located in the tools\build\v2 subdirectory to contain the following line: using mpi ;.
  7. Execute the following to command to start the build and installation process: bjam.exe –build-dir=x:\src.boost_1_46_1\build\vs90-64 –prefix=x:\boost_1_46_1\vs90-64 install. Please note that I use different directories in the –build-dir and –prefix options, since I intend to remove the X:\src.boost_1_46_1 directory once boost is installed. Especially a debug build may use a significant amount of disc storage.
  8. Wait…
  9. There are several other options that you might want to explore, but in many cases the default does just fine. Using the command line from above, on Windows you will get static multi-threaded libraries in debug and release mode using shared runtime. On Windows, the default toolset is msvc, which is the Visual Studio compiler. You can change that via the toolset=xxx option, for example insert toolset=intel to the command line above just before install if you want to build using the Intel compilers.

Since it is uncomfortable to change mpi.jam whenever you are going to build a new version of boost, I filed a bug report on this and proposed to extend the search path to include MS-MPI v2 locations as well.

In order to use this build of boost, in your projects you have to add X:\boost_1_46_1\vs90-32\include\boost-1_46_1 to the list of include directories, and X:\boost_1_46_1\vs90-32\lib to the list of library directories (all acording to the directory scheme I used above). In your code you do #include <boost/mpi.hpp>. The boost header files contain directives to link the correct boost libraries automatically, but of course you have to linke with the MS-MPI library you used to build boost with.

Upcoming Events in March 2011

Let me point you to some HPC events in March 2011.

3rd Parallel Programming in Computational Engineering and Science (PPCES) Workshop. This event will continue the tradition of previous annual week-long events taking place in Aachen every spring since 2001, this year from March 21st to March 25th. This year, the agenda is – as always – a little different from the previous one. Beginning with a series of overview presentations on Monday afternoon, we are very happy to announce the upcoming RWTH Compute Cluster to be delivered by Bull. Throughout the week, we will cover serial and parallel programming using OpenMP and MPI in Fortran and C / C++ as well as performance tuning addressing both, Linux and Windows platforms. Due to the positive experience of last year, we are happy to present a renowned speaker to give an introduction into GPGPU architectures and programming on Friday: Michael Wolfe from PGI. All further information can be found at the event website:

4th Meeting of the German Windows-HPC User Group. The fourth meeting of the German Windows HPC User Group will take place in Karlsruhe on March 31st and April 1st, kindly hosted by the KIT. As in the previous years, we will learn about and discuss Microsoft’s current and future products, as well as users presenting their (good and not so good) experiences in doing HPC on Windows. This year, we will have an Expert Discussion Panel for which the audience is invited to ask (tough) question to fire up the discussion.

RWTH Aachen gets a new 300 Teraflops HPC system from Bull

While I usually do not repeat press releases in my blog, this one I do since we all are a little proud of the achievement: RWTH Aachen University orders Bull supercomputer to support its scientific, industrial and environmental research. Getting this system was a lot of work, and preparing for it still is. The compute power of that machine totals 300 Teraflops. The focus of our center is not just running this machine, but to provide HPC-specific support and to ensure efficient operation. We are confident that in Bull we found a competent partner to investigate these and other topcis in close collaboration.

Windows HPC Server 2008 R2 is ready

All members of the Microsoft Technology Adoption Program (TAP) for Windows HPC Server 2008 R2 just got mail that build number 2369 is ready for release. It is available via MS Connect already and will be made available via the usual channels in the coming days and weeks. We have been trying various builds throughout our participation in the TAP program – with varying success – and got a good overview of the new features in this product. As usual there are some features I really like and have been waiting for, and some features of questionable value.

The new product, both the HPC Pack 2008 R2 and the Windows Server 2008 R2 HPC Edition, will be available in two editions: Express (for traditional HPC usage including MS-MPI, the Job Scheduler and the Admin features you already know) and Enterprise (for SOA and Excel-based workload including everything from Express as well as new Excel and Workstation Cycle Stealing functionality). The HPC Pack 2008 R2 will also be available as a for Workstation only edition (giving you the Cycle Stealing functionality). I still have no clue in what version our licenses with Software Assurance will be converted, lets hope for Enterprise :-).

What is new for traditional HPC users (such as our center)?

  • The MPI stack (MS-MPI) has been improved and, for example, has been equipped with several environment variables to allow for more fine-granular control of the inner workings, i.e. which protocol sheme to use depending on the message size. Together with general performance improvements this offers some options for further performance tuning as well as analysis of the MPI behaviour.
  • The option to boot compute nodes via iSCSI from the network has been introduced. What you need is a suitable iSCSI provider (ask your storage vendor, MS will offer an iSCSI provider development kit) and a suitable volume, Windows HPC Server 2008 R2 is intended to do the management for you. This is the feature I (personally) was most interested in. It took us until the appearance of the release candidate to get it working well with our NetApp installation, so our experience with this is still limited but I am very keen on seeing how this behaves with heavy job loads.
  • Improved Diagnostics have been made available. Especially on the network side the options to (automatically) check the health of your cluster have been significantly improved, along with possibilities to test whether compute nodes are ok to run ISV codes. For the latter, we have written a lot of test on our own, and it took us a lot of time to get them right in detecting the most prominent issues with ISV codes. Providing well-integrated and extensive diagnostics is a great opportunity for ISVs to save their users from a lot of pain!
  • In addition there are several other things, like new Scheduling Policies and an improved Admin Console. The new Windows HPC Server 2008 R2 support for 256 threads (I think the mean cores), instead of 64. It became significantly easier to run pre- and post-job scripts, or enable email notifications when the job status changes, and things like that. Once the R2 cluster is in production I intend to share our experiences with this…

A special focus of this release lies on support for “emerging workloads” – this is how Microsoft names it – based on Enterprise SOA, Excel and Desktop Cycle Stealing. I did not look into the SOA improvements so far, therefore no comment on that. A better integration of Excel with the HPC server is very welcomed, although we do not (yet) have real users for this in our center. You will be able to run distributed instances of Excel 2010 on a cluster where every instance is computing an individual workbook (with a difference dataset), or you can source out the computation of user-defined functions of Excel 2010 to the cluster. In the past myself (and a few others) experimented with using Excel to steer computing, for example optimizing a kernel with various parameters, and I am curious whether there will be more use of that in the future by directly attaching a computation into Excel.

Well, and then there is Desktop Cycle Stealing. The idea (as far as I got it) is to use Windows 7-based workstations to run jobs, without the tight integration into a cluster as regular compute nodes have. Admittedly my view is shaped by what we do in our center, but I do not think using desktops makes a lot of sense for what most people name HPC. We design our cluster in a way that applications run efficiently on it, i.e. by taking special networks. The network connection to a workstation, even if it is GE, is comparably weak. Compute nodes are centrally managed, equipped with efficient cooling, etc. – workstations are distributed and often not reliable. There may be some applications that can profit from getting some cycles here and there. But promising desktop cycle stealing to save some money for HPC-type ISV codes will not result in satisfied users, since these codes just do not run efficiently on a weakly coupled network of inhomogeneous machines. JM2C, as always I am happy to learn about counter examples.

Parallel Programming with Visual Studio 2010: F5 to the Cluster

You probably have noticed that: Visual Studio 2010 (VS2010) is about to be released. As of today, the Microsoft website states that VS2010 will be launched on April 12th. I have been playing with various builds since more than a year and I am really looking forward to taking this new version into production, since it comes loaded with plenty of new features for parallel programmers. After the launch you probably have to pay money for it, so grabbing the release candidate (RC) and taking a look at it right now may be worth it!

The feature I am talking about right now is the improved MPI Cluster Debugger that lets you execute MPI debugging jobs either locally or on a cluster, with only a minor configuration task involved. A few days ago at the German Windows-HPC User Group event Keith Yedlin from Microsoft Corp was talking about it and I demoed it live. Daniel Moth has a blog post providing an overview of that feature, MSDN has a walk-through on how to set it up, so I am not going to repeat all that content but instead explain how I am using it (and what I am still missing).

Examining variable values across MPI processes. This is a core requirement on a parallel debugger, as I stated in previous posts already. Visual Studio 2008 did allow for this already, but Visual Studio 2010 improved the way in which you inspect variables, especially if you are switching between threads and / or processes. I am not sure about the official name of the feature, but let me just call it laminate: when you put the mouse pointer over a variable the menu that will appear does not only show you the variable value, in VS2010 it also contains a sticker that you can click to keep this window persistent in front of the editor.

Screenshot: Visual Studio 2010 Debugging Session, Laminated Variable
Screenshot: Visual Studio 2010 Debugging Session, Laminated Variable

In my debugging workflow I got used to laminate exactly those variables that have different values on the threads and / or processes involved in my program. Whenever I switch to a different thread and / or process that in fact has a different value for that particular variable, the view will become red. This turned out to be very handy!

Screenshot: Visual Studio 2010 Debugging Session, Laminated Variable after Thread Switch
Screenshot: Visual Studio 2010 Debugging Session, Laminated Variable after Thread Switch

Debugging MPI applications on a Cluster. This became usable only with Visual Studio 2010 – before it was possible, but involved many configuration steps. In Visual Studio 2008 I complained about the task of setting up the right paths to mpiexec and mpishim – gone in Visual Studio 2010, thanks a lot. If you intend on using Microsoft MPI v2 (either on a Cluster or on a local Workstation) there is no need to configure anything at all, just switch to the MPI Cluster Debugger in the Debugging pane of your project settings. It gets even better: The field Run Environment allows you to select between the execution on your local machine, or on a Windows HPC Server 2008 Cluster:

Screenshot: Visual Studio 2010 Debugging Configuration
Screenshot: Visual Studio 2010 Debugging Configuration

Debugging on a cluster is particularly useful if you program is not capable of being executed with a small number of processes, or if you have a large dataset to do the debugging with and not enough memory on the local machine. The debugging session is submitted to the cluster just like a regular compute job. Setting up your debugging project for the cluster is pretty simple: just select the head node of your cluster and then the necessary resources, that is all. Hint: After selecting the head node immediately select the node group (if any) to reduce the number of compute nodes status information are being queried from, since this may take a while and let the Node Selector dialog become unresponsive for some moments.

Screenshot: Visual Studio 2010 Debugging Configuration, Node Selector
Screenshot: Visual Studio 2010 Debugging Configuration, Node Selector

After the configuration step you are all set up to F5 to the cluster – once your job has been started you will see no difference to a local debugging session. If you open the Debug -> Processes window from the VS2010 menu you can take a look at the Transport Qualifier column to see which node the debug processes are running on:

Screenshot: Visual Studio 2010 Debugging Session, MPI Processes on a Cluster
Screenshot: Visual Studio 2010 Debugging Session, MPI Processes on a Cluster

If you are interested, you can start the HPC Job Manager program to examine your debugging session. Unless explicitly object, Visual Studio does the whole job of deploying the runtime for your application and afterwards cleaning everything up again:

Screenshot: HPC Job Manager displaying a Visual Studio 2010 Debugging Job (1/2)
Screenshot: HPC Job Manager displaying a Visual Studio 2010 Debugging Job (2/2)
Screenshot: HPC Job Manager displaying a Visual Studio 2010 Debugging Job (2/2)

What I am still missing. The new features all are really nice, but there are still two very important things that I am missing for an MPI debugger: (i) better management of the MPI processes during debugging, and (ii) a better way to investigate variable values over multiple processes (this may include arrays). Microsoft is probably working on Dev11 already, so I hope these two points will make it into the next product version, maybe even more…