Examining the NUMA architecture of a 8-socket Nehalem-EX

I have been rather quiet on this blog for some while now, which is opposite to my intent – I plan to write more regularly again! And I will just continue with one of the topics I like most: NUMA architectures. Some while ago I talked about how different two systems equipped with exactly the same processors may look like and how this can influence the application performance. This blog post is about exploring the NUMA architecture of a very recent system in more detail.

Some days ago we got remote access to a very recent eight-way (meaning 8-socket) system equipped with Nehalem-EX processors. This makes 64 physical or 128 logical (hyper-threaded) cores per system! The system was kindly provided by Fujitsu. Since we soon will get plenty of those (not necessarily from Fujitsu, we really do not know yet), we took a close look on how it behaves, especially my colleague Dirk Schmidl performed lots of the benchmarks with the help of some student workers.

In the aforementioned previous blog post I pointed to the so-called System Locality Information Table (SLIT) provided by the BIOS. Does it help to understand how the eight sockets found in this server are related to each other? Taking a look at it, the answer is simple: No. It just know about two levels: Same socket (the diagonal: 10) and other socket (the rest: 12).

 

 

8-socket Nehalem-EX: SLIT value matrix

8-socket Nehalem-EX: SLIT value matrix

 

Our goal was to examine how the eight sockets are related to each other and how “deep” the NUMA architecture of that machine really is. Of course you can get that information from the system specification documentation, but in order to get a feeling of the performance characteristics of a machine it is good practice to examine it first on your own and then check whether your conclusions match what is described.

We used a simple benchmark: We placed eight threads (each processor has eight physical cores) on one selected socket and made all of them access memory at another socket (well, one thread access the local socket). We then measured the achieved memory bandwidth [MB/s]. This resulted in the following performance matrix:

 

8-socket Nehalem-EX: memory bandwidth matrix

8-socket Nehalem-EX: memory bandwidth matrix

 

By measuring the memory bandwidth in this particular way we do not get the optimal aggregated memory bandwidth the system could deliver, since all sockets are busy and there is also some cache coherency traffic. Instead, our benchmark results are more close to what the system delivers when it is fully loaded using a rather bad memory access behavior.

Our measurements revealed three significantly different performance levels, of which one can further be spitted into two separate ones. The different levels are colored accordingly in the figure below. Depending of which socket you label as “0”, you can come up with the following architectural plot (my colleague Dieter an Mey did this particular one):

 

8-socket Nehalem-EX: architecture

8-socket Nehalem-EX: architecture

 

One can see that we have two pairs of four sockets each that are connected by apparently slightly slower links. I do not yet know what is causing this. Looking at the number of hops you get this matrix:

 

8-socket Nehalem-EX: number of hops matrix

8-socket Nehalem-EX: number of hops matrix

 

The maximum number of hops to get from one socket to another socket is two. Since the Intel QuickPath interconnect allows to use (up to) three connectors to build a multi-socket system, each socket as three neighbors than can be reached with just one hop.

Well, an aggregated memory bandwidth of nearly 90 GB/s with this bad memory access pattern is pretty ok. But it is not a factor two over a system of four sockets. It is well-suited for shared memory parallel programs that can make use of that many cores (and a large total memory), but of course it odes not offer a price-performance sweet spot (the price trend of adding sockets is clearly over-linear). And last but not least, although the memory bandwidth is really important for most HPC applications, there are also other factors that play an important role in an application’s performance on a given architecture. We did many more benchmarks to evaluate this system, of which I do not want to speak here and now, but by doing some memory bandwidth benchmark we figured out how the system architecture looks like and how the eight sockets are related to each other.

About these ads
This entry was posted in NUMA, OpenMP and tagged , . Bookmark the permalink.

2 Responses to Examining the NUMA architecture of a 8-socket Nehalem-EX

  1. Kent says:

    Hello,

    You mentioned two 8-socket systems…
    What is the other one besides the Fujitsu?

    Thank You,
    Kent

  2. terboven says:

    Hi Kent. I just had access to one 8-socket system so far. As part of our current procurement we will probably get quite a few of these, but the winner of that call is not yet decided. Kind regards, Christian.

Comments are closed.