The interconnect used between nodes on a cluster can be the limiting factor in performance for bandwidth-limited models like ROMS. This was certainly impressed upon me during some ROMS benchmarking I did recently that I will post shortly.
I wasn't aware, however, of just how big the difference in these interconnect methods are. For 100-1000 k messages (see, for example http://vmi.ncsa.uiuc.edu/performance/vmi_bw.php) the bandwith difference is dramatic:
Gigabit: ~100 MB/sec
Myrinet: ~250 MB/sec
InfiniBand: ~700 MB/sec
Therefore, depending on the degree of bandwith limitation, it appears you might be able to obtain a factor of 7 increase in performance on your cluster by using Infiniband instead of Gigabit. Any comments from people who know more alot more about this than me? From some pricing I've seen, it seems that Infiniband costs about the same or less than Myrinet. Would there be any reason to use Myrinet over Infiniband?
Thanks,
Rich Signell
Interconnect speed for MPI: Gigabit, Myrinet and Infiniband
Infiniband/Myrinet
Rich,
The disparity between infiniband and myrinet is clear and with the introduction of the PCI-express i/o busses in the new Intel EMT64 Xeon platforms (Such as the Dell 1850 servers), the bandwidth is no longer PCI limited for the 4X Infiniband which gives potential for an even greater disparity. Myrinet hasn't made any advances in their products as far as I know since their Myrinet-2000 line. There was some promise of a low latency software patch for their GM layer which was going to cut small message latency in half. I don't know if that is considered stable at this time. The per slot cost of the two options for larger clusters (32 processor+) is about the same and when you speak with the cluster makers such as Dell or RLX you will find that they can swap between the two without significant changes in cost. The infiniband hardware makers are competing heavily at this time. Both Topspin and Voltaire make an array of products, although I believe that only Voltaire has a full clos switch above 128 ports (This may have changed). With the open standard nature of Infiniband and bandwidth doubling standards in the pipeline, it seems likely that most everybody will be moving in that direction (for large clusters) in the near future.
-G. Cowles
The disparity between infiniband and myrinet is clear and with the introduction of the PCI-express i/o busses in the new Intel EMT64 Xeon platforms (Such as the Dell 1850 servers), the bandwidth is no longer PCI limited for the 4X Infiniband which gives potential for an even greater disparity. Myrinet hasn't made any advances in their products as far as I know since their Myrinet-2000 line. There was some promise of a low latency software patch for their GM layer which was going to cut small message latency in half. I don't know if that is considered stable at this time. The per slot cost of the two options for larger clusters (32 processor+) is about the same and when you speak with the cluster makers such as Dell or RLX you will find that they can swap between the two without significant changes in cost. The infiniband hardware makers are competing heavily at this time. Both Topspin and Voltaire make an array of products, although I believe that only Voltaire has a full clos switch above 128 ports (This may have changed). With the open standard nature of Infiniband and bandwidth doubling standards in the pipeline, it seems likely that most everybody will be moving in that direction (for large clusters) in the near future.
-G. Cowles
I would not expect anywhere near a factor of 7 (or even of 2) performance difference (for a program spending 50% of its time communicating exclusively very large messages - a rather unlikely configuration - 7 times higher bandwidth would at best buy you 43% less wallclock time). ROMS' (as well as other FD based ocean models') parallel performance depends on a combination of latency and bandwidth and therefore the massive Infiniband bandwidth advantage is not necessarily buying you as much as one would think.
What you want to know is which interconnect performs the best for the message sizes characteristic of your important problem sizes. For example for the BENCHMARK 1-3 problems, message sizes for one of the directions range from 60Kbytes to 480Kbytes (medium to large) and for the other direction range from 1920bytes to 15Kbytes (small to medium). Zero message latency along with asymptotic bandwidth determine which bandwidth curve rises the fastest to beat out the rest in the range of interest... And keep in mind that the particular MPI implementation you will choose to use also makes a difference to the bandwidth curve.
Right now, for generic cluster solutions, latency wise:
1 us < Quadrics < SCI < Myrinet XP ~ Infiniband ~< 10us << Gigabit Ethernet ~< 100us.
Using MX (apparently available in beta form) Myrinet latency is supposed to go down to 2.6-3.5us.
and bandwidth wise (unidirectional and assuming a PCI-Express interface that can handle all possible traffic):
60MB/s < Gigabit Ethernet ~< 125MB/s < SCI ~ 340 MB/s < Myrinet MX E-card ~< 490MB/s < Quadrics QSNet II (Elan4) ~900MB/s ~< Infiniband x4 ~900-1500MB/s
The new Pathscale InfiniPath adapter claims to provide 1.5us MPI latency. Gigabit Ethernet with special drivers (MPICH-based GAMMA or SCore) can also give latencies in the ~12-15us range for much lower cost but also lower asymptotic bandwidth (so check your message range of interest).
As ROMS does not attempt to hide computation with communication it makes no difference if the interconnect is capable of message progress on its own (as in the case of Quadrics) or not. Also keep in mind that depending on the interconnect and library combination MPI collectives speed (for the allreduce, allgather, bcast cases) also changes.
C. Evangelinos
What you want to know is which interconnect performs the best for the message sizes characteristic of your important problem sizes. For example for the BENCHMARK 1-3 problems, message sizes for one of the directions range from 60Kbytes to 480Kbytes (medium to large) and for the other direction range from 1920bytes to 15Kbytes (small to medium). Zero message latency along with asymptotic bandwidth determine which bandwidth curve rises the fastest to beat out the rest in the range of interest... And keep in mind that the particular MPI implementation you will choose to use also makes a difference to the bandwidth curve.
Right now, for generic cluster solutions, latency wise:
1 us < Quadrics < SCI < Myrinet XP ~ Infiniband ~< 10us << Gigabit Ethernet ~< 100us.
Using MX (apparently available in beta form) Myrinet latency is supposed to go down to 2.6-3.5us.
and bandwidth wise (unidirectional and assuming a PCI-Express interface that can handle all possible traffic):
60MB/s < Gigabit Ethernet ~< 125MB/s < SCI ~ 340 MB/s < Myrinet MX E-card ~< 490MB/s < Quadrics QSNet II (Elan4) ~900MB/s ~< Infiniband x4 ~900-1500MB/s
The new Pathscale InfiniPath adapter claims to provide 1.5us MPI latency. Gigabit Ethernet with special drivers (MPICH-based GAMMA or SCore) can also give latencies in the ~12-15us range for much lower cost but also lower asymptotic bandwidth (so check your message range of interest).
As ROMS does not attempt to hide computation with communication it makes no difference if the interconnect is capable of message progress on its own (as in the case of Quadrics) or not. Also keep in mind that depending on the interconnect and library combination MPI collectives speed (for the allreduce, allgather, bcast cases) also changes.
C. Evangelinos
Gigabit vs. Infiniband on the BENCHMARK1 test
Constantinos,
When I get some time I will try to understand fully the meaning of your post -- you clearly know a lot more about this stuff than I do! However, I thought it would be of interest to point out that the BENCHMARK1 ROMS test case has been run on dual-Opteron 250 clusters with both INFINIBAND and GIGABIT. For an 8 processor run, the system with INFINIBAND was 2.73 times faster than the system with GIGABIT.
For details, look at lines 10 and 17 in the benchmark spreadsheet at:
http://cove.whoi.edu/~rsignell/roms/bench/
The slower run on the Penguin cluster used the PGI compiler, and the faster run on the Microway Navion cluster used the Intel Compiler, but although the Intel Compiler is generally faster than PGI on AMD machines (at least according to the Polyhedran 2004 Fortran 90 Benchmark Suite), it's generallly only 10% faster -- certainly nowhere near enough to explain this large difference in performance.
Do you think there is another explanation?
-Rich Signelll
When I get some time I will try to understand fully the meaning of your post -- you clearly know a lot more about this stuff than I do! However, I thought it would be of interest to point out that the BENCHMARK1 ROMS test case has been run on dual-Opteron 250 clusters with both INFINIBAND and GIGABIT. For an 8 processor run, the system with INFINIBAND was 2.73 times faster than the system with GIGABIT.
For details, look at lines 10 and 17 in the benchmark spreadsheet at:
http://cove.whoi.edu/~rsignell/roms/bench/
The slower run on the Penguin cluster used the PGI compiler, and the faster run on the Microway Navion cluster used the Intel Compiler, but although the Intel Compiler is generally faster than PGI on AMD machines (at least according to the Polyhedran 2004 Fortran 90 Benchmark Suite), it's generallly only 10% faster -- certainly nowhere near enough to explain this large difference in performance.
Do you think there is another explanation?
-Rich Signelll
Re: Gigabit vs. Infiniband on the BENCHMARK1 test
Was the Gigabit run setup with Jumbo Frames (MTU=9000)? Although it doesn't change the latency, it would potentially decrease the number of frames that need to be sent by up to a factor of six.rsignell wrote:However, I thought it would be of interest to point out that the BENCHMARK1 ROMS test case has been run on dual-Opteron 250 clusters with both INFINIBAND and GIGABIT. For an 8 processor run, the system with INFINIBAND was 2.73 times faster than the system with GIGABIT.
Has anyone compared ROMS with Gigabit with and without Jumbo Frames? We might need to run on Gigabit (they've taken the Myrinet cards to test in the Mac cluster!) for a while and I was wondering if it was worth trying to get them to reconfigure the cluster for jumbo frames.
Thanks,
Steve
Dual core benchmarks
Has anyone used the standard benchmark cases (for examples, those in Rich Signell's benchmarks above) to exam the speed of the dual core operteron processors that AMD has recently released?
For the problem I have in mind, I plan to be running on a 5 or 6 node linux cluster, with each node haveing two processors -- essentially the Microway dual processor opteron cluster in Rich's benchmarks. My grid will be about 250x300x20, or so.
I was wondering if anyone has benchmarked such a cluster with dual processors each with dual cores? If not, I shall probably do so, if I can, in a fashion as similar as possible to Rich's benchmarks. However, the results might be sub-optimal. It can take me a while to learn how to optimally configure a model such as ROMS for new architectures.
The dual core chips are very pricey -- if they are going to be entirely memory bound in their packages, they might well not be worth purchasing.
Thanks for any help yall can share,
Jamie Pringle
For the problem I have in mind, I plan to be running on a 5 or 6 node linux cluster, with each node haveing two processors -- essentially the Microway dual processor opteron cluster in Rich's benchmarks. My grid will be about 250x300x20, or so.
I was wondering if anyone has benchmarked such a cluster with dual processors each with dual cores? If not, I shall probably do so, if I can, in a fashion as similar as possible to Rich's benchmarks. However, the results might be sub-optimal. It can take me a while to learn how to optimally configure a model such as ROMS for new architectures.
The dual core chips are very pricey -- if they are going to be entirely memory bound in their packages, they might well not be worth purchasing.
Thanks for any help yall can share,
Jamie Pringle