Best Practices For Building A Multiple GPU System

Wed Sep 02, 2015 11:34 am

Tutor,

Question to you. I'm finishing designing of my custom case for WC 12 gpus (graphics cards) with 2x2000W PSUs and 2xMORA rads. Total dimensions shall not exceed 40x50x60cm. I'm about to order X10DRX
http://www.supermicro.com/products/moth ... x10drx.cfm

What solution do you think is the best for maintaining its x8 pcie speed - shall I go with this http://www.ebay.com/itm/1PC-PCI-Express ... 1006566764 and x16 to x16 risers
or use usb risers?

Wed Sep 02, 2015 12:45 pm

Usb riser is only 1x I think. I suggest the fisrt solution 8x to 16x riser to maintain the x8 speed (and usefull for out of core feature if needed !)

Wed Sep 02, 2015 1:40 pm

Just my two cents Smicha, if You don't mind =) the best would be to go shielded x8 or x16 risers, as not only the signal quality but interference might happen causing some issues. talking about USB raisers, I haven't seen good one that pushes more than x4..in that case those might be a waste =) but that's only my opinion, S.

Wed Sep 02, 2015 3:34 pm

Thanks itou and Tom.

Any links to great quality shielded risers? x16->x16, or x8->x16?

Wed Sep 02, 2015 5:53 pm

smicha wrote:Tutor,

Question to you. I'm finishing designing of my custom case for WC 12 gpus (graphics cards) with 2x2000W PSUs and 2xMORA rads. Total dimensions shall not exceed 40x50x60cm. I'm about to order X10DRX
http://www.supermicro.com/products/moth ... x10drx.cfm

What solution do you think is the best for maintaining its x8 pcie speed - shall I go with this http://www.ebay.com/itm/1PC-PCI-Express ... 1006566764 and x16 to x16 risers
or use usb risers?

glimpse wrote:Just my two cents Smicha, if You don't mind =) the best would be to go shielded x8 or x16 risers, as not only the signal quality but interference might happen causing some issues. talking about USB raisers, I haven't seen good one that pushes more than x4..in that case those might be a waste =) but that's only my opinion, S.

CURRENTLY, MAXIMIZING GPU COUNT PER SYSTEM USUALLY INVOLVES MAKING TRADEOFFS

Some don't like snake dins; some don't like improvisation; some don't like compromise, etc., etc. And they won't currently have GPU rendering monsters, unless to them cost is no barrier.

The eleven PCIe slots of the MBD-X10DRX are constituted as 10x PCI-E 3.0 x8 (FHFL) slots and1x PCI-E 2.0 x4 (in x8) slot. PCI-e 3.0 x8 = 2xSpeed of PCIe 2.0 x8 or PCIe 2.0 x16. Thus, in terms of speed you get the equivalent of 10x PCIe 2.0 x16 and 1xPCIe 2.0 x4. The 1PC PCI Express Riser Card x8 to x16 Left Slot Adapters For 1U Servers can’t convert the MBD-X10DRX’s 10x PCI-E 3.0 x8 (FHFL) slots to x16 slots electrically/signal-wise; thus they’ll always be, at max, PCIe 3.0 x8 slots electrically/signal-wise. Yes, I'm repeating myself, those Left Slot Adapters For 1U Servers aren’t (and can’t) convert the X10DRXs’s slots to x16 signal wise for they will for all practical purposes always be, at best, capable of handling only signals at x8 levels. The only function of the Left Slot Adapters is to provide a female interface for inserting a male x16 connector, whether from a riser cable or a GPU itself/directly. So “at max” is the key because although there’s no way for us to convert those x8 slots to x16 v3 speeds, we can do our best to maintain their PCIe 2.0 x16 equivalency by not using any extenders in them that aren’t rated, at a minimum as x8, and preferably as x16 since we’re talking about just small change when it comes to the cost difference between x8 and x16 extenders and the x16s are more prevalent. As to that eleventh slot which is PCIe 2.0 x4, most importantly it's presented to us as an x8 slot so that we can use the same Left Slot Adapters for 1U Servers in that slot also. So to maximize PCIe throughput on a DRX motherboard for eleven or more GPUs, one needs 11x1PC PCI Express Riser Card x8 to x16 Left Slot Adapters For 1U Servers. If Nvidia were to introduce (maybe Pascals or Voltas will come this way stock) or the brave among us were to convert GPUs to single-wide slot versions, then that would be the end of the task if one were satisfied with JUST ELEVEN GPUs for all we’d need to then do would be to insert up to 11 single-wide GPUs directly into up to 11xLeft Slot Adapters For 1U Servers. If that were done, then 10 of the GPUs would be operating at PCIe 3.0 x8 (or the equivalent of PCIe 2.0 x16) and the 11th @ PCIe 2.0 x4. However, if one wanted to tap all of the IO space possessed by the motherboard/CPU to use more than 11 GPUs, then one would need to tap that by using PCIe port multipliers such as the Amfeltec GPU Oriented Splitters. That will drag down the PCIe throughput from PCIe 3.0 x8 speeds to, at best, x4 speeds in the case of the MBD-X10DRX [and in cases like mine where I chose to compromise and use the MBD-X9DRX because I had an excess of older CPUs from the SandyBridge family*/] for those GPUs connected to the Splitter.

As to the best shielded riser cables, Star Tech probably makes the best commercially [see, e.g., http://www.newegg.com/Product/Product.a ... 6815158290 ] {they usually come to short for my liking or are too expensive and are not powered [ http://www.digikey.com/product-detail/e ... ND/3641403 ]} or you can roll your own shields like Sveetsnelda (and I did from following his advice) [ https://bitcointalk.org/index.php?topic=38331.20 ] to fit less expensive x16 to x16 PCIe molex powered extender cables like these: http://www.ebay.com/itm/like/1617592038 ... ps&lpid=82 . Get the best aluminum foil and electrical tape money will buy or that's available to you - it'll go a long way and you'll extra tape and foil for other future tasks. But going beyond 12-15 inches in total length is particularly risky in terms of maintaining signal integrity. You'll need to shield the SATA or USB riser cables and the trade-off is that although they're usually a bit longer, their data transfer speeds are a bit slower. How much slower? x4 at best and for the vast majority - usually x1; but in some cases [situationally and metallurgically/plastically] they're the only game in town because of their somewhat greater lengths or when working with smaller PCIe slots physically like x1s and x2s.

But here're words of caution: Some even recommend against using powering risers by any means except from the motherboard [ https://bitcointalk.org/index.php?topic=365181.0 ] and recommend getting a motherboard that has extra PCIe powering ports on the motherboard. The MBD X9DRX and the MBD X10DRX do have extra PCIe powering ports on them. So I got the molex powered risers, but do not intend to power them except only were absolutely necessary, particularly on my two MBD X9DRXs and my two EVGA SR-2s because all four of them have two extra PCIe powering ports on their motherboards.

*/ When it comes down to GPU rendering there's not much difference between the MBD X9DRX and the MBD X10DRX; so let your CPU, memory and SATA needs and relative costs and availability break the tie. See, in particular, the bolded differences and underlined similarity:

MBD X9DRX [ http://www.supermicro.com/products/moth ... DRX_-F.cfm ]
Key Features

1. Dual socket R (LGA 2011) supports
    Intel® Xeon® processor E5-2600
    and E5-2600 v2 family†

2. Intel® C602 chipset; QPI up to 8.0GT/s

3. Up to 1TB ECC DDR3, up to 1866MHz;
    16x DIMM slots

4. Expansion slots: 10 PCI-E 3.0 x8 and
    1 PCI-E 2.0 x4 (in x8) slot

5. Intel® i350 Dual port GbE LAN

6. 8x SATA2 and 2x SATA3 ports

7. Integrated IPMI 2.0 and KVM with
    Dedicated LAN

8. 10x USB 2.0 ports
    (4 rear, 4 via header + 2 Type A)

MBD X10DRX
Key Features

1. Dual socket R3 (LGA 2011) supports
    Intel® Xeon® processor E5-2600 v3
    family; QPI up to 9.6GT/s

2. Intel® C612 chipset

3. Up to 1TB ECC DDR4 2133MHz;
    16x DIMM slots

4. Expansion slots: 10 PCI-E 3.0 x8 and
    1 PCI-E 2.0 x4 (in x8) slot

5. Intel® i350 Dual port GbE LAN

6. 10x SATA3 (6Gbps); RAID 0, 1, 5, 10

7. Integrated IPMI 2.0 and KVM with
    Dedicated LAN

8. 5x USB 3.0 ports, 4x USB 2.0 ports

Wed Sep 02, 2015 7:27 pm

To anyone who may know, I am curious as to the architectural impact of running cards at differing speeds . Let's for argument's sake say you have 2 identical boards:

Board A
4 PCIE Slots (1 used for primary display)

Board B
4 PCIE Slots (1 used for primary display)

And you connect to A, 3 PCIE USB 3.0 Risers at X1. And to B, 3 PCIE Risers at x8.

Does this mean you are using more of the motherboard's available PCIE lanes (not slots), in the case of Board B, to furnish the x8 speed?

Wed Sep 02, 2015 11:10 pm

Notiusweb wrote:To anyone who may know, I am curious as to the architectural impact of running cards at differing speeds . Let's for argument's sake say you have 2 identical boards:

Board A
4 PCIE Slots (1 used for primary display)

Board B
4 PCIE Slots (1 used for primary display)

And you connect to A, 3 PCIE USB 3.0 Risers at X1. And to B, 3 PCIE Risers at x8.

Does this mean you are using more of the motherboard's available PCIE lanes (not slots), in the case of Board B, to furnish the x8 speed?

Notiusweb,
The answer to your question is "Yes."

Thu Sep 03, 2015 3:17 am

I'm penning this post on one of my three 2007 MacPro2,1s. In the phrase " PCIe v3.0 x16," "x" denotes the no. of lanes and "v" denotes the generation. Ny 2007 MacPros have PCIe v1. PCIe v3.0 x16 is twice as fast as PCIe v2.0 x16 which is twice as fast as PCIe v1.0(1) x16. This Mac has two CUDA GPUs fed through x8 lanes, one CUDA GPU fed through x4 lanes and a fourth CUDA GPU connected by a x1 riser in an x4 lane slot. Physically, all four slots are x16 size. I also have systems with PCIe v3 with GPUs in x16 slots/lanes. I have a low quantum of patience. Guess which machines I prefer to do most of my preliminary 3d work on? My 2007 Macs. Question: "Tutor, but doesn't that frustrate you working on v1 x8, x4 and x1?" Answer: "Not in the least." Only when one questions me about it or I am otherwise forced to focus on it do I notice a slight speed difference when feeding the GPUs for 3d rendering. Once the data is transferred to the GPUs, The "v" and "x" isn't an issue at all. Then only the properties of the GPU matter. However, current video production work forces me to use one of my newer systems sooner for all but the simplest of chores.

Lanes derive from the CPU and thus aren't GPU specific. They are also used for non-GPU specific purposes. The speed at which data is transmitted from source to destination, particularly as it pertains to Octane-related data to be computed by the GPU, is rapid and best conceived of as being a momentary and sporadic occurrence. Being feed from more lanes is why a true x16 slot is physically wider than a true x1 slot. Normally, all current GPU cards are x16 size, but can also operate as video display cards when fed from x8 (eight lanes) and even from x4 (four lanes). This has been the case since PCIe v1, although the speed in x1 v1 are more noticeable and frustrating for modern video production. However, for what we use an excess of video cards to do - compute only chores- the cards can operate to do parallel compute functions {that CPUs can do but not on so massive a scale} when fed through an x1 (one lane). To be sure, it does take the data to be computed a little more time to get to the destination when using x1, but it's hardly noticeable now, particularly in the case of x1 v3, which is four times faster than x1 v1. The trend: As 4k , 8k and larger productions becomes the norm for 3d rendering the wait time will likely be felt more first at the low end, namely x1 v1 and gradually move up from there.
[ http://www.enthusiastpc.net/articles/00003/ ]

Thu Sep 03, 2015 10:22 am

Tutor wrote:I'm penning this post on one of my three 2007 MacPro2,1s. In the phrase " PCIe v3.0 x16," "x" denotes the no. of lanes and "v" denotes the generation. Ny 2007 MacPros have PCIe v1. PCIe v3.0 x16 is twice as fast as PCIe v2.0 x16 which is twice as fast as PCIe v1.0(1) x16. This Mac has two CUDA GPUs fed through x8 lanes, one CUDA GPU fed through x4 lanes and a fourth CUDA GPU connected by a x1 riser in an x4 lane slot. Physically, all four slots are x16 size. I also have systems with PCIe v3 with GPUs in x16 slots/lanes. I have a low quantum of patience. Guess which machines I prefer to do most of my preliminary 3d work on? My 2007 Macs. Question: "Tutor, but doesn't that frustrate you working on v1 x8, x4 and x1?" Answer: "Not in the least." Only when one questions me about it or I am otherwise forced to focus on it do I notice a slight speed difference when feeding the GPUs for 3d rendering. Once the data is transferred to the GPUs, The "v" and "x" isn't an issue at all. Then only the properties of the GPU matter. However, current video production work forces me to use one of my newer systems sooner for all but the simplest of chores.

Lanes derive from the CPU and thus aren't GPU specific. They are also used for non-GPU specific purposes. The speed at which data is transmitted from source to destination, particularly as it pertains to Octane-related data to be computed by the GPU, is rapid and best conceived of as being a momentary and sporadic occurrence. Being feed from more lanes is why a true x16 slot is physically wider than a true x1 slot. Normally, all current GPU cards are x16 size, but can also operate as video display cards when fed from x8 (eight lanes) and even from x4 (four lanes). This has been the case since PCIe v1, although the speed in x1 v1 are more noticeable and frustrating for modern video production. However, for what we use an excess of video cards to do - compute only chores- the cards can operate to do parallel compute functions {that CPUs can do but not on so massive a scale} when fed through an x1 (one lane). To be sure, it does take the data to be computed a little more time to get to the destination when using x1, but it's hardly noticeable now, particularly in the case of x1 v3, which is four times faster than x1 v1. The trend: As 4k , 8k and larger productions becomes the norm for 3d rendering the wait time will likely be felt more first at the low end, namely x1 v1 and gradually move up from there.
[ http://www.enthusiastpc.net/articles/00003/ ]

I agree with Tutor on this. The lane speed impact is minimal and (in theory) only affects the time the scene takes to load. The newest question on the block, though, is what the impact of lane speed is on Octane's out of core memory usage. It seems as though the employment of the I/O activity changes when shared memory is used since textures now has to straddle motherboard RAM and GPU. I have not seen or done any extensive testing on this as yet bit it seems to affect the usage when memory is shared. Here is an extract from the help file: Out-of-core textures allow users to use more textures than would fit in the graphic memory, by keeping them in the host memory. The data for rendering the scene needs to be sent to the GPU while rendering so some tradeoff in the rendering speed is expected.

There are other risks regarding out of core memory such as an unstable system, whether one has all 16 lanes operating to a GPU or not. For the best and smoothest Octane experience, my advice would be to either ensure your graphics memory is adequate for the kinds of scenes you use, or if you need to use the out of core memory function, ensure you have plenty of RAM in the main system. Then lane speed impacts are generally negligible and, if noticeable, usually worth the small sacrifice for the benefit of access to more cards than your main board or case can handle.

Seeker

Thu Sep 03, 2015 12:26 pm

Yes, be carefull with OOC feature when use 1x lane. the performance could drop to x6, and even more !