Best Practices For Building A Multiple GPU System

Thu Sep 03, 2015 1:41 pm

Tutor, I like the MBD X10DRX myself out of the 2. For me any time something has USB 3.0, it is a big deal as I use external drives.
I have seen pics of boards in these forums (and bitcoin forums) and they typically are open with the GPUs, very local, they look like true mining or rendering configs. Mine looks like a standard tower on a desk, with wires running out back to the GPUs down below. I am wondering if I could still have 13 GPU without them being 1X speed.

I could be completely wrong here, but just drawing from my own experience, If the boards you listed are using PCIE lanes (not slots) to handle storage, then one thing that would come into play for consideration, is:

MBD X9DRX's
6. 8x SATA2 and 2x SATA3 ports

vs

MBD X10DRX's
6. 10x SATA3 (6Gbps); RAID 0, 1, 5, 10

In my case the board had something similar, LSI and SATA, where I am able to discharge a storage controller to free up lanes. But, if it was all LSI, for example, I would have no choice but to use those lanes to handle my storage. I think in a case where a board handles the storage natively without PCIE Lanes, all good. But if any storage is used by PCIE Lanes, you'd want to have the liberty of sacrificing it for PCIE Lanes. So, It jumped out at me, should any storage be used by PCIE lanes, that you could disable maybe SATA2 or SATA3 independent of one another, in the case of MBD X9DRX, whereas MBD X10DRX would be all or none (unless the 10 ports can be deactivated in some split way.) In fact, I am more inclined now to view the potential overall allocatability of a board, vs its PCIE Slots, although usually I think more slots signal higher yield. Just thoughts, nothing definitive as far as insight. Certainly has nothing to do with Best Practices...more like, Best Hacking-ces.

Thu Sep 03, 2015 3:00 pm

itou31 wrote:Yes, be carefull with OOC feature when use 1x lane. the performance could drop to x6, and even more !

Thanks for this important piece of information, but what did the performance drop to x6 from? Was it x8 or x16?

Thu Sep 03, 2015 4:05 pm

in fact I made a comparison on this topic :
viewtopic.php?f=25&t=45279&p=248519#p248274
have tried OOC several time with the 780ti plug on PCIe-1x motherboard with USB3 riser : fail and freeze OS.

Thu Sep 03, 2015 5:05 pm

Seekerfinder wrote:
Tutor wrote:I'm penning this post on one of my three 2007 MacPro2,1s. In the phrase " PCIe v3.0 x16," "x" denotes the no. of lanes and "v" denotes the generation. Ny 2007 MacPros have PCIe v1. PCIe v3.0 x16 is twice as fast as PCIe v2.0 x16 which is twice as fast as PCIe v1.0(1) x16. This Mac has two CUDA GPUs fed through x8 lanes, one CUDA GPU fed through x4 lanes and a fourth CUDA GPU connected by a x1 riser in an x4 lane slot. Physically, all four slots are x16 size. I also have systems with PCIe v3 with GPUs in x16 slots/lanes. I have a low quantum of patience. Guess which machines I prefer to do most of my preliminary 3d work on? My 2007 Macs. Question: "Tutor, but doesn't that frustrate you working on v1 x8, x4 and x1?" Answer: "Not in the least." Only when one questions me about it or I am otherwise forced to focus on it do I notice a slight speed difference when feeding the GPUs for 3d rendering. Once the data is transferred to the GPUs, The "v" and "x" isn't an issue at all. Then only the properties of the GPU matter. However, current video production work forces me to use one of my newer systems sooner for all but the simplest of chores.

Lanes derive from the CPU and thus aren't GPU specific. They are also used for non-GPU specific purposes. The speed at which data is transmitted from source to destination, particularly as it pertains to Octane-related data to be computed by the GPU, is rapid and best conceived of as being a momentary and sporadic occurrence. Being feed from more lanes is why a true x16 slot is physically wider than a true x1 slot. Normally, all current GPU cards are x16 size, but can also operate as video display cards when fed from x8 (eight lanes) and even from x4 (four lanes). This has been the case since PCIe v1, although the speed in x1 v1 are more noticeable and frustrating for modern video production. However, for what we use an excess of video cards to do - compute only chores- the cards can operate to do parallel compute functions {that CPUs can do but not on so massive a scale} when fed through an x1 (one lane). To be sure, it does take the data to be computed a little more time to get to the destination when using x1, but it's hardly noticeable now, particularly in the case of x1 v3, which is four times faster than x1 v1. The trend: As 4k , 8k and larger productions becomes the norm for 3d rendering the wait time will likely be felt more first at the low end, namely x1 v1 and gradually move up from there.
[ http://www.enthusiastpc.net/articles/00003/ ]
I agree with Tutor on this. The lane speed impact is minimal and (in theory) only affects the time the scene takes to load. The newest question on the block, though, is what the impact of lane speed is on Octane's out of core memory usage. It seems as though the employment of the I/O activity changes when shared memory is used since textures now has to straddle motherboard RAM and GPU. I have not seen or done any extensive testing on this as yet bit it seems to affect the usage when memory is shared. Here is an extract from the help file: Out-of-core textures allow users to use more textures than would fit in the graphic memory, by keeping them in the host memory. The data for rendering the scene needs to be sent to the GPU while rendering so some tradeoff in the rendering speed is expected.

There are other risks regarding out of core memory such as an unstable system, whether one has all 16 lanes operating to a GPU or not. For the best and smoothest Octane experience, my advice would be to either ensure your graphics memory is adequate for the kinds of scenes you use, or if you need to use the out of core memory function, ensure you have plenty of RAM in the main system. Then lane speed impacts are generally negligible and, if noticeable, usually worth the small sacrifice for the benefit of access to more cards than your main board or case can handle.

Seeker

How can one call the speed at which something now functions as being slower when it previously didn't function?

Good points. Regarding that which you reference as the "newest question on the block ... is what the impact of lane speed is on Octane's out of core memory usage," I suspect that since the GPUs, IF they're being feed frame allotments serially and equal to or close to the GPUs' maximum capacity to perform about as they would perform under a scenario where the allotment was just a whole frame being sent, then [ just for this moment] excluding the delay occasioned by the CPU(s)' enlistment in this feat, the impact of lane speed shouldn't be any different. However, the more the CPU's parceling of allotments drop below what the GPU could ordinarily optimally handle in a non-Out-Of-Core setting, then the more delays may be apparent to the user. But since we're now "braking up" the frame data to get it to fit in a GPU's memory which arguable couldn't have been done in the first instance, I don't know have you can measure or otherwise adjudge a render as being slower when it otherwise wouldn't have happened in the first place. Of course this assumes that one isn't using Octane's out-of-core feature just to be using it in a situation where the GPU could have rendered the scene without using the out-of-core feature. So, if the out-of-core feature is being used appropriately, i.e., when it's truly needed, and delays become more noticeable, the delays might be as likely attributable to the introduction of the CPU/system ram combo as division agent while working in conjunction with GPU processor and GPU ram. In sum, what one may be witnessing might not be occasioned by lane speed. This is just my silly reasoning.

P.S. I'm not, however, advocating that one scimp on lane speed selection.

Thu Sep 03, 2015 5:19 pm

itou31 wrote:in fact I made a comparison on this topic :
viewtopic.php?f=25&t=45279&p=248519#p248274
have tried OOC several time with the 780ti plug on PCIe-1x motherboard with USB3 riser : fail and freeze OS.

Thanks for the reference to your thread. Try that which Notiusweb suggests in your thread that you do and then deselect (or remove) only the 780 Ti connected to the USB riser and run the render again and let me now the outcomes for each case. Thanks again.

Thu Sep 03, 2015 7:09 pm

Notiusweb wrote:Tutor, I like the MBD X10DRX myself out of the 2. For me any time something has USB 3.0, it is a big deal as I use external drives.
I have seen pics of boards in these forums (and bitcoin forums) and they typically are open with the GPUs, very local, they look like true mining or rendering configs. Mine looks like a standard tower on a desk, with wires running out back to the GPUs down below. I am wondering if I could still have 13 GPU without them being 1X speed.

I could be completely wrong here, but just drawing from my own experience, If the boards you listed are using PCIE lanes (not slots) to handle storage, then one thing that would come into play for consideration, is:

MBD X9DRX's
6. 8x SATA2 and 2x SATA3 ports

vs

MBD X10DRX's
6. 10x SATA3 (6Gbps); RAID 0, 1, 5, 10

In my case the board had something similar, LSI and SATA, where I am able to discharge a storage controller to free up lanes. But, if it was all LSI, for example, I would have no choice but to use those lanes to handle my storage. I think in a case where a board handles the storage natively without PCIE Lanes, all good. But if any storage is used by PCIE Lanes, you'd want to have the liberty of sacrificing it for PCIE Lanes. So, It jumped out at me, should any storage be used by PCIE lanes, that you could disable maybe SATA2 or SATA3 independent of one another, in the case of MBD X9DRX, whereas MBD X10DRX would be all or none (unless the 10 ports can be deactivated in some split way.) In fact, I am more inclined now to view the potential overall allocatability of a board, vs its PCIE Slots, although usually I think more slots signal higher yield. Just thoughts, nothing definitive as far as insight. Certainly has nothing to do with Best Practices...more like, Best Hacking-ces.

I would have had no hesitation in getting two X10DRXs if I already had CPUs for them or didn't already have E5 v1/v2 CPUs as I did in the case of the X9DRX (and I forgot to mention this earlier, but I also had and wasn't using 96 gigs of DDR3 ram for each X9DRX) . So outfitting two X10DRX would have cost me many thousands of dollars for equivalent and suitable ram and the CPUs. Moreover, working with 24 other systems makes my money dearer to me. At no point did I mean to imply that anyone else shouldn't prefer or purchase the X10DRX. That's why I stated; "When it comes down to GPU rendering there's not much difference between the MBD X9DRX and the MBD X10DRX; so let your CPU, memory and SATA needs and relative costs and availability break the tie. See, in particular, the bolded differences and underlined similarity:" I admit that by my mentioning SATA storage only rather just saying "storage" that my language was under-inclusive and left out other forms of storage such as USB.

I'm not afraid of snakes and have had many of them as pets throughout my life. So knowing that background information, it shouldn't come as a surprise to anyone that your and my snake dens (resulting from our being really just mere beginners on the path to true GPU monsterdom) doesn't bother me in the least. Our builds truly are still works in development/progress. Also, remember that when you start with something, in your case {literally} it was and is just a standard tower that is now on a desk, it's hard to just trash it - reminds me of the extra CPUs and ram that I wasn't using [and still wouldn't be using if I had purchased two X10DRXs]. If your case looked like anything else, you'd be a magician or wasteful. My late mother's favorite saying when I was a child was "waste not - want not." Wasn't she an early ecological fanatic? However, by all means tackle the esthetics when you feel the need. But alway remember and take pride in the fact that YOU BUILT IT and in the course of doing it that you learned a lot more information than you had at the start.

One of the things that I always do is to read the manual for a system/motherboard/other component that peaks my interest before I decide to purchase the thing. In this post [ viewtopic.php?f=40&t=43597&start=200#p241271 ] regarding tackling IO space issues, I suggested that the first thing one should do is study his or her system’s block diagram showing the layout of the system's CPU(s), PCIe slots, DMI points, other features/resources/peripherals and their connections … . I've found it best to download manuals as PDFs so that they can be easily searched. The X10 DRX and the X9DRX manuals show that all of the persistent storage travels on the DMI lane of each motherboard. A file search for "LSI" on the manuals for both of these motherboards didn't turn up anything. The diagrams for each of these two motherboards show that nothing is connected to the PCIe lanes of either of them other that what you connect to it. PCIe socket 11 is, however, connected to the CPU by a DMI lane [ https://en.wikipedia.org/wiki/Direct_Media_Interface ], which is not, although similar to, a PCIe lane. PCIe socket 11 is the one that I intend to populate with one of my 4 port E-Sata card to connect one of my 20 Terabyte external quad 5T hard drive arrays. Among the many other things that the manuals covers, it shows how to disable the SATA controllers completely or per port and how to disable the USB controllers.

Resources:
X10DRX - http://www.supermicro.com/products/moth ... x10drx.cfm
X9DRX - http://www.supermicro.com/products/moth ... drx_-f.cfm

Thu Sep 10, 2015 4:55 pm

The following thread may be of assistance to those who for whom a system with up to about six GPUs may be completely satisfactory - viewtopic.php?f=40&t=50181#p248935 . This may become the sweeter spot for many as much higher performance GPUs like Pascals and Voltas drop [just Google "nvidia pascal and volta" and read the wccftech.com posts].

Thu Sep 17, 2015 11:07 pm

Hi Tutor,

i´m running a rig with 7 GTX 780 6GB, work like a charm. I used the same board with 7 titan x. I stuck at the BAR problem.

1x GTX780 6GB uses 178MB of IOMEM, Titan X use 306 MB of IOMEM. As far i can see and what gordon told is is, the 32 bit bios is the problem. Did you use uefi based bioses or did you changed the BAR strap ?

the hardest part was to mod the titan x to single slot.

kind regards
thunderbolt

Fri Sep 18, 2015 7:50 am

Is the single slot conversion your custom one? What are the temps of gpus? What about water temp?

This is impressive!

Fri Sep 18, 2015 8:47 am

hi,

yes it was a modification. i removed the second dvi port. the first card was the hardest one, but the other 6 where easy .... i tried the waterflow connection, the best way is to send the water to gpu1 after this to gpu2 and so one. the last card gets the hottest water but its not much.

temps are around 54°C -61°C at full load, cant say exactly because i can use only 5 of 7 titan cards so far. But the water goes from my 4x 7 gpu rigs directly to my floor heating at my office. The incomming water temp is around 20°C -22 °C ....

regards
thunderbolt