6 GPU system some GPUs fail(software) mid render...
Posted: Sun Dec 20, 2015 5:51 pm
So I've got a 6 GPU system. It has a 980, an original Titan, two 680s and two 460s.
I originally had it setup in my office where all my other regular render boxes are but after rendering for long periods of time it would heat up the room too much even though the window was fully open to allow the cold air in AND a box fan blowing this cold an onto it. The main problem with it being setup in the office was that it would trip my power breaker... other than these two issues it ran fine and there were no GPU failures.
I had a render that was going for about 23 hours when this happened the other night so I opted to move it downstairs and run a 50 ft network cable to it.
Now it is all set back up but in a much better configuration for cooling.
Here's how its setup:
I have the 980 and the titan running directly from the motherboard and from the main motherboard's power supply. I have the two 680s being fed power from the main motherboard power supply but they are connected to an Amfeltec GPU splitter. The 460s and the Amfeltec GPU splitter PCIe card splitter cards are all powered by a 2nd power supply. I am triggering this second power supply on via a relay that is triggered on when it sees a 12v source from the main motherboard power supply. THE ONLY difference between how its wired up now and how it was wired up before is that its now on a 50ft network cable and 1 of the 680s WAS running off the main motherboard PCIe but I've moved it to run on the Amfeltec GPU splitter. When I try to power the Amfeltec GPU splitter via the main motherboard power supply, the whole system kills randomly even when idling.
Now after moving it down stairs I'll go ahead and run a render, I can confirm this happening both in direct lighting and in PMC, messages will pop up on the host machine that says like ".3 samples were lost" and on the server machine it'll tell me "cuda device (#) failed". If I restart the octane server the # of GPUs is lessened by the amout of CUDA device(#) failure messages and I can't get all of them back online till I reboot...
System Specs:
ASUS SABERTOOTH 990FX R2.0
AMD 9590 FX
32 Gigs of RAM
PSU 1: EVGA 120-G2-1300-XR 80 PLUS
PSU 2: Corsair HX850
GTX 980
Titan
Amfeltec GPU-Oriented x4 PCIe 4-Way Splitter
GTX 680 x 2
GTX 460 x 2
Windows 7
I originally had it setup in my office where all my other regular render boxes are but after rendering for long periods of time it would heat up the room too much even though the window was fully open to allow the cold air in AND a box fan blowing this cold an onto it. The main problem with it being setup in the office was that it would trip my power breaker... other than these two issues it ran fine and there were no GPU failures.
I had a render that was going for about 23 hours when this happened the other night so I opted to move it downstairs and run a 50 ft network cable to it.
Now it is all set back up but in a much better configuration for cooling.
Here's how its setup:
I have the 980 and the titan running directly from the motherboard and from the main motherboard's power supply. I have the two 680s being fed power from the main motherboard power supply but they are connected to an Amfeltec GPU splitter. The 460s and the Amfeltec GPU splitter PCIe card splitter cards are all powered by a 2nd power supply. I am triggering this second power supply on via a relay that is triggered on when it sees a 12v source from the main motherboard power supply. THE ONLY difference between how its wired up now and how it was wired up before is that its now on a 50ft network cable and 1 of the 680s WAS running off the main motherboard PCIe but I've moved it to run on the Amfeltec GPU splitter. When I try to power the Amfeltec GPU splitter via the main motherboard power supply, the whole system kills randomly even when idling.
Now after moving it down stairs I'll go ahead and run a render, I can confirm this happening both in direct lighting and in PMC, messages will pop up on the host machine that says like ".3 samples were lost" and on the server machine it'll tell me "cuda device (#) failed". If I restart the octane server the # of GPUs is lessened by the amout of CUDA device(#) failure messages and I can't get all of them back online till I reboot...
System Specs:
ASUS SABERTOOTH 990FX R2.0
AMD 9590 FX
32 Gigs of RAM
PSU 1: EVGA 120-G2-1300-XR 80 PLUS
PSU 2: Corsair HX850
GTX 980
Titan
Amfeltec GPU-Oriented x4 PCIe 4-Way Splitter
GTX 680 x 2
GTX 460 x 2
Windows 7