Page 1 of 3

Cuda error 702

Posted: Fri Oct 06, 2017 9:40 am
by orb101
Hi there

A machine I use as an octane slave with 4x 1080Ti's is repeatedly giving me the following error in the slave console.
error.png
Everything works fine until it's been rendering, sometimes for a few frames, sometimes for a while. Then I get a 702 and the console freezes.
The scene type and complexity, memory usage etc do not seem to have any effect on the crash. It can be a simple scene or very complex and the issue is the same.
The error is always a 702, but the description is sometimes different. It can be; "failed to copy memory from device" "failed to allocate/deallocate memory" "could not get memory info"
At this point it can also freeze the master (presumably it's waiting for the slave and getting nothing)
Sometimes the slave machine OS becomes unresponsive and throws a BSOD DPC_watchdog violation.

I've tried to isolate the cards and render using each of them separately, I cant get any single card to fail, but any combination of cards will fail.
I have re seated every card, tried again to narrow it down to a failing card or cards and failed.

Running tests like Furmark and vram tests show no errors on the cards, and during stress tests none of the cards have exceeded a heat limit or crashed. Theyre all hybrid watercooled and the temps never go above 50 degrees.

The driver the cards are using is a stable Nvidia driver that I use on another octane machine with no issues.
Bios, firmware, HDD on the slave machine are all latest updates.

Aside from cinema4d and the slave install for octane there's nothing else on the machine except the OS which is Windows 10 Pro.

Machine specs;
X99 E -WS USB 3.1 mobo
4x 1080ti EVGA Hybrid
Samsung 850 Evo M.2 500gb
64GB corsair DDR4 ram
Intel i7 6850k

Any help greatly appreciated.

Re: Cuda error 702

Posted: Fri Oct 06, 2017 9:53 am
by bepeg4d
Hi orb101,
CUDA error 702 in general is related to hw or drivers issues.
Have you tried to render with only 3x GPU and leave one GPU for system/monitor only?
ciao beppe

Re: Cuda error 702

Posted: Fri Oct 06, 2017 9:57 am
by orb101
Hi Beppe

I tried;

4x GPU's fail
3x GPU's fail
2x GPU's fail
1x GPU success

I then tried to isolate the failing card, but found that any combination of two cards or more always failed.

I thought it might be a bad pcie slot, but rotating the cards showed that it could fail no matter which slot's i use.

SLI options are turned off in Nvidia control panel.

Any chance this is to do with system RAM or HDD?

Re: Cuda error 702

Posted: Fri Oct 06, 2017 10:04 am
by bepeg4d
Hi orb101,
I suspect more in the motherbord bios and Win 10.
Maybe worths to try with a different hdd and Win 7.
ciao beppe

Re: Cuda error 702

Posted: Mon Oct 09, 2017 3:46 pm
by orb101
For those interested, I did a lot more testing with various GPUs, all of which failed when attached to one of the PCIE slots.

It looks like a fault on the motherboard, and it's on it's way back to the manufacturer for further tests.

Re: Cuda error 702

Posted: Thu Oct 12, 2017 4:08 pm
by tobydalsgaard
I know this is probably too late for you since you have already shipped it off the mobo:

I had the exact same issues with a second PC that we had purchased, the fix was to downgrade the Nvida drivers. Haven't had a single problem since. Just FYI the version I downgraded to was 382.33 which is what our other PC was running on

Re: Cuda error 702

Posted: Sat Oct 14, 2017 5:28 am
by coilbook
same here -> kernel execution failed(report)
CUDA error 702 on device 3: the launch timed out and was terminated
-> failed to launch kernel(ptBrdf2)
device 3: direct light kernel failed
CUDA error 702 on device 0: the launch timed out and was terminated
-> could not get memory info
CUDA error 702 on device 0: the launch timed out and was terminated
-> could not get memory info
CUDA error 702 on device 1: the launch timed out and was terminated
-> could not get memory info
CUDA error 702 on device 1: the launch timed out and was terminated
-> could not get memory info
CUDA error 702 on device 3: the launch timed out

Re: Cuda error 702

Posted: Sun Oct 15, 2017 11:14 pm
by joeycamacho
tobydalsgaard wrote:I know this is probably too late for you since you have already shipped it off the mobo:

I had the exact same issues with a second PC that we had purchased, the fix was to downgrade the Nvida drivers. Haven't had a single problem since. Just FYI the version I downgraded to was 382.33 which is what our other PC was running on
Thanks - this was helpful. Especially after the Windows Creator's Update. Seemed to have done some strange things to my rig - but this helped nonetheless.
Also, for other viewers - CUDA 702 can sometimes be caused by a Windows TDR setting that will lock up the computer. Read more about it here - https://www.pugetsystems.com/labs/hpc/W ... ience-777/
Hopefully this will help others.

Re: Cuda error 702

Posted: Wed Oct 18, 2017 1:18 pm
by FAZ
tobydalsgaard wrote:I know this is probably too late for you since you have already shipped it off the mobo:

I had the exact same issues with a second PC that we had purchased, the fix was to downgrade the Nvida drivers. Haven't had a single problem since. Just FYI the version I downgraded to was 382.33 which is what our other PC was running on
Just to confirm- Your post saved my life! :)

Going back to that specific driver solved a very similar problem I was having as well. Thank you!

Re: Cuda error 702

Posted: Fri Dec 15, 2017 8:18 pm
by rustyippolito
I have been having the exact same problems with my 4x 1080Ti rig and i have tried swapping out cards one at a time, switching them around, and whenever i try to render with more than one card enabled, i get intermittent system freezes and even a few BSODs. I tried downgrading from 388.59 down to 382.33 and now none of my GPUs show up at all in the Octane plugin or in the standalone. Has anyone cracked this problem yet? I have had BOXX (It is a maxed-out BOXX APEX4 Workstation) send me out multiple replacement Nvidia cards and RAM and had NO luck. I am about to send the whole damned machine back to them and see if they can find a solution, but even they have said that they are not sure where to diagnose, other than the MB. If anyone has any info that could help us find a direction to go in , we would appreciate it.