Octane 3 Plugin / Handling of Hardware Glitches/Fallouts/Err

Fri May 27, 2016 8:11 am

Jim and Octane team

My current working computer is and open structure rig with 1 dedicated and 6 GPUs for rendering. 5 of the GPUs are raised with PCI riser cables. The cards have many space to run cool.
This setup has mostly worked very well in the past 3 years.
In the Octane 2 version of the plugin I sometimes got a "render failed" message when rendering high resolution and/or high count passes, can't really say what was the cause: GPU overheating, PCI delays probably.
The thing is: I didn't really care, since the rendering did not stop neither was visually impacted by that. I thought the failing GPU just fall out but the other just continued working.

Now while testing Octane 3 plugin it occurred to me that 3ds max crashed a couple of times and I am suspecting the all new Octane 3 handling with the increased PCI traffic could have been the cause.
My question is: How does the plugin and Octane react on a single GPU fail?
It would be great if it could handle the fallout without crashing, possibly retry a couple of times and when no success just forget the failing GPU (Somehow this seemed to be the case with the oct2 versions and my particular system). This way one must not fear the loss of the rendering after a rendering job over night even if one GPU failed.

I'm very interested what you think/know..

Cheers
Boris

edit: for those interested what this looks like:

Fri May 27, 2016 8:50 am

boris wrote:5 of the GPUs are raised with PCI riser cables.

Long cables may be not so robust as when the card is directly in the slot... Perhaps some data loss from time to time may occur...

boris wrote:The cards have many space to run cool.

You know, even with enough space for aircooled cards... I always disassembled my cards right after buying - and always the stock thermal connection between chips and a cooler was just a joke. I always removed a stock compound, cleaned the surfaces, and put a thin layer of good silver compound there...
The GPU may fail even if the temperature of GPU chip shows up as OK. E.g. the mosfets in power module having bad thermal connection to a cooler may get so hot that you could get burn from them... That affects the quality of power supply of the GPU board. Or a memory chips may get overheated too...

The drivers may sometimes be a cause too. And there is still a possibility of some bugs in overhauled Octane rendering core...

Anyway, when you get some GPU issues - always look at Octane log window after that. It's button is in kernel settings dialog.

Fri May 27, 2016 9:16 am

I do not have a rised card system.
But it's true that octane v3 has a new frame buffer thing going on.
Where new features like Parallel Samples, Max. Tile Samples must increase the traffic on PCI bus.

They say that increasing the parallel samples it nearly matches the v2.0 speed.

That may be the reason if you didn't have these issues before.

BTW You have a nice rig there mate.
Could you please let me know what is the framing you used there. Where did you buy it?
Which brand / make risers are you using right now?

thanks

Fri May 27, 2016 9:42 am

oguzbir wrote:I do not have a rised card system.
BTW You have a nice rig there mate.
Could you please let me know what is the framing you used there. Where did you buy it?
Which brand / make risers are you using right now?
thanks

tnx.. these are standard 20mm aluminium rods, there are different types of it, mine is bosch if I remember correctly. I bought in a shop in germany since I live in switzerland and in switzerland those things are just double the price..
I planned the thing in autocad, placed an order and the hardware shop in germany cut it to the right length. price was about 350$ with all the nuts, screws and L-type parts.
assembly took me one day or so.
the good thing is it's very flexible.
jim:
the four gtx680 are the gainward type and this is a very good cooler IMO (at the cost of 2.5 pci slots).
the 2 cards on the mainboard are water cooled but this is a relict from my previous system..

the riser cables I found on ebay from a firm liquidation. they are 12cm long. will give you details when I find time.. 2 of them are stacked...

have to go, give you some more details later if you wish.

cheers,
boris

edit: tracked down the riser cables: type is CLKF731 / CLKF731S1
the cable itself is covered with a kind of silver pad which should ensure no speed loss.
http://www.sintech.cn/riser%20card/ST-C ... cable.html

I can't really track down which GPU failed in the past of course I had a look into the log. it was not always the same GPU if I remember correctly so I assumed there could be a latency problem, but unfortunately on my current mobo I can't alter or fine tune the PCI latency in BIOS.

Anyway the funny thing is that my dedicated card is raised and I had never had glitches on my monitor at least.

when i first tried gpu rendering I bought some second hand tesla's with passive cooling (to get enough memory - no instancing in arion back then), ripped off the cooler and mounted a water cooling block. One of the cards made problems (ebay..) and I thought the memory would get too hot. I ended up bending copper pipes to include the memory modules of the GPU in the water circuit..
http://www.meyerdudesek.com/tesla-cooling.html
the card did not behave better but I learned how to bend copper pipes

Tue May 31, 2016 3:07 pm

ok jim playing with volumetrics I got several crashes. have to restart max to get octane working again:

OctaneRender 3.00 (3000020)

no interface found with subnet address 10.3.0.0, not changing the daemon scan subnet
CUDA error 700 on device 1: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> kernel execution failed (report)
CUDA error 700 on device 1: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to launch kernel (ptBrdf2)
device 1: path tracing kernel failed
CUDA error 700 on device 2: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> kernel execution failed (report)
CUDA error 700 on device 2: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to launch kernel (ptBrdf2)
device 2: path tracing kernel failed
CUDA error 700 on device 4: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> kernel execution failed (report)
CUDA error 700 on device 4: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to launch kernel (ptBrdf2)
device 4: path tracing kernel failed
CUDA error 700 on device 3: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> kernel execution failed (report)
CUDA error 700 on device 3: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to launch kernel (ptBrdf2)
device 3: path tracing kernel failed
CUDA error 700 on device 5: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> kernel execution failed (report)
CUDA error 700 on device 5: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to launch kernel (ptBrdf2)
device 5: path tracing kernel failed
CUDA error 700 on device 4: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
CUDA error 700 on device 3: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
CUDA error 700 on device 2: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to load symbol data to the device (deep_data)
-> failed to load symbol data to the device (deep_data)
device 3: failed to upload the deep params
-> failed to load symbol data to the device (deep_data)
device 4: failed to upload the deep params
CUDA error 700 on device 5: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
device 2: failed to upload the deep params
CUDA error 700 on device 1: Kernel tried to load or store to an invalid memory address. The context cannot be used anymore and must be destroyed. All existing device memory allocations from this context are invalid and must be reconstructed.
-> failed to load symbol data to the device (deep_data)
-> failed to load symbol data to the device (deep_data)
device 5: failed to upload the deep params
device 1: failed to upload the deep params

Tue Jun 07, 2016 9:12 am

jim:
trying to get my rig error free.

just encountered a "render failed" with the 2.24.2 plugin. but this time i wasn't able to port the result to the framebuffer

(8000 passes 3900x2400 pathtracing)

anyway what monitor programs do you guys use to identify the failing gpu in multi gpu setups? for me it's quite hard to troubleshoot since it looks like a chain reaction.
and it's hard to trigger the error, so trying one gpu after the other is a very time consuming process...

here some logs:

OctaneRender 2.24.2 (2240001)

CUDA error 715 on device 1: Error code unknown.
-> kernel execution failed (pt)
device 1: path tracing failed
CUDA error 715 on device 5: Error code unknown.
-> kernel execution failed (pt)
device 5: path tracing failed
CUDA error 715 on device 4: Error code unknown.
-> kernel execution failed (pt)
device 4: path tracing failed
CUDA error 715 on device 2: Error code unknown.
-> kernel execution failed (pt)
device 2: path tracing failed
CUDA error 715 on device 3: Error code unknown.
-> kernel execution failed (pt)
device 3: path tracing failed
failed to tonemap render pass "Beauty"
failed to tonemap render pass "Beauty"
failed to tonemap render pass "Beauty"
failed to tonemap render pass "Beauty"

windows system log:

The description for Event ID 13 from source nvlddmkm cannot be found. BLABLABLA
The following information was included with the event:

\Device\UVMLiteProcess2
Graphics SM Warp Exception on (GPC 2, TPC 1): Illegal Instruction Encoding

AND milliseconds before:

\Device\UVMLiteProcess2
Graphics Exception: ESR 0x514e48=0x170009 0x514e50=0x20 0x514e44=0x13eff2 0x514e4c=0x7f

nvidia driver 365.19
6 x gtx680