Node crash halts render
Moderator: juanjgon
- BorisGoreta
- Posts: 1413
- Joined: Fri Dec 07, 2012 6:45 pm
- Contact:
With the latest 3.04.3.0 version if one of the nodes drops the frame never finishes.
19 x NVIDIA GTX http://www.borisgoreta.com
- BorisGoreta
- Posts: 1413
- Joined: Fri Dec 07, 2012 6:45 pm
- Contact:
Yes, with the Octane native network rendering, I have a dodgy node which can't survive the whole night of rendering so every time it stops in the middle.
19 x NVIDIA GTX http://www.borisgoreta.com
There have been no relevant changes here, so I don't think it's related to only specific versions. But let's investigate the issue anyway. Does the dodgy slave crash or just hang? What is the output on the terminal?BorisGoreta wrote:Yes, with the Octane native network rendering, I have a dodgy node which can't survive the whole night of rendering so every time it stops in the middle.
In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
- BorisGoreta
- Posts: 1413
- Joined: Fri Dec 07, 2012 6:45 pm
- Contact:
The slave doesn't crash, it hangs, it reports the same error with some code for all devices. I have identified the dodgy GPU and removed it from the system and now it works fine.
19 x NVIDIA GTX http://www.borisgoreta.com
Hmm ok. In theory (and as far as I have tested it), the slave should inform the master when all devices have failed and the net render master should then return all unfinished assignments to the render target which will then redistribute them. In case you or someone else wants this issue being investigated further, it's probably best if you enable logging on the master and the slaves and then send me the log files the next time the problem occurs. To enable logging, you have to copy the following file into the directory with the Octane slave binary (on the slaves) and the Standalone / Octane DLL binary (on the master) and then make sure that the applications are restarted:BorisGoreta wrote:The slave doesn't crash, it hangs, it reports the same error with some code for all devices. I have identified the dodgy GPU and removed it from the system and now it works fine.
In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
- BorisGoreta
- Posts: 1413
- Joined: Fri Dec 07, 2012 6:45 pm
- Contact:
Ok I will do that. In my experience every time the node halted or restarted rendering would just stop. I would have to press abort and restart render. This usually happens over night at some point while I am not monitoring rendering progress. Huge issue for me since I can't go to sleep thinking sequence will be finished in the morning.
19 x NVIDIA GTX http://www.borisgoreta.com
Yes, that's big issue and as Abstrax said that shouldn't happen, I also thought that problem is solved and no mater how many GPUs might fail (for whatever reason overnight) render should NOT stop until last GPU in network is live/rendering. Otherwise all our deadlines overnight or while we are not in front of computer babysitting it are big mistery will they finish on time or not finish at all.
--
Lewis
http://www.ram-studio.hr
Skype - lewis3d
ICQ - 7128177
WS AMD TRPro 3955WX, 256GB RAM, Win10, 2 * RTX 4090, 1 * RTX 3090
RS1 i7 9800X, 64GB RAM, Win10, 3 * RTX 3090
RS2 i7 6850K, 64GB RAM, Win10, 2 * RTX 4090
Lewis
http://www.ram-studio.hr
Skype - lewis3d
ICQ - 7128177
WS AMD TRPro 3955WX, 256GB RAM, Win10, 2 * RTX 4090, 1 * RTX 3090
RS1 i7 9800X, 64GB RAM, Win10, 3 * RTX 3090
RS2 i7 6850K, 64GB RAM, Win10, 2 * RTX 4090
- BorisGoreta
- Posts: 1413
- Joined: Fri Dec 07, 2012 6:45 pm
- Contact:
This is happening a lot again with one of the node. What happens is that node command window halts the render completely. If I kill this node window by pressing the X on the top right of the window the render continues normally.
This is very easy to test, just cut power to one of the GPUs in the node and it will halt the render.
Why isn't there some heartbeat test for the nodes ? If it doesn't reply in a reasonable amount of time just disregard it from subsequent frames and continue rendering with what you've got left.
This is very easy to test, just cut power to one of the GPUs in the node and it will halt the render.
Why isn't there some heartbeat test for the nodes ? If it doesn't reply in a reasonable amount of time just disregard it from subsequent frames and continue rendering with what you've got left.
19 x NVIDIA GTX http://www.borisgoreta.com
There is a heartbeat to detect deadlocks. Regarding generic timeouts: What is a reasonable amount of time between responses? There are so many components in play here that can delay communication and some scenes really take a long time to render a tile...BorisGoreta wrote:This is happening a lot again with one of the node. What happens is that node command window halts the render completely. If I kill this node window by pressing the X on the top right of the window the render continues normally.
This is very easy to test, just cut power to one of the GPUs in the node and it will halt the render.
Why isn't there some heartbeat test for the nodes ? If it doesn't reply in a reasonable amount of time just disregard it from subsequent frames and continue rendering with what you've got left.
In theory there is no difference between theory and practice. In practice there is. - Yogi Berra