Automatically close failed slave?

Fri Nov 10, 2017 10:09 am

Hi, I'm not sure if this is specific to Cinema or to Octane in general, but we are finding that quite often a slave will crash at some point during rendering animations in the Picture Viewer, and the frame will then never complete. As soon as the crashed slave is removed, the frame completes and starts the next one.
This creates problems when running overnight, as you can have a frame freeze at 2am and only find out in the morning that the time was wasted.

The slaves are a mix of 1080ti and 980ti in pairs (8 slaves, 16 cards total), we have tested as much as we can and cannot find a specific trigger. It's not memory, sometimes a 1080ti will fail when the 980ti are fine.
It seems to happen most often when a slave joins a job part way through, but sometimes a slave will fail that has been working since the beginning of the job and cause the same problem.
Removing and re-adding the slave works, so there's no consistency there either. All slaves are running the same drivers: 382.33 and Octane 3.06 stable.

: Job paused on 5th frame. For 14 hours.; 2017-11-10 08_46_52-Picture Viewer.png (14.88 KiB) Viewed 2444 times

As you can see, a slave crashed on the 5th frame, and the frame paused at 99.97% for 14 hours. When the slave was restarted everything ran fine again.

: Slave failed 5 frames in. When restarted it worked fine and completed the job

In this case, the error was CUDA 719, but this is not consistent - it sometimes be an error of "not receiving all information for a frame", or sometimes simply say the "slave crashed or was stopped (CTRL-C)"

I understand there are all sorts of variables that could affect rendering so I'm not looking for a specific fix, but rather:

Is there any way to automatically force a slave daemon to quit if it fails - so that it is removed from the pool and the frame will finish?

I'd rather that the job continued a little slower overnight than froze for hours at a time.

Thanks in advance,
James

Mon Nov 13, 2017 12:17 pm

Hi,

I also have this problem from time to time and would like to know why
its not possible, that the Master continues rendering just without the crashed slave?

regards
Mike

Beppe OctaneRender™ Italia Blue Sky · Mon Nov 13, 2017 2:50 pm

Hi guys,
unfortunately, I have tried several times to reproduce this issue without success.
If the Slave crashes, the Master continue to render here.
If the issue is not clearly reproducible, is very difficult for the developers to find the culprit.
If you could find a scene that always behaves in this way, please, share with us.
ciao beppe

Mon Nov 13, 2017 10:11 pm

Hi,

It's definitely not Cinema4D related. I have same issue sin LightWave network rendering through Octane controller.

My topic is here but sadly no answer/news form OTOY

viewtopic.php?f=23&t=63777&p=325122#p325122

Wed Nov 15, 2017 9:43 am

I've managed to make a scene that uses just a bit too much memory for the 6gb cards and this does reproduce the error - but not consistently.
Most of the time the slaves fail and the render continues, but sometimes - like in this screenshot - the render gets stuck.

You can see that:
The slaves don't show as failed (three of them did because of lack of memory)
Nothing shows in the log
The render is stuck - it's been 5 seconds from finishing the frame for over 15 minutes.

On the three slaves that failed, this is the error:

While I have managed to force this to happen, as mentioned previously it's not the same error every time. It doesn't seem to be related to any single issue, but occasionally the slave that crashes doesn't seem to talk to the machine in charge of rendering, which waits indefinitely for a result that's not coming.

What would be great is if there could be a time-out set on the master, so if it receives no result in a set amount of time - 2 minutes for example - it excludes the slave and carries on, or even a command-line instruction on the slave daemon to quit if it encounters an error. Is that possible?

Fri Nov 17, 2017 3:02 pm

We have the same issue at our studio from time to time, the render stops at 99% of one frame and then gets stuck, halting the rest of the night render. Some sort of time out function would be highly appreciated.

Fri Nov 24, 2017 12:17 am

Just to let you know: I didn't have time yet, but will have an in-depth look into the reported problem next week. If there is any more information that might be relevant, feel free to add it to this thread. Thanks a lot.

Fri Nov 24, 2017 4:12 am

James, I just noticed that you are using version 3.06. Could you (when you've got time) update to version 3.07 to see if the problem is gone. I don't think that the update will solve your problem, but it's worth a try. Thanks.

Fri Nov 24, 2017 6:38 am

abstrax wrote:James, I just noticed that you are using version 3.06. Could you (when you've got time) update to version 3.07 to see if the problem is gone. I don't think that the update will solve your problem, but it's worth a try. Thanks.

You are right, 3.07 will not fix this.
I was using 3.07.1 (LW verison) but had same issue, if one slave dies (for whatever reason) then rest just stops and waits untill i hit continue

.

Thanks

Fri Nov 24, 2017 10:16 am

Same issue here with slave crash. Awaiting response. thanks