regarding CUDA error 2 issues in v4 (driver issues solved)

Fri Nov 23, 2018 5:15 am

EDIT (5/12/2018): Yesterday, NVIDIA released a new GeForce driver that fixes issue #2 (see below) on Windows 7 / 8. It has version 417.22. They are planning to release the corresponding Quadro driver on 13th December.

Thank you all for helping to track down this problem. With this driver fix, we assume that any additional out-of-memory errors are either caused by insufficient resources or non-optimal memory management in Octane. We are working on to improving the latter.

Hi all,

We are still working on trying to get the CUDA error 2 (out of memory) failures resolved. Since now a lot more people will start using version 4, I thought it's a good idea to give an update about where we are with this issue:

Fundamentally there are two very different scenarios we have to distinguish:

In the first case it's a legitimate error message that we have run out of memory. This can happen either when we try to allocate some memory on the device or when we try to pin some memory on host. Both resources (i.e. VRAM and pinnable memory) are limited and we can run of it. In this case you should see some error message that some memory could not been allocated or pinned.
In the past things were fairly simple, because when you ran out of memory that was that and there is nothing you can do anyway. But now with out-of-core support, Octane could at least try to move more data back to system memory. Since version 4 RC 7 we have already made some changes to reduce this problem and we are trying to increase robustness in the memory management, but aren't finished with that yet since it's not that trivial. What makes it all even harder to solve is the fact that the numbers about the free memory reported by CUDA are not very consistent. We hope to have at least the re-allocation implemented in the next release.
Another thing we are trying to improve is to reduce the amount of pinned memory we use in Octane.
In a nutshell, this scenario should be solvable and I hope we have it sorted soon.
The second situation is unfortunately a lot trickier, and it seems to be the more common problem: What happens is that a CUDA operation fails with an out-of-memory error, although there is a) more than enough memory and b) this operation actually doesn't allocate any memory anyway. It also always screws up the CUDA context and is not recoverable, i.e. requires the application to restart.
At the moment I believe that this is a CUDA driver problem, which I was able to reproduce under very specific circumstances. The reproducible cases have always been observed first in the C4D plugin, but I was always able to reproduce them in the Standalone as long as the following criteria applied:
- CINEMA 4D is running in the background (it doesn't have to have a scene loaded and it doesn't have to have the Octane plugin installed).
- The operating system is either Windows 7 or 8, i.e. the graphics driver is a WDDMv1 driver. (We weren't able to reproduce the issue on Windows 10 yet)
- The used graphics card is the first CUDA GPU that is not using the TCC driver mode.
- The used GPU is a Pascal GPU (1070, 1080, Titan Pascal).
Most of the times the upload of the material preview module is failing, but I have also seen other operations fail, like uploading some global symbols (which are usually <1KB in size). All of these CUDA API calls shouldn't produce an out-of-memory error at all, which is why I think that it is a driver problem. I have filed a couple of bug reports with NVIDIA, but there has been no progress in that matter yet.

If you think your case is reproducible and fits scenario 2), please send the scene and report to me. I currently have 4 scenes where I can reproduce the issue here and maybe with more scenes I can start seeing a pattern. Especially if you've got a reproducible CUDA failure that either doesn't involve C4D or Windows 7 or 8, it might be very helpful since this would indicate that it is a more generic problem.

Although I currently believe that it is a CUDA driver issue, it is very well possible that it is a bug somewhere in Octane that has this error happening at some later time. So more scenes might help in trying to find some commonality, too and might help pointing us to any issues we have in Octane.

-> Please be patient with us, we are definitely working on it and I have been banging my head at it for almost 2 months now. And any help / information is more than welcome.

Thank you,
Marcus

Tue Nov 27, 2018 12:11 am

Update: NVIDIA was able to reproduce the issue and it is indeed an issue in the WDDM v1 drivers (Windows 7 and 8) and they are working on a fix. If everything goes well there will be a driver release in December that will fix the problem

Tue Nov 27, 2018 5:22 pm

I think I have a scenario #2 issue.
I have two Win10 machines running v4:
-Laptop with 1 x 1080
-Render slave with 6 x1070s
Using C4D R16 with v4 plugin.

I seem to be able to work with the slave in the LV but as soon as I send the render to run, I get a raft of errors - see attached - and the slave crashes.

Wed Nov 28, 2018 12:00 am

volumeboy wrote:I think I have a scenario #2 issue.
I have two Win10 machines running v4:
-Laptop with 1 x 1080
-Render slave with 6 x1070s
Using C4D R16 with v4 plugin.

I seem to be able to work with the slave in the LV but as soon as I send the render to run, I get a raft of errors - see attached - and the slave crashes.

We couldn't reproduce the issue here, but are also not able to recreate your exact setup. Could you please check if the error also occurs with 1, 2 or 4 GPUs?

And does the slave immediately fail or only after some time?

Wed Nov 28, 2018 3:03 pm

I can confirm that we encountering both issues #1 and #2 in windows 10. With either 1 or 2 GPU's. The render fails right off the bat, and fails to render anything. Requires a full restart. Also, that is with the 3ds max plugin. Not sure that it matters.

Wed Nov 28, 2018 4:31 pm

abstrax wrote:
volumeboy wrote:I think I have a scenario #2 issue.
I have two Win10 machines running v4:
-Laptop with 1 x 1080
-Render slave with 6 x1070s
Using C4D R16 with v4 plugin.

I seem to be able to work with the slave in the LV but as soon as I send the render to run, I get a raft of errors - see attached - and the slave crashes.
We couldn't reproduce the issue here, but are also not able to recreate your exact setup. Could you please check if the error also occurs with 1, 2 or 4 GPUs?

And does the slave immediately fail or only after some time?

t fails for all GPUs on the slave (6 x 1070s) - immediately upon sending.
Using Nvidia driver 417.01 - Win10 64bit.

Wed Nov 28, 2018 7:29 pm

volumeboy wrote:
abstrax wrote:
volumeboy wrote:I think I have a scenario #2 issue.
I have two Win10 machines running v4:
-Laptop with 1 x 1080
-Render slave with 6 x1070s
Using C4D R16 with v4 plugin.

I seem to be able to work with the slave in the LV but as soon as I send the render to run, I get a raft of errors - see attached - and the slave crashes.
We couldn't reproduce the issue here, but are also not able to recreate your exact setup. Could you please check if the error also occurs with 1, 2 or 4 GPUs?

And does the slave immediately fail or only after some time?

t fails for all GPUs on the slave (6 x 1070s) - immediately upon sending.
Using Nvidia driver 417.01 - Win10 64bit.

Yes, but could you please try using only 1 GPU on the slave. If that works, try using 2 GPUs on the slave. And then 4. I'm wondering if memory allocations become more iffy when more GPUs are used in a system or if there are some more issues hiding in Octane. According to the log, a memory allocation fails, which indicates that it is issue #1 and not #2.

Thank you.

Wed Nov 28, 2018 7:30 pm

DartFrog wrote:I can confirm that we encountering both issues #1 and #2 in windows 10. With either 1 or 2 GPU's. The render fails right off the bat, and fails to render anything. Requires a full restart. Also, that is with the 3ds max plugin. Not sure that it matters.

Yes, it matters. Could you send me the scene, your configuration and the steps to reproduce? We would like to investigate this issue further.

Thu Nov 29, 2018 10:58 pm

abstrax wrote: Yes, but could you please try using only 1 GPU on the slave. If that works, try using 2 GPUs on the slave. And then 4. I'm wondering if memory allocations become more iffy when more GPUs are used in a system or if there are some more issues hiding in Octane. According to the log, a memory allocation fails, which indicates that it is issue #1 and not #2.

Thank you.

I can run 1-4 of my 6 GPUs in the slave without errors. Adding the 5th triggers the cascade of Cuda 2 errors.
I also tried the same tests with 3.08.3 - with that I could run 5 GPUs before experiencing the errors. Adding the 6th triggers them.

Tue Dec 04, 2018 7:50 pm

today I installed the new driver and the problem was gone. I also needed to deactivate one of my gpus.
417.22 at windows 7 (3x 1080ti, 1x 1080)
@volumeboy did you try the new driver?