Its a MacPro 5,1 running Win7, 12GB RAM, GTX 980 and a Titan X. I'm Currently running R20/V4 for evaluation, usually its R19/3.08.
nVidia Driver: 397.64
CUDA version: 9.1.85
Everything is stock, no overclocking.
This rig has been solid for a number of years, I've never had a problem with Cuda errors. This is suddenly what I get now:
I'm also seeing corruption on the connected display, and other weirdness. I did get it working again by uninstalling and reinstalling Cuda and drivers (via more current versions which also failed, so I rolled back to nVidia Driver: 397.64 and CUDA version: 9.1.85).CUDA error 719 on device 1: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 1: unspecified launch failure
-> failed to destroy CUDA event
CUDA error 719 on device 1: unspecified launch failure
-> failed to deallocate pinned memory
CUDA error 719 on device 1: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 1: unspecified launch failure
-> failed to destroy CUDA event
Started logging on 05.12.18 15:25:12
OctaneRender 4.00 (4000021)
CUDA error 700 on device 0: an illegal memory access was encountered
CUDA error 700 on device 1: an illegal memory access was encountered
-> failed to download symbol(stats data)
-> failed to wait for event
device 0: path tracing kernel failed
device 1: path tracing kernel failed
CUDA error 700 on device 1: an illegal memory access was encountered
CUDA error 700 on device 0: an illegal memory access was encountered
-> failed to load symbol data to the device(deep_data)
-> failed to load symbol data to the device(deep_data)
device 1: failed to upload the deep params
device 0: failed to upload the deep params
detected that all GPUs of the slave have failed -> restarting slave
CUDA error 700 on device 0: an illegal memory access was encountered
-> failed to destroy CUDA event
I did a stress test today scene and it ran no problem for an hour.
BUT then I tried the scene on which I first had the errors appear and again that toasted the rig. Can a scene cause damage on a slave machine???
This scene was setup with Redshift, but I had removed all the specific tags and settings and uninstalled Redshift. IDK if that has anything to do with it. But the stress test scene ran fine immediately prior to that.
Really struggling to fix this and more importantly understand what the cause is. I hope its not a dying card...