CUDA 4.0

Mon Feb 28, 2011 4:40 pm

How will CUDA 4 affect Octane's software architecture?
Sincerely
Rick

Mon Feb 28, 2011 5:18 pm

There's a small article about it here. I was wondering as well whether this will give the Octane developers a headache like 3.2 did.
http://www.anandtech.com/show/4198/nvid ... es-cuda-40

Mon Feb 28, 2011 6:39 pm

I don't know yet, but we will see, when it's out. I'm not so much interested in the UMA (too slow for our purposes), but I'm really looking forward to a hopefully new/improved compiler tool chain, which was giving us quite some headaches in the past.

But let's wait and see

Cheers,
Marcus

Sat Mar 05, 2011 3:45 pm

looking at it in more detail.It looks like you will be able to pool the cards memory.EG fit the scene across multiple cards.

Now the memory across multi cards in a system will become addative if i am reading it right.

EG 2 x 2mb cards would give you 4Mb to play with.

now that would be a bit og a game changer.

Sat Mar 05, 2011 6:43 pm

A CUDA Release Candidate 4.0 is available now.
http://developer.nvidia.com/object/cuda ... loads.html

Looks like its got plenty of low level changes. I would think that it's going to be a bit of a wait before a future release of Octane can be fully migrated. That last Cuda version 3.2 seemed to delay updates a bit, and this 4.0 is still only a release candidate.

Sat Mar 05, 2011 9:07 pm

Jaberwocky wrote:looking at it in more detail.It looks like you will be able to pool the cards memory.EG fit the scene across multiple cards.

Now the memory across multi cards in a system will become addative if i am reading it right.

EG 2 x 2mb cards would give you 4Mb to play with.

now that would be a bit og a game changer.

Unfortunately, the devil lies in the detail

Each GPU needs access to everything. If you would distribute the scene data over several GPUs or even the CPU, you would then have to fetch the data from the other GPUs or the CPU. And everything via PCI ... That's superslow and not practical for our uses.

Cheers,
Marcus

Sat Mar 05, 2011 9:19 pm

Qtoken wrote:A CUDA Release Candidate 4.0 is available now.
http://developer.nvidia.com/object/cuda ... loads.html

Looks like its got plenty of low level changes. I would think that it's going to be a bit of a wait before a future release of Octane can be fully migrated. That last Cuda version 3.2 seemed to delay updates a bit, and this 4.0 is still only a release candidate.

Actually, the changes were a lot smaller than you would expect from the PowerPoints NVIDIA has floated around before. -> Octane builds and runs fine with CUDA 4.0. And no big surprises regarding speed. Unfortunately the multi-GPU changes are more trivial than what I was hoping for after reading the PowerPoints, which probably means that the multi-GPU rewrite will go on as planned originally.

Cheers,
Marcus

Sun Mar 06, 2011 10:31 am

abstrax wrote:
Jaberwocky wrote:looking at it in more detail.It looks like you will be able to pool the cards memory.EG fit the scene across multiple cards.

Now the memory across multi cards in a system will become addative if i am reading it right.

EG 2 x 2mb cards would give you 4Mb to play with.

now that would be a bit og a game changer.

Unfortunately, the devil lies in the detail Each GPU needs access to everything. If you would distribute the scene data over several GPUs or even the CPU, you would then have to fetch the data from the other GPUs or the CPU. And everything via PCI ... That's superslow and not practical for our uses.

Cheers,
Marcus

you mean even over PCIE x16 V2.0 slots

Perhaps we need to wait for PCIE V3.0 slots.

http://www.eetimes.com/electronics-news ... cification.

I suppose then of course there would be a backward compatability issue.

Sun Mar 06, 2011 10:46 am

Jaberwocky wrote: you mean even over PCIE x16 V2.0 slots

Perhaps we need to wait for PCIE V3.0 slots.

http://www.eetimes.com/electronics-news ... cification.

I suppose then of course there would be a backward compatability issue.

No external bus will be able to help here. Bandwith is not the problem, but latency. Any memory that is accessed randomly and used in your inner loops of your algorithms needs to be fetched as quickly as possible. Usually you don't load heaps of data - only a few bytes - but you have to wait for them, i.e. your core is basically twiddling thumbs during that time. Caches reduce the problem, but in the end light can travel only so far during one clock cycle (a few centimeters only), which means you want to have your memory physically as close as possible. And you achieve that only with on-board memory (which is already slow compared to caches).

Fortunately there is help coming from another direction: It looks like the amount of VRAM is increasing continuously. The GTX 580 can already be bought with 3GB

Cheers,
Marcus

Mon Mar 07, 2011 10:54 am

Ok thanks for the insight Abstrax.