CUDA 4.0
Posted: Mon Feb 28, 2011 4:40 pm
How will CUDA 4 affect Octane's software architecture?
Sincerely
Rick
Sincerely
Rick
Unfortunately, the devil lies in the detailJaberwocky wrote:looking at it in more detail.It looks like you will be able to pool the cards memory.EG fit the scene across multiple cards.
Now the memory across multi cards in a system will become addative if i am reading it right.
EG 2 x 2mb cards would give you 4Mb to play with.
now that would be a bit og a game changer.
Actually, the changes were a lot smaller than you would expect from the PowerPoints NVIDIA has floated around before. -> Octane builds and runs fine with CUDA 4.0. And no big surprises regarding speed. Unfortunately the multi-GPU changes are more trivial than what I was hoping for after reading the PowerPoints, which probably means that the multi-GPU rewrite will go on as planned originally.Qtoken wrote:A CUDA Release Candidate 4.0 is available now.
http://developer.nvidia.com/object/cuda ... loads.html
Looks like its got plenty of low level changes. I would think that it's going to be a bit of a wait before a future release of Octane can be fully migrated. That last Cuda version 3.2 seemed to delay updates a bit, and this 4.0 is still only a release candidate.
abstrax wrote:Unfortunately, the devil lies in the detailJaberwocky wrote:looking at it in more detail.It looks like you will be able to pool the cards memory.EG fit the scene across multiple cards.
Now the memory across multi cards in a system will become addative if i am reading it right.
EG 2 x 2mb cards would give you 4Mb to play with.
now that would be a bit og a game changer.
Each GPU needs access to everything. If you would distribute the scene data over several GPUs or even the CPU, you would then have to fetch the data from the other GPUs or the CPU. And everything via PCI ... That's superslow and not practical for our uses.
Cheers,
Marcus
No external bus will be able to help here. Bandwith is not the problem, but latency. Any memory that is accessed randomly and used in your inner loops of your algorithms needs to be fetched as quickly as possible. Usually you don't load heaps of data - only a few bytes - but you have to wait for them, i.e. your core is basically twiddling thumbs during that time. Caches reduce the problem, but in the end light can travel only so far during one clock cycle (a few centimeters only), which means you want to have your memory physically as close as possible. And you achieve that only with on-board memory (which is already slow compared to caches).Jaberwocky wrote: you mean even over PCIE x16 V2.0 slots![]()
Perhaps we need to wait for PCIE V3.0 slots.
http://www.eetimes.com/electronics-news ... cification.
I suppose then of course there would be a backward compatability issue.