Memory management

Fri Mar 09, 2012 12:45 pm

Hi,

I have a simple question, is there anywhere in the roadmap something to remove any vram limitation ? For now let's say in a production environment I have a scene requiring 5GB to render, I have 3GB available. I have 3 options in front of me :
- I buy an expensive tesla card which isn't really an option.
- I optimise the scene, decrease mesh subdivisions, decrease texture resolutions, and the customer may say it looks low poly, there's always place for non quality-destructive optimisations, but from 5GB to 3GB does not seem easy at all, and even though, 5GB would still be a quite small scene for archiviz where you usually have a lot of geometry with furnitures / vegetation.
- I go for another render engine and lost a lot of time setting things up on octane.

I know we can access RAM from cuda, at expenses of slow downs as those accesses are way slower than a direct vram access, I assume it. But rather than not rendering the scene at all, I still think it is preferable to render even though it is slower. In the end, it should still be faster than a CPU solution shouldn't it ? I dunno know how it could operate with the current version of octane but the idea would be to operate in buckets. Basically subdividing meshes / textures, load this part into the vram and work on this for a while before switching. In the end it would be a kinda of bucket style. I know that the tree organisation of datas may not fit with the pixels on the screen and this kind of feature would need a certain investment. But I don't see affordable GPUs getting more than 16GB of vram anytime soon. But I see octane getting a real potential way sooner. It could already fit very wells my needs in product viz, but is far for archiviz especially because of memory limitations.

Another idea would be, as soon as instances and multi meshes are implemented, to add a simple subdivision algorithm, allowing to easily control mesh smooth and memory consumption. That would work good for the idea above, where I would exceed the 3GB of memory, things would slow down, I could then reduce mesh subdivisions for preview purposes and switch them back for production quality later on.

And a last idea, with cuda 4.0 and UVA, perhaps there's a way using multi GPU as memory cards, let's say I have 2 GTX580 3GB and a scene requiring 5GB. It is obviously not fitting on any card, the idea would be to load the scene on two cards and using only one of them to calculate, swapping datas with the others which is way faster than accessing system ram.

Maybe my ideas are not the most realistic but I am sure there is a way around this memory limitation, which will be a trade against speed of course, but will allow users to get over a job with the tool they chose.

t_3 · Fri Mar 09, 2012 1:03 pm

a.boeglin wrote:In the end, it should still be faster than a CPU solution shouldn't it?

that is the question. from my benchmark tests i know, that vram speed is very important; if i clock vram down 20%, render speed goes down nearly 20% too. now current ddr3/1600 has a bandwidth of 12.8gb/s, and from what i've alreay seen, cuda apps can't access system ram at the speed (it might be only half that bandwidth). now my gtx 580 for example has a memory bandwidth of 192gb/s... i don't think octane would be still competitive if rendering is 15-30 times slower...

Fri Mar 09, 2012 2:34 pm

Sure, I understand that the loss is huge. But the idea is to optimise those accesses by working only on a part of the scene for a while and then switching to another one and so on. Maybe a bucket's style system. OR to work with other GPU's memory using UVA. I'm not enough into cuda to say how it could be applied, but it seems from the information I gathered that it allows gpus to share memory without passing by the CPU which would lead in much less speed loss. And all those ideas I explicited would occur only when exceeding the available vram on the current system. and I think a longer rendertime is still better than no render at all.

To come back to the UVA solution, lets imagine a system of 4 gtx 580 3GB, it would currently be 4x3GB available. If the scene requires less than 3GB, everything is fine we use 4 GPUs to compute. If the scene requires 3-6GB, we use only 2 GPUs and the 2 others are being used to store scene information, and finally if the scene needs more than 6GB, we would have 1 GPU computing while the 3 others store informations available for the computing GPU. Of course I guess we still need to do a lot of swap between computing GPUs and storing GPUs, but it should be way faster than the RAM option as we keep high memory speeds. At the end, the bigger the scene, the slower the render, but buying 4 gtx 580 is still way cheaper than buying one tesla 12GB, and would render faster with small configurations.

Leme3D · Fri Mar 09, 2012 7:11 pm

a.boeglin wrote:Sure, I understand that the loss is huge. But the idea is to optimise those accesses by working only on a part of the scene for a while and then switching to another one and so on. Maybe a bucket's style system. OR to work with other GPU's memory using UVA. I'm not enough into cuda to say how it could be applied, but it seems from the information I gathered that it allows gpus to share memory without passing by the CPU which would lead in much less speed loss. And all those ideas I explicited would occur only when exceeding the available vram on the current system. and I think a longer rendertime is still better than no render at all.

To come back to the UVA solution, lets imagine a system of 4 gtx 580 3GB, it would currently be 4x3GB available. If the scene requires less than 3GB, everything is fine we use 4 GPUs to compute. If the scene requires 3-6GB, we use only 2 GPUs and the 2 others are being used to store scene information, and finally if the scene needs more than 6GB, we would have 1 GPU computing while the 3 others store informations available for the computing GPU. Of course I guess we still need to do a lot of swap between computing GPUs and storing GPUs, but it should be way faster than the RAM option as we keep high memory speeds. At the end, the bigger the scene, the slower the render, but buying 4 gtx 580 is still way cheaper than buying one tesla 12GB, and would render faster with small configurations.

Very good

t_3 · Fri Mar 09, 2012 10:08 pm

a.boeglin wrote:Sure, I understand that the loss is huge. But the idea is to optimise those accesses by working only on a part of the scene for a while and then switching to another one and so on. Maybe a bucket's style system. OR to work with other GPU's memory using UVA.

afaik the bandwidth problem still exists with uva. uva transfers data over the pcie bus, yet not utilizing the cpu, but still limited by pci and ioh bandwidth (8gb/s concurrent between two ports) - so this won't help imo.

i'm no way an expert in raytracing algorithms, but i don't see how to easily compute raytracing without having access to all participating elements in a scene. thinking of bucket rendering, i though can imagine that some sort of auto-generated lod might be a method, to still have all parts of a scene in, but only those parts currently visible in the bucket in full quality, and the rest more and more simplified (in terms of geometry and materials/textures). still this doesn't sound like an easy to implement approach

another way to save at least some vram would be to don't store the film into vram, but system ram (this time using uva) - assuming that writing to the film buffer happens at way lower speeds compared to geometry and texture access. this could save several hundred megabytes on large scale renders.

would be cool if a dev tells the truth...

Sat Mar 10, 2012 12:00 am

Yes, that's totally true. When computing reflection/refraction rays its hard when you don't have the geometry around, say you have a mirror ball and its fucked up as the sphere will see almost everything around via the reflection. But another idea maybe, lets say we load a part of the scene into the GPU, we start computing rays, BUT, if for instance the ray is reaching a part which isn't in vram, we stop this ray waiting for it to be loaded into vram. This means, I have 3GB of vram, to come back to our previous example. I want to render a scene needing 10GB. We absolutely need to load a "cage". This cage will allow us to know if we reach something being in the scene or not ( weather its loaded into vram or not ). Later on, we can trace rays, when we reach a point as said before which is in the scene but not in the vram, we stop the calculation for this point, moving to another camera ray, and leave it in stand by while loading data necessary to complete this camera ray calculation. This way we are loading data necessary to complete calculation of the camera ray while moving to another area. I know this may be hard to handle due to the high parallelism of the calculations but that would be a way of handling it. Of course at the end it would still be way slower, but that would be a first start of an idea about optimizing the process, and still, would allow the user to render larger scenes weather it wasn't possible before.

Finally, I think we all understand that in any way if we can go over this memory limitation it has a cost, this cost is efficiency and speed, but still allows us using the tool. When you are able to hide objets ( as you may in the current 3DS max plugin that I havent tried yet ), and modify subdivision levels on the fly to optimize the scene and run it for preview purposes, I think the final render time isn't of a big issue. I just really think that from my personal experience and what I heard last months around me, the main reason why people don't move yet to GPU systems is the memory limit which is really blocking. At the end, we don't care having something as slow as a CPU option, as long as it is still 20x faster during previews and small projects. For animation of course there are tones of other optimisations / compositing options, but in the case of still images, especially with printing aims, an edgy spherical volume or a pixelated texture aren't really forgivable. Let's come back to what is the 3D designer's job, its 10 days of pressure to get the job done, and one or two nights of render. Sure, it depends on what area of 3D you are, but an animator/rigger/simulator/non-renderer dude won't have big interest in octane render anyway. With one single gtx 590, I am able to render most of scenes within 2 hours for hardest of them with very descend results, and shading / lighting process is the most pleasant I have never seen. If those 2 hours would go for 10 or 12, I wouldn't care much, I could sleep longer the next day if you know what I'm talking about. The node workflow is genius ( every software should work around nodes ), capabilities are here, but this memory limit is a pain in the axx.

And I hear people in the back saying that a few years back we didn't have as much memory in systems anyway. I agree, but requirements moved on, today's graphics limits are pushed away and if tools are becoming more accessible and more efficient, quality requirements are being harder as well. And of course in 3 years we'll have 16GB of memory in GPUs ( maybe, does not seem to move up as fast as computing capabilities ), and where a scene is using today 16GB of memory it will simply ask 64GB tomorrow. We don't know what's the next step, using 32 bits textures, higher resolution outputs, hologram pictures, real time raytracing, there's always a new technology asking for bigger and better quality, which leads to more power and memory needs. But in all of this, what is the most important, is the workflow timing. When you are designing, editing materials, lighting, textures, you need fast feedback, to move on from this feedback to go on the next one and so on. The final render time isn't as much of an issue when the tool allows you to work about 10 times faster than another one with a final render time being in the same timescale as other similar tools.

And finally, I may not have all the knowledge required to answer this question but it is good to discuss this point which is primordial in my eyes. I have absolutely no idea if developers have tried to find a solution to this limit yet or not. But even if a stupid algorithm consisting on accessing the system memory to swap data would be the best option, it would still be a better option than no option if you know what I mean.

I hope to hear from a developer soon, the opinion must be interesting. Wish you all a good week end !

Sat Mar 10, 2012 8:32 am

wait for instance and we will instance everything.

Mon Mar 12, 2012 11:46 am

Instances are one optimisation, but it doesn't solve the problem of a few heavy meshes. It's nice to instance one tree 20 times, but it lacks of diversity on a landscape or house surrounding. When I see what takes one tree in the memory, I doubt instances will allow 3-6 different trees in a scene. And without any proper pass system it is not even possible to think of compositing it in post production.

yamanash · Fri Mar 16, 2012 6:20 pm

I really hope that the devs for octane add the ability to page out of core memory in the future. It may not be easy, nvidia has done it on the quadro and tesla.. Even better the people working on CentiLeo have pulled it off on a GTX 480. Obviously paging to system ram will kill your perf, but it is still fast compared to a cpu solution and its better than finding your self out of options part way though rendering a project..

t_3 · Fri Mar 16, 2012 7:42 pm

yamanash wrote:but it is still fast compared to a cpu solution

i really doubt that. octane is for example about 5 times faster than luxrender on an i7 quad (roughly/depending on/etc.).
now make it 10 times slower because of uma...