Gtx 470/480 - 2.4 pre beta speed
Forum rules
NOTE: The software in this forum is not %100 reliable, they are development builds and are meant for testing by experienced octane users. If you are a new octane user, we recommend to use the current stable release from the 'Commercial Product News & Releases' forum.
NOTE: The software in this forum is not %100 reliable, they are development builds and are meant for testing by experienced octane users. If you are a new octane user, we recommend to use the current stable release from the 'Commercial Product News & Releases' forum.
gtx470 has speed improvement about 20% with same number of CUDA cores,
gtx460 has speed improvement about 20% with 112 added cores
it does not make sense to me...
gtx460 has speed improvement about 20% with 112 added cores
it does not make sense to me...
|DualXeon5130|10GB|2xGTX460 2GB|
We found out why. It is because GTX 460 (GF 104) has different architecture. Look here:
http://www.anandtech.com/show/3809/nvid ... 200-king/2
One thing we haven’t discussed up until now is how an SM is internally divided up for the purposes of executing instructions. Since the introduction of G80 in 2006, the size of a warp has stayed constant at 32 threads wide. For Fermi, a warp is executed over 2 (or more) clocks of the CUDA cores – 16 threads are processed and then the other 16 threads in that warp are processed. For full SM utilization, all threads must be running the same instruction at the same time. For these reasons a SM is internally divided up in to a number of execution units that a single dispatch unit can dispatch work to:
16 CUDA cores (#1)
16 CUDA cores (#2)
16 Load/Store Units
16 Interpolation SFUs (not on NVIDIA's diagrams)
4 Special Function SFUs
4 Texture Units
With 2 warp scheduler/dispatch unit pairs in each SM, GF100 can utilize at most 2 of 6 execution units at any given time. It’s also because of the SM being divided up like this that it was possible for NVIDIA to add to it. GF104 in comparison has the following:
16 CUDA cores (#1)
16 CUDA cores (#2)
16 CUDA cores (#3)
16 Load/Store Units
16 Interpolation SFUs (not on NVIDIA's diagrams)
8 Special Function SFUs
8 Texture Units
This gives GF104 a total of 7 execution units, the core of which are the 3 blocks of 16 CUDA cores.
picture
With 2 warp schedulers, GF100 could put all 32 CUDA cores to use if it had 2 warps where both required the use of CUDA cores. With GF104 this gets more complex since there are now 3 blocks of CUDA cores but still only 2 warp schedulers. So how does NVIDIA feed 3 blocks of CUDA cores with only 2 warp schedulers? They go superscalar.
In a nutshell, superscalar execution is a method of extracting Instruction Level Parallelism from a thread. If the next instruction in a thread is not dependent on the previous instruction, it can be issued to an execution unit for completion at the same time as the instruction preceding it. There are several ways to extract ILP from a workload, with superscalar operation being something that modern CPUs have used as far back as the original Pentium to improve performance. For NVIDIA however this is new – they were previously unable to use ILP and instead focused on Thread Level Parallelism (TLP) to ensure that there were enough warps to keep a GPU occupied.
So, GTX 460 has another 16 CUDA cores in one block but efficiency of theese cores is dependent on character of the application (like intel Hyper-threading in Games is almost useless)

RESULT: CUDA 3.2 only showing another 112 cores, but only showing, this architecture cannot us all 3 blocks in this version octane, maybe radiance will find out how to do it
http://www.anandtech.com/show/3809/nvid ... 200-king/2
One thing we haven’t discussed up until now is how an SM is internally divided up for the purposes of executing instructions. Since the introduction of G80 in 2006, the size of a warp has stayed constant at 32 threads wide. For Fermi, a warp is executed over 2 (or more) clocks of the CUDA cores – 16 threads are processed and then the other 16 threads in that warp are processed. For full SM utilization, all threads must be running the same instruction at the same time. For these reasons a SM is internally divided up in to a number of execution units that a single dispatch unit can dispatch work to:
16 CUDA cores (#1)
16 CUDA cores (#2)
16 Load/Store Units
16 Interpolation SFUs (not on NVIDIA's diagrams)
4 Special Function SFUs
4 Texture Units
With 2 warp scheduler/dispatch unit pairs in each SM, GF100 can utilize at most 2 of 6 execution units at any given time. It’s also because of the SM being divided up like this that it was possible for NVIDIA to add to it. GF104 in comparison has the following:
16 CUDA cores (#1)
16 CUDA cores (#2)
16 CUDA cores (#3)
16 Load/Store Units
16 Interpolation SFUs (not on NVIDIA's diagrams)
8 Special Function SFUs
8 Texture Units
This gives GF104 a total of 7 execution units, the core of which are the 3 blocks of 16 CUDA cores.
picture

With 2 warp schedulers, GF100 could put all 32 CUDA cores to use if it had 2 warps where both required the use of CUDA cores. With GF104 this gets more complex since there are now 3 blocks of CUDA cores but still only 2 warp schedulers. So how does NVIDIA feed 3 blocks of CUDA cores with only 2 warp schedulers? They go superscalar.
In a nutshell, superscalar execution is a method of extracting Instruction Level Parallelism from a thread. If the next instruction in a thread is not dependent on the previous instruction, it can be issued to an execution unit for completion at the same time as the instruction preceding it. There are several ways to extract ILP from a workload, with superscalar operation being something that modern CPUs have used as far back as the original Pentium to improve performance. For NVIDIA however this is new – they were previously unable to use ILP and instead focused on Thread Level Parallelism (TLP) to ensure that there were enough warps to keep a GPU occupied.
So, GTX 460 has another 16 CUDA cores in one block but efficiency of theese cores is dependent on character of the application (like intel Hyper-threading in Games is almost useless)



RESULT: CUDA 3.2 only showing another 112 cores, but only showing, this architecture cannot us all 3 blocks in this version octane, maybe radiance will find out how to do it

1 x GTX 460 2GB, Core i3 @ 3,7Ghz, 4GB Ram, Win 7 64-bit, 260.99 WHQL
Wow thanks for the info.
On the topic of cards. Tigerdirect has GTX 460 on sale for $109 USD.
I recently bought a system with GTX480 and am waiting for it to arrive. Can i use the 460 with a 480 in SLI mode.
On the topic of cards. Tigerdirect has GTX 460 on sale for $109 USD.
I recently bought a system with GTX480 and am waiting for it to arrive. Can i use the 460 with a 480 in SLI mode.
Win 11 64GB | NVIDIA RTX3060 12GB