Sunday Nov. 15 2009
The idea of CUDA is to have as many threads (not the traditional threads) doing very small tasks.
- Block: 512 threads.
- Block Barrier: threads in a block synchronize at same point.
- Warp: 32 threads, one instruction is executed by one warp.
Use of conditionals (if-then-else) in warps reduce throughput if some threads take one path a others take another.
- Threads in a block have high-speed shared memory available.
- All threads from any block can access global memory.
- Mix of serial and parallel code.
- Parallel sections of code can be offloaded to CUDA.
- Available in C++, Java, Python and possibly other languages.
- Debugging: cuda_gdb (question: what is a breakpoint when you have 10K threads?)
- Parallel algorithms with STL interface provided by Thrust