







#### PURDUE

How it works

- 3) An assigned block is partitioned into *warps*. Their execution is interleaved
- 4) Warps are assigned to SM (one thread to one SP)
- 5) Warps can be delayed if idle for some reason (waiting for memory)

© Bedrich Benes

#### PURDUE.

### **Basic Considerations**



- the size of a block is limited to 512 threads blockDim(512,1,1) blockDim(8,16,2) blockDim(16,16,2)
- kernel can handle up to 65,536x65,536 blocks

© Bedrich Benes

# PURDUE

#### **G80 Architecture**



each can process

≤ 8 blocks

or

≤ 768 threads

max: 8x16=128 CUDA Cores (SPs)

max: 16x768=12,288 threads

Bedrich Bene

#### PURDUE

## **GT200 Architecture**



has 30 SMs

each can process

≤ 8 blocks

or

≤ 1024 threads

max: 8x30=240 CUDA Cores (SPs)

max: 30x1,024= 30,720 threads

Redrich Renes

















#### PURDUE.

#### OVIDIA CLUBA TELECONIE CENTER CENTER

# Warp Assignment

- one thread is assigned to one SP
- SM has 8 SPs
- warp has 32 threads
- so a warp is executed in four steps

© Bedrich Benes

#### PURDUE

# Warps - latency hiding



 Why do we need so many warps if there are just 8 CUDA cores in SM (GT200)?

#### **Latency hiding:**

- a warp executes a global memory read instruction that delays it for 400 cycles
- any other warp can be executed in the meantime
- if more than one is available priorities

© Bedrich Bene

#### PURDUE

# Warps – processing



- A warp is SIMT (single instruction multiple thread) all run in parallel and the same instruction
- Two warps are MIMD can do branching, loops, etc.
- Threads within one warp do not need synchronization – they run the same time instruction

Bedrich Bene

#### **PURDUE**

# Warps - zero-overhead



#### Zero-overhead thread scheduling

- having many warps available, the selection of warps that are ready to go keeps the SM busy (no idle time)
- that is why, caches are not usually necessary

© Redrich Ren

#### PURDUE

# CUSA TRAINING CONTROL

# Example - granularity

Having GT200 and matrix multiplication.
 Which tiles are the best 4x4, 8x8, 16x16, or 32x32?

© Bedrich Benes

# Example - granularity



 4x4 will need 16 threads per block SM can take up to 1024 threads We can take 1024/16=64 blocks BUT! The SM is limited to 8 blocks There will be 8\*16=128 threads in each SM 128/32=4 -> 8 warps, but each half full

heavily underutilized! (fewer warps to schedule)

© Bedrich Bene

**PURDUE** 

#### PURDUE



# Example - granularity

 8x8 will need 64 threads per block SM can take up to 1024 threads We can take 1024/64=16 blocks BUT! The SM is limited to 8 blocks There will be 8\*64=512 threads in each SM 512/32=16 warps

still underutilized!
(fewer warps to schedule)

Bedrich Bene

#### **PURDUE**



## Example - granularity

 16x16 will need 256 threads per block SM can take up to 1024 threads We can take 1024/256=4 blocks The SM can take it 2x There will be 8\*64=512 threads in each SM 512/32=16

full capacity and a lot of warps to schedule

© Bedrich Bene

# Example - granularity • 32x32 will need 1024 threads per block a block (GT200) can take max 512 Not even one will fit in the SM (not true in GT400)











© Bedrich Benes





#### PURDUE

# NVIDIA CUEA TRACESSE TRACES TRACESSE TR

# Thread Divergence

- · What causes thread divergence?
- 1) If statements with functions of threadIdx
- 2) Loops with functions of threadIdx

ifs are expensive anyway...

© Bedrich Benes

PURDUE

# PURDUE



# Thread Divergence

#### Example:

for (int i=0;i<threadIdx.x;i++)
 a[i]=i;</pre>

All loops that should finished will finish, but

the GPU will iterate for the others till the end

© Bedrich Benes

# Reading



- NVIDIA CUDA Programming Guide
- Kirk, D.B., Hwu, W.W., Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010

Programming Massively
Parallel Processors
A Hands-on Approach

© Bedrich Benes