cuda - Strong scaling on GPUs -


i'd investigate strong scaling of parallel gpu code (written openacc). concept of strong scaling gpus - @ least far know - more murky cpus. the resource found regarding strong scaling on gpus suggests fixing problem size , increasing number of gpus. however, believe there amount of strong scaling within gpus, example scaling on streaming multiprocessors (in nvidia kepler architecture).

the intent of openacc , cuda explicitly abstract away hardware parallel programmer, constraining three-level programming model gangs (thread blocks), workers (warps) , vectors (simt group of threads). understanding cuda model aims @ offering scalability respect thread blocks, independent , mapped smxs. therefore see 2 ways investigate strong scaling gpu:

  1. fix problem size, , set thread block size , number of threads per block arbitrary constant number. scale number of thread blocks (grid size).
  2. given additional knowledge on underlying hardware (e.g. cuda compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set thread block size , number of threads per block such block occupies entire , single smx. therefore, scaling on thread blocks equivalent scaling on smxs.

my questions are: train of thought regarding strong scaling on gpu correct/relevant? if so, there way #2 above within openacc?

gpus strong scale, not in way you're thinking, why you've been able find information strong scaling multiple gpus. multi-core cpu can trivially decide how many cpu cores want run on, can fix work , adjust degree of threading across cores. gpu allocation across sms handled automatically , out of control. design, because means well-written gpu code strong scale fill whatever gpu (or gpus) throw @ without programmer or user intervention.

you run on small number of openacc gangs/cuda threadblocks , assume 14 gangs run on 14 different sms, there's couple of problems this. first, 1 gang/threadblock not saturate single kepler smx. no matter how many threads, no matter occupancy, need more blocks per sm in order utilize hardware. second, you're not guaranteed hardware choose schedule blocks way. finally, if find optimal number of blocks or gangs per sm on device have, won't scale other devices. trick gpus expose parallelism possible can scale devices 1 sm devices 100, if ever exist, or multiple devices.

if want experiment how varying number of openacc gangs fixed amount of work affects performance, you'd either num_gangs clause, if you're using parallel region, or gang clause, if you're using kernels. since you're trying force particular mapping of loops, you're better off using parallel, since that's more prescriptive directive. you'd want following:

#pragma acc parallel loop gang vector num_gangs(vary number) vector_length(fix number) for(i=0; i<n; i++)   

this tells compiler vectorize loop using provided vector length , partition loop across openacc gangs. i'd expect add gangs you'll see better performance until multiple of number of sms, @ point performance become flat (with outliers of course). said above, fixing number of gangs @ point see optimal performance not best idea, unless device you're interested in. instead, either letting compiler decide how decompose loop, allows compiler make smart decisions based on architecture tell build for, or exposing many gangs possible, gives additional parallelism strong scale larger gpus or multiple gpus, you'd have more portable code.


Comments

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

javascript - angular ng-required radio button not toggling required off in firefox 33, OK in chrome -

xcode - Swift Playground - Files are not readable -