performance - Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code -
i'm new opencl , willing compare performance gain between c code , opencl kernels. can please elaborate method among these 2 better/correct profiling opencl code when comparing performance c reference code:
using queryperformancecounter()/__rdtsc() cycles (called inside gettime function)
ret |= clfinish(command_queue); //empty queue gettime(&begin); ret |= clenqueuendrangekernel(command_queue, kernel, 2, null, global_ws, null, 0, null, null); //profiling disabled. ret |= clfinish(command_queue); gettime(&end); g_ndrangepureexectimesec = elapsed_time(&begin, &end); //performs: (end-begin)/(clock_per_cycle*clock_per_cycle*clock_per_cycle)
using events profiling:
ret = clenqueuemarker(command_queue, &evt1); //empty queue ret |= clenqueuendrangekernel(command_queue, kernel, 2, null, global_ws, null, 0, null, &evt1); ret |= clwaitforevents(1, &evt1); ret |= clgeteventprofilinginfo(evt1, cl_profiling_command_start, sizeof(cl_long), &begin, null); ret |= clgeteventprofilinginfo(evt1, cl_profiling_command_end, sizeof(cl_long), &end, null); g_ndrangepureexectimesec = (cl_double)(end - begin)/(clock_per_cycle*clock_per_cycle*clock_per_cycle); //nsec sec ret |= clreleaseevent(evt1);
furthermore i'm not using dedicated graphics card , utilizing intel hd 4600 integrated graphics following piece of opencl code:
__kernel void filter_rows(__global float *ip_img,\ __global float *op_img, \ int width, int height, \ int pitch,int n, \ __constant float *w) { __private int i=get_global_id(0); __private int j=get_global_id(1); __private int k; __private float a; __private int image_offset = n*pitch +n; __private int curr_pix = j*pitch + +image_offset; // apply filter = ip_img[curr_pix-8] * w[0 ]; += ip_img[curr_pix-7] * w[1 ]; += ip_img[curr_pix-6] * w[2 ]; += ip_img[curr_pix-5] * w[3 ]; += ip_img[curr_pix-4] * w[4 ]; += ip_img[curr_pix-3] * w[5 ]; += ip_img[curr_pix-2] * w[6 ]; += ip_img[curr_pix-1] * w[7 ]; += ip_img[curr_pix-0] * w[8 ]; += ip_img[curr_pix+1] * w[9 ]; += ip_img[curr_pix+2] * w[10]; += ip_img[curr_pix+3] * w[11]; += ip_img[curr_pix+4] * w[12]; += ip_img[curr_pix+5] * w[13]; += ip_img[curr_pix+6] * w[14]; += ip_img[curr_pix+7] * w[15]; += ip_img[curr_pix+8] * w[16]; // write output op_img[curr_pix] = (float)a; }
and similar code column wise processing. i'm observing gain (opencl vs optimized vectorized c-ref) around 11x using method 1 , around 16x using method 2. i've noticed people claiming gains in order of 200-300x, when using dedicated graphics cards.
so questions are:
- what magnitude of gain can expect, if run same code in dedicated graphics card. similar order or graphics card outperform intel hd graphics?
- can map warp , thread concept cuda intel hd graphics (i.e. number of threads executing in parallel)?
from different vendors can't compare performance, basic comparison , expectation can done using no of parallel thread running multiplied frequency.
you have processor intel hd 4600 graphics: should have 20 execution units (eu), each eu runs 7 hardware threads, each thread capable of executing simd8, simd16 or simd32 instructions, each simd lane corresponding 1 work item (wi) in opencl speak.
simd16 typical simple kernels, 1 trying optimize, talking 20*7*16=2240 work items executing in parallel. keep in mind each work item capable of processing vector data types, e.g. float4, should try rewriting kernel take advantage of them. hope helps compare nvidia's offerings.
Comments
Post a Comment