c++ - How does OpenMP implement access to critical sections? -

- August 15, 2011

i want read input file (in c/c++) , process each line independently fast possible. processing takes few ticks itself, decided use openmp threads. have code:

#pragma omp parallel num_threads(num_threads) {   string line;   while (true) { #pragma omp critical(input)     {       getline(f, line);     }     if (f.eof())       break;     process_line(line);   } }

my question is, how determine optimal number of threads use? ideally, dynamically detected @ runtime. don't understand dynamic schedule option parallel, can't if help. insights?

also, i'm not sure how determine optimal number "by hand". tried various numbers specific application. have thought cpu usage reported top help, doesn't(!) in case, cpu usage stays consistently @ around num_threads*(85-95). however, using pv observe speed @ i'm reading input, noted optimal number around 2-5; above that, input speed becomes smaller. quesiton is- why see cpu usage of 850 when using 10 threads?? can due inefficiency in how openmp handles threads waiting in critical section?

edit: here timings. obtained them with:

for ncpu in $(seq 1 20) ; echo "ncpu=$ncpu" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; pid=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; usage=$(ps h -o "%cpu" $pid) ; kill -9 $pid ; sleep 1 ; echo "usage: $usage" >&2 ; } & cat ; } | ./my_prog -n $ncpu >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v killed ; done

ncpu=1 [8.27mb/s] usage: 98.4

ncpu=2 [12.5mb/s] usage: 196

ncpu=3 [18.4mb/s] usage: 294

ncpu=4 [23.6mb/s] usage: 393

ncpu=5 [28.9mb/s] usage: 491

ncpu=6 [33.7mb/s] usage: 589

ncpu=7 [37.4mb/s] usage: 688

ncpu=8 [40.3mb/s] usage: 785

ncpu=9 [41.9mb/s] usage: 884

ncpu=10 [41.3mb/s] usage: 979

ncpu=11 [41.5mb/s] usage: 1077

ncpu=12 [42.5mb/s] usage: 1176

ncpu=13 [41.6mb/s] usage: 1272

ncpu=14 [42.6mb/s] usage: 1370

ncpu=15 [41.8mb/s] usage: 1493

ncpu=16 [40.7mb/s] usage: 1593

ncpu=17 [40.8mb/s] usage: 1662

ncpu=18 [39.3mb/s] usage: 1763

ncpu=19 [38.9mb/s] usage: 1857

ncpu=20 [37.7mb/s] usage: 1957

my problem can achieve 40mb/s 785 cpu usage, 1662 cpu usage. cycles go??

edit2: lirik , john dibling, understand reason find timings above puzzling has nothing i/o, rather, way openmp implements critical sections. intuition if have 1 thread inside cs , 10 threads waiting in, moment 1st thread exits cs, kernel should wake one other thread , let in. timings suggest otherwise: can threads wake many times on own , find cs occupied? issue threading library or kernel?

"i want read input file (in c/c++) , process each line independently fast possible."

reading file makes application i/o bound, maximum performance able achieve reading portion alone read @ maximum disk speed (on machine that's less 10% cpu time). means if able free reading thread processing, require processing takes less remaining cpu time (90% computer). if line processing threads take more remaining cpu time, not able keep hard drive.

there several options in case:

queue input , let processing threads dequeue "work" until they've caught input presented (given have enough ram so).
open enough threads , max out processor until data read, best effort scenario.
throttle reading/processing don't take of system resources (in case you're worried ui responsiveness and/or user experience).

"...the processing takes few ticks itself, decided use openmp threads"

this sign, means cpu utilization not high. part can optimize performance , it's best hand, john dibling mentioned. in general, it's best if queue each line , let processing threads pull processing requests queue until have nothing more process. latter know producer/consumer design pattern- common pattern in concurrent computing.

update

why there difference between

(i) each process lock, pull data, release lock, process data; ,

(ii) 1 process: pull data, lock, enqueue chunk, release lock,

others: lock, dequeue chunk, release lock, process data?

there little difference: in way, both represent consumer/producer pattern. in first case (i) don't have actual queue, consider file stream producer (queue) , consumer thread reads stream. in second case (ii) you're explicitly implementing consumer/producer pattern, more robust, reusable , provides better abstraction producer. if ever decide use more 1 "input channel," latter case better.

finally (and importantly), can use lock-free queue single producer , single consumer make (ii) lot faster (i) in terms of getting i/o bound. lock-free queue can pull data, enqueue chunk , dequque chunk without locking.

Search This Blog

EXC

c++ - How does OpenMP implement access to critical sections? -

update

Comments

Post a Comment

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -