c++ - How does OpenMP implement access to critical sections? -
i want read input file (in c/c++) , process each line independently fast possible. processing takes few ticks itself, decided use openmp threads. have code:
#pragma omp parallel num_threads(num_threads) { string line; while (true) { #pragma omp critical(input) { getline(f, line); } if (f.eof()) break; process_line(line); } }
my question is, how determine optimal number of threads use? ideally, dynamically detected @ runtime. don't understand dynamic schedule option parallel
, can't if help. insights?
also, i'm not sure how determine optimal number "by hand". tried various numbers specific application. have thought cpu usage reported top
help, doesn't(!) in case, cpu usage stays consistently @ around num_threads*(85-95). however, using pv
observe speed @ i'm reading input, noted optimal number around 2-5; above that, input speed becomes smaller. quesiton is- why see cpu usage of 850 when using 10 threads?? can due inefficiency in how openmp handles threads waiting in critical section?
edit: here timings. obtained them with:
for ncpu in $(seq 1 20) ; echo "ncpu=$ncpu" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; pid=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; usage=$(ps h -o "%cpu" $pid) ; kill -9 $pid ; sleep 1 ; echo "usage: $usage" >&2 ; } & cat ; } | ./my_prog -n $ncpu >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v killed ; done
ncpu=1 [8.27mb/s] usage: 98.4
ncpu=2 [12.5mb/s] usage: 196
ncpu=3 [18.4mb/s] usage: 294
ncpu=4 [23.6mb/s] usage: 393
ncpu=5 [28.9mb/s] usage: 491
ncpu=6 [33.7mb/s] usage: 589
ncpu=7 [37.4mb/s] usage: 688
ncpu=8 [40.3mb/s] usage: 785
ncpu=9 [41.9mb/s] usage: 884
ncpu=10 [41.3mb/s] usage: 979
ncpu=11 [41.5mb/s] usage: 1077
ncpu=12 [42.5mb/s] usage: 1176
ncpu=13 [41.6mb/s] usage: 1272
ncpu=14 [42.6mb/s] usage: 1370
ncpu=15 [41.8mb/s] usage: 1493
ncpu=16 [40.7mb/s] usage: 1593
ncpu=17 [40.8mb/s] usage: 1662
ncpu=18 [39.3mb/s] usage: 1763
ncpu=19 [38.9mb/s] usage: 1857
ncpu=20 [37.7mb/s] usage: 1957
my problem can achieve 40mb/s 785 cpu usage, 1662 cpu usage. cycles go??
edit2: lirik , john dibling, understand reason find timings above puzzling has nothing i/o, rather, way openmp implements critical sections. intuition if have 1 thread inside cs , 10 threads waiting in, moment 1st thread exits cs, kernel should wake one other thread , let in. timings suggest otherwise: can threads wake many times on own , find cs occupied? issue threading library or kernel?
"i want read input file (in c/c++) , process each line independently fast possible."
reading file makes application i/o bound, maximum performance able achieve reading portion alone read @ maximum disk speed (on machine that's less 10% cpu time). means if able free reading thread processing, require processing takes less remaining cpu time (90% computer). if line processing threads take more remaining cpu time, not able keep hard drive.
there several options in case:
- queue input , let processing threads dequeue "work" until they've caught input presented (given have enough ram so).
- open enough threads , max out processor until data read, best effort scenario.
- throttle reading/processing don't take of system resources (in case you're worried ui responsiveness and/or user experience).
"...the processing takes few ticks itself, decided use openmp threads"
this sign, means cpu utilization not high. part can optimize performance , it's best hand, john dibling mentioned. in general, it's best if queue each line , let processing threads pull processing requests queue until have nothing more process. latter know producer/consumer design pattern- common pattern in concurrent computing.
update
why there difference between
- (i) each process lock, pull data, release lock, process data; ,
- (ii) 1 process: pull data, lock, enqueue chunk, release lock,
- others: lock, dequeue chunk, release lock, process data?
there little difference: in way, both represent consumer/producer pattern. in first case (i) don't have actual queue, consider file stream producer (queue) , consumer thread reads stream. in second case (ii) you're explicitly implementing consumer/producer pattern, more robust, reusable , provides better abstraction producer. if ever decide use more 1 "input channel," latter case better.
finally (and importantly), can use lock-free queue single producer , single consumer make (ii) lot faster (i) in terms of getting i/o bound. lock-free queue can pull data, enqueue chunk , dequque chunk without locking.
Comments
Post a Comment