c++ - OpenMP performance -


firstly, know [type of] question asked, let me preface saying i've read as can, , still don't know deal is.

i've parallelized massive outer loop. number of loop iterations varies, typically between 20-150, loop body huge amount of work, calling on lot of local intensive linear algebra routines (as in, code part of source , not external dependency). within loop body there 1000+ calls these routines, they're totally independent of 1 another, figured prime candidate parallelism. loop code c++, calls lot of subroutines written in c.

code looks this;

<declare , initialize shared variables here> #ifdef _openmp #pragma omp parallel                            \   private(....)\   shared(....)              \   firstprivate(....) schedule(runtime) #endif   for(tst = 0; tst < ntest; tst++) {       // lots of functionality (science!)      // calls other deep functions manipulate private variables      // call function has 1000 loop iterations doing matrix manipulation      // no exaggeration, there millions       // of for-loop iterations in body, in various functions called.       // lots of mallocing , freeing      // generated calculated_values       shared_array1[tst] = calculated_value1;      shared_array2[tst] = calculated_value2;      shared_array3[tst] = calculated_value3;   } // end of parallel ,  // final tidy 

there shouldn't, believe, any synchronization @ - time threads access shared variable shared_arrays, , access unique points in arrays, indexed tst.

thing is, when number of threads (on multicore cluster!) speeds we're seeing (where invoke loop 5 times) follows;

              elapsed time   system time  serial:        188.149          1.031  2 thrds:       148.542          6.788  4 thrds:       309.586        424.037       # what?  8 thrds:       230.290        568.166   16 thrds:       219.133        799.780  

things may noticeable massive jump in system time between 2 , 4 threads, , fact elapsed time doubles move 2 4, , decreases.

i've tried huge range of omp_schedule parameters no luck. related fact each thread using malloc/new , free/delete lot? has consistently been run 8gbs memory - i'm guessing that's not problem. frankly, huge rise in system time makes threads might blocking, have no idea why happen.

update 1 thought false sharing going problem, re-wrote code loops store calculated values in thread-local arrays, , copy these arrays across shared array @ end. sadly didn't have impact, though don't believe myself.

following @cmeerw's advice, ran strace -f, , after initialization there millions of lines of

[pid 58067] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58066] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 58065] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 57684] <... futex resumed> )       = 0 [pid 58067] <... futex resumed> )       = 0 [pid 58066] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58065] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58067] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58066] <... futex resumed> )       = 0 [pid 57684] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 58065] <... futex resumed> )       = 0 [pid 58067] <... futex resumed> )       = 0 [pid 57684] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 58066] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 58065] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58066] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 57684] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58065] <... futex resumed> )       = 0 [pid 58066] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 57684] <... futex resumed> )       = 0 [pid 58067] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 58066] <... futex resumed> )       = 0 [pid 58065] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58067] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 58066] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 57684] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58065] <... futex resumed> )       = 0 [pid 58067] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58066] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 57684] <... futex resumed> )       = 0 [pid 58067] <... futex resumed> )       = 0 [pid 58066] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58065] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 58066] <... futex resumed> )       = 0 [pid 58065] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 58066] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 57684] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 58067] futex(0x35ca58bb40, futex_wait_private, 2, null <unfinished ...> [pid 58066] <... futex resumed> )       = -1 eagain (resource temporarily unavailable) [pid 58065] futex(0x35ca58bb40, futex_wake_private, 1 <unfinished ...> [pid 57684] <... futex resumed> )       = 0 

anyone have ideas means? looks threads context switching way often, or blocking , unblocking? when strace same implementation omp_num_threads set 0 none of @ all. comparison, logfile generated when 1 thread used 486 kb , , log file generated when 4 threads used 266 mb.

in other words, parallel version invokes 4170104 lines of log file...

update 2

as suggested tom, tried binding threads specific processors no avail. we're in openmp 3.1, set environment variable using export omp_proc_bind=true. same size logfile , same timeframe.

update 3

the plot thickens. having profiled on cluster far, installed gnu gcc 4.7 via macports , compiled (with openmp) on macbook first time (apple's gcc-4.2.1 throws compiler bug when openmp enabled, why hadn't compiled , run in parallel locally until now). on macbook, see trend you'd expect

                c-code time  serial:         ~34 seconds  2 thrds:        ~21 seconds  4 thrds:        ~14 seconds  8 thrds:        ~12 seconds 16 thrds:         ~9 seconds 

we see dimishing returns towards ends, though hardly surprising couple of data sets we're iterating on over test data have <16 members (so, we're generating 16 threads for, for-loop 7 iterations).

so, question remains - why cluster's performance degrade badly. i'm going try on different quadcore linuxbox tonight. cluster compiles gnu-gcc 4.6.3, can't believe in going make such difference?

neither ltrace nor gdb installed on cluster (and can't them on various reasons). if linuxbox gives cluster-like performance i'll run corresponding ltrace analysis there.

update 4

oh my. duel booted macbook pro ubuntu (12.04) , re-ran code. runs (which reassuring) see same, weird bad-performance behavior see on clusters, , same run of millions of futex calls. given difference between local machine in ubuntu , in osx software (and i'm using same compiler , libraries - presumably there aren't different glibc implementations osx , ubuntu!) i'm wondering if how linux schedules/distributes threads. case, being on local machine makes million times easier, i'm going go ahead , ltrace -f , see can find. wrote work around clusters forks() off separate process, , gives perfect 1/2 in runtime, it's possible parallelism going...

so after extensive profiling (thanks this great post info on gprof , time sampling gdb) involved writing big wrapper function generate production level code profiling, became obvious vast majority of time when aborted running code gdb , ran backtrace stack in stl <vector> call, manipulating vector in way.

the code passes few vectors parallel section private variables, seemed work fine. however, after pulling out vectors , replacing them arrays (and other jiggery-pokery make work) saw significant speed up. small, artificial data sets speed near perfect (i.e. double number of threads half time), while real data sets speed isn't quite good, makes complete sense in context of how code works.

it seems whatever reason (maybe static or global variables deep in stl<vector> implementation?) when there loops moving through hundreds of thousands of iterations in parallel there deep level locking, occurs in linux (ubuntu 12.01 , centos 6.2) not in osx.

i'm intrigued why see difference. difference in how stl implemented (osx version compiled under gnu gcc 4.7, linux ones), or context switching (as suggested arne babenhauserheide)

in summary, debugging process followed;

  • initial profiling within r identify issue

  • ensured there no static variables acting shared variables

  • profiled strace -f , ltrace -f helpful in identifying locking culprit

  • profiled valgrind errors

  • tried variety of combinations schedule type (auto, guided, static, dynamic) , chunk size.

  • tried binding threads specific processors

  • avoided false sharing creating thread-local buffers values, , implement single synchronization event @ end of for-loop

  • removed mallocing , freeing within parallel region - didn't issue did provide small general speedup

  • tried on various architectures , osses - didn't in end, did show linux vs. osx issue , not supercomputer vs. desktop one

  • building version implements concurrency using fork() call - having workload between 2 processes. halved time on both osx , linux, good

  • built data simulator replicate production data loads

  • gprof profiling

  • gdb time sampling profiling (abort , backtrace)

  • comment out vector operations

  • had not worked, arne babenhauserheide's link looks may have crucial stuff on memory fragmentation issues openmp


Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -