performance - Why is std::pair faster than std::tuple -


here code testing.

tuple test:

using namespace std;  int main(){      vector<tuple<int,int>> v;       (int var = 0; var < 100000000; ++var) {         v.push_back(make_tuple(var, var));     } } 

pair test:

#include <vector>  using namespace std;  int main(){      vector<pair<int,int>> v;       (int var = 0; var < 100000000; ++var) {         v.push_back(make_pair(var, var));     } } 

i did time measurement via linux time command. results are:

|       |   -o0   |    -o2   | |:------|:-------:|:--------:| | pair  |   8.9 s |  1.60 s  | | tuple |  19.8 s |  1.96 s  | 

i wondering, why such big difference between 2 data structures in o0, should similar. there small difference in 02.

why difference in o0 big, , why there difference @ all?

edit:

the code v.resize()

pair:

#include <vector>  using namespace std;  int main(){      vector<pair<int,int>> v;      v.resize(100000000);      (int var = 0; var < 100000000; ++var) {         v[var] = make_pair(var, var);     } } 

tuple:

#include<tuple> #include<vector>  using namespace std;  int main(){      vector<tuple<int,int>> v;      v.resize(100000000);      (int var = 0; var < 100000000; ++var) {         v[var] = make_tuple(var, var);     } } 

results:

|       |   -o0   |    -o2   | |:------|:-------:|:--------:| | pair  |  5.01 s |  0.77 s  | | tuple |  10.6 s |  0.87 s  | 

edit:

my system

g++ (gcc) 4.8.3 20140911 (red hat 4.8.3-7) glibcxx_3.4.19 

you missing crucial information: compiler use? use measure performance of microbenchmark? standard library implementation use?

my system:

g++ (gcc) 4.9.1 20140903 (prerelease) glibcxx_3.4.20 

anyhow, ran examples, reserved proper size of vectors first rid of memory allocation overhead. that, funnily observe opposite interesting - reverse of see:

g++ -std=c++11 -o2 pair.cpp -o pair perf stat -r 10 -d ./pair performance counter stats './pair' (10 runs):        1647.045151      task-clock:hg (msec)      #    0.993 cpus utilized            ( +-  1.94% )               346      context-switches:hg       #    0.210 k/sec                    ( +- 40.13% )                 7      cpu-migrations:hg         #    0.004 k/sec                    ( +- 22.01% )           182,978      page-faults:hg            #    0.111 m/sec                    ( +-  0.04% )     3,394,685,602      cycles:hg                 #    2.061 ghz                      ( +-  2.24% ) [44.38%]     2,478,474,676      stalled-cycles-frontend:hg #   73.01% frontend cycles idle     ( +-  1.24% ) [44.55%]     1,550,747,174      stalled-cycles-backend:hg #   45.68% backend  cycles idle     ( +-  1.60% ) [44.66%]     2,837,484,461      instructions:hg           #    0.84  insns per cycle                                                           #    0.87  stalled cycles per insn  ( +-  4.86% ) [55.78%]       526,077,681      branches:hg               #  319.407 m/sec                    ( +-  4.52% ) [55.82%]           829,623      branch-misses:hg          #    0.16% of branches          ( +-  4.42% ) [55.74%]       594,396,822      l1-dcache-loads:hg        #  360.887 m/sec                    ( +-  4.74% ) [55.59%]         20,842,113      l1-dcache-load-misses:hg  #    3.51% of l1-dcache hits    ( +-  0.68% ) [55.46%]         5,474,166      llc-loads:hg              #    3.324 m/sec                    ( +-  1.81% ) [44.23%]   <not supported>      llc-load-misses:hg               1.658671368 seconds time elapsed                                          ( +-  1.82% ) 

versus:

g++ -std=c++11 -o2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple performance counter stats './tuple' (10 runs):          996.090514      task-clock:hg (msec)      #    0.996 cpus utilized            ( +-  2.41% )               102      context-switches:hg       #    0.102 k/sec                    ( +- 64.61% )                 4      cpu-migrations:hg         #    0.004 k/sec                    ( +- 32.24% )           181,701      page-faults:hg            #    0.182 m/sec                    ( +-  0.06% )     2,052,505,223      cycles:hg                 #    2.061 ghz                      ( +-  2.22% ) [44.45%]     1,212,930,513      stalled-cycles-frontend:hg #   59.10% frontend cycles idle     ( +-  2.94% ) [44.56%]       621,104,447      stalled-cycles-backend:hg #   30.26% backend  cycles idle     ( +-  3.48% ) [44.69%]     2,700,410,991      instructions:hg           #    1.32  insns per cycle                                                           #    0.45  stalled cycles per insn  ( +-  1.66% ) [55.94%]       486,476,408      branches:hg               #  488.386 m/sec                    ( +-  1.70% ) [55.96%]           959,651      branch-misses:hg          #    0.20% of branches          ( +-  4.78% ) [55.82%]       547,000,119      l1-dcache-loads:hg        #  549.147 m/sec                    ( +-  2.19% ) [55.67%]         21,540,926      l1-dcache-load-misses:hg  #    3.94% of l1-dcache hits    ( +-  2.73% ) [55.43%]         5,751,650      llc-loads:hg              #    5.774 m/sec                    ( +-  3.60% ) [44.21%]   <not supported>      llc-load-misses:hg               1.000126894 seconds time elapsed                                          ( +-  2.47% ) 

as can see, in case reason higher number of stalled cycles, both in frontend in backend.

now come from? bet comes down failed inlining, similar explained here: std::vector performance regression when enabling c++11

indeed, enabling -flto equalizes results me:

performance counter stats './pair' (10 runs):        1021.922944      task-clock:hg (msec)      #    0.997 cpus utilized            ( +-  1.15% )                 63      context-switches:hg       #    0.062 k/sec                    ( +- 77.23% )                 5      cpu-migrations:hg         #    0.005 k/sec                    ( +- 34.21% )           195,396      page-faults:hg            #    0.191 m/sec                    ( +-  0.00% )     2,109,877,147      cycles:hg                 #    2.065 ghz                      ( +-  0.92% ) [44.33%]     1,098,031,078      stalled-cycles-frontend:hg #   52.04% frontend cycles idle     ( +-  0.93% ) [44.46%]       701,553,535      stalled-cycles-backend:hg #   33.25% backend  cycles idle     ( +-  1.09% ) [44.68%]     3,288,420,630      instructions:hg           #    1.56  insns per cycle                                                           #    0.33  stalled cycles per insn  ( +-  0.88% ) [55.89%]       672,941,736      branches:hg               #  658.505 m/sec                    ( +-  0.80% ) [56.00%]           660,278      branch-misses:hg          #    0.10% of branches          ( +-  2.05% ) [55.93%]       474,314,267      l1-dcache-loads:hg        #  464.139 m/sec                    ( +-  1.32% ) [55.73%]         19,481,787      l1-dcache-load-misses:hg  #    4.11% of l1-dcache hits    ( +-  0.80% ) [55.51%]         5,155,678      llc-loads:hg              #    5.045 m/sec                    ( +-  1.69% ) [44.21%]   <not supported>      llc-load-misses:hg               1.025083895 seconds time elapsed                                          ( +-  1.03% ) 

and tuple:

performance counter stats './tuple' (10 runs):        1018.980969      task-clock:hg (msec)      #    0.999 cpus utilized            ( +-  0.47% )                 8      context-switches:hg       #    0.008 k/sec                    ( +- 29.74% )                 3      cpu-migrations:hg         #    0.003 k/sec                    ( +- 42.64% )           195,396      page-faults:hg            #    0.192 m/sec                    ( +-  0.00% )     2,103,574,740      cycles:hg                 #    2.064 ghz                      ( +-  0.30% ) [44.28%]     1,088,827,212      stalled-cycles-frontend:hg #   51.76% frontend cycles idle     ( +-  0.47% ) [44.56%]       697,438,071      stalled-cycles-backend:hg #   33.15% backend  cycles idle     ( +-  0.41% ) [44.76%]     3,305,631,646      instructions:hg           #    1.57  insns per cycle                                                           #    0.33  stalled cycles per insn  ( +-  0.21% ) [55.94%]       675,175,757      branches:hg               #  662.599 m/sec                    ( +-  0.16% ) [56.02%]           656,205      branch-misses:hg          #    0.10% of branches          ( +-  0.98% ) [55.93%]       475,532,976      l1-dcache-loads:hg        #  466.675 m/sec                    ( +-  0.13% ) [55.69%]         19,430,992      l1-dcache-load-misses:hg  #    4.09% of l1-dcache hits    ( +-  0.20% ) [55.49%]         5,161,624      llc-loads:hg              #    5.065 m/sec                    ( +-  0.47% ) [44.14%]   <not supported>      llc-load-misses:hg               1.020225388 seconds time elapsed                                          ( +-  0.48% ) 

so remember, -flto friend , failed inlining can have extreme results on heavily templated code. use perf stat find out what's happening.


Comments

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

javascript - angular ng-required radio button not toggling required off in firefox 33, OK in chrome -

xcode - Swift Playground - Files are not readable -