performance - Why is std::pair faster than std::tuple -
here code testing.
tuple test:
using namespace std; int main(){ vector<tuple<int,int>> v; (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } }
pair test:
#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; (int var = 0; var < 100000000; ++var) { v.push_back(make_pair(var, var)); } }
i did time measurement via linux time command. results are:
| | -o0 | -o2 | |:------|:-------:|:--------:| | pair | 8.9 s | 1.60 s | | tuple | 19.8 s | 1.96 s |
i wondering, why such big difference between 2 data structures in o0, should similar. there small difference in 02.
why difference in o0 big, , why there difference @ all?
edit:
the code v.resize()
pair:
#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; v.resize(100000000); (int var = 0; var < 100000000; ++var) { v[var] = make_pair(var, var); } }
tuple:
#include<tuple> #include<vector> using namespace std; int main(){ vector<tuple<int,int>> v; v.resize(100000000); (int var = 0; var < 100000000; ++var) { v[var] = make_tuple(var, var); } }
results:
| | -o0 | -o2 | |:------|:-------:|:--------:| | pair | 5.01 s | 0.77 s | | tuple | 10.6 s | 0.87 s |
edit:
my system
g++ (gcc) 4.8.3 20140911 (red hat 4.8.3-7) glibcxx_3.4.19
you missing crucial information: compiler use? use measure performance of microbenchmark? standard library implementation use?
my system:
g++ (gcc) 4.9.1 20140903 (prerelease) glibcxx_3.4.20
anyhow, ran examples, reserved proper size of vectors first rid of memory allocation overhead. that, funnily observe opposite interesting - reverse of see:
g++ -std=c++11 -o2 pair.cpp -o pair perf stat -r 10 -d ./pair performance counter stats './pair' (10 runs): 1647.045151 task-clock:hg (msec) # 0.993 cpus utilized ( +- 1.94% ) 346 context-switches:hg # 0.210 k/sec ( +- 40.13% ) 7 cpu-migrations:hg # 0.004 k/sec ( +- 22.01% ) 182,978 page-faults:hg # 0.111 m/sec ( +- 0.04% ) 3,394,685,602 cycles:hg # 2.061 ghz ( +- 2.24% ) [44.38%] 2,478,474,676 stalled-cycles-frontend:hg # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%] 1,550,747,174 stalled-cycles-backend:hg # 45.68% backend cycles idle ( +- 1.60% ) [44.66%] 2,837,484,461 instructions:hg # 0.84 insns per cycle # 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%] 526,077,681 branches:hg # 319.407 m/sec ( +- 4.52% ) [55.82%] 829,623 branch-misses:hg # 0.16% of branches ( +- 4.42% ) [55.74%] 594,396,822 l1-dcache-loads:hg # 360.887 m/sec ( +- 4.74% ) [55.59%] 20,842,113 l1-dcache-load-misses:hg # 3.51% of l1-dcache hits ( +- 0.68% ) [55.46%] 5,474,166 llc-loads:hg # 3.324 m/sec ( +- 1.81% ) [44.23%] <not supported> llc-load-misses:hg 1.658671368 seconds time elapsed ( +- 1.82% )
versus:
g++ -std=c++11 -o2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple performance counter stats './tuple' (10 runs): 996.090514 task-clock:hg (msec) # 0.996 cpus utilized ( +- 2.41% ) 102 context-switches:hg # 0.102 k/sec ( +- 64.61% ) 4 cpu-migrations:hg # 0.004 k/sec ( +- 32.24% ) 181,701 page-faults:hg # 0.182 m/sec ( +- 0.06% ) 2,052,505,223 cycles:hg # 2.061 ghz ( +- 2.22% ) [44.45%] 1,212,930,513 stalled-cycles-frontend:hg # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%] 621,104,447 stalled-cycles-backend:hg # 30.26% backend cycles idle ( +- 3.48% ) [44.69%] 2,700,410,991 instructions:hg # 1.32 insns per cycle # 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%] 486,476,408 branches:hg # 488.386 m/sec ( +- 1.70% ) [55.96%] 959,651 branch-misses:hg # 0.20% of branches ( +- 4.78% ) [55.82%] 547,000,119 l1-dcache-loads:hg # 549.147 m/sec ( +- 2.19% ) [55.67%] 21,540,926 l1-dcache-load-misses:hg # 3.94% of l1-dcache hits ( +- 2.73% ) [55.43%] 5,751,650 llc-loads:hg # 5.774 m/sec ( +- 3.60% ) [44.21%] <not supported> llc-load-misses:hg 1.000126894 seconds time elapsed ( +- 2.47% )
as can see, in case reason higher number of stalled cycles, both in frontend in backend.
now come from? bet comes down failed inlining, similar explained here: std::vector performance regression when enabling c++11
indeed, enabling -flto
equalizes results me:
performance counter stats './pair' (10 runs): 1021.922944 task-clock:hg (msec) # 0.997 cpus utilized ( +- 1.15% ) 63 context-switches:hg # 0.062 k/sec ( +- 77.23% ) 5 cpu-migrations:hg # 0.005 k/sec ( +- 34.21% ) 195,396 page-faults:hg # 0.191 m/sec ( +- 0.00% ) 2,109,877,147 cycles:hg # 2.065 ghz ( +- 0.92% ) [44.33%] 1,098,031,078 stalled-cycles-frontend:hg # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%] 701,553,535 stalled-cycles-backend:hg # 33.25% backend cycles idle ( +- 1.09% ) [44.68%] 3,288,420,630 instructions:hg # 1.56 insns per cycle # 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%] 672,941,736 branches:hg # 658.505 m/sec ( +- 0.80% ) [56.00%] 660,278 branch-misses:hg # 0.10% of branches ( +- 2.05% ) [55.93%] 474,314,267 l1-dcache-loads:hg # 464.139 m/sec ( +- 1.32% ) [55.73%] 19,481,787 l1-dcache-load-misses:hg # 4.11% of l1-dcache hits ( +- 0.80% ) [55.51%] 5,155,678 llc-loads:hg # 5.045 m/sec ( +- 1.69% ) [44.21%] <not supported> llc-load-misses:hg 1.025083895 seconds time elapsed ( +- 1.03% )
and tuple:
performance counter stats './tuple' (10 runs): 1018.980969 task-clock:hg (msec) # 0.999 cpus utilized ( +- 0.47% ) 8 context-switches:hg # 0.008 k/sec ( +- 29.74% ) 3 cpu-migrations:hg # 0.003 k/sec ( +- 42.64% ) 195,396 page-faults:hg # 0.192 m/sec ( +- 0.00% ) 2,103,574,740 cycles:hg # 2.064 ghz ( +- 0.30% ) [44.28%] 1,088,827,212 stalled-cycles-frontend:hg # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%] 697,438,071 stalled-cycles-backend:hg # 33.15% backend cycles idle ( +- 0.41% ) [44.76%] 3,305,631,646 instructions:hg # 1.57 insns per cycle # 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%] 675,175,757 branches:hg # 662.599 m/sec ( +- 0.16% ) [56.02%] 656,205 branch-misses:hg # 0.10% of branches ( +- 0.98% ) [55.93%] 475,532,976 l1-dcache-loads:hg # 466.675 m/sec ( +- 0.13% ) [55.69%] 19,430,992 l1-dcache-load-misses:hg # 4.09% of l1-dcache hits ( +- 0.20% ) [55.49%] 5,161,624 llc-loads:hg # 5.065 m/sec ( +- 0.47% ) [44.14%] <not supported> llc-load-misses:hg 1.020225388 seconds time elapsed ( +- 0.48% )
so remember, -flto
friend , failed inlining can have extreme results on heavily templated code. use perf stat
find out what's happening.
Comments
Post a Comment