2011년 2월 21일 월요일

컴파일러의 선택.

 

 http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html

이 이사트에 가면

Benchmarking Intel C++ to GNU g++

by Scott Robert Ladd
17 December 2002
This article is obsolete and has been superceded by:
The page you're viewing exists as a historical record only, and does not reflect the current state of the art in Linux compilers. The new article is part of an ongoing effort to track the quality of programming tools for Linux.

Products compared:

22 August 2004: In the last 18 months, this article has become somewhat obsolete, given new releases of both Intel C++ and the GCC compilers. While the conclusions largely hold true, the gap between the compilers is narrowing. I am working on an update to this article, and hope to publish it "soon."

17 December 2002: I added a sentence to the conclusions, replacing an accidentally-deleted comment in which I praise gcc's cross-platform capabilities.

15 December 2002: Preliminary Pentium 4 numbers are here! I've extended the tables, keeping the Pentium III numbers for comparison purposes. You'll find specifications for both test systems below. I'm likely to update this article after the first of the year, based on feedback from readers and two new "real world" benchmarks that I'm getting ready.

The article has been expanded, and I've fixed some inconsistencies in the text. In particular, my conclusions changed in light of the Pentium 4 tests. To get the whole story, read the entire article!

Acknowledgements: Thank you (in no particular order) to Claudius Link, Jan Hubicka, Kelley R. Cook, Mark Mitchell, Richard Henderson, Tim Price, Christian Volger, and several anonymous respondents for your input on this article. If I've missed anyone, please forgive me. Mark Rutherford was very helpful in getting Tycho working.

Once more, unto the breach, dear friends, once more...
Henry V (as voiced by Shakespeare) may have been talking about war, but the statement seems apropos when producing a set of benchmarks. Theoretically, benchmarks should provide clear, unequivocal information that guides people in making choices about software and hardware. Reality is, alas, somewhat less than that ideal; benchmarks are quite subjective, prone to interpretation, and rarely show a clear picture. Benchmarking is always a tricky business, especially when it comes to compilers. The entire process is subjective: A reviewer selects a limited suite of benchmarks that demonstrate specific aspects of a code generation, thus predicting general compiler performance from a limited data set. Not terribly scientific, to be sure, and prone to interpretation and second guessing by the author's audience.
So why do benchmarks at all? Because we can still learn something about the relative performance of different tools, by comparing their performance in a controlled environment. Benchmarks are guidelines, not absolute answers. And to be valid, benchmark source code must be available, and the testing conditions clearly stated.
With each new version of gcc, the GNU team adds many features, ranging from improved compliance with Standards to new code generation options. Intel's latest compiler supports several options for parallel programming that deserve further investigation. A compiler is more than the sum of its benchmarks, and I want to create an accurate and complete comparison of the two products.
As mentioned in my earlier articles, the Intel C++ compiler is available with a non-commercial license, meaning that anyone can download and use the full compiler for non-profit work. You can find out more about the license in this article. The Intel non-commercial license is not the same thing as the GNU Public License (GPL); Intel C++ is not "free" software -- however, it is a tool that can be used for developing free software and working on non-commercial projects.
Choosing the Benchmarks
In the case of a compiler, code generation benchmarks give us an empirical comparison of products that serves to guide our choices of tools. If I'm developing a number-crunching application, I appreciate knowing that compiler "A" produces faster code than "compiler "B". In my experience (which goes back some dozen years), benchmarking serves as a guide, a filter that shows trends and identifies areas of concern.
A web-based set of benchmarks is superior to what can be accomplished in a magazine article. I can correct any egregious errors I might make, updating articles to reflect the views of vendors and users.
You can download the complete benchmark source code here. The archive contains C++ source code and bash scripts for duplicating my tests.
Benchmark SystemsCopernicusTycho
MotherboardIBM 6889Intel D850EMVRL
Processorsdual Pentium III
600MHz
single Pentium 4
2.8GHz
HyperThreading Enabled
RAM384MB256MB (PC800)
FSB100MHz533MHz
Hard Drive9GB SCSI90GB ATA/100
Linux distroSlackware 8.0, modifiedDebian "sid" (unstable)
Linux Kernel 2.2.19 SMP2.5.51 SMP
glibc version2.2.52.3.1
binutils2.13.12.13.1

Test Systems
I used two Linux systems for these tests, as shown in the table at left. While Tycho has hyperthreading enabled (and yes, it is working with the 2.5.51 kernel), I did not use the Intel C++ -parallel option to auto-parallelize loops. In a future article, I want to thoroughly address parallel code, hyperthreading, and related issues.
Astute readers will note that Tycho is running a "development" kernel and the unstable version of Debian GNU/Linux. It's a long story that involves hyperthreading, nVidia drivers, and other bizarre circumstances beyond the scope of this review. Once the scars have healed, I might write an article about how Tycho came into being. Maybe. ;)
I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present. If you see any obvious errors or problems in my selections of options, please let me know.
Compiler Options
At the suggestion of RedHat's Richard Henderson, I tried gcc with both the -O2 and -O3 options; -O3 turns on automatic function inlining, which Richard suggested could bloat code and actually slow it down. In general, I wasn't interested in code size, and I found that -O3 produced faster code in every instance. Richard didn't suggest using any other options; therefore, I took his recommendation and the gcc documentation at face value, using -O3 to optimize gcc's output. I tried several other options suggested by various e-mail and mailing list correspondents, including -malign-double, -fprefetch-loop-arrays, and -fstrict-aliasing; none of those improved performance on the benchmark code.
I chose code generation switches that served similar purposes in both compilers; for example, Intel uses non-IEEE 745 "fast math" by default, while gcc only does so when using the -ffast-math option. The options used for compiling were:
  • for gcc:
    gcc -O3 -funroll-all-loops -fomit-frame-pointer -ffast-math -march=pentium3 -mfpmath=sse
    gcc -O3 -funroll-all-loops -finline-limit=800 -fomit-frame-pointer -ffast-math -march=pentium4 -mfpmath=sse
  • for Intel C++ 6.0:
    icc -O3 -axK -ipo
  • for Intel C++ 7.0:
    icc -O3 -i_dynamic -xK -ipo -march=pentiumiii -mcpu=pentiumpro
    icc -O3 -i_dynamic -xW -ipo -march=pentium4 -mcpu=pentium4
And so, with those disclaimers, on to the benchmarks!
SciMark 2.0Pentium 3Pentium 4
g++
3.0.4
g++
3.1
g++
3.2.1
Intel
C++ 6.0
Intel
C++ 7.0
g++
3.2.1
Intel
C++ 7.0
Composite Score108.8104.7104.7122.3117.7584.5717.2
FFT97.897.896.992.892.0329.1348.8
SOR202.41177.8177.1213.2195.0428.2653.2
MonteCarlo35.137.737.691.692.0110.9513.8
Sparse Matrix103.0102.7104.3104.7101.7856.7835.8
LU105.7107.7107.7109.0108.21197.71235.6

SciMark 2.0
SciMark 2.0 is a C benchmark invented by Roldan Pozo and Bruce Miller at the U.S. National Institute of Standards and Technology. Originally written in Java for the purpose of comparing Java virtual machine performance, the suite was translated into ANSI C for use as a performance benchmark. Bigger numbers result from faster code, as this benchmark reports results using MIPS (millions of instructions per second).
SciMark measures the performance of number-crunching code used in "typical" scientific and engineering applications. It consists of five computational kernels: a Fast Fourier Transform, a Gauss-Seidel relaxation, a sparse matrix-multiply, a Monte Carlo integration, and a dense LU factorization. The code is straight ANSI C, without any abstractions or the use of C++ features. I've found this benchmark reflects the performance I can expect in my own numerical applications.
On the Pentium III, gcc and Intel run very close together.  The Pentium IV tests, however, show a trend that will continue throughout the rest of these tests: Intel produces faster code on almost all tests, and produces code that is 20% faster overall. Only on the Sparse Matrix Multiplication test did gcc generate the fastest code.
MazeBenchPentium IIIPentium 4
g++
3.0.4
g++
3.1
g++
3.2.1
Intel
C++ 6.0
Intel
C++ 7.0
g++
3.2.1
Intel
C++ 7.0
run time (seconds)6.22.82.83.03.00.80.8
MazeBench
This benchmark is based on my Algorithmic Conjurings column on maze generation. I wrote a very simple test that generates a 1000-by-1000 (1 million) cell matrix. This test looks at integer loop optimization and, to some extent, pointer manipulation and the rand() function. Just to make sure both compilers were generating the maze correctly, I ran another version of MazeBench, which saves the generated maze to an image. Indeed, all three compilers generated programs that produced identical output. Both compilers produced similar-quality code on both processors. 
StepanovPentium IIIPentium 4
g++
3.0.4
g++
3.1
g++
3.2.1
Intel
C++ 6.0
Intel
C++ 7.0
g++
3.2.1
Intel
C++ 7.0
run time (seconds)16.43.43.53.33.41.11.0
abstraction penalty3.61.01.11.01.01.21.0

Stepanov
The Stepanov benchmark was created by Alex Stepanov, one of the inventors of the Standard Template Library (STL). Stepanov measures what he calls an "abstraction penalty", by performing a simple task (adding 2000 doubles in an array 25,000 times) using 13 different levels of abstraction. For a full explanation of this benchmark, I direct you to the source code, which contains long and useful comments about Stepanov's goals and algorithms.
With 3.1, the GNU compiler drew into a dead heat with Intel's product. This may be due to improved inlining of functions; Claudius Link, of Freiburg University, noted that gcc 3.0.4's performance on this benchmark could be vastly improved by setting the -finline-limit option to 700 or more. This is why I added the -finline-limit=800 to the command line when compiling with gcc 3.2 and later.
OOPackPentium IIIPentium 4
g++ 3.0.4g++ 3.1g++ 3.2.1Intel C++ 6.0Intel C++ 7.0g++ 3.2.1Intel C++ 7.0
Max - MFlops C / C++149.3 / 149.3149.3 / 149.3149.3 / 149.3107.5 / 137.044.4 / 44.8666.7 / 131.61428.6 / 666.7
Matrix - MFlops C / C++235.8 / 215.5242.7 / 231.5242.7 / 231.5238.1 / 240.4240.4 / 238.11315.8 / 1136.41250.0 / 1250.0
Complex - MFlops C / C++140.1 / 69.3140.1 / 69.3140.4 / 70.9173.9 / 145.5145.2 / 161.01951.2 / 115.31818.2 / 1818.2
Iterator - MFlops C / C++339.9 / 274.3327.9 / 281.7327.9 / 285.7322.6 / 266.7322.6 / 227.31333.3 / 1333.32500.0 / 1333.3
OOPack
OOPack, developed by Arch D. Robison of Kuck & Associates; purports to "measure the relative performance of object-oriented-programming (OOP) in C++ versus just writing plain C-style code in C++." Max measures how well a C++ compiler inlines a function that returns the result of a comparison. Complex compares complex numbers in C++ (based on the complex<> template) relative to using explicit real and imaginary parts in C. Matrix measures how well a C++ compiler performs constant propagation and strength-reduction on classes. The last test, Iterator computes a matrix dot-product using C-style code and OOP-style code. Bigger numbers result from faster code, as this benchmark reports results using MFLOPS (millions of floating-point operations per second).
You can find more details on these tests in the source code. In the table, I've presented the results in MFlops (million floating-point operations per second), so a higher number means faster code. In most cases, the C++ code was slower than the equivalent C implementation.
Both OOPack and Stepanov show how abstraction imposes a measurable performance hit on software. We engineers need to employ abstraction judiciously, knowing when "people time" is more important than running time -- and vice versa. You can't always assume that you'll be using a compiler (like Intel's) that imposes little or no OO overhead.
One of my recent tasks was to develop a 24/7/365 data mining engine suitable for inclusion in a daemon or service. At the same time, I developed a GUI-based application for generating metadata used by the processing engine. The GUI code, being bound to the speed of the user, is far more "object oriented" than is the processor-bound data mining code. The penalty for abstraction mattered in the engine, where maximum throughput was the goal. For the metadata GUI, the goal was ease of development and maintenance. Don't trust any pundit who gives blanket advice like "Performance doesn't matter; just buy faster hardware" or "Don't use abstraction because it makes programs too slow" -- because, like most generalizations, both statements are demonstrably false.
Pentium III results: For the Matrix and Complex tests, Intel produced faster code than did gcc; the GNU compiler, however, wins the Max and Iterator tests, doing particularly well on the C version of Matrix. Something odd happened with Intel C++ 7.0, too -- it's performance on the Max benchmark has fallen considerably. I'll be asking Intel about this.
Pentium 4 results: For straight C on Max and Iterator, Intel C++ produces code that is about twice as fast as the code emitted from gcc; on Matrix and Complex, Intel's code is 5-7% slower. The real shocker, though, is the superiority of Intel at generating C++ code: With the exception of Iterator (where the two compilers are equal), Intel's code is 10-1500% (no, that's not a typo) faster. The Complex benchmark, in particular, is an eye-opener, to be sure. I ran the tests many times, just to be sure I was seeing these numbers. Given that Intel is only slightly faster than gcc on the C version of Complex, I suspect that something about C++ turns off certain important optimizations in gcc. Also, as a couple of correspondents have pointed out, the numbers are a bit too round and "neat"; there may be a problem with timer resolution or the size of test I'm running. I plan to investigate further; the numbers here are preliminary.
WhetstonePentium IIIPentium 4
g++
3.0.4
g++
3.1
g++
3.2.1
Intel
C++ 6.0
Intel
C++ 7.0
g++
3.2.1
Intel
C++ 7.0
run time (seconds)9192925566?22
estimated MIPS549.5543.5543.5909.1757.6?2272.7

Whetstone
Okay, so I included this hoary old benchmark because everyone's heard about it. Whetstone is another artificial benchmark, deriving code performance from a set of numerical loops. Originally written in Fortran, the version I used was converted to C by Rich Painter. This is a double-precision version of the benchmark, which I've run for 500,000 iterations.
Intel's optimizer is smart enough to know that Whetstone doesn't actually do anything; if you use the -ip option to enable interprocedural optimizations, Whetstone runs really fast (7500 MIPS!) because Intel C++ eliminates all of the "useless" code! I've had this happen before, so I was prepared to see it happen again. The problem with artificial benchmarks is that they're, well, artificial. As time goes by, I'm hoping to develop a "real world" benchmark suite for practical performance comparisons.
On the Pentium 4, the gcc-compiled Whetstone seemed to get caught in an infinite loop. I need to investigate more; while Whetstone is a tricky benchmark given its simplistic nature, gcc should generate a working executable.
One new wrinkle: On Copernicus, an Intel-compiled version of my xants program (an ALife research project) is twice as slow as the gcc-generated version, a discrepancy related to Intel's implementation of the std::vector template. I was surprised to see that the Pentium 4 brought Intel's performance in line with gcc's. This is another of those "I'll let you know what's going on when I figure it out"-type mysteries.
Compile Time /
Code Size
Pentium IIIPentium 4
g++ 3.2.1Intel C++ 7.0g++ 3.2.1Intel C++ 7.0
SciMark3.9s / 28,7822.2s / 32,5471.3s /19,9170.6s / 25,941
MazeBench10.4s / 37,1137.3s / 132,5483.5s / 28,0142.0s /123,912
Stepanov1.9s / 21,267 1.2s / 27,8960.5s / 12,4700.4s / 24,850
OOPack2.1s / 27,561 0.9s / 34,6450.5s / 18,8300.3s / 26,399
Whetstone1.2s / 17,9620.5s / 17,9420.3s / 9,549 0.2s /8,121 

Code Size
This table shows the size of code generated by the compilers, and the amount of time required to compile. I've only included numbers for the latest compiler versions, for optimized compiles.
By default, Intel C++ uses static linking; the -i_dynamic switch (which I used in these tests) tells the compiler to use dynamic linking, which is what gcc does by default.
Intel generally produces larger executable files, but is always faster to compile than is gcc. In my experience (and that of many correspondents), gcc 3.x is considerably slower then gcc 2.x. When compiling for the Pentium 4, both compilers produced significantly smaller executable images.
Conclusions
So which compiler is better?
Like Einstein, I have to say the answer is relative. If you use systems based on the Pentium 4 architecture, Intel C++ is an excellent choice. If you need OpenMP or automatic vectorization, Intel is your only choice given gcc's lack of these features. With version 7, Intel has also added full support for hyperthreading; I haven't had the time to experiment with HT yet.
Intel does not support all gcc language extensions; while it has been used to compile the Linux kernel and other free software projects, it is not a drop-in replacement for gcc. I do have some reports of code that compiles incorrectly with Intel C++; on the other hand, I have such reports and experiences with every compiler I've ever used,  including gcc. Compiling code is a complicated business, and it isn't humanly possible to write a perfect compiler that digests everything programmers throw at it.
Don't think I'm counting gcc out. If anything, these tests prove that free software can produce products that rival -- and sometimes exceed -- the qualities of their commercial counterparts. Perhaps gcc's greatest strength is its cross-platform portability; it is arguably the most ubiquitous piece of software in the world, running on everything from mainframes to embedded systems. For obvious reasons, Intel's compiler is specific and limited to their processors.
As for the religious war over free and proprietary software: I've written a few million lines of code over the decades, and only under GNU/Linux have I had source code for my compiler. Perusing the gcc source can be very educational -- but in the end, as a developer, I haven't had a compelling need for my compiler's source code. I don't have the time to do compiler hacking when I'm trying to write code for my customers. So long as gcc exists and is free, I don't see any problem with companies like Intel (and Borland) producing closed-source tools we can use to develop free source projects.
Your mileage and religious fervor on this issue may, of course, vary. I'm just glad we have a choice when it comes to development tools.
"Choice" is the key word here -- choice is good, be it in democracy or software. Intel provides a useful alternative to gcc for development on ia32 systems. One compiler might have a great environment for developing GUI code; another compiler might generate fast code. GPL-like freedom may be important -- or not -- as individual circumstances dictate.
Smart people choose the right tool (or tools) for the job; fools make snap decisions based on religious issues.
As always, I look forward to considered comments.
-- Scott


라는 내용이 있다.

결론만 말하면, Intel 칩에서는 Intel 컴파일러가 좋다는 거다. 자료 내용도 꽤 오래 되었고 gcc 버전도 꽤 오래 된 거지만. 그리고 꼭 속도랑 칩 기능의 Full supporting만이 해답은 아니다.
 intel  vs gcc 의 안정성은 어떨까? 아무래도 더 많은 유저를 보유한 컴파일러가 더 안정적이지 않을까? 어떤 식으로 버그 리포팅되고 투명하게 고쳐졌는지 소스 코드도 공개되는 컴파일러가 더 믿을만 하다고 생각이 된다. 기업에 있어보면 알겠지만 묻어두고 가는 문제들도 비일비재하다.
Within the OBLcpu benchmark suite, the performance of 30 out of 34 kernels improved using the Intel C++ compiler. The level of improvement varied dramatically with many kernels clocking in 100% faster. Overall the geometric mean improvement as 47%. Only 4 kernels exhibited any signs of performance degradation and of those only 1, gamsim, showed a significant decline in performance.

 가장 믿을이 가는 내용은 gcc로 컴파일되어 나오는 휴대폰이 대부분이라는 것. 내가 휴대폰쪽에 종사하니까. ^^;

 그래서 gcc를 선택하게 되었다.

그럼, 버전도 선택해야 한다. 소프트웨어를 하면서 느낀점은 버전이 다르면 다르다고 보는 것이 옳은 견해라는 것이다.

버전을 쭉~ 둘러보자.

P.S 참 볼랜드나 비주얼 시리즈는 배제하였는데 참 어플리케이션 만들때는 Visual 시리즈가 나은 것 같다. 볼랜드 시리즈는 이제 더 이상 대세가 아니고(델마당 님들 죄송합니다) 운영체제 먹은 마소가 내놓는 visual 시리즈는 이제 중소기업들도 차례로 먹어서. 어떤 개발 집단이라기 보다는 시대를 대표하는 트렌드가 되어 가는 것이 사실이기 때문이다. 그리고 걔네들을 옹호하거나 비난 할수도 없는게 ... 이제는 더 이상 스타플레이어가 이끌어 갈 수 있을만한 덩치가 아니기 때문이다. 소소한 버그들이 나왔다고 anderson hejlsberg를 비난하겠는가...


GLP 컴포넌트들
U-Boot, Busybox, Freeware Advanced Audio
Coder, LZO real-time data compression library, openswan, Bluz Utils
LGPL 라이선스의 컴포넌트

GTK+, GStreamer, FFmpeg, DirectFB, Pango, Webkit, Glibc

댓글 1개:

국정원의 댓글 공작을 지탄합니다.

UPBIT is a South Korean company, and people died of suicide cause of coin investment.

 UPBIT is a South Korean company, and people died of suicide cause of coin. The company helps the people who control the market price manipu...