I've recently been working a little bit on a compiler, which has made me ask
myself the following question: just how slow is it compared to mainstream
languages? Not only does my compiler perform no optimizations yet (no inlining,
loop unrolling, constant folding, strength reduction, dead code elimination,
nothing!), it generates very naïve assembly. Because it has no register
allocation strategy, it stores everything on the stack and only uses
rax
(and rbx
when using binary instructions). This
leads to comically bad assembly, such as:
movq $1, %rax movq %rax, %rbx
Anyways, I was curious to know just how slow the code my compiler outputs is compared to mainstream languages. Obviously, it would be slower than the equivalent C code or even Python, but just how much? I wanted to get a feel for this, but also for how close I could realistically get with basic optimizations.
It's common knowledge that you can't compete with LLVM/GCC in terms of code generation (at least not as a one-man effort), but also that all mainstream languages benefit from tons of performance work being done by lots of experts. Therefore, my intuition was that if I could get within an order of magnitude of their speed after implementing optimizations, I should be quite satisfied with it. I estimate that, on the scale of a whole program, not generating ridiculously bad assembly plus optimizations should be able to get me an order of magnitude in terms of performance (maybe too optimistic, but it's really bad right now). Being an order of magnitude away from mainstream languages wouldn't be too bad, I don't intend to use my language for performance critical work and can drop down to assembly if needed.
So I've decided to go ahead and compare the same program, written in my language, in Python and in C. I've gone for a naïve Fibonacci function (the recursive one, which doesn't use dynamic programming) and asked it for the 40th number. For reference, here's the (x64) assembly that my compiler generated for it:
fibo: push %rbp movq %rsp, %rbp movq 16(%rbp), %rax push %rax movq $0, %rax pop %rbx cmpq %rax, %rbx movq $0, %rax sete %al push %rax movq 16(%rbp), %rax push %rax movq $1, %rax pop %rbx cmpq %rax, %rbx movq $0, %rax sete %al pop %rbx or %rbx, %rax cmpq $0, %rax je else_0 movq 16(%rbp), %rax jmp end_0 else_0: movq 16(%rbp), %rax push %rax movq $1, %rax movq %rax, %rbx pop %rax subq %rbx, %rax push %rax call fibo addq $8, %rsp push %rax movq 16(%rbp), %rax push %rax movq $2, %rax movq %rax, %rbx pop %rax subq %rbx, %rax push %rax call fibo addq $8, %rsp pop %rbx addq %rbx, %rax end_0: movq %rbp, %rsp pop %rbp ret
Without further ado, here are the results:
~ > compiler > out.s && gcc out.s && time ./a.out real 0m0.925s user 0m0.925s sys 0m0.000s ~ > gcc -O0 f.c && time ./a.out real 0m0.575s user 0m0.575s sys 0m0.000s ~ > gcc -O1 f.c && time ./a.out real 0m0.394s user 0m0.394s sys 0m0.000s ~ > gcc -O2 f.c && time ./a.out real 0m0.149s user 0m0.149s sys 0m0.000s ~ > gcc -O3 f.c && time ./a.out real 0m0.137s user 0m0.137s sys 0m0.000s ~ > time python3 f.py real 0m19.200s user 0m19.191s sys 0m0.008sMy binary is within half the performance of GCC's
-O0
(which is
very
slow), within an order of magnitude of -O3
and... almost 20
times faster than Python..? This shocked me quite a bit when I
first ran the benchmarks, since I expected to score lower than even Python.
My conclusion is twofold:
This is a cherry-picked benchmark!No, I've just written the first simple function that takes time to execute that came to my mind.
This is not a realistic benchmark and it disproportionally disadvantages Python!Sure, but my assembly is also really terrible. My point is that I still expected to be slower than Python. I didn't know by how much, and with hindsight, it makes sense that I would be a faster, but I still would not have expected a 20x speedup.
Even if your assembly is naïve, it's low-level and all the data will likely fit in cache!Well first, the second point also applies to Python. For the first point, I know that CPython doesn't JIT, but even then, I cannot imagine what causes the overhead to be that big. Either way, I'm now convinced that it is relatively easy to outperform the decades of performance work that were done on CPython.
I will be happy to perform some more realistic benchmarks later, but my compiler is not ready for that yet :^)