Writing a Linux 4k Intro

RCL_SPD

5.00/5 (1 vote)

Jan 27, 2014

CPOL

4 min read

7516

How to write a Linux 4K introduction

This summer, my old friend motivated me to restart demoscene activity and I started coding a 4k introduction for x86 Linux (which was supposed to be released at WeCan'2012, but we missed the deadline).

This is an ongoing work, but I decided to share some experience that I consider interesting.

1. Even C is Probably Too Heavy for the Task

Although a friend of mine has previously written a 4k intro in C, I am having problems getting the compiler to generate the compact code I want. One of the reasons for that is that even when optimizing the whole program at once, compilers (here and later on by "compilers" I mean ones I tried for the intro, namely clang & gcc) don't relax the ABI restrictions for internal (externally invisible) functions. Sure, you can inline a function altogether but sometimes it's not what you want.

To provide an example: let's say that ABI mandates that you pass arguments via stack. However, according to your particular optimization criteria (be it size or speed), you could generate better code for calling some (not all) functions if arguments were passed via registers. Currently, you can fine-tune that manually (by using 'regparm' function attribute on select functions), but ideally compiler would analyze all the callers of your function and decide that automatically.

Generally, it seems that even whole-program optimizing compilers are having problems crossing function borders, even if all the possible places where a function could be called from are visible to them.

E.g. I would like two (or more) functions to share a common stack frame - NOT inline one into another, as this would produce sub-optimal (for me), larger and worse to pack code, but essentially turn some of the callee functions into local ones. This can be done manually to some extent, but this looks like a task for compiler.

BTW, it turns out that Clang generates smaller code than GCC, although this is mostly (only?) true if you use SSE intrinsics for math: it looks like Clang cannot and will not generate any x87 code even if you pass -mfpmath=387 (at least that's the experience with current SVN trunk and also with version 3.0 shipped with Ubuntu 12.04) and scalar SSE opcodes are large.

2. Compilers are Buggy ;-)

Nothing new here, of course. Just sharing a particular example of peephole optimizer silliness in Clang, which turns this code...

    size_t iCnt = 6;
    float * pBuf = s_BufStart;
    do
    {
        SomeFunc( pBuf );
        SomeFunc( pBuf + 4 );
        SomeFunc( pBuf + 8 );
        SomeFunc( pBuf + 12 );
        pBuf += 16;
    }
    while( --iCnt > 0 );

...into this (pay attention to instructions in bold):

804988e:       31 db                   xor    %ebx,%ebx
8049890:       83 ec 10                sub    $0x10,%esp
8049893:       8d 83 30 bf 04 08       lea    0x804bf30(%ebx),%eax
8049899:       89 04 24                mov    %eax,(%esp)
804989c:       ff 15 14 be 04 08       call   *0x804be14
80498a2:       83 c4 10                add    $0x10,%esp
80498a5:       83 ec 10                sub    $0x10,%esp
80498a8:       8d 83 40 bf 04 08       lea    0x804bf40(%ebx),%eax
80498ae:       89 04 24                mov    %eax,(%esp)
80498b1:       ff 15 14 be 04 08       call   *0x804be14
80498b7:       83 c4 10                add    $0x10,%esp
80498ba:       83 ec 10                sub    $0x10,%esp
80498bd:       8d 83 50 bf 04 08       lea    0x804bf50(%ebx),%eax
80498c3:       89 04 24                mov    %eax,(%esp)
80498c6:       ff 15 14 be 04 08       call   *0x804be14
80498cc:       83 c4 10                add    $0x10,%esp
80498cf:       83 ec 10                sub    $0x10,%esp
80498d2:       8d 83 60 bf 04 08       lea    0x804bf60(%ebx),%eax
80498d8:       89 04 24                mov    %eax,(%esp)
80498db:       ff 15 14 be 04 08       call   *0x804be14
80498e1:       83 c4 10                add    $0x10,%esp
80498e4:       83 c3 40                add    $0x40,%ebx
80498e7:       81 fb 80 01 00 00       cmp    $0x180,%ebx
80498ed:       75 a1                   jne    8049890

Apparently, clang is trying to set up and tear down an aligned stack parameter, but does not notice for some reason that the same set up and tear down happens four times in a row, thus missing the opportunity to optimize the three of them out.

I tried to come up with a smaller fragment of code that reproes that, but could not so far, that is why instead of filing a bug report I'm just mentioning this on the blog. Notably, gcc does not have any problems with this particular piece of code, but as I mentioned above, gcc's code is ultimately larger and less packer-friendly.

Also, it's possible to get an ICE with clang when doing weird stuff with SIMD intrinsics/registers (especially when aliased unions with floats are involved), but again, a simple isolated repro is hard to come by.

3. x86 Code is Damn Tight (SSE One, Anyway)

I experimented with some simple transforms on the resulting binary to improve compression ratio. Using tools from ELF kickers and an excellent distorm disassembler library, I wrote a code rotator to transpose x86 opcodes, i.e., instead of having them placed in memory like...

    164a:       0f 28 ea                movaps %xmm2,%xmm5
    164d:       0f 59 ed                mulps  %xmm5,%xmm5
    1650:       0f 59 e9                mulps  %xmm1,%xmm5
    [...]
    165b:       0f 52 ed                rsqrtps %xmm5,%xmm5
    165e:       0f 59 d5                mulps  %xmm5,%xmm2
    1661:       0f 58 e2                addps  %xmm2,%xmm4

...that is,

    0f 28 ea 0f 59 ed 0f 59 e9 0f 52 ed 0f 59 d5 0f 58 e2

...the rotator places them roughly like:

    0f 0f 0f 0f 0f 0f 28 59 59 52 59 58 ea ed e9 ed d5 e2

(it is actually more tricky as x86 opcodes have different sizes).

Guess what? Despite this layout looking more regular, it actually packs worse, at least with gzip and lzma (which are used as 4k code droppers). Trying to add a simple RLE pre-pass didn't help, so this requires a closer (and more thorough) look.

4. Linux Needs Better 4k Tools

Particularly, lack of crinkler is a major disadvantage. Specialized (and better) loaders than ld do exist, but nothing can make up for the lack of crinkler's excellent packer.

However, I would like to highlight GC masher, a great tool (with awful source code, but still) to conduct a brute-force search of compiler parameters or code layout (e.g., provide alternative paths in code switched by #ifs and use the tool to find the combination with the best compression ratio). With some creative use, the tool can be used to find not the smallest, but the fastest version of resulting binary (within the limits of a brute-force search of course).

5. Demomaking is Fun and I Missed It

I have previously coded a 4k intro (MS-DOS, with GUS music), and a couple of demos, too, but I have been inactive for quite a while (crunch doesn't help creativity ;-) and then there were other reasons, too). Now that I returned to the coding scene stuff, I feel myself a teenager again :-)