Click here to Skip to main content
15,880,608 members
Articles / Internet of Things / Arduino

Dynamic Loop Unrolling

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
28 Jan 2014CPOL4 min read 7.5K  
Dynamic loop unrolling

During good old days on ZX Spectrum, coders used to construct various sprite blitting routines 'manually', that is, during run time. Sometimes, this was done to save code size, but often times the primary reason was efficiency - on that architecture, it was faster to embed your data in the code than to fetch it from some other location. E.g. instead of (pseudoasm):

ASM
    mov ptr, sprite_data    ; a memory address
draw_sprite:
    mov reg, [ptr]           ; read the first pixel
    mov some_screen_addr, reg

You could just create a custom routine for that sprite that would do:

ASM
draw_specific_sprite:
    mov reg, sprite_pixel  ; sprite pixel as an immediate value
    mov some_screen_addr, reg 

Of course, actual Z80 assembly looked quite differently and we used PUSH (having stack pointer set to point to video buffer) instead of regular register loads/stores, but you get the idea.

With compiled languages, this is, generally speaking, unfeasible. One can achieve similar effect during compile time by using templates, but to have the same flexibility code needs to be generated and compiled on the fly.

However, having recently learned about GCC-specific extension of labels as values, I decided to create a program that would generate itself on the fly without invoking the compiler. So, here is the result.

Useless Dynamic Loop Unroller

The idea is simple: since labels (with GCC) are just addresses in code, allocate some executable memory (this is slightly different from normal malloc() because of W^X), copy parts of the function as many times as needed, and then execute. This will work because these days, compilers usually generate relocatable (i.e. PC-relative) code which rarely, if ever, needs patching (linker authors may correct me here, but at least so was my experience). For additional safety, -fPIC could be used.

The example below shows only a single increment being repeated, but of course the 'pieces' could be as different as one wishes. I could have even patched the code to replace dummy immediate values with another ones (which would work on x86/x64 platform).

Could... if compiler didn't get in the way.

My first proof-of-concept code was much less complicated, but I wanted this to work with -O3 optimization and this proved to be much harder than I thought. Compiler rearranged the code, removed unused labels, or tried to predict the execution flow and duplicated certain code paths instead of doing jumps.

Finally, I worked around this by placing label addresses in a table and using an artificial ExternalRandom variable which value compiler cannot guess when compiling this unit. All that made this code unpleasant to read and follow.

As useless and fragile as it is, this example is portable (!) and works on x86, x64, PowerPC (PS3) and ARM (N900) devices. Code is GCC-specific, and also - as given here - Linux specific because of mmap() usage. If I were less lazy, I could have used VirtualAlloc() on Windows to make it even more portable.

C++
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <memory.h>

void * MallocExec( size_t Size )
{
    // align size to page size
    Size = ( Size & ~4095 ) + 4096;

    void * pAddr = mmap(
        NULL,
        Size,
        PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_ANONYMOUS | MAP_PRIVATE,
        0, 0 );

    return pAddr;
}

void FreeExec( void * pPtr, size_t Size )
{
    // align size to page size
    Size = ( Size & ~4095 ) + 4096;

    munmap( pPtr, Size );
}

void TellPC( const char * pStr )
{
    printf( "%s: program counter is 0x%ZX\n", pStr, 
            ( size_t )__builtin_return_address( 0 ) );
}

// ExternalRandom should be always 1, but compiler shouldn't know about this
void SelfConstruct( int * pCounter, int nTimes, int ExternalRandom )
{
    static void * Jumps[] = { &&dyn_gen, &&end, &&end2, &&loop };
    void * pLoop = Jumps[ ExternalRandom + 2 ];
    void * pEnd = &&end;
    size_t LoopSize = ( size_t )&&end - ( size_t )&&loop;
    size_t EndSize = ( size_t )&&end2 - ( size_t )&&end;
    void * pExecMem = NULL;
    void * pEnd2 = Jumps[ ExternalRandom + 1 ];
    int i;
 
    printf( "loop: 0x%ZX, sizeof( loop ) = %Zu\n", ( size_t )&&loop, LoopSize );
    printf( "end: 0x%ZX, sizeof( end ) = %Zu\n", ( size_t )&&end, EndSize );
    goto *Jumps[ ExternalRandom - 1 ];

dyn_gen:
    pExecMem = MallocExec( nTimes * LoopSize + EndSize );

    for ( i = 0; i < nTimes; ++i )
    {
        memcpy( pExecMem + i * LoopSize, pLoop, LoopSize );
        printf( "Copied %Zu bytes from 0x%ZX to 0x%ZX\n", LoopSize, 
                ( size_t )pLoop, ( size_t )pExecMem + i * LoopSize );
    }

    memcpy( pExecMem + nTimes * LoopSize, pEnd, EndSize );
    printf( "Copied %Zu bytes from 0x%ZX to 0x%ZX\n", EndSize, 
            ( size_t )pEnd, ( size_t )pExecMem + nTimes * LoopSize );

    if ( nTimes > -1 )
    {
        printf( "Jumping to 0x%ZX\n", ( size_t )pExecMem );
        goto *pExecMem;
    }
    goto *Jumps[ ExternalRandom ]; // should never reach this line, but 
                                   // don't tell the compiler about this

loop:
    ++( *pCounter );               // the line which will be actually repeated

end:
    if ( ExternalRandom >= 1 )     // should be always true, but psst...
    {
        goto *pEnd2;
    }
    // will never get here
    *pCounter = -1;

end2:
    TellPC( "end2" );
    FreeExec( pExecMem, nTimes * LoopSize + EndSize );
};

int main( int argc, const char *argv[] )
{
    int nCounter = 0;
    if ( argc < 2 )
    {
        SelfConstruct( &nCounter, 1, argc );
    }
    else
    {
        SelfConstruct( &nCounter, atoi( argv[ 1 ] ), 1 );
    }
    printf( "nCounter = %d\n", nCounter );

    return 0;
}

This cleanly compiles with -O3 -Wall and works (with GCC up to 4.5.2 at least).

And, just for reference, here are the results of running this program on different devices (with 12 passed as parameter). I'm giving those since it was curious for me to compare code size and virtual address space layout.

x86 Ubuntu

loop: 0x8048610, sizeof( loop ) = 6
end: 0x8048616, sizeof( end ) = 26
Copied 6 bytes from 0x8048610 to 0x634000
Copied 6 bytes from 0x8048610 to 0x634006
Copied 6 bytes from 0x8048610 to 0x63400C
Copied 6 bytes from 0x8048610 to 0x634012
Copied 6 bytes from 0x8048610 to 0x634018
Copied 6 bytes from 0x8048610 to 0x63401E
Copied 6 bytes from 0x8048610 to 0x634024
Copied 6 bytes from 0x8048610 to 0x63402A
Copied 6 bytes from 0x8048610 to 0x634030
Copied 6 bytes from 0x8048610 to 0x634036
Copied 6 bytes from 0x8048610 to 0x63403C
Copied 6 bytes from 0x8048610 to 0x634042
Copied 26 bytes from 0x8048616 to 0x634048
Jumping to 0x634000
end2: program counter is 0x804880F
nCounter = 12

amd64 Ubuntu

loop: 0x400798, sizeof( loop ) = 8
end: 0x4007A0, sizeof( end ) = 32
Copied 8 bytes from 0x400798 to 0x7FA5168B3000
Copied 8 bytes from 0x400798 to 0x7FA5168B3008
Copied 8 bytes from 0x400798 to 0x7FA5168B3010
Copied 8 bytes from 0x400798 to 0x7FA5168B3018
Copied 8 bytes from 0x400798 to 0x7FA5168B3020
Copied 8 bytes from 0x400798 to 0x7FA5168B3028
Copied 8 bytes from 0x400798 to 0x7FA5168B3030
Copied 8 bytes from 0x400798 to 0x7FA5168B3038
Copied 8 bytes from 0x400798 to 0x7FA5168B3040
Copied 8 bytes from 0x400798 to 0x7FA5168B3048
Copied 8 bytes from 0x400798 to 0x7FA5168B3050
Copied 8 bytes from 0x400798 to 0x7FA5168B3058
Copied 32 bytes from 0x4007A0 to 0x7FA5168B3060
Jumping to 0x7FA5168B3000
end2: program counter is 0x40093F
nCounter = 12

PowerPC64 (PS3) Yellow Dog Linux 6.0

loop: 0x100006A0, sizeof( loop ) = 12
end: 0x100006AC, sizeof( end ) = 36
Copied 12 bytes from 0x100006A0 to 0xF7FFC000
Copied 12 bytes from 0x100006A0 to 0xF7FFC00C
Copied 12 bytes from 0x100006A0 to 0xF7FFC018
Copied 12 bytes from 0x100006A0 to 0xF7FFC024
Copied 12 bytes from 0x100006A0 to 0xF7FFC030
Copied 12 bytes from 0x100006A0 to 0xF7FFC03C
Copied 12 bytes from 0x100006A0 to 0xF7FFC048
Copied 12 bytes from 0x100006A0 to 0xF7FFC054
Copied 12 bytes from 0x100006A0 to 0xF7FFC060
Copied 12 bytes from 0x100006A0 to 0xF7FFC06C
Copied 12 bytes from 0x100006A0 to 0xF7FFC078
Copied 12 bytes from 0x100006A0 to 0xF7FFC084
Copied 36 bytes from 0x100006AC to 0xF7FFC090
Jumping to 0xF7FFC000
end2: program counter is 0x100008BC
nCounter = 12

ARM (N900) Maemo

loop: 0x8564, sizeof( loop ) = 16
end: 0x8574, sizeof( end ) = 24
Copied 16 bytes from 0x8564 to 0x4001F000
Copied 16 bytes from 0x8564 to 0x4001F010
Copied 16 bytes from 0x8564 to 0x4001F020
Copied 16 bytes from 0x8564 to 0x4001F030
Copied 16 bytes from 0x8564 to 0x4001F040
Copied 16 bytes from 0x8564 to 0x4001F050
Copied 16 bytes from 0x8564 to 0x4001F060
Copied 16 bytes from 0x8564 to 0x4001F070
Copied 16 bytes from 0x8564 to 0x4001F080
Copied 16 bytes from 0x8564 to 0x4001F090
Copied 16 bytes from 0x8564 to 0x4001F0A0
Copied 16 bytes from 0x8564 to 0x4001F0B0
Copied 24 bytes from 0x8574 to 0x4001F0C0
Jumping to 0x4001F000
end2: program counter is 0x86FC
nCounter = 12

Closing Notes

This is of course fragile and not guaranteed to work in more complicated cases. Something as simple as calling a non-inlined function from within relocated block of code may break things (this is because, as I said above, code is position-independent and will call the relative (wrong) address - unless you also copy the function in question).

Perhaps more limited subset of code (e.g. naked functions containing inline asm only) is better suited for this kind of tricks. Actually, I did exactly that about 5 years ago (when 32-bit x86 was still king) - I generated complex SSE code during run time from naked functions containing assembly pieces (I didn't want to write my own assembler just for that purpose). This code wasn't uberoptimal though.

...

On an unrelated note, gdb sucks. That beast cannot step into memory for which it can't find source mapping (at least a function that contains that address). Yes, you can force it to disassemble raw memory, e.g.:

x/40i $pc

but without the ability to do nexti or at least stepi, this sucks anyway. Also, I haven't figured out why my breakpoints cease to work after switching layouts (?!)... Generally, while it's better than nothing, gdb probably cannot be used for debugging a dynamically generated code.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --