Compiler's Code Optimization - The Dark Side

Marcelo A. B. Slomp

4.92/5 (23 votes)

Jun 26, 2003

7 min read

82002

634

How compiler's code optimization models works and mixing its to create an hybrid optimization model.

Download source - 3.3 Kb

Brief considerations

First of all, I apologize for my imperfect English, since my native language is Portuguese.

The source code for this article is in VC++6 project format, but the article applies also for other versions, like .NET and even VC++5.

I'm assuming that you have a previous knowledge on generating, using and interpreting listing files, as well some minimal knowledge in Assembly language. I will abstain me here on explaining these details, but if you are not familiar with listing files, you can get these information by reading this very good article.

Introduction

We all know that many compilers give us several options for code optimization, including here the Visual C++ and even the Visual Basic compiler. We know also that there are two kinds of optimization more commonly used:

Minimize Size
Maximize Speed

We use it with a relative frequency, but hardly stop for thinking in how all this work or when we should prefer one over other. In a first view, we ask: how the compiler can build different code-objects if our sorce code is the same?

To answer this question, we need to examine what in fact is generated by the compiler. This can be done through the listing files.

Inside optimizations

To start, let's examine the following code:

//
// Counting generated machine code size in a simple "Hello World" example
//

#include <stdio.h>

char* hello = "Hello World!";  // our famous message :)

int main(void)
{
    long seip,eeip; // variables to store 'start'
                    // and 'end' EIP addresses
    __asm {
        xor eax,eax // clear contents of eax
        call jump1  // store instruction pointer value on
                    // the stack, then jump to label
                    // (see below)
      jump1:
        mov eax,[esp] // #1 (3 bytes)
        mov seip,eax  // #2 (3 bytes)
    }

    // Here, the code to be analyzed
    printf("%s\n",hello);

    __asm {
        xor eax,eax // #3 (2 bytes)
        call jump2  // #4 (5 bytes)
                    // 'call' instruction is equivalent to:
                    // push EIP
                    // jmp jump2
      jump2:
        xor eax,eax
        mov eax,[esp]
        sub eax,0Dh // we need subtract from the last
                    // instruction address the size of all
                    // performed instructions between call
                    // instructions that are not part of
                    // the code being analyzed.
                    // (13 bytes: lines #1, #2, #3 and #4)
        mov eeip,eax
    }
    // print code size results and exit
    printf("Code size: %d bytes",eeip-seip);
    return 0;
}

The purpose of this code is simple: we will get the size of the generated machine code, in bytes, for the source code being analyzed. (in this case, just a printf), through the difference between the EIP address before and after our code. Notice that MASM and, consequently, inline assembly code in Visual C++ does not allow direct access to EIP, then this procedure was emulated through a call instruction (that is equivalent to: push EIP + jmp LABEL).

Compiling the code using size optimization (select OptSize configuration in the example project) and executing the program, we will get an output:

$ Hello World!
$ Code size: 18 bytes

Compiling with speed optimization (select OptSpeed configuration in the example project) and executing the program, we will get an output:

$ Hello World!
$ Code size: 19 bytes

We have now 1 more byte in machine code, without have added anything in the source code!

Let's examine the listing files for both compilations to see how this "magic" happened. (Using the option Assembly, Machine Code, and Source - the listing files will have a .cod extension)

Case A: Minimize Size

00012 ff 35 00 00 00
    00 push DWORD PTR ?hello@@3PBDB
00018 68 00 00 00 00  push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@
0001d   e8 00 00 00 00  call _printf
00022   59 pop ecx
00023   59 pop ecx

Case B: Maximize Speed

0013 a1 00 00 00 00 mov eax, DWORD PTR ?hello@@3PBDB
00018 50 push eax
00019 68 00 00 00 00 push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@
0001e e8 00 00 00 00 call printf
00023 83 c4 08 add esp, 8

In Case A our variable hello is allocated in the stack directly by it pointer, consuming 6 bytes of code, while in Case B the address of our variable is firstly copied in eax and then allocated through push eax, consuming 6 bytes too, but with 2 instructions (mov + push). We ask now: what's the difference? Take a look in the following tables:

Case A: Minimize Size

INSTRUCTION	CLOCKS*	SIZE
`push DWORD PTR ?hello@@3PBDB`	4	6
`push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@`	1	5
`call _printf`	3	5
`pop ecx`	1	1
`pop ecx`	1	1
TOTALS	10	18

Case B: Maximize Speed

INSTRUCTION	CLOCKS*	SIZE
`mov eax, DWORD PTR ?hello@@3PBDB`	1	5
`push eax`	1	1
`push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@`	1	5
`call _printf`	3	5
`add esp, 8`	1	3
TOTALS	7	19

* Clock counts for the program running in an I486 machine

Then, for the same code we have, in Case A, 18 bytes of code, consuming 10 machine cycles and in Case B, 19 bytes of code but consuming 7 machine cycles.

Pay attention in the first instruction in Case A and the 2 first instructions in Case B: In Case A, only one instruction is used (push), but it involves the use of a memory address operand (DWORD PTR), and this consumes more cycles than if using a register as operand. This is done in Case B, by storing the memory address in a register (eax, in this case) and then applying push to this register. Even having two instructions in despite of only one, the operation will be executed more fast. This is, basically, the "magic" associated to the optimization: when the objective is speed, prioritize at maximum operations involving registers directly and avoiding (when possible) operations involving memory operands.

But we can get more from this code. Take a look in the stack frame creation for both, A and B cases:

Case A: Minimize Size

INSTRUCTION	CLOCKS*	SIZE
`push ebp`	1	1
`mov ebp, esp`	1	2
`push ecx`	1	1
`push ecx`	1	1
TOTALS	4	5

Case B: Maximize Speed

INSTRUCTION	CLOCKS*	SIZE
`push ebp`	1	1
`mov ebp, esp`	1	2
`sub esp, 8`	1	3
TOTALS	3	6

* Clock counts for the program running in an I486 machine

Notice that that the last 2 push instructions in Case A does the same thing that the sub instruction in Case B, but only a little slow and with a little less size. Since ecx is a 32-bit register (4 bytes), at the end of the 2 push instructions, the stack pointer will have a displacement of 8 bytes (we have two dword local variables), that is the same that sub esp,8 in Case B. This, by itself, explain why our code, in Case A have a pop sequence, while in Case B have an add instruction to restore the stack (remember: we are using C calling convention).

Mixing models: Hybrid optimization

Extending the previous analysis, notice that, for the same operation, we have a code size of 6 bytes, that in Case A consumes 4 cycles and in Case B consumes only 2 cycles. Why not to substitute this instruction in Case A by the instructions of Case B? This will result in a hybrid optimization model, where we have simultaneously a small and fast code. Applying this substitution (manually), we have:

Hybrid Optimization Model

INSTRUCTION	CLOCKS*	SIZE
`mov eax, DWORD PTR ?hello@@3PBDB`	1	5
`push eax`	1	1
`push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@`	1	5
`call _printf`	3	5
`pop ecx`	1	1
`pop ecx`	1	1
TOTALS	8	18

* Clock counts for the program running in an I486 machine

We continue with the same 18 bytes of code as in Case A, but now consumming 8 cycles, that is almost the value for the Case B! You can now use MASM to assemble this new code (a .bat file is included in the example for this operation).

However, consider this hybrid model just as an exercise for understanding of this study, because in a real and extensive project, this manual process becomes unviable and susceptible to errors.

Optimization models: Choosing the best solution

Now, the questions that "stays resident" are:

What's the difference if my code execute with only 3 cycles more or less?
It's not more convenient to always choose the minimal size model?

It's a frequent doubt, but that can be answered taking in mind factors like:

This is just a small example, where we have only a simple function being executed. Now imagine a real program, where we have countless operations. You can now imagine, proportionaly, how many machine cycles can be saved by maximizing the speed (or how many bytes of code can be saved by minimizing the size).
Imagine our example in a loop with 1 million iteractions. For each cycle required by the code itself we have not only one, but 1 million cycles required until the end of this loop! In a modern computer, maybe this is imperceptible, but when running the program in an old 80386, for example, this difference is relevant.
When operating in protected mode, many concurrent tasks are being executed and, in critical and complex tasks, such extensive mathematical operations in engineering applications, perhaps a fast code be more adequated.

Many other factors can be enumerated to justify the use of one or other model, and this demonstrates that the choice of a correct optimization model must be done considering factors like these. It's an objective and at same time subjective task, that many times is forgotten, or not considered, but that can determine the success of our projects.