## Introduction

How often do we encounter numbers that exeed MAXUINT32 or even MAXUINT64 ? Indeed, 2^32 - 1 = 4294967295 and 2^64 - 1 = 18446744073709551615 are extremelly huge numbers. But what if you need to handle numbers like 2^3232 or greater without loss of precision ? That can never happen but in case you will encounter this it may to surprise you. I was suprised with the problem of computing such extra huge values while solving one exotic problem. So I thing my experience may someone come in handy.

## Background

Lets look at arbitrary integer written on paper: 123456 for instance. We, human beings, use special numeral symbols from set [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and don't think a lot how much of such symbols to use to write a number on a paper or desk. Actually we are limited only with available empty space to write next digit. Even in case the line is over, we boldly keep note using next line below. When we need to add/substract/multiply/divide - we use our smart electronic helpers. Nowadays we are usually look at so called scientific notation of numbers because of length of number devices can handle is limited by system architecture. Scientific notation of numbers lets us to evaluate results of operations, exponent of number for instance. But what if results of huge and extra huge numbers arithmetics must be exact and precision is not a tradeoff ? In this case we need to build some lines of code to solve the problem.

## Model

First of all, recall the ancient knowledge. CPU (hereinafter x86) uses binary code in it's work. Number 123456 (in decimal numeral system) looks like 1 1110 0010 0100 0000 in binary numeral system. When this number is stored in process memory at address N, it will look as follow:

N: [0100 0000] N + 1: [1110 0010] N+2: [0000 0001]
We have 3 bytes. As you know, byte is minimum memory storage unit (yes, it consists of 8 bits, but you can't add bit #5 to bit#7 atomically by only instruction). So arbitrary number being stored in memory splits into bytes and more significant bytes are stored at higher memory addresses. CPU can operate numbers in memory, but atomicity of this operations is limited. 32 - bit processor can handle 32 - bit operand (64/128 bits for some cases) by one instruction. As a finite state machine CPU changes it's state accordingly to result of perfomed operation.

Now it's time to refresh in mind binary arithmetics knowledges.

Let try to add two numbers, 17 (10001) and 18 (10010). I use small numbers to be brief and to the point.

10001 (17)
+ 10010 (18)
-------
100011 (35)

As you can see, when adding bits #4 there was an overflow and bit #5 is come in sight. Result width was expanded. If we would limit with the 5 bits the result was 00011 with overflow. In substraction there is situation when we have to "borrow" 1 from more significant bit and it is known as underflow. Thereby in case of splitting number in memory we can't add (substract) next parts without taking as another operand state of overflow (underflow). Processor gives us such opporunity - *adc* and *sbb* instructions from "classic" instructions set are come in handy (snippet is for demonstration purposes only):

mov ecx, [Size]
xor esi, esi
clc
lahf
_AddLoop:
mov ebx, [Addend]
mov dl, [ebx][esi]
mov ebx, [AddResult]
sahf
adc [ebx][esi], dl
lahf
inc esi
loop __AddLoop

Substraction operation is implemented similarly.

Now lets talk about the multiplication and try to multiply 17 (10001) and 18 (10010):

10010 (18)
* 10001 (17)
--------
10010
00000
00000
00000
010010
-----------------
0100110010 (306)

As you see I just decompose multiply operation into sequence of shifts and additions. We shift left first operand everytime next bit of second operand is being checked. If second operand's bit is non-zero addition is involved.

It's pretty straighforward. Result width is sum of the operands widths. The implementation of multiply operation can be found at Source code.

And the last operation is division. As usually, do it by hands, pen and on the paper:

100110001 (305) 1110 (14)
01110 10101 (21) - Quotient
-------
0010100
0001110
---------
000011001
000001110
-----------
000001011 - Remainder (11)

Indeed, 21 * 14 + 11 = 305.

So as you can see, divide operation can be decomposed into sequence of shifts, comparings and substractions and this algorithm is not too complex. Dividend and quotient are shifting left until it's most significant part is greater than divider (I call it "to alignment heads"). When divider is less than that shifted chunk - substraction is performed and 1 is set into quotient. Remainder is the bits are remained after all shifts and substractions. Divident is copied and handled into the separate double sized buffer.

## Mechanics

All this functions are in Source code implemented with common assembler instructions and are subject to improve. My approach is straighforward and there are some confines:

- all numbers are unsigned. It lets to avoid some difficulties with negative numbers handling.
- all numbers are equal sized.
- all numbers are DWORDs arrays.

Ways to increase perfomance.

First and simpliest - declare functions as naked and __fastcall. It reduces overheads for prologue/epilogue code execution and passing arguments via stack. Then - use instructions that handle max sized operands. (I used instructions from ancient "classic" command set). But before do that make sure they are supported by processor. CPUID is a good approach for this. Second and more sophisticated - review algorithms control flow. In case of add and sub routines it seems there are nothing to enchance in such way, but mul and div are subjects for such analysis and improvements. For mul algorithms first of all we can reduce number of first operand shifts. We can calculate the "distance" to next non-zero bit in second operand and shift first operand by that number of bits. For instance, following code snippet is consequetly checks second operands bits for set ones and calculates such "distance":

mov ebx, [lpUIntR]
xor eax, eax
mov esi, eax
mov edi, eax
__L_BitTest:
bt [ebx][esi * 4], eax
jc __L_BitIsNonZero
inc edi
__L_BitTestLoop:
inc eax
cmp eax, 32
jne __L_BitTest
xor eax, eax
inc esi
cmp esi, [cnSize]
jne __L_BitTest
jmp __L_MulIsDone

But such manipulation can impact perfomance in case of "dense" second operand.

The div algorithms even more prone to improvements. As you remember, before iterations starts dividend and divider must be "aligned" to each other. This startup "align" operation can be made by one "pass" (before iterations). This can improve perfomance in case of dividing numbers with near-valued exponents but not in case of dividing numbers such as in source. Another enchancement is to calculate distance to shift buffer left to align it to the divider inside the iterations loop. But the advantages of such improvement are subtle: buffer is changed by every iteration.

Anyway you will find appropriate solution experimentally in case of interest. So in Source code you will find coarse algoritms implementations.

And now let me try to impress you. Look at results of test from source:

Dividend: 3121748550315992231381597229793166305748598142664971150859156959625371738819765620120306103063491971159826931121406622895447975679288285306290175

Divider: 100000

Result: 31217485503159922313815972297931663057485981426649711508591569596253717388197656201203061030634919711598269311214066228954479756792882853062

Remainder: 90175

Exponents of handled integers are impressive as for me.

## Conclusions

There is nothing impossible with assembler. It gives us all the power of system, but we must to use it with extremelly care. As in case of integer operations you can let yourself to forget about loss of precision in integer calculations. But don't forget: we are always have to pay for everything. In this case you will pay with your application perfomance.

P.s. Sorry for my poor English.