|I'd prefer not to do buddy recombination on the fly, i.e. when allocating/deallocating a block. I want neither allocation nor deallocation to be delayed by block recombination.
If you run out of space for bigger blocks, the freelists for smaller blocks may be long. To find buddies fit for combination is similar to sorting the list. It would be inefficient to go to the first smaller list (below the requested size) for the search; it may have viable candidates, but one block's buddy may live as two not-yet-combined halves in the second smaller list. So recombination should start at the smallest unit, and proceed upwards.
This you can leave to a very low priority task. The task unhooks a freelist for a given size, possibly leaving a couple entries for other threads' requests while the combination is being done. The task sorts the unhooked freelist to make combinable buddies list neighbors. If the heap start address is a multiple of the largest block that can be allocated, an XOR between the neighbors' addresses leaves the single bit that is the block size being processed; that is super fast. The two are unlinked, and the lower put into a separate list. Forget the upper one from now on - it is merged into the lower. When reaching the end of the list, the task is left with two lists: The reduced one (of the size being processed) and a list of recombined next size blocks, to be added to that freelist.
At the start, the task must reserve one freelist head while unhooking; that is done in a couple of instructions. At the cleanup, it must again reserve the freelist head to hook the reduced list back in, and then add its entries to the next bigger freelist. Both operations are done in a couple of instructions. There is no hurry: Noone is hurt if the task must wait in line before getting to add its entry to the two list heads. The order doesn't matter; if one head is busy, it can do the other one while it waits.
The task processes one size at a time. Each size usually has a moderate size freelist, so the time spent doing the job is moderate. Doing a real list sort might be a good idea: With allocation done from the list head, after the combination, the lowest available block is allocated for the following requests. If a bigger buddy must be split, it will be the lowermost of those. This strategy tends to gather active heap blocks at lower addresses, leaving the higher addresses for larger blocks (without the risk of a few single small block allocations in the upper range 'polluting' a large heap area). A block release may be a block at a higher address, but at the next compaction, the sorting will move it further out it the freelist, reducing the chances of it being allocated anew.
For the sorting part: When the compaction task unlinks a freelist, ready to start the sorting, chances are that the last part of the list are the same entries as on the previous compaction. There are suitable list sorting algorithms of O(n) on a sorted list, so the cost is dominated by the number of new freelist entries - which is probably not very high, since those freed are the first to go out of the list again. If the list has no already sorted tail, it indicates that the list has been all empty in the meantime, and probably hasn't grown that big yet.
If you let this low priority recombination task iterate over available block sizes, from bottom up, you will maintain a distinct locality of the heap use, towards lower addresses. The iteration frequency may be self-tuning: If the task found no more than, say, two combinable buddy pairs in any list, it may increase its delay before the next iteration by 50%. If it found sixteen pairs or more in any list, it may half its delay.
Given a background task for recombining buddies, allocation can be done in a couple instruction from a non-empty freelist, adding a small handful of instructions for each level it must move upwards if the list is empty. Release is (always) done in a couple of instructions.
Obviously: If you really run out of heap space - even the freelist for the largest block size is empty - then the recombination task must be force activated to combine buddies from the smallest up to the biggest block size. If even that fails, then you are really out of heap. If it succeeds, take it as a hint that the task should be run at a higher frequency!