How to Make Software Deterministic

ocoudert

Rate me:

5.00/5 (1 vote)

31 May 2011CPOL9 min read

19.4K

How to make software deterministic

A program is deterministic, or repeatable, if it produces the very same output when given the same input no matter how many times it is run.

Refining this definition, we should consider whether a program produces the same result on any platform (32 and 64 bits machines, running Windows, Mac OS, Linux, Solaris, etc.). Or whether the program is insensitive to the form of its inputs. For example, the problem of generating the shortest route to visit all the capitals of Europe should not depend on how the map of Europe is entered, nor it should depend on which language is used to name the capitals.

Determinism is obviously very desirable. For the user, a non-deterministic program can be confusing and frustrating. For the developer, a non-deterministic program is extremely hard to test and debug, since bugs and specific configuration cannot be easily reproduced.

Repeatability looks like a given for most applications. For instance, if we add two numbers in a spreadsheet, we expect the same result no matter how many times we perform this operation and regardless of the platform we run on (PC, Mac, etc.). Or if we run a spell checker several times, we expect it to flag the very same errors.

But it is not that obvious for more complex applications. This is especially true when there are multiple solutions to a problem, or when heuristics are used to produce a result –because an exact solution is too computationally expensive. For example, it is not uncommon to see slightly different outcomes when running the same EDA synthesis or P&R tool on the same input several times.

Even more, a user would like to see the same result when only minor changes are applied to the input. For instances, running a P&R tool on two netlists that differ only by the names of their cells should produce exactly the same result –a P&R tools should produce a result that only depend on the netlist structure. But experience shows that industrial synthesis and P&R tool does not meet that requirement. Closest to software, it is not uncommon to generate slightly different object codes with gcc by changing the names of a few variables.

Among the causes of non-deterministic response, we can distinguish the following types:

A random number generator
Reading an uninitialized data
A race condition on concurrent threads
An unordered iteration that is assumed ordered
A total order that depends on memory address
A total order that depends on time stamps
A total order that depends on a non-canonical labeling

1. Random Number Generator

There are a lot of applications that use stochastic processes (e.g., simulated annealing, genetic algorithms, Monte-Carlo simulations), but that we would like to be repeatable. Using a pseudorandom number generator with a known seed makes it possible to reproduce the same long sequence of seemingly random numbers over and over again.

Note that some applications (e.g., gaming, cryptography, statistical sampling) require a non-deterministic behavior. In that case, the seed of the random number generator must be an always-changing value, for example, the host’s current time.

There are more deliberate efforts to produce true random values by relying on natural, chaotic events. For instance, Lavarand produces random numbers by hashing the frames of a video stream of lava lamps. HotBits generates random bits by timing successive pairs of radioactive decays detected by a Geiger-Müller tube interfaced to a computer. Random.org uses variations in the amplitude of atmospheric noise recorded with a normal radio.

2. Uninitialized or Random Data Read

Initialized data may not exist in languages that have systematic default values and no memory management control, as opposed to high performance languages like C/C++.

Finding and fixing these kind of issues is relatively simple. For instance, tools like Purify and Valgrind can report when a C/C++ code reads arbitrary values in memory. To use Purify’s terminology, such errors are UMR (Uninitialized Memory Read), ABR (Array Bound Read, i.e., dereferencing an array outside of its bounds), and FMR (Free Memory Read). These defects all consist in reading some random value in memory. The code below illustrates some of these errors.

C++

#include
#include 

int main() {
  bool large;
  int size = 100;

  if (large) {           // UMR
    size *= 10;
  }

  char* const str   = new char(size);
  char* const middle = str + size/2;
  // Set the end-of-string.
  str[size - 1] = '\0';
  // The loop intends to fill up the string str with 'a'.
  // But this loop is faulty, because *++c is used instead of *c++.
  char* c = str;
  for (int i = 1; i < size; ++i, *++c = 'a');

  printf("%c\n", str[0]);      // UMR
  printf("%c\n", str[1]);      // Ok, will print 'a'.
  printf("%c\n", str[2]);      // Ok, will print 'a'.
  printf("%c\n", str[size - 1]);  // Ok, but will print 'a'
                                  // instead of the expected '\0'.
  printf("%c\n", str[size]);    // ABR
  printf("%lu\n", strlen(str));   // UMR, because we overwrote
                                  // the final '\0' in the loop.
  delete [] str;
  printf("%c\n", *middle);     // FMR

  return 0;
}

Note that an ABW (Array Bound Write, i.e., writing outside of an array’s bounds), FMW (Free Memory Write), FNH (Freeing Non-Heap memory) and FUM (Freeing Unallocated Memory), although severe bugs also reported by dynamic analysis tools, are not an original source of non-determinism: they consistently reproduce the same bug.

3. Thread Races

Thread races are difficult to detect, and fixing them can be very costly. A typical example is when one thread writes a value at some address, and another thread reads the value at that address. Depending on which thread access the address first, the outcome of the program will be different. Two threads performing a non-atomic write at the same address simultaneously results in some unpredictable value.

One can use a mutex to prevent conflicting read/write for non-atomic operations. But racing threads (e.g., who reads/writes first) must be resolved with synchronization, which can be quite complicated. Moreover, it can hurt performances.

4. Iteration on Unordered Data

Iterating data with some random order can make a program non repeatable. This pattern is often encountered, and is easy to fix.

For example, an algorithm produces a result via a visitor that assumes a total order. The developer uses an incorrect visitor, which enumerates data in a random order, usually depending on the memory allocation of the data container. E.g., instead of using a std::set as a container, the developer uses a std::hash_set (or a tr1::unordered_set instead of a tr1::ordered_set). Forcing a total order on the data fixes the problem.

Note that the fix may be incomplete if it simply transforms a type (4) non-determinism into a type (5) or (6) non-determinism, which we discuss below.

5. Ordering by Pointer Value

This type of non-determinism is extremely common. For instance, a developer uses a tr1::ordered_set as a container of pointers, and feels that the visitor is deterministic. It is indeed deterministic, but only with respect to the memory addresses allocated to the data, which depend on factors out of the application’s control.

One way of addressing the problem is to force a specific memory addressing scheme, but that requires a very fine control of the memory allocator and is therefore complicated. A more common way consists in assigning a unique ID to an object at the time of its creation. The ID can be a 32 bit unsigned integer that is incremented for every new object. A total ordering, independent from the objects’ actual memory addresses, is then obtained from the IDs. It is a simple solution, as long as one can afford the extra 4 bytes for every object. ID-based sorting with no memory penalty can be obtained using custom memory allocators.

6. Ordering by Time Stamps

Note though that the total ID-based ordering described above is exactly the order of creation of the objects. It is no different from an order that depends on time stamps. Therefore, two equal sets of inputs that only differ in the order will be visited in a different order, which can lead to different results. This leads us to the type (7) of non-determinism.

7. Ordering Induced by a Non-canonical Labeling

Type (7) non-determinism often goes unrecognized, or is simply ignored. The idea is that as long as the same input is given to a program (but possibly in a different order or form), the output should be the same. If the input can somehow be normalized to a form that captures the notion of “same input”, then the program can be made insensitive to the format of the input. That is, of course, assuming that the normalization process run time penalty is not too high.

This normalization process is better defined as canonization. Formally, let O be a set of objects, and let EQ be an equivalence relation that captures the notion of “same” on these objects. A function Canon maps an object onto its canonical form, and is such that for any two objects o1 and o2, o1 and o2 are the same (i.e., o1 EQ o2) if and only if Canon(o1) = Canon(o2).

For instance, a set of integers can be represented by a number of containers (a list, an array, a hash set, a binary tree, etc). A canonical form can simply consist in sorting the integers. Two sets that are equal because they contain the very same integers, but that are initially given in different orders and forms, will end up in the same canonical form. Since sorting is an O(n log n) algorithm, this is an efficient canonization.

Canonization can be much more costly. A Boolean function can be represented in many ways, e.g., with a truth table, a Conjunctive Normal Form (CNF), a Disjunctive normal form (DNF), a decision diagram, etc. (see below). Boolean function canonization is at least NP-hard, since it solves the satisfiability problem (SAT). In practice, this means that Boolean function canonization algorithms have an exponential complexity.

Canonization can also be elusive. Let us consider the problem of drawing a graph in some aesthetic way (e.g., such that the nodes are evenly distributed and such that there is a minimum number of crossing edges). One would like the graph to be drawn the very same way, regardless of its representation (adjacency list or adjacency matrix), and regardless of the order the adjacency information is given. For instance, the three graphs below, although looking different, are exactly the same, and can be drawn without any edge crossing as shown on the right side.

Graph canonization is also known as graph labeling. It is at least as hard as graph isomorphism, one of these rare problems that are in NP but that are not known to be NP-complete or in P. Although all existing graph canonization algorithms have an exponential worst-case complexity, it is believed that graph canonization can be done in polynomial time.

Conclusion

The most common cause for non-determinism is related to some unreliable data order. The ultimate solution to make a program insensitive to the form of its input is to canonize its input as a pre-processing step. This proves to be a challenging and costly task in some cases. Whenever possible, canonization (or some imperfect normalization) goes a long way to make the application consistently repeatable.

CodeProject

This article was originally posted at http://www.ocoudert.com/blog/2011/05/30/how-to-make-software-deterministic

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

ocoudert

Architect OC Consulting

United States

I have 20 years experience in software architecture and product development, including 10 years experience in research. I worked at eBay, Synopsys, Mentor Graphics, Magma, and I am an independent consultant in software design and development. I have published 50+ research papers or book chapters, and invented several algorithms for which I hold a few patents.

I am interested in technology as a whole, in particular software, hardware, and web-based applications. Check me out on LinkedIn or twitter (@ocoudert).

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.