Writing a MySQL Storage Engine from Scratch

cruppstahl

4.91/5 (17 votes)

Jun 20, 2016

CPOL

12 min read

73080

Describes how to write a MySQL storage engine - a plugin which is responsible for persisting MySQL table data.

Download source code - 1.6 MB

Introduction

This article summarizes the things that I learned when I wrote a new MySQL storage engine (“SE”). MySQL’s storage engines are plugins which are responsible for actually storing the data on disk, and providing access to the data. The SEs usually are implemented as key/value databases, but not necessarily have to. Oracle’s MySQL comes with a variety of SEs, and the whole SE landscape can be a bit overwhelming. The default is InnoDB, a key/value database library. Other SEs can read and write CSV files (“Tina”) or were written specifically for archiving the data (“archive”). MariaDB, a fork, is more progressive and includes SEs for accessing Cassandra NoSQL databases, Percona’s TokuDB key/value library and xtradb, an improved version of InnoDB.

In total, MySQL comes with 13 SEs, MariaDB has about 20. That’s clearly not enough. Let’s add another one!

Downloading Code and First Impressions

You can download the MySQL code from GitHub. I am working with version 5.7.12.

    git clone https://github.com/mysql/mysql-server.git
    cd mysql-server
    git checkout mysql-5.7-12

Let’s have a brief look at the source code structure. The amount of files and directories is a bit intimidating, but most of them are not relevant for us. There are just a few directories which are worth a closer look:

    include           Stores global include files. Here you will find mysql.h, 
                      which has many important macros and declarations.
    sql               Contains the actual sql related code: parser, query engine etc. 
    sql/field.h       The Field class describes a MySQL column
    sql/item.h        The Item class is a node in the abstract syntax tree 
                      which is generated by the parser
    sql/sql_*.cc      These files implement the actual SQL operations
    sql/handler.h     The base class for a storage engine
    storage           This directory has all the storage engines
    storage/innobase  The InnoDB storage engine
    storage/example   A storage engine example project

After browsing through the code, it becomes clear that parts of MySQL’s code base are relatively old. While most code is implemented as C++, it’s actually a “C with classes”. More advanced C++ features like templates, exceptions or the standard library are not used. Linked lists are implemented as structures wrapping a “void *” object (my_list.h). MySQL also has its own string implementation (string.c) instead of using std::string. RAII is also not used. Instead, you will find many “goto” instructions which are responsible for cleaning up allocated resources.

Functions can be extremely long (more than 1000 lines of code, with many levels of nested “if” clauses - mysql_prepare_create_table), and some of the classes are also huge. Most methods are stored in the base class instead of derived classes, and in addition, the classes store a lot of state.

MySQL uses several memory allocators (memroot_allocator.h). They are not STL compatible. Many classes overwrite operator new and operator delete to use a custom allocator.

The code has no consistent coding style. Indention varies, structure names are sometimes capitalized (“struct THD”), lower case with underscores (“st_mem_root”) or start with a capital letter and continue with underscores (“Field_long”). The use of spaces varies as well (foo= 3; foo = 3; foo=3).

I have never spoken to someone working for Oracle or MySQL AB, but we can try to make a few guesses about their corporate culture based on the code we have seen. The lack of a coding style means that each engineer can pick the one he’s most familiar with. Coding styles vary a lot, even in the same file, which can be a sign that there is no “code ownership” and everyone can work on every part of the code. I have worked with similar code in the past, in a similar environment, and it was a great place to work.

The use of “C with classes”, the refusal to use the STL and the archaic long and complex functions will require lots of refactoring if MySQL wants to get rid of its technical debt and attract new developers.

Compile, Install, Run, Debug

Let’s continue with the installation. If you followed the steps above, then you already checked out the code into the “mysql-server” directory. In here, we will run cmake, which will generate the Makefiles. We will create a Debug build which is slower but can be used for debugging.

    cd mysql-server
    cmake -DCMAKE_BUILD_TYPE=Debug
    make -j 5
    sudo make install

Your compiled MySQL files are now installed in /usr/local/mysql. You can run cmake in a subdirectory to have separate builds for Debug and Release, and you can choose different installation directories. Run “cmake --help” for a list of options. For this article, I’ll just keep everything as simple as possible.

Now there are just a few more things to do till we can successfully start the MySQL daemon. We need to create a new data directory (which stores the tables) and initialize it (note that you will have to change the path names in the following commands):

    mkdir /home/chris/tmp/mysql-data # must be empty, or else...
    cd /usr/local/mysql/bin
    ./mysqld --initialize --datadir=/home/chris/tmp/mysql-data

The last command will generate a root password. Write it down, you will need it in a moment. I’ve set up my development environment with an empty root password - it’s just more convenient while testing. You can reset the password with the following commands (replace <PASSWORD> with the generated root password from above):

    # start the server
    ./mysqld --datadir=/home/chris/tmp/mysql-data

    # in a separate terminal we can now change the password 
    ./mysqladmin  password --user=root --password=<password>

Now we can create a new database:

    ./mysqladmin create test --user=root --password

The client can be started as well, and then you can finally create tables, insert data, etc.

    ./mysql --user=root test

Bootstrapping a New Storage Engine

A storage engine is implemented as a dynamic library (a “.so” file), and the sources are stored in the “mysql-server/storage” directory. If you just want to toy around then you can modify the existing “example” SE. I chose to create my own called “upscaledb” with the following steps:

    cd mysql-server/storage
    # copy the “ha_example” directory to “ha_upscaledb”
    cp -r ha_example ha_upscaledb
    # rename the files
    cd ha_upscaledb
    mv ha_example.h ha_upscaledb.h
    mv ha_example.cc ha_upscaledb.cc

As a last step, use your favourite IDE (I use vim) to replace “example” with “upscaledb” and “EXAMPLE” with “UPSCALEDB”. Don’t forget to change the CMakeLists.txt file as well. Then go to the root directory of mysql-server and once more run “cmake” and “make”. The new upscaledb storage engine is now built, the file name is mysql-server/storage/upscaledb/ha_upscaledb.so.

Now we have to inform MySQL about the new storage engine. First, we create a symbolic link from the installation directory to the new .so file. With this link, our server will always use the newest version of the .so file. Whenever we make changes, we simply compile the storage engine and restart the MySQL server.

    cd /usr/local/mysql/lib/plugin
    sudo ln -s ~/prj/mysql-server/storage/upscaledb/ha_upscaledb.so ha_upscaledb.so

The final step is to update MySQL’s internal system table. We can do this with the MySQL client. Make sure that the MySQL server is still running!

    cd /usr/local/mysql/bin
    ./mysql --user=root test

    mysql> INSTALL  PLUGIN upscaledb SONAME ‘ha_upscaledb.so’;

You can now start using the new SE and try to create tables (CREATE TABLE test (value INTEGER) ENGINE=upscaledb;). Since our SE is just a skeleton without any implementation, we will get an error. Before we start adding logic, I’ll show you how to debug the MySQL server. The following command starts the server in gdb, and lets gdb catch all signals (i.e., CTRL-C breaks into the debugger):

    cd /usr/local/mysql/bin
    gdb --args ./mysqld --gdb --datadir=/home/chris/tmp/mysql-data

Try setting a breakpoint in the “create()” method of your SE, and run the CREATE TABLE command from above once more!

Adding Functionality

It’s now time to fill in the various methods of our Handler class. It would need too much time to describe every method in detail. Besides, some of them are still unknown to me. But I will give an overview, and provide links to actual implementations.

It is important to be aware that there can be multiple instances of a Handler working on the same table. The actual table data therefore needs to be stored in a separate object (a “Share”), and the Handlers, then acquire the Share. If your data is stored in a plain file, then the file handle would be a member of the Share, not of your Handler.

You can look at the ExampleShare here and my UpscaledbShare here. You see that the UpscaledbShare stores the actual database handles, and additional metadata about the databases.

Creating Tables

ExampleHandler::create and UpscaledbHandler::create are called whenever a table is created. Immediately afterwards, MySQL will call the open() method on that table. Your create() method therefore can prepare the table, but does not actually have to open it.

The implementation of UpscaledbHandler::create() is straight forward. It creates an upscaledb Environment, then creates a database for each index. If the user created a table without a primary key, then an index is generated. The database configuration depends on the actual type of the column and a few other parameters (i.e., whether it is unique or not).

Opening Tables

ExampleHandler::open and UpscaledbHandler::open are called whenever a table is opened. As mentioned above, this can (and will) happen many times for the same table. You therefore have to store the actual table data in your “Share”.

After retrieving a pointer to the Share object, the UpscaledbHandler::open() method checks if the upscaledb environment was already opened. If yes, it returns immediately. If not, then it proceeds with opening the file, and stores the environment’s handle in the “Share”.

Closing Tables

The ExampleHandler::close and UpscaledbHandler::close methods are supposed to close the table. If your table data is stored in the Share, then you might use reference-counting to figure out when to destroy the Share object. In my UpscaledbHandler, I never destroy the share; after all, the Share will be required again sooner or later.

INSERTing Rows

Whenever you invoke the INSERT SQL statement, the write_row() method of your handler is invoked. Its only parameter is the new row, serialized in a byte array. This array stores the actual columns in different order than the one you specified when calling the CREATE TABLE statement. The primary key is always at the beginning, followed by all other indexed columns, and finally followed by the non-indexed column.

This byte array usually starts with an (optional) bitmap describing the null values in the current column. It is followed by either fixed length columns, or columns with variable length (a TINYTEXT, MEDIUMTEXT, TEXT, LONGTEXT or one of the corresponding BLOB columns). The variable length columns start with one or two bytes storing the size of the column, followed by the data (which can be stored in a separate memory chunk; in this case the byte array contains an encoded pointer to the actual data). If you persist a row in a file, it makes sense to “condense” variable length rows in a more compact format, in order to reduce space.

My UpscaledbHandler caches the indexed Field objects to quickly extract the indexed columns. The following code can then be used to extract the key for an indexed column (“index” is the numerical ID of the index; the primary index is always 0).

static inline ups_key_t
key_from_row(TABLE *table, const uchar *buf, int index)
{
  KEY_PART_INFO *key_part = table->key_info[index].key_part;
  uint16_t key_size = (uint16_t)key_part->length;
  uint32_t offset = key_part->offset;

  if (key_part->type == HA_KEYTYPE_VARTEXT1
          || key_part->type == HA_KEYTYPE_VARBINARY1) {
    key_size = buf[offset];
    offset += 1;
  }
  else if (key_part->type == HA_KEYTYPE_VARTEXT2
          || key_part->type == HA_KEYTYPE_VARBINARY2) {
    key_size = *(uint16_t *)&buf[offset];
    offset += 2;
  }

  ups_key_t key = ups_make_key((uchar *)buf + offset, key_size);
  return key;
}

Having extracted the index keys, you can then store the row’s data in your file. You will have to deal with three cases:

The user did not specify ANY index or primary key
The user specified a primary key but no other indexes
The user specified a primary key and additional indexes

If that’s not complex enough, then also keep in mind that an index can be “virtual”, i.e., it combines multiple columns (as in the following statement).

    CREATE TABLE test (
        id         INT NOT NULL,
        last_name  CHAR(30) NOT NULL,
        first_name CHAR(30) NOT NULL,
        PRIMARY KEY (id),
        INDEX name (last_name, first_name)    -- creates a virtual index!
    );

Here is the implementation of the UpscaledbHandler::write_row method.

DELETEing Rows

The DELETE SQL command ends up calling the delete_row() method of your handler. You have to make sure that ALL indices are deleted, not just the primary one. However, deleting the primary key is actually quite simple because MySQL will use a database cursor to locate it. Here is the UpscaledbHandler::delete_row implementation.

UPDATEing Rows

This is the most complex one - at least if you want to make it fast. The update_row() method receives two parameters; the old row’s value and the new row’s value. The naive solution is to call delete_row() for the old row, then write_row() for the new row. This works, but is really slow because you end up updating all columns, even if just one was updated. It is much faster to only update those columns that were modified.

Cursors

For a lot of tasks, the MySQL core will just create a database cursor, locate a key (either on the primary or on a secondary index) and then move forward while working with the actual data. There are a few methods that you have to implement to support the cursors.

index_init(): creates a cursor for a secondary index
index_end(): can close the cursor
index_read_map(): positions the cursor on a row
index_next(): moves cursor to the next key, retrieves the row
index_prev(): moves cursor to the previous key, retrieves the row
index_last(): moves cursor to the last key, retrieves the row
index_first(): moves cursor to the first key, retrieves the row
index_next_same(): moves cursor to the next duplicate of the current key
rnd_init(): creates a cursor for the primary index
rnd_end(): closes the cursor
rnd_next(): moves cursor to the next key, retrieves the row
rnd_pos(): moves cursor to a specified row

Other Methods

A few other methods which are important or interesting enough to be mentioned.

rename_table(): renames a table (and all its files). This is used whenever your schema is changed, i.e., because you add a column. MySQL copies all the data to a temporary table, then uses the rename_table() method to rename the temporary table into your original table.
delete_table(): deletes a table (and all its files).
table_flags(): returns a set of flags which describe the capabilities of your handler. A lot of these are not well documented, and many are geared towards InnoDB. My guess is that only InnoDB implements all capabilities.
index_flags(): returns a set of flags which describe the capabilities of your handler - i.e., whether your index_prev() and index_next() methods are provided, etc.

Conclusion

Writing your own MySQL Storage Engine sounds like a complex endeavor, but actually it is not. You do not have to implement all the methods that I described above. For some methods, MySQL will deal with a missing implementation and provide its own. For others, it will simply abort some SQL commands with an error. The ExampleHandler is basically empty and does not provide any functionality. But nevertheless you can load it, place breakpoints in a debugger and start adding functionality piece by piece. If you get stuck, then you can have a look at existing SEs of MySQL and MariaDB. And the developers on the mysql-internals mailing list are also very helpful.

Ideas for interesting SEs coming to my mind:

An append-only database which persists its data in HDFS; users can then use Spark or Hadoop’s Map/Reduce jobs to further process the data
In-memory tables based on std::map or std::multi_map
SEs which are backed by XML files

What other ideas do you have?

History

20^th June, 2016: Initial version