Skip to content

Optimising multiple MySQL INSERTs

March 28, 2009

Starting last weekend with handling timezone conversion with PHP DateTime, we’re aiming to publish a technical blog post each weekend to give an insight into some of the things we’re doing in our code. Today I’m looking at how we’re making our MySQL queries, specifically inserting many rows, as efficient as possible.

This week has been quite quiet in terms of blog posts. We’ve been talking to some of our beta users and making tweaks to our server monitoring product, Server Density, but the main goal has been to complete the new Server Snapshot feature – viewing the state of the server at a given point in time, including load, memory usage and the running processes. This is a great example of how feedback drives development since although it was suggested by a user, it also solves a problem we’ve had in the past – troubleshooting load spikes after the event has been resolved. The feature was finished today and so next week I’ll be writing about the feature in more detail and providing a video demo as well as pushing all the recent updates out to beta users.

As part of the feature, sd-agent needed some changes to switch from using simple HTTP POST values for each piece of data posted back to our servers to using a JSON payload to store all the data. This was necessary because we now record all the processes that are running at the time the agent does its sampling. Since Linux servers are often running 100 or more processes, we needed this data in a structured form, and JSON provides that. It also makes it easy to add extra metrics in a later date. The update to the agent will be released next week with the new feature although we will be maintaining backwards compatibility of the old postback method until the end of the beta.

The biggest problem we had when working on this feature was the large amounts of data that will inevitably be generated by storing the full list of running processes and associated statistics (CPU %, memory %, running time, etc). Even a relatively low number of processes, 100 for example, would generate 6000 rows per hour per server, but that is a topic for another blog post on server data capacity.

If we’re going to be inserting at least 100 new rows every 60 seconds from each monitored server, then we had to look at making those inserts as quick as possible. In PHP you could insert each row individually. We use MySQLi and so the call would be:

$core->db->query('INSERT INTO checksProcesses SET serverId = x, user = x...');

Of course, that means you’re going to be executing at least 100 queries every minute.

According to the MySQL 5.0 documentation on the speed of MySQL INSERT statements:

The time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:

* Connecting: (3)
* Sending query to server: (2)
* Parsing query: (2)
* Inserting row: (1 × size of row)
* Inserting indexes: (1 × number of indexes)
* Closing: (1)

This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query.

Every time you run the query() method, the above process is going to execute. That’s bad. Luckily, the MySQL documentation provides a solution which allows us to only need to do the connecting, sending query to server, parsing query and closing once. The inserting row and indexes happen for each row we insert but they are in the lowest proportions compared to the others. Obviously, the proportion increases based on the size of the data and the number of indexes (we have 2 on the relevant table) but by cutting out the other parts, we can increase performance.

How is this done?

If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements.

This means converting the query we have above into this:

INSERT INTO checksProcesses (serverId, user...) VALUES (x, x...), (y, y...), (z, z...)...

Therefore we only have 1 query to execute. Another advantage to using just a single INSERT is that the index buffer is only flushed to disk once. Usually this is done every INSERT, which adds overhead. If you need to do multiple inserts then you can lock the table before and unlock after the set of INSERT queries, which will have the same effect. But the problem with this is the lock prevents any further reads (and writes depending on the lock), which would not be desirable for our uses.

A different option could be to disable the indexes whilst the inserts are occurring. This would remove another step in the process above but you do need to enable the indexing again after which takes a bit of time to process the unindexed rows, although it is faster than doing individual INSERTs:

ALTER TABLE … DISABLE KEYS tells MySQL to stop updating non-unique indexes. ALTER TABLE … ENABLE KEYS then should be used to re-create missing indexes. MySQL does this with a special algorithm that is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give a considerable speedup.

from the MySQL manual.

The aim of this is to make the INSERT as efficient as possible. We are able to make a lot of optimisations on the server level to match our operating environment but since we’re going to be allowing customers to download and install the main application on their servers, we want to ensure we get as much performance as we can in the raw SQL.

No comments yet

Leave a comment