It’s been a while since I was in charge of managing a mysql server with some load on it. So things came as they always do - I fucked up.

What was the problem?

I was quickly setting up a database server to provide a testing/staging environment for two team members. I just installed the packages and relied on the default mysql settings. I know, you shouldn’t but i thought “it’s not a production env, there won’t be much traffic, we are currently only testing things”. Yeah, fuck up number one.

My second fail was to skip monitoring set up directly - you know “no production traffic, not mission critical, will do that later”. Well later took too long.

As you may got from the posts title: the disk of the database ran out of space. We where testing a hugh batch workload in that environment so there were many inserts on that database. This resulted in a very large mysql binlog. To clean up the Binary Logs mysql provieds PURGE BINARY LOGS... which can easily be used on a online database to clean up the data an free up disk space. Unfortunatly the database server was down and was unable to be restarted as no space was left.

The solution

Finding a solution took longer than expect - the most answers were not helpful since they hinted to PURGE BINARY LOGS.... Deleteing the the binlog* files directly was not recommended from most answers.

Fortunatly i came across a post on StackExchange hinting to the solution:

there is a file that records all the binlog, named binlog.index, you can edit this file, remove the first N lines of it, and then remove the correspending mysqlbinlog file.

then you can safely start the sever

And it worked like a charm. Removing the first few binlogs freed up enought space to get the mysql server back up an running. Then i was able to remove all additional binary logs that were not required anymore.

Avoid that problem in the future

To prevent the binlog to fill up your disk you can set the option binlog_expire_logs_seconds or binlog_expire_logs_days to let mysql delete all old binlog files for you. You will still have to monitor you disk space usage and maybe adjust the variable (or disk space 😉)

Learnings

The main takeaway is to setup monitoring right away - also for non-production environments. Looking back that’s kinda obious!

But trying to support other people to do their job or trying to match deadlines ofter leads to a quick and dirty approch. I don’t think that’s necessarily a bad way of working - i thing fast and pragmatic solutions are often undervalued. But you must force yourself to do another iteration after the quick fix or POC to improve the things you have built. That may be code oder infrastructur!