Outage of gitlab.common-lisp.net

Sun Apr 10 19:27:09 UTC 2022

Hi,

common-lisp.net GitLab instance is set to maintenance mode since last night
due to unexpected outage: around 19.20 CEST, phoe approached me on the
common-lisp.net:matrix.org chat channel (a.k.a. #common-lisp.net:libera.chat)
about the unavailability of the with-contexts project. Logging into the
system, it quickly became apparent that one of the discs had filled up,
causing this failure. After a bit of research, it became apparent that some
23GB (less than 10% of space on the volume) of disc space was taken by
Prometheus. A tool we're not using, but which comes out of the box with
GitLab. Disabling Prometheus and removing its files quickly freed up enough
space for basic web requests to work again.
To have some more room for various processes to operate with, I'm using a
maximum of 80% fill-rate for the volumes in the VM. So, I went looking for
more possibilities to clean out storage. At one point, I ended up GitLab's
PostgreSQL directory, where there was a little more than a GB of storage to
be won. Not a lot, but since I was cleaning anyway, it seemed like a good
thing to look at. There were clearly old Pg clusters (various Pg10 and Pg11
clusters while we were running on 12). There also was a script called
"delete_old_clusters.sh". It seemed better to use a script from the vendor
than meddling with the database data myself, so I used it. HOWEVER: it
immediately and without warning *removed the production database* (contrary
to the expectation that it would remove the *old* clusters laying around)!
Although this is rather unfortunate, series of events, I quickly recovered
from the heart attack that followed; turned off as many services on the
machine as possible and searched (a) for older database copies and (b) for
dumped backups. Unfortunately, misforture never comes alone: as soon as I
found the backup, I realized it's from Feburary 27th. The backup system
that had been running without problems for *years* had stopped running
after March 1st and none of the current maintainers noticed: since the
backup procedure didn't generate an error, but was plainly not executed,
there were no mails about backups failing. On top of that, it turns out
that the system I have in place to report disk usage problems, wasn't
delivering messages of the common-lisp.net disk overage either!
I've restored the database backup from the 27th and we have the backup
procedure running again. This means that anything stored in the database is
back to the 27th. MRs, issues, etc. The *repositories* are fine and never
were in danger!

So far, I've waited to enable the service, because I've contacted the
#gitlab:libera.chat channel to ask if anything can be done to assure
consistency between the repositories and the database. So far, the channel
has remained silent (not just to my question, but to any questions posed).
I am thinking to restore access on monday night CEST, if no answer appears
on the gitlab channel, or as much earlier as a usable answer will be
provided.

Let me close off this mail by offering my sincere apologies for failing the
trust you have put in me and for any inconvenience this may have caused.
Please report any inconsistencies you run into to admin at common-lisp.net
so we can work on fixes. Additional controls are being worked on to prevent
a similar situation in future: "ping" messages from the monitoring
infrastructure and checks on the off-site backup system to check that the
weekly full backup (and the daily incrementals) have been delivered.

-- 
Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.
Robust and Flexible. No vendor lock-in.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/clo-devel/attachments/20220410/43b99d75/attachment.html>