Outage of gitlab.common-lisp.net

Sat Apr 16 22:56:11 UTC 2022

Sounds like a series of unfortunate events.

Thanks for your hard work in getting this all back up again.

On Fri, Apr 15, 2022 at 8:18 AM Erik Huelsmann <ehuels at gmail.com> wrote:

> Hi,
>
>
> common-lisp.net GitLab instance is set to maintenance mode since last
> night due to unexpected outage: around 19.20 CEST, phoe approached me on
> the common-lisp.net:matrix.org chat channel (a.k.a. #common-lisp.net:libera.chat)
> about the unavailability of the with-contexts project. Logging into the
> system, it quickly became apparent that one of the discs had filled up,
> causing this failure. After a bit of research, it became apparent that some
> 23GB (less than 10% of space on the volume) of disc space was taken by
> Prometheus. A tool we're not using, but which comes out of the box with
> GitLab. Disabling Prometheus and removing its files quickly freed up enough
> space for basic web requests to work again.
> To have some more room for various processes to operate with, I'm using a
> maximum of 80% fill-rate for the volumes in the VM. So, I went looking for
> more possibilities to clean out storage. At one point, I ended up GitLab's
> PostgreSQL directory, where there was a little more than a GB of storage to
> be won. Not a lot, but since I was cleaning anyway, it seemed like a good
> thing to look at. There were clearly old Pg clusters (various Pg10 and Pg11
> clusters while we were running on 12). There also was a script called
> "delete_old_clusters.sh". It seemed better to use a script from the vendor
> than meddling with the database data myself, so I used it. HOWEVER: it
> immediately and without warning *removed the production database* (contrary
> to the expectation that it would remove the *old* clusters laying around)!
> Although this is rather unfortunate, series of events, I quickly recovered
> from the heart attack that followed; turned off as many services on the
> machine as possible and searched (a) for older database copies and (b) for
> dumped backups. Unfortunately, misforture never comes alone: as soon as I
> found the backup, I realized it's from Feburary 27th. The backup system
> that had been running without problems for *years* had stopped running
> after March 1st and none of the current maintainers noticed: since the
> backup procedure didn't generate an error, but was plainly not executed,
> there were no mails about backups failing. On top of that, it turns out
> that the system I have in place to report disk usage problems, wasn't
> delivering messages of the common-lisp.net disk overage either!
> I've restored the database backup from the 27th and we have the backup
> procedure running again. This means that anything stored in the database is
> back to the 27th. MRs, issues, etc. The *repositories* are fine and never
> were in danger!
>
> So far, I've waited to enable the service, because I've contacted the
> #gitlab:libera.chat channel to ask if anything can be done to assure
> consistency between the repositories and the database. So far, the channel
> has remained silent (not just to my question, but to any questions posed).
> I am thinking to restore access on monday night CEST, if no answer appears
> on the gitlab channel, or as much earlier as a usable answer will be
> provided.
>
>
> Let me close off this mail by offering my sincere apologies for failing
> the trust you have put in me and for any inconvenience this may have
> caused. Please report any inconsistencies you run into to admin at
> common-lisp.net so we can work on fixes. Additional controls are being
> worked on to prevent a similar situation in future: "ping" messages from
> the monitoring infrastructure and checks on the off-site backup system to
> check that the weekly full backup (and the daily incrementals) have been
> delivered.
>
>
> --
> Bye,
>
> Erik.
>
> http://efficito.com -- Hosted accounting and ERP.
> Robust and Flexible. No vendor lock-in.
>

-- 
Ray
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/clo-devel/attachments/20220416/5c55c920/attachment.html>