Outage of gitlab.common-lisp.net

Sun Apr 17 16:36:33 UTC 2022

Dear Erik,

As we say in Italy (pardon the translation): Fortune is blind, but Jinx has
a 20/20 vision, and usually aims straight.

Thank you for all the work you all are doing.

All the best

Marco

On Sun, Apr 17, 2022 at 12:56 AM Raymond Toy <toy.raymond at gmail.com> wrote:

> Sounds like a series of unfortunate events.
>
> Thanks for your hard work in getting this all back up again.
>
> On Fri, Apr 15, 2022 at 8:18 AM Erik Huelsmann <ehuels at gmail.com> wrote:
>
>> Hi,
>>
>>
>> common-lisp.net GitLab instance is set to maintenance mode since last
>> night due to unexpected outage: around 19.20 CEST, phoe approached me on
>> the common-lisp.net:matrix.org chat channel (a.k.a. #common-lisp.net:libera.chat)
>> about the unavailability of the with-contexts project. Logging into the
>> system, it quickly became apparent that one of the discs had filled up,
>> causing this failure. After a bit of research, it became apparent that some
>> 23GB (less than 10% of space on the volume) of disc space was taken by
>> Prometheus. A tool we're not using, but which comes out of the box with
>> GitLab. Disabling Prometheus and removing its files quickly freed up enough
>> space for basic web requests to work again.
>> To have some more room for various processes to operate with, I'm using a
>> maximum of 80% fill-rate for the volumes in the VM. So, I went looking for
>> more possibilities to clean out storage. At one point, I ended up GitLab's
>> PostgreSQL directory, where there was a little more than a GB of storage to
>> be won. Not a lot, but since I was cleaning anyway, it seemed like a good
>> thing to look at. There were clearly old Pg clusters (various Pg10 and Pg11
>> clusters while we were running on 12). There also was a script called
>> "delete_old_clusters.sh". It seemed better to use a script from the vendor
>> than meddling with the database data myself, so I used it. HOWEVER: it
>> immediately and without warning *removed the production database* (contrary
>> to the expectation that it would remove the *old* clusters laying around)!
>> Although this is rather unfortunate, series of events, I quickly
>> recovered from the heart attack that followed; turned off as many services
>> on the machine as possible and searched (a) for older database copies and
>> (b) for dumped backups. Unfortunately, misforture never comes alone: as
>> soon as I found the backup, I realized it's from Feburary 27th. The backup
>> system that had been running without problems for *years* had stopped
>> running after March 1st and none of the current maintainers noticed: since
>> the backup procedure didn't generate an error, but was plainly not
>> executed, there were no mails about backups failing. On top of that, it
>> turns out that the system I have in place to report disk usage problems,
>> wasn't delivering messages of the common-lisp.net disk overage either!
>> I've restored the database backup from the 27th and we have the backup
>> procedure running again. This means that anything stored in the database is
>> back to the 27th. MRs, issues, etc. The *repositories* are fine and never
>> were in danger!
>>
>> So far, I've waited to enable the service, because I've contacted the
>> #gitlab:libera.chat channel to ask if anything can be done to assure
>> consistency between the repositories and the database. So far, the channel
>> has remained silent (not just to my question, but to any questions posed).
>> I am thinking to restore access on monday night CEST, if no answer appears
>> on the gitlab channel, or as much earlier as a usable answer will be
>> provided.
>>
>>
>> Let me close off this mail by offering my sincere apologies for failing
>> the trust you have put in me and for any inconvenience this may have
>> caused. Please report any inconsistencies you run into to admin at
>> common-lisp.net so we can work on fixes. Additional controls are being
>> worked on to prevent a similar situation in future: "ping" messages from
>> the monitoring infrastructure and checks on the off-site backup system to
>> check that the weekly full backup (and the daily incrementals) have been
>> delivered.
>>
>>
>> --
>> Bye,
>>
>> Erik.
>>
>> http://efficito.com -- Hosted accounting and ERP.
>> Robust and Flexible. No vendor lock-in.
>>
>
>
> --
> Ray
>

-- 
Marco Antoniotti, Professor                           tel. +39 - 02 64 48
79 01
DISCo, Università Milano Bicocca U14 2043   http://dcb.disco.unimib.it
Viale Sarca 336
I-20126 Milan (MI) ITALY
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/clo-devel/attachments/20220417/277e145b/attachment.html>