GitLab.common-lisp.net unavailability: post-mortem

Fri Sep 29 22:03:43 UTC 2017

Hi,

Last weekend and up to Wednesday, gitlab.common-lisp.net had issues,
returning 500 Internal Server Errors while cloning or pulling; additionally
the gitlab subdomain was down completely on Sunday.
This mail provides an analysis of what happened.

There is some context to all this to be started with: common-lisp.net uses
the so-called "omnibus" package to run its GitLab install; it's a
batteries-included package provided by GitLab, meaning that everything down
to OpenSSL, nginx and Ruby are included in the package and installed in a
separate - not interfering with the system - location. This omnibus package
also comes with its own configuration (script) in the form of a Chef recipe.

While the package provides a default configuration which uses an Nginx
reverse proxy and default ports for daemons to be accessed over TCP
sockets, this default configuration doesn't quite wok on common-lisp.net
due to the fact that we use Apache 2.4 as our web-visible reverse proxy.
Apache 2.4 also serves a truckload of other services, such as lisppaste,
trac, abcl.org, cliki.net, darcsweb.cgi, etc.
Due to this entanglement of Apache, we can't just replace it with nginx.
Also, due to the large number of reverse-proxied services, not all standard
ports for GitLab's configuration are open.
This isn't a problem, because GitLab offers the ability to configure
site-local deviations from the defaults configuration as input for the Chef
recipe.

We have succesfully been running with a configuration like this since
GitLab 7.(something). The current GitLab version is 10.0.

In its evolution from version 7 to version 10, gitlab started out with
"Unicorn" based rails workers (a standard Rails setup). As demand grew, a
custom webserver was developed (gitlab-git-http-server) which addressed
Unicorn time-outs with long running "git" processes (clones).
In order to support the "simple" setup with just Unicorn, the unicorn and
gitlab-git-http-servers were configured to run each on their own port.
Around GitLab 8.2, gitlab-git-http-server was renamed to gitlab-workhorse
and the configuration keys were renamed with it, although the old config
keys were still respected. Our local override contained these
gitlab-git-http-server config keys last Sunday due to ports already being
taken by other services.

As of version 10, the 'gitlab-git-http-server' configuration keys are no
longer supported: the configuration *must* now be specified in terms of
'gitlab-workhorse' keys. Last Sunday, when I upgraded the system to the
current version (10.0.2) in the morning, I missed this fact, which caused
the system to remain unconfigured (and thus unavailable) until I received
notification on #common-lisp.net of problems.
The cause at that time was quickly determined and the
'gitlab-git-http-server' configuration keys were quickly removed and the
system was redeployed and all seemed to work again, after changing the
reverse proxy rules to point to gitlab's remaining open ports.

On Monday I received more signals of problems; being on a conference with
little to no Net access, Mark pitched in, but was unable to determine the
cause. When I *did* have access, everything looked fine, so I didn't check
any further.
Then on Tuesday, I received more signals of problems, but being on the same
conference without Net access, still, I wasn't able to do much.
On Wednesday morning, with yet more reports of problems, it became apparent
that I was checking the web frontend for availability, but that the people
reporting issues were actually experiencing problems with clones/pulls/etc.
So, the git-over-http component wasn't working.

With the actual problem identified and reproduced, it was quickly apparent
that due to the removal of the gitlab-git-http-server config keys,
'gitlab-workhorse' was no longer being configured and started. With a bit
of trial-and-error, it also turned out that gitlab-workhorse has a default
configuration to run over Unix domain sockets; a configuration supported by
Nginx, but not by Apache. With the configuration corrected and the system
reconfigured, problems were solved by Wednesday noon.

In retrospect, the removal of these config keys was in the release notes,
so I could have known. It was 22 <PgDown> clicks down, by which time I
wasn't alert enough to realise the importance of the deprecation
announcement.

Regards,

-- 
Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.
Robust and Flexible. No vendor lock-in.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/clo-devel/attachments/20170930/eb45e6bc/attachment.html>