GitLab.common-lisp.net unavailability: post-mortem

Dave Cooper david.cooper at genworks.com
Sun Oct 1 19:15:45 UTC 2017


Dear All,

Thanks to Erik for this post-mortem, which will also serve as some
historical documentation on the system setup.  I should note that Mark E
was abroad and also out of pocket (at an extended wedding celebration)
during this incident, not able to devote uninterrupted attention to
troubleshooting.  So it was a bit of a "perfect storm" situation, what with
not just one, but both our volunteer administrators coincidentally absorbed
elsewhere, concurrent with a relatively unusual back-compatibility-breaking
deprecation in gitlab.

With that said,  I'll speak for common-lisp.net's sponsoring organization,
the CLF (as a Board member), and say that we consider any amount of
downtime to be too much. After all, our policy for the past several years
has been to invite and encourage CL-based projects to come to
common-lisp.net for their main repository hosting, as an alternative to
huge impersonal repository hosts à la github (which arguably has its own
whole set of issues and risks).  We wish to continue encouraging in this
way in good faith.  To that end, at our next monthly teleconference, the
CLF will discuss actions we can take to reduce the likelihood of an outage
like this recurring.  If anyone on this list has ideas they would like to
contribute to the discussion, please feel free to describe them on this
list.

You may also consider joining the upcoming CLF meeting on October 4 (on
Google Hangout). Please write to me directly if you would like to receive
the meeting link/invitation.


Best Regards,

 Dave Cooper



On Fri, Sep 29, 2017 at 6:03 PM, Erik Huelsmann <ehuels at gmail.com> wrote:

> Hi,
>
> Last weekend and up to Wednesday, gitlab.common-lisp.net had issues,
> returning 500 Internal Server Errors while cloning or pulling; additionally
> the gitlab subdomain was down completely on Sunday.
> This mail provides an analysis of what happened.
>
> There is some context to all this to be started with: common-lisp.net
> uses the so-called "omnibus" package to run its GitLab install; it's a
> batteries-included package provided by GitLab, meaning that everything down
> to OpenSSL, nginx and Ruby are included in the package and installed in a
> separate - not interfering with the system - location. This omnibus package
> also comes with its own configuration (script) in the form of a Chef recipe.
>
> While the package provides a default configuration which uses an Nginx
> reverse proxy and default ports for daemons to be accessed over TCP
> sockets, this default configuration doesn't quite wok on common-lisp.net
> due to the fact that we use Apache 2.4 as our web-visible reverse proxy.
> Apache 2.4 also serves a truckload of other services, such as lisppaste,
> trac, abcl.org, cliki.net, darcsweb.cgi, etc.
> Due to this entanglement of Apache, we can't just replace it with nginx.
> Also, due to the large number of reverse-proxied services, not all standard
> ports for GitLab's configuration are open.
> This isn't a problem, because GitLab offers the ability to configure
> site-local deviations from the defaults configuration as input for the Chef
> recipe.
>
> We have succesfully been running with a configuration like this since
> GitLab 7.(something). The current GitLab version is 10.0.
>
> In its evolution from version 7 to version 10, gitlab started out with
> "Unicorn" based rails workers (a standard Rails setup). As demand grew, a
> custom webserver was developed (gitlab-git-http-server) which addressed
> Unicorn time-outs with long running "git" processes (clones).
> In order to support the "simple" setup with just Unicorn, the unicorn and
> gitlab-git-http-servers were configured to run each on their own port.
> Around GitLab 8.2, gitlab-git-http-server was renamed to gitlab-workhorse
> and the configuration keys were renamed with it, although the old config
> keys were still respected. Our local override contained these
> gitlab-git-http-server config keys last Sunday due to ports already being
> taken by other services.
>
> As of version 10, the 'gitlab-git-http-server' configuration keys are no
> longer supported: the configuration *must* now be specified in terms of
> 'gitlab-workhorse' keys. Last Sunday, when I upgraded the system to the
> current version (10.0.2) in the morning, I missed this fact, which caused
> the system to remain unconfigured (and thus unavailable) until I received
> notification on #common-lisp.net of problems.
> The cause at that time was quickly determined and the
> 'gitlab-git-http-server' configuration keys were quickly removed and the
> system was redeployed and all seemed to work again, after changing the
> reverse proxy rules to point to gitlab's remaining open ports.
>
> On Monday I received more signals of problems; being on a conference with
> little to no Net access, Mark pitched in, but was unable to determine the
> cause. When I *did* have access, everything looked fine, so I didn't check
> any further.
> Then on Tuesday, I received more signals of problems, but being on the
> same conference without Net access, still, I wasn't able to do much.
> On Wednesday morning, with yet more reports of problems, it became
> apparent that I was checking the web frontend for availability, but that
> the people reporting issues were actually experiencing problems with
> clones/pulls/etc. So, the git-over-http component wasn't working.
>
> With the actual problem identified and reproduced, it was quickly apparent
> that due to the removal of the gitlab-git-http-server config keys,
> 'gitlab-workhorse' was no longer being configured and started. With a bit
> of trial-and-error, it also turned out that gitlab-workhorse has a default
> configuration to run over Unix domain sockets; a configuration supported by
> Nginx, but not by Apache. With the configuration corrected and the system
> reconfigured, problems were solved by Wednesday noon.
>
> In retrospect, the removal of these config keys was in the release notes,
> so I could have known. It was 22 <PgDown> clicks down, by which time I
> wasn't alert enough to realise the importance of the deprecation
> announcement.
>
>
> Regards,
>
> --
> Bye,
>
> Erik.
>
> http://efficito.com -- Hosted accounting and ERP.
> Robust and Flexible. No vendor lock-in.
>



-- 
My Best,

Dave Cooper, david.cooper at gen.works
genworks.com, gendl.org
+1 248-330-2979
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/clo-devel/attachments/20171001/b0194898/attachment.html>


More information about the clo-devel mailing list