written on Wednesday, September 14, 2011
TL;DR Version: We've screwed up for not really understanding how the packaging system we used worked internally.
I currently work on a project where we have to go full stack: from planning, coding and testing to packaging, shipping and doing some legal paper work. One of the areas that gives the most headaches (besides the legal hassle, obviously) is the packaging (we're only dealing with RPM for now). Here I'll describe an RPM default behavior which has bitten us recently and how we've used it for our own good.
RPM packages support scripting at some points during installation: before and after installation and removal - namely:
I won't go too deep on the workings of RPM as this is not the scope of this post, but if you're interested check out Maximum RPM: it's the most complete RPM guide you'll ever get. As a bare minimum, take a quick look at here and here.
So, what would happen behind the scenes during a package upgrade? It's natural (at least for me and my team) to think that the old version package will be removed and then the new version would be installed - meaning that the sections described previously would be called in the following order: %preun, %postun, %pre, %post. Well, turns out this is not what happens and this is why we were bitten. Twice.
Following the "natural" chain of thought, we'd included in our package's %preun section instructions to stop the services we setup so that it wouldn't be running during the upgrade and risk breaking everything.
This definitely didn't break anything and since we provide a front-end for upgrading our packages which triggers a restart after everything, we didn't spot the issue soon. Only when testing a command line upgrade it was noticed that our services were completely stopped.
The fix for it was simple: we removed the commands to stop our services in the %preun section during an upgrade and everything went better than expected.
Due to some internal reasons one of the packages we ship switched from including some files statically in the RPM package (defined in the %files section) to automatically generating them during the install process.
With this change, after an upgrade from our old and more declarative package to this new dynamic one the files that were being generated ended up being removed after the upgrade (we had confirmed they were definitely being created during the setup).
The rather-ugly-but-simple-fix was to create an intermediary package with static files with names different from the ones being auto generated.
So, how come those fixes worked and - more importantly - why the heck were they happening in the first place?
Quoting from this developerWorks article, what happens during a package upgrade is the following:
- Run the %pre section of the RPM being installed.
- Install the files that the RPM provides.
- Run the %post section of the RPM.
- Run the %preun of the old package.
- Delete any old files not overwritten by the newer version. (This step deletes files that the new package does not require.)
- Run the %postun hook of the old package.
What this means is that the old package removal happens after the installation of the new one. If you think about it, it kind of makes sense since we don't want to touch the configuration files the user has changed (or rather, we don't want to force the user to reconfigure her service after simple updates). If it were up to me I wouldn't do it like this, but anyway.
Let's understand each issue and why the fix worked:
Enough of issues, let's take a look at a situation where this upgrade behavior actually helped us solve a production bug:
In our first release some of our code depended on some filesystem data to work and we didn't cache such data for further use throughout the service lifetime; This caused a bug during an upgrade because the newer package version would replace some files which the old service depended on - so, after an upgrade through our UI, the application would crash since the filesystem data is different from what it expected (yeah, I'm talking about templates - shame on us).
Such an issue could be solved by simply releasing a new minor version of the product, however the bug would still show up if the user tried to upgrade from the original version, without going through the minor update, so this wasn't perfectly safe.
By the time we caught this problem we were already aware of the real RPM upgrade flow, so we used that to work around the issue for any version: in the %pre section of the new package we backed-up the old files which triggered the crash when changed and kept them at their original location after the upgrade finished - when the application restarts, it detects the presence of the old files and finally replaces them with the new ones: now the service is running with updated code aware of the filesystem changes and nothing really breaks.
This is what one would call a really ugly hack in the wild and I'd have to agree. However, the code was already in production and even worse than doing dirty hacks is letting the application crash in the face of the user.