I talk about topics related mostly to the Node.js world, but they can be applied to different languages.
I love upgrading software. I know some may think that's a weird statement, but oh well.
So, I say it again. I love upgrading software. Reading the upgrade docs and seeing all those new features or performance improvements is a real joy. All I want to do is take advantage and test them out. I mean, who wouldn't?!?!
In the past I used to update the
package.json (or whatever upgrade mechanism), throw it on the QA environment for a little test, and then push it out. I thought nothing more of it.
Unfortunately, this is not the case anymore because of a few reasons:
- I've grown
- I don't want to cause people unnecessary headaches
- I don't want to explain why a service was down due to things which could've been mitigated
In order to help you from making the same mistakes, I wanted to share some of the things I go through before doing major updates. These will make your upgrade take a bit more time, but they will also help mitigate risk, which ultimately is better for you and your company/customer/etc.
Through the rest of the post, I'm going to be using a few examples of projects I've worked on recently. They include major version upgrades of Next.js, Node.js, and CentOS.
While this seems like a no brainer, sometimes it's forgotten. When upgrading Node.js you can easily use something like n or nvm to switch between versions. There's even a package that will run tests against multiple versions of Node.js.
Some of the things I find when testing locally include:
- NPM packages that are not compatible with the new Node.js version and require updates,
- NPM packages that are not compatible with upgraded dependencies. [This can happen when updating things like Next.js or ExpressJS.]
- Debian distros that target certain major versions [This happened to me when updating our docker image to use "stretch" (from "jessie"). In this case, the Nginx source was pointing to the jessie distro, which was incorrect.]
You may work on an app with a couple other people or there may be a bunch of teams working on one app - no matter your situation, I've found it very useful to enlist them to help in testing your upgrade.
I work on an app with many different parts, all owned by different teams. When doing an upgrade it's hard for me to justify that I've tested all the different areas. This is because I don't know all these areas in-depth.
So, I'll ask everyone to take a couple minutes out of their day to pull down my branch and test out their section with the changes.
When doing this I've found issues such as incorrect CSS link ordering, webpack loaders failing to function, server-side rendering issues, "it works on my machine problems", etc.
Testing in Different Environments
Before we start this section, I should mention that it's important to have metrics and logs. This will help enable you to diagnose issues much quicker.
There are many tools out there that enable you to do this. They include, but are not limited to (nor do I have a special affinity for any of these): Splunk, Prometheus, Grafana, Datadog, Papertrail, etc.
They do involve a bit of setup, maybe a bit less if you use Heroku, but they are invaluable.
Once you get these set up, you will be able to see metrics and action off of them (auto-rollback, auto-scale, alert people, etc.). Logs will also provide valuable insight about what is going on with your app; however, what information to log is a whole different topic. I know a bunch of people better suited to talk about that, so we won't be covering it here.
When testing updates, no matter how major they are, it's my opinion that you should test them in a testing/stating/QA environment before releasing them to your users.
For small upgrades, like minor version upgrades, the usual, "deploy, test for regressions, and deploy to prod" will likely work.
When it comes to larger upgrades it may be beneficial to do a bit more. By this I mean let it simmer/sit on the testing environment for a while. You'll likely have others that use this testing environment. Little did they know, they've just become unwitting testers for you!
It will be helpful to look at the following metrics, probably more, when releasing to your test environment.
- CPU utilisation
- Memory utilisation
- Error rate
This is where logs and metrics come in handy. You can compare the metrics of the upgraded app to the metrics of the app before the upgrade. This will help you determine if there's something "shifty" going on.
I would recommend to try load testing in the test environment, if you can, before rolling it out to production. Artillery is a useful tool when it comes to doing this.
Progressive Production Rollout
If you don't have the ability to progressively rollout apps, then don't worry about this section.
We all know that a test environment is almost never like the actual production environment (if yours is, then congratulations!). The production environment usually has more dedicated CPU, more memory allocation, more instances, better load balancing, etc.
In an effort to make sure the changes will function as planned, you could progressively roll your service/app out to the production environment. This may look something like replacing 10% of your instances with the upgraded version and monitoring it for a bit to make sure everything is tip-top. Next, increase the replaced instances to 25% and monitor it again. Repeat until you've replaced 100% of your instances.
This will allow you to make sure everything is functioning properly over a longer period of time.
Remember: Being overly cautious when it comes to these things is okay (and I would say expected).
When doing a major upgrade, I usually put together a document that details out everything that I will be doing, step-by-step. This includes all the things I have talked about above, but it also includes possible remediations if something goes wrong.
Remediations are the steps that one can take in order to either diagnose a problem or revert the upgrade. If something goes wrong, it's (usually) much easier to revert the change, get your service/app in a functioning state, and then diagnose what went wrong.
I hope this is helpful to you! If you have any other things, hit me up in the comments or on twitter (@vernacchia).
Until next time...