Production readiness checklist for backend applications

4 min readMay 22, 2020

Intro

Releasing my own apps or working with teams to help them to release their application. I realized that rolling out to production is usually chaos. It doesn’t matter how big is your app: the app that was made by different 35 teams or just one mature team. It just a matter of entropy :-).

There are a lot of books that describe best practices to go to live with subsequent production support. But you are not going to read all those books right before going to production and event if you read them, during the rollout you may forget something that you read about. Of course, I’ve read many of them and have my own experience. Before go-live I usually starting to recollect everything that I know and try to apply to the current case. I was thinking about how to organize information better and come up with an idea of well known old checklists.

Here is my checklist.

Checklist to go live for backend application

Network

The network is isolated from the internet(nothing is reachable from the internet)
APIs are covered by API Gateway
Databases are available only from the production network
[Nice to have] Application is deployed to at least 2 different locations(VPC peering could be configured between location)

Monitoring

Need to monitor each service of your application
Need to monitor databases
Need to monitor 3rd party service
Need to monitor Kubernetes(in case you have it)
Configure request tracking(in case of multi-services)

Log aggregation

Logs are centralized to log aggregation service
Add filters at least for ERROR and WARN

Alerts

Alerts are configured at least for basic metrics so that CPU, memory, IOPS, disk space
[Nice to have] Configure alerts for errors in logs
[Nice to have] Use error reporting services
Have a mechanism to turn off alerts during deployment
All alerts should be tracked

Backups

The backup strategy should be applied to each database
Backups should have a validation process in place
Rollback process should be automated and tested

Deployment

Deployment process should be automated
[Nice to have] Blue/Green deployment
[Nice to have] Support canary deployment

Data replication

Databases should be deployed at least with a minimum required number of a node with a replication factor that helps you to recover as soon as possible

Access

Only the production team should have access to production environments
Developers may have access to logs and monitoring information

Secrets

Should be stored in any service which is responsible for secret management
None of the passwords or keys should be stored anywhere
Secret management strategy should be defined along with people who have keys

Smoke tests

Need to have a list of smoke tests (it’s good if they are automated) that could show that production is up and running

Health checks

Application Services should provide health check endpoints
3rd party services health check should be either checked by application service itself or has his health check endpoints or needs to make some scripts that check the health of 3rd party service.

Versioning

All components should be versioned

Git branches and tags

According to git-flow that you use need to assign tags to prepare branches for quick fixes, etc.

Capacity planning

Make sure that you understand the number of necessary resources for each component in a system
Make sure that you configure all limits for the services. To avoid situations when because of the memory leak in one service you kill everything around.

Time configuration [optional]

Make sure that you have synced time over all nodes of your cluster

Release strategy process [in case if there is an old version of product]

Need to describe the procedure that you are going to use
Bigbang
Rolling upgrade
Canary

Production sign off process

Need to have a person who is responsible for production deployment and sign off that production is ready to go

Support documents

Contains common issues and resolutions
Contains instructions about production management(deployment commands examples, purge scripts, other automation)

Disaster recovery

Need to have a disaster recovery plan with different levels of details and ETAs for example: how to recover one service, how to recover a database, how to recover the whole cluster, how to recover the whole region

P.s.

Please don’t hesitate to send me the things that you usually check before go live and I will update this list. And of cause, if you need help with your Application Architecture/DevOps Processes/BigData processing reach me out.