Production readiness checklist for backend applications

Aleksei Kornev
4 min readMay 22, 2020

Intro

Releasing my own apps or working with teams to help them to release their application. I realized that rolling out to production is usually chaos. It doesn’t matter how big is your app: the app that was made by different 35 teams or just one mature team. It just a matter of entropy :-).

There are a lot of books that describe best practices to go to live with subsequent production support. But you are not going to read all those books right before going to production and event if you read them, during the rollout you may forget something that you read about. Of course, I’ve read many of them and have my own experience. Before go-live I usually starting to recollect everything that I know and try to apply to the current case. I was thinking about how to organize information better and come up with an idea of well known old checklists.

Here is my checklist.

Checklist to go live for backend application

Network

  • The network is isolated from the internet(nothing is reachable from the internet)
  • APIs are covered by API Gateway
  • Databases are available only from the production network
  • [Nice to have] Application is deployed to at least 2 different locations(VPC peering could be configured between location)

Monitoring

  • Need to monitor each service of your application
  • Need to monitor databases
  • Need to monitor 3rd party service
  • Need to monitor Kubernetes(in case you have it)
  • Configure request tracking(in case of multi-services)

Log aggregation

  • Logs are centralized to log aggregation service
  • Add filters at least for ERROR and WARN

Alerts

  • Alerts are configured at least for basic metrics so that CPU, memory, IOPS, disk space
  • [Nice to have] Configure alerts for errors in logs
  • [Nice to have] Use error reporting services
  • Have a mechanism to turn off alerts during deployment
  • All alerts should be tracked

Backups

  • The backup strategy should be applied to each database
  • Backups should have a validation process in place
  • Rollback process should be automated and tested

Deployment

  • Deployment process should be automated
  • [Nice to have] Blue/Green deployment
  • [Nice to have] Support canary deployment

Data replication

  • Databases should be deployed at least with a minimum required number of a node with a replication factor that helps you to recover as soon as possible

Access

  • Only the production team should have access to production environments
  • Developers may have access to logs and monitoring information

Secrets

  • Should be stored in any service which is responsible for secret management
  • None of the passwords or keys should be stored anywhere
  • Secret management strategy should be defined along with people who have keys

Smoke tests

  • Need to have a list of smoke tests (it’s good if they are automated) that could show that production is up and running

Health checks

  • Application Services should provide health check endpoints
  • 3rd party services health check should be either checked by application service itself or has his health check endpoints or needs to make some scripts that check the health of 3rd party service.

Versioning

  • All components should be versioned

Git branches and tags

  • According to git-flow that you use need to assign tags to prepare branches for quick fixes, etc.

Capacity planning

  • Make sure that you understand the number of necessary resources for each component in a system
  • Make sure that you configure all limits for the services. To avoid situations when because of the memory leak in one service you kill everything around.

Time configuration [optional]

  • Make sure that you have synced time over all nodes of your cluster

Release strategy process [in case if there is an old version of product]

  • Need to describe the procedure that you are going to use
  • Bigbang
  • Rolling upgrade
  • Canary

Production sign off process

  • Need to have a person who is responsible for production deployment and sign off that production is ready to go

Support documents

  • Contains common issues and resolutions
  • Contains instructions about production management(deployment commands examples, purge scripts, other automation)

Disaster recovery

  • Need to have a disaster recovery plan with different levels of details and ETAs for example: how to recover one service, how to recover a database, how to recover the whole cluster, how to recover the whole region

P.s.

Please don’t hesitate to send me the things that you usually check before go live and I will update this list. And of cause, if you need help with your Application Architecture/DevOps Processes/BigData processing reach me out.

--

--

Aleksei Kornev

Solution Architect Consultant DevOps/Microservices/Backend