3 Stage Deployment Flow
This is a description of a deployment flow that has worked well for me for many years in multiple contexts. If it's useful for you too, great, but you know your context best. If you have questions about how to adjust this for your context though, by all means reach out to me.
3 Accounts
I'll be working in AWS throughout this post. You can have 3 individual accounts or you can manage them via AWS Organizations. I prefer the AWS Organizations approach and utilization of the AWS Identity Center (formerly SSO) to manage this association and to allow for even more granular accounts for service compartmentalization. I'll leave that topic for a separate post.
The 3 distinct accounts that I will have are Dev, Staging and Prod. With respect to naming, they can of course be named anything, but I'll go with those 3 names throughout for consistency. The most important aspect is that these 3 accounts are distinct. I want to be able to deploy to each of the environments with the same deployment logic, but without having any interaction among the separate environments. Keeping the environments separate allows for containment of the blast radius of permissions and of changes accidentally causing outages in unintended environments. Perhaps even more importantly, it helps manage our console access. Even if you decide to allow Admin permissions in Prod, it at least provides an intentional act of switching accounts as a trigger to be a bit more deliberate with your actions. Finally, it just makes all analysis simpler from XRay traces to logs to costs.
Dev can be a bit of the wild west. I do not subscribe to the notion that you need to be able to work 100% disconnected. When it comes to the cloud, I use dev to experiment and test changes from my localhost. In fact, I don't even have the deployment pipeline update dev. This applies whether I'm using Terraform (via Terraform Cloud) or Cloud Formation (via CDK). If you have multiple pairs (or people) working on the same workspace/stack, then you may want to consider separate "sandbox" accounts for each. I also don't think every other service you need to work with needs to work in dev for you to get your job done. This is an excellent opportunity to employ mocks for contracts you are counting on.
Staging is a lot more controlled. I only want automated deploys to happen in this account. Everything should be automated when it comes to deploying to AWS. We think our next increment should work, so we merge to main and let our CI/CD pipeline get to work. After it runs any builds and initial tests, it will execute the automated deployment to the environment. This can have multiple components to it, but for the purposes here, I'm only going to focus on deploying artifacts to staging and then running automated acceptance tests against staging. I do not want there to be any manual testing required before we move on to prod, but if you're not that far along in your CD journey, you may need a gate at this point. The point is that staging is a safe environment to verify that your next deployed increment is working as expected. I do expect other services to be available as well in staging, though their version may not be guaranteed. Meaning, their versions may be ahead of prod, but I do at least expect them to be kept current.
Prod is the account connected to users. My goal is generally to continuously deploy to production. Since the deployment and subsequent acceptance tests passed in the staging environment, we have every reason to believe it will work in production. Note that in order to make this safer and more resilient against defects, I would emphasize you want smaller changes to be deployed more frequently. I also like to have at least some amount of "smoke" tests that can be run automatically against prod after every deployment. These should also be run in the acceptance suite against staging to validate they are still expected to pass. The other means of validation done in prod is monitoring via whichever tools support your use case.
The Anatomy of the Build Pipeline
The Commit Stage needs to be a max 5 minutes in duration. If it starts getting up past 4 minutes, you should start looking into where the bottlenecks are and how to improve or maintain the current times. Every app or service may have its own nuances, but these are the primary components I tend to see:
- Perform static checks (e.g. ESLint, Prettier).
- Compile & build (if using Docker, build your image here).
- Run unit tests.
- CDK tests: I don't advocate a lot of them as I prefer to run acceptance tests against a deployed environment instead. I typically have written a single CDK test per stack though and typically run that task in parallel job, since it can often take a little over a minute to run. This test at least acts as a verification that any changes to CDK logic can successfully build a new stack.
- Note: I do not deploy to dev here.
The Acceptance Stage can sometimes take some time to complete. This is ok, but I still recommend looking for opportunities to parallelize jobs or make other speed improvements. The primary components I tend to include are:
- Deploy to staging (e.g. cdk deploy or push to Terraform Cloud).
- Run acceptance tests against the deployed staging environment.
- Run any contract tests in staging. This is a separate topic of its own, but typically I am executing consumer-level contract tests via my unit test framework of choice. My contract test here will call the "adapter" class or function ( as in ports and adapters), that only has logic to execute remote calls to an external dependency.
The Production Stage is now ready to execute.
- Automatically, repeat the same deployment logic from the acceptance stage, but to the production environment instead. Only configuration details should change here.
- Run smoke tests against the production account.
- Run a subset of contract tests that can be safely exercised in a production environment.
I consider having an end-to-end pipeline the first thing to implement with any new app or service. Of course, if you are working in an existing brown field project, you will have to adapt and incorporate this model into your existing process. As an advocate of trunk-based development (TBD) especially with pairing or mobbing, I'm typically not focused on implementing any sort of pull-request (PR) workflow. For me, the PR workflow is typically just a mirror of the Commit Stage with no deployments happening.
This is relatively high level description of the pattern that has worked best for me over the years. Each application or service may have its differences that have been relatively easy to adapt from project to project.
- v1.0.1 - Jan 29, 2024 - Add versioning
- v1.0.0 - Dec 30, 2023