[Terragrunt GitOps - Part 1] Introduction and design

·

10 min read

Introduction

This article presents an architecture of the Terragrunt GitOps pipeline on GCP (Google Cloud Platform). It aims to give some inspiration when designing your own GitOps pipeline. The design has been explained in detail (including the quirks I've added as well as limitations I haven't addressed) and the implementation code has been shared (so that anyone can bootstrap it themselves to test or develop further).

Features of the solution

To give a taste of the features I've included, here's an appetizer:

  • separate sets of triggers per customer

  • infrastructure separation (GCP projects) per customer

  • Github fine-grained personal access tokens

  • trunk-based branching

  • Terraform module separation (repositories) with tagging

  • Service Account impersonation using short-lived tokens

  • dynamically created build configs stored in a central repository

  • customized Docker image running the jobs

  • pre-commit for Terraform & Terragrunt

Prerequisites to understand this article

In order to understand this article, some knowledge is expected.

  • Terraform

  • Terragrunt

  • Cloud Build

  • GCP (basics)

  • Git & GitHub

  • Bash (basics)

  • Docker (basics)

  • fundamental concepts of CI/CD and GitOps

High-Level Design

GitOps in 1 minute

The very high-level design is a standard GitOps approach - a developer pushes the code (configuration) to the remote repository (Github in our case), and that event triggers a job on your automation service (Cloud Build in our case), that applies that configuration across customers and environments.

Now, you may have 2 questions already:

  • what does this asterisk next to cloudbuild.yaml file mean?

  • should we really run terragrunt run-all (many folk say it's against the best practices)?

I'll address the first point when explaining how triggers are constructed; and the latter in the last article of the series.

The overall structure of resources in GCP

All of the resources belong either to the "service provider" (SP) perimeter (that is, you - an engineer responsible for deploying it in some company) OR to any of your customers' environments. Inside the SP perimeter, we have 1 common project, and as many projects dedicated to the customers, as many customers you have. The number of projects in a single customer perimeter varies of course.

The "common project" in the SP perimeter contains (a set of) Cloud Build triggers whose purpose is to create (a set of) triggers per each customer. Those sets of triggers will be attached to a customer-dedicated project. Apart from the "trigger creator" CB trigger (yep, that's right...), this common project contains resources such as Docker image which runs the steps of the builds, fine-grained personal access token, and a few other things.

Each customer-dedicated project contains the aforementioned set of triggers responsible for running Terragrunt jobs for a given customer. Those projects also contain Storage buckets for state files and artifacts, service accounts that are attached to those triggers, and... a few other things.

Finally, customers' projects come with whatever a given customer deployed there. What's important for us, is that we deploy the resources that are part of our solution (describe a bit below). Those resources are deployed using customer-provided service accounts that we impersonate in a Cloud Build job.

But what do we deploy actually?

Alright, so I described the projects, and the "location" of the Cloud Build triggers (location as in associated projects, not regions), impersonation, etc. But what do the red arrows in the diagram above actually deploy?

The correct answer is: whatever you need. However, let me give you an example to wrap your head around it.

Example

Imagine that (one or more of) your customers requested you to provide them every day with some report. This report contains data that you (SP) own. You want to automate that process so you write a Cloud Function and want to deploy a Cloud Scheduler to trigger it once a day. Then you want to save some data in the customer's project, in a GCS bucket. Therefore, you deploy in your (SP) perimeter those GCP resources (in a customer-dedicated project). You use your service account with proper IAM bindings.

Then, you deploy a GCS bucket in a customer project (customer perimeter). For that, you use a customer-provided service account. In order for the Cloud Build job (running Terragrunt) to use that account (but not 'attach' it to the Cloud Build resource), you use impersonation.

This tutorial's example

In this tutorial, I didn't bother to write any function (maybe in some future article/series I'll do that because then I would like to add a "CI" part to this pipeline - code testing and artifact creation). So I used the simplest example of all - I will create 1 GCS bucket in a customer-dedicated project, and another GCS bucket in a customer project (using impersonation). Boring, I know, but hey - that's not the point of this series, we focus on Terragrunt here!

Low-level design

How many environments?

Let's start with this simple question. Most customers come (as they should!) with multiple environments: dev, stage, prod, etc. We definitely want to have "copies" of the relevant resources per each environment, both in customer perimeter and in SP (so, in our boring case of GCS buckets: 1 bucket in the customer-dedicated project per environment; 1 bucket in each customer's environment - which will usually mean separate projects).

In my article you will come across something rather peculiar - check "environment". Now, that's a bit of redundancy I came up with. The idea sprung up when I noticed customers with a single (prod-only) environment. If I deploy a complex solution (with resources both in the customer-dedicated project in SP and the customer's perimeter), I would like to first check if everything is right, before creating new resources in their production set-up. But if they don't have a non-prod environment, it might be tricky. One way is to create a copy of resources in SP (the part of the solution that's deployed in SP), without creating anything in the customer's perimeter - that's the check one.

Repositories

In this article, I'll be using the following repositories:

  • terraform-random-sample-module (link) -> This contains the Terraform code for the (very simple) module that generates random strings. I need it because I want to demonstrate dependency handling in Terragrunt.

  • terraform-storage-sample-module (link) -> This contains the Terraform code for the GCS bucket-creation module. It is a main resource (albeit a very simplistic one) I'll use to demonstrate resource creation in SP and customer perimeter.

  • terragrunt-runner-module (link) -> This contains the Terraform code for the Cloud Build triggers (and related resources - e.g. service accounts, GCS buckets). This module is complex and I will explain some sections in one of the next articles.

  • terragrunt-example-envs (link) -> This contains the Terragrunt code for the environments. It's the heart of this solution (or one of 2 hearts, along with the runner module. I hope some creature has 2 hearts to make this metaphor work). I explain the contents of it in the section below.

The diagram above shows how repos are utilized. terragrunt-example-envs is central here, in terms of triggers reacting to changes in the codebase there. The other repos contain Terraform modules that are invoked in the Terragrunt configuration present in terragrunt-example-envs .

Terragrunt repo structure

If it wasn't clear from the description above, let me reiterate - terragrunt commands are executed in the steps of the builds that run in customer-dedicated projects.

When running terragrunt run-all, it's imperative that we choose the proper directory in which we execute the command. Subsequently, we have to plan the folder structure in the repo containing the configuration (and modules' invocations).

I chose this:

# tree
.
├── README.md
└── envs
    ├── common_all_customers.hcl
    ├── common_general.hcl
    ├── customers
    │   ├── customer1
    │   │   ├── check
    │   │   │   ├── environment.hcl
    │   │   │   ├── random
    │   │   │   │   └── terragrunt.hcl
    │   │   │   └── storage
    │   │   │       └── terragrunt.hcl
    │   │   ├── customer.hcl
    │   │   ├── dev
    │   │   └── prod
    │   │       ├── environment.hcl
    │   │       ├── random
    │   │       │   └── terragrunt.hcl
    │   │       ├── storage
    │   │       │   └── terragrunt.hcl
    │   │       └── storage_impersonate
    │   │           └── terragrunt.hcl
    │   ├── customer2
    │   │   ├── check
    │   │   └── prod
    │   └── internal-testing
    ├── onboard
    │   ├── customer1
    │   │   ├── check
    │   │   │   └── terragrunt.hcl
    │   │   ├── dev
    │   │   └── prod
    │   │       └── terragrunt.hcl
    │   └── onboard.hcl
    ├── provider.hcl
    └── terragrunt.hcl

Note. The files you see here are present only after you follow the steps of the subsequent articles of this series. At the very beginning, the number of files and directories will be much smaller.

Woah, that's a lot. Again, I assume you know a bit of Terragrunt, because I won't be explaining the simple stuff.

Let's break it down starting from the current directory and then going deeper into the folder structure:

terragrunt.hcl is the core file, although it's quite a short one. It has 2 main functions: firstly, it dynamically creates the backend config. Secondly, it is responsible for merging the local values placed at various levels in the structure. This merging is quite standard and you can see an example of it (apart from the article that you're reading right now) in this Gruntwork's repo.

provider.hcl contains the GCP provider config. common_all_customers.hcl and common_general.hcl contain only local-type variables.

Now, we have 2 directories to explain: customer and onboard. Let's start with onboard

Onboard directory.

In the HLD section, I mentioned "trigger creating other triggers" (trigger creator). From now on, I'll call it a "meta" trigger, because that's the convention I've been using in the code.

Think about what needs to be done to "onboard" a new customer into the solution. You of course would need a new customer-dedicated project, and in that project, you'd need to have a set of Cloud Build triggers and all auxiliary resources. The onboard directory contains the code that creates stuff needed for that onboard phase.

onboard.hcl is a bit similar to the terragrunt.hcl in the working directory. Then each directory (like prod or dev) pertains to the customer's environment. check is special, as described earlier - it is to NOT affect customers but create resources only in SP perimeter.

Customers directory.

Here we have a separate directory per customer (+ internal-testing which I reserved for the future, but haven't used in this tutorial). Then we have a directory per environment (the same way as described in the 'onboard' section). Finally, we have a directory per module invocation (as it is in Terragrunt). The files terragrunt.hcl at the lowest level of the tree contain mainly include, terraform blocks and inputs attribute.

Along the way in the tree, you'll also notice files like environment.hcl or customer.hcl. They contain local variables that are merged at the top level terragrunt.hcl.

Resources on an LLD diagram

That's a busy diagram, but it's supposed to be a low-level one. Please notice the colors of the captions - they reflect who and how creates them.

Orange ones are those created as part of "bootstrap" in my solution. It doesn't mean that they're created via "click-ops" (in fact, most are created by Terraform), it means that those resources will not be created or changed in reaction to changes in the code in any repository. In your case, some of the resources may already be in your environment (for example, GCP Project, APIs may be enabled, and you may already have a Docker image that runs Terragrunt). The other point is that those resources may be managed in a GitOps way (for example, another repository and another "general" Terraform/Terragrunt runner). Customize as you please - for this article, it didn't make sense to automate more for me.

Green ones are created by the customer. The customer will already have GCP projects of course. They will also be asked to create a Service Account and grant relevant roles: TO the service account (so that it can do stuff in their environment) and OVER the service account (so that SP can impersonate it).

Grey ones, finally, are created by our GitOps solution.

How come, you may ask, so few resources are created by GitOps here? Firstly, orange/green resources are a one-time set-up. You do it once (or at least infrequently change it) and are set to go. Secondly, the example in this article (GCS buckets) is the simplest of all, so in real-world scenarios, the number and complexity of the grey resources will increase. Thirdly, it is possible to automate the creation of orange resources as I've explained above, but just wasn't needed for the purpose of this article.

Summary

In this article, I've described the idea and design of the solution. The crux of the solution is the deployment of resources in customers' perimeters by Terragrunt workers. There's a separation per customer and environment basis. The repositories with the modules are versioned and tagged and allow flexibility when choosing the version. The resources in customers' perimeters are deployed by service accounts provided by the customer which the service provider (SP) impersonates.

Now, let's continue to the next article of the series - Prerequisites & bootstrap.