This was forked from runatlantis/atlantis at v0.17.3
Since then this version has diverged significantly from upstream and was therefore detached.
⚠️ This repo is still still contains a lot of code from upstream which is slowly being phased out as we test our implementation in production. It is not ready for general consumption.
Functional Differences
Non Functional Differences
Most of the new code can be found in server/neptune
. Neptune is the codename for our rebuild of Atlantis. Outside of that package is mostly old code which we are removing as we deprecate upstream/legacy workers.
Currently, Atlantis can operate in 3 modes based on the configuration passed in:
In order for Neptune to work correctly, all of these must exist.
Receives webhook events from github and acts on them accordingly. Gateway is stateless however each request does spin up a go routine that clones a repository to disk. This is the primary bottleneck here.
unlocked
and plan_review
. When we receive both these events, we signal our deploy workflow and terraform workflow respectively.Re-run failed checkruns
button selection and signals our deploy workflow that we are attempting to add the previously failing revision back into the deploy queue for a rerun attempt. This is similar to the check suite event described below, but the key difference here is this request only reruns attempts where a checkrun has failed.Re-run all checkruns
button selection events and signals our deploy workflows that we are attempting add all off the modified roots within a revision back into their respective deploy queues for a rerun attempt, regardless of success or failure status. Note that we only support rerun attempts if a revision meets the following criteria:
Responsible for speculative planning and policy checking within the PR. This code is relatively untouched from upstream atlantis and should eventually be nuked in favor of Temporal workflows.
Responsible for running 3 primary processes:
Deploy workflows are run on the granularity of a single repository root. It follows the ID pattern below:
<OWNER/REPO>||<ROOT_NAME>
The following is a high level diagram of how this workflow is structured:
┌─────────────┐
│ │
│ deployment │
│ store │
┌─────────────────┐ │ │
│ select │ └────▲───┬────┘
│ │ │ │
│ │ │ │
│ │ │ │
├┬───────────────┬┤ ┌────┴───▼──────┐
││revision signal││ ┌──────────────┐ │ │
││ channel │┼────►priority queue├────► queue worker │
│┼───────────────┼│ └──────────────┘ │ │
│┼───────────────┼│ └──────┬────────┘
││ timeout timer ││ │
│┼───────────────┼│ │
└─────────────────┘ ┌───────▼─────────┐
│ select │
│ │
│ │
│ │
├┬───────────────┬┤
││ queue::CanPop │┼───────────────┐
││ ││ │
│┼───────────────┼│ ┌────────▼─────────┐
│┼───────────────┼│ │ select │
││ unlock signal ││ │ │ ┌─────────┐
│┼───────────────┼│ │ │ │ Github │
└─────────────────┘ │ │ └▲────────┘
├┬────────────────┬┤ │
││ state change ││ │
││ signal channel │┼──────────┴───┐
│┼────────────────┼│ │
│┼────────────────┼│ │
││ child workflow ││ ┌──▼──┐
│┼────────────────┼│ │ SNS │
└──────────────────┘ └─────┘
The deploy workflow is responsible for a few things:
In order to receive revisions our main workflow thread listens to a dedicated channel. If we haven't received a new revision in 60 minutes and our queue is empty, we instigate a shutdown of the workflow in its entirety.
The queue is modeled as a priority queue where manually triggered deployments always happen before merge triggered deployments. This queue can be in a locked state if the last deployment that happened was triggered manually. The queue lock applies only to deployments that have been triggered via merging and can be unlocked through the corresponding check run of an item that is blocked.
Items can only be popped of the queue if the queue is unlocked OR if the queue is locked but contains a manually triggered deployment.
By default, a new workflow starts up in an unlocked state.
Upon workflow startup, we start a queue worker go routine which is responsible for popping off items from our queue when it can, and listening for unlock signals from gateway.
The worker also maintains local state on the latest deployment that has happened. This is used for validating that new revisions intended for deploy are ahead of the latest deployed revisions. Once each deployment is complete, the worker persists this information in the configured blob store. The worker only fetches from this blob store on workflow startup and maintains the information locally for lifetime of its execution.
A deploy consists of executing a terraform workflow. The worker blocks on execution of this "child" workflow and listens for state changes via a dedicated signal channel. Once the child is complete, we stop listening to this signal channel and move on.
State changes are reflected in the github checks UI (ie. plan in progress, plan failed, apply in progress etc.). A single check run is used to represent the deployment state. The check run state is indicative of the completion state of the deployment and the details of the deployment itself are rendered in the check run details section.
State changes for apply jobs specifically are sent to SNS for internal auditing purposes.
The Terraform workflow runs on the granularity of a single deployment. It's identifier is the deployment's identifier which is randomly generated in the Deploy Workflow. Note: this means a single revision can be tied to multiple deployments.
The terraform workflow is stateful due to the fact that it keeps data on disk and references it throughout the workflow. Cleanup of that data only happens when that workflow is complete. In order to ensure this statefulness, the terraform workflow is aware of the worker it's running on and fetches this information as part of the first activity. Each successive activity task takes place on the same task queue.
Following this:
Before and after this job, the workflow signals it's parent execution with a state object. At this point, the workflow either blocks on a dedicated plan review channel, or proceeds to the apply under some criteria. Atm this is only if there are no changes in the plan.
Plan review signals are received directly by this workflow from gateway which pulls the workflow ID from the check run's External ID
field.
If the plan is approved, the workflow proceeds with the apply, all the while updating the parent execution with the status, before exiting the workflow. If the plan is rejected or times out (1 week timeout on plan reviews), the parent is notified and the workflow exits.
The workflow itself has no retries configured. All activities use the default retry policy except for Terraform Activities. Terraform Activities throw up a TerraformClientError
if there is an error from the binary itself. This error is configured to be non-retryable since most of the time this is a user error.
For Terraform Applies, timeouts are not retried. Timeouts can happen from exceeding the ScheduleToClose threshold or from lack of heartbeat for over a minute. Instead of retrying the apply, which can have unpredictable results, we signal our parent that there has been a timeout and this is surfaced to the user.
Since Terraform activities can run long, we send hearbeats at 5 second intervals. If 1 minute goes by without receiving a hearbeat, temporal will assume the worker node is down and the configured retry policy will be run.
Terraform operation logs are streamed to the local server process using go channels. Once the operation is complete, the channel is closed and the receiving process persists the logs to the configured job store.
atlantis server --gh-user <your username> --gh-token <your token> --repo-allowlist <your repo> --gh-webhook-secret <your webhook secret> --log-level debug
If you get an error like command not found: atlantis
, ensure that $GOPATH/bin
is in your $PATH
.The atlantis worker can't technically be run yet locally given it's dependency on sqs. However, Docker compose is set up to run a gateway, a temporal worker, temporalite and ngrok all in the same network. Ngrok allows us to expose localhost to the public internet in order to test github app integrations.
There is some setup that is required in order to have your containers running and receiving webhooks.
Setup your own personal github organization and test github app.
Install this app to a test repo within your organization with the following configuration.
Download the key file and save it to ~/.ssh
directory. Note: ~/.ssh
is mounted to allow for referencing any local ssh keys.
Create the following files: ~/atlantis-gateway.env
ATLANTIS_PORT=4143 ATLANTIS_GH_APP_ID=<FILL THIS IN> ATLANTIS_GH_APP_KEY_FILE=/.ssh/your-key-file.pem ATLANTIS_GH_WEBHOOK_SECRET=<FILL THIS IN> ATLANTIS_GH_APP_SLUG=<FILL THIS IN> # The github organization the feature flag repo resides ATLANTIS_FF_OWNER=<FILL THIS IN> # Name of the feature flag repo ATLANTIS_FF_REPO=<FILL THIS IN> # The path to the flags.yaml file ATLANTIS_FF_PATH=<FILL THIS IN> ATLANTIS_DATA_DIR=/tmp/ ATLANTIS_LYFT_MODE=gateway ATLANTIS_REPO_CONFIG=/generated/repo-config.yaml ATLANTIS_WRITE_GIT_CREDS=true ATLANTIS_ENABLE_POLICY_CHECKS=true ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT=true ATLANTIS_REPO_ALLOWLIST=<FILL THIS IN> ALLOWED_REPOS=<FILL THIS IN>
~/atlantis-temporalworker.env
ATLANTIS_PORT=4142 ATLANTIS_GH_APP_ID=<FILL THIS IN> ATLANTIS_GH_APP_KEY_FILE=/.ssh/your-key-file.pem ATLANTIS_GH_WEBHOOK_SECRET=<FILL THIS IN> ATLANTIS_GH_APP_SLUG=<FILL THIS IN> ATLANTIS_FF_OWNER=<FILL THIS IN> ATLANTIS_FF_REPO=<FILL THIS IN> ATLANTIS_FF_PATH=<FILL THIS IN> ATLANTIS_DATA_DIR=/tmp/ ATLANTIS_LYFT_MODE=temporalworker ATLANTIS_REPO_CONFIG=/generated/repo-config.yaml ATLANTIS_WRITE_GIT_CREDS=true ATLANTIS_ENABLE_POLICY_CHECKS=true ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT=true ATLANTIS_REPO_ALLOWLIST=<FILL THIS IN> ALLOWED_REPOS=<FILL THIS IN> ATLANTIS_DEFAULT_TF_VERSION=1.2.5
Once these steps are complete, everything should startup as normal. You just need to run:
make build-service
docker-compose build
docker-compose up
In order to have events routed to gateway, you'll need to visit http://localhost:4040/
and copy the https url into your GitHub app.
In order to see the temporal ui visit http://localhost:8233/
.
If the ngrok container is restarted, the url will change which is a hassle. Fortunately, when we make a code change, we can rebuild and restart the atlantis container easily without disrupting ngrok.
e.g.
make build-service
docker-compose up --detach --build
make test
. If you want to run the integration tests that actually run real terraform
commands, run make test-all
.
docker run --rm -v $(pwd):/go/src/github.com/runatlantis/atlantis -w /go/src/github.com/runatlantis/atlantis ghcr.io/runatlantis/testing-env:latest make test
Or to run the integration tests
docker run --rm -v $(pwd):/go/src/github.com/runatlantis/atlantis -w /go/src/github.com/runatlantis/atlantis ghcr.io/runatlantis/testing-env:latest make test-all
Calling Your Local Atlantis From GitHub
- Metadata: Read-Only
- Pull Requests: Read and Write
- Commit Statuses: Read and Write
Similarly, subscribe for the following events:
- Pull Request
- Issue Comment
atlantis server --gh-user <your username> --gh-token <your token> --repo-allowlist <your repo> --gh-webhook-secret <your webhook secret> --log-level debug
4141
:https
url that ngrok
printed out after running ngrok http 4141
. Be sure to append /events
so your webhook url looks something like https://efce3bcd.ngrok.io/events
. See Add GitHub Webhook.atlantis help
. You should see the request in the ngrok
and Atlantis logs and you should also see Atlantis comment back.ctx.Log
should be available in most methods. If not, pass it down.ctx.Log.Infof("cleaning clone dir %q", dir)
=> Cleaning clone directory "/tmp/atlantis/lkysow/atlantis-terraform-test/3"
:
" in a log since that's used to separate error descriptions and causes
-
or ,
ex. failed to clean directory, continuing regardless
errors.Wrap(err, "additional context...")"
instead of fmt.Errorf("additional context: %s", err)
because it is less likely to result in mistakes and gives us the ability to trace call stacksError setting up workspace: failed to run git clone: could find git
and will instead look like
Error: setting up workspace: running git clone: no executable "git"
This is easier to read and more consistent
{package under test}_test
to enforce testing the external interfaces{file under test}_internal_test.go
import . "github.com/runatlantis/atlantis/testing"
and then use Assert()
, Equals()
and Ok()
We write our own mocks to reduce dependencies. Most of the old code uses pegomock which is unmaintained. We shouldn't use this for any new changes going forward.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4