Many business verticals require business continuity management (BCM) for production services. A reliable backup of your Terraform Enterprise deployment is crucial to ensuring business continuity. The backup should include data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).
This guide extends the Backup & Restore documentation, which contains more technical detail about the backup and restore process. This guide discusses the best practices, options, and considerations to back up Terraform Enterprise and increase its resiliency. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups.
Most of this guide is only relevant to single-region, multi-availability zone External Services mode deployments except where otherwise stated. Refer to Backup a Mounted Disk Deployment section below for specific details if you are running a Mounted Disk deployment. This guide does not cover Demo mode backups.
For region redundancy, repeat the recommendations in this guide for each region and consider the recommendations in the Multi-Region Considerations section at the end of this page.
For recommended patterns for recovery and restoration of TFE, refer to the Terraform Enterprise Recovery & Restoration Recommended Pattern.
Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.
Note
The ISO 22301 document uses business continuity rather than disaster recovery (DR). As a result, this tutorial will refer to business continuity instead of disaster recovery.
Two factors heavily determine your organization's ability to achieve BC:
Recovery Time Objective (RTO) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO) is the maximum tolerable period that data can be lost after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.
Based on these definitions, you should assess the valid RTO/RPO for your business and approach BC accordingly. These factors will determine your backup frequency and other considerations discussed later in this guide.
In this guide:
When you deploy Terraform Enterprise:
For fully automated deployments, you must manage several common sensitive values. The methods below do not back up these data and you should secure them another way. Do not store any of these sensitive values in version control or allow them to leak into shell histories.
Active/Active deployments must be automated, and have additional sensitive values you must manage.
Process audit logsAudit log processing helps you identify the root cause during a data recovery incident.
Follow the guidance on Terraform Enterprise logs to aggregate and index logs from the Terraform Enterprise node(s) using a central logging platform such as Splunk, ELK, or a cloud-native solution. These should be used as a diagnostic tool in the event of outage, scanning them for ERROR
and FATAL
messages as part of root cause analysis.
The backup API facilitates backups and migrations from one operational mode or deployment method (Standalone or Active/Active) to another.
Only use the backup API to migrate between low-volume implementations, especially in non-production environments. Use cloud-native tooling instead for day-to-day backup and recovery on public cloud, and standard approaches for on-premise deployments as detailed below.
The following recommendations will improve your security posture, reduce the effort required to maintain an optimal Terraform Enterprise instance, and speed up deployment time during a restoration.
install.sh
script deploys to avoid accidental version upgrades. Use the flag release-sequence=${tfe_release_sequence}
where ${tfe_release_sequence}
is the Replicated release sequence. Look up the release sequence on this page. For example, for release v202103-3
, use 523
as the ${tfe_release_sequence}
.Note
The Automated Recovery function only backs up installation data and not application data. If you have an automated deployment, you don't need to use the Automated Recovery function.
Reference the tab(s) below for specific recommendations relevant to your installation method.
If you are using the online installation method, configure the boot script to run the Replicated install.sh
script explicitly without the airgap argument when the new VM starts up. The VM will download the installation media from the Internet and install the service.
Based on the Replicated configuration, the application will connect to the configured object store and database resources automatically.
If you are using the air-gapped installation method, use one of the following ways to ensure the installation media is available to the install configuration.
We recommend you automatically replace application server nodes when a node or availability zone fails. Replacing the node provides redundancy at the server and availability zone level. Public clouds and VMware have specific services for this.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
Use an Auto Scaling group (ASG) to automatically replace nodes on AWS. Select your deployment for more details.
min_size
and max_size
to 1. When a node or availability zone fails, the ASG will automatically replace the node. The time it takes for the ASG to replace the node depends on the time it takes the node to be ready. For example, if the node needs to download the installation media from a network, the node will not be ready until the node downloads and installs the installation media.min_size
and max_size
to the desired number of nodes. These two values must be the same. If a node fails, the service will remain up while the ASG replaces it. Active/Active deployments require a fully automated deployment.vpc_zone_identifier
list with at least two subnets. If the region supports additional subnets, we recommend a minimum of three subnets since it provides n-2
AZ redundancy.Use a zone-balanced Linux virtual machine scale set (VMSS) to automatically replace nodes on Azure. Select your deployment for more details.
instances
to 1. When a node or availability zone fails, the VMSS will automatically replace the node in the same region. The time it takes for the VMSS to replace the node depends on the time it takes the node to be ready. For example, if the node needs to download the installation media from a network, the node will not be ready until the node downloads and installs the installation media.instances
to the desired number of nodes. Two or more instances meet the Azure 99.95% SLA for VM availability. If a node fails, the service will remain up while the VMSS replaces it. Active/Active deployments require a fully automated deployment.azurerm_linux_virtual_machine_scale_set
resource. Set the zones
to a minimum of two (preferably three) zones in the region, and set zone_balance
to true, which provides zone redundancy.Use a regional managed instance group (MIG) to automatically replace nodes on GCP. Select your deployment for more details.
target_size
to 1. When a node or availability zone fails, the MIG will automatically replace the node in the same region. Enable the Google compute health check and configure the auto-healing policy of the instance group manager. The time it takes for the MIG to replace the node depends on the time it takes the node to be ready. For example, if the node needs to download the installation media from a network, the node will not be ready until the node downloads and installs the installation media.target_size
to the desired number of nodes. If a node fails, the service will remain up while the MIG replaces it. Active/Active deployments require a fully automated deployment.google_compute_region_instance_group_manager
resource as this deploys a regional MIG, and thus ensures that the application server layer can automatically recover from a zone failure.In an External Services mode scenario, the application server is running as a stateless node.
We recommend the following to support the object store's business continuity:
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to AWS's claim of eleven 9s of durability.
As a result, S3 Same-Region Replication is not explicitly required for the Terraform Enterprise object store because it does not add sufficient value: corruption on the primary S3 bucket will be replicated to the secondary automatically.
We recommend the following to ensure you back up your application data appropriately.
The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to Azure's claim of eleven 9s of durability.
We recommend the following to ensure you back up your application data appropriately.
Microsoft.Storage
endpoint.The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to GCP's claim of eleven 9s of durability.
We recommend the following to ensure you back up your application data appropriately.
For on-premise External Services deployments, as the architectural requirements include an S3-compatible storage facility, such as minIO or Dell ECS:
You should configure the database to be in line with Terraform Enterprise's PostgreSQL requirements.
For high availability in a single public cloud region, we recommend deploying the database in a multi-availability zone configuration to add resilience against recoverable outages. For coverage against non-recoverable issues (such as data corruption), take regular snapshots of the database.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
In addition to the general recommendations above, consider the following AWS-specific recommendations:
aws_db_instance
resource:
backup_window
to a suitable period in line with company policy and regulations. The default backup window should be 30 minutes.multi_az
to true
aws_rds_cluster
resource:
availability_zones
to a list of at least three EC2 availability zones. AWS will increase availability zones to at least three if you specify less than three; however, we recommend using at least three to maximize the database layer's recoverability.preferred_backup_window
and preferred_maintenance_window
to times convenient to your business model.backup_retention_period
to suitable periods according to company policy and regulations. The recommended retention is the current maximum of 35 days since this maximizes the recoverability of the data; however, you should be aware of the costs associated with the potential level of data retention. Use snapshots to retain DB copies for longer than this.In addition to the general recommendations above, consider the following Azure-specific recommendations:
azurerm_postgresql_server
resource's backup_retention_days
to a suitable period inline with company policy and regulations. The recommended retention is the current maximum of 35 days since this maximizes the recoverability of the data; however, you should be aware of the costs associated with the potential level of data retention. Use snapshots to retain DB copies for longer than this.In addition to the general recommendations above, consider the following GCP-specific recommendations:
google_sql_database_instance
resource:
availability_type
to REGIONAL
to enable high availabilitybackup_configuration
subblock, set enabled
and point_in_time_recovery_enabled
to true
, set an appropriate start_time
for backups to run.In addition to the general recommendations above, consider the following VMware-specific recommendations:
We understand that customers with private clouds are likely to have an established backup policy for databases already, possibly including a software partnership with a recognized backup vendor. In this case, for External Services mode deployments, we recommend you use these existing practices and tooling.
We make these additional recommendations for database backups:
This section is only relevant if you are running an Active/Active deployment.
Because the Redis instance serves as an active memory cache for Terraform Enterprise, you don't need to maintain backups. However, we recommend you ensure regional availability to protect against zone failure.
Note
Enabling Redis RDB backups may be unnecessary due to the ephemeral nature of the data in the cache at any given time.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
AWS has a significant number of business continuity configuration options for Redis.
If you use Terraform to deploy Terraform Enterprise, refer to AWS ElastiCache section of the Active/Active deployment guide for an example Redis configuration.
Your aws_elasticache_replication_group.tfe
resource should look similar to the one found below. This configuration is for a Redis (cluster mode disabled) cluster of three nodes, one in each availability zone to confer n-2
zone redundancy.
resource "aws_elasticache_replication_group" "tfe" {
## ...
num_cache_clusters = 3
preferred_cache_cluster_azs = [var.availability_zones]
multi_az_enabled = true
automatic_failover_enabled = true
}
Note
You should set the preferred_cache_cluster_azs
argument to a list of availability zones equal to the number of cluster nodes. The first availability zone in the list will be the primary zone for the cluster. Duplicates are allowed.
Note
The setup will increase cost, so you should be mindful when setting up your Redis clusters. Setting a minimum of two cache clusters with the above configuration will ensure failover capability.
Azure Cache for Redis has built-in high availability.
If you use Terraform to deploy Terraform Enterprise, refer to the Azure Cache for Redis section of the Active/Active deployment guide.
Your azurerm_redis_cache.tfe
resource should look similar to the one found below. This configuration is for a Redis (cluster mode disabled) cluster of three nodes, one in each availability zone to confer n-2
zone redundancy.
resource "azurerm_redis_cache" "tfe" {
## ...
capacity = 3
family = "P"
sku_name = "Premium"
}
Note
The Azure Premium tier is currently available in preview.
Note
The setup will increase cost, so you should be mindful when setting up your Redis clusters. Setting a minimum of two cache clusters with the above configuration will ensure failover capability.
The Standard Tier of the GCP Memorystore for Redis service provides high availability through replication and automatic failover capability. However, this tier provides only a second node, which provides an n-1 zone redundancy. The Standard Tier is currently the highest.
If you use Terraform to deploy Terraform Enterprise, refer to the GCP Memorystore for Redis section of the Active/Active deployment guide for an example configuration.
Active/Active deployment is unavailable for VMWare.
Terraform Enterprise's application architecture is currently single-region. The additional configuration should be for business continuity purposes only and not for cross-region, Active/Active capability. Support for the below would be on a best-endeavors basis only. In addition, cross-region functionality on every application tier is not supported in every region. Check support as part of architectural planning.
Generally, we recommend you repeat the recommendations in this guide for each region to achieve region redundancy in a Terraform Enterprise deployment.
Note
Cross-region deployments incur additional hosting costs.
Recommendations common to the most-used cloud vendors include:
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
The following additional considerations provide an n-1 region redundancy on AWS. Since both cross-region S3 replication and Aurora read replicas can provide replicas in multiple Secondary regions, it is possible to offer greater than n-1 region redundancy if required.
bootstrap
buckets that store the air-gapped installation media. Doing this locates critical data local to the ASG in the respective region.The following additional considerations will provide an n-1 region redundancy on Azure:
geo_redundant_backup_enabled = true
in the azurerm_postgresql_server
resource.The following additional considerations will provide an n-1 region redundancy on GCP:
EUR4
being europe-north1
and europe-west4
and the requirement to colocate the MIG in the same location as the GCS bucket, you must deploy Terraform Enterprise to one of these two regions to ensure a working instance with successful cross-region replication.location
to the Secondary region in the backup_configuration
subblock of the settings
stanza of the google_sql_database_instance
resource.Since this guide refers to multiple availability zones and maps these zones to separate VMware datacenters, multi-region deployments require connected datacenters in different countries or continents.
Repeat the recommendations in this guide for each region and use the strategic connections between regions to migrate Terraform Enterprise workloads during outages. The key concepts are to ensure:
The backup approach for a Mounted Disk operational mode is simpler than for External Services mode because it involves a single machine and possibly its business continuity instance. Also, a Mounted Disk deployment backup ensures the integrity of the machine and its attached data disk.
We recommend using Mounted Disk mode when provisioning on private cloud if the added complexity of managing an on-premise database and S3-compatible storage are not readily supported in your environment. In the event of an eventual move to the Active/Active deployment mode, supporting these external services with the addition of Redis services will be required.
We do not recommend using Mounted Disk deployments on public cloud since External Services mode provides better scalability and Mounted Disk mode does not support Active/Active deployments. For Twelve Factor compliance, use the same operational mode for both production and non-production.
Ensure to quiesce the database on Mounted Disk instances — your backup software may or may not do this automatically.
Mounted Disk mode uses a separate mountable volume (data disk) that can come in many flavors. To ensure data integrity, ensure the mountable volume has the following capabilities (in this order):
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
AWS has recommended backup/snapshot options to back up a Mounted Disk deployment.
Azure has recommended backup/snapshot options to back up a Mounted Disk deployment.
GCP has recommended backup/snapshot options to back up a Mounted Disk deployment.
For on-premise Mounted Disk mode deployments, refer to the Application Server VMware tab above for recommendations for server backup.
In addition:
lsyncd
, corruption on the primary volume will be replicated to the disk attached to the passive node. Maintain regular additional snapshots/backups of the data disk.Note
Do not start more than one Mounted Disk mode instance against the same database simultaneously. If you are using a load balancer and a warm server with the data disk visible in the other datacenter, ensure Terraform Enterprise is not running on it while the primary is.
In this guide, you learned best practices for preparing and backing up Terraform Enterprise's main components.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4