Configure disaster recovery

To configure disaster recovery, you must provision a replica to serve as backup during failovers. If your primary server is permanently disabled, you can then promote a replica.

Before you begin

Apply disaster recovery system and software requirements.

Tip: The puppet infrastructure commands, which are used to configure and manage disaster recovery, require a valid admin RBAC token and must be run from a root session. Running with elevated privileges via sudo puppet infrastructure is not sufficient. Instead, start a root session by running sudo su -, and then run the puppet infrastructure command. For details about these commands, run puppet infrastructure help <ACTION>, for example, puppet infrastructure help provision.

Provision and enable a replica

Provisioning a replica duplicates specific components and services from the primary server to the replica. Enabling a replica activates most of its duplicated services and components, and instructs agents and infrastructure nodes how to communicate in a failover scenario.

Before you begin

Apply disaster recovery system and software requirements.
Ensure you have a valid admin RBAC token.
Ensure Code Manager is enabled and configured on your primary server.
Move any tuning parameters that you set for your primary server using the console to Hiera. Using Hiera ensures configuration is applied to both your primary server and replica.
Back up your classifier hierarchy, because enabling a replica alters classification.

Note: While completing this task, the primary server is unavailable to serve catalog requests. Time completing this task accordingly.

Configure infrastructure agents to connect orchestration agents to the primary server.

In the console, click Node groups, and in the PE Infrastructure group, select the PE Agent > PE Infrastructure Agent group.
If you manage your load balancers with agents, on the Rules tab, pin load balancers to the group.
Pinning load balancers to the PE Infrastructure Agent group ensures that they communicate directly with the primary server.

On the Classes tab, find the puppet_enterprise::profile::agent class and specify these parameters:


Parameter	Value
manage_puppet_conf	Specify `true` to ensure that your setting for `server_list` is configured in the expected location and persists through Puppet runs.
pcp_broker_list	Hostname for your primary server. Hostnames must include port 8142, for example `["PRIMARY.EXAMPLE.COM:8142"]`.
primary_uris	Hostname for your primary server, for example `["PRIMARY.EXAMPLE.COM"]`. This setting assumes port 8140 unless you specify otherwise with `host:port`.
server_list

Remove any values set for pcp_broker_ws_uris.
Commit changes.
Run Puppet on all agents classified into the PE Infrastructure Agent group.

On the primary server, as the root user, run puppet infrastructure provision replica <REPLICA NODE NAME> --enable
Note: In installations with compilers, use the --skip-agent-config flag with the --enable option if you want to:
- Upgrade a replica without needing to run Puppet on all agents.
- Add disaster recovery to an installation without modifying the configuration of existing load balancers.
- Manually configure which load balancer agents communicate with in multi-region installations. See Managing agent communication in multi-region installations.
Copy your secret key file from the primary server to the replica. The path to the secret key file is /etc/puppetlabs/orchestration-services/conf.d/secrets/keys.json.

Important: If you do not copy your secret key file onto your replica, the replica generates a new secret key when you promote it. The new key prevents you from accessing credentials for your agentless nodes or running tasks and plans on agentless nodes.
Optional: Verify that all services running on the primary server are also running on the replica:
1. From the primary server, run puppet infrastructure status --verbose to verify that the replica is available.
2. From any managed node, run puppet agent -t --noop --server_list=<REPLICA HOSTNAME>. If the replica is correctly configured, the Puppet run succeeds and shows no changed resources.
Optional: Deploy updated configuration to agents by running Puppet, or wait for the next scheduled Puppet run.

If you used the --skip-agent-config option, you can skip this step.

Note: If you use the direct Puppet workflow, where agents use cached catalogs, you must manually deploy the new configuration by running puppet job run --no-enforce-environment --query 'nodes {deactivated is null and expired is null}'
Optional: Perform any tests you feel are necessary to verify that Puppet runs continue to work during failover. For example, to simulate an outage on the primary server:
1. Prevent the replica and a test node from contacting the primary server. For example, you might temporarily shut down the primary server or use iptables with drop mode.
2. Run puppet agent -t on the test node. If the replica is correctly configured, the Puppet run succeeds and shows no changed resources. Runs might take longer than normal when in failover mode.
3. Reconnect the replica and test node.

Managing agent communication in multi-region installations

Typically, when you enable a replica using puppet infrastructure enable replica, the configuration tool automatically sets the same communication parameters for all agents. In multi-region installations, with load balancers or compilers in multiple locations, you must manually configure agent communication settings so that agents fail over to the appropriate load balancer or compiler.

To skip automatically configuring which Puppet servers and PCP brokers agents communicate with, use the --skip-agent-config flag when you provision and enable a replica, for example:

puppet infrastructure provision replica example.puppet.com --enable --skip-agent-config

To manually configure which load balancer or compiler agents communicate with, use one of these options:

CSR attributes
1. For each node, include a CSR attribute that identifies the location of the node, for example pp_region or pp_datacenter.
2. Create child groups off of the PE Agent node group for each location.
3. In each child node group, include the puppet_enterprise::profile::agent module and set the server_list parameter to the appropriate load balancer or compiler hostname.
4. In each child node group, add a rule that uses the trusted fact created from the CSR attribute.
Hiera
For each node or group of nodes, create a key/value pair that sets the puppet_enterprise::profile::agent::server_list parameter to be used by the PE Agent node group.
Custom method that sets the server_list parameter in puppet.conf.

Promote a replica

If your primary server can’t be restored, you can promote the replica to primary server to establish the replica as the new, permanent primary server.

Verify that the primary server is permanently offline.

If the primary server comes back online during promotion, your agents can get confused trying to connect to two active primary servers.
On the replica, as the root user, run puppet infrastructure promote replica
Promotion can take up to the amount of time it took to install PE initially. Don’t make code or classification changes during or after promotion.
When promotion is complete, update any systems or settings that refer to the old primary server, such as PE client tool configurations, Code Manager hooks, and CNAME records.
Deploy updated configuration to nodes by running Puppet or waiting for the next scheduled run.

Note: In case of a failover, scheduled Puppet and task runs are rescheduled based on the last execution time.
Optional: Provision a new replica in order to maintain disaster recovery.
Note: Agent configuration must be updated before provisioning a new replica. If you re-use your old primary server’s node name for the new replica, agents with outdated configuration might use the new replica as a primary server before it’s fully provisioned.

Enable a new replica using a failed primary server

After promoting a replica, you can use your old primary server as a new replica, effectively swapping the roles of your failed primary server and promoted replica.

Before you begin

The puppet infrastructure run command leverages built-in Bolt plans to automate certain management tasks. To use this command, you must be able to connect using SSH from your primary server to any nodes that the command modifies. You can establish an SSH connection using key forwarding, a local key file, or by specifying keys in .ssh/config on your primary server. For more information, see Bolt OpenSSH configuration options.

To view all available parameters, use the --help flag. The logs for all puppet infrastructure run Bolt plans are located at /var/log/puppetlabs/installer/bolt_info.log.

You must be able to reach the failed primary server via SSH from the current primary server.

On your promoted replica, as the root user, run puppet infrastructure run enable_ha_failover, specifying these parameters:

host — Hostname of the failed primary server. This node becomes your new replica.
topology — Architecture used in your environment, either mono (standard) or mono-with-compile (large).
replication_timeout_secs — Optional. The number of seconds allowed to complete provisioning and enabling of the new replica before the command fails.
tmpdir — Optional. Path to a directory to use for uploading and executing temporary files.

For example:

puppet infrastructure run enable_ha_failover host=<FAILED_PRIMARY_HOSTNAME> topology=mono

Results

The failed primary server is repurposed as a new replica.

Forget a replica

Forgetting a replica cleans up classification and database state, preventing degraded performance over time.

Before you begin

Ensure you have a valid admin RBAC token and the replica you want to remove is permanently offline.

Run the forget command whenever a replica node is destroyed, even if you plan to replace it with a replica with the same name.

On the primary server, as the root user, run puppet infrastructure forget <REPLICA NODE NAME>
Delete your secret key file from the replica because leaving sensitive information on a replica poses a security risk. The path to the secret key file is /etc/puppetlabs/orchestration-services/conf.d/secrets/keys.json

Results

The replica is decommissioned, the node is purged as an agent, secret key information is deleted, and a Puppet run is completed on the primary server.

Reinitialize a replica

If you encounter certain errors on your replica after provisioning, you can reinitialize the replica. Reinitializing destroys and re-creates replica databases, as specified.

Before you begin

Your primary server must be fully functional and the replica must be able to communicate with the primary server.

CAUTION: If you reinitialize a functional replica that you already enabled, the replica is unavailable to serve as backup in a failover during reinitialization.

Reinitialization is not intended to fix slow queries or intermittent failures. Reinitialize your replica only if it’s inoperational or you see replication errors.

On the replica, as the root user, reinitialize databases as needed:
- All databases: puppet infrastructure reinitialize replica
- Specific databases: puppet infrastructure reinitialize replica --db <DATABASE> where <DATABASE> is pe-activity, pe-classifier, pe-orchestrator, or pe-rbac.
Follow prompts to complete the reinitialization.