Below are the service-specific steps required to restore data to a Postgres cluster.
Restore Postgres procedures by service:
- Spire
- Keycloak
- VCS
- HSM
- SLS
In the event that the Keycloak Postgres cluster must be rebuilt and the data restored, then the following procedures are recommended.
- A dump of the database exists.
-
(
ncn-mw#
) Copy the database dump to an accessible location.- If a manual dump of the database was taken, then check that the dump file exists in a location off the Postgres cluster. It will be needed in the steps below.
- If the database is being automatically backed up, then the most recent version of the dump and the secrets should exist in the
postgres-backup
S3 bucket. These will be needed in the steps below. List the files in thepostgres-backup
S3 bucket and if the files exist, download the dump and secrets out of the S3 bucket. Thecray artifacts
CLI can be used to list and download the files. Note that the.psql
file contains the database dump and the.manifest
file contains the secrets.
-
Set and export the
CRAY_CREDENTIALS
environment variable.This will permit simple CLI operations that are needed to obtain the Keycloak backup file. See Authenticate an Account with the Command Line.
-
List the available Postgres logical backups by date.
cray artifacts list postgres-backup --format json | jq -r '.artifacts[] | select(.Key | contains("spilo/keycloak")) | "\(.LastModified) \(.Key)"'
Example output:
2023-03-23T02:10:11.158000+00:00 spilo/keycloak-postgres/ed8f6691-9da7-4662-aa67-9c786fa961ee/logical_backups/1679537409.sql.gz 2023-03-24T02:10:12.689000+00:00 spilo/keycloak-postgres/ed8f6691-9da7-4662-aa67-9c786fa961ee/logical_backups/1679623811.sql.gz
-
Set the environment variables to the name of the backup file.
BACKUP=spilo/keycloak-postgres/ed8f6691-9da7-4662-aa67-9c786fa961ee/logical_backups/1679623811.sql.gz DUMPFILE=$(basename ${BACKUP})
-
Download the backup files.
cray artifacts get postgres-backup "${BACKUP}" "./${DUMPFILE}"
-
Unset the
CRAY_CREDENTIALS
environment variable and remove the temporary token file.unset CRAY_CREDENTIALS rm -v /tmp/setup-token.json
-
(
ncn-mw#
) Set helper variables.CLIENT=cray-keycloak NAMESPACE=services POSTGRESQL=keycloak-postgres
-
(
ncn-mw#
) Scale the Keycloak service to 0.kubectl scale statefulset "${CLIENT}" -n "${NAMESPACE}" --replicas=0
-
(
ncn-mw#
) Wait for the pods to terminate.while kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}" | grep -qv NAME ; do echo " waiting for pods to terminate"; sleep 2 done
-
(
ncn-mw#
) Delete the Keycloak Postgres cluster.kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | jq 'del(.status)' > postgres-cr.json kubectl delete -f postgres-cr.json
-
(
ncn-mw#
) Wait for the pods to terminate.while kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pods to terminate"; sleep 2 done
-
(
ncn-mw#
) Create a new single instance Keycloak Postgres cluster.cp -v postgres-cr.json postgres-orig-cr.json jq '.spec.numberOfInstances = 1' postgres-orig-cr.json > postgres-cr.json kubectl create -f postgres-cr.json
-
(
ncn-mw#
) Wait for the pod to start.while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pod to start"; sleep 2 done
-
(
ncn-mw#
) Wait for the Postgres cluster to start running.while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ] do echo " waiting for postgresql to start running"; sleep 2 done
-
(
ncn-mw#
) Copy the database dump file to the Postgres member.kubectl cp "./${DUMPFILE}" "${POSTGRESQL}-0:/home/postgres/${DUMPFILE}" -c postgres -n "${NAMESPACE}"
-
(
ncn-mw#
) Restore the data.kubectl exec "${POSTGRESQL}-0" -c postgres -n "${NAMESPACE}" -it -- bash -c "zcat -f ${DUMPFILE} | psql -U postgres"
Errors such as
... already exists
can be ignored; the restore can be considered successful when it completes. -
Update the
keycloak-postgres
secrets in Postgres.-
(
ncn-mw#
) From the threekeycloak-postgres
secrets, collect the password for each Postgres username:postgres
,service_account
, andstandby
.for secret in postgres.keycloak-postgres.credentials service-account.keycloak-postgres.credentials \ standby.keycloak-postgres.credentials do echo -n "secret ${secret} username & password: " echo -n "`kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.username}' | base64 -d` " echo `kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.password}'| base64 -d` done
Example output:
secret postgres.keycloak-postgres.credentials username & password: postgres ABCXYZ secret service-account.keycloak-postgres.credentials username & password: service_account ABC123 secret standby.keycloak-postgres.credentials username & password: standby 123456
-
(
ncn-mw#
)kubectl exec
into the Postgres pod.kubectl exec "${POSTGRESQL}-0" -n "${NAMESPACE}" -c postgres -it -- bash
-
(
pod#
) Open a Postgres console./usr/bin/psql postgres postgres
-
(
postgres#
) Update the password for each user to match the values found in the secrets.-
Update the password for the
postgres
user.ALTER USER postgres WITH PASSWORD 'ABCXYZ';
Example of successful output:
ALTER ROLE
-
Update the password for the
service_account
user.ALTER USER service_account WITH PASSWORD 'ABC123';
Example of successful output:
ALTER ROLE
-
Update the password for the
standby
user.ALTER USER standby WITH PASSWORD '123456';
Example of successful output:
ALTER ROLE
-
-
(
postgres#
) Exit the Postgres console with the\q
command. -
(
pod#
) Exit the Postgres pod with theexit
command.
-
-
(
ncn-mw#
) Restart the Postgres cluster.kubectl delete pod -n "${NAMESPACE}" "${POSTGRESQL}-0"
-
(
ncn-mw#
) Wait for thepostgresql
pod to start.while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pods to start"; sleep 2 done
-
(
ncn-mw#
) Scale the Postgres cluster back to 3 instances.kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \ -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
-
(
ncn-mw#
) Wait for thepostgresql
cluster to start running.This may take a few minutes to complete.
while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ] do echo " waiting for postgresql to start running"; sleep 2 done
-
(
ncn-mw#
) Scale the Keycloak service back to 3 replicas.kubectl scale statefulset "${CLIENT}" -n "${NAMESPACE}" --replicas=3
-
(
ncn-mw#
) Wait for the Keycloak pods to start.while [ $(kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}" | grep -cv NAME) != 3 ] do echo " waiting for pods to start"; sleep 2 done
-
(
ncn-mw#
) Wait for all Keycloak pods to be ready.If there are pods that do not show that both containers are ready (
READY
is2/2
), then wait a few seconds and re-run the command until all containers are ready.kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/instance="${CLIENT}"
Example output:
NAME READY STATUS RESTARTS AGE cray-keycloak-0 2/2 Running 0 35s cray-keycloak-1 2/2 Running 0 35s cray-keycloak-2 2/2 Running 0 35s
-
(
ncn-mw#
) Run thekeycloak-setup
job to restore the Kubernetes client secrets.-
Run the job.
kubectl get job -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak -o json > keycloak-setup.json cat keycloak-setup.json | jq '.items[0]' | jq 'del(.metadata.creationTimestamp)' | jq 'del(.metadata.managedFields)' | jq 'del(.metadata.resourceVersion)' | jq 'del(.metadata.selfLink)' | jq 'del(.metadata.uid)' | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | jq 'del(.status)' | kubectl replace --force -f -
-
Wait for job to complete.
Check the status of the
keycloak-setup
job. If theCOMPLETIONS
value is not1/1
, then wait a few seconds and run the command again until theCOMPLETIONS
value is1/1
.kubectl get jobs -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak
Example output:
NAME COMPLETIONS DURATION AGE keycloak-setup-2 1/1 59s 91s
-
-
(
ncn-mw#
) Run thekeycloak-users-localize
job to restore the users and groups in S3 and the Kubernetes ConfigMap.-
Run the job.
kubectl get job -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak-users-localize \ -o json > cray-keycloak-users-localize.json cat cray-keycloak-users-localize.json | jq '.items[0]' | jq 'del(.metadata.creationTimestamp)' | jq 'del(.metadata.managedFields)' | jq 'del(.metadata.resourceVersion)' | jq 'del(.metadata.selfLink)' | jq 'del(.metadata.uid)' | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | jq 'del(.status)' | kubectl replace --force -f -
-
Wait for the job to complete.
Check the status of the
cray-keycloak-users-localize
job. If theCOMPLETIONS
value is not1/1
, then wait a few minutes and run the command again until theCOMPLETIONS
value is1/1
.kubectl get jobs -n "${NAMESPACE}" -l app.kubernetes.io/instance=cray-keycloak-users-localize
Example output:
NAME COMPLETIONS DURATION AGE keycloak-users-localize-2 1/1 45s 49s
-
-
(
ncn-mw#
) Restart the ingressoauth2-proxies
.-
Issue the restarts.
kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-access-ingress && kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-high-speed-ingress && kubectl rollout restart -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-management-ingress
-
Wait for the restarts to complete.
kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-access-ingress && kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-high-speed-ingress && kubectl rollout status -n "${NAMESPACE}" deployment/cray-oauth2-proxies-customer-management-ingress
-
-
(
ncn-mw#
) Verify that the service is working.The following should return an
access_token
for an existing user. Replace the<username>
and<password>
as appropriate.curl -s -k -d grant_type=password -d client_id=shasta -d username=<username> -d password=<password> \ https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token
In the event that the VCS Postgres cluster must be rebuilt and the data restored, then the following procedures are recommended.
- A dump of the database exists.
- A backup of the VCS PVC exists.
- See Restore PVC data.
- The Cray command line interface (CLI) tool is initialized and configured on the system.
-
(
ncn-mw#
) Copy the database dump to an accessible location.- If a manual dump of the database was taken, then check that the dump file exists in a location off the Postgres cluster. It will be needed in the steps below.
- If the database is being automatically backed up, then the most recent version of the dump and the secrets should exist in the
postgres-backup
S3 bucket. These will be needed in the steps below. List the files in thepostgres-backup
S3 bucket and if the files exist, download the dump and secrets out of the S3 bucket. Thecray artifacts
CLI can be used to list and download the files. Note that the.psql
file contains the database dump and the.manifest
file contains the secrets.
-
List the available backups.
cray artifacts list postgres-backup --format json | jq -r '.artifacts[] | select(.Key | contains("spilo/gitea")) | "\(.LastModified) \(.Key)"'
Example output:
2023-03-22T01:10:26.475500+00:00 spilo/gitea-vcs-postgres/9b8df946-ef39-4880-86a7-f8c21b71c542/logical_backups/1679447424.sql.gz 2023-03-23T01:10:26.395000+00:00 spilo/gitea-vcs-postgres/9b8df946-ef39-4880-86a7-f8c21b71c542/logical_backups/1688548424.sql.gz
-
Set the environment variables to the name of the backup file.
BACKUP=spilo/gitea-vcs-postgres/9b8df946-ef39-4880-86a7-f8c21b71c542/logical_backups/1688548424.sql.gz DUMPFILE=$(basename ${BACKUP})
-
Download the backup files.
cray artifacts get postgres-backup "${BACKUP}" "./${DUMPFILE}"
-
(
ncn-mw#
) Set helper variables.SERVICE=gitea-vcs SERVICELABEL=vcs NAMESPACE=services POSTGRESQL=gitea-vcs-postgres
-
(
ncn-mw#
) Scale the VCS service to 0.kubectl scale deployment ${SERVICE} -n "${NAMESPACE}" --replicas=0
-
(
ncn-mw#
) Wait for the pods to terminate.while kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name="${SERVICELABEL}" | grep -qv NAME ; do echo " waiting for pods to terminate"; sleep 2 done
-
(
ncn-mw#
) Delete the VCS Postgres cluster.kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | jq 'del(.status)' > postgres-cr.json kubectl delete -f postgres-cr.json
-
(
ncn-mw#
) Wait for the pods to terminate.while kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pods to terminate"; sleep 2 done
-
(
ncn-mw#
) Create a new single instance VCS Postgres cluster.cp -v postgres-cr.json postgres-orig-cr.json jq '.spec.numberOfInstances = 1' postgres-orig-cr.json > postgres-cr.json kubectl create -f postgres-cr.json
-
(
ncn-mw#
) Wait for the pod to start.while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pod to start"; sleep 2 done
-
(
ncn-mw#
) Wait for the Postgres cluster to start running.while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ] do echo " waiting for postgresql to start running"; sleep 2 done
-
(
ncn-mw#
) Copy the database dump file to the Postgres member.kubectl cp "./${DUMPFILE}" "${POSTGRESQL}-0:/home/postgres/${DUMPFILE}" -c postgres -n "${NAMESPACE}"
-
(
ncn-mw#
) Restore the data.kubectl exec "${POSTGRESQL}-0" -c postgres -n "${NAMESPACE}" -it -- bash -c "zcat -f ${DUMPFILE} | psql -U postgres"
Errors such as
... already exists
can be ignored; the restore can be considered successful when it completes. -
Update the
gitea-vcs-postgres
secrets in Postgres.-
(
ncn-mw#
) From the threegitea-vcs-postgres
secrets, collect the password for each Postgres username:postgres
,service_account
, andstandby
.for secret in postgres.gitea-vcs-postgres.credentials service-account.gitea-vcs-postgres.credentials \ standby.gitea-vcs-postgres.credentials do echo -n "secret ${secret} username & password: " echo -n "`kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.username}' | base64 -d` " echo `kubectl get secret "${secret}" -n "${NAMESPACE}" -ojsonpath='{.data.password}'| base64 -d` done
Example output:
secret postgres.gitea-vcs-postgres.credentials username & password: postgres ABCXYZ secret service-account.gitea-vcs-postgres.credentials username & password: service_account ABC123 secret standby.gitea-vcs-postgres.credentials username & password: standby 123456
-
(
ncn-mw#
)kubectl exec
into the Postgres pod.kubectl exec "${POSTGRESQL}-0" -n "${NAMESPACE}" -c postgres -it -- bash
-
(
pod#
) Open a Postgres console./usr/bin/psql postgres postgres
-
(
postgres#
) Update the password for each user to match the values found in the secrets.-
Update the password for the
postgres
user.ALTER USER postgres WITH PASSWORD 'ABCXYZ';
Example of successful output:
ALTER ROLE
-
Update the password for the
service_account
user.ALTER USER service_account WITH PASSWORD 'ABC123';
Example of successful output:
ALTER ROLE
-
Update the password for the
standby
user.ALTER USER standby WITH PASSWORD '123456';
Example of successful output:
ALTER ROLE
-
-
(
postgres#
) Exit the Postgres console with the\q
command. -
(
pod#
) Exit the Postgres pod with theexit
command.
-
-
(
ncn-mw#
) Restart the Postgres cluster.kubectl delete pod -n "${NAMESPACE}" "${POSTGRESQL}-0"
-
(
ncn-mw#
) Wait for thepostgresql
pod to start.while ! kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -qv NAME ; do echo " waiting for pods to start"; sleep 2 done
-
(
ncn-mw#
) Scale the Postgres cluster back to 3 instances.kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \ -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
-
(
ncn-mw#
) Wait for thepostgresql
cluster to start running.while [ $(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o json | jq -r '.status.PostgresClusterStatus') != "Running" ] do echo " waiting for postgresql to start running"; sleep 2 done
-
(
ncn-mw#
) Scale the Gitea service back up.kubectl scale deployment ${SERVICE} -n "${NAMESPACE}" --replicas=1
-
(
ncn-mw#
) Wait for the Gitea pods to start.while ! kubectl get pods -n "${NAMESPACE}" -l app.kubernetes.io/name="${SERVICELABEL}" | grep -qv NAME ; do echo " waiting for pods to start"; sleep 2 done