AETHER-1715 Add instructions to restore PVC/PV using Velero
Change-Id: I66f6ec9cff5e2f181b269640ace3848b9641ab27
diff --git a/operations/images/rancher-fleet-cluster-label-edit1.png b/operations/images/rancher-fleet-cluster-label-edit1.png
new file mode 100644
index 0000000..8db4856
--- /dev/null
+++ b/operations/images/rancher-fleet-cluster-label-edit1.png
Binary files differ
diff --git a/operations/images/rancher-fleet-cluster-label-edit2.png b/operations/images/rancher-fleet-cluster-label-edit2.png
new file mode 100644
index 0000000..378f812
--- /dev/null
+++ b/operations/images/rancher-fleet-cluster-label-edit2.png
Binary files differ
diff --git a/operations/procedures.rst b/operations/procedures.rst
index 114b6bf..e7d52ff 100644
--- a/operations/procedures.rst
+++ b/operations/procedures.rst
@@ -41,3 +41,145 @@
2. Turn on the management server using the front panel power button
3. Turn on the compute servers using the front panel power buttons
+
+Restore stateful application procedure
+--------------------------------------
+
+.. note::
+
+ PersistentVolumeClaim/PersistentVolume backup and restore is currently only available for ACC and AMP clusters.
+
+1. Download and install Velero CLI following the `official guide <https://velero.io/docs/v1.7/basic-install/#install-the-cli>`_.
+ You'll also need ``kubectl`` and ``helm`` command line tools.
+
+2. Download the K8S config of the target cluster from Rancher to your workstation.
+
+3. Open Rancher **Continuous Delivery** > **Clusters** dashboard,
+ find the cluster the target application is running on,
+ and temporarily update the cluster label used as the target application's cluster selector
+ to uninstall the application and prevent it from being reinstalled during the restore process.
+ Refer to the table below for the cluster selector labels for the Aether applications.
+ It may take several minutes for the application uninstalled.
+
+ +-------------+-----------------+------------------+
+ | Application | Original Label | Temporary Label |
+ +-------------+-----------------+------------------+
+ | cassandra | core4g=enabled | core4g=disabled |
+ +-------------+-----------------+------------------+
+ | mongodb | core5g=enabled | core5g=disabled |
+ +-------------+-----------------+------------------+
+ | roc | roc=enabled | roc=disabled |
+ +-------------+-----------------+------------------+
+
+.. image:: images/rancher-fleet-cluster-label-edit1.png
+ :width: 753
+
+.. image:: images/rancher-fleet-cluster-label-edit2.png
+ :width: 753
+
+4. Clean up existing PVC and PV for the application. In this guide, Cassandra is used as an example.
+
+.. code-block:: shell
+
+ # Assume that we lost all HSSDB data
+ $ kubectl exec cassandra-0 -n aether-sdcore -- cqlsh $cassandra_ip -e 'select * from vhss.users_imsi'
+ <stdin>:1:InvalidRequest: code=2200 [Invalid query] message="Keyspace vhss does not exist"
+
+ # Confirm the application is uninstalled after updating the cluster label
+ $ helm list -n aether-sdcore
+ (no result)
+
+ # Clean up any remaining resources including PVC
+ $ kubectl delete ns aether-sdcore
+
+ # Clean up released PVs if exists
+ $ kubectl delete pv $(kubectl get pv | grep cassandra | grep Released | awk '$1 {print$1}')
+
+5. Find a backup to restore.
+
+.. code-block:: shell
+
+ # Find the relevant backup schedule name
+ $ velero schedule get
+ NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR
+ velero-daily-logging Enabled 2021-09-25 01:35:24 -0700 PDT 0 0 * * * 720h0m0s 19h ago <none>
+ velero-daily-monitoring Enabled 2021-09-25 01:35:25 -0700 PDT 0 0 * * * 720h0m0s 19h ago <none>
+ velero-daily-roc Enabled 2021-09-25 01:35:25 -0700 PDT 0 0 * * * 720h0m0s 19h ago <none>
+ velero-daily-sdcore Enabled 2021-09-25 01:35:25 -0700 PDT 0 0 * * * 720h0m0s 19h ago <none>
+
+ # List the backups
+ $ velero backup get --selector velero.io/schedule-name=velero-daily-sdcore
+ NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
+ velero-daily-sdcore-20211001000013 Completed 0 0 2021-09-30 17:00:19 -0700 PDT 29d default <none>
+ velero-daily-sdcore-20210930000013 Completed 0 0 2021-09-29 17:00:28 -0700 PDT 28d default <none>
+ ...
+
+ # Confirm the backup includes all the necessary resources
+ $ velero backup describe velero-daily-sdcore-20211001000013 --details
+ ...
+ Resource List:
+ v1/PersistentVolume:
+ - pvc-67f82bc9-14f3-4faf-bf24-a2a3d6ccc411
+ - pvc-b19d996b-cc83-4c10-9888-a55ba0eedc93
+ - pvc-d2473b2e-8e6c-42d2-9d13-8fdb842d8cb1
+ v1/PersistentVolumeClaim:
+ - aether-sdcore/data-cassandra-0
+ - aether-sdcore/data-cassandra-1
+ - aether-sdcore/data-cassandra-2
+
+6. Update the backup storage location to read-only mode to prevent backup object from being created or
+ deleted in the backup location during the restore process.
+
+.. code-block:: shell
+
+ $ kubectl patch backupstoragelocations default \
+ --namespace velero \
+ --type merge \
+ --patch '{"spec":{"accessMode":"ReadOnly"}}'
+
+7. Create a restore with the most recent backup.
+
+.. code-block:: shell
+
+ # Create restore
+ $ velero restore create --from-backup velero-daily-sdcore-20211001000013
+
+ # Wait STATUS become Completed
+ $ velero restore get
+ NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
+ velero-daily-sdcore-20211001000013-20211001141850 velero-daily-sdcore-20211001000013 Completed 2021-10-01 13:11:20 -0700 PDT <nil> 0 0 2021-10-01 13:11:20 -0700 PDT <none>
+
+8. Confirm that PVCs are restored and "Bound" to the restored PV successfully.
+
+.. code-block:: shell
+
+ $ kubectl get pvc -n aether-sdcore
+ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
+ data-cassandra-0 Bound pvc-67f82bc9-14f3-4faf-bf24-a2a3d6ccc411 10Gi RWO standard 45s
+ data-cassandra-1 Bound pvc-b19d996b-cc83-4c10-9888-a55ba0eedc93 10Gi RWO standard 45s
+ data-cassandra-2 Bound pvc-d2473b2e-8e6c-42d2-9d13-8fdb842d8cb1 10Gi RWO standard 45s
+
+9. Revert the backup storage location to read-write mode.
+
+.. code-block:: shell
+
+ kubectl patch backupstoragelocation default \
+ --namespace velero \
+ --type merge \
+ --patch '{"spec":{"accessMode":"ReadWrite"}}'
+
+10. Revert the cluster label to the original and wait Fleet to reinstall the application.
+ It may take several minutes.
+
+.. code-block:: shell
+
+ # Confirm the application is installed
+ $ helm list -n aether-sdcore
+ NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
+ cassandra aether-sdcore 8 2021-10-01 22:27:18.739617668 +0000 UTC deployed cassandra-0.15.1 3.11.6
+ sd-core-4g aether-sdcore 26 2021-10-02 00:55:25.317693605 +0000 UTC deployed sd-core-0.7.3
+
+ # Confirm the data is restored
+ $ kubectl exec cassandra-0 -n aether-sdcore -- cqlsh $cassandra_ip -e 'select * from vhss.users_imsi'
+ ...
+ (10227 rows)