Skip to content

Backup & Restore

This guide describes how to back up and restore a Kamea instance running on Azure. Kamea is a Terraform-deployed platform: most resources are stateless (App Services, Function Apps, Service Bus, IoT Hub configuration, etc.) and are fully reproducible from the Git repository plus the GitLab CI/CD environment variables. Only a handful of resources actually carry data and require active backup procedures.

Scope of this guide

The procedures and commands below assume the default Kamea stack: Azure as the cloud provider, GitLab for source hosting and CI/CD (including the GitLab-managed Terraform HTTP backend). The reasoning behind each backup item — what must be preserved and why — is the same regardless of the underlying tooling. When running Kamea on a different stack, the same logic must be translated to the equivalent services. For example:

  • Terraform state: if the state is stored in S3, an Azure Storage Account, a Terraform Cloud workspace, a self-hosted backend, or anywhere other than GitLab, the backup target changes (S3 versioning, blob soft delete, Terraform Cloud snapshots, etc.) but the rule "keep an offline encrypted copy of the latest state" still applies.
  • CI/CD variables: GitHub Actions / Azure DevOps / Bitbucket / Jenkins all have their own secret stores. Use their respective export APIs in place of the GitLab variables export.
  • Telemetry storage: Kamea does not impose a single telemetries database. InfluxDB Cloud, PostgreSQL and Redis are all first-class options, controlled by the USE_PGSQL_FOR_TELEMETRIES / USE_REDIS_FOR_TELEMETRIES flags and the InfluxDB connection variables. Several backends can be enabled at the same time (the message-bus router fans messages out to each enabled storage). Apply the backup procedure of every backend that is actually deployed in your environment — the corresponding sections below are independent of one another.
  • Cloud provider: most concepts (managed PostgreSQL backups, blob versioning, file share snapshots, secret vaults) have a direct equivalent on AWS / GCP / on-prem. The list of what to back up does not change; how to back it up does.

The goal of this document is to:

  • Make the distinction between rebuildable resources (Terraform + CI/CD pipelines) and stateful resources (databases, storage accounts, IoT Hub registry, IDP).
  • Provide concrete commands and Azure procedures to back up each stateful resource.
  • Provide a recovery runbook ordered by dependency.

Overview of what holds data

Kamea is composed of three layers:

Layer Examples Backup strategy
Stateless App Services, Function Apps, Service Bus, NSG/VNET, AKS Re-run Terraform + redeploy from CI/CD
Configuration Terraform state, GitLab CI variables, IDP registrations Versioned state + secret manager + offline copy
Stateful (data) PostgreSQL, InfluxDB Cloud, Storage Accounts, IoT Hub, Redis Dedicated backups described in this document

This diagram shows the overall topology.

A note on RPO and RTO

Two acronyms recur throughout this guide. They come from the standard disaster-recovery vocabulary and frame every backup decision:

  • RPO — Recovery Point Objective: how much data the platform can afford to lose. It is measured backwards from the failure: an RPO of 1 hour means the last accepted state is at most 1 hour older than the moment things broke. RPO is set by the backup frequency — daily dumps give an RPO of up to 24 h, continuous WAL shipping gives an RPO of a few minutes, real-time replication gives an RPO close to zero.
  • RTO — Recovery Time Objective: how long the platform can afford to be down. It is measured forwards from the failure: an RTO of 4 hours means the platform must be back online within 4 hours, including diagnosis, restore execution and validation. RTO is set by the restore procedure and the operator's familiarity with it — a PITR cutover is minutes to an hour, a full pg_restore from a pg_dump archive plus full re-deployment is hours.

The two are independent: you can have a tight RPO with a loose RTO (you lose almost no data, but it takes a day to bring it back) or the opposite (the platform is back online in 15 minutes, but the last few hours of data are gone). Each section below states which side it influences. The customer's contractual RPO/RTO targets dictate which of the recommended frequencies and retention windows are mandatory and which are nice-to-have.

Resources to back up

1. Terraform state

Stored in the HTTP backend (GitLab managed Terraform state). It is the source of truth for every resource ID and for every Terraform-generated random secret (PostgreSQL admin login/password, RabbitMQ login/password/auth secret, Redis writer password, Azure Function x-api-key keys, etc., declared as random_password resources in the PostgreSQL, RabbitMQ and device-connectivity modules).

Losing the state is recoverable but expensive: every random password must be regenerated and re-injected manually, and Terraform may try to recreate resources that already exist.

Backup: GitLab provides versioned state out of the box. Periodically download the raw state to offline cold storage:

bash curl --header "Authorization: Bearer $GITLAB_TOKEN" \ "https://<gitlab>/api/v4/projects/<project_id>/terraform/state/<state_name>" \ -o "tfstate-$(date +%Y%m%d).json"

Encrypt the result (it contains secrets) and store it outside the GitLab project (e.g. on an Azure Storage Account with versioning enabled, or any secrets manager).

2. GitLab CI/CD environment variables

All deployment inputs (IDP client secret, InfluxDB token, GitLab registry token, ARM client ID/secret, custom domains, SKU sizing, etc.) live as GitLab CI variables — see platform-setup.md. They are not in Terraform state.

Backup: export the project's CI variables periodically:

bash curl --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \ "https://<gitlab>/api/v4/projects/<project_id>/variables?per_page=100" \ -o "ci-variables-$(date +%Y%m%d).json"

Store the dump in a secrets manager — and treat the file as a secret. The GitLab API distinguishes three confidentiality settings, with very different backup implications:

Variable type UI label Returned by GET .../variables?
Plain (none) Yes — value in clear text.
Masked "Masked" Yes — value in clear text. Masking only hides the value in CI job logs, not from the API.
Masked and hidden "Masked and hidden" No. Since GitLab 17.4, the value is write-once: it cannot be retrieved through the API or the UI after creation. The export only contains the key, scope and flags.

Masked-and-hidden variables are not in the dump

A bare curl export is therefore not a complete backup if the project uses masked-and-hidden variables. Maintain a parallel record of those values at creation/rotation time (in a password manager, a Key Vault, or wherever the upstream secret already lives — e.g. the IDP admin console for the IDP client secret, the cloud provider for the ARM credentials). After a recovery, you cannot "restore" hidden variables from the GitLab dump; you must re-set them through the UI or API using the values held in that out-of-band record.

Restore: GitLab does not provide a single "import variables" endpoint, so the dump is replayed entry by entry against POST /projects/:id/variables. The export already preserves every flag the API expects (variable_type, protected, masked, hidden, environment_scope, raw, description), so the values can be POSTed verbatim.

3. PostgreSQL Flexible Server (Management API DB + optional telemetries)

Hosts:

  • the Management API database which stores all platform metadata: tenants, users, groups, roles, permissions, devices, channels, codecs, device types, firmwares metadata, campaigns, jobs, supervisions, metadata templates, etc.
  • optionally a second database used for telemetries when use_pgsql_for_telemetries=true (pgsql-telemetries-module) — note the prevent_destroy = true lifecycle on this DB.

Admin credentials are randomly generated and exposed only as Terraform outputs (module.pgsql.db_user, module.pgsql.db_password).

Built-in automated backups

Azure Database for PostgreSQL Flexible Server takes automatic full + differential + WAL backups. Defaults: 7 days retention, locally-redundant. Verify and adjust per environment:

bash az postgres flexible-server show \ --resource-group $AZURE_RESOURCE_GROUP \ --name <pg_server_name> \ --query "backup"

Recommended for production:

  • backupRetentionDays: 30 (max 35).
  • geoRedundantBackup: Enabled (requires a Zone-Redundant SKU and must be set at server creation; see Azure docs).

These can be added to the Terraform definition by setting backup_retention_days and geo_redundant_backup_enabled on azurerm_postgresql_flexible_server.

A logical dump is a database export produced by pg_dump. Unlike the built-in Azure backup — which is a physical, block-level snapshot tightly coupled to the source server — a logical dump is a self-contained file that describes the database content (schemas, tables, data, constraints, indexes) in PostgreSQL's own portable format. It can be moved freely, inspected, and restored anywhere that runs a compatible PostgreSQL version.

The procedure below is advised for the Management API database, which holds unique business state (tenants, users, groups, roles, devices, channels, codecs, firmwares metadata, campaigns, etc.) that cannot be reconstructed from external sources. For the telemetries database, the same procedure is optional: telemetry data is high-volume — making logical dumps expensive in time and storage — has a retention policy that drops old points anyway, and is reproducible from the devices once they reconnect. The managed Azure Point-in-Time Restore (PITR) is generally enough to cover the telemetries DB; apply pg_dump on top of it only if a compliance requirement or a specific operational need calls for it.

Azure's (PITR) covers accidental data corruption within an otherwise healthy server. Several of its limitations make an independent logical dump a necessary complement:

  • Restore target is constrained. PITR only produces a new Azure Database for PostgreSQL Flexible Server in the same Azure subscription as the source. It cannot restore to another subscription, another cloud, an on-prem PostgreSQL, or a significantly different PostgreSQL version. A pg_dump file can be restored anywhere pg_restore runs.
  • Retention caps at 35 days. Azure caps the managed backup retention at 35 days. Any compliance, audit or regulatory requirement that mandates longer retention (90 days, 1 year, 7 years, etc.) must be served by an independent logical dump archived to long-term storage.
  • Storage co-location with the server. Managed backups live in the same Azure subscription/region as the server, secured with the same RBAC. A subscription compromise, a billing/account suspension, or an accidental subscription deletion removes both the server and its backups simultaneously. Storing logical dumps in a separate subscription (ideally a separate Azure tenant or even an off-Azure archive) is what gives Kamea a true off-site recovery option.
  • Granular restore. pg_restore accepts --schema, --table and --data-only flags, which lets you restore a single table (e.g. roll back only the device table after a bad migration) without touching the rest of the database. PITR always restores the full server.
  • Cross-environment seeding. A pg_dump file from prod is the standard way to refresh a staging or QA environment with realistic data — possibly after scrubbing personally identifiable information (PII) via pg_dump --exclude-table=... or post-restore SQL. Managed PITR cannot serve this use case: it always provisions a brand-new server in the same subscription as the source, so it can neither write into the pre-existing staging server nor reach a staging subscription/tenant. It also has no mechanism to filter or transform the data on the way out — a PITR-restored server is a bit-for-bit copy of the source, including every PII column, which is unacceptable for a non-production environment.

In practice: rely on PITR for fast in-place recovery (RPO of a few minutes, RTO of a few minutes to an hour) and on logical dumps for anything that PITR cannot do — long-term retention, off-Azure DR, partial restores, environment refresh, schema audit. The two strategies are complementary, not redundant.

Use pg_dump from a host that can reach the PostgreSQL server. Add a CI/CD scheduled pipeline or run from a jump host:

bash PGSSLMODE=verify-full \ PGSSLROOTCERT=/path/to/DigiCertGlobalRootCA.pem \ pg_dump \ --host=<pg_server>.postgres.database.azure.com \ --username=<admin_login> \ --format=custom \ --no-owner \ --file=kamea-db-$(date +%Y%m%dT%H%M).dump \ <database_name>

Upload the dump to a dedicated Storage Account (different RG / region for a real DR scenario) with soft delete + versioning + immutability enabled.

Restore

  • Point-in-Time Restore (PITR) through Azure CLI — preferred for accidental data corruption:

bash az postgres flexible-server restore \ --resource-group $AZURE_RESOURCE_GROUP \ --name <new_pg_server_name> \ --source-server <existing_pg_server_name> \ --restore-time "2026-05-06T10:00:00Z"

Why PITR creates a new server

Azure Database for PostgreSQL Flexible Server does not support in-place restore. The --name parameter is mandatory and must be different from the source server. This is by design:

  • PITR works by provisioning fresh storage and replaying transaction logs onto it up to the chosen point in time. The source storage is never touched, which is what makes the operation safe and reversible if you mistarget the timestamp.
  • The source server keeps running and accepting writes during the restore. Overwriting it would require an outage, an atomic swap of storage, and a way to roll back if the restored data turns out to be wrong — none of which the managed service offers.
  • Keeping both servers side by side lets you compare, validate, or even export only specific tables from the restored copy before committing to the cutover.

The same restriction exists on Azure SQL, AWS RDS, and most managed PostgreSQL offerings, so the workflow below is portable.

After the restore completes, you have two servers and one of two workflows is appropriate depending on whether the corrupted server is salvageable:

  1. Cutover to the restored server (fastest, recommended for full corruption). Update the Management API and Function Apps to point to the new server (DB_HOST, DB_USER, DB_PASSWORD, DB_CA_B64 — the admin credentials are regenerated by Azure on restore and the FQDN changes with the new server name). Update the Terraform state to reflect the new server: either rename the resource and terraform import the restored one, or update the pgsql_server_name variable so the next plan adopts it. Decommission the old server once you are confident the cutover is good.
  2. Selective restore back into the original server (recommended for partial corruption). Use the new server only as a read-only source: pg_dump the specific schemas or tables you need, then pg_restore them into the original server. Delete the temporary restored server. This avoids changing any connection string and keeps Terraform state untouched, but requires that the original server is still healthy enough to receive writes.

Choosing between the two workflows.: The decision rests on three independent questions. Answer them in order; the first one that points to "cutover" wins.

Question Answer that points to cutover (#1) Answer that points to selective restore (#2)
Is the original server still healthy enough to receive writes? No — the server is unreachable, locked, ran out of storage, or its file system is corrupted. Yes — the server is responding, only the data is wrong.
What is the scope of the corruption? Server-wide: most schemas/tables affected, or unknown blast radius. Localized: a known set of tables (e.g. a bad migration on device), the rest verified intact.
What is the operational pressure to be back online? High RTO pressure — every minute of downtime costs. Cutover is one DNS/config change away from "done". RTO is relaxed enough to run pg_dump/pg_restore on a subset and validate it before pushing back.

Rules of thumb:

  • Full corruption, server unreachable, or unknown blast radius → cutover (#1). You cannot make a partial restore safer than the source, and trying to surgically patch a broken server wastes the time you'd spend cutting over cleanly. The cost is a one-time reconfiguration of DB_HOST/credentials in every consumer and a short Terraform state update.
  • Known, localized corruption on an otherwise healthy server → selective restore (#2). The blast radius is small (one or a few tables), the original server still accepts writes, and you want to avoid touching connection strings, Terraform state, RBAC role assignments and IP allowlists — all of which point at the existing server today. Restoring a few tables in place is the lower-risk operation in this case.
  • You are not sure which one applies. Default to cutover (#1). A "partial" restore that misses a corrupted table is worse than a full cutover that rebuilds connection strings. Cutover has a predictable execution plan; selective restore depends on diagnosis quality.

Two combinations deserve a specific note:

  • High RTO pressure + small known corruption. Cutover is still faster than a selective restore in absolute terms, but you may not want to push staging-like config changes under pressure. If the affected tables are clearly identified, running a targeted pg_restore against the original server is a smaller change-window operation. Pick selective restore only if you trust the diagnosis; otherwise cut over.
  • Original server healthy + corruption discovered late. PITR retains backups up to backupRetentionDays (35 max). If the corruption is older than the retention window, neither workflow above will help — you fall back to a pg_restore from a pg_dump archive (see the "Logical dumps" section), which is independent of PITR and not bound by the 35-day cap.

In both cases, regenerate the App Service DB_HOST/DB_USER/DB_PASSWORD app settings if the restored server's credentials are used.

  • From pg_dump — full disaster recovery:

bash pg_restore --host=<new_pg_server>.postgres.database.azure.com \ --username=<admin_login> \ --dbname=<database_name> \ --no-owner \ kamea-db-<timestamp>.dump

After restore, run the telemetry table init script only if telemetry partitions were not part of the dump.

4. Telemetries databases

Kamea supports three telemetry backends, which can be enabled independently or in combination through the USE_PGSQL_FOR_TELEMETRIES, USE_REDIS_FOR_TELEMETRIES and InfluxDB connection variables — see the platform setup guide and the PGSQL telemetries setup guide. Apply the backup procedure of every backend that is actually deployed.

4.a InfluxDB Cloud

External SaaS, not deployed by Terraform — only the URL/org/token are passed to the Influx Functions defined in the influxdb-module. The data is hosted on InfluxData's infrastructure, with their durability SLA.

This guide does not include a customer-side backup procedure for InfluxDB Cloud

For the default Kamea stack, a periodic data backup of InfluxDB Cloud is not recommended and is therefore not documented here. The reasoning:

  • InfluxData owns the durability. The service runs with internal replication and is contractually responsible for the data it stores. A customer-side dump does not protect against the failure modes the SLA already covers (disk loss, region failure, node failure).
  • The failure modes a customer dump would protect against are narrow: InfluxData account compromise, billing suspension, accidental org deletion by a holder of an all-access token, regulatory mandate that the data be held by the customer, or migration to another provider. None of these apply to a typical Kamea deployment.
  • The cost is real. influx backup/influx restore are OSS-only and reject Cloud hosts (Error: InfluxDB OSS-only command used with InfluxDB Cloud host), so the only available path is influx query → annotated CSV → influx write. Every query and every write is metered (query units, egress, write units), and the restore throughput is capped by Cloud rate limits. On a real fleet, a daily full dump quickly becomes a serious line item, and a restore takes hours to days.
  • Telemetry is reproducible. Devices keep producing data, retention policies already drop old points on a schedule, and the business state — tenants, users, devices, channels, codecs, firmwares, campaigns — lives in the Management API PostgreSQL DB, which is backed up (section 3).
  • The InfluxDB connection variables (URL, org, token) are already covered by the GitLab CI/CD variables backup (section 2). That is enough to point a fresh org back at Kamea without any action on the platform.

If a specific compliance or contractual requirement obliges you to hold a customer-side copy of the telemetry data, that is a project-level decision — refer to the InfluxData documentation for the export/restore commands available on your Cloud edition. This guide deliberately leaves it out to avoid pushing every customer into a costly procedure that, in the common case, provides no additional protection over what InfluxData already guarantees.

4.b PostgreSQL telemetries

If USE_PGSQL_FOR_TELEMETRIES=true, telemetries live in a dedicated database on the PostgreSQL flexible server (or on a separate PG server, depending on the deployment choice — see this page). They are covered by the PostgreSQL backup procedure described in section 3. Two specifics apply:

  • The pgsql-telemetries-module declares the database with prevent_destroy = true, so a Terraform run cannot accidentally drop it.
  • The schema initialization script shipped with the platform creates monthly partitions up to 2030. Remember to re-apply or extend it when restoring into an empty DB; logical dumps include the partitions, so this is only relevant for a fresh DB.

5. Redis (Container App) — required, stores device provisioning data

Critical, always-on store

Redis is not optional in Kamea. It is deployed by the device-connectivity-module on every environment and is consumed by the Management API, the WebSocket Server and the ingestion Azure Functions (see the flux matrix in security/flux-matrix-azure.md). It holds two classes of data on the same instance:

  1. Device provisioning data — written by the Management API when a device is provisioned and read by the ingestion Azure Functions to authorize and decode incoming traffic (which channel the device uses, its codec, its hashed secret, its tenant, etc.). This data has no expiration and is business-critical: losing it means every device is rejected by the ingestion chain until each one is re-provisioned through the API. The same information is also kept in the Management API PostgreSQL database, so a full restore is possible from there, but it takes time and the platform is degraded in the meantime.
  2. Last value per key for telemetries — only when USE_REDIS_FOR_TELEMETRIES=true. This is derived data that any device push will refresh.

The same Redis instance hosts both concerns, on the same Azure File share (redis-persistence), and the same backup/restore procedure applies. The retention profile is dictated by the more critical of the two roles, i.e. the provisioning data.

Redis is deployed as an Azure Container App with persistence on an Azure File share, and saves snapshots through Redis' RDB mechanism in /usr/local/etc/redis/backup.

Backup: enable Azure Backup (Recovery Services Vault) on the redis-persistence storage share, with daily snapshots and a retention window of at least 30 days. Because provisioning data does not expire, treat this share with the same retention as the Management API PostgreSQL dumps — it is part of the platform's permanent state, not a cache. Increase retention further if regulatory requirements apply.

Restore: stop the Redis container app, restore the file share snapshot, and restart the container app — it will reload the latest RDB file on boot. Verify that devices can authenticate and that telemetries are accepted by the ingestion functions before declaring the restore successful.

6. IoT Hub (device registry)

Deployed by the iothub-module. The hub stores: device identities, their authentication material (symmetric SAS keys, self-signed X.509 thumbprints, or certificateAuthority references), device twins (reported + desired state), and module identities. Kamea supports all three IoT Hub authentication modes — symmetric key, self-signed certificate and CA-signed certificate — see this page. The Kamea Management API also keeps a copy of the device list in PostgreSQL, but the IoT Hub credentials/twins are the source of truth for the connectivity layer.

Backup: IoT Hub does not have a managed backup; export the registry and twins periodically.

```bash

Identities + auth (devices.json)

az iot hub device-identity export \ --hub-name \ --blob-container-uri "" \ --identity \ --include-keys

Twin export

az iot hub device-twin list \ --hub-name \

iothub-twins-$(date +%Y%m%d).json ```

What the export captures, per authentication mode:

Auth mode Captured by device-identity export Notes
Symmetric key (SAS) Yes — full device record + keys The --include-keys flag is what makes the primary/secondary symmetric keys appear in the dump. Without it, those fields are blanked out and the export is unusable for SAS devices.
Self-signed X.509 Yes — full device record + thumbprints The primaryThumbprint / secondaryThumbprint fields are part of the device record itself and are emitted regardless of --include-keys.
CA-signed X.509 Partial — device record only Each device record only references type=certificateAuthority; there is no per-device material. The trust anchor lives outside the registry — see below.

For CA-signed devices, the root and intermediate CA certificates uploaded to the IoT Hub / DPS are not part of device-identity export. Without them the imported devices cannot connect even if the registry is fully restored. The CA material is what the prove_certificate_possession.sh script in the iothub-module proves ownership of at deploy time, and it must be backed up independently:

  • The PEM cert and private key used as the trust root (the customer's PKI artefact — typically held off-platform in a key vault or HSM, never committed to Git outside of the test fixtures shipped with Kamea).
  • The list of certificates uploaded to the hub and DPS:

bash az iot hub certificate list --hub-name <iothub_name> > iothub-certs-$(date +%Y%m%d).json az iot dps certificate list --dps-name <dps_name> > dps-certs-$(date +%Y%m%d).json

The export blob for the registry can be in a new container next to the dedicated device-uploaded-files one or a separate vault account. Generate a write-only SAS URL with limited TTL.

For DPS (use_dps=true), also export enrollment groups and individual enrollments:

bash az iot dps enrollment-group list --dps-name <dps_name> > dps-enrollment-groups-$(date +%Y%m%d).json az iot dps enrollment list --dps-name <dps_name> > dps-enrollments-$(date +%Y%m%d).json

Restore:

bash az iot hub device-identity import \ --hub-name <new_iothub_name> \ --input-blob-container-uri "<sas_url_to_input>" \ --output-blob-container-uri "<sas_url_to_output>"

Then:

  • Re-apply twins with az iot hub device-twin update --hub-name ... --device-id ... --set properties.desired=....
  • For CA-signed devices, re-upload each CA certificate with az iot hub certificate create / az iot dps certificate create, and replay the proof-of-possession (the same workflow as prove_certificate_possession.sh in the iothub-module: read the verification code with az iot dps certificate generate-verification-code, sign it with the CA private key, and submit it via az iot dps certificate verify). Until the CA is verified, devices in certificateAuthority mode will be rejected at the TLS handshake.
  • Re-import DPS enrollments via az iot dps enrollment[-group] create.

Warning

The Management API stores the IoT Hub device name (system metadata iotHubDeviceName) when explicitly provided, or relies on the Kamea device UUID otherwise. After a restore, ensure that the device IDs in IoT Hub match what PostgreSQL expects, otherwise telemetries will be ignored.

7. Storage Accounts

Several storage accounts are created. Their backup needs differ:

Storage account Purpose Contains data? Action
<project>apistorage API blob storage: tenant logos, firmwares, device-uploaded files Yes — critical Enable blob soft delete, versioning, point-in-time restore, Azure Backup
<project>functionstorage, <project>influxfnstorage, <project>redisfnstorage, <project>pgsqlfnstorage Azure Functions runtime state and code packages (ingestion, InfluxDB, Redis and PostgreSQL functions). No business data — functionstorage used to hold device provisioning records, but those have been moved to Redis (see section 5). No business data Re-deploy from CI/CD; soft delete is sufficient
<project>caddyfrontend Front-end SPA artifacts mounted in Caddy No Re-deploy from CI/CD
<project>storacfrontend/<project>settingsfrontend Static-website-hosted SPAs No Re-deploy from CI/CD
<project>redispersistence Redis RDB snapshot — holds device provisioning data (see section 5), optionally last-value telemetries Yes — critical Azure Backup (Files), daily snapshots, 30+ days retention
<project>rmqcertificates RabbitMQ TLS cert, trust store, persistence Yes — critical Azure Backup (Files), 30-day retention

Configuring blob/file backup

For blob containers on <project>apistorage (the only storage account in the table that holds business data in blobs):

bash az storage account blob-service-properties update \ --account-name <name> \ --resource-group $AZURE_RESOURCE_GROUP \ --enable-delete-retention true \ --delete-retention-days 30 \ --enable-container-delete-retention true \ --container-delete-retention-days 30 \ --enable-versioning true \ --enable-restore-policy true \ --restore-days 29

These options can also be added directly to the azurerm_storage_account blocks in Terraform:

hcl blob_properties { delete_retention_policy { days = 30 } container_delete_retention_policy { days = 30 } versioning_enabled = true change_feed_enabled = true restore_policy { days = 29 } }

For full-scale backup, register the storage accounts with a Recovery Services Vault / Backup Vault (operational + vaulted tiers depending on your RPO/RTO targets).

8. RabbitMQ persistence (only when use_mqtt=true)

The rabbitmq-module creates the storage account <project>rmqcertificates with three Azure File shares: TLS certificates, trust store and (in the AKS module) persistence. The TLS cert is critical — a lost cert means re-issuing one and updating every device. Enable Azure Files backup as for Redis persistence.

The RabbitMQ login/password/auth secret are random_password resources — rotation requires updating both the Kubernetes Key Vault secret and the API/AKS App Service settings; see the aks-module.

9. Application Insights / Log Analytics workspace

Defined in the app-insights-module. Retention is set via workspace_retention_in_days. Logs are not business data but are required for troubleshooting historical incidents. To preserve logs beyond the retention window, configure a continuous export to a Storage Account ("Diagnostic settings -> Send to storage account") on the workspace.

10. Key Vault (AKS path only)

Declared in the aks-module. Key Vault is created with purge_protection_enabled = true and soft_delete_retention_days = 30, so deletion is reversible for 30 days: a soft-deleted vault or secret can be recovered via az keyvault recover / az keyvault secret recover within that window. This is the recovery path Kamea relies on.

This guide deliberately omits an active Key Vault backup procedure

Key Vault is the platform's secret store: it holds the RabbitMQ login/password, the Service Bus connection string, the Redis writer password, the RabbitMQ TLS private key, and any other secret the AKS workloads consume through the CSI driver (see the SecretProviderClass declared in the aks-module). The value of Key Vault as a security boundary comes entirely from the fact that those secrets live in exactly one Microsoft-managed store, with RBAC, private-endpoint access, audit logging, and HSM-backed protection on premium SKUs.

Producing periodic copies of every secret with az keyvault secret backup — however convenient for cross-region DR — runs against that property:

  • The output of az keyvault secret backup is a wrapped blob, but once written to a regular Storage Account it is exposed to a wider blast radius: anyone with read access to the backup container (a CI runner token, a misset RBAC role, a developer with debug rights) becomes a candidate path to the secret material. The Storage Account access controls are typically less strict than the vault's, defeating the purpose.
  • The wrapped backup can only be restored into a vault in the same Azure geography and subscription, so the cross-region DR benefit is limited to begin with.
  • Every secret in this vault is either (a) generated by Terraform as a random_password and re-derivable from the Terraform state (already backed up in section 1), or (b) the RabbitMQ TLS certificate, which is also stored on the <project>rmqcertificates storage account file share (section 8) and whose root authority is held off-platform.

The recovery path therefore is:

  1. Within 30 days of a delete/purge, recover the vault and its secrets in place with az keyvault recover / az keyvault secret recover.
  2. Outside that window — i.e. true disaster — re-run the deployment pipeline. Terraform regenerates the random secrets, pushes them back into the (recreated) vault, and updates the consumers. The TLS material is reissued from the customer's CA as part of the same pipeline.

If a specific compliance requirement obliges you to hold an out-of-vault copy of the secrets, treat the secondary store with the same controls as the vault itself: a separate, restricted, audited Key Vault in another subscription, with purge_protection_enabled = true, no inheritable RBAC, and no service principal that the rest of the platform can use. Do not drop the wrapped blobs on a Storage Account.

Asset Frequency Retention Tooling
Terraform state After each run 30 versions GitLab + offline copy
GitLab CI variables Weekly 12 weeks GitLab API + secrets manager
PostgreSQL automated backup Built-in 30 days Azure Database for PostgreSQL
PostgreSQL logical dump Daily 90 days pg_dump from CI/CD
PG telemetries logical dump (if PG telemetries enabled) Daily 90 days pg_dump from CI/CD
IoT Hub registry export Daily 30 days az iot hub device-identity export
IoT Hub twins export Weekly 30 days az iot hub device-twin list
DPS enrollment(-group) export After change 12 versions az iot dps enrollment list
Storage apistorage (firmwares, logos, files) Continuous (versioning + soft delete) 30 days Azure Backup / Storage policies
Redis persistence file share (provisioning data — critical) Daily 30+ days Azure Backup (Files)
RabbitMQ certificates / persistence Daily 30 days Azure Backup (Files)
Log Analytics export to Storage Continuous 1+ year Workspace diagnostic settings

Adjust frequencies and retention to match the customer's RPO/RTO and regulatory constraints.

Disaster recovery: full restore runbook

Order matters because of resource dependencies (cf. security/flux-matrix-azure.md).

  1. Re-create the Azure Resource Group (or use the existing one if only data was lost).
  2. Restore the Terraform state from the offline copy into the GitLab HTTP backend.
  3. Restore GitLab CI variables from the latest export.
  4. Re-deploy the infrastructure with the existing pipeline. The two-step Terraform sequence described in platform-setup.md (targeted apply, then full apply) still applies.
  5. Restore the PostgreSQL server:
  6. Either PITR a new flexible server (and update Terraform/state to import it), or re-deploy a new server and pg_restore the latest dump.
  7. Update App Service settings if credentials changed (DB_HOST, DB_USER, DB_PASSWORD, DB_CA_B64).
  8. Restore PostgreSQL telemetries if USE_PGSQL_FOR_TELEMETRIES=true — covered by step 5 if it lives on the same server, otherwise PITR/restore the dedicated server. InfluxDB Cloud needs no action here: data is held by InfluxData and the org/token live in the CI/CD variables restored in step 3 (see section 4.a for the rationale).
  9. Restore the Redis persistence file share — required, holds device provisioning data. Without it, the ingestion functions will reject every device until each one is re-provisioned.
  10. Restore the API storage account (apistorage):
  11. Use blob point-in-time restore, or restore from Recovery Services Vault.
  12. Verify the logos container, firmware blobs and device-uploaded-files.
  13. Restore IoT Hub identities + twins, then the DPS enrollments.
  14. Restore RabbitMQ certificates and persistence (only if MQTT is used).
  15. Re-deploy applications by re-running the GitLab pipeline (it pushes the docker images into the App Services and the function packages into the Function Apps).
  16. Smoke test:
    • GET /health on Caddy, the management API, the WSS, and each Azure Function.
    • POST /init is not required again — the database already contains the root tenant and admin account.
    • Provision a test device, send a telemetry, check that it appears in each enabled telemetry store and on the WebSocket dashboard.
  17. Re-issue rotated secrets if any were exposed during the recovery (DB admin, RabbitMQ login, Function x-api-key, GitLab tokens, IDP client secret).

Things you do not need to back up

The following are recreated from scratch on every Terraform run or on every Docker deploy and therefore do not require backup procedures:

  • App Service Plans, App Services (management API, WSS, Caddy, onboarding API).
  • Function Apps and their code packages (rebuilt from CI/CD).
  • Service Bus namespace, topics, subscriptions, rules — recreated by Terraform; in-flight messages are accepted as transient loss.
  • Azure Container App for Redis (the runtime; persistence is covered separately).
  • AKS cluster, NSGs, VNet, subnets, private DNS zones.
  • Application Insights / Log Analytics workspace (configuration only — see export note above for the historical logs).
  • Azure Maps account.
  • Front-end storage accounts when serve_frontend_from_storage_account=true (artifacts are pushed by CI/CD on every release).

If a stateless resource is destroyed, run the deployment pipeline; everything is rebuilt within a few minutes.

Validating backups

Backups that have never been restored cannot be trusted. Schedule a quarterly DR drill:

  1. Pick a non-production environment.
  2. Restore the latest PostgreSQL dump, the latest IoT Hub export, and one full export of each enabled telemetry backend into a fresh RG.
  3. Re-deploy Kamea on top of it from the same Git revision.
  4. Run the smoke test above.
  5. Document the elapsed time — that is your real RTO.