Backup & Restore
This guide describes how to back up and restore a Kamea instance running on Azure. Kamea is a Terraform-deployed platform: most resources are stateless (App Services, Function Apps, Service Bus, IoT Hub configuration, etc.) and are fully reproducible from the Git repository plus the GitLab CI/CD environment variables. Only a handful of resources actually carry data and require active backup procedures.
Scope of this guide
The procedures and commands below assume the default Kamea stack: Azure as the cloud provider, GitLab for source hosting and CI/CD (including the GitLab-managed Terraform HTTP backend). The reasoning behind each backup item — what must be preserved and why — is the same regardless of the underlying tooling. When running Kamea on a different stack, the same logic must be translated to the equivalent services. For example:
- Terraform state: if the state is stored in S3, an Azure Storage Account, a Terraform Cloud workspace, a self-hosted backend, or anywhere other than GitLab, the backup target changes (S3 versioning, blob soft delete, Terraform Cloud snapshots, etc.) but the rule "keep an offline encrypted copy of the latest state" still applies.
- CI/CD variables: GitHub Actions / Azure DevOps / Bitbucket / Jenkins all have their own secret stores. Use their respective export APIs in place of the GitLab variables export.
- Telemetry storage: Kamea does not impose a single telemetries database.
InfluxDB Cloud, PostgreSQL and Redis are all first-class options, controlled by
the
USE_PGSQL_FOR_TELEMETRIES/USE_REDIS_FOR_TELEMETRIESflags and the InfluxDB connection variables. Several backends can be enabled at the same time (the message-bus router fans messages out to each enabled storage). Apply the backup procedure of every backend that is actually deployed in your environment — the corresponding sections below are independent of one another. - Cloud provider: most concepts (managed PostgreSQL backups, blob versioning, file share snapshots, secret vaults) have a direct equivalent on AWS / GCP / on-prem. The list of what to back up does not change; how to back it up does.
The goal of this document is to:
- Make the distinction between rebuildable resources (Terraform + CI/CD pipelines) and stateful resources (databases, storage accounts, IoT Hub registry, IDP).
- Provide concrete commands and Azure procedures to back up each stateful resource.
- Provide a recovery runbook ordered by dependency.
Overview of what holds data
Kamea is composed of three layers:
| Layer | Examples | Backup strategy |
|---|---|---|
| Stateless | App Services, Function Apps, Service Bus, NSG/VNET, AKS | Re-run Terraform + redeploy from CI/CD |
| Configuration | Terraform state, GitLab CI variables, IDP registrations | Versioned state + secret manager + offline copy |
| Stateful (data) | PostgreSQL, InfluxDB Cloud, Storage Accounts, IoT Hub, Redis | Dedicated backups described in this document |
This diagram shows the overall topology.
A note on RPO and RTO
Two acronyms recur throughout this guide. They come from the standard disaster-recovery vocabulary and frame every backup decision:
- RPO — Recovery Point Objective: how much data the platform can afford to lose. It is measured backwards from the failure: an RPO of 1 hour means the last accepted state is at most 1 hour older than the moment things broke. RPO is set by the backup frequency — daily dumps give an RPO of up to 24 h, continuous WAL shipping gives an RPO of a few minutes, real-time replication gives an RPO close to zero.
- RTO — Recovery Time Objective: how long the platform can afford to be down.
It is measured forwards from the failure: an RTO of 4 hours means the platform must
be back online within 4 hours, including diagnosis, restore execution and validation.
RTO is set by the restore procedure and the operator's familiarity with it — a
PITR cutover is minutes to an hour, a full
pg_restorefrom apg_dumparchive plus full re-deployment is hours.
The two are independent: you can have a tight RPO with a loose RTO (you lose almost no data, but it takes a day to bring it back) or the opposite (the platform is back online in 15 minutes, but the last few hours of data are gone). Each section below states which side it influences. The customer's contractual RPO/RTO targets dictate which of the recommended frequencies and retention windows are mandatory and which are nice-to-have.
Resources to back up
1. Terraform state
Stored in the HTTP backend (GitLab managed Terraform state). It is the source of truth for
every resource ID and for every Terraform-generated random secret (PostgreSQL admin
login/password, RabbitMQ login/password/auth secret, Redis writer password, Azure
Function x-api-key keys, etc., declared as random_password resources in the PostgreSQL,
RabbitMQ and device-connectivity modules).
Losing the state is recoverable but expensive: every random password must be regenerated and re-injected manually, and Terraform may try to recreate resources that already exist.
Backup: GitLab provides versioned state out of the box. Periodically download the raw state to offline cold storage:
bash
curl --header "Authorization: Bearer $GITLAB_TOKEN" \
"https://<gitlab>/api/v4/projects/<project_id>/terraform/state/<state_name>" \
-o "tfstate-$(date +%Y%m%d).json"
Encrypt the result (it contains secrets) and store it outside the GitLab project (e.g. on an Azure Storage Account with versioning enabled, or any secrets manager).
2. GitLab CI/CD environment variables
All deployment inputs (IDP client secret, InfluxDB token, GitLab registry token,
ARM client ID/secret, custom domains, SKU sizing, etc.) live as GitLab CI variables — see
platform-setup.md. They are not in Terraform state.
Backup: export the project's CI variables periodically:
bash
curl --header "PRIVATE-TOKEN: $GITLAB_TOKEN" \
"https://<gitlab>/api/v4/projects/<project_id>/variables?per_page=100" \
-o "ci-variables-$(date +%Y%m%d).json"
Store the dump in a secrets manager — and treat the file as a secret. The GitLab API distinguishes three confidentiality settings, with very different backup implications:
| Variable type | UI label | Returned by GET .../variables? |
|---|---|---|
| Plain | (none) | Yes — value in clear text. |
| Masked | "Masked" | Yes — value in clear text. Masking only hides the value in CI job logs, not from the API. |
| Masked and hidden | "Masked and hidden" | No. Since GitLab 17.4, the value is write-once: it cannot be retrieved through the API or the UI after creation. The export only contains the key, scope and flags. |
Masked-and-hidden variables are not in the dump
A bare curl export is therefore not a complete backup if the project uses
masked-and-hidden variables. Maintain a parallel record of those values at
creation/rotation time (in a password manager, a Key Vault, or wherever the
upstream secret already lives — e.g. the IDP admin console for the IDP client
secret, the cloud provider for the ARM credentials). After a recovery, you cannot
"restore" hidden variables from the GitLab dump; you must re-set them through the
UI or API using the values held in that out-of-band record.
Restore: GitLab does not provide a single "import variables" endpoint, so the dump
is replayed entry by entry against POST /projects/:id/variables. The export already
preserves every flag the API expects (variable_type, protected, masked, hidden,
environment_scope, raw, description), so the values can be POSTed verbatim.
3. PostgreSQL Flexible Server (Management API DB + optional telemetries)
Hosts:
- the Management API database which stores all platform metadata: tenants, users, groups, roles, permissions, devices, channels, codecs, device types, firmwares metadata, campaigns, jobs, supervisions, metadata templates, etc.
- optionally a second database used for telemetries when
use_pgsql_for_telemetries=true(pgsql-telemetries-module) — note theprevent_destroy = truelifecycle on this DB.
Admin credentials are randomly generated and exposed only as Terraform outputs
(module.pgsql.db_user, module.pgsql.db_password).
Built-in automated backups
Azure Database for PostgreSQL Flexible Server takes automatic full + differential + WAL backups. Defaults: 7 days retention, locally-redundant. Verify and adjust per environment:
bash
az postgres flexible-server show \
--resource-group $AZURE_RESOURCE_GROUP \
--name <pg_server_name> \
--query "backup"
Recommended for production:
backupRetentionDays: 30 (max 35).geoRedundantBackup:Enabled(requires a Zone-Redundant SKU and must be set at server creation; see Azure docs).
These can be added to the Terraform definition by setting backup_retention_days and
geo_redundant_backup_enabled on azurerm_postgresql_flexible_server.
Logical dumps (recommended in addition)
A logical dump is a database export produced by pg_dump. Unlike the built-in Azure
backup — which is a physical, block-level snapshot tightly coupled to the source server —
a logical dump is a self-contained file that describes the database content (schemas,
tables, data, constraints, indexes) in PostgreSQL's own portable format. It can be moved
freely, inspected, and restored anywhere that runs a compatible PostgreSQL version.
The procedure below is advised for the Management API database, which holds
unique business state (tenants, users, groups, roles, devices, channels, codecs,
firmwares metadata, campaigns, etc.) that cannot be reconstructed from external sources.
For the telemetries database, the same procedure is optional: telemetry data is
high-volume — making logical dumps expensive in time and storage — has a retention
policy that drops old points anyway, and is reproducible from the devices once they
reconnect. The managed Azure Point-in-Time Restore (PITR) is generally enough to cover the telemetries DB; apply
pg_dump on top of it only if a compliance requirement or a specific operational need
calls for it.
Azure's (PITR) covers accidental data corruption within an otherwise healthy server. Several of its limitations make an independent logical dump a necessary complement:
- Restore target is constrained. PITR only produces a new Azure Database for
PostgreSQL Flexible Server in the same Azure subscription as the source. It cannot
restore to another subscription, another cloud, an on-prem PostgreSQL, or a
significantly different PostgreSQL version. A
pg_dumpfile can be restored anywherepg_restoreruns. - Retention caps at 35 days. Azure caps the managed backup retention at 35 days. Any compliance, audit or regulatory requirement that mandates longer retention (90 days, 1 year, 7 years, etc.) must be served by an independent logical dump archived to long-term storage.
- Storage co-location with the server. Managed backups live in the same Azure subscription/region as the server, secured with the same RBAC. A subscription compromise, a billing/account suspension, or an accidental subscription deletion removes both the server and its backups simultaneously. Storing logical dumps in a separate subscription (ideally a separate Azure tenant or even an off-Azure archive) is what gives Kamea a true off-site recovery option.
- Granular restore.
pg_restoreaccepts--schema,--tableand--data-onlyflags, which lets you restore a single table (e.g. roll back only thedevicetable after a bad migration) without touching the rest of the database. PITR always restores the full server. - Cross-environment seeding. A
pg_dumpfile from prod is the standard way to refresh a staging or QA environment with realistic data — possibly after scrubbing personally identifiable information (PII) viapg_dump --exclude-table=...or post-restore SQL. Managed PITR cannot serve this use case: it always provisions a brand-new server in the same subscription as the source, so it can neither write into the pre-existing staging server nor reach a staging subscription/tenant. It also has no mechanism to filter or transform the data on the way out — a PITR-restored server is a bit-for-bit copy of the source, including every PII column, which is unacceptable for a non-production environment.
In practice: rely on PITR for fast in-place recovery (RPO of a few minutes, RTO of a few minutes to an hour) and on logical dumps for anything that PITR cannot do — long-term retention, off-Azure DR, partial restores, environment refresh, schema audit. The two strategies are complementary, not redundant.
Use pg_dump from a host that can reach the PostgreSQL server. Add a CI/CD scheduled
pipeline or run from a jump host:
bash
PGSSLMODE=verify-full \
PGSSLROOTCERT=/path/to/DigiCertGlobalRootCA.pem \
pg_dump \
--host=<pg_server>.postgres.database.azure.com \
--username=<admin_login> \
--format=custom \
--no-owner \
--file=kamea-db-$(date +%Y%m%dT%H%M).dump \
<database_name>
Upload the dump to a dedicated Storage Account (different RG / region for a real DR scenario) with soft delete + versioning + immutability enabled.
Restore
- Point-in-Time Restore (PITR) through Azure CLI — preferred for accidental data corruption:
bash
az postgres flexible-server restore \
--resource-group $AZURE_RESOURCE_GROUP \
--name <new_pg_server_name> \
--source-server <existing_pg_server_name> \
--restore-time "2026-05-06T10:00:00Z"
Why PITR creates a new server
Azure Database for PostgreSQL Flexible Server does not support in-place
restore. The --name parameter is mandatory and must be different from the
source server. This is by design:
- PITR works by provisioning fresh storage and replaying transaction logs onto it up to the chosen point in time. The source storage is never touched, which is what makes the operation safe and reversible if you mistarget the timestamp.
- The source server keeps running and accepting writes during the restore. Overwriting it would require an outage, an atomic swap of storage, and a way to roll back if the restored data turns out to be wrong — none of which the managed service offers.
- Keeping both servers side by side lets you compare, validate, or even export only specific tables from the restored copy before committing to the cutover.
The same restriction exists on Azure SQL, AWS RDS, and most managed PostgreSQL offerings, so the workflow below is portable.
After the restore completes, you have two servers and one of two workflows is appropriate depending on whether the corrupted server is salvageable:
- Cutover to the restored server (fastest, recommended for full corruption).
Update the Management API and Function Apps to point to the new server (
DB_HOST,DB_USER,DB_PASSWORD,DB_CA_B64— the admin credentials are regenerated by Azure on restore and the FQDN changes with the new server name). Update the Terraform state to reflect the new server: either rename the resource andterraform importthe restored one, or update thepgsql_server_namevariable so the next plan adopts it. Decommission the old server once you are confident the cutover is good. - Selective restore back into the original server (recommended for partial
corruption). Use the new server only as a read-only source:
pg_dumpthe specific schemas or tables you need, thenpg_restorethem into the original server. Delete the temporary restored server. This avoids changing any connection string and keeps Terraform state untouched, but requires that the original server is still healthy enough to receive writes.
Choosing between the two workflows.: The decision rests on three independent questions. Answer them in order; the first one that points to "cutover" wins.
| Question | Answer that points to cutover (#1) | Answer that points to selective restore (#2) |
|---|---|---|
| Is the original server still healthy enough to receive writes? | No — the server is unreachable, locked, ran out of storage, or its file system is corrupted. | Yes — the server is responding, only the data is wrong. |
| What is the scope of the corruption? | Server-wide: most schemas/tables affected, or unknown blast radius. | Localized: a known set of tables (e.g. a bad migration on device), the rest verified intact. |
| What is the operational pressure to be back online? | High RTO pressure — every minute of downtime costs. Cutover is one DNS/config change away from "done". | RTO is relaxed enough to run pg_dump/pg_restore on a subset and validate it before pushing back. |
Rules of thumb:
- Full corruption, server unreachable, or unknown blast radius → cutover (#1).
You cannot make a partial restore safer than the source, and trying to surgically
patch a broken server wastes the time you'd spend cutting over cleanly. The cost
is a one-time reconfiguration of
DB_HOST/credentials in every consumer and a short Terraform state update. - Known, localized corruption on an otherwise healthy server → selective restore (#2). The blast radius is small (one or a few tables), the original server still accepts writes, and you want to avoid touching connection strings, Terraform state, RBAC role assignments and IP allowlists — all of which point at the existing server today. Restoring a few tables in place is the lower-risk operation in this case.
- You are not sure which one applies. Default to cutover (#1). A "partial" restore that misses a corrupted table is worse than a full cutover that rebuilds connection strings. Cutover has a predictable execution plan; selective restore depends on diagnosis quality.
Two combinations deserve a specific note:
- High RTO pressure + small known corruption. Cutover is still faster than a
selective restore in absolute terms, but you may not want to push staging-like
config changes under pressure. If the affected tables are clearly identified,
running a targeted
pg_restoreagainst the original server is a smaller change-window operation. Pick selective restore only if you trust the diagnosis; otherwise cut over. - Original server healthy + corruption discovered late. PITR retains backups
up to
backupRetentionDays(35 max). If the corruption is older than the retention window, neither workflow above will help — you fall back to apg_restorefrom apg_dumparchive (see the "Logical dumps" section), which is independent of PITR and not bound by the 35-day cap.
In both cases, regenerate the App Service DB_HOST/DB_USER/DB_PASSWORD app
settings if the restored server's credentials are used.
- From
pg_dump— full disaster recovery:
bash
pg_restore --host=<new_pg_server>.postgres.database.azure.com \
--username=<admin_login> \
--dbname=<database_name> \
--no-owner \
kamea-db-<timestamp>.dump
After restore, run the telemetry table init script only if telemetry partitions were not part of the dump.
4. Telemetries databases
Kamea supports three telemetry backends, which can be enabled independently or in
combination through the USE_PGSQL_FOR_TELEMETRIES, USE_REDIS_FOR_TELEMETRIES and
InfluxDB connection variables — see the platform setup guide and
the PGSQL telemetries setup guide. Apply
the backup procedure of every backend that is actually deployed.
4.a InfluxDB Cloud
External SaaS, not deployed by Terraform — only the URL/org/token are passed to the
Influx Functions defined in the influxdb-module. The data is hosted on InfluxData's
infrastructure, with their durability SLA.
This guide does not include a customer-side backup procedure for InfluxDB Cloud
For the default Kamea stack, a periodic data backup of InfluxDB Cloud is not recommended and is therefore not documented here. The reasoning:
- InfluxData owns the durability. The service runs with internal replication and is contractually responsible for the data it stores. A customer-side dump does not protect against the failure modes the SLA already covers (disk loss, region failure, node failure).
- The failure modes a customer dump would protect against are narrow: InfluxData account compromise, billing suspension, accidental org deletion by a holder of an all-access token, regulatory mandate that the data be held by the customer, or migration to another provider. None of these apply to a typical Kamea deployment.
- The cost is real.
influx backup/influx restoreare OSS-only and reject Cloud hosts (Error: InfluxDB OSS-only command used with InfluxDB Cloud host), so the only available path isinflux query→ annotated CSV →influx write. Every query and every write is metered (query units, egress, write units), and the restore throughput is capped by Cloud rate limits. On a real fleet, a daily full dump quickly becomes a serious line item, and a restore takes hours to days. - Telemetry is reproducible. Devices keep producing data, retention policies already drop old points on a schedule, and the business state — tenants, users, devices, channels, codecs, firmwares, campaigns — lives in the Management API PostgreSQL DB, which is backed up (section 3).
- The InfluxDB connection variables (URL, org, token) are already covered by the GitLab CI/CD variables backup (section 2). That is enough to point a fresh org back at Kamea without any action on the platform.
If a specific compliance or contractual requirement obliges you to hold a customer-side copy of the telemetry data, that is a project-level decision — refer to the InfluxData documentation for the export/restore commands available on your Cloud edition. This guide deliberately leaves it out to avoid pushing every customer into a costly procedure that, in the common case, provides no additional protection over what InfluxData already guarantees.
4.b PostgreSQL telemetries
If USE_PGSQL_FOR_TELEMETRIES=true, telemetries live in a dedicated database on the
PostgreSQL flexible server (or on a separate PG server, depending on the deployment
choice — see this page).
They are covered by the PostgreSQL backup procedure described in section 3. Two specifics
apply:
- The
pgsql-telemetries-moduledeclares the database withprevent_destroy = true, so a Terraform run cannot accidentally drop it. - The schema initialization script shipped with the platform creates monthly partitions up to 2030. Remember to re-apply or extend it when restoring into an empty DB; logical dumps include the partitions, so this is only relevant for a fresh DB.
5. Redis (Container App) — required, stores device provisioning data
Critical, always-on store
Redis is not optional in Kamea. It is deployed by the device-connectivity-module
on every environment and is consumed by the Management API, the WebSocket Server and
the ingestion Azure Functions (see the flux matrix in
security/flux-matrix-azure.md). It holds two
classes of data on the same instance:
- Device provisioning data — written by the Management API when a device is provisioned and read by the ingestion Azure Functions to authorize and decode incoming traffic (which channel the device uses, its codec, its hashed secret, its tenant, etc.). This data has no expiration and is business-critical: losing it means every device is rejected by the ingestion chain until each one is re-provisioned through the API. The same information is also kept in the Management API PostgreSQL database, so a full restore is possible from there, but it takes time and the platform is degraded in the meantime.
- Last value per key for telemetries — only when
USE_REDIS_FOR_TELEMETRIES=true. This is derived data that any device push will refresh.
The same Redis instance hosts both concerns, on the same Azure File share
(redis-persistence), and the same backup/restore procedure applies. The retention
profile is dictated by the more critical of the two roles, i.e. the provisioning
data.
Redis is deployed as an Azure Container App with persistence on an Azure File share, and
saves snapshots through Redis' RDB mechanism in /usr/local/etc/redis/backup.
Backup: enable Azure Backup (Recovery Services Vault) on the redis-persistence
storage share, with daily snapshots and a retention window of at least 30 days.
Because provisioning data does not expire, treat this share with the same retention as
the Management API PostgreSQL dumps — it is part of the platform's permanent state, not
a cache. Increase retention further if regulatory requirements apply.
Restore: stop the Redis container app, restore the file share snapshot, and restart the container app — it will reload the latest RDB file on boot. Verify that devices can authenticate and that telemetries are accepted by the ingestion functions before declaring the restore successful.
6. IoT Hub (device registry)
Deployed by the iothub-module. The hub stores: device identities, their authentication
material (symmetric SAS keys, self-signed X.509 thumbprints, or certificateAuthority
references), device twins (reported + desired state), and module identities. Kamea
supports all three IoT Hub authentication modes — symmetric key, self-signed certificate
and CA-signed certificate — see
this page.
The Kamea Management API also keeps a copy of the device list in PostgreSQL, but the
IoT Hub credentials/twins are the source of truth for the connectivity layer.
Backup: IoT Hub does not have a managed backup; export the registry and twins periodically.
```bash
Identities + auth (devices.json)
az iot hub device-identity export \
--hub-name
Twin export
az iot hub device-twin list \
--hub-name
iothub-twins-$(date +%Y%m%d).json ```
What the export captures, per authentication mode:
| Auth mode | Captured by device-identity export |
Notes |
|---|---|---|
| Symmetric key (SAS) | Yes — full device record + keys | The --include-keys flag is what makes the primary/secondary symmetric keys appear in the dump. Without it, those fields are blanked out and the export is unusable for SAS devices. |
| Self-signed X.509 | Yes — full device record + thumbprints | The primaryThumbprint / secondaryThumbprint fields are part of the device record itself and are emitted regardless of --include-keys. |
| CA-signed X.509 | Partial — device record only | Each device record only references type=certificateAuthority; there is no per-device material. The trust anchor lives outside the registry — see below. |
For CA-signed devices, the root and intermediate CA certificates uploaded to the IoT
Hub / DPS are not part of device-identity export. Without them the imported devices
cannot connect even if the registry is fully restored. The CA material is what the
prove_certificate_possession.sh script in the iothub-module proves ownership of at
deploy time, and it must be backed up independently:
- The PEM cert and private key used as the trust root (the customer's PKI artefact — typically held off-platform in a key vault or HSM, never committed to Git outside of the test fixtures shipped with Kamea).
- The list of certificates uploaded to the hub and DPS:
bash
az iot hub certificate list --hub-name <iothub_name> > iothub-certs-$(date +%Y%m%d).json
az iot dps certificate list --dps-name <dps_name> > dps-certs-$(date +%Y%m%d).json
The export blob for the registry can be in a new container next to the dedicated device-uploaded-files one or a separate vault account. Generate a write-only SAS URL with limited TTL.
For DPS (use_dps=true), also export enrollment groups and individual enrollments:
bash
az iot dps enrollment-group list --dps-name <dps_name> > dps-enrollment-groups-$(date +%Y%m%d).json
az iot dps enrollment list --dps-name <dps_name> > dps-enrollments-$(date +%Y%m%d).json
Restore:
bash
az iot hub device-identity import \
--hub-name <new_iothub_name> \
--input-blob-container-uri "<sas_url_to_input>" \
--output-blob-container-uri "<sas_url_to_output>"
Then:
- Re-apply twins with
az iot hub device-twin update --hub-name ... --device-id ... --set properties.desired=.... - For CA-signed devices, re-upload each CA certificate with
az iot hub certificate create/az iot dps certificate create, and replay the proof-of-possession (the same workflow asprove_certificate_possession.shin theiothub-module: read the verification code withaz iot dps certificate generate-verification-code, sign it with the CA private key, and submit it viaaz iot dps certificate verify). Until the CA is verified, devices incertificateAuthoritymode will be rejected at the TLS handshake. - Re-import DPS enrollments via
az iot dps enrollment[-group] create.
Warning
The Management API stores the IoT Hub device name (system metadata
iotHubDeviceName) when explicitly provided, or relies on the Kamea device UUID
otherwise. After a restore, ensure that the device IDs in IoT Hub match what
PostgreSQL expects, otherwise telemetries will be ignored.
7. Storage Accounts
Several storage accounts are created. Their backup needs differ:
| Storage account | Purpose | Contains data? | Action |
|---|---|---|---|
<project>apistorage |
API blob storage: tenant logos, firmwares, device-uploaded files | Yes — critical | Enable blob soft delete, versioning, point-in-time restore, Azure Backup |
<project>functionstorage, <project>influxfnstorage, <project>redisfnstorage, <project>pgsqlfnstorage |
Azure Functions runtime state and code packages (ingestion, InfluxDB, Redis and PostgreSQL functions). No business data — functionstorage used to hold device provisioning records, but those have been moved to Redis (see section 5). |
No business data | Re-deploy from CI/CD; soft delete is sufficient |
<project>caddyfrontend |
Front-end SPA artifacts mounted in Caddy | No | Re-deploy from CI/CD |
<project>storacfrontend/<project>settingsfrontend |
Static-website-hosted SPAs | No | Re-deploy from CI/CD |
<project>redispersistence |
Redis RDB snapshot — holds device provisioning data (see section 5), optionally last-value telemetries | Yes — critical | Azure Backup (Files), daily snapshots, 30+ days retention |
<project>rmqcertificates |
RabbitMQ TLS cert, trust store, persistence | Yes — critical | Azure Backup (Files), 30-day retention |
Configuring blob/file backup
For blob containers on <project>apistorage (the only storage account in the table
that holds business data in blobs):
bash
az storage account blob-service-properties update \
--account-name <name> \
--resource-group $AZURE_RESOURCE_GROUP \
--enable-delete-retention true \
--delete-retention-days 30 \
--enable-container-delete-retention true \
--container-delete-retention-days 30 \
--enable-versioning true \
--enable-restore-policy true \
--restore-days 29
These options can also be added directly to the azurerm_storage_account blocks in
Terraform:
hcl
blob_properties {
delete_retention_policy { days = 30 }
container_delete_retention_policy { days = 30 }
versioning_enabled = true
change_feed_enabled = true
restore_policy { days = 29 }
}
For full-scale backup, register the storage accounts with a Recovery Services Vault / Backup Vault (operational + vaulted tiers depending on your RPO/RTO targets).
8. RabbitMQ persistence (only when use_mqtt=true)
The rabbitmq-module creates the storage account <project>rmqcertificates with three
Azure File shares: TLS certificates, trust store and (in the AKS module) persistence. The
TLS cert is critical — a lost cert means re-issuing one and updating every device. Enable
Azure Files backup as for Redis persistence.
The RabbitMQ login/password/auth secret are random_password resources — rotation
requires updating both the Kubernetes Key Vault secret and the API/AKS App Service
settings; see the aks-module.
9. Application Insights / Log Analytics workspace
Defined in the app-insights-module. Retention is set via workspace_retention_in_days.
Logs are not business data but are required for troubleshooting historical incidents. To
preserve logs beyond the retention window, configure a continuous export to a Storage
Account ("Diagnostic settings -> Send to storage account") on the workspace.
10. Key Vault (AKS path only)
Declared in the aks-module. Key Vault is created with purge_protection_enabled = true
and soft_delete_retention_days = 30, so deletion is reversible for 30 days: a
soft-deleted vault or secret can be recovered via az keyvault recover /
az keyvault secret recover within that window. This is the recovery path Kamea relies
on.
This guide deliberately omits an active Key Vault backup procedure
Key Vault is the platform's secret store: it holds the RabbitMQ login/password,
the Service Bus connection string, the Redis writer password, the RabbitMQ TLS
private key, and any other secret the AKS workloads consume through the CSI
driver (see the SecretProviderClass declared in the aks-module). The value of
Key Vault as a security boundary comes entirely from the fact that those secrets
live in exactly one Microsoft-managed store, with RBAC, private-endpoint
access, audit logging, and HSM-backed protection on premium SKUs.
Producing periodic copies of every secret with az keyvault secret backup —
however convenient for cross-region DR — runs against that property:
- The output of
az keyvault secret backupis a wrapped blob, but once written to a regular Storage Account it is exposed to a wider blast radius: anyone with read access to the backup container (a CI runner token, a misset RBAC role, a developer with debug rights) becomes a candidate path to the secret material. The Storage Account access controls are typically less strict than the vault's, defeating the purpose. - The wrapped backup can only be restored into a vault in the same Azure geography and subscription, so the cross-region DR benefit is limited to begin with.
- Every secret in this vault is either (a) generated by Terraform as a
random_passwordand re-derivable from the Terraform state (already backed up in section 1), or (b) the RabbitMQ TLS certificate, which is also stored on the<project>rmqcertificatesstorage account file share (section 8) and whose root authority is held off-platform.
The recovery path therefore is:
- Within 30 days of a delete/purge, recover the vault and its secrets in place
with
az keyvault recover/az keyvault secret recover. - Outside that window — i.e. true disaster — re-run the deployment pipeline. Terraform regenerates the random secrets, pushes them back into the (recreated) vault, and updates the consumers. The TLS material is reissued from the customer's CA as part of the same pipeline.
If a specific compliance requirement obliges you to hold an out-of-vault copy of
the secrets, treat the secondary store with the same controls as the vault
itself: a separate, restricted, audited Key Vault in another subscription, with
purge_protection_enabled = true, no inheritable RBAC, and no service principal
that the rest of the platform can use. Do not drop the wrapped blobs on a
Storage Account.
Recommended schedule
| Asset | Frequency | Retention | Tooling |
|---|---|---|---|
| Terraform state | After each run | 30 versions | GitLab + offline copy |
| GitLab CI variables | Weekly | 12 weeks | GitLab API + secrets manager |
| PostgreSQL automated backup | Built-in | 30 days | Azure Database for PostgreSQL |
| PostgreSQL logical dump | Daily | 90 days | pg_dump from CI/CD |
| PG telemetries logical dump (if PG telemetries enabled) | Daily | 90 days | pg_dump from CI/CD |
| IoT Hub registry export | Daily | 30 days | az iot hub device-identity export |
| IoT Hub twins export | Weekly | 30 days | az iot hub device-twin list |
| DPS enrollment(-group) export | After change | 12 versions | az iot dps enrollment list |
Storage apistorage (firmwares, logos, files) |
Continuous (versioning + soft delete) | 30 days | Azure Backup / Storage policies |
| Redis persistence file share (provisioning data — critical) | Daily | 30+ days | Azure Backup (Files) |
| RabbitMQ certificates / persistence | Daily | 30 days | Azure Backup (Files) |
| Log Analytics export to Storage | Continuous | 1+ year | Workspace diagnostic settings |
Adjust frequencies and retention to match the customer's RPO/RTO and regulatory constraints.
Disaster recovery: full restore runbook
Order matters because of resource dependencies (cf.
security/flux-matrix-azure.md).
- Re-create the Azure Resource Group (or use the existing one if only data was lost).
- Restore the Terraform state from the offline copy into the GitLab HTTP backend.
- Restore GitLab CI variables from the latest export.
- Re-deploy the infrastructure with the existing pipeline. The two-step Terraform
sequence described in
platform-setup.md(targeted apply, then full apply) still applies. - Restore the PostgreSQL server:
- Either PITR a new flexible server (and update Terraform/state to import it), or
re-deploy a new server and
pg_restorethe latest dump. - Update App Service settings if credentials changed (
DB_HOST,DB_USER,DB_PASSWORD,DB_CA_B64). - Restore PostgreSQL telemetries if
USE_PGSQL_FOR_TELEMETRIES=true— covered by step 5 if it lives on the same server, otherwise PITR/restore the dedicated server. InfluxDB Cloud needs no action here: data is held by InfluxData and the org/token live in the CI/CD variables restored in step 3 (see section 4.a for the rationale). - Restore the Redis persistence file share — required, holds device provisioning data. Without it, the ingestion functions will reject every device until each one is re-provisioned.
- Restore the API storage account (
apistorage): - Use blob point-in-time restore, or restore from Recovery Services Vault.
- Verify the
logoscontainer, firmware blobs anddevice-uploaded-files. - Restore IoT Hub identities + twins, then the DPS enrollments.
- Restore RabbitMQ certificates and persistence (only if MQTT is used).
- Re-deploy applications by re-running the GitLab pipeline (it pushes the docker images into the App Services and the function packages into the Function Apps).
- Smoke test:
GET /healthon Caddy, the management API, the WSS, and each Azure Function.POST /initis not required again — the database already contains the root tenant and admin account.- Provision a test device, send a telemetry, check that it appears in each enabled telemetry store and on the WebSocket dashboard.
- Re-issue rotated secrets if any were exposed during the recovery (DB admin,
RabbitMQ login, Function
x-api-key, GitLab tokens, IDP client secret).
Things you do not need to back up
The following are recreated from scratch on every Terraform run or on every Docker deploy and therefore do not require backup procedures:
- App Service Plans, App Services (management API, WSS, Caddy, onboarding API).
- Function Apps and their code packages (rebuilt from CI/CD).
- Service Bus namespace, topics, subscriptions, rules — recreated by Terraform; in-flight messages are accepted as transient loss.
- Azure Container App for Redis (the runtime; persistence is covered separately).
- AKS cluster, NSGs, VNet, subnets, private DNS zones.
- Application Insights / Log Analytics workspace (configuration only — see export note above for the historical logs).
- Azure Maps account.
- Front-end storage accounts when
serve_frontend_from_storage_account=true(artifacts are pushed by CI/CD on every release).
If a stateless resource is destroyed, run the deployment pipeline; everything is rebuilt within a few minutes.
Validating backups
Backups that have never been restored cannot be trusted. Schedule a quarterly DR drill:
- Pick a non-production environment.
- Restore the latest PostgreSQL dump, the latest IoT Hub export, and one full export of each enabled telemetry backend into a fresh RG.
- Re-deploy Kamea on top of it from the same Git revision.
- Run the smoke test above.
- Document the elapsed time — that is your real RTO.