Three-Layer Deployment Architecture

Context and Problem Statement

The CIMPL deployment includes Azure infrastructure (AKS cluster, networking, RBAC), platform middleware (Elasticsearch, PostgreSQL, RabbitMQ, Istio config), and OSDU application services (Partition, Entitlements, etc.). These have fundamentally different change frequencies and blast radii. A Helm chart upgrade should not risk the AKS cluster, and an application service rollout should not trigger re-evaluation of platform middleware. We need a deployment architecture that isolates these concerns while maintaining a single azd up experience.

The reference implementation (cimpl-azd ADR-0006) uses a two-layer split (infra and software/stack). This project extends that to three explicit layers to provide finer-grained lifecycle control as the platform matures.

Decision Drivers

Cluster infrastructure changes are infrequent and high-risk
Platform middleware changes are moderate frequency and medium-risk
Application service changes are frequent and lower-risk
azd up must orchestrate all layers in the correct order via lifecycle hooks
Cross-layer values (cluster name, OIDC issuer URL, resource group) must flow between layers via environment variables
Blast radius of a terraform destroy must be containable per layer
Teardown must proceed in reverse order (software → platform → infra)

Considered Options

Three-layer architecture (infra / platform / software)
Two-layer architecture (infra / stack)
Single monolithic Terraform state
Terraform workspaces

Decision Outcome

Chosen option: "Three-layer architecture", because it provides the finest-grained lifecycle isolation, aligns with the natural boundaries between Azure resources, Kubernetes middleware, and application services, and enables independent deploy/rollback per layer.

Layer	Path	Manages	Triggered by
1. Infrastructure	`infra/`	Resource group, AKS cluster, networking, managed identities	`azd provision`
1a. Access	`infra-access/`	Privileged RBAC, policy exemptions (see ADR-0020)	`scripts/bootstrap-access.ps1`
2. Foundation	`software/foundation/`	cert-manager, CloudNativePG, Elasticsearch, ExternalDNS, Gateway	post-provision hook
3. Stack	`software/stack/`	OSDU services, Airflow, Keycloak, RabbitMQ, Redis, MinIO, Istio routing	`azd deploy` (pre-deploy hook)

Consequences

Good, because each layer can be independently planned, applied, and destroyed
Good, because terraform apply in software/ cannot accidentally modify platform middleware or the AKS cluster
Good, because platform middleware changes (e.g., Elasticsearch upgrade) don't trigger re-evaluation of all OSDU services
Good, because teardown hooks can cleanly reverse the order: software → platform → infra
Good, because aligns with azd lifecycle: azd provision (infra), post-provision hook (foundation), azd deploy via pre-deploy hook (stack)
Bad, because cross-layer values must be explicitly passed via environment variables — no direct Terraform state references between layers
Bad, because multiple state files to manage and reason about
Bad, because debugging requires understanding which layer owns each resource
Bad, because adds orchestration complexity in lifecycle hook scripts compared to a two-layer model

Pros and Cons of the Options

Three-Layer Architecture (infra / platform / software)

Separate Terraform roots for Azure infrastructure, Kubernetes middleware, and application services.

Good, because finest-grained blast radius isolation
Good, because application deploys are fast — only evaluating OSDU service Helm releases
Good, because platform middleware can be upgraded independently of both infra and applications
Good, because maps cleanly to team responsibilities (infra team, platform team, app team)
Neutral, because adds one more layer than the reference implementation
Bad, because three state files and three sets of cross-layer variable passing
Bad, because lifecycle hook scripts must coordinate three layers

Two-Layer Architecture (infra / stack)

Separate Terraform roots for Azure infrastructure and all Kubernetes resources (middleware + services combined), as used by the reference implementation (cimpl-azd ADR-0006).

Good, because simpler orchestration — only two layers to coordinate
Good, because proven in reference implementation
Good, because middleware and services share depends_on relationships naturally
Bad, because a Helm chart upgrade to Elasticsearch triggers re-evaluation of all OSDU services
Bad, because blast radius includes both middleware and application services
Bad, because conflates different change frequencies into one state

Single Monolithic State

One Terraform root managing everything from resource group to application services.

Good, because simplest to understand — everything in one place
Good, because direct resource references (no cross-layer variable passing)
Bad, because terraform plan evaluates every resource on every change (slow)
Bad, because a single terraform destroy removes everything with no granularity
Bad, because highest blast radius — any change can affect any resource

Terraform Workspaces

Single Terraform configuration with workspace-based isolation.

Good, because workspace switching is simpler than directory-based separation
Bad, because workspaces are designed for environment isolation (dev/staging/prod), not layer isolation
Bad, because all resources still share one configuration — no blast radius reduction
Bad, because not supported by azd's Terraform integration

More Information

Reference implementation: cimpl-azd ADR-0006 (two-layer variant)
Azure Developer CLI lifecycle hooks
Related: ADR-0001 (Terraform + azd choice enables lifecycle hook orchestration)