168 lines
17 KiB
Markdown
168 lines
17 KiB
Markdown
# Architecture Decision Record: AI-Resilient DevSecOps & Infrastructure
|
|
|
|
**Date:** April 16, 2026
|
|
**Status:** Approved & Finalized
|
|
**Author:** [Your Name]
|
|
**Environment:** SUSE Harvester (RKE2) Homelab, Argo Workflows/CD, AI-Assisted Development
|
|
|
|
## 1. Context & Threat Model
|
|
This architecture governs a centralized template strategy deployed across multiple personal repositories. The infrastructure is hosted on a personal SUSE Harvester cluster and exposed to the public internet.
|
|
|
|
The introduction of **AI coding agents** fundamentally alters the standard solo-developer threat model. AI agents act as hyper-productive "junior developers" that write code, manage dependencies, and push directly to branches, but they lack contextual awareness. They are highly susceptible to hallucinating insecure application logic, pulling in typosquatted dependencies, or generating dummy credentials.
|
|
|
|
Furthermore, as a solo developer managing a public-facing homelab, **alert fatigue is a fatal risk**. Security controls must prioritize high-signal, zero-noise alerting, and automated enforcement.
|
|
|
|
Our goals are to:
|
|
|
|
Prevent supply chain attacks (malicious packages).
|
|
Catch misconfigurations before they deploy to the homelab.
|
|
Eliminate alert fatigue/false positives so development velocity remains high.
|
|
Maintain a robust security posture without spending hours managing DevSecOps infrastructure.
|
|
Isolate the AI agent's blast radius without hindering its autonomous productivity.
|
|
|
|
To achieve this, the architecture utilizes "Defense in Depth," split across several distinct boundaries: Local Tooling, Pipeline Gatekeepers, Infrastructure Delivery, and Runtime Security.
|
|
|
|
---
|
|
|
|
## 2. Part 1: Local Development & Repository Tooling
|
|
### 2.1 Secret Scanning: Gitleaks (Local)
|
|
|
|
What it does: Fast, static regex matching for secrets.
|
|
Where it runs: Local developer machine (via Pre-commit hook).
|
|
Detailed Rationale: Developers make human errors. Gitleaks runs in milliseconds and acts as a "spell-check for secrets." It prevents accidentally committing a .env file or hardcoded token before it ever enters the local Git history.
|
|
Trade-offs: It relies on the developer actively using the pre-commit hook. If a commit is forced (--no-verify), the local check is bypassed.
|
|
|
|
### 2.2 Supply Chain Defense: Socket CLI (Local Wrapper)
|
|
|
|
What it does: Intercepts package installation to check for malicious code, typosquatting, and hijacked packages.
|
|
Where it runs: Local machine (aliased: alias pnpm="socket pnpm").
|
|
Detailed Rationale: Standard CVE scanners only look for accidental bugs. Socket looks for malice. By aliasing this locally, both the human developer and local AI coding agents are forced to verify a package via Socket's API before it executes install scripts on the host machine.
|
|
Trade-offs: Adds slight latency to local npm install commands as it must query the external Socket API.
|
|
|
|
|
|
## 2. Part 1: Source Control & Pipeline Gatekeepers (Shift-Left)
|
|
|
|
### 2.1 Pipeline Tamper Protection: `CODEOWNERS` Branch Protection
|
|
* **What it does:** Locks specific file paths (e.g., `.github/workflows/*`, `.argo/*`) requiring human PR reviews, while allowing autonomous AI merges on application code.
|
|
* **Detailed Rationale:** The pipeline is the absolute gatekeeper. If an AI agent has branch-push access, the most critical vulnerability is the agent modifying or deleting the vulnerability scanners themselves to force a failing build to pass. Implementing `CODEOWNERS` strictly enforces the boundary between the "auditor" (the pipeline) and the "developer" (the AI).
|
|
* **Trade-offs:** Introduces slight friction. If an AI agent legitimately needs to update a pipeline configuration to support a new app feature, a human must manually intervene and approve the PR, slightly slowing down end-to-end autonomous deployment.
|
|
|
|
### 2.2 Application Security (SAST): Semgrep
|
|
* **What it does:** Scans first-party application code for logic flaws (SQL Injection, XSS, IDOR).
|
|
* **Detailed Rationale:** While other tools check the infrastructure and the dependencies, nothing was checking the *actual logic* written by the AI agent. AI models frequently output insecure, dated boilerplate. Semgrep was selected because it is incredibly fast, rule-based, and integrates seamlessly into CI without requiring a persistent database or heavy runner compute.
|
|
* **Trade-offs:** Semgrep is highly focused on pattern matching and lacks the deep, cross-file data-flow analysis of heavier enterprise SAST tools. It may miss highly complex, multi-stage logic vulnerabilities.
|
|
|
|
### 2.3 Secret Scanning: Gitleaks (Local) & TruffleHog (Pipeline)
|
|
* **What it does:** Gitleaks provides instantaneous local pre-commit regex matching. TruffleHog scans the CI pipeline and dynamically attempts to authenticate discovered secrets against external APIs.
|
|
* **Detailed Rationale:** AI agents frequently hallucinate realistic-looking API keys (`sk_live_...`). Gitleaks stops obvious human errors instantly. However, in the pipeline, TruffleHog is run with the `--only-verified` flag. If it finds a regex match, it queries the provider (e.g., Stripe, AWS). It only fails the build if the key is active. This completely eliminates the false-positive noise caused by AI-generated dummy keys.
|
|
* **Trade-offs:** TruffleHog's verification requires outbound internet access from the CI runner and relies on third-party APIs being online. If the external provider's API is down, the verification step could fail or timeout.
|
|
|
|
### 2.4 Supply Chain Defense: Socket
|
|
* **What it does:** Wraps local package managers (`alias pnpm="socket pnpm"`) and acts as a pipeline gatekeeper to analyze new dependencies for malicious behavior.
|
|
* **Detailed Rationale:** Traditional CVE scanners check for accidental developer mistakes. Socket checks for active malice (install scripts that steal SSH keys, typosquatting, hijacked maintainer accounts). Because AI agents regularly pull in new dependencies to solve coding problems, Socket ensures neither the local machine nor the pipeline executes malicious code during dependency resolution.
|
|
* **Trade-offs:** API-dependent. To conserve free-tier API quotas, the pipeline step must be strictly configured to trigger *only* when lockfiles (`pnpm-lock.yaml`) change, requiring careful CI optimization.
|
|
|
|
**outdated, using pulumi crossguard**
|
|
### 2.5 Infrastructure Validation (IaC): Checkov
|
|
* **What it does:** Parses Kubernetes manifests, Terraform, and Dockerfiles to ensure they adhere to security best practices.
|
|
* **Detailed Rationale:** A homelab exposed to the internet cannot afford basic infrastructure misconfigurations, such as running containers as `root` or mapping sensitive host volumes. Checkov acts as an automated senior cloud architect, validating the AI's generated Kubernetes manifests before Argo CD syncs them.
|
|
* **Trade-offs:** Custom internal infrastructure patterns might be flagged as insecure by default, requiring the maintainer to actively manage and write skip-flags (`# checkov:skip=...`) in the code to silence false positives.
|
|
|
|
### 2.6 Container Vulnerability Management: Syft + Grype
|
|
* **What it does:** Syft generates a Software Bill of Materials (SBOM), and Grype scans that SBOM for CVEs, scored by the Exploit Prediction Scoring System (EPSS).
|
|
* **Detailed Rationale:** This enforces an "SBOM-first" architecture, creating a permanent cryptographic receipt of exactly what is running. By utilizing EPSS scoring instead of standard CVSS (High/Critical), Grype drastically filters the noise down to only vulnerabilities that hackers are *actively exploiting in the wild*.
|
|
* **Trade-offs:** Generating an SBOM adds several seconds to every CI pipeline run. Furthermore, Grype lacks native reachability analysis (it knows a vulnerable package exists, but doesn't know if your code actually calls the vulnerable function).
|
|
|
|
3.7 Dependency Management: Renovate Bot
|
|
|
|
What it does: Automatically opens pull requests to update outdated libraries.
|
|
Detailed Rationale: The best way to fix CVEs found by Grype is to never let them linger. Renovate batches minor updates into a single weekly PR, automating security patching without overwhelming the repository.
|
|
Trade-offs: Major version updates can still break the application, requiring high-confidence automated testing.
|
|
|
|
3.8 Vulnerability Management Dashboard: DefectDojo
|
|
|
|
What it does: A centralized ASPM (Application Security Posture Management) platform.
|
|
Detailed Rationale: Reading raw SARIF logs is a terrible developer experience. Checkov and Grype push findings into DefectDojo, which deduplicates alerts and provides a single, clean homelab dashboard.
|
|
Trade-offs: Requires hosting and maintaining a separate stateful web application and database.
|
|
|
|
---
|
|
|
|
## 3. Part 2: Cluster Infrastructure & Delivery
|
|
|
|
### 3.1 Harvester Native Components
|
|
* **What it does:** Utilizing built-in RKE2 components: Longhorn (Storage), Prometheus/Grafana (Monitoring), and Canal (Networking CNI).
|
|
* **Detailed Rationale:** Hyperconverged Infrastructure (HCI) is designed to be "batteries included." By using native components, we guarantee compatibility, out-of-the-box dashboards, and seamless upgrade paths managed by the SUSE ecosystem. Canal natively supports Kubernetes Network Policies without the need for complex service meshes.
|
|
* **Trade-offs:** Vendor lock-in. If the architecture ever migrates away from SUSE Harvester/RKE2 (e.g., to Proxmox + Talos), the storage and monitoring stacks will need to be entirely rebuilt.
|
|
|
|
### 3.2 Ingress Security: Cloudflare Tunnels (Zero Trust)
|
|
* **What it does:** Creates an outbound encrypted tunnel to Cloudflare's edge, proxying all external traffic without opening inbound firewall ports.
|
|
* **Detailed Rationale:** Essential for public-facing homelabs. It completely obfuscates the home router's public IP from Shodan scanners and acts as an enterprise-grade WAF and DDoS shield natively at the edge.
|
|
* **Trade-offs:** Introduces a strict dependency on a third-party SaaS provider. All inbound traffic is subject to Cloudflare's routing latency and Terms of Service. If the Cloudflare daemon (`cloudflared`) crashes inside the cluster, all web applications go offline globally.
|
|
|
|
### 3.3 Secret Delivery: Infisical Kubernetes Operator
|
|
* **What it does:** A centralized vault that uses an in-cluster operator to authenticate, pull, and inject secrets dynamically into Pods at runtime.
|
|
* **Detailed Rationale:** Scanning repositories for secrets only ensures they don't leak; it doesn't solve secure delivery. Infisical removes secrets from the GitOps flow entirely. AI agents use the Infisical CLI to load local variables, and Argo CD deploys standard manifests. The operator hydrates the sensitive data into memory right as the container starts.
|
|
* **Trade-offs:** Adds a critical stateful dependency to the cluster. If the Infisical operator fails or the connection to the Infisical vault is lost during pod startup, the applications will fail to boot due to missing environment variables.
|
|
|
|
### 3.4 Local Container Registry: `zot` (or `registry:2`)
|
|
* **What it does:** A self-hosted, lightweight OCI registry living completely inside the Harvester cluster.
|
|
* **Detailed Rationale:** Required for applications that spawn ephemeral containers dynamically. Pulling images from an external registry (like GitHub Container Registry) incurs significant internet latency. By keeping the registry on the local LAN/cluster, image pulls happen in milliseconds, providing blistering fast pod startup times.
|
|
* **Trade-offs:** Minimalist tooling. `zot` and `registry:2` do not have the robust GUI, user management, or built-in garbage collection interfaces of enterprise registries like Harbor, requiring manual CLI maintenance to prune old images and reclaim Longhorn disk space.
|
|
|
|
---
|
|
|
|
## 4. Part 3: Runtime Security & Admission Control (Shield-Right)
|
|
|
|
### 4.1 All-in-One Runtime Platform: SUSE Security (NeuVector)
|
|
* **What it does:** Native platform handling Admission Control, Layer 7 network visualization, and eBPF-based Runtime Threat Detection.
|
|
* **Detailed Rationale:** Serves as the ultimate fallback layer (Defense in Depth). If an attacker (or AI agent) bypasses the pipeline, NeuVector intercepts actions at the cluster level.
|
|
* **Admission Control:** Enforces the policy that the cluster may *only* pull images from the local `zot` registry, rejecting arbitrary or malicious images.
|
|
* **Runtime:** Uses eBPF to monitor kernel activity, actively blocking or alerting if an exploited container attempts to spawn a bash shell, modify its file system, or read sensitive host files.
|
|
* **Trade-offs:** NeuVector is a heavy, enterprise-grade tool. It consumes significantly more CPU and memory overhead per node than specialized, single-purpose CLI tools. Its learning curve and initial configuration are also substantial compared to raw YAML network policies.
|
|
|
|
---
|
|
|
|
## 5. Part 4: Future Upgrades
|
|
|
|
### 5.1 Dynamic Application Security Testing (DAST)
|
|
* **Plan:** Integrate OWASP ZAP or Nuclei to perform black-box testing against the running applications.
|
|
* **Rationale:** Simulates external hacker traffic to identify exploitable logic that static analysis (Semgrep) cannot see (e.g., auth bypass, business logic flaws).
|
|
* **Trade-offs for Delaying:** DAST is incredibly slow. Running it inline with CI/CD destroys developer velocity. It will be implemented later as an asynchronous weekly cron job, meaning we accept the temporary risk of zero-day web logic vulnerabilities persisting for up to a week before discovery.
|
|
|
|
---
|
|
|
|
## 6. Tools Explicitly Evaluated and Rejected (The "Why Not?" List)
|
|
|
|
### ❌ Cosign (Image Signing)
|
|
* **Why it was rejected:** Cosign protects against Man-in-the-Middle attacks between the CI runner and the registry. In a single-user private homelab where the runner, registry, and cluster share the same localized trust boundary, the risk is negligible. Managing Sigstore admission controllers bloats the architecture for near-zero practical ROI.
|
|
|
|
### ❌ OPA Gatekeeper / Kyverno (Admission Control)
|
|
* **Why it was rejected:** OPA requires learning an entirely new, complex logic language (Rego). While Kyverno uses simpler Kubernetes YAML, both were ultimately rejected in favor of NeuVector, which handles admission control natively within its existing UI as part of the unified SUSE ecosystem.
|
|
|
|
### ❌ Snyk Open Source
|
|
* **Why it was rejected:** Redundant when utilizing Socket for malware detection and Grype for CVE scanning.
|
|
|
|
### ❌ Falco / Tetragon (Runtime Security)
|
|
* **Why it was rejected:** Both are exceptional eBPF runtime security monitors. However, implementing them would require setting up custom alerting pipelines and separate visualizers. Because Harvester is a SUSE product, SUSE Security (NeuVector) was selected to provide equivalent runtime protection natively, keeping the tech stack consolidated.
|
|
|
|
### ❌ SonarQube (SAST)
|
|
* **Why it was rejected:** SonarQube requires a persistent database (PostgreSQL), high memory consumption, and a complex runner setup. Semgrep provides 90% of the value for an AI-threat model with 10% of the infrastructure footprint and integrates directly into the lightweight Argo Workflows pipeline.
|
|
|
|
### ❌ Trivy (All-in-One Scanner)
|
|
* **Why it was rejected:** Trivy scans containers, IaC, and secrets. However, we prioritized a "best of breed" approach. Checkov handles complex Terraform/Kubernetes graphs better, TruffleHog eliminates secret false-positives via API verification, and Grype offers EPSS scoring. Using Trivy would duplicate efforts or generate conflicting, noisy alerts.
|
|
|
|
### ❌ Clair (Container Scanning)
|
|
* **Why it was rejected:** Clair requires a stateful PostgreSQL database to continuously index vulnerability feeds. This architecture specifically demands a stateless tool (Grype) that acts as an ephemeral gate *inside* the CI/CD pipeline before images are ever pushed to a registry.
|
|
|
|
### ❌ CrowdSec / Local WAFs
|
|
* **Why it was rejected:** Because the cluster's sole ingress is routed through Cloudflare Tunnels, malicious traffic and automated DDoS attempts are filtered at Cloudflare's edge network. Running a secondary WAF inside the cluster wastes compute resources to solve a problem that was already mitigated before the traffic reached the home network.
|
|
|
|
Here is a concise, professional summary formatted to drop directly into your ADR's **"Tools Explicitly Evaluated and Rejected (The 'Why Not?' List)"** section:
|
|
|
|
### ❌ Ovvoc (Automated Dependency Updates & Code Migration)
|
|
* **What it does:** An advanced dependency updater that goes beyond version bumping by using AST transforms and AI to actively rewrite application code to fix breaking changes (e.g., migrating Express 4 to 5).
|
|
* **Why it was rejected:**
|
|
* **Cost-Prohibitive:** At $49/month for a single repository (and $249/month for up to 6), the enterprise pricing is not sustainable for a solo homelab environment.
|
|
* **Redundant AI Capabilities:** Because this architecture already relies heavily on local AI-assisted development (e.g., Cursor, Copilot, or Aider), local AI agents can easily be prompted to fix the occasional breaking change in seconds at no additional cost.
|
|
* **Diminishing Returns:** The vast majority of security vulnerabilities are patched in non-breaking minor or patch updates. **Renovate Bot** handles these perfectly for free. Ovvoc solves a problem (major version breaking changes) that is too infrequent in a homelab to justify the extreme price tag.
|