moved rationale and added semgrep reason
This commit is contained in:
@@ -0,0 +1,154 @@
|
|||||||
|
# Architecture Decision Record: AI-Resilient DevSecOps & Infrastructure
|
||||||
|
|
||||||
|
**Date:** April 16, 2026
|
||||||
|
**Status:** Approved & Finalized
|
||||||
|
**Author:** [Your Name]
|
||||||
|
**Environment:** SUSE Harvester (RKE2) Homelab, Argo Workflows/CD, AI-Assisted Development
|
||||||
|
|
||||||
|
## 1. Context & Threat Model
|
||||||
|
This architecture governs a centralized template strategy deployed across multiple personal repositories. The infrastructure is hosted on a personal SUSE Harvester cluster and exposed to the public internet.
|
||||||
|
|
||||||
|
The introduction of **AI coding agents** fundamentally alters the standard solo-developer threat model. AI agents act as hyper-productive "junior developers" that write code, manage dependencies, and push directly to branches, but they lack contextual awareness. They are highly susceptible to hallucinating insecure application logic, pulling in typosquatted dependencies, or generating dummy credentials.
|
||||||
|
|
||||||
|
Furthermore, as a solo developer managing a public-facing homelab, **alert fatigue is a fatal risk**. Security controls must prioritize high-signal, zero-noise alerting, and automated enforcement.
|
||||||
|
|
||||||
|
Our goals are to:
|
||||||
|
|
||||||
|
Prevent supply chain attacks (malicious packages).
|
||||||
|
Catch misconfigurations before they deploy to the homelab.
|
||||||
|
Eliminate alert fatigue/false positives so development velocity remains high.
|
||||||
|
Maintain a robust security posture without spending hours managing DevSecOps infrastructure.
|
||||||
|
Isolate the AI agent's blast radius without hindering its autonomous productivity.
|
||||||
|
|
||||||
|
To achieve this, the architecture utilizes "Defense in Depth," split across several distinct boundaries: Local Tooling, Pipeline Gatekeepers, Infrastructure Delivery, and Runtime Security.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
2. Part 1: Local Development & Repository Tooling
|
||||||
|
2.1 Secret Scanning: Gitleaks (Local)
|
||||||
|
|
||||||
|
What it does: Fast, static regex matching for secrets.
|
||||||
|
Where it runs: Local developer machine (via Pre-commit hook).
|
||||||
|
Detailed Rationale: Developers make human errors. Gitleaks runs in milliseconds and acts as a "spell-check for secrets." It prevents accidentally committing a .env file or hardcoded token before it ever enters the local Git history.
|
||||||
|
Trade-offs: It relies on the developer actively using the pre-commit hook. If a commit is forced (--no-verify), the local check is bypassed.
|
||||||
|
|
||||||
|
2.2 Supply Chain Defense: Socket CLI (Local Wrapper)
|
||||||
|
|
||||||
|
What it does: Intercepts package installation to check for malicious code, typosquatting, and hijacked packages.
|
||||||
|
Where it runs: Local machine (aliased: alias pnpm="socket pnpm").
|
||||||
|
Detailed Rationale: Standard CVE scanners only look for accidental bugs. Socket looks for malice. By aliasing this locally, both the human developer and local AI coding agents are forced to verify a package via Socket's API before it executes install scripts on the host machine.
|
||||||
|
Trade-offs: Adds slight latency to local npm install commands as it must query the external Socket API.
|
||||||
|
|
||||||
|
|
||||||
|
## 2. Part 1: Source Control & Pipeline Gatekeepers (Shift-Left)
|
||||||
|
|
||||||
|
### 2.1 Pipeline Tamper Protection: `CODEOWNERS` Branch Protection
|
||||||
|
* **What it does:** Locks specific file paths (e.g., `.github/workflows/*`, `.argo/*`) requiring human PR reviews, while allowing autonomous AI merges on application code.
|
||||||
|
* **Detailed Rationale:** The pipeline is the absolute gatekeeper. If an AI agent has branch-push access, the most critical vulnerability is the agent modifying or deleting the vulnerability scanners themselves to force a failing build to pass. Implementing `CODEOWNERS` strictly enforces the boundary between the "auditor" (the pipeline) and the "developer" (the AI).
|
||||||
|
* **Trade-offs:** Introduces slight friction. If an AI agent legitimately needs to update a pipeline configuration to support a new app feature, a human must manually intervene and approve the PR, slightly slowing down end-to-end autonomous deployment.
|
||||||
|
|
||||||
|
### 2.2 Application Security (SAST): Semgrep
|
||||||
|
* **What it does:** Scans first-party application code for logic flaws (SQL Injection, XSS, IDOR).
|
||||||
|
* **Detailed Rationale:** While other tools check the infrastructure and the dependencies, nothing was checking the *actual logic* written by the AI agent. AI models frequently output insecure, dated boilerplate. Semgrep was selected because it is incredibly fast, rule-based, and integrates seamlessly into CI without requiring a persistent database or heavy runner compute.
|
||||||
|
* **Trade-offs:** Semgrep is highly focused on pattern matching and lacks the deep, cross-file data-flow analysis of heavier enterprise SAST tools. It may miss highly complex, multi-stage logic vulnerabilities.
|
||||||
|
|
||||||
|
### 2.3 Secret Scanning: Gitleaks (Local) & TruffleHog (Pipeline)
|
||||||
|
* **What it does:** Gitleaks provides instantaneous local pre-commit regex matching. TruffleHog scans the CI pipeline and dynamically attempts to authenticate discovered secrets against external APIs.
|
||||||
|
* **Detailed Rationale:** AI agents frequently hallucinate realistic-looking API keys (`sk_live_...`). Gitleaks stops obvious human errors instantly. However, in the pipeline, TruffleHog is run with the `--only-verified` flag. If it finds a regex match, it queries the provider (e.g., Stripe, AWS). It only fails the build if the key is active. This completely eliminates the false-positive noise caused by AI-generated dummy keys.
|
||||||
|
* **Trade-offs:** TruffleHog's verification requires outbound internet access from the CI runner and relies on third-party APIs being online. If the external provider's API is down, the verification step could fail or timeout.
|
||||||
|
|
||||||
|
### 2.4 Supply Chain Defense: Socket
|
||||||
|
* **What it does:** Wraps local package managers (`alias pnpm="socket pnpm"`) and acts as a pipeline gatekeeper to analyze new dependencies for malicious behavior.
|
||||||
|
* **Detailed Rationale:** Traditional CVE scanners check for accidental developer mistakes. Socket checks for active malice (install scripts that steal SSH keys, typosquatting, hijacked maintainer accounts). Because AI agents regularly pull in new dependencies to solve coding problems, Socket ensures neither the local machine nor the pipeline executes malicious code during dependency resolution.
|
||||||
|
* **Trade-offs:** API-dependent. To conserve free-tier API quotas, the pipeline step must be strictly configured to trigger *only* when lockfiles (`pnpm-lock.yaml`) change, requiring careful CI optimization.
|
||||||
|
|
||||||
|
### 2.5 Infrastructure Validation (IaC): Checkov
|
||||||
|
* **What it does:** Parses Kubernetes manifests, Terraform, and Dockerfiles to ensure they adhere to security best practices.
|
||||||
|
* **Detailed Rationale:** A homelab exposed to the internet cannot afford basic infrastructure misconfigurations, such as running containers as `root` or mapping sensitive host volumes. Checkov acts as an automated senior cloud architect, validating the AI's generated Kubernetes manifests before Argo CD syncs them.
|
||||||
|
* **Trade-offs:** Custom internal infrastructure patterns might be flagged as insecure by default, requiring the maintainer to actively manage and write skip-flags (`# checkov:skip=...`) in the code to silence false positives.
|
||||||
|
|
||||||
|
### 2.6 Container Vulnerability Management: Syft + Grype
|
||||||
|
* **What it does:** Syft generates a Software Bill of Materials (SBOM), and Grype scans that SBOM for CVEs, scored by the Exploit Prediction Scoring System (EPSS).
|
||||||
|
* **Detailed Rationale:** This enforces an "SBOM-first" architecture, creating a permanent cryptographic receipt of exactly what is running. By utilizing EPSS scoring instead of standard CVSS (High/Critical), Grype drastically filters the noise down to only vulnerabilities that hackers are *actively exploiting in the wild*.
|
||||||
|
* **Trade-offs:** Generating an SBOM adds several seconds to every CI pipeline run. Furthermore, Grype lacks native reachability analysis (it knows a vulnerable package exists, but doesn't know if your code actually calls the vulnerable function).
|
||||||
|
|
||||||
|
3.7 Dependency Management: Renovate Bot
|
||||||
|
|
||||||
|
What it does: Automatically opens pull requests to update outdated libraries.
|
||||||
|
Detailed Rationale: The best way to fix CVEs found by Grype is to never let them linger. Renovate batches minor updates into a single weekly PR, automating security patching without overwhelming the repository.
|
||||||
|
Trade-offs: Major version updates can still break the application, requiring high-confidence automated testing.
|
||||||
|
|
||||||
|
3.8 Vulnerability Management Dashboard: DefectDojo
|
||||||
|
|
||||||
|
What it does: A centralized ASPM (Application Security Posture Management) platform.
|
||||||
|
Detailed Rationale: Reading raw SARIF logs is a terrible developer experience. Checkov and Grype push findings into DefectDojo, which deduplicates alerts and provides a single, clean homelab dashboard.
|
||||||
|
Trade-offs: Requires hosting and maintaining a separate stateful web application and database.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Part 2: Cluster Infrastructure & Delivery
|
||||||
|
|
||||||
|
### 3.1 Harvester Native Components
|
||||||
|
* **What it does:** Utilizing built-in RKE2 components: Longhorn (Storage), Prometheus/Grafana (Monitoring), and Canal (Networking CNI).
|
||||||
|
* **Detailed Rationale:** Hyperconverged Infrastructure (HCI) is designed to be "batteries included." By using native components, we guarantee compatibility, out-of-the-box dashboards, and seamless upgrade paths managed by the SUSE ecosystem. Canal natively supports Kubernetes Network Policies without the need for complex service meshes.
|
||||||
|
* **Trade-offs:** Vendor lock-in. If the architecture ever migrates away from SUSE Harvester/RKE2 (e.g., to Proxmox + Talos), the storage and monitoring stacks will need to be entirely rebuilt.
|
||||||
|
|
||||||
|
### 3.2 Ingress Security: Cloudflare Tunnels (Zero Trust)
|
||||||
|
* **What it does:** Creates an outbound encrypted tunnel to Cloudflare's edge, proxying all external traffic without opening inbound firewall ports.
|
||||||
|
* **Detailed Rationale:** Essential for public-facing homelabs. It completely obfuscates the home router's public IP from Shodan scanners and acts as an enterprise-grade WAF and DDoS shield natively at the edge.
|
||||||
|
* **Trade-offs:** Introduces a strict dependency on a third-party SaaS provider. All inbound traffic is subject to Cloudflare's routing latency and Terms of Service. If the Cloudflare daemon (`cloudflared`) crashes inside the cluster, all web applications go offline globally.
|
||||||
|
|
||||||
|
### 3.3 Secret Delivery: Infisical Kubernetes Operator
|
||||||
|
* **What it does:** A centralized vault that uses an in-cluster operator to authenticate, pull, and inject secrets dynamically into Pods at runtime.
|
||||||
|
* **Detailed Rationale:** Scanning repositories for secrets only ensures they don't leak; it doesn't solve secure delivery. Infisical removes secrets from the GitOps flow entirely. AI agents use the Infisical CLI to load local variables, and Argo CD deploys standard manifests. The operator hydrates the sensitive data into memory right as the container starts.
|
||||||
|
* **Trade-offs:** Adds a critical stateful dependency to the cluster. If the Infisical operator fails or the connection to the Infisical vault is lost during pod startup, the applications will fail to boot due to missing environment variables.
|
||||||
|
|
||||||
|
### 3.4 Local Container Registry: `zot` (or `registry:2`)
|
||||||
|
* **What it does:** A self-hosted, lightweight OCI registry living completely inside the Harvester cluster.
|
||||||
|
* **Detailed Rationale:** Required for applications that spawn ephemeral containers dynamically. Pulling images from an external registry (like GitHub Container Registry) incurs significant internet latency. By keeping the registry on the local LAN/cluster, image pulls happen in milliseconds, providing blistering fast pod startup times.
|
||||||
|
* **Trade-offs:** Minimalist tooling. `zot` and `registry:2` do not have the robust GUI, user management, or built-in garbage collection interfaces of enterprise registries like Harbor, requiring manual CLI maintenance to prune old images and reclaim Longhorn disk space.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Part 3: Runtime Security & Admission Control (Shield-Right)
|
||||||
|
|
||||||
|
### 4.1 All-in-One Runtime Platform: SUSE Security (NeuVector)
|
||||||
|
* **What it does:** Native platform handling Admission Control, Layer 7 network visualization, and eBPF-based Runtime Threat Detection.
|
||||||
|
* **Detailed Rationale:** Serves as the ultimate fallback layer (Defense in Depth). If an attacker (or AI agent) bypasses the pipeline, NeuVector intercepts actions at the cluster level.
|
||||||
|
* **Admission Control:** Enforces the policy that the cluster may *only* pull images from the local `zot` registry, rejecting arbitrary or malicious images.
|
||||||
|
* **Runtime:** Uses eBPF to monitor kernel activity, actively blocking or alerting if an exploited container attempts to spawn a bash shell, modify its file system, or read sensitive host files.
|
||||||
|
* **Trade-offs:** NeuVector is a heavy, enterprise-grade tool. It consumes significantly more CPU and memory overhead per node than specialized, single-purpose CLI tools. Its learning curve and initial configuration are also substantial compared to raw YAML network policies.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Part 4: Future Upgrades
|
||||||
|
|
||||||
|
### 5.1 Dynamic Application Security Testing (DAST)
|
||||||
|
* **Plan:** Integrate OWASP ZAP or Nuclei to perform black-box testing against the running applications.
|
||||||
|
* **Rationale:** Simulates external hacker traffic to identify exploitable logic that static analysis (Semgrep) cannot see (e.g., auth bypass, business logic flaws).
|
||||||
|
* **Trade-offs for Delaying:** DAST is incredibly slow. Running it inline with CI/CD destroys developer velocity. It will be implemented later as an asynchronous weekly cron job, meaning we accept the temporary risk of zero-day web logic vulnerabilities persisting for up to a week before discovery.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Tools Explicitly Evaluated and Rejected (The "Why Not?" List)
|
||||||
|
|
||||||
|
### ❌ Cosign (Image Signing)
|
||||||
|
* **Why it was rejected:** Cosign protects against Man-in-the-Middle attacks between the CI runner and the registry. In a single-user private homelab where the runner, registry, and cluster share the same localized trust boundary, the risk is negligible. Managing Sigstore admission controllers bloats the architecture for near-zero practical ROI.
|
||||||
|
|
||||||
|
### ❌ OPA Gatekeeper / Kyverno (Admission Control)
|
||||||
|
* **Why it was rejected:** OPA requires learning an entirely new, complex logic language (Rego). While Kyverno uses simpler Kubernetes YAML, both were ultimately rejected in favor of NeuVector, which handles admission control natively within its existing UI as part of the unified SUSE ecosystem.
|
||||||
|
|
||||||
|
### ❌ Falco / Tetragon (Runtime Security)
|
||||||
|
* **Why it was rejected:** Both are exceptional eBPF runtime security monitors. However, implementing them would require setting up custom alerting pipelines and separate visualizers. Because Harvester is a SUSE product, SUSE Security (NeuVector) was selected to provide equivalent runtime protection natively, keeping the tech stack consolidated.
|
||||||
|
|
||||||
|
### ❌ SonarQube (SAST)
|
||||||
|
* **Why it was rejected:** SonarQube requires a persistent database (PostgreSQL), high memory consumption, and a complex runner setup. Semgrep provides 90% of the value for an AI-threat model with 10% of the infrastructure footprint and integrates directly into the lightweight Argo Workflows pipeline.
|
||||||
|
|
||||||
|
### ❌ Trivy (All-in-One Scanner)
|
||||||
|
* **Why it was rejected:** Trivy scans containers, IaC, and secrets. However, we prioritized a "best of breed" approach. Checkov handles complex Terraform/Kubernetes graphs better, TruffleHog eliminates secret false-positives via API verification, and Grype offers EPSS scoring. Using Trivy would duplicate efforts or generate conflicting, noisy alerts.
|
||||||
|
|
||||||
|
### ❌ Clair (Container Scanning)
|
||||||
|
* **Why it was rejected:** Clair requires a stateful PostgreSQL database to continuously index vulnerability feeds. This architecture specifically demands a stateless tool (Grype) that acts as an ephemeral gate *inside* the CI/CD pipeline before images are ever pushed to a registry.
|
||||||
|
|
||||||
|
### ❌ CrowdSec / Local WAFs
|
||||||
|
* **Why it was rejected:** Because the cluster's sole ingress is routed through Cloudflare Tunnels, malicious traffic and automated DDoS attempts are filtered at Cloudflare's edge network. Running a secondary WAF inside the cluster wastes compute resources to solve a problem that was already mitigated before the traffic reached the home network.
|
||||||
@@ -1,93 +0,0 @@
|
|||||||
# Architecture Decision Record: DevSecOps & Infrastructure
|
|
||||||
|
|
||||||
**Date:** April 16, 2026
|
|
||||||
**Status:** Approved & Implemented
|
|
||||||
**Author:** [Your Name]
|
|
||||||
**Environment:** SUSE Harvester (Kubernetes) Homelab, Argo Workflows/CD, AI-Assisted Development
|
|
||||||
|
|
||||||
## 1. Context & Threat Model
|
|
||||||
This project utilizes a centralized template strategy across multiple personal repositories. The infrastructure is hosted on a personal SUSE Harvester cluster and exposed to the public internet. Furthermore, the introduction of AI coding agents (which write code, manage dependencies, and occasionally push directly to branches) fundamentally alters the standard solo-developer threat model.
|
|
||||||
|
|
||||||
Our goals are to:
|
|
||||||
1. Prevent supply chain attacks (malicious packages).
|
|
||||||
2. Catch misconfigurations before they deploy to the homelab.
|
|
||||||
3. Eliminate alert fatigue/false positives so development velocity remains high.
|
|
||||||
4. Maintain a robust security posture without spending hours managing DevSecOps infrastructure.
|
|
||||||
|
|
||||||
To achieve this, the architecture is split into two layers: **Local/Repo Tooling** (to protect the developer and catch errors early) and the **CI/CD Pipeline** (the un-bypassable server-side gate).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Part 1: Local Development & Repository Tooling
|
|
||||||
|
|
||||||
### 2.1 Secret Scanning: Gitleaks (Local)
|
|
||||||
* **What it does:** Fast, static regex matching for secrets.
|
|
||||||
* **Where it runs:** Local developer machine (via Pre-commit hook).
|
|
||||||
* **The Rationale:** Developers make human errors. Gitleaks runs in milliseconds and acts as a "spell-check for secrets." It prevents accidentally committing a `.env` file or hardcoded token before it ever enters the local Git history.
|
|
||||||
|
|
||||||
### 2.2 Supply Chain Defense: Socket CLI (Local Wrapper)
|
|
||||||
* **What it does:** Intercepts package installation to check for malicious code, typosquatting, and hijacked packages.
|
|
||||||
* **Where it runs:** Local machine (aliased: `alias pnpm="socket pnpm"`).
|
|
||||||
* **The Rationale:** Standard CVE scanners only look for accidental bugs. Socket looks for *malice* (e.g., a hacker stealing a maintainer's credentials and adding an SSH-stealing script). By aliasing this locally, both the human developer and local AI coding agents are forced to verify a package via Socket's API *before* it downloads to the machine.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Part 2: CI/CD Pipeline (Argo Workflows / Server-Side)
|
|
||||||
|
|
||||||
### 3.1 Supply Chain Defense: Socket (Pipeline Gatekeeper)
|
|
||||||
* **What it does:** Analyzes pull requests and commits for newly introduced dependencies, scanning them for malware, typosquatting, and high-risk network/filesystem behavior.
|
|
||||||
* **The Rationale (The AI Blindspot):** If an AI agent pushes a commit directly via API or a remote environment, the local Socket CLI wrapper is bypassed. Socket in the CI/CD pipeline acts as the absolute gatekeeper. To conserve API quota/free-tier limits, this step is configured to trigger *only* when dependency files (e.g., `package.json`, `pnpm-lock.yaml`) are modified.
|
|
||||||
|
|
||||||
### 3.2 Secret Verification: TruffleHog
|
|
||||||
* **What it does:** Scans Git history for secrets and actively calls out to external APIs (AWS, Stripe, etc.) to verify if the key is live.
|
|
||||||
* **The Rationale:** Similar to the Socket blindspot, AI agents can hallucinate real, live API keys and force them into a commit.
|
|
||||||
* **Trade-off / Why Not Just Gitleaks?:** By running TruffleHog with the `--only-verified` flag, it only fails the pipeline if it proves a key is actively working. This creates **zero noise/false positives**. Setup takes 60 seconds, and because it only executes the "slow" API verification if a regex match is found, clean pipelines complete this step in ~10 seconds.
|
|
||||||
|
|
||||||
### 3.3 Infrastructure as Code (IaC) Validation: Checkov
|
|
||||||
* **What it does:** Scans Terraform, Kubernetes manifests, and Dockerfiles for misconfigurations.
|
|
||||||
* **The Rationale:** Because these projects are exposed to the public internet on a personal Harvester cluster, infrastructure misconfigurations (like running containers as `root` or overly permissive ingress networks) are fatal. Checkov acts as an automated "senior cloud architect," ensuring community best practices are enforced automatically.
|
|
||||||
|
|
||||||
### 3.4 Container Security: Syft + Grype
|
|
||||||
* **What it does:** Syft generates a Software Bill of Materials (SBOM), and Grype scans that SBOM for known CVEs.
|
|
||||||
* **The Rationale:** This enforces an "SBOM-first" architecture. By generating the SBOM artifact first, we have a permanent cryptographic receipt of exactly what was deployed. Grype consumes this SBOM and is configured to prioritize vulnerabilities based on EPSS (Exploit Prediction Scoring System), drastically reducing the noise of traditional "High/Critical" CVSS alerts.
|
|
||||||
|
|
||||||
### 3.5 Dependency Management: Renovate Bot
|
|
||||||
* **What it does:** Automatically opens pull requests to update outdated libraries.
|
|
||||||
* **The Rationale:** The best way to fix CVEs found by Grype is to never let them linger. Renovate is configured to batch/group minor updates into a single weekly PR, automating security patching without overwhelming the repository with individual pull requests.
|
|
||||||
|
|
||||||
### 3.6 Vulnerability Management Dashboard: DefectDojo
|
|
||||||
* **What it does:** A centralized, open-source ASPM (Application Security Posture Management) platform.
|
|
||||||
* **The Rationale:** Reading raw CI/CD JSON/SARIF logs to triage security issues is terrible developer experience. Checkov and Grype both push their findings into DefectDojo, which deduplicates the alerts and provides a single, clean dashboard for the entire homelab.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Part 3: Infrastructure & Ingress
|
|
||||||
|
|
||||||
### 4.1 Ingress Security: Cloudflare Tunnels (Zero Trust)
|
|
||||||
* **What it does:** Creates an outbound encrypted tunnel from the Harvester cluster to Cloudflare's edge network.
|
|
||||||
* **The Rationale:** This acts as a Web Application Firewall (WAF) and DDoS protection layer without requiring any open inbound ports on the home router. Hackers scanning the public IP of the home network will see zero exposed services, yet the web applications remain 100% accessible to the public via Cloudflare's routing.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Tools Explicitly Evaluated and Rejected (The "Why Not?" List)
|
|
||||||
To avoid architectural bloat, several industry-standard tools were evaluated and explicitly excluded from this stack.
|
|
||||||
|
|
||||||
### Cosign (Image Signing)
|
|
||||||
* **What it does:** Cryptographically signs container images.
|
|
||||||
* **Why it was rejected:** Cosign solves "Man-in-the-Middle" attacks between a CI/CD runner and a container registry. Because the CI/CD platform, the registry, and the Kubernetes cluster are all strictly controlled by a single user on a private homelab, the risk of a MitM attack is negligible. The complexity of managing Sigstore admission controllers in Harvester drastically outweighs the ROI for a solo developer.
|
|
||||||
|
|
||||||
### Clair (Container Scanning)
|
|
||||||
* **What it does:** Vulnerability scanning (similar to Grype/Trivy).
|
|
||||||
* **Why it was rejected:** Clair requires a stateful PostgreSQL database to continuously scan container registries. We require a stateless tool that acts as a gate *inside* the CI/CD pipeline before an image is pushed. Grype achieves this statelessly.
|
|
||||||
|
|
||||||
### Snyk Open Source
|
|
||||||
* **What it does:** SCA and SAST scanning.
|
|
||||||
* **Why it was rejected:** While Snyk's reachability analysis is excellent, its functionality is largely redundant when utilizing Socket for malware detection and Grype for CVE scanning. (Note: Socket's paid "Team" tier provides the missing reachability analysis if required in the future).
|
|
||||||
|
|
||||||
### CrowdSec (Homelab WAF)
|
|
||||||
* **What it does:** Crowdsourced IP blocking and local WAF.
|
|
||||||
* **Why it was rejected:** Because all public ingress is routed exclusively through Cloudflare Tunnels (which proxy and filter malicious traffic at the edge), running a local WAF at the Nginx Ingress level is redundant and wastes cluster compute resources.
|
|
||||||
|
|
||||||
### Trivy
|
|
||||||
* **What it does:** All-in-one scanner for IaC, secrets, and CVEs.
|
|
||||||
* **Why it was rejected:** Trivy is excellent for zero-config setups. However, because we wanted "best of breed" specialized tools—Checkov for superior IaC graph-scanning, TruffleHog for API-verified secrets, and Grype for EPSS risk-scoring—Trivy would either be redundant or create duplicate/conflicting alerts if run alongside them.
|
|
||||||
@@ -0,0 +1,86 @@
|
|||||||
|
# Security & Pipeline Architecture Decision Record (ADR)
|
||||||
|
|
||||||
|
## 1. Context and Constraints
|
||||||
|
The architecture is designed for a self-hosted homelab environment with internet-exposed applications. The development workflow heavily relies on AI coding agents.
|
||||||
|
The system must satisfy the following constraints:
|
||||||
|
* **Language & Style:** Exclusively TypeScript, utilizing a highly functional, Domain-Driven Design architecture (based on Scott Wlaschin's *Domain Modeling Made Functional* pipeline methods, monads, and deep function chaining).
|
||||||
|
* **Infrastructure:** Self-hosted Gitea/GitLab with Argo Workflows. Repositories are private.
|
||||||
|
* **Low Friction / Low Noise:** Developer fatigue must be minimized. False positives are unacceptable, as they lead to ignoring security tools altogether.
|
||||||
|
* **AI Agent Defense:** Must specifically catch AI-introduced flaws (hallucinated packages, insecure API usage, bypasses).
|
||||||
|
* **Low Maintenance:** Avoid managing heavy databases or infrastructure just to run security scans.
|
||||||
|
* **Budget:** Willing to pay for premium tools (e.g., ~$50/month) *only* if they significantly reduce developer friction and pipeline execution time.
|
||||||
|
|
||||||
|
## 2. Pipeline Execution Strategy: Argo Workflows
|
||||||
|
To maintain developer velocity (the "Friction" principle), pipeline feedback must be fast.
|
||||||
|
|
||||||
|
* **Decision:** Utilize **Argo Workflows** natively configured with Directed Acyclic Graphs (DAGs) and Step-Level Memoization.
|
||||||
|
* **Trade-off Analyzed:** Paid remote-caching CI runners (like RWX Mint, Dagger, or Earthly) vs. Argo.
|
||||||
|
* **Why Argo won:** Argo natively supports parallel pod execution. Five distinct 1-minute security steps will run simultaneously, resulting in a total pipeline execution time of 1 minute. Argo's native "Step-Level Memoization" allows task skipping if inputs haven't changed, providing sufficient caching without the cost of enterprise CI tools.
|
||||||
|
|
||||||
|
## 3. The Security Stack (Defense in Depth)
|
||||||
|
|
||||||
|
### Layer 1: Instant IDE Feedback (0-second delay)
|
||||||
|
* **Tool:** `eslint` with `eslint-plugin-security` and `@typescript-eslint`.
|
||||||
|
* **Reasoning:** Linters are "dumb" but instantaneous. They will catch AI agents generating immediately dangerous syntax (like `eval()` or unsafe Regex) before a commit is even made.
|
||||||
|
|
||||||
|
### Layer 2: Infrastructure as Code (IaC) Scanning
|
||||||
|
* **Tool:** Checkov (Open Source)
|
||||||
|
* **Reasoning:** Lightweight CLI tool to ensure the AI agents do not accidentally expose internal homelab ports to the internet or misconfigure container permissions.
|
||||||
|
|
||||||
|
### Layer 3: Supply Chain Security
|
||||||
|
* **Tool:** Socket.dev (Team Tier - $25/month)
|
||||||
|
* **Trade-off Analyzed:** Standard npm audit / Dependabot vs. Socket.dev.
|
||||||
|
* **Why Socket won:** AI agents are notorious for hallucinating package names, which leads to downloading typosquatted malware. Furthermore, standard tools alert on *every* CVE, causing massive alert fatigue. Socket provides **Reachability Analysis**—it maps the data flow and only alerts if the code *actually calls the vulnerable function* within a library. This drastically reduces false positives and justifies the $25/mo cost.
|
||||||
|
|
||||||
|
### Layer 4: Static Application Security Testing (SAST)
|
||||||
|
* **Tool:** Semgrep Code / Teams Tier ($30/month)
|
||||||
|
* **Trade-off Analyzed:** Semgrep OSS / Free Tier vs. Semgrep Teams.
|
||||||
|
* **Why Semgrep Teams won:** Standard SAST tools (including Semgrep OSS/Pro Engine) rely on strict AST Taint Analysis. Because this project uses a highly functional architecture (custom pipelines, generic monads like `Result<T,E>`, and `.bind()` chaining), traditional data-flow trackers easily lose the trace, leading to dangerous **False Negatives**.
|
||||||
|
* **The Deciding Feature:** The Teams tier unlocks **AI-Powered Detection** (not just the AI Assistant for triage). Semgrep's AI reads the semantic intent of the code, successfully tracing data through complex functional wrappers where traditional scanners fail. This acts as an automated senior security reviewer reading the AI agent's PRs.
|
||||||
|
|
||||||
|
## 4. Alternatives Dismissed
|
||||||
|
|
||||||
|
| Tool | Reason for Rejection |
|
||||||
|
| :--- | :--- |
|
||||||
|
| **CodeQL** | Best-in-class taint analysis, but requires an exorbitant GitHub Advanced Security license for private repositories. Scans are also notoriously slow (compile-to-database architecture), violating the "low friction / fast pipeline" constraint. |
|
||||||
|
| **SonarQube** | Excellent for tech debt, but violates the "Low Maintenance" constraint. Requires spinning up and maintaining a dedicated Postgres database and Java server in the homelab. Generates too much noise out-of-the-box. |
|
||||||
|
| **Snyk Code** | Great UX, but lacks the ability to write custom rules. If the AI agent develops a specific bad habit unique to this codebase, Snyk cannot be easily tuned to block it. |
|
||||||
|
| **Checkmarx / Veracode** | Built for massive legacy enterprise compliance. Far too expensive, slow, and noisy for a modern, agile homelab setup. |
|
||||||
|
|
||||||
|
## 5. Future Considerations / Phase 2
|
||||||
|
* **Build Caching:** If actual container build steps (`docker build`, `npm install`) become the bottleneck in Argo Workflows, evaluate adding open-source caching layers like **Kaniko** or **BuildKit** inside Argo pods before purchasing paid caching solutions.
|
||||||
|
* **Custom Semgrep Rules:** If the AI agent repeatedly makes domain-specific logic errors (e.g., misusing a specific custom Monad), write lightweight custom Semgrep YAML rules to permanently block those specific anti-patterns.
|
||||||
|
|
||||||
|
## 6. Detailed Analysis of Rejected Alternatives
|
||||||
|
|
||||||
|
When designing this pipeline, several industry-standard tools were evaluated. They were ultimately rejected because they violated one or more core constraints of this specific environment: **Low Homelab Maintenance**, **Fast Pipeline Execution (Low Friction)**, **Support for Functional Architectures**, or **Affordability for Private Repos**.
|
||||||
|
|
||||||
|
### Rejected: CodeQL (by GitHub)
|
||||||
|
* **What it is:** A highly advanced, semantic code-scanning engine that treats code like a database to perform deep data-flow and taint analysis.
|
||||||
|
* **Why it was rejected (Dealbreakers):**
|
||||||
|
* **The "Private Repo Tax":** CodeQL is only free for public open-source projects. Because this homelab uses self-hosted private Gitea/GitLab repositories, using CodeQL would require purchasing GitHub Advanced Security (GHAS) enterprise licenses, which are prohibitively expensive for an individual.
|
||||||
|
* **Pipeline Friction:** CodeQL requires a build phase. It has to compile the TypeScript code into a relational database before it can run queries. This adds significant time (often minutes) to the pipeline, violating the strict fast-feedback loop required for high-velocity AI agent development.
|
||||||
|
|
||||||
|
### Rejected: SonarQube / SonarCloud
|
||||||
|
* **What it is:** A holistic code quality and SAST platform that acts as a central dashboard for code smells, technical debt, and security vulnerabilities.
|
||||||
|
* **Why it was rejected (Dealbreakers):**
|
||||||
|
* **High Infrastructure Maintenance:** To run SonarQube for free on private code, it must be self-hosted. This requires standing up a dedicated Java-based application server and a PostgreSQL database in the homelab cluster. Managing, updating, and tuning this infrastructure directly contradicts the "Low Maintenance" constraint.
|
||||||
|
* **Noise and Focus:** SonarQube heavily flags general "code smells" and style issues. In an AI-driven workflow, this generates massive alert fatigue. The goal is to catch critical security flaws and complex logic errors, not to argue with the scanner about whether an AI agent wrote a slightly redundant `if` statement.
|
||||||
|
|
||||||
|
### Rejected: Snyk Code (SAST capabilities)
|
||||||
|
* **What it is:** A developer-first, high-speed SAST tool powered by machine learning, famous for instant IDE feedback. *(Note: Snyk was evaluated as a SAST alternative to Semgrep, whereas Socket was chosen for dependencies).*
|
||||||
|
* **Why it was rejected (Dealbreakers):**
|
||||||
|
* **The "Black Box" Limitation:** Snyk does not allow users to write custom security rules. If the AI agent develops a bad habit specific to this project's unique functional domain-modeling architecture, there is no way to write a quick rule to block it. Semgrep’s YAML rules allow for immediate, custom course-correction.
|
||||||
|
* **Scan Limits:** The free tier heavily restricts the number of scans per month. Because AI agents often generate dozens of micro-commits and rapid iterations, the pipeline would frequently hit rate limits, blocking deployment.
|
||||||
|
|
||||||
|
### Rejected: Legacy Enterprise Scanners (Checkmarx, Veracode, Fortify)
|
||||||
|
* **What it is:** Heavyweight commercial application security testing platforms utilized by massive enterprises, banks, and governments for compliance auditing.
|
||||||
|
* **Why it was rejected (Dealbreakers):**
|
||||||
|
* **Execution Speed:** Historically known for extremely slow scan times (sometimes taking hours), completely destroying developer velocity and CI/CD parallelism.
|
||||||
|
* **Extreme Cost:** Pricing starts in the tens of thousands of dollars.
|
||||||
|
* **High False Positives:** Out-of-the-box, these tools are incredibly noisy and require dedicated AppSec teams to tune them to the specific application architecture.
|
||||||
|
|
||||||
|
### Rejected: Paid CI/CD Execution Engines (RWX Mint, Dagger Cloud, Buildkite)
|
||||||
|
* **What it is:** Next-generation CI platforms that offer advanced DAG execution, deep remote layer caching, and highly optimized parallel builds.
|
||||||
|
* **Why it was rejected (Dealbreakers):**
|
||||||
|
* **Redundancy with Argo:** While powerful, paying a premium for these platforms is unnecessary. Because the homelab already utilizes Kubernetes, **Argo Workflows** natively provides DAG parallel execution and step-level memoization for free. The compute is already paid for by the homelab hardware, making commercial CI orchestrators an unnecessary expense for this phase of the architecture.
|
||||||
Reference in New Issue
Block a user