From 706cf2033aab05493f1e53b4a536491fbd51ad41 Mon Sep 17 00:00:00 2001 From: Elizabeth W Date: Wed, 15 Apr 2026 23:11:04 -0600 Subject: [PATCH] added rationale doc --- docs/rationale.md | 93 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 docs/rationale.md diff --git a/docs/rationale.md b/docs/rationale.md new file mode 100644 index 0000000..42ea2e4 --- /dev/null +++ b/docs/rationale.md @@ -0,0 +1,93 @@ +# Architecture Decision Record: DevSecOps & Infrastructure + +**Date:** April 16, 2026 +**Status:** Approved & Implemented +**Author:** [Your Name] +**Environment:** SUSE Harvester (Kubernetes) Homelab, Argo Workflows/CD, AI-Assisted Development + +## 1. Context & Threat Model +This project utilizes a centralized template strategy across multiple personal repositories. The infrastructure is hosted on a personal SUSE Harvester cluster and exposed to the public internet. Furthermore, the introduction of AI coding agents (which write code, manage dependencies, and occasionally push directly to branches) fundamentally alters the standard solo-developer threat model. + +Our goals are to: +1. Prevent supply chain attacks (malicious packages). +2. Catch misconfigurations before they deploy to the homelab. +3. Eliminate alert fatigue/false positives so development velocity remains high. +4. Maintain a robust security posture without spending hours managing DevSecOps infrastructure. + +To achieve this, the architecture is split into two layers: **Local/Repo Tooling** (to protect the developer and catch errors early) and the **CI/CD Pipeline** (the un-bypassable server-side gate). + +--- + +## 2. Part 1: Local Development & Repository Tooling + +### 2.1 Secret Scanning: Gitleaks (Local) +* **What it does:** Fast, static regex matching for secrets. +* **Where it runs:** Local developer machine (via Pre-commit hook). +* **The Rationale:** Developers make human errors. Gitleaks runs in milliseconds and acts as a "spell-check for secrets." It prevents accidentally committing a `.env` file or hardcoded token before it ever enters the local Git history. + +### 2.2 Supply Chain Defense: Socket CLI (Local Wrapper) +* **What it does:** Intercepts package installation to check for malicious code, typosquatting, and hijacked packages. +* **Where it runs:** Local machine (aliased: `alias pnpm="socket pnpm"`). +* **The Rationale:** Standard CVE scanners only look for accidental bugs. Socket looks for *malice* (e.g., a hacker stealing a maintainer's credentials and adding an SSH-stealing script). By aliasing this locally, both the human developer and local AI coding agents are forced to verify a package via Socket's API *before* it downloads to the machine. + +--- + +## 3. Part 2: CI/CD Pipeline (Argo Workflows / Server-Side) + +### 3.1 Supply Chain Defense: Socket (Pipeline Gatekeeper) +* **What it does:** Analyzes pull requests and commits for newly introduced dependencies, scanning them for malware, typosquatting, and high-risk network/filesystem behavior. +* **The Rationale (The AI Blindspot):** If an AI agent pushes a commit directly via API or a remote environment, the local Socket CLI wrapper is bypassed. Socket in the CI/CD pipeline acts as the absolute gatekeeper. To conserve API quota/free-tier limits, this step is configured to trigger *only* when dependency files (e.g., `package.json`, `pnpm-lock.yaml`) are modified. + +### 3.2 Secret Verification: TruffleHog +* **What it does:** Scans Git history for secrets and actively calls out to external APIs (AWS, Stripe, etc.) to verify if the key is live. +* **The Rationale:** Similar to the Socket blindspot, AI agents can hallucinate real, live API keys and force them into a commit. +* **Trade-off / Why Not Just Gitleaks?:** By running TruffleHog with the `--only-verified` flag, it only fails the pipeline if it proves a key is actively working. This creates **zero noise/false positives**. Setup takes 60 seconds, and because it only executes the "slow" API verification if a regex match is found, clean pipelines complete this step in ~10 seconds. + +### 3.3 Infrastructure as Code (IaC) Validation: Checkov +* **What it does:** Scans Terraform, Kubernetes manifests, and Dockerfiles for misconfigurations. +* **The Rationale:** Because these projects are exposed to the public internet on a personal Harvester cluster, infrastructure misconfigurations (like running containers as `root` or overly permissive ingress networks) are fatal. Checkov acts as an automated "senior cloud architect," ensuring community best practices are enforced automatically. + +### 3.4 Container Security: Syft + Grype +* **What it does:** Syft generates a Software Bill of Materials (SBOM), and Grype scans that SBOM for known CVEs. +* **The Rationale:** This enforces an "SBOM-first" architecture. By generating the SBOM artifact first, we have a permanent cryptographic receipt of exactly what was deployed. Grype consumes this SBOM and is configured to prioritize vulnerabilities based on EPSS (Exploit Prediction Scoring System), drastically reducing the noise of traditional "High/Critical" CVSS alerts. + +### 3.5 Dependency Management: Renovate Bot +* **What it does:** Automatically opens pull requests to update outdated libraries. +* **The Rationale:** The best way to fix CVEs found by Grype is to never let them linger. Renovate is configured to batch/group minor updates into a single weekly PR, automating security patching without overwhelming the repository with individual pull requests. + +### 3.6 Vulnerability Management Dashboard: DefectDojo +* **What it does:** A centralized, open-source ASPM (Application Security Posture Management) platform. +* **The Rationale:** Reading raw CI/CD JSON/SARIF logs to triage security issues is terrible developer experience. Checkov and Grype both push their findings into DefectDojo, which deduplicates the alerts and provides a single, clean dashboard for the entire homelab. + +--- + +## 4. Part 3: Infrastructure & Ingress + +### 4.1 Ingress Security: Cloudflare Tunnels (Zero Trust) +* **What it does:** Creates an outbound encrypted tunnel from the Harvester cluster to Cloudflare's edge network. +* **The Rationale:** This acts as a Web Application Firewall (WAF) and DDoS protection layer without requiring any open inbound ports on the home router. Hackers scanning the public IP of the home network will see zero exposed services, yet the web applications remain 100% accessible to the public via Cloudflare's routing. + +--- + +## 5. Tools Explicitly Evaluated and Rejected (The "Why Not?" List) +To avoid architectural bloat, several industry-standard tools were evaluated and explicitly excluded from this stack. + +### Cosign (Image Signing) +* **What it does:** Cryptographically signs container images. +* **Why it was rejected:** Cosign solves "Man-in-the-Middle" attacks between a CI/CD runner and a container registry. Because the CI/CD platform, the registry, and the Kubernetes cluster are all strictly controlled by a single user on a private homelab, the risk of a MitM attack is negligible. The complexity of managing Sigstore admission controllers in Harvester drastically outweighs the ROI for a solo developer. + +### Clair (Container Scanning) +* **What it does:** Vulnerability scanning (similar to Grype/Trivy). +* **Why it was rejected:** Clair requires a stateful PostgreSQL database to continuously scan container registries. We require a stateless tool that acts as a gate *inside* the CI/CD pipeline before an image is pushed. Grype achieves this statelessly. + +### Snyk Open Source +* **What it does:** SCA and SAST scanning. +* **Why it was rejected:** While Snyk's reachability analysis is excellent, its functionality is largely redundant when utilizing Socket for malware detection and Grype for CVE scanning. (Note: Socket's paid "Team" tier provides the missing reachability analysis if required in the future). + +### CrowdSec (Homelab WAF) +* **What it does:** Crowdsourced IP blocking and local WAF. +* **Why it was rejected:** Because all public ingress is routed exclusively through Cloudflare Tunnels (which proxy and filter malicious traffic at the edge), running a local WAF at the Nginx Ingress level is redundant and wastes cluster compute resources. + +### Trivy +* **What it does:** All-in-one scanner for IaC, secrets, and CVEs. +* **Why it was rejected:** Trivy is excellent for zero-config setups. However, because we wanted "best of breed" specialized tools—Checkov for superior IaC graph-scanning, TruffleHog for API-verified secrets, and Grype for EPSS risk-scoring—Trivy would either be redundant or create duplicate/conflicting alerts if run alongside them.