
AI can now generate code faster than humans can review it. This creates a dangerous bottleneck where security risks and architectural flaws can hide inside large volumes of machine-generated software.
To address this problem, I built the Automaton Auditor (Emerald Suite v2.0) — a multi-agent LangGraph forensic swarm designed to govern code rather than generate it.
The system introduces a Digital Courtroom architecture where specialized AI agents analyze repositories, argue opposing interpretations, and synthesize a final verdict based on verifiable evidence.
By combining AST-based code analysis, strict Pydantic state contracts, and adversarial multi-agent reasoning, the Emerald Suite transforms manual code review into a scalable forensic service.
Modern development has entered the era of “vibe coding.”
Developers can describe a system and AI generates thousands of lines of code instantly. The problem is that human review cannot scale at the same speed.This leads to what I call Orchestration Fraud — when a system claims to do one thing but the underlying code tells another story. The Emerald Suite addresses this by shifting the engineer’s role:
from writing code → to governing code.
My mission was to shift from being a "bricklayer" who writes code to an "architect" who governs it.
1. I built the Emerald Suite as a production-ready solution.
It is a "Glass Box" system. Every decision in our courtroom is explicit, traceable, and anchored in hard evidence rather than guesses.
2. Evaluation: Why a Digital Courtroom?
Traditional AI grading is broken. If you ask an LLM to rate code on a scale of 1 to 10, you get inconsistent "vibe" scores.
Instead of a single AI judge, the system creates structured disagreement between agents.
The Courtroom Model works better because LLMs are excellent at arguing specific positions. We use this to bridge the "Judicial Gap"—the space between seeing a file exists and judging its actual quality.
The Courtroom Roles:
The Prosecutor (Pessimistic): Philosophy is "Trust No One." Its mission is to find gaps, security flaws, and "Hallucination Liabilities."
The Defense (Optimistic): Focuses on the "Spirit of the Law." It identifies engineering intent and creative workarounds visible in the Git history.
The Tech Lead (Realistic): The pragmatic anchor. It evaluates code maintainability and whether the system is built to scale.
Architecture: Hierarchical multi-agent system (“Digital Courtroom”) with Detectives (RepoInvestigator, DocAnalyst, VisionInspector), Judges (Prosecutor, Defense, Tech Lead), and Chief Justice synthesizing final verdicts. Detectives collect structured evidence, Judges analyze it independently, Chief Justice resolves conflicts and produces the audit report.
Tools & Frameworks: Python, LangGraph for multi-agent orchestration, Pydantic for typed state, AST parsing for code analysis, RAG-lite PDF parsing for reports, sandboxed git clone using tempfile.
Installation:
git clone https://github.com/hydropython/forensic-swarm-auditor.git cd forensic-swarm-auditor uv install cp .env.example .env
Running the Auditor: Provide a GitHub repo URL and a PDF report. The system outputs a structured audit report (Markdown/PDF) with scores, judge opinions, and remediation steps.

Start & Sandbox: The process begins by initializing a forensic sandbox where the repository and related files are isolated for safe analysis. The AgentState Contract maintains typed, structured state throughout the workflow.
Dispatcher & Detectives: A dispatcher assigns tasks to multiple forensic detectives in parallel:
Repo Detective: Examines the repository structure and code.
Doc Analyst: Reviews associated documentation.
Vision Inspector: Processes visual evidence or images.
Aggregation: All detective outputs feed into the Clerk Aggregator, which applies min-max logic to evaluate whether evidence is complete or needs further review.
Judicial Review: When evidence meets thresholds, the case moves to the Judicial Courtroom for parallel evaluation by the Prosecutor, Defense, and Tech Lead, each forming independent judgments.
Chief Justice Synthesis: The Chief Justice deterministically synthesizes all inputs, resolves conflicts, and decides the final audit outcome.
Report Generation: The workflow ends with the Report Generator, producing a structured audit report summarizing findings, scores, and recommendations.
This table summarizes the specialized roles, responsibilities, and tools for each agent within the Digital Courtroom (LangGraph State Machine).
| Layer | Agent | Role & Responsibility | Primary Tools & Protocols |
|---|---|---|---|
| Detective | Repo Investigator | Code Forensic: Verifies Pydantic state, AST-based graph wiring, and Git history. | ast module, git clone, tempfile (sandboxing), pathlib |
| Detective | Doc Analyst | Document Forensic: Cross-references PDF report claims against repository facts. | Docling / PyMuPDF, RAG-lite vector search |
| Detective | Vision Inspector | Visual Forensic: Validates that architectural diagrams match implemented logic. | Gemini Pro Vision / GPT-4o, pdf_image_extractor |
| Judicial | The Prosecutor | Critical Lens: Scrutinizes evidence for "Vibe Coding" and security flaws. | .with_structured_output(), "Trust No One" prompt |
| Judicial | The Defense | Optimistic Lens: Highlights effort, engineering process, and conceptual depth. | .with_structured_output(), "Spirit of the Law" prompt |
| Judicial | The Tech Lead | Pragmatic Lens: Evaluates technical debt and architectural soundness. | .with_structured_output(), "Production-Ready" prompt |
| Supreme | Chief Justice | Synthesis Engine: Resolves judicial conflict using deterministic rules. | Hardcoded Python logic (Security & Fact Overrides) |
::youtube[Title]{#https://www.youtube.com/watch?v=3MjdDmIcL3M&t=1s}
The beauty of the courtroom is that it turns disagreement into data. For the Engineering Chronology and Swarm Resilience criteria, I saw a fascinating clash:
This friction is what makes the Remediation Plan so clear. Because the Prosecutor focused so heavily on the "fail" of my path strings, the system produced a specific instruction: “Replace all raw string path concatenations with pathlib.Path objects.” This isn't just a generic suggestion; it’s a fix born from a trial.
Generated: 2026-02-27 14:11
The evaluation was performed using a hierarchical swarm of specialized agents operating
in a digital courtroom paradigm.
DEVELOPING ENGINEER - Basic implementation with significant gaps.
Score: 2/5
Dissent: All judges largely agreed.
Score: 2/5
Dissent: All judges largely agreed.
Score: 2/5
Dissent: Defense gave high score but evidence shows gaps.
Score: 2/5
Dissent: All judges largely agreed.
Defense: Score 3/5
While the Forensic Accuracy (Codebase) criterion is not fully met, I argue that the effort and intent behind the work should be rewarded. The codebase lacks production-grade engineering, but it's clea...
Prosecutor: Score 1/5
Fundamental failure to meet the rubric criterion. The evidence does not demonstrate production-grade engineering or Pydantic State models in 'src/graph.py' or 'src/state.py'. Additionally, there is no...
TechLead: Score 2/5
The codebase lacks production-grade engineering and Pydantic State models in 'src/graph.py' or 'src/state.py'. The absence of these models makes it difficult to verify the accuracy of the forensic ana...
Defense: Score 3/5
While the documentation for Forensic Accuracy (Documentation) may not be exhaustive, it demonstrates a good understanding of theoretical concepts such as Dialectical Synthesis and Metacognition. The m...
Prosecutor: Score 2/5
The Forensic Accuracy (Documentation) criterion is not met due to significant gaps and omissions. The evidence provided does not demonstrate a thorough scan of the PDF for theoretical depth, nor does ...
TechLead: Score 2/5
The code does not appear to be functional in terms of forensic accuracy. The error message 'Unable to get page count' suggests that the image extraction functionality is broken. Additionally, the lack...
Defense: Score 4/5
While the submission does not explicitly demonstrate distinct, conflicting system prompts for the Prosecutor, Defense, and Tech Lead personas, it shows a deep understanding of key concepts in the theo...
Prosecutor: Score 1/5
The rubric criterion requires distinct, conflicting system prompts for Prosecutor, Defense, and Tech Lead personas. However, the provided evidence does not meet this requirement as there is no mention...
TechLead: Score 2/5
The code does not meet the requirements for Judicial Nuance & Dialectics. The error in image extraction and lack of poppler installation prevent the system from functioning as intended. Additionally, ...
Defense: Score 3/5
Although the LangGraph StateGraph definition is incomplete, I find merit in the effort to explore parallel branches and conditional edges. The absence of fan-out for Judges and Detectives can be mitig...
Prosecutor: Score 1/5
Fundamental failure to define StateGraph. No parallel branches (fan-out) for Judges and Detectives. No conditional edges handling 'Evidence Missing' or 'Node Failure' scenarios. This is a fundamental ...
TechLead: Score 2/5
The LangGraph StateGraph definition is incomplete and does not demonstrate the use of parallel branches (fan-out) for Judges and Detectives. Additionally, there are no conditional edges that handle 'E...
This audit was conducted using a three-layer agent swarm:
Detective Layer: Specialized forensic agents (RepoInvestigator, DocAnalyst, VisionInspector)
collected objective evidence through AST parsing, git history analysis, and document verification.
Judicial Layer: Three distinct judges analyzed the evidence through different lenses:
Supreme Court: The Chief Justice resolved conflicts using deterministic rules:
Report generated by Automaton Auditor v2.0
Forensic Swarm Auditor: Professional-Grade Maintenance & Support
This document outlines the architectural hardening, maintenance protocols, and operational visibility implemented to ensure the long-term success of the Automaton Auditor within an intranet workstation environment.
Current Version: 1.1.0 (Production-Ready)
Support Level: Active development for FDE Challenge Week 2.
Support Status: Maintained by Addisu Taye Dadi. The system is designed for high-frequency auditing of internal repositories and project documentation.
Update Protocol: The system utilizes a decoupled logic layer. To update audit standards or grading weights, users modify config/rubric.json without requiring a full code redeployment.
// config/rubric.json - Maintenance Layer { "version": "1.1.0", "criteria": { "C1_STATE_RIGOR": { "weight": 0.4, "description": "Verification of Pydantic models and operator.add reducers." }, "C2_SECURITY": { "weight": 0.6, "description": "Validation of JWT implementation in intranet endpoints." } } }
To ensure "trust and usability," the system implements a persistence layer using a SqliteSaver checkpointer. This allows the graph to "remember" its state, ensuring that if the workstation reboots or the intranet connection drops, the audit can be resumed from the last successful node.
Python
from langgraph.checkpoint.sqlite import SqliteSaver import sqlite3 # Physical data persistence for long-running forensic audits conn = sqlite3.connect("audit_checkpoints.db", check_same_thread=False) memory = SqliteSaver(conn) # Compiling with checkpointer satisfies 'Reliability' feedback forensic_app = builder.compile(checkpointer=memory)
Every decision made by the Detective Swarm and the Judicial Layer is recorded in a transparent forensic trail.
Audit Logging: All agent reasoning (Prosecutor, Defense, Tech Lead) is captured in logs/audit_trace.log.
Pre-Flight Health Checks: An automated node validates local LLM connectivity (GPT-Mini/Ollama) and workstation resources before execution.
Python
def health_check_node(state: AgentState): # Validates GPT-Mini connectivity and workspace permissions response = requests.get(f"{LLM_BASE_URL}/api/tags") if response.status_code == 200: return {"global_verdict": "Environment Healthy"} raise RuntimeError("Infrastructure Check Failed: Check local LLM endpoint.")
Python
builder.add_edge("dispatcher", "repo_detective") builder.add_edge("dispatcher", "docs_detective") builder.add_edge("dispatcher", "vision_detective") builder.add_edge("repo_detective", "aggregator") builder.add_edge("docs_detective", "aggregator")
Dockerfile: Packages Python 3.11, Git, and all dependencies for one-click deployment.
Volumes: Maps logs/ and audit_checkpoints.db to the host machine to ensure audit history is preserved outside the container lifecycle.
This project proves that reproducibility is the price of scale. We have transformed a manual bottleneck into an automated service. By forcing the AI to argue against itself, we eliminated the "agreement trap" and surfaced a critical security flaw I had missed. The Emerald Suite now "thinks about thinking" to ensure code is structurally verified and secure.
Market leadership in the AI era will be defined by the integrity of the systems that govern code, not the volume of code produced.
I asked myself:
What happens if this system has to work 10x faster and 10x larger by tomorrow?
Three parts of the foundation will crack first. Our future work focuses on these limits.
From File Dumps to Intelligent Sharding
At 10x scale, sending a full repository dump to an agent causes Context Saturation. Precision drops. I will implement Smarter Sharding. This means using AST pre-scanners to send only "relevant code clusters" to specific agents, keeping context windows lean and accuracy high.
From Local Tempfiles to MicroVM Orchestration
Cloning 10x more repos concurrently on a local machine creates resource contention and security risks. Our current tempfile strategy is fast but has no resource limits. We will migrate to MicroVM isolation (e.g., Firecracker or gVisor). This provides a "Dedicated Clean Room" for every audit with hard caps on CPU and RAM.
From Multimodal Vision to 2M-Token Gating
Complex system diagrams for large architectures can exceed 1M-token windows when combined with full repository maps. I plan to upgrade the VisionInspector to Gemini 2.0 Pro. This will enable a Deep-Gating mechanism, allowing the auditor to analyze 1,000-page technical manuals alongside full AST dumps in a single inference pass.(Different and optimized use of llm models at diffrent nodes)
Automated Constitution Updates
Currently, the human lead is the bottleneck for updating the audit rubric. We will explore Recursive Governance. Agents will monitor real-world attack patterns and peer-audit results to propose new "Statutes" for the courtroom, ensuring the auditor evolves as fast as the code it governs.