GitHub Security Lab's AI Framework Found 80+ Vulnerabilities
GitHub Security Lab's open-source AI framework found 80+ real vulnerabilities by teaching LLMs to understand threat models first. Here's how it works and what it found.
TL;DR
- GitHub Security Lab's open-source taskflow agent found 80+ real vulnerabilities in production codebases using LLM-powered auditing
- The framework breaks auditing into threat modeling, issue suggestion, and rigorous verification stages to minimize hallucinations
- LLMs excel at finding logic bugs like authorization bypasses — 25% hit rate on business logic issues vs 4% on RCE
- You can run it yourself: requires GitHub Copilot license, takes 1-2 hours on medium repos, outputs SQLite results
The Big Picture
Static analysis tools are good at finding technical vulnerabilities. They're terrible at understanding context. A reverse proxy flagged for SSRF? That's the entire point of a reverse proxy. A command injection in a CI sandbox? That's by design.
GitHub Security Lab spent months building an AI-powered auditing framework that actually understands threat models. The results are striking: 80+ disclosed vulnerabilities across 40+ repositories, including a password bypass in Rocket.Chat that let anyone log in with any password, and systematic cart logic bugs exposing PII across multiple ecommerce platforms.
The framework is open source. It runs on your private repos. And unlike traditional SAST tools that drown you in false positives, this approach starts with threat modeling — teaching the LLM what your code is supposed to do before asking what could go wrong.
The architecture is clever: break auditing into stages, use a database to pass context between tasks, and apply strict verification criteria at each step. No single massive prompt. No hallucinated vulnerabilities. Just a systematic workflow that mimics how experienced security researchers actually work.
How It Works
The seclab-taskflow-agent framework runs YAML-defined taskflows — sequences of LLM tasks that pass results through a SQLite database. Think of it as a state machine for security auditing.
Stage one is threat modeling. The LLM divides your repository into components by functionality, identifies entry points exposed to untrusted input, maps web endpoints with HTTP methods and paths, and catalogs what normal users can do. This isn't busywork — it establishes the security boundary. A command injection in a CLI tool designed to execute user scripts isn't a vulnerability. The same bug in an authentication handler absolutely is.
Stage two is issue suggestion. The LLM reviews each component and suggests vulnerability types based on exposure to untrusted input and intended functionality. This stage is deliberately loose — the goal is breadth, not accuracy. The LLM is explicitly told not to audit yet, just brainstorm focus areas.
Stage three is rigorous verification. Fresh context, strict criteria. The suggestions from stage two are treated as unvalidated alerts from an external tool. The LLM must provide concrete file paths, line numbers, and realistic attack scenarios. Vague findings like "IDOR in user endpoints" get rejected. The prompt emphasizes: "It is ok to conclude that there is no security issue."
This three-stage design solves the hallucination problem. You can't ask an LLM "find any vulnerability anywhere" and expect useful results. But you can ask it to suggest areas of concern, then apply forensic-level scrutiny to each suggestion. The second stage doesn't validate the first — it challenges it with fresh context and different criteria.
The framework also handles iteration. Security audits involve repeating the same analysis across dozens of components. The agent templates prompts and runs tasks asynchronously across all components, substituting component-specific details as it goes. You write the taskflow once, it scales to the entire codebase.
What This Changes For Developers
The disclosed vulnerabilities tell the story. In Outline, a collaborative document platform, the agent found that document group membership endpoints authorized with "update" permission instead of "manageUsers" permission. A ReadWrite collaborator could grant themselves Admin access by adding a group they belonged to with Admin permissions. The bug had been there for years.
In WooCommerce, signed-in users could view all guest orders including names, addresses, and phone numbers. In Spree Commerce, unauthenticated users could enumerate addresses of all guest orders by incrementing a sequential ID. These weren't obscure edge cases — they were systematic authorization logic failures that traditional SAST tools missed because they don't understand user roles and intended access patterns.
The Rocket.Chat finding is particularly instructive. The agent traced a missing await keyword through multiple TypeScript files. The validatePassword function returned Promise<boolean>, but the caller didn't await it. Since a Promise is always truthy in JavaScript, the validation always passed when a bcrypt hash existed. Anyone could log in with any password in the microservices deployment.
This is what LLMs are good at: following control flow across files, understanding async behavior, and reasoning about what the code actually does versus what it's supposed to do. The agent's notes included the exact JSON payload to exploit the bug via Meteor's DDP protocol.
The data backs this up. Across 40 repositories, the agent suggested 1,003 potential issues. After the audit stage, 139 were marked as having vulnerabilities. After manual review and deduplication: 19 were high-impact vulnerabilities worth reporting, 52 were low-severity issues, and 20 were false positives. That's a 21% true positive rate for serious vulnerabilities — far better than typical SAST tool output.
More interesting: the breakdown by vulnerability type. Business logic issues had a 25% vulnerability rate. IDOR and access control issues: 15.8%. Authentication issues: 16.5%. Compare that to remote code execution at 4.2% or SQL injection at 0%. The LLM excels at understanding intended behavior and spotting deviations. It struggles with memory safety (only 3 suggestions total, mostly because the tested repos used memory-safe languages).
Try It Yourself
The framework is open source and runs in a GitHub Codespace. You need a GitHub Copilot license — the prompts use premium model requests.
# Start a codespace from the seclab-taskflows repo
# Wait for initialization (a few minutes)
./scripts/audit/run_audit.sh myorg/myrepo
Expect 1-2 hours for a medium-sized repository. Results land in SQLite. Open the audit_results table and filter for rows with has_vulnerability checked.
Important: LLMs are non-deterministic. Run the taskflows multiple times. Try different models (GPT-4o, Claude Opus). A second run can surface entirely different issues.
For private repos, you'll need to modify the codespace configuration to grant access. The default setup only works on public repositories.
If you want to write your own taskflows, the seclab-taskflows repo includes examples. The YAML structure is straightforward: define tasks, specify dependencies, template prompts with component-specific variables. The agent handles execution and database management.
The Bottom Line
Use this if you maintain a multi-user web application with complex authorization logic. The framework found systematic bugs in ecommerce carts, document collaboration tools, and chat platforms — all places where "who can do what" matters more than "is this input sanitized."
Skip it if you're hunting memory safety bugs or need exhaustive coverage of technical vulnerabilities like XSS and SQLi. Traditional fuzzers and SAST tools still win there. The agent found zero SQL injection vulnerabilities across 40 repos, not because they don't exist, but because LLMs aren't optimized for that pattern matching.
The real opportunity is in logic bugs that require understanding the application's security model. GitHub Security Lab has proven the approach works at scale — 80+ disclosed vulnerabilities, many critical, with a true positive rate that makes manual review feasible. That's the bar: not "finds everything," but "finds enough high-impact issues that a human can actually triage the results."
The risk is cost and quota consumption. Running these taskflows burns through GitHub Copilot premium model requests fast. Budget accordingly. And remember: this is a research tool, not a production security scanner. Manual verification is mandatory.
If you're building agentic workflows for security, study this architecture. The three-stage design — threat modeling, suggestion, verification — is a template for any LLM task where accuracy matters more than speed.
Source: GitHub Blog