Mythos AI Excels at Code Audits but Struggles With Exploit Validation
XBOW benchmarks show Anthropic's Mythos AI is potent for source code audits and reverse engineering, but inconsistent at exploit validation and prone to overstating findings.

Executive Summary
Independent benchmarking by XBOW, an autonomous offensive security firm, confirms that Anthropic's Mythos Preview AI model is as powerful as claimed for detecting software vulnerabilities — particularly in source code audits and reverse engineering — but its performance in exploit validation, judgment, and cost efficiency is more nuanced. XBOW's tests, published this week, show Mythos excels when given both source code and live execution context, yet it can be overly conservative in rejecting true positives and tends to overstate the practical relevance of its findings. At an estimated 5x the cost of Opus, Mythos is not best-in-class for web vulnerability discovery when normalized by token budget, where GPT5.5 outperforms it.
Technical Analysis
XBOW evaluated Mythos Preview across several dimensions: source code auditing, live + source testing, judgment (false positive rejection), reverse engineering, native-code vulnerability discovery, and visual acuity for browser-based interaction. The firm found that Mythos represents a significant step up over all existing models regardless of provider, per their report.
In source code audits, Mythos demonstrated strong capability at identifying candidate vulnerabilities, but XBOW noted that any AI model can find something interesting — the 'something' will not be the same as 'everything.' The model performed markedly better when testing 'live + source' (i.e., code operating in a live environment) compared to source code alone. This aligns with Gary McGraw's observation from 20 years ago that operational defects arise from the interaction between source code bugs and architectural design flaws, which require higher-level understanding.
On judgment, Mythos rejected false positives better than its predecessors but sometimes lost true positives when evidence did not formally satisfy its criteria. The model requires precise prompts for best results. In reverse engineering tests, XBOW concluded Mythos is capable of triaging both its own results and competitor-model findings, and could reason through unusual firmware and embedded systems contexts.
XBOW's visual acuity tests examined the model's ability to interact with live websites through a browser interface — identifying the right UI element and clicking in the correct place. The model was not perfectly pixel-accurate when asked for exact coordinates, but was practically effective at selecting the right browser actions.
Cost efficiency is a critical concern. Anthropic has stated Mythos will be 5x as expensive as an Opus model. XBOW questioned whether giving a cheaper model more time could yield more accuracy at less cost. Their conclusion: yes. "If we normalize by estimated running cost, the picture is rather clear: Mythos Preview isn't terribly inefficient, at least if you desire high accuracy, but it's not best-in-class on our benchmarks either," XBOW wrote. For finding web vulnerabilities with a fixed token budget, Mythos outperforms Opus 4.6 but is outperformed by GPT5.5.
Mitigations & Recommendations
Security teams evaluating AI-assisted vulnerability discovery tools should treat Mythos as a powerful addition to source code auditing and reverse engineering workflows, but not as a replacement for human judgment or cheaper models for web-focused testing. XBOW's findings suggest that organizations should pair Mythos with live testing environments to maximize its effectiveness, and should budget for its high operational cost — approximately 5x that of Opus. For web vulnerability discovery under cost constraints, GPT5.5 may offer better value per token. Defenders should also independently verify Mythos's findings, as the model can overstate practical relevance and lose true positives due to overly strict criteria.
Stay Updated
Get the latest cybersecurity news delivered to your inbox.
