Haozhe Zhang | BenchCAD

BenchCAD is a comprehensive benchmark for evaluating how well large language models generate programmatic CAD code. It contains over 17,000 execution-verified CadQuery programs across 106 industrial part families — gears, springs, drills, brackets and more — and probes models across four tasks: visual question answering, code analysis, image-to-code conversion, and code editing.

Findings: current models often recover the coarse outer geometry of a part but fail to produce faithful parametric CAD programs. Common failure modes include missing structural details and oversimplified operations, and generalization to unseen part families remains limited even after fine-tuning.

Project page · arXiv:2605.10865

Cited in Anthropic’s System Card

Anthropic used BenchCAD in their official Claude Fable 5 & Claude Mythos 5 System Card (§8.16.4), evaluating their frontier models on the Vision2Code task over all 17,874 published files. They also ran a Python-tools ablation, where Vision2Code performance jumped from 0.379 to 0.650 voxel IoU once the model could render and visually verify its output.

BenchCAD Vision2Code scores, full benchmark

BenchCAD Vision2Code scores with and without Python tools

Figures: Anthropic, Claude Fable 5 & Claude Mythos 5 System Card (June 2026), §8.16.4.