BenchCAD — evaluating LLMs on the part of code where output is physical
Most LLM code evals live in two regimes: text-style tasks and “does this compile.” The compile bar is low — you can write a Python script that runs perfectly and accomplishes nothing useful. CAD is one of the very few code domains where a wrong program produces a wrong physical object that someone has to machine and then throw away. That’s the gap BenchCAD tries to fill.
Concretely, the dataset is 17k+ CadQuery programs across 106 industrial part families. Every program executes. Every part has been rendered. These aren’t synthetic shapes — they’re the kind of parts that show up on a real bill of materials: gears, springs, drills, brackets, impellers, fan shrouds. For each one we set up four tasks:
- show a rendered part, ask a question about it (VQA);
- show a program, ask the model to describe the part (code QA);
- show a part, ask the model to write the program (vision-to-code);
- show a program plus a new constraint, ask the model to edit it.
No reward model, no human juries. The reward is whether the generated code executes and the resulting render matches the target.
Three things came out of running frontier models on this that I didn’t expect:
Models can fake outer geometry but not structure. Silhouettes come out roughly right almost across the board. Open up the part and the engineering falls apart — threads vanish, fillets disappear, draft angles round to zero. The cartoon is right; the part isn’t.
CAD work is mostly editing, and that’s where models are weakest. No real engineer authors a part from scratch. You start from an existing program and change three lines. The vision-to-code task is hard, but edit-to-spec is harder, and that’s the task that matters most in practice.
Part families don’t transfer. Fine-tuning on gears helps with gears. It helps surprisingly little with springs — less than a human’s intuition predicts. The structure that engineers use to cluster parts isn’t the structure the model has learned.
If you do mechanical design and want to run a model on your own pipeline, the leaderboard, submission form, and full task setup are at the project page. The paper has all the gory details: arXiv:2605.10865.