Haozhe Zhang | Voice is going to be the next interface

I think voice is about to become a primary software interface, and most people are still pricing it as a novelty. The reason that take is hard to take seriously is that voice has been a graveyard — Siri, Alexa, Google Assistant, all built on rules and intent classifiers, all painful. What’s different now is that you can put a real reasoner behind the microphone. The product shape changes entirely once that’s possible.

We’ve been building krono on that bet — a voice agent for the markets, where you talk to it about positions, news, and trade ideas. Most of what we learned has very little to do with finance and a lot to do with what it actually takes to make voice feel alive.

The stack

A year of trying things landed us roughly here:

WebRTC for transport. No real alternative if you want browser-native low latency. We tried polling, we tried WebSockets — anything above WebRTC shows up as audible jitter the moment the user has a flaky connection.
Silero VAD for voice activity. Small, MIT-licensed, runs in the browser. The naive “is there energy in the audio frame” approach falls apart on background music, traffic, breathing — Silero just works.
A semantic turn-detection model on top of VAD. This is the part most demos skip. VAD says “the audio is silent right now”; it does not say “the user finished their sentence.” If you respond on VAD alone, you cut people off mid-thought. A small classifier reads the partial transcript and decides “is this an end of turn.” Adds ~100 ms; kills the most painful failure mode.
Streaming STT → LLM → streaming TTS, all overlapped. The user-perceived latency is max(STT chunk, LLM first-token, TTS first-audio), not the sum. Build it as await this; await that; and you ship a slow product no matter how good your model is.

Why we trained a small model for news filtering

This is the design choice I get asked about the most. Standard answer: throughput and cost. Real answer: that, plus a determinism thing that’s easy to underrate.

The news firehose for US equities is thousands of headlines per minute. For each one we want a single yes/no: “is this material for any ticker the user cares about.” We are not going to pay a frontier model to read every wire. Even with batching, the latency budget doesn’t work — by the time the LLM finishes scoring, the headline is stale and the price has moved.

So we trained our own. ~100M-parameter encoder, fine-tuned on a few hundred thousand labeled headlines and the corresponding intraday price reaction. On this specific task it beats GPT-class models by a wide margin. It also has properties the big model doesn’t: it always returns a calibrated score, never refuses, never adds an editorial paragraph. At scale, the variance of a frontier model is its own failure mode.

The bigger LLM still has a job. It writes the summary the user actually hears once the small model has decided what’s worth surfacing. That’s the right division of labor — cheap deterministic classifiers in the hot path, expensive reasoner at the synthesis end.

What surprised me

Dead air is fatal. Two seconds of silence in a voice product feels like the system crashed. Users hang up. So you end up engineering filler — “let me check that for you” — before you do any real work. Nobody writes about this and it’s most of the perceived quality.

Session memory matters more than per-turn answer quality. A user opens with “how’s NVDA today” and ends forty seconds later on a competitor’s earnings call. If you don’t carry that thread, every individual answer is technically correct and the conversation is useless.

Tool-use is where the reliability lives. Pure LLMs hallucinate prices, full stop. Every numeric claim about a security in krono is backed by a tool call within the same turn. The interesting design question was about how aggressively to force the tool — too eager and the model sounds robotic, too lazy and it’ll cheerfully invent a P/E ratio.

If you want to try it: laurion.com/krono. Honest feedback welcome.