中文

推荐系统(recommender system,就是帮你排出”接下来最可能想看/想买什么”的那套模型)这十年基本都是两段式:先 retrieval(召回,从上亿候选里粗筛出几百个),再 ranking(精排,给这几百个打分排序),每一段都得喂一大堆手工特征。2024 年 Meta 那篇 Actions Speak Louder than Words 干脆把这框架掀了——把一个用户的整段历史当成一串 token,召回和精排统统变成”猜下一个 token 是什么”,也就是 generative recommender(生成式推荐,把 LLM 那套自回归玩法搬到推荐上)。撑起这套打法的新模块就叫 HSTU(Hierarchical Sequential Transduction Unit),它要顶替的是标准 transformer 里的 attention(注意力,序列模型里让每个位置去”看”别的位置的那个标准件)。

这篇就两件事。先把 HSTU 这个模块拆开,说清楚它跟普通 attention 到底差在哪;再聊一个我们自己在短视频数据上折腾出来的小结果——把它输入端的时间编码换成 RoPE(rotary position embedding,旋转位置编码:用一组随位置旋转的 sin/cos,把”谁先谁后、隔多远”塞给模型),离线指标涨了大概 1.5%。

从”排序”到”生成”:sequential transduction

传统精排说白了是个静态问题:给你一个用户、一批候选,吐出分数就完事。HSTU 把它改写成了 sequential transduction(序列转换,输入一串、输出一串,逐位置对齐)。做法是把用户历史里的 item(看过的视频、买过的东西)和 action(对它的反馈,完播、点赞、打分这些)交错串成一条序列:

\[\Phi_0,\, a_0,\, \Phi_1,\, a_1,\, \dots,\, \Phi_{n-1},\, a_{n-1}\]

这里 \(\Phi_i\) 是第 \(i\) 个 item 的 embedding(把离散 id 压成一个稠密向量),\(a_i\) 是对应的 action。这么一摆,精排就是在每个 item 后面猜它的 action \(p(a_{i+1}\mid \Phi_0, a_0, \dots, \Phi_{i+1})\),召回就是猜下一个 item \(p(\Phi_{i+1}\mid u_i)\)。原来要两套模型、两套特征工程伺候的事,现在塌成了同一个自回归目标——”生成式”三个字就是这么来的。

Traditional sequential recommender (left) vs generative recommender (right) (Fig. 8, Zhai et al. 2024): items and actions interleaved into one token sequence, predicting the next token.

这么搞最大的好处是 scaling:序列拉得越长、参数堆得越多,效果越好,跟 LLM 一个脾气。论文里 HSTU 一路怼到 1.5 万亿参数,公开 benchmark 上 NDCG(衡量排序质量的指标,越高越好)比基线高出一大截,线上 A/B +12.4%,推理还比 FlashAttention2 快 5–15 倍。

HSTU 这个 layer 内部长什么样

一个 HSTU layer 就三步。记输入的序列表征是 \(X\)。

第一步,pointwise projection(逐点投影):一个线性层 \(f_1(X)=W_1 X + b_1\) 一把算出四组向量,过一道 SiLU(\(\phi_1\),一种平滑的激活函数)再切成四份:

\[U(X),\, V(X),\, Q(X),\, K(X) = \mathrm{Split}\big(\phi_1(f_1(X))\big)\]

\(Q, K, V\) 就是 attention 里的老三样 query/key/value;多出来那个 \(U\) 是个 gate(门控向量),留着第三步用。

第二步,pointwise aggregated attention(逐点聚合注意力)

\[A(X)V(X) = \phi_2\big(Q(X)K(X)^\top + \mathrm{rab}^{p,t}\big)\,V(X)\]

注意这里的 \(\phi_2\) 也是 SiLU,不是 softmax。这就是 HSTU 跟普通 transformer 最不一样的地方——attention 里那个”逐行归一化”的 softmax,被它直接换成了逐点的 SiLU。\(\mathrm{rab}^{p,t}\) 是 relative attention bias(相对注意力偏置),把位置 \(p\) 和时间 \(t\) 当成一个加性偏置加到 \(QK^\top\) 上,下一节再细说。

第三步,门控输出

\[Y(X) = f_2\big(\mathrm{Norm}(A(X)V(X)) \odot U(X)\big)\]

把聚合结果归一化,跟第一步留下的 gate \(U\) 逐元素一乘(\(\odot\)),再过个线性层。这套 gating 跟 GLU/SwiGLU 一个路子,让模型自己决定哪个通道放大、哪个压住。

HSTU block vs a DLRM stack (Fig. 3, Zhai et al. 2024). Each HSTU layer: U,Q,K,V=phi1(f1(X)) -> A(X)=phi2(QK^T+rab^{p,t}) -> Norm(A(X)V(X)) (.) U(X) -> Y=f2(...).

为什么要把 softmax 去掉

这步最反直觉,但也最关键。softmax 会把一行 attention 分数归一化成”加起来等于 1”的概率分布——它只留下相对大小,把绝对强度抹平了。可推荐里,”有多少条历史都指向同一个目标”恰恰是个极强的信号:一个人连刷 20 个篮球视频,跟只看过 1 个,你对”下一个该不该推篮球”的把握完全是两回事。softmax 一归一化,这种强弱差异就没了。HSTU 拿逐点 SiLU 换掉 softmax,图的就是把这个幅度/强度信息留住。论文原话大意是:指向目标的历史数据点数量,是用户偏好强度的强特征,而它在 softmax 归一化之后很难保下来。

代价是 attention 不再是概率分布、每行加起来也不为 1,所以才得靠第三步那个 Norm 加 gating 把数值范围拉回来。

HSTU 怎么编码位置和时间

序列模型自己不知道”谁先谁后”,得显式喂进去。HSTU 把这事放进 \(\mathrm{rab}^{p,t}\) 里——一个同时带位置项(排第几)和时间项(两个真实时间戳隔了多久)的相对偏置,加在 \(QK^\top\) 上。时间项尤其要紧,因为推荐数据是条非平稳的流:同样是两次点击,隔 5 分钟和隔 3 个月完全是两码事,时间项就是让模型看见这个 gap。

不过在 Meta 开源的 generative-recommenders 里,输入端那个 preprocessor(预处理模块,把 id 序列变成喂进 layer 的 embedding)默认用的其实是个可学习的绝对位置 embedding:把 item 和 action 交错成 \(2N\) 长的序列,再按位置 index 查一张可学习的位置表加上去。时间信息主要还是靠 attention 里头 \(\mathrm{rab}\) 那一项扛着。我们动的,就是输入端这块位置/时间编码。

我们的实验:把时间编码换成 rotary 时间戳编码

背景是我们一个小项目 HummingbirdRec,把 HSTU 搬到 KuaiRec(快手放出来的短视频公开数据集,带视频时长、完播率这些字段)上,跟 SASRec(一个很经典的 self-attention 推荐基线)对。在这个 setting 里我们试了件事:把输入端那个纯”可学习绝对位置 embedding”,换成带 rotary 时间戳编码的版本。

具体就是:从每条交互的 timestamp 里抠出 hour-of-day(这条交互发生在一天的第几个小时,0–23),用一组 RoPE 风格的 sin/cos 编码成向量,加到输入 embedding 上:

\[\theta_i = 10000^{-2i/d},\qquad \mathrm{enc}(h) = \big[\,\sin(h\,\theta_i)\,\Vert\,\cos(h\,\theta_i)\,\big]\]

\(h\) 是这条交互发生的小时,\(d\) 是 embedding 维度。代码大概长这样:

def apply_rotary_embedding(self, timestamps, dim):
    half_dim = dim // 2
    theta = 10000 ** (-2 * torch.arange(half_dim) / dim)
    embeddings = timestamps.unsqueeze(-1) * theta
    return torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)

# forward 里:从每条交互的 timestamp 取 hour-of-day,加到输入上
hour = [[datetime.utcfromtimestamp(ts).hour for ts in row] for row in timestamps]
user_embeddings = user_embeddings + self.apply_rotary_embedding(hour, dim)

换上之后,KuaiRec 上的离线 ranking 指标涨了大概 1.5%。算不上惊天动地,但方向稳,还几乎不要钱——没加参数,计算也没多多少。

为啥有用,有个挺朴素的解释:短视频消费的昼夜节律特别强。同一个人,通勤路上刷的、午休刷的、半夜躺床上刷的,口味根本不一样。把 hour-of-day 显式编进去,等于白送模型一个跟”什么时候在刷”对齐的特征。至于为什么用 sin/cos 而不是直接塞个整数小时——因为 sin/cos 天然把 23 点和 0 点放得很近(它是周期的),不会把这俩当成相差 23 的两个极端。

几个得说在前头的 caveat

  • 严格讲,这版并不是经典 RoPE。经典 RoPE 是按位置去旋转 Q/K 向量、让相对位置体现在点积里;我们这儿是把时间戳的 sin/cos 编码直接加在输入 embedding 上,更接近 Transformer 原版那个 sinusoidal positional encoding,只不过喂进去的是时间戳、不是位置 index。叫它”rotary 时间戳编码”更老实。
  • 只编了 hour-of-day,把 day-of-week、距上次多久这些一样有用的尺度全扔了。多尺度时间特征(小时 / 星期 / 距上次交互多久)估计还能再抠出一点。
  • 1.5% 是个单一离线结果:就 KuaiRec 一个数据集、一套超参跑出来的,没上线、也没做多 seed 的显著性检验。当成”值得接着挖的信号”就行,别当结论。
  • 它跟 HSTU 内部 \(\mathrm{rab}\) 的时间项是叠加,不是替换——我们改的是输入端 preprocessor,attention 里那项还在。这俩怎么分工、会不会冗余,还没拆开测过。

接下来想做的

把 hour-of-day 扩成多尺度时间编码;把时间真正做进 attention(比如按时间戳去旋转 Q/K,而不是只在输入端相加);以及在 KuaiRec 之外的数据集上把这 1.5% 复现出来,确认它不是单一数据集的运气。


English

For about a decade the standard recommender system (the thing that decides “what you’ll probably want next”) has been a two-stage pipeline: retrieval (pull a few hundred candidates out of hundreds of millions), then ranking (score and sort those few hundred) — each stage fed a big pile of hand-crafted features. Meta’s 2024 paper Actions Speak Louder than Words threw that framing out: treat a user’s whole history as a stream of tokens, and make both retrieval and ranking just “predict the next token.” That’s a generative recommender — the LLM autoregressive recipe, pointed at recommendation. The new block holding it together is HSTU (Hierarchical Sequential Transduction Unit), and what it’s replacing is the attention in a standard transformer (the part that lets each position in a sequence “look at” the others).

Two things in this post. First I’ll pull HSTU apart and show where it actually differs from ordinary attention. Then a small result we got on short-video data — swapping the input-side time encoding for RoPE (rotary position embedding: position-dependent sin/cos rotations that tell the model “what came before what, and how far apart”), which nudged our offline metric up by about 1.5%.

From “ranking” to “generating”: sequential transduction

Classic ranking is really a static problem: hand it a user and a batch of candidates, get scores back. HSTU rewrites it as sequential transduction (sequence in, sequence out, aligned position by position). You interleave the items a user touched (videos watched, things bought) with the actions taken on them (completion, like, rating) into one sequence:

\[\Phi_0,\, a_0,\, \Phi_1,\, a_1,\, \dots,\, \Phi_{n-1},\, a_{n-1}\]

where \(\Phi_i\) is the embedding of item \(i\) (a discrete id squeezed into a dense vector) and \(a_i\) is its action. Laid out this way, ranking is just guessing the action after each item, \(p(a_{i+1}\mid \Phi_0, a_0, \dots, \Phi_{i+1})\), and retrieval is guessing the next item, \(p(\Phi_{i+1}\mid u_i)\). Two jobs that used to need two models and two feature pipelines collapse into one autoregressive objective — and that’s all “generative” means here.

Traditional sequential recommender (left) vs generative recommender (right) (Fig. 8, Zhai et al. 2024): items and actions interleaved into one token sequence, predicting the next token.

The big win is scaling: longer sequences and more parameters keep paying off, same temperament as LLMs. The paper pushes HSTU all the way to 1.5 trillion parameters, reports a large relative lift in NDCG (a ranking-quality metric, higher is better) over baselines on public benchmarks, +12.4% in online A/B tests, and 5–15× faster inference than FlashAttention2.

What a HSTU layer actually looks like

A HSTU layer is just three steps. Call the input representation \(X\).

Step 1, pointwise projection. One linear layer \(f_1(X)=W_1 X + b_1\) spits out four vectors at once; run a SiLU (\(\phi_1\), a smooth activation) and split into four:

\[U(X),\, V(X),\, Q(X),\, K(X) = \mathrm{Split}\big(\phi_1(f_1(X))\big)\]

\(Q, K, V\) are the usual query/key/value; the extra \(U\) is a gate, saved for step 3.

Step 2, pointwise aggregated attention:

\[A(X)V(X) = \phi_2\big(Q(X)K(X)^\top + \mathrm{rab}^{p,t}\big)\,V(X)\]

Note that \(\phi_2\) is also SiLU, not softmax. This is where HSTU parts ways with a normal transformer — the row-wise normalizing softmax in attention gets swapped for a pointwise SiLU. \(\mathrm{rab}^{p,t}\) is a relative attention bias that drops positional (\(p\)) and temporal (\(t\)) info onto \(QK^\top\) as an additive term (next section).

Step 3, gated output:

\[Y(X) = f_2\big(\mathrm{Norm}(A(X)V(X)) \odot U(X)\big)\]

Normalize the aggregate, multiply it element-wise (\(\odot\)) by the gate \(U\) from step 1, run a final linear layer. The gating is GLU/SwiGLU-flavored — it lets the model decide for itself which channels to turn up and which to clamp.

HSTU block vs a DLRM stack (Fig. 3, Zhai et al. 2024). Each HSTU layer: U,Q,K,V=phi1(f1(X)) -> A(X)=phi2(QK^T+rab^{p,t}) -> Norm(A(X)V(X)) (.) U(X) -> Y=f2(...).

Why drop the softmax

This is the least intuitive part and the most important. Softmax squashes a row of attention scores into a probability distribution that sums to 1 — it keeps the relative sizes and flattens the absolute magnitude. But in recommendation, how many prior actions point at the same target is itself a strong signal: someone who just watched 20 basketball clips in a row is a completely different bet for “recommend basketball next” than someone who watched one. Normalize with softmax and that intensity is gone. HSTU swaps in a pointwise SiLU precisely to keep it. The paper’s own words, roughly: the number of prior data points related to the target is a strong feature for the intensity of user preference, and it’s hard to preserve after softmax normalization.

The cost: attention is no longer a probability distribution — rows don’t sum to 1 — which is exactly why step 3 needs the Norm and gating to pull the numbers back into range.

How HSTU encodes position and time

A sequence model doesn’t know “what came first” on its own; you have to feed that in. HSTU puts it in \(\mathrm{rab}^{p,t}\) — a relative bias carrying both a positional term (which slot) and a temporal term (how much real time elapsed between timestamps), added onto \(QK^\top\). The temporal term matters because recommendation data is a non-stationary stream: two clicks five minutes apart are a different animal from two clicks three months apart, and the temporal term is what lets the model see that gap.

In Meta’s open-source generative-recommenders codebase, though, the input-side preprocessor (the module that turns an id sequence into the embeddings fed to the layer) actually defaults to a learnable absolute positional embedding: interleave items and actions into a length-\(2N\) sequence, then look up a learnable position table by index and add it. Time mostly rides on the \(\mathrm{rab}\) term inside attention. What we changed is this input-side encoding.

Our experiment: swapping in a rotary timestamp encoding

The setting is a small project of ours, HummingbirdRec, which ports HSTU onto KuaiRec (Kuaishou’s public short-video dataset, with fields like video duration and watch ratio) and pits it against SASRec (a classic self-attention recommender baseline). In that setup we tried one thing: replacing the plain “learnable absolute positional embedding” at the input with a version that adds a rotary timestamp encoding.

Concretely: pull the hour-of-day (0–23) out of each interaction’s timestamp, encode it with a RoPE-style set of sin/cos, and add it to the input embedding:

\[\theta_i = 10000^{-2i/d},\qquad \mathrm{enc}(h) = \big[\,\sin(h\,\theta_i)\,\Vert\,\cos(h\,\theta_i)\,\big]\]

where \(h\) is the hour the interaction happened and \(d\) the embedding dimension. The code:

def apply_rotary_embedding(self, timestamps, dim):
    half_dim = dim // 2
    theta = 10000 ** (-2 * torch.arange(half_dim) / dim)
    embeddings = timestamps.unsqueeze(-1) * theta
    return torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)

# in forward(): pull hour-of-day from each interaction's timestamp, add it in
hour = [[datetime.utcfromtimestamp(ts).hour for ts in row] for row in timestamps]
user_embeddings = user_embeddings + self.apply_rotary_embedding(hour, dim)

With this in, the offline ranking metric on KuaiRec went up by about 1.5%. Not earth-shattering, but a stable direction and basically free — no extra parameters, negligible extra compute.

There’s a plain intuition for why: short-video consumption has a strong daily rhythm. What the same person scrolls on a commute, at lunch, and in bed at midnight just isn’t the same. Encoding hour-of-day explicitly hands the model a free feature lined up with when the scrolling happens. And the reason for sin/cos over a raw integer hour: sin/cos naturally puts 11pm and midnight right next to each other (it’s periodic) instead of treating them as two extremes 23 apart.

A few caveats worth stating up front

  • Strictly, this version isn’t classic RoPE. Classic RoPE rotates the Q/K vectors by position so relative position shows up in the dot product; what we did is add a sin/cos encoding of the timestamp to the input embedding, which is closer to the original Transformer’s sinusoidal positional encoding — just fed timestamps instead of position indices. “Rotary timestamp encoding” is the honest name.
  • It encodes only hour-of-day and throws away day-of-week and absolute inter-event gaps, which are useful scales too. Multi-scale time features (hour / weekday / time-since-last-interaction) would probably squeeze out a bit more.
  • The 1.5% is a single offline result — one dataset (KuaiRec), one hyperparameter setting, no online test, no multi-seed significance check. Treat it as a signal worth chasing, not a verdict.
  • It stacks on top of HSTU’s internal \(\mathrm{rab}\) temporal term rather than replacing it — we touched the input-side preprocessor; the attention term is still there. How the two split the work, and whether they’re redundant, is untested.

What I’d do next

Grow hour-of-day into a multi-scale time encoding; push time properly into the attention (rotate Q/K by timestamp instead of only adding at the input); and reproduce the 1.5% on datasets beyond KuaiRec to make sure it isn’t single-dataset luck.