HSTU 详解,以及把时间编码换成 RoPE / HSTU explained, and swapping its time encoding for RoPE
中文
推荐系统(recommender system,给用户排出”接下来最可能想看/想买的东西”的模型)过去十年的主流做法是:先 retrieval(召回,从上亿候选里粗筛出几百个)再 ranking(精排,对这几百个打分排序),每一层都喂进一堆手工特征。2024 年 Meta 那篇 Actions Speak Louder than Words 把这套换了个框架——把用户的整段历史看成一串 token,召回和精排都变成”预测下一个 token”,也就是 generative recommender(生成式推荐,借用 LLM 那套自回归思路做推荐)。撑起这套框架的新 layer 就叫 HSTU(Hierarchical Sequential Transduction Unit),它是用来替掉标准 transformer attention(注意力机制,序列模型里让每个位置”看”其他位置的标准模块)的。
这篇分两部分:先把 HSTU 这个 layer 拆开讲清楚它和普通 attention 差在哪,再讲一个我们自己在短视频数据上做过的小实验——把它输入端的时间编码换成 RoPE(rotary position embedding,旋转位置编码,用一组随位置旋转的 sin/cos 给模型注入”先后/远近”信息),离线指标涨了大约 1.5%。
从”排序”到”生成”:sequential transduction
传统精排是个静态问题:给定一个用户和一批候选,输出分数。HSTU 把它改写成 sequential transduction(序列转换,输入一串、输出一串,逐位置对齐)。具体做法是把用户历史里的 item(看过的视频/买过的东西)和 action(对它的反馈,比如完播、点赞、评分)交错排成一条序列:
\[\Phi_0,\, a_0,\, \Phi_1,\, a_1,\, \dots,\, \Phi_{n-1},\, a_{n-1}\]其中 \(\Phi_i\) 是第 \(i\) 个 item 的 embedding(把离散 id 映射成一个稠密向量),\(a_i\) 是对应的 action。精排就是在每个 item 后面预测它的 action \(p(a_{i+1}\mid \Phi_0, a_0, \dots, \Phi_{i+1})\);召回则是预测下一个 item \(p(\Phi_{i+1}\mid u_i)\)。两件原本要分两套模型、两套特征工程做的事,被统一成同一个自回归目标——这就是”生成式”的含义。

这么做的好处是 scaling:序列越长、参数越多,效果越好,跟 LLM 一个规律。论文里 HSTU 一路推到 1.5 万亿参数,公开 benchmark 上 NDCG(一个排序质量指标,越高越好)相对基线提升很大,线上 A/B 测试 +12.4%,推理还比 FlashAttention2 快 5–15 倍。
HSTU 这个 layer 内部长什么样
一个 HSTU layer 由三步组成。设输入序列表征为 \(X\)。
第一步,pointwise projection(逐点投影):用一个线性层 \(f_1(X)=W_1 X + b_1\) 一次性算出四组向量,过一个 SiLU 非线性(\(\phi_1\),一种平滑的激活函数)后切成四份:
\[U(X),\, V(X),\, Q(X),\, K(X) = \mathrm{Split}\big(\phi_1(f_1(X))\big)\]\(Q, K, V\) 就是 attention 里的 query/key/value;多出来的 \(U\) 是个 gate(门控向量),后面用。
第二步,pointwise aggregated attention(逐点聚合注意力):
\[A(X)V(X) = \phi_2\big(Q(X)K(X)^\top + \mathrm{rab}^{p,t}\big)\,V(X)\]注意这里的 \(\phi_2\) 也是 SiLU,不是 softmax。这是 HSTU 跟普通 transformer 最核心的区别——它把 attention 里那个”对每一行做归一化”的 softmax 直接换成了逐点的 SiLU。\(\mathrm{rab}^{p,t}\) 是 relative attention bias(相对注意力偏置),把位置 \(p\) 和时间 \(t\) 的信息作为一个加性偏置加到 \(QK^\top\) 上,下一节细讲。
第三步,门控输出:
\[Y(X) = f_2\big(\mathrm{Norm}(A(X)V(X)) \odot U(X)\big)\]把聚合结果归一化,再和第一步留下的 gate \(U\) 逐元素相乘(\(\odot\)),最后过一个线性层。这个 gating 类似 GLU/SwiGLU 那一套,让模型能自适应地放大或抑制每个通道。

为什么要把 softmax 去掉
这是最反直觉、也最关键的设计。softmax 会把一行 attention 分数归一化成”加起来等于 1”的概率分布——它只保留相对大小,把绝对强度抹掉了。但在推荐里,”有多少条历史行为指向某个目标”本身就是极强的信号:一个用户连着看了 20 个篮球视频,和只看了 1 个,对”下一个推篮球”的置信度完全不同。softmax 归一化之后这种强度差异就没了。HSTU 用逐点 SiLU 替代 softmax,正是为了保留这个幅度/强度信息。论文原话是:指向目标的历史数据点数量,是用户偏好强度的强特征,而它在 softmax 归一化之后很难被保留。
代价是 attention 不再是概率分布、行和不为 1,所以才需要第三步那个 Norm + gating 来稳住数值范围。
HSTU 怎么编码位置和时间
序列模型本身不知道”先后”,得显式注入。HSTU 把这件事放在 \(\mathrm{rab}^{p,t}\) 里——一个同时含位置项(第几个)和时间项(真实时间戳之间隔了多久)的相对偏置,加在 \(QK^\top\) 上。时间项尤其重要,因为推荐数据是非平稳的流:同样两次点击,间隔 5 分钟和间隔 3 个月,含义完全不同;时间项就是让模型看见这个 gap。
不过在 Meta 开源的 generative-recommenders 代码里,输入端的 preprocessor(预处理模块,把 id 序列变成喂进 layer 的 embedding)默认用的是一个可学习的绝对位置 embedding:把 item 和 action 交错成 \(2N\) 长的序列后,按位置 index 查一张可学习的位置表加上去。时间信息主要靠 attention 内部的 \(\mathrm{rab}\) 那一项。我们的实验动的就是这个输入端的位置/时间编码。
我们的实验:把时间编码换成 rotary 时间戳编码
背景是我们一个叫 HummingbirdRec 的小项目,把 HSTU 搬到 KuaiRec(快手放出的短视频推荐公开数据集,有视频时长、完播率这些字段)上、和 SASRec(一个经典的 self-attention 推荐基线)对比。在这个 setting 里我们试了一件事:把输入端那个纯”可学习绝对位置 embedding”,换成带 rotary 时间戳编码的版本。
具体做的是:从每条交互的 timestamp 取出 hour-of-day(一天里的第几个小时,0–23),用一组 RoPE 风格的 sin/cos 把它编码成向量,加到输入 embedding 上:
\[\theta_i = 10000^{-2i/d},\qquad \mathrm{enc}(h) = \big[\,\sin(h\,\theta_i)\,\Vert\,\cos(h\,\theta_i)\,\big]\]其中 \(h\) 是该交互发生的小时数,\(d\) 是 embedding 维度。代码长这样:
def apply_rotary_embedding(self, timestamps, dim):
half_dim = dim // 2
theta = 10000 ** (-2 * torch.arange(half_dim) / dim)
embeddings = timestamps.unsqueeze(-1) * theta
return torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)
# forward 里:从每条交互的 timestamp 取 hour-of-day,加到输入上
hour = [[datetime.utcfromtimestamp(ts).hour for ts in row] for row in timestamps]
user_embeddings = user_embeddings + self.apply_rotary_embedding(hour, dim)
换上之后,KuaiRec 上的离线 ranking 指标涨了大约 1.5%。提升不算惊天动地,但方向稳定、几乎零成本——没加参数、没加多少计算。
为什么有用,有个挺直白的解释:短视频消费有很强的昼夜节律。同一个人通勤路上刷的、午休刷的、深夜躺床上刷的,内容偏好是不一样的;把 hour-of-day 显式编码进去,等于给模型一个免费的、和”刷视频的时间点”对齐的特征。RoPE 这种 sin/cos 编码相比直接塞一个整数小时的好处是:它天然把 23 点和 0 点编码得很近(周期性),而不是当成相差 23 的两个极端值。
诚实的几个 caveat
- 严格说这一版不是经典 RoPE。经典 RoPE 是按位置去旋转 Q/K 向量、让相对位置体现在点积里;我们这里是把时间戳的 sin/cos 编码加在输入 embedding 上,更接近 Transformer 原版的 sinusoidal positional encoding,只是喂的是时间戳而非位置 index。叫它”rotary 时间戳编码”更准确。
- 只编码了 hour-of-day,丢掉了 day-of-week、绝对时间间隔这些同样有用的尺度。多尺度时间特征(小时 / 星期 / 距上次交互多久)大概率还能再榨一点。
- 1.5% 是单一离线结果,KuaiRec 这一个数据集、这一套超参下测的,没跑线上、没做多 seed 的显著性检验。当成”值得继续往下挖的信号”,别当成定论。
- 它和 HSTU 内部的 \(\mathrm{rab}\) 时间项是叠加关系,不是替换——我们改的是输入端 preprocessor,attention 里那项还在。两者怎么分工、会不会冗余,还没拆开测。
还想往下做的
把 hour-of-day 扩成多尺度时间编码;把时间真正做进 attention(比如对 Q/K 按时间戳做旋转,而不是只在输入端相加);以及在 KuaiRec 之外的数据集上复现这个 1.5%,确认它不是单一数据集的运气。
English
For the last decade the standard recommender system (a model that orders “what you’re most likely to want next”) has been a two-stage pipeline: retrieval (pull a few hundred candidates from hundreds of millions) followed by ranking (score and sort those few hundred), each stage fed a pile of hand-crafted features. Meta’s 2024 paper Actions Speak Louder than Words reframed the whole thing — treat a user’s entire history as a sequence of tokens, and turn both retrieval and ranking into “predict the next token.” That’s a generative recommender (borrowing the autoregressive recipe from LLMs). The new layer holding it up is HSTU (Hierarchical Sequential Transduction Unit), built to replace standard transformer attention (the module that lets each position in a sequence “look at” the others).
Two parts here: first I unpack the HSTU layer and where it diverges from ordinary attention, then a small experiment we ran on short-video data — swapping the input-side time encoding for RoPE (rotary position embedding: a set of position-dependent sin/cos rotations that inject “order / distance” into the model), which moved our offline metric by about 1.5%.
From “ranking” to “generating”: sequential transduction
Classic ranking is a static problem: given a user and a batch of candidates, output scores. HSTU recasts it as sequential transduction (sequence in, sequence out, aligned position by position). Concretely, you interleave the items a user touched (videos watched, things bought) with the actions taken on them (completion, like, rating) into one sequence:
\[\Phi_0,\, a_0,\, \Phi_1,\, a_1,\, \dots,\, \Phi_{n-1},\, a_{n-1}\]where \(\Phi_i\) is the embedding of item \(i\) (a discrete id mapped to a dense vector) and \(a_i\) is its action. Ranking becomes predicting the action after each item, \(p(a_{i+1}\mid \Phi_0, a_0, \dots, \Phi_{i+1})\); retrieval becomes predicting the next item, \(p(\Phi_{i+1}\mid u_i)\). Two things that used to need two models and two feature pipelines collapse into a single autoregressive objective — that’s what “generative” means here.

The payoff is scaling: longer sequences and more parameters keep helping, same law as LLMs. The paper pushes HSTU to 1.5 trillion parameters, reports a large relative lift in NDCG (a ranking-quality metric, higher is better) over baselines on public benchmarks, +12.4% in online A/B tests, and 5–15× faster inference than FlashAttention2.
What a HSTU layer actually looks like
A HSTU layer is three steps. Let the input representation be \(X\).
Step 1, pointwise projection. A single linear layer \(f_1(X)=W_1 X + b_1\) produces four vectors at once; pass through a SiLU nonlinearity (\(\phi_1\), a smooth activation) and split into four:
\[U(X),\, V(X),\, Q(X),\, K(X) = \mathrm{Split}\big(\phi_1(f_1(X))\big)\]\(Q, K, V\) are the usual query/key/value; the extra \(U\) is a gate, used later.
Step 2, pointwise aggregated attention:
\[A(X)V(X) = \phi_2\big(Q(X)K(X)^\top + \mathrm{rab}^{p,t}\big)\,V(X)\]Crucially, \(\phi_2\) is also SiLU, not softmax. This is HSTU’s core departure from a normal transformer — it replaces the row-wise normalizing softmax in attention with a pointwise SiLU. \(\mathrm{rab}^{p,t}\) is a relative attention bias that adds positional (\(p\)) and temporal (\(t\)) information as an additive bias onto \(QK^\top\) (next section).
Step 3, gated output:
\[Y(X) = f_2\big(\mathrm{Norm}(A(X)V(X)) \odot U(X)\big)\]Normalize the aggregate, multiply element-wise (\(\odot\)) by the gate \(U\) from step 1, and pass through a final linear layer. The gating is GLU/SwiGLU-flavored — it lets the model adaptively amplify or suppress each channel.

Why drop the softmax
This is the least intuitive and most important choice. Softmax normalizes a row of attention scores into a probability distribution summing to 1 — it keeps the relative sizes and erases the absolute magnitude. But in recommendation, how many prior actions point toward a target is itself a strong signal: a user who just watched 20 basketball clips in a row is a very different bet for “recommend basketball next” than one who watched a single clip. Softmax normalization throws that intensity away. HSTU’s pointwise SiLU keeps it. The paper’s own framing: the number of prior data points related to the target is a strong feature for the intensity of user preference, and it’s hard to preserve after softmax normalization.
The cost is that attention is no longer a probability distribution — rows don’t sum to 1 — which is exactly why step 3’s Norm + gating is there to keep the numerics in range.
How HSTU encodes position and time
A sequence model has no built-in notion of order; you inject it. HSTU does this inside \(\mathrm{rab}^{p,t}\) — a relative bias carrying both a positional term (which slot) and a temporal term (how much real time elapsed between timestamps), added onto \(QK^\top\). The temporal term matters because recommendation data is a non-stationary stream: two clicks five minutes apart mean something very different from two clicks three months apart, and the temporal term is what lets the model see that gap.
In Meta’s open-source generative-recommenders codebase, though, the input-side preprocessor (the module turning an id sequence into the embeddings fed to the layer) defaults to a learnable absolute positional embedding: after interleaving items and actions into a length-\(2N\) sequence, it looks up a learnable position table by index and adds it. Time information is carried mostly by the \(\mathrm{rab}\) term inside attention. Our experiment touches this input-side encoding.
Our experiment: swapping in a rotary timestamp encoding
The context is a small project of ours, HummingbirdRec, which ports HSTU onto KuaiRec (Kuaishou’s public short-video dataset, with fields like video duration and watch ratio) and compares against SASRec (a classic self-attention recommender baseline). In that setting we tried one thing: replacing the plain “learnable absolute positional embedding” at the input with a version that adds a rotary timestamp encoding.
Concretely: take the hour-of-day (0–23) from each interaction’s timestamp, encode it with a RoPE-style set of sin/cos, and add it to the input embedding:
\[\theta_i = 10000^{-2i/d},\qquad \mathrm{enc}(h) = \big[\,\sin(h\,\theta_i)\,\Vert\,\cos(h\,\theta_i)\,\big]\]where \(h\) is the hour of the interaction and \(d\) the embedding dimension. The code:
def apply_rotary_embedding(self, timestamps, dim):
half_dim = dim // 2
theta = 10000 ** (-2 * torch.arange(half_dim) / dim)
embeddings = timestamps.unsqueeze(-1) * theta
return torch.cat([torch.sin(embeddings), torch.cos(embeddings)], dim=-1)
# in forward(): pull hour-of-day from each interaction's timestamp, add it in
hour = [[datetime.utcfromtimestamp(ts).hour for ts in row] for row in timestamps]
user_embeddings = user_embeddings + self.apply_rotary_embedding(hour, dim)
With this in place, the offline ranking metric on KuaiRec went up by about 1.5%. Not earth-shattering, but a stable direction at essentially zero cost — no added parameters, negligible extra compute.
There’s a clean intuition for why it helps: short-video consumption has a strong daily rhythm. What the same person scrolls during a commute, at lunch, and in bed at midnight differs; encoding hour-of-day explicitly hands the model a free feature aligned with when the scrolling happens. The sin/cos encoding beats stuffing in a raw integer hour because it places 11pm and midnight close together (periodicity) instead of treating them as two extremes 23 apart.
A few honest caveats
- Strictly, this version isn’t classic RoPE. Classic RoPE rotates the Q/K vectors by position so relative position shows up in the dot product; here we add a sin/cos encoding of the timestamp to the input embedding, which is closer to the original Transformer’s sinusoidal positional encoding — just fed timestamps instead of position indices. “Rotary timestamp encoding” is the accurate name.
- It encodes only hour-of-day, dropping day-of-week and absolute inter-event gaps, which are also useful scales. Multi-scale time features (hour / weekday / time-since-last-interaction) would likely squeeze out a bit more.
- The 1.5% is a single offline result — one dataset (KuaiRec), one hyperparameter setting, no online test, no multi-seed significance check. Treat it as a signal worth chasing, not a verdict.
- It stacks on top of HSTU’s internal \(\mathrm{rab}\) temporal term rather than replacing it — we changed the input-side preprocessor; the attention term is still there. How the two split the work, and whether they’re redundant, is untested.
What I’d do next
Extend hour-of-day into a multi-scale time encoding; push time properly into the attention (rotate Q/K by timestamp rather than only adding at the input); and reproduce the 1.5% on datasets beyond KuaiRec to confirm it isn’t single-dataset luck.