Pyra Eval Working Group

评测协议 · benchmark 设计 · LMArena-Δ · 公开 · 47 成员

Eval protocols · benchmark design · LMArena-Δ · public · 47 members

JY

🎉 Pyra-XL 在内部 LMArena-Δ 上达到 ELO 1387。明天的 WG 我会复盘评测设置。

🎉 Pyra-XL hit ELO 1387 on internal LMArena-Δ. I'll walk through the eval setup at tomorrow's WG.

PT

大家好——我注意到我们的 LMArena-Δ 评测 setup 跟 LMArena 公开 setup 有 11 处差异。我整理了 diff 表,传到 Files。

📎 LMArena_diff_v3.xlsx

Hey all — I noticed our LMArena-Δ eval setup diverges from the public LMArena setup in 11 places. I put a diff table together and uploaded it to Files.

📎 LMArena_diff_v3.xlsx

??

本帖已被管理员删除。

This post has been removed by an admin.

PT
Priya Gupta
楼主原帖里那张 ELO 对比图我截屏了。需要的私我。
I screenshotted the ELO-comparison chart from OP's original post. DM me if you need it.
??

泄露的 LMArena diff(来自上周的 leadership preview deck):

· Pyra-XL preference-optimized 版本上了 internal Bench #2
· Pyra-XL production checkpoint 在外部评测排第 32
提交到 LMArena 的是哪一个?谁负责的解释一下。我们不能"used different models for different benchmarks",这话上次就出过事。

Leaked LMArena diff (from last week's leadership preview deck):

· The Pyra-XL preference-optimized variant landed on internal Bench #2
· The Pyra-XL production checkpoint placed #32 on the external eval
Which one did we actually submit to LMArena? Whoever owned this — please explain. We can't fall back on "used different models for different benchmarks" — that line blew up in our faces last time.

418
查看 418 个反应 · 96 条评论
See all 418 reactions · 96 comments
PT
官方在 X 上否认任何 wrongdoing,但两个 checkpoint 输出的 style diff 是 undeniable 的。我跑了 200 个 prompt 配对盲测,91% 可以分辨。reproducibility statement 必须把这点写清楚,不然 Q2 末发出去也是火上浇油。
Official PR denied any wrongdoing on X, but the style diff between the two checkpoints' outputs is undeniable. I ran a paired blind eval on 200 prompts — 91% distinguishable. The reproducibility statement has to spell this out clearly, otherwise shipping it end-of-Q2 will just pour gasoline on the fire.