Pyra Eval Working Group · Tema Workplace

JY

Jamie Wynn

8 小时前 · 🌐

🎉 Pyra-XL 在内部 LMArena-Δ 上达到 ELO 1387。明天的 WG 我会复盘评测设置。

🎉 Pyra-XL hit ELO 1387 on internal LMArena-Δ. I'll walk through the eval setup at tomorrow's WG.

PT

Priya Gupta

3 天前 · 🌐

大家好——我注意到我们的 LMArena-Δ 评测 setup 跟 LMArena 公开 setup 有 11 处差异。我整理了 diff 表，传到 Files。

📎 LMArena_diff_v3.xlsx

Hey all — I noticed our LMArena-Δ eval setup diverges from the public LMArena setup in 11 places. I put a diff table together and uploaded it to Files.

📎 LMArena_diff_v3.xlsx

??

[message deleted by admin]

5 天前 · 已删除

本帖已被管理员删除。

This post has been removed by an admin.

PT

Priya Gupta

楼主原帖里那张 ELO 对比图我截屏了。需要的私我。

I screenshotted the ELO-comparison chart from OP's original post. DM me if you need it.

赞 · 134 · 4天 Like · 134 · 4d

??

[匿名] lmarena-leak

6 天前 · 🌐

泄露的 LMArena diff（来自上周的 leadership preview deck）：

· Pyra-XL preference-optimized 版本上了 internal Bench #2
· Pyra-XL production checkpoint 在外部评测排第 32
提交到 LMArena 的是哪一个？谁负责的解释一下。我们不能"used different models for different benchmarks"，这话上次就出过事。

Leaked LMArena diff (from last week's leadership preview deck):

· The Pyra-XL preference-optimized variant landed on internal Bench #2
· The Pyra-XL production checkpoint placed #32 on the external eval
Which one did we actually submit to LMArena? Whoever owned this — please explain. We can't fall back on "used different models for different benchmarks" — that line blew up in our faces last time.

👍418

查看 418 个反应 · 96 条评论

See all 418 reactions · 96 comments

PT

Priya Gupta

官方在 X 上否认任何 wrongdoing，但两个 checkpoint 输出的 style diff 是 undeniable 的。我跑了 200 个 prompt 配对盲测，91% 可以分辨。reproducibility statement 必须把这点写清楚，不然 Q2 末发出去也是火上浇油。

Official PR denied any wrongdoing on X, but the style diff between the two checkpoints' outputs is undeniable. I ran a paired blind eval on 200 prompts — 91% distinguishable. The reproducibility statement has to spell this out clearly, otherwise shipping it end-of-Q2 will just pour gasoline on the fire.

赞 · 207 · 5天 Like · 207 · 5d