Pyra-XL 评测政策Pyra-XL Evaluation Policy

v2.3 · 最后更新 2026-04-12 · Pyra Eval Working Groupv2.3 · last updated 2026-04-12 · Pyra Eval Working Group

1. 评测范围1. Scope of evaluations

本政策适用于所有 Pyra 系列模型（Pyra-7B、Pyra-72B、Pyra-XL）的内部与外部 benchmark 发布。不适用于研究性 ablation 实验。

This policy applies to every internal and external benchmark publication for the Pyra family (Pyra-7B, Pyra-72B, Pyra-XL). It does not apply to research ablation experiments.

2. 评测分类2. Evaluation tiers

Tier A — 公开 benchmark（MMLU、HumanEval、LMArena 等）。提交版本必须为 production candidate weights。
Tier B — 内部 benchmark（LMArena-Δ、Tema-Code-Bench）。允许使用 dogfood 版本。
Tier C — 客户演示与 marketing。要求伴随完整 reproducibility statement。

Tier A — public benchmarks (MMLU, HumanEval, LMArena, etc.). Submissions must use production-candidate weights.
Tier B — internal benchmarks (LMArena-Δ, Tema-Code-Bench). Dogfood builds are allowed.
Tier C — customer demos and marketing. Must ship with a full reproducibility statement.

3. Weights 一致性要求3. Weights-consistency requirements

所有 Tier A 提交必须满足：

SHA-256 hash 在评测期间 frozen，不可更换。
系统 prompt 必须与公开 release 一致。
preference-optimized 与 production 版本不可互换提交。

Every Tier A submission must:

Keep the SHA-256 hash frozen for the duration of the evaluation; no swaps.
Use the same system prompt as the public release.
Never substitute the preference-optimized build for the production build, or vice versa.

4. 例外条款4. Exception clause

[REDACTED — 本节内容仅对持有 PE-CONF-L3 权限的员工可见。如需访问，请联系王启航或 Lab-Δ Eval Lead。]

[REDACTED — this section is visible only to employees with PE-CONF-L3 clearance. For access, contact Qihang Wang or the Lab-Δ Eval Lead.]

注：本节内容因涉及内部模型路线图，已应法律部门要求遮盖。任何对外提及本政策时，请勿引用例外条款的存在。

Note: this section has been redacted at the Legal team's request because it touches the internal model roadmap. When referring to this policy externally, do not acknowledge that an exception clause exists.

5. 违规处理5. Handling of violations

任何在评测中：

使用未 frozen 的 weights
提交与 production release 不一致的 prompt 模板
选择性挑选 subset 而不在 reproducibility statement 中披露

都将触发 Eval Working Group 的复核。复核结果记录在 [REDACTED] 中。

Any evaluation that:

uses non-frozen weights,
submits a prompt template that diverges from the production release, or
cherry-picks a subset without disclosing it in the reproducibility statement,

triggers an Eval Working Group review. The review outcome is recorded in [REDACTED].

6. 历史修订6. Revision history

v2.3 (2026-04-12) — 增加 Tier C; 修改 §4
v2.2 (2026-02-01) — 修改 §3.2
v2.1 (2025-11-08) — 增加 §4 例外条款（initial）
v2.0 (2025-09-15) — major rewrite，作者：林澜 Lan Lin
v1.4 (2025-06-30) — 增加 LMArena-Δ

v2.3 (2026-04-12) — added Tier C; revised §4
v2.2 (2026-02-01) — revised §3.2
v2.1 (2025-11-08) — added §4 exception clause (initial)
v2.0 (2025-09-15) — major rewrite, author: 林澜 Lan Lin
v1.4 (2025-06-30) — added LMArena-Δ