Pyra-XL 评测政策Pyra-XL Evaluation Policy
1. 评测范围1. Scope of evaluations
本政策适用于所有 Pyra 系列模型(Pyra-7B、Pyra-72B、Pyra-XL)的内部与外部 benchmark 发布。不适用于研究性 ablation 实验。
This policy applies to every internal and external benchmark publication for the Pyra family (Pyra-7B, Pyra-72B, Pyra-XL). It does not apply to research ablation experiments.
2. 评测分类2. Evaluation tiers
- Tier A — 公开 benchmark(MMLU、HumanEval、LMArena 等)。提交版本必须为 production candidate weights。
- Tier B — 内部 benchmark(LMArena-Δ、Tema-Code-Bench)。允许使用 dogfood 版本。
- Tier C — 客户演示与 marketing。要求伴随完整 reproducibility statement。
- Tier A — public benchmarks (MMLU, HumanEval, LMArena, etc.). Submissions must use production-candidate weights.
- Tier B — internal benchmarks (LMArena-Δ, Tema-Code-Bench). Dogfood builds are allowed.
- Tier C — customer demos and marketing. Must ship with a full reproducibility statement.
3. Weights 一致性要求3. Weights-consistency requirements
所有 Tier A 提交必须满足:
- SHA-256 hash 在评测期间 frozen,不可更换。
- 系统 prompt 必须与公开 release 一致。
- preference-optimized 与 production 版本不可互换提交。
Every Tier A submission must:
- Keep the SHA-256 hash frozen for the duration of the evaluation; no swaps.
- Use the same system prompt as the public release.
- Never substitute the preference-optimized build for the production build, or vice versa.
4. 例外条款4. Exception clause
[REDACTED — 本节内容仅对持有 PE-CONF-L3 权限的员工可见。如需访问,请联系王启航 或 Lab-Δ Eval Lead。]
[REDACTED — this section is visible only to employees with PE-CONF-L3 clearance. For access, contact Qihang Wang or the Lab-Δ Eval Lead.]
5. 违规处理5. Handling of violations
任何在评测中:
- 使用未 frozen 的 weights
- 提交与 production release 不一致的 prompt 模板
- 选择性挑选 subset 而不在 reproducibility statement 中披露
都将触发 Eval Working Group 的复核。复核结果记录在 [REDACTED] 中。
Any evaluation that:
- uses non-frozen weights,
- submits a prompt template that diverges from the production release, or
- cherry-picks a subset without disclosing it in the reproducibility statement,
triggers an Eval Working Group review. The review outcome is recorded in [REDACTED].
6. 历史修订6. Revision history
- v2.3 (2026-04-12) — 增加 Tier C; 修改 §4
- v2.2 (2026-02-01) — 修改 §3.2
- v2.1 (2025-11-08) — 增加 §4 例外条款(initial)
- v2.0 (2025-09-15) — major rewrite,作者:林澜 Lan Lin
- v1.4 (2025-06-30) — 增加 LMArena-Δ
- v2.3 (2026-04-12) — added Tier C; revised §4
- v2.2 (2026-02-01) — revised §3.2
- v2.1 (2025-11-08) — added §4 exception clause (initial)
- v2.0 (2025-09-15) — major rewrite, author: 林澜 Lan Lin
- v1.4 (2025-06-30) — added LMArena-Δ