Model Accuracy Evaluation

About model accuracy evaluation

Definition

  • What it does: Model accuracy evaluation verifies the prediction accuracy of your custom inference implementation (written per this document) on a given dataset.
  • Requirement: The model inference service must be reachable via the streaming API at /v1/chat/completions.

How to run an accuracy evaluation

  1. Write your script following the “Model accuracy evaluation run script” section below. Note: Only the model adapter repository admin can start an accuracy evaluation, and the repository hardware type must be NPU.
    image-20260311092702397
  2. Open the Model evaluation tab and go to the model evaluation page.
    Snipaste_2026-03-11_09-29-21
  3. On the model accuracy page, open the Accuracy tab, select the model weights repository, then click Run test to start the evaluation.
    Snipaste_2026-03-11_09-30-26
  4. Wait until the accuracy run finishes.
    Snipaste_2026-03-11_09-31-32
  5. (Optional) While status is Running, you can click Stop test to cancel.
    Snipaste_2026-03-11_09-31-32
  6. When the run completes, the Accuracy section shows status and provides logs and report downloads.
    Snipaste_2026-03-11_10-02-09

Model accuracy evaluation run script

Accuracy evaluation uses deploy.sh as the entry point. Follow this document strictly when writing scripts.

The adapter repository for accuracy evaluation must include:

  • requirements.txt: Python packages needed to run the script. Omit this file if there are no extra dependencies (optional).
  • deploy.sh: The evaluation service uses this script to install dependencies and start the adapter project (required).

File locations

requirements.txt and deploy.sh must be at the repository root.

requirements.txt (optional)

List NPU-side dependencies required to run the script. Torch NPU, CANN, and Python are provided by the environment for the selected framework version—do not duplicate them in requirements.txt (that can cause install conflicts). Example:

transformers==4.37.0
accelerate==0.27.2

If you need no extra packages, you may omit the .txt file; the job skips dependency installation.

deploy.sh

This shell script starts model adapter inference; how you run inference inside is flexible. Conventions below apply.

Example: install dependencies (optional)

python3 -m pip install --upgrade pip setuptools wheel

Passing parameters into the script

Weights are downloaded at test time from the repository you selected when starting the evaluation. To pass the weight path in the shell script, use the environment variable $MODEL_PATH (path to the weight files). Example for launching with vLLM:

vllm serve "$MODEL_PATH"  --trust-remote-code --tensor-parallel-size 1 --dtype float16   --max-num-seqs 4  --gpu-memory-utilization 0.95

Adapter code requirements

Inference must expose a standard OpenAPI-compatible inference API, and the HTTP server must listen on port 8000. The evaluation service calls the matching inference API for the selected weights’ task type. Supported task types:

Task typeTask codeInference API path
Text generationtext-generation/v1/chat/completions
Image to textimage-text-to-text/v1/chat/completions
Multimodalany-to-any/v1/chat/completions

Note: Accuracy evaluation relies on /v1/chat/completions. If that endpoint is missing, the accuracy job fails.

Model weight size limit

  • Maximum size: 100 GB
  • Rule: Adapter weight storage must not exceed this limit.
  • If exceeded: Weight download fails and the accuracy job fails.

End-to-end example

deploy.sh

  • vLLM adapter example:
#!/bin/sh
set -e
echo "=== MODEL_PATH set to: $MODEL_PATH ==="
vllm serve "$MODEL_PATH"  --trust-remote-code --tensor-parallel-size 1 --dtype float16   --max-num-seqs 4  --gpu-memory-utilization 0.95

Note: In this vLLM example you do not need --served-model-name; the evaluation service uses the weight path as the served model name automatically.

Accuracy report

After a successful run, download and unzip the accuracy report. It contains the folders configs, predictions, results, and summary.

Directory layout:

ee9480acbbac4d4aa190a124d5ddf39c/
├── configs # Combined config from model task, dataset task, and presentation task
│   └── 20260326_151326_29317.py
├── logs # Runtime logs; with --debug in the command, step logs go to stdout only
│   ├── eval
│   │   └── vllm-api-general-chat
│   │       └── demo_gsm8k.out # Accuracy-evaluation log from predictions/
│   └── infer
│       └── vllm-api-general-chat
│           └── demo_gsm8k.out # Inference log
├── predictions
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Model outputs (full responses from the service)
├── results
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Raw scores from accuracy metrics
└── summary
    ├── summary_20260326_151326.csv # Final scores (CSV)
    ├── summary_20260326_151326.md # Final scores (Markdown)
    └── summary_20260326_151326.txt # Final scores (plain text)

Example content of summary/summary_20260326_151326.md:

datasetversionmetricmodevllm-api-stream-chat
demo_gsm8k0ba9daaccuracy (5 runs average)gen5.60
demo_gsm8k0ba9daavg@5gen5.60
demo_gsm8k0ba9dapass@5gen24.00
demo_gsm8k0ba9dacons@5gen0.00

Interpreting accuracy results

I. Relationship among n, k, and num_return_sequences in the API config

1. How pass@k is computed

This section only sketches pass@k; for other metrics see Definitions and relationships of pass@k, cons@k, and avg@n.

pass@k is a core metric for code generation: the probability that at least one of k sampled candidates passes all tests. It uses an unbiased estimator to avoid high variance from naive sampling. Steps:

  1. Samples and correctness

    • For each problem, generate n candidates (n ≥ k); c of them pass tests.
    • Example: n = 100 samples, c = 20 correct—per-sample pass rate follows from that.
  2. Combinatorial form

    • Probability that a random k-subset of the n samples is all wrong:
    $$ \frac{\binom{n-c}{k}}{\binom{n}{k}} $$
    • pass@k is one minus that (at least one success):

      $$ pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} $$
    • Stable implementation: to avoid factorial overflow, code uses:

    pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
    
  3. Why unbiased estimation

    • In the current implementation, k and n are tied to num_return_sequences only; this section highlights why the unbiased form helps.

    • Sampling k times directly has high variance (especially for large k); generating n with n >> k and using the combinatorial formula improves stability. Example: n = 5, c = 3, k = 2:

    • P(all wrong in a random pair) = C(2,2)/C(5,2) = 1/10 = 0.1

    • pass@2 = 1 - 0.1 = 0.9 (90% chance at least one success in two tries).

2. How n, k, and num_return_sequences relate

All three must be positive integers. The accuracy service defaults num_return_sequences to 5.

Configurable?ParameterMeaningWhere definedConstraints
NonNumber of rollouts per problem (total generated samples)Not separately configurable; equals num_return_sequencesRequires n ≥ k; no separate n today
NokSample size for evaluation; scales pass@kNot separately configurable; equals num_return_sequencesSame as above
Yesnum_return_sequencesIndependent repeated completions per requestAPI model config; default 5

3. Summary

  • pass@k: Unbiased combinatorial estimate; reduces variance of naive sampling.
  • Current constraints:
    • n and k are not independently configurable—only num_return_sequences in the API config matters.
    • n = k = num_return_sequences
    • The k or n in names like pass@k, cons@k, avg@n all refer to num_return_sequences

n and k are only used at evaluation time, while num_return_sequences drives inference—but values come from the API config. When running evaluation (--mode eval), ensure reused inference outputs were produced with the same num_return_sequences as the current setting.


II. Definitions and relationships: pass@k, cons@k, avg@n

1. Background

In LLM and multimodal RL evaluation, pass@k, cons@k, and avg@n summarize multi-sample behavior from different angles. They suit tasks that need multiple independent attempts (code, math, RL, etc.) and give statistically meaningful views of performance.

2. Definitions and computation

2.1 Definitions

Formal definitions:

​ pass@k:

$$ 1 - \prod_{j=n-c+1}^{n} (1 - \frac{k}{j}) $$

​ cons@k:

$$ \frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2) $$

​ avg@n:

$$ \frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n} $$
MetricWhat it measuresGoalRange
pass@kP(at least one correct) (unbiased)Reliability of solving[0, 1]
cons@kMajority-correct rateStability of outputs[0, 1]
avg@nMean per-sample accuracyOverall correctness[0, 1]

Where:

  • N: Total number of problems in the dataset.
  • n: Rollouts per problem (total samples), matching code parameter n.
  • k: Evaluation sample size for pass@k and cons@k, matching code parameter k.
  • cᵢ: Number of correct samples for problem i.
  • I(·): Indicator (1 if condition holds, else 0).
  • Product index j runs from n - c + 1 to n for numerical stability.

2.2 Details

  • pass@k: Unbiased estimator; implemented as compute_pass_at_k(n, c, k) where n is total samples, c correct count, k sample size—algebraically equivalent to the binomial form but implemented in product form.
  • cons@k: “Consistency” / stability—share of problems where a strict majority of samples are correct. Per problem, count 1 if correct count > k/2, else 0; average over problems (majority-vote style).
  • avg@n: Mean over problems of c / n (per-problem accuracy); reflects overall sample-level correctness.

2.3 Example (num_return_sequences = 3, so n = k = 3)

Problem 1: preds [A, A, X] → c = 2
Problem 2: preds [B, C, B] → c = 2
Problem 3: preds [X, X, C] → c = 1
Problem 4: preds [X, X, X] → c = 0

pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (at least one correct on 1–3; none on 4)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 ≈ 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (majority correct on 1–2; not on 3–4)

3. cons@k vs avg@n

3.1 Ordering

By construction, pass@k is always ≥ both avg@n and cons@k—it cannot be smaller—so we do not compare pass@k to the other two here.

The ordering of cons@k and avg@n depends on prediction patterns.

Case 1: cons@k > avg@n
  • Pattern: Highly consistent but not perfect (many problems have a strict majority correct, yet not 100% per-sample accuracy).
  • Example: k = 3, two problems:
    • P1: [A, A, B], gold Ac = 2, rate ≈ 0.667; majority A (2 > 1.5) → cons contributes 1.
    • P2: [B, B, C], gold B → same; cons contributes 1.
    • avg@n = (0.667 + 0.667) / 2 = 0.667
    • cons@k = (1 + 1) / 2 = 1.0
    • So cons@k > avg@n.
Case 2: cons@k < avg@n
  • Pattern: Predictions spread out—no majority—but average correctness can still be moderate (correctness spread thin).
  • Example: k = 3, two problems:
    • P1: [A, B, C], gold Ac = 1, rate ≈ 0.333; no strict majority → cons 0.
    • P2: [A, B, C], gold B → same → cons 0.
    • avg@n = 0.333
    • cons@k = 0
    • So cons@k < avg@n.
Case 3: cons@kavg@n
  • Pattern: Near-perfect or all-wrong, or distributions where majority rate tracks mean rate.
  • Example: k = 3, two problems:
    • P1: [A, A, A], gold A → rate 1.0; majority correct → cons 1.
    • P2: [B, B, B], gold C → rate 0.0; majority wrong → cons 0.
    • avg@n = 0.5
    • cons@k = 0.5
    • So cons@k = avg@n.
  • High agreement (strict majorities often correct) can push cons@k above avg@n because cons only needs a majority while mean is dragged by errors.
  • Spread predictions without majorities can make avg@n exceed cons@k (partial credit vs majority rule).
  • At extremes (all right or all wrong), the two often align.
  • In practice (e.g. multi-turn RL), use cons@k for stability and avg@n for overall accuracy—they complement each other; no fixed ordering.

4. Takeaways

  1. Which metric when

    • pass@k: Model potential / best-effort success
    • cons@k: Stability
    • avg@n: Aggregate sample accuracy
  2. Common pitfalls

    • Only pass@1: Ignores multi-try capability
    • Ignoring cons@k: May hide instability in production
    • avg@n alone: Cannot separate consistency vs fault tolerance
  3. Heuristic thresholds and decisions

    Thresholds below are illustrative: high (>0.8), mid (0.5–0.8), low (<0.5). Use as guidance only.

    • Scenario guide

      ScenarioPrimarySecondaryTargets
      Reliability-first (medical, finance)cons@kpass@kcons@k > 0.8, pass@k > 0.9
      Fault-tolerance-first (code, exploration)pass@kavg@npass@k > 0.8, avg@n > 0.7
      Balanced (general assistant)avg@ncons@k + pass@kavg@n > 0.75
    • Decision matrix

      ProfileSituationNext steps
      High pass@k, mid avg@n, low cons@kStrong but unstableImprove consistency (temperature, voting)
      Mid across allEven but mediocreBroad improvements (data, prompts)
      Low pass@k & avg@n, high cons@kSystematic biasAudit data, prompts, model bias
      All lowFailingRetrain or change architecture

Using all three together gives a fuller picture for tuning and deployment.

5. Caveats

Not every dataset config’s Evaluator implements all three metrics. If eval_cfg points to an evaluator that does not return the needed quantities, the UI falls back to the original accuracy-style metrics only.

Source for the above: AISBench documentation.