Model Accuracy Evaluation

AtomGit AI Community Documentation

Models

About model accuracy evaluation

Definition

What it does: Model accuracy evaluation verifies the prediction accuracy of your custom inference implementation (written per this document) on a given dataset.
Requirement: The model inference service must be reachable via the streaming API at /v1/chat/completions.

How to run an accuracy evaluation

Write your script following the “Model accuracy evaluation run script” section below. Note: Only the model adapter repository admin can start an accuracy evaluation, and the repository hardware type must be NPU.
Open the Model evaluation tab and go to the model evaluation page.
On the model accuracy page, open the Accuracy tab, select the model weights repository, then click Run test to start the evaluation.
Wait until the accuracy run finishes.
(Optional) While status is Running, you can click Stop test to cancel.
When the run completes, the Accuracy section shows status and provides logs and report downloads.

Model accuracy evaluation run script

Accuracy evaluation uses deploy.sh as the entry point. Follow this document strictly when writing scripts.

The adapter repository for accuracy evaluation must include:

requirements.txt: Python packages needed to run the script. Omit this file if there are no extra dependencies (optional).
deploy.sh: The evaluation service uses this script to install dependencies and start the adapter project (required).

File locations

requirements.txt and deploy.sh must be at the repository root.

`requirements.txt` (optional)

List NPU-side dependencies required to run the script. Torch NPU, CANN, and Python are provided by the environment for the selected framework version—do not duplicate them in requirements.txt (that can cause install conflicts). Example:

transformers==4.37.0
accelerate==0.27.2

If you need no extra packages, you may omit the .txt file; the job skips dependency installation.

`deploy.sh`

This shell script starts model adapter inference; how you run inference inside is flexible. Conventions below apply.

Example: install dependencies (optional)

python3 -m pip install --upgrade pip setuptools wheel

Passing parameters into the script

Weights are downloaded at test time from the repository you selected when starting the evaluation. To pass the weight path in the shell script, use the environment variable $MODEL_PATH (path to the weight files). Example for launching with vLLM:

vllm serve "$MODEL_PATH"  --trust-remote-code --tensor-parallel-size 1 --dtype float16   --max-num-seqs 4  --gpu-memory-utilization 0.95

Adapter code requirements

Inference must expose a standard OpenAPI-compatible inference API, and the HTTP server must listen on port 8000. The evaluation service calls the matching inference API for the selected weights’ task type. Supported task types:

Task type	Task code	Inference API path
Text generation	text-generation	/v1/chat/completions
Image to text	image-text-to-text	/v1/chat/completions
Multimodal	any-to-any	/v1/chat/completions

Note: Accuracy evaluation relies on /v1/chat/completions. If that endpoint is missing, the accuracy job fails.

Model weight size limit

Maximum size: 100 GB
Rule: Adapter weight storage must not exceed this limit.
If exceeded: Weight download fails and the accuracy job fails.

End-to-end example

deploy.sh

vLLM adapter example:

#!/bin/sh
set -e
echo "=== MODEL_PATH set to: $MODEL_PATH ==="
vllm serve "$MODEL_PATH"  --trust-remote-code --tensor-parallel-size 1 --dtype float16   --max-num-seqs 4  --gpu-memory-utilization 0.95

Note: In this vLLM example you do not need --served-model-name; the evaluation service uses the weight path as the served model name automatically.

Accuracy report

After a successful run, download and unzip the accuracy report. It contains the folders configs, predictions, results, and summary.

Directory layout:

ee9480acbbac4d4aa190a124d5ddf39c/
├── configs # Combined config from model task, dataset task, and presentation task
│   └── 20260326_151326_29317.py
├── logs # Runtime logs; with --debug in the command, step logs go to stdout only
│   ├── eval
│   │   └── vllm-api-general-chat
│   │       └── demo_gsm8k.out # Accuracy-evaluation log from predictions/
│   └── infer
│       └── vllm-api-general-chat
│           └── demo_gsm8k.out # Inference log
├── predictions
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Model outputs (full responses from the service)
├── results
│   └── vllm-api-general-chat
│       └── demo_gsm8k.json # Raw scores from accuracy metrics
└── summary
    ├── summary_20260326_151326.csv # Final scores (CSV)
    ├── summary_20260326_151326.md # Final scores (Markdown)
    └── summary_20260326_151326.txt # Final scores (plain text)

Example content of summary/summary_20260326_151326.md:

dataset	version	metric	mode	vllm-api-stream-chat
demo_gsm8k	0ba9da	accuracy (5 runs average)	gen	5.60
demo_gsm8k	0ba9da	avg@5	gen	5.60
demo_gsm8k	0ba9da	pass@5	gen	24.00
demo_gsm8k	0ba9da	cons@5	gen	0.00

Interpreting accuracy results

I. Relationship among `n`, `k`, and `num_return_sequences` in the API config

1. How `pass@k` is computed

This section only sketches pass@k; for other metrics see Definitions and relationships of pass@k, cons@k, and avg@n.

pass@k is a core metric for code generation: the probability that at least one of k sampled candidates passes all tests. It uses an unbiased estimator to avoid high variance from naive sampling. Steps:

Samples and correctness
- For each problem, generate n candidates (n ≥ k); c of them pass tests.
- Example: n = 100 samples, c = 20 correct—per-sample pass rate follows from that.
Combinatorial form
- Probability that a random k-subset of the n samples is all wrong:
$$ \frac{\binom{n-c}{k}}{\binom{n}{k}} $$
- pass@k is one minus that (at least one success):
  $$ pass@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} $$
- Stable implementation: to avoid factorial overflow, code uses:
```
pass@k = 1 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
```
Why unbiased estimation
- In the current implementation, k and n are tied to num_return_sequences only; this section highlights why the unbiased form helps.
- Sampling k times directly has high variance (especially for large k); generating n with n >> k and using the combinatorial formula improves stability. Example: n = 5, c = 3, k = 2:
- P(all wrong in a random pair) = C(2,2)/C(5,2) = 1/10 = 0.1
- pass@2 = 1 - 0.1 = 0.9 (90% chance at least one success in two tries).

2. How `n`, `k`, and `num_return_sequences` relate

All three must be positive integers. The accuracy service defaults num_return_sequences to 5.

Configurable?	Parameter	Meaning	Where defined	Constraints
No	n	Number of rollouts per problem (total generated samples)	Not separately configurable; equals `num_return_sequences`	Requires `n ≥ k`; no separate `n` today
No	k	Sample size for evaluation; scales `pass@k`	Not separately configurable; equals `num_return_sequences`	Same as above
Yes	num_return_sequences	Independent repeated completions per request	API model config; default 5	—

3. Summary

pass@k: Unbiased combinatorial estimate; reduces variance of naive sampling.
Current constraints:
- n and k are not independently configurable—only num_return_sequences in the API config matters.
- n = k = num_return_sequences
- The k or n in names like pass@k, cons@k, avg@n all refer to num_return_sequences

n and k are only used at evaluation time, while num_return_sequences drives inference—but values come from the API config. When running evaluation (--mode eval), ensure reused inference outputs were produced with the same num_return_sequences as the current setting.

II. Definitions and relationships: pass@k, cons@k, avg@n

1. Background

In LLM and multimodal RL evaluation, pass@k, cons@k, and avg@n summarize multi-sample behavior from different angles. They suit tasks that need multiple independent attempts (code, math, RL, etc.) and give statistically meaningful views of performance.

2. Definitions and computation

2.1 Definitions

Formal definitions:

pass@k:

$$ 1 - \prod_{j=n-c+1}^{n} (1 - \frac{k}{j}) $$

cons@k：

$$ \frac{1}{N} \sum_{i=1}^{N} I(c_i > k/2) $$

avg@n：

$$ \frac{1}{N} \sum_{i=1}^{N} \frac{c_i}{n} $$

Metric	What it measures	Goal	Range
pass@k	P(at least one correct) (unbiased)	Reliability of solving	[0, 1]
cons@k	Majority-correct rate	Stability of outputs	[0, 1]
avg@n	Mean per-sample accuracy	Overall correctness	[0, 1]

Where:

N: Total number of problems in the dataset.
n: Rollouts per problem (total samples), matching code parameter n.
k: Evaluation sample size for pass@k and cons@k, matching code parameter k.
cᵢ: Number of correct samples for problem i.
I(·): Indicator (1 if condition holds, else 0).
Product index j runs from n - c + 1 to n for numerical stability.

2.2 Details

pass@k: Unbiased estimator; implemented as compute_pass_at_k(n, c, k) where n is total samples, c correct count, k sample size—algebraically equivalent to the binomial form but implemented in product form.
cons@k: “Consistency” / stability—share of problems where a strict majority of samples are correct. Per problem, count 1 if correct count > k/2, else 0; average over problems (majority-vote style).
avg@n: Mean over problems of c / n (per-problem accuracy); reflects overall sample-level correctness.

2.3 Example (`num_return_sequences = 3`, so `n` = `k` = 3)

Problem 1: preds [A, A, X] → c = 2
Problem 2: preds [B, C, B] → c = 2
Problem 3: preds [X, X, C] → c = 1
Problem 4: preds [X, X, X] → c = 0

pass@3 = (1.0 + 1.0 + 1.0 + 0.0)/4 = 0.75 (at least one correct on 1–3; none on 4)
avg@3 = (2/3 + 2/3 + 1/3 + 0/3)/4 ≈ 0.4167
cons@3 = (1 + 1 + 0 + 0)/4 = 0.5 (majority correct on 1–2; not on 3–4)

3. `cons@k` vs `avg@n`

3.1 Ordering

By construction, pass@k is always ≥ both avg@n and cons@k—it cannot be smaller—so we do not compare pass@k to the other two here.

The ordering of cons@k and avg@n depends on prediction patterns.

Case 1: `cons@k` > `avg@n`

Pattern: Highly consistent but not perfect (many problems have a strict majority correct, yet not 100% per-sample accuracy).
Example: k = 3, two problems:
- P1: [A, A, B], gold A → c = 2, rate ≈ 0.667; majority A (2 > 1.5) → cons contributes 1.
- P2: [B, B, C], gold B → same; cons contributes 1.
- avg@n = (0.667 + 0.667) / 2 = 0.667
- cons@k = (1 + 1) / 2 = 1.0
- So cons@k > avg@n.

Case 2: `cons@k` < `avg@n`

Pattern: Predictions spread out—no majority—but average correctness can still be moderate (correctness spread thin).
Example: k = 3, two problems:
- P1: [A, B, C], gold A → c = 1, rate ≈ 0.333; no strict majority → cons 0.
- P2: [A, B, C], gold B → same → cons 0.
- avg@n = 0.333
- cons@k = 0
- So cons@k < avg@n.

Case 3: `cons@k` ≈ `avg@n`

Pattern: Near-perfect or all-wrong, or distributions where majority rate tracks mean rate.
Example: k = 3, two problems:
- P1: [A, A, A], gold A → rate 1.0; majority correct → cons 1.
- P2: [B, B, B], gold C → rate 0.0; majority wrong → cons 0.
- avg@n = 0.5
- cons@k = 0.5
- So cons@k = avg@n.

3.2 Trends

High agreement (strict majorities often correct) can push cons@k above avg@n because cons only needs a majority while mean is dragged by errors.
Spread predictions without majorities can make avg@n exceed cons@k (partial credit vs majority rule).
At extremes (all right or all wrong), the two often align.
In practice (e.g. multi-turn RL), use cons@k for stability and avg@n for overall accuracy—they complement each other; no fixed ordering.

4. Takeaways

Which metric when
- pass@k: Model potential / best-effort success
- cons@k: Stability
- avg@n: Aggregate sample accuracy
Common pitfalls
- Only pass@1: Ignores multi-try capability
- Ignoring cons@k: May hide instability in production
- avg@n alone: Cannot separate consistency vs fault tolerance

Heuristic thresholds and decisions

Thresholds below are illustrative: high (>0.8), mid (0.5–0.8), low (<0.5). Use as guidance only.

Scenario guide

Scenario	Primary	Secondary	Targets
Reliability-first (medical, finance)	cons@k	pass@k	cons@k > 0.8, pass@k > 0.9
Fault-tolerance-first (code, exploration)	pass@k	avg@n	pass@k > 0.8, avg@n > 0.7
Balanced (general assistant)	avg@n	cons@k + pass@k	avg@n > 0.75

Decision matrix

Profile	Situation	Next steps
High pass@k, mid avg@n, low cons@k	Strong but unstable	Improve consistency (temperature, voting)
Mid across all	Even but mediocre	Broad improvements (data, prompts)
Low pass@k & avg@n, high cons@k	Systematic bias	Audit data, prompts, model bias
All low	Failing	Retrain or change architecture

Using all three together gives a fuller picture for tuning and deployment.

5. Caveats

Not every dataset config’s Evaluator implements all three metrics. If eval_cfg points to an evaluator that does not return the needed quantities, the UI falls back to the original accuracy-style metrics only.

Source for the above: AISBench documentation.

← Model Availability Test

Model Performance Evaluation →