Verification & Integrity Tests
SigRank ranks operators on token telemetry. The obvious question: how do you know the numbers aren't fabricated, gamed, or bot-generated? Every result here comes from a real run on real data — and where a test failed its first form, we show that too, because a test that can't fail isn't a test.
Why these tests exist
The cascade thesis says operator token usage is a multiplicative process — each stage compounds on the last. Multiplicative processes leave statistical fingerprints that fabricated or mechanical data don't reproduce. To fake a high rank, a forger would have to simultaneously fake the right first-digit distribution, the right internal arithmetic, the right concentration, and the right human activity schedule — in one self-consistent file. Each test closes one of those escape routes.
Test 1 — Benford's Law (first-digit conformity)
If session totals come from a genuine multiplicative work process, their leading digits should follow Benford's Law — P(first digit = d) = log₁₀(1 + 1/d). The theory was never fitted to digits; it predicts this as a side effect. Pre-registered kill condition (declared before seeing data): Nigrini MAD > 0.015 = nonconformity.
First result — the registered prediction FAILED:
| Set | n | MAD | Verdict |
|---|---|---|---|
| All agents | 544 | 0.01604 | NONCONFORM |
| Claude only | 487 | 0.01896 | NONCONFORM |
| Codex only | 51 | 0.01793 | NONCONFORM |
Raw session totals did not conform. We report this plainly — the first prediction was falsified. But the failure was diagnostic: digit 1 was under-represented, 5 and 9 over-represented — the textbook signature of lower-bound truncation. The cause is mechanical: every coding session begins with ~20–23k tokens of cached system prompt — an additive constant on top of the multiplicative process — which starves the leading-1 bucket and breaks Benford.
The fix confirmed the mechanism:
| Approach | n | MAD | Verdict |
|---|---|---|---|
| All sessions (raw) | 544 | 0.01604 | NONCONFORM |
| Sessions > 10× floor | 269 | 0.03193 | NONCONFORM |
| Floor-subtracted (value − 22k) | 532 | 0.01109 | ACCEPTABLE |
Subtracting the measured floor — removing the additive constant and leaving the multiplicative remainder — recovers conformity. Subsetting does not fix it; subtraction does. Synthetic simulation reproduced the whole story (pure multiplicative conforms at 0.00974; +22k floor breaks it to 0.03253, matching the data; floor-subtracted recovers to 0.00787). The mechanism reproduces in synthesis — it's not a story told after the fact.
The defensible claim: the multiplicative cascade is Benford-conforming once the measured additive system-prompt floor is removed. The raw version is falsified and we say so; the floor-corrected version holds and is mechanistically motivated — a stronger result than naive conformity. The test had teeth, fired, and revealed a real artifact (the floor) that is now itself a tracked quantity.
Test 2 — the bot control (Hermes)
A natural-conformity claim is only meaningful if something fails it. Among the sessions was a set of 5 automated probe runs (“hermes”): totals 4208, 4152, 4115, 4222, 4258. Every first digit was 4. Zero digit diversity — a fixed-size mechanical probe, exactly the non-Benford signature a bot produces. This is the control that gives Test 1 meaning: the method distinguishes a multiplicative human process from a constant-size machine process.
Test 3 — the telescoping identity (internal-consistency lock)
The cascade has three stages — transmission (O/I), commitment (Create/O), and reuse (Read/Create). Their product must equal cache_read/input exactly, because the intermediate terms cancel:
(O/I) × (Create/O) × (Read/Create) = Read/Input
So 10^(10xDEV) = Leverage, by identity — not by fit. An operator cannot inflate their amplification exponent independently of their leverage; the two are bound by algebra. A fabricated row with a high 10xDEV but the wrong Read/Input ratio fails the identity and is detectable. We recompute this on every operator from the raw four pillars; it holds for every legitimate row.
Test 4 — content-free verification (the privacy license)
A separate experiment (EXP-007) established that conserved structure is detectable without reading content: across negation-paraphrase pairs, surface overlap was zero (Jaccard 0.00) while semantic equivalence was complete (NLI 1.00) — “You must not smoke” and “No smoking” converge to one kernel. The consequence: a statistical witness (token counts) is a legitimate instrument for a conservation-driven process. The no-content-access design is not a privacy compromise we tolerate — it is the architecture this result predicts. We rank the four integers; we never see what you typed.
Test 5 — the threat model (failure taxonomy → countermeasures)
| Gaming attempt | Countermeasure |
|---|---|
| Score inflation / single-metric overclaim | Composite scoring; no single metric escalates rank |
| Fake convergence on pre-processed numbers | Server recomputes everything from the RAW payload |
| High leverage with inverted meaning (idle re-read) | Convergence + concentration-band check |
| Merging metrics to blur a weak one | Components stay separately binding |
What's still being hardened (stated honestly)
- Cadence (Test 6, in development): human activity is bursty with heavy tails (Barabási, Nature 435, 207, 2005) and carries 1/f timing noise (Gilden, Science, 1995); machines are periodic or Poisson. Session timestamps already carry the data for a timing-domain humanity test. Not yet deployed.
- Data provenance note: the Benford figures above were computed on a 544-session sample transcribed by hand from session JSON. They are real and reproducible from that sample, but canonical published numbers should be regenerated from source telemetry. We flag this rather than hide it.
Sources
- Benford's Law: Nigrini, M. (2012), Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection.
- Human burst dynamics: Barabási, A.-L. (2005), “The origin of bursts and heavy tails in human dynamics,” Nature 435, 207.
- 1/f cognitive noise: Gilden, D. et al. (1995), Science 267, 1837.
- AA pricing baseline (7:2:1): Artificial Analysis, Language Model Benchmarking Methodology.
- All token-telemetry results: computed from canonical four-pillar session data. Methods and scripts are reproducible; raw transcripts are not published (privacy).
Token counts only — never prompt content. Tests are run, not asserted.