Benchmark Methodology

numpy-ts ships with a comprehensive benchmark suite that measures performance against Python NumPy. This page explains how the benchmarks work.

What’s tested

The suite contains 308 benchmark specifications across 18 categories: creation, arithmetic, math, trig, gradient, linalg, reductions, manipulation, io, indexing, bitwise, sorting, logic, statistics, sets, random, polynomials, and fft. Each specification is tested across multiple dtypes (float64, float32, float16, int8-int64, uint8-uint64, complex64, complex128, bool) where applicable, producing ~2,400 individual benchmarks in a full run. Array sizes are configurable:

Scale	Array size	Matrix size
Small	100 elements	32×32
Medium (default)	1,000 elements	100×100
Large	10,000 elements	1,000×1,000

All benchmark specifications are defined in benchmarks/src/specs.ts.

How timing works

Both sides use high-resolution timers: performance.now() in the JS runner and time.perf_counter() in the Python runner. The benchmark measures computation time only, from the JS side for numpy-ts and from the Python side for NumPy. This gives an apples-to-apples comparison of the numerical computation itself, without being skewed by JS↔Python interop overhead.

Auto-calibration

Each benchmark automatically calibrates how many operations to run per sample, targeting a minimum sample time of 100ms. This eliminates timer resolution noise: if an operation takes 0.001ms, the runner batches 100,000 of them into a single sample rather than measuring one at a time. The calibration uses exponential scaling (×10 → ×2 → exact) to converge quickly, with a cap of 10 calibration rounds.

Warmup

Before measurement, each benchmark runs a configurable number of warmup iterations to stabilize JIT compilation and ensure WASM modules are compiled:

Mode	Warmup iterations	Min sample time	Samples
Quick	3	50ms	1
Standard	10	100ms	5
Full	20	100ms	5

The published benchmarks on this site use full mode.

Measurement

After warmup and calibration, the runner collects 5 independent samples. Each sample runs the calibrated number of operations and records the per-operation time. The suite reports:

Mean and median time per operation
Min and max across samples
Standard deviation
Ops/second (derived from mean time)

The speedup ratio shown on the benchmark pages is numpy-ts ops/s ÷ NumPy ops/s. A ratio above 1.0x means numpy-ts was faster.

Fairness

A few design decisions to keep the comparison honest:

Same operations, same data. Both sides run the same algorithm on the same array shapes and dtypes. The Python runner (numpy_benchmark.py) mirrors the JS specifications exactly.
Computation only. Timing happens on each side of the boundary. numpy-ts is timed from JS; NumPy is timed from Python. Neither side pays for cross-language overhead.
No cherry-picking. Every benchmark in the spec file runs. Categories where NumPy is faster (trig, math, indexing) are reported alongside categories where numpy-ts wins.
Geometric mean for ratios. Category and overall averages use the geometric mean, which is the correct method for averaging ratios.

Running benchmarks yourself

# Standard run (~5-10 min)
npm run bench

# Full run with more warmup (~30-60 min, used for published results)
npm run bench:full

# Quick sanity check (~1-2 min)
npm run bench:quick

# Test different array sizes
npm run bench -- --size small
npm run bench -- --size large

# Compare across runtimes (Node, Deno, Bun)
npm run bench -- --runtimes

Results are saved to benchmarks/results/ as JSON files.

Caching

Benchmark results are cached for 24 hours, keyed by machine fingerprint. This prevents stale cross-comparisons when hardware or environment changes. Use --fresh to skip the cache and re-run Python benchmarks.

​What’s tested

​How timing works

​Auto-calibration

​Warmup

​Measurement

​Fairness

​Running benchmarks yourself

​Caching