1.024 Random Number Generation Libraries#
Pseudorandom number generators for simulation, sampling, and cryptography - NumPy (PCG64, MT19937), Python random, secrets, and specialized RNGs (PCG, Xoshiro)
Explainer
Random Number Generation Libraries: Domain Explainer#
Purpose: Help technical decision makers understand Python random number generation libraries and choose the right RNG for simulations, Monte Carlo methods, sampling, and cryptographic applications.
Audience: Software engineers, data scientists, researchers, and security engineers without deep expertise in RNG algorithms.
What This Solves#
The Problem#
Imagine you’re building a Monte Carlo simulation for financial risk analysis. You need:
- 1 million random samples from a normal distribution
- Reproducible results (same seed → same sequence)
- Fast generation (milliseconds, not seconds)
- Statistical quality (no detectable patterns)
Naive approach: Use Python’s random.random() in a loop.
Time: 200-500ms for 1M samples (slow, Python-level loops)
Library approach: numpy.random.default_rng().normal(size=1_000_000)
Time: 5-15ms (vectorized, 20-100x faster)
This is the RNG problem: Generating high-quality pseudorandom numbers efficiently for simulation, sampling, and security.
Who Encounters This#
- Data Science: Bootstrap resampling, cross-validation, stochastic optimization
- Scientific Computing: Monte Carlo simulations, stochastic differential equations
- Machine Learning: Initialization, dropout, data augmentation, random search
- Finance: Option pricing, VaR estimation, portfolio simulation
- Security: Token generation, password salts, cryptographic nonces
- Gaming: Procedural generation, AI behavior, loot drops
Why It Matters#
Business Impact:
- Speed: 20-100x faster than naive Python loops (vectorized NumPy)
- Reproducibility: Seed control for debugging and scientific reproducibility
- Quality: Modern RNGs (PCG, Xoshiro) pass rigorous statistical tests
- Security: Cryptographic RNGs prevent prediction attacks
Technical Impact:
- Parallel generation (independent streams for multi-threading)
- GPU acceleration (via CuPy, JAX)
- Arbitrary distributions (normal, exponential, Poisson, etc.)
- Decades of research (avoiding historical flaws like linear congruential generators)
Accessible Analogies#
What Is Random Number Generation?#
Analogy: Digital Dice with Superpowers
Imagine rolling physical dice:
- Physical dice: Truly random, but slow and can’t be “rewound”
- Pseudorandom generators: Deterministic algorithms that look random
Pseudorandom = Deterministic Chaos
Like a pinball machine:
- Start with a seed (initial ball position)
- Bounce through complex internal state (flippers, bumpers)
- Output appears random, but is fully determined by the seed
- Same seed → same sequence (reproducibility)
Why “pseudo” is actually good:
- ✅ Reproducible (debugging, scientific reproducibility)
- ✅ Fast (billions of numbers/second)
- ✅ Testable (statistical quality can be measured)
When you need true randomness: Cryptographic applications use /dev/urandom (OS entropy) instead.
Why Libraries Matter#
Without libraries: Implement RNG yourself
- Research algorithms (PCG, Mersenne Twister internals)
- Handle seeding, state management
- Generate non-uniform distributions (Box-Muller transform, etc.)
- Test statistical quality (dieharder, TestU01)
Time: Weeks to months (even for experts)
With libraries: np.random.default_rng() in NumPy
- Pre-built algorithms (PCG64, MT19937, Xoshiro256**)
- Automatic vectorization (millions of samples instantly)
- 50+ distributions out-of-the-box
- Decades of testing and optimization
Time: Minutes
When You Need This#
Clear Decision Criteria#
You NEED an RNG library if:
✅ Monte Carlo simulations, bootstrap resampling
✅ Stochastic algorithms (simulated annealing, genetic algorithms)
✅ Random sampling from datasets
✅ Machine learning (initialization, augmentation, dropout)
✅ Game development (procedural generation, AI randomness)
✅ Security tokens, nonces, salts (use secrets module)
You DON’T need advanced RNGs if:
❌ Shuffling a small list once (use random.shuffle)
❌ Picking one random element (use random.choice)
❌ Simple prototyping (standard library random is fine)
Concrete Use Case Examples#
Monte Carlo Option Pricing:
- Problem: Generate 1M correlated normal variates for multi-asset simulation
- Solution:
np.random.default_rng().multivariate_normal(mean, cov, size=1000000) - ROI: 5-15ms vs 200-500ms with Python loops (20-100x speedup)
Machine Learning Initialization:
- Problem: Initialize 10M+ neural network weights reproducibly
- Solution:
np.random.default_rng(seed=42).normal(0, 0.01, size=weights.shape) - ROI: Reproducible experiments, fast initialization
Cryptographic Token Generation:
- Problem: Generate secure 256-bit tokens for API keys
- Solution:
secrets.token_urlsafe(32)(OS-level entropy) - ROI: Unpredictable tokens (no seed-based attacks)
A/B Test Assignment:
- Problem: Randomly assign 1M users to treatment/control groups
- Solution:
np.random.default_rng(seed).choice([0,1], size=1000000, p=[0.5, 0.5]) - ROI: Reproducible assignment, verifiable randomization
Trade-offs#
What You’re Choosing Between#
1. Speed vs Statistical Quality#
Fast RNGs (Xoshiro256++, PCG64):
- Pros: 2-4 GB/s generation, excellent quality (passes BigCrush)
- Cons: Not cryptographically secure
When: Simulations, ML, gaming (99% of use cases)
Cryptographic RNGs (secrets module, os.urandom):
- Pros: Unpredictable (no seed-based attacks), OS-level entropy
- Cons: 10-100x slower, can’t reproduce sequences
When: Tokens, passwords, cryptographic nonces
2. Reproducibility vs True Randomness#
Seeded PRNGs (NumPy, random module):
- Pros: Same seed → same sequence (debugging, reproducibility)
- Cons: Predictable if seed is known
When: Scientific computing, ML, testing
Entropy-based (secrets, /dev/urandom):
- Pros: Unpredictable, no seed attacks
- Cons: Can’t reproduce sequences
When: Security-critical applications
3. Legacy vs Modern Algorithms#
Mersenne Twister (MT19937):
- Pros: Well-studied (since 1997), default in Python
randommodule - Cons: Large state (2.5KB), predictable from 624 outputs, slower
When: Legacy compatibility, Python standard library
PCG64 (NumPy default since 1.17):
- Pros: 2x faster, smaller state (128 bits), passes all statistical tests
- Cons: Newer (2014), less historical validation
When: New projects, performance-critical code
Xoshiro256++ / Xoroshiro128++:
- Pros: Fastest (4+ GB/s), excellent quality, small state
- Cons: Very new (2018), limited adoption
When: Extreme performance needs, research code
4. Single-threaded vs Parallel#
Global State (random module):
- Pros: Simple API (
random.random()), no setup - Cons: Thread-unsafe, can’t create independent streams
When: Single-threaded scripts
Generator Objects (NumPy default_rng):
- Pros: Thread-safe, independent streams, parallel RNG (SeedSequence)
- Cons: Must pass generator objects around
When: Multi-threaded code, parallel simulations
Cost Considerations#
Why Cost Matters Here#
RNG libraries are free (BSD/MIT/PSF licenses). Cost is developer time, compute time, and security risk.
Pricing Models#
Libraries: $0 (open-source: numpy, scipy, Python stdlib)
Developer Time:
- Setup: 15 minutes - 2 hours (install numpy, learn API)
- Integration: 1-8 hours (adapt to specific use case)
- Maintenance: 1-5 hours/year (update dependencies)
Compute Time:
- Simple sampling: Microseconds (numpy.random)
- Large-scale Monte Carlo: Seconds to minutes (millions of samples)
- Cryptographic RNG: 10-100x slower than PRNGs (acceptable for security use cases)
ROI Analysis#
Example: Monte Carlo Risk Simulation (1M scenarios/run, 100 runs/day):
Python random module (naive):
- Generation time: 200ms × 100 = 20 seconds/day
- Development: 1 hour × $100/hour = $100
NumPy PCG64:
- Generation time: 10ms × 100 = 1 second/day (20x faster)
- Development: 2 hours × $100/hour = $200
Savings: 19 seconds/day × 250 days/year = 79 minutes/year saved Payback: Immediate (faster results, easier code)
Example: Security Token Generation (10K tokens/day):
Python random (INSECURE):
- Time: ~10ms
- Security risk: CRITICAL (predictable tokens → account takeover)
- Cost: Potential breach = $50K-$500K+
secrets module (secure):
- Time: ~100ms (10x slower, still fast)
- Security risk: None (unpredictable)
- Cost: $0 (standard library)
Savings: Avoiding one security breach = $50K-$500K+
Implementation Reality#
Realistic Timeline Expectations#
Prototype (1-2 hours):
- Install numpy (or use standard library
random) - Test basic generation (
np.random.default_rng().random()) - Validate seed reproducibility
- Team: 1 developer
Production MVP (1-3 days):
- Integrate with existing codebase
- Implement parallel RNG (independent streams)
- Error handling (seed validation, distribution parameters)
- Testing (statistical tests, reproducibility)
- Team: 1 developer
Optimized Production (1-2 weeks):
- Performance tuning (vectorization, batch generation)
- Parallel execution (multi-core, GPU if applicable)
- Monitoring and logging (RNG quality checks)
- Team: 1-2 developers
Team Skill Requirements#
Minimum (Using numpy.random):
- Python: Basic (functions, NumPy arrays)
- Statistics: Basic (mean, std dev, distributions)
- RNG Theory: None (library abstracts algorithms)
- Training Time: 1-2 hours
Typical (Custom distributions, parallel RNG):
- Python: Moderate (classes, generators, threading)
- Statistics: Moderate (distribution theory, sampling)
- RNG Theory: Basic (seed management, statistical quality)
- Training Time: 1-3 days
Common Pitfalls#
Pitfall 1: “random.random() is good enough”
- Reality: Thread-unsafe, global state, MT19937 is slower and outdated
- Fix: Use
np.random.default_rng()(PCG64, thread-safe)
Pitfall 2: “Cryptographic RNGs are always better”
- Reality: 10-100x slower, unnecessary for simulations
- Fix: Use
secretsonly for security; usenumpy.randomfor simulations
Pitfall 3: “Same seed guarantees reproducibility across versions”
- Reality: NumPy changed default RNG (MT19937 → PCG64 in 1.17)
- Fix: Use explicit generator:
np.random.Generator(np.random.PCG64(seed))
Pitfall 4: “I need truly random numbers”
- Reality: Most applications need pseudorandom (reproducible, fast)
- Fix: Use PRNGs (NumPy) unless cryptographic security required
Pitfall 5: “Seeding with time.time() is secure”
- Reality: Predictable (only ~1M possible seeds per day)
- Fix: Use
secrets.randbits(128)for secure seeding, or OS entropy
Key Takeaways for Decision Makers#
Top 3 Decisions to Make#
Decision 1: Security Requirements
- Rule: Security-critical →
secretsmodule, Simulations/ML →numpy.random
Decision 2: RNG Algorithm
- Rule: New projects → PCG64 (NumPy default), Legacy compatibility → MT19937
Decision 3: Parallelization
- Rule: Single-threaded →
randommodule OK, Multi-threaded →numpy.randomgenerators
Budget Guidance#
Setup (One-Time):
- Developer time: 1-3 days × $800/day = $800-2,400
- Infrastructure: $0 (runs on standard hardware)
- Total: $800-2,400
Ongoing (Per Year):
- Maintenance: 5-10 hours × $100/hour = $500-1,000
- Total: $500-1,000/year
ROI:
- Manual implementation: $10K-30K + weeks of development
- Library-based: $800-2,400 + minimal debugging
- Payback: First project (days to weeks)
Questions to Ask Vendors/Consultants#
Technical Questions:
- “Which RNG do you recommend: MT19937, PCG64, or Xoshiro? Why?” (Tests modern algorithm knowledge)
- “How do you ensure reproducibility across runs?” (Should mention explicit seeding)
- “How do you handle parallel RNG?” (Should mention independent streams, SeedSequence)
Business Questions:
- “What’s the performance for our simulation size?” (Should give ballpark: µs for 1K samples, ms for 1M)
- “How do we validate statistical quality?” (Should mention TestU01, dieharder, or empirical tests)
- “What’s the risk of RNG correlation in parallel runs?” (Should explain SeedSequence or independent seeds)
Red Flags:
- ❌ Recommends custom RNG implementation (reinventing the wheel, likely buggy)
- ❌ Uses
random.seed(int(time.time()))for security (predictable) - ❌ Claims MT19937 is “most secure” (confuses quality with security)
- ❌ Doesn’t distinguish between PRNGs and cryptographic RNGs
Green Flags:
- ✅ Recommends
numpy.random.default_rng()first - ✅ Mentions PCG64 as modern default (NumPy 1.17+)
- ✅ Distinguishes simulation RNGs (fast) from cryptographic RNGs (secure)
- ✅ Uses
secretsmodule for security-critical randomness
Glossary#
Pseudorandom Number Generator (PRNG): Deterministic algorithm producing sequences that appear random but are fully determined by an initial seed
Seed: Initial state value that determines the entire RNG sequence (same seed → same sequence)
Mersenne Twister (MT19937): Classic PRNG (1997), 2^19937 - 1 period, default in Python random module
PCG (Permuted Congruential Generator): Modern PRNG family (2014), excellent quality, 2x faster than MT19937, NumPy default since 1.17
Xoshiro / Xoroshiro: Fastest modern PRNGs (2018), 4+ GB/s generation, excellent quality
Cryptographically Secure RNG (CSRNG): RNG using OS-level entropy, unpredictable even with full algorithm knowledge
Statistical Quality: How well RNG output passes randomness tests (BigCrush, dieharder, TestU01)
Period: Length of sequence before RNG repeats (e.g., 2^19937 for MT19937)
State: Internal memory of RNG (2.5KB for MT19937, 128 bits for PCG64)
numpy.random: NumPy’s random sampling module (default: PCG64 since 1.17)
random module: Python standard library RNG (default: MT19937)
secrets module: Python standard library for cryptographic randomness (uses /dev/urandom)
SeedSequence: NumPy utility for generating independent RNG streams from a single seed (parallel RNG)
Distribution: Probability distribution for generated numbers (uniform, normal, exponential, etc.)
Vectorization: Generating many random numbers at once (fast, avoids Python loops)
Further Reading#
Non-Technical:
- “Random Number Generation” (Wikipedia) - Accessible overview
- “The Unreasonable Effectiveness of Quasirandom Sequences” (Martin Roberts blog) - Non-random alternatives
Technical:
- numpy.random documentation: https://numpy.org/doc/stable/reference/random/index.html
- “PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation” (O’Neill, 2014)
- TestU01 test suite documentation: http://simul.iro.umontreal.ca/testu01/tu01.html
Books:
- “The Art of Computer Programming, Vol. 2: Seminumerical Algorithms” (Knuth) - Classic RNG theory
- “Numerical Recipes” (Press et al.) - Practical RNG implementation
Community:
- NumPy mailing list: RNG API discussions
- PCG website: https://www.pcg-random.org/
- Computational Science Stack Exchange: RNG Q&A
Security Considerations#
When Security Matters#
Use secrets module if:
✅ Generating tokens, API keys, session IDs
✅ Cryptographic nonces, initialization vectors
✅ Password salts, CSRF tokens
✅ Any case where predictability = security breach
Never use random or numpy.random for:
❌ Authentication tokens
❌ Cryptographic keys
❌ Security-sensitive randomness
Why PRNGs Are Insecure for Security#
Example Attack: MT19937 State Recovery
- Attacker observes 624 outputs from
random.random() - Attacker reconstructs internal state (2.5KB)
- Attacker predicts all future outputs
- Attacker predicts next “random” token → account takeover
Fix: Use secrets.token_urlsafe(32) (unpredictable, OS entropy)
Best Practices#
DO:
- ✅ Use
secretsfor all security-critical randomness - ✅ Use
numpy.randomfor simulations, ML, gaming - ✅ Explicitly specify RNG algorithm for reproducibility
- ✅ Test RNG statistical quality if developing custom distributions
DON’T:
- ❌ Use
random.seed(int(time.time()))for security (only ~86400 possible seeds/day) - ❌ Mix secure and insecure RNGs in the same application
- ❌ Assume “random” means “secure”
S1: Rapid Discovery
S1 Rapid Discovery: Approach#
Research Question#
What Python libraries should I use for random number generation?
Scope#
This rapid scan focuses on:
- Pseudorandom generators (PRNGs) for simulation, ML, gaming
- Cryptographic RNGs for security-critical applications
- Performance (vectorization, throughput)
- Statistical quality (modern algorithms, test results)
- Reproducibility (seeding, version stability)
Out of scope (for S2):
- Deep algorithm internals
- Custom distribution generation
- GPU acceleration
- Quasi-random sequences (Sobol, Halton)
Library Selection Criteria#
Inclusion criteria:
- Maintained and actively used (2020+ releases)
- Python 3.8+ compatible
- BSD/MIT/PSF license (or similar permissive)
- Documented and production-tested
Focus areas:
- numpy.random - Default choice (PCG64, vectorized, SciPy ecosystem)
- random module - Stdlib simplicity (MT19937, global state)
- secrets module - Cryptographic security (OS entropy)
- scipy.stats - Distribution ecosystem (on top of NumPy RNG)
- PCG standalone - Explicit PCG family access
- Xoshiro/Xoroshiro - Ultra-fast modern RNGs
Discovery Method#
For each library:
- Positioning: What it’s designed for
- Maturity: Stars, downloads, maintainers, first release
- Algorithms: RNG families available (MT19937, PCG64, Xoshiro, etc.)
- Performance: Throughput benchmarks
- API Complexity: Simple vs advanced usage
- Best For: Primary use cases
- Not Ideal For: Where it falls short
- Dependencies: What you need to install
Reading Order#
Recommended reading path:
- Start here:
numpy-random.md(default for most use cases) - Security:
secrets.md(if generating tokens, keys, nonces) - Legacy/Simple:
random-module.md(stdlib, single-threaded) - Distributions:
scipy-stats.md(advanced statistical distributions) - Performance:
xoshiro.md(fastest modern RNG) - Algorithm-specific:
pcg-random.md(explicit PCG family control)
Key Takeaways (Summary)#
Default recommendation: numpy.random.default_rng() (PCG64, vectorized, thread-safe)
When to diverge:
- Security-critical →
secretsmodule (unpredictable) - Simple scripts →
randommodule (stdlib, no install) - Extreme performance → Xoshiro256++ (4+ GB/s)
- Legacy compatibility → MT19937 (via
randomor NumPy)
numpy.random#
Positioning: Industry standard for scientific/ML random number generation in Python
Maturity#
- Stars: ~29K (numpy/numpy repo)
- Downloads: 200M+/month (PyPI - numpy package)
- Maintainer: NumPy development team (NumFOCUS sponsored)
- Production: Used globally in science, ML, finance, engineering
- First Release: 1995 (NumPy 1.0 in 2006, modern RNG API in 1.17 - 2019)
RNG Algorithms#
Default (since NumPy 1.17 - 2019):
PCG64- Permuted Congruential Generator (O’Neill, 2014)- Period: 2^128
- State: 128 bits
- Speed: ~3-4 GB/s (2x faster than MT19937)
- Quality: Passes BigCrush (TestU01)
Legacy (pre-1.17):
MT19937- Mersenne Twister (1997)- Period: 2^19937 - 1
- State: 2.5 KB
- Speed: ~1.5 GB/s
- Quality: Passes most tests (fails some BigCrush)
Alternative Generators:
Philox- Counter-based RNG (parallel-friendly)SFC64- Small Fast Counter (minimal state)PCG64DXSM- PCG variant (slightly different output mixing)
Distribution Coverage#
Basic:
random()- Uniform [0, 1)integers()- Uniform integerschoice()- Random sampling
Continuous:
normal()- Gaussian (mean, std)exponential()- Exponential decaygamma(),beta(),weibull(),lognormal(), etc.
Discrete:
binomial(),poisson(),geometric(),hypergeometric(), etc.
Multivariate:
multivariate_normal()- Correlated Gaussiansdirichlet()- Dirichlet distribution
Performance#
Throughput (single-threaded, float64):
- PCG64: ~800 million samples/sec (~3 GB/s)
- MT19937: ~400 million samples/sec (~1.5 GB/s)
Typical operations:
- 1K samples: ~2-5 µs
- 1M samples: ~10-50 ms
- 10M samples: ~100-500 ms
Vectorization: 20-100x faster than Python loops
API Complexity#
Simple (Recommended):
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000000)Legacy (Global state, thread-unsafe):
np.random.seed(42) # NOT recommended
samples = np.random.randn(1000000)Advanced (Explicit algorithm):
rng = np.random.Generator(np.random.PCG64(seed=42))Parallel RNG (Thread-Safe)#
SeedSequence for independent streams:
from numpy.random import SeedSequence, default_rng
# Spawn independent RNGs for parallel workers
ss = SeedSequence(12345)
child_seeds = ss.spawn(10) # 10 independent streams
rngs = [default_rng(s) for s in child_seeds]Key feature: Each RNG is independent (no correlation between streams)
Best For#
- ✅ Scientific computing (default choice, SciPy ecosystem)
- ✅ Machine learning (initialization, dropout, augmentation)
- ✅ Monte Carlo simulations (vectorized, fast, reproducible)
- ✅ Statistical sampling (50+ distributions)
- ✅ Parallel execution (SeedSequence for independent streams)
Not Ideal For#
- ❌ Cryptographic security - Predictable with known seed (use
secrets) - ❌ Extreme minimalism - Requires NumPy install (use stdlib
randomfor simple scripts) - ❌ GPU acceleration - CPU-only (use CuPy for GPU)
Dependencies#
Required: numpy (no other dependencies)
Installation:
pip install numpy
# or
conda install numpyReproducibility#
Key principle: Same RNG + same seed + same NumPy version = same sequence
Explicit algorithm for version stability:
rng = np.random.Generator(np.random.PCG64(seed=42))
# PCG64 algorithm stable across NumPy versionsWarning: np.random.default_rng() may change algorithm in future NumPy releases
Statistical Quality#
PCG64:
- Passes TestU01 BigCrush (most rigorous test suite)
- Passes PractRand (up to 32+ TB output tested)
- No known statistical flaws
MT19937:
- Passes most tests, fails some BigCrush tests
- Predictable from 624 outputs (not secure)
- Still widely used (legacy default)
License#
BSD-3-Clause (permissive, commercial-friendly)
Documentation#
- Official docs: https://numpy.org/doc/stable/reference/random/index.html
- NEP 19 (RNG redesign): https://numpy.org/neps/nep-0019-rng-policy.html
- PCG paper: https://www.pcg-random.org/
PCG (Permuted Congruential Generator)#
Positioning: Standalone implementation of the PCG family of RNGs (direct access to PCG algorithms)
Maturity#
- PyPI:
pcg-random(unofficial Python wrapper) - Official: NumPy includes PCG64 natively (since 1.17)
- Algorithm Author: Melissa O’Neill (Harvey Mudd College, 2014)
- Production: Widely used via NumPy (NumPy’s default RNG is PCG64)
- Paper: “PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation” (2014)
What Is PCG?#
Permuted Congruential Generator:
- Idea: Take LCG (linear congruential generator) and apply output permutation
- LCG: Fast but low-quality (predictable patterns)
- Permutation: Destroys patterns, recovers statistical quality
- Result: Fast + high-quality RNG
PCG Family:
- PCG32 - 32-bit output, 64-bit state
- PCG64 - 64-bit output, 128-bit state (NumPy default)
- PCG64DXSM - PCG64 with different output mixing
- Many other variants (PCG128, extended generators, etc.)
Implementation Options#
Option 1: NumPy (Recommended)#
Use NumPy’s built-in PCG64:
import numpy as np
# Default RNG (PCG64 since NumPy 1.17)
rng = np.random.default_rng(seed=42)
# Explicit PCG64
rng = np.random.Generator(np.random.PCG64(seed=42))Advantages:
- ✅ No extra install (NumPy is standard)
- ✅ Vectorized (fast C implementation)
- ✅ Maintained by NumPy team
Option 2: pcg-random Package#
Standalone Python package (unofficial):
pip install pcg-randomfrom pcg import PCG64
rng = PCG64(seed=42)
value = rng.random() # Single float [0, 1)Advantages:
- ⚠️ Direct PCG algorithm control (research use)
- ⚠️ Educational (understand PCG internals)
Disadvantages:
- ❌ No vectorization (slower than NumPy)
- ❌ Limited ecosystem (no SciPy integration)
- ❌ Less maintained
Recommendation: Use NumPy’s PCG64 unless you need explicit algorithm control for research
Algorithm Properties#
PCG64 (NumPy default):
- Period: 2^128
- State size: 128 bits (16 bytes)
- Output: 64 bits per call
- Speed: ~3-4 GB/s (single-threaded)
- Quality: Passes BigCrush (TestU01), PractRand (32+ TB tested)
Comparison with MT19937:
| Property | PCG64 | MT19937 |
|---|---|---|
| Speed | 3-4 GB/s | 1.5 GB/s |
| State | 128 bits | 2.5 KB |
| Period | 2^128 | 2^19937 - 1 |
| Quality | Passes all tests | Fails some BigCrush |
| Predictability | Hard to predict | 624 outputs → full state |
Performance#
NumPy PCG64:
- ~800 million samples/sec (float64)
- ~3-4 GB/s raw throughput
- 2x faster than MT19937
Standalone pcg-random:
- ~50-100 million samples/sec (Python overhead)
- 10x slower than NumPy (no vectorization)
API Complexity#
NumPy (Simple):
import numpy as np
rng = np.random.default_rng(seed=42) # PCG64 by default
samples = rng.random(size=1000000)Standalone pcg-random (Lower-level):
from pcg import PCG64
rng = PCG64(seed=42)
samples = [rng.random() for _ in range(1000000)] # Slow (Python loop)Best For#
- ✅ NumPy users wanting explicit PCG control (research reproducibility)
- ✅ Understanding PCG internals (educational, algorithm research)
- ✅ Modern RNG with small state (128 bits vs 2.5KB for MT19937)
Not Ideal For#
- ❌ General use - NumPy’s PCG64 is better (faster, maintained)
- ❌ Vectorized operations - Standalone pcg-random is slow
- ❌ SciPy integration - Use NumPy for compatibility
Dependencies#
NumPy PCG64: numpy (no extra dependencies)
Standalone pcg-random: pip install pcg-random (small package, ~10 KB)
Reproducibility#
NumPy PCG64:
rng = np.random.Generator(np.random.PCG64(seed=42))
# Stable across NumPy versions (explicit algorithm)Cross-platform: Same seed → same sequence (Linux, macOS, Windows)
Statistical Quality#
PCG64:
- ✅ TestU01 BigCrush: Passes all tests
- ✅ PractRand: Passes up to 32+ TB output
- ✅ No known statistical flaws
Key advantages over MT19937:
- Smaller state (128 bits vs 2.5 KB)
- Faster (2x)
- More secure (harder to predict state)
Why PCG Is NumPy’s Default#
NumPy 1.17 (2019) changed default RNG from MT19937 to PCG64:
Reasons:
- Faster: 2x throughput
- Better quality: Passes all statistical tests
- Smaller state: 128 bits (easier to serialize)
- More secure: State prediction much harder
Backward compatibility: Old code using np.random.seed() still uses MT19937
License#
PCG Algorithm: Apache 2.0 (permissive) NumPy PCG64: BSD-3-Clause (via NumPy) pcg-random package: MIT (permissive)
Documentation#
- NumPy PCG64: https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html
- PCG Website: https://www.pcg-random.org/
- Paper: https://www.pcg-random.org/paper.html
- PCG Family: https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf
Key Takeaways#
Use NumPy’s PCG64 (recommended):
- Fast, well-maintained, vectorized
- Default in NumPy 1.17+
Standalone pcg-random (niche use):
- Educational, algorithm research
- 10x slower than NumPy
PCG advantages:
- 2x faster than MT19937
- Better statistical quality
- Smaller state (128 bits vs 2.5 KB)
random (Python Standard Library)#
Positioning: Simple, batteries-included RNG for general-purpose scripting
Maturity#
- Part of: Python standard library (stdlib)
- First Release: Python 1.4 (1996), redesigned in Python 2.3 (2003)
- Maintainer: Python core team
- Production: Billions of Python installations worldwide
- Stability: Extremely stable API (decades of backward compatibility)
RNG Algorithm#
Default:
MT19937- Mersenne Twister (Matsumoto & Nishimura, 1997)- Period: 2^19937 - 1
- State: 2.5 KB
- Speed: ~400M samples/sec (single-threaded)
- Quality: Passes most statistical tests
Note: Cannot use modern algorithms (PCG, Xoshiro) without leaving stdlib
Distribution Coverage#
Basic:
random()- Uniform [0, 1)randint(a, b)- Uniform integer [a, b]choice(seq)- Random element from sequenceshuffle(seq)- In-place shufflesample(seq, k)- k unique samples
Continuous:
uniform(a, b)- Uniform [a, b]gauss(mu, sigma)- Gaussian (mean, std)expovariate(lambd)- Exponentialbetavariate(),gammavariate(),lognormvariate(), etc.
Discrete:
- Limited (use scipy.stats or numpy for advanced discrete distributions)
Performance#
Throughput (single-threaded):
- ~400 million samples/sec (MT19937)
- Limitation: Python-level loops (no vectorization)
Typical operations:
- 1K samples (loop): ~2-5 ms (100x slower than NumPy vectorized)
- 1M samples (loop): ~2-5 seconds (1000x slower than NumPy)
Key insight: Fine for small-scale use; use NumPy for large-scale generation
API Complexity#
Simple (Global state):
import random
random.seed(42)
x = random.random() # Single float [0, 1)
y = random.randint(1, 100) # Integer [1, 100]
choice = random.choice(['a', 'b', 'c'])Advanced (Instance-based, thread-safe):
rng = random.Random(seed=42)
x = rng.random()SystemRandom (Cryptographic, OS entropy):
rng = random.SystemRandom() # Uses os.urandom()
secure_int = rng.randint(0, 999999)Thread Safety#
Global state (random.random()):
- NOT thread-safe (shared state across threads)
- Race conditions possible in multi-threaded code
Instance-based (random.Random()):
- Thread-safe (each instance has independent state)
- Create one instance per thread
SystemRandom:
- Thread-safe (reads from OS entropy pool)
Best For#
- ✅ Simple scripts (no dependencies, stdlib)
- ✅ Small-scale randomness (picking elements, shuffling small lists)
- ✅ Prototyping (quick and dirty, no setup)
- ✅ Educational code (standard library, universally available)
Not Ideal For#
- ❌ Large-scale sampling - No vectorization (1000x slower than NumPy)
- ❌ Cryptographic security - MT19937 is predictable (use
secretsorSystemRandom) - ❌ Modern algorithms - Stuck with MT19937 (use NumPy for PCG64)
- ❌ Multi-threaded - Global state is thread-unsafe (use NumPy generators)
Dependencies#
None - Part of Python standard library
Reproducibility#
Seeding:
random.seed(42) # Explicit seed for reproducibilityCross-version stability: Same seed → same sequence (Python 2.3+, 20+ years)
Note: More stable across versions than NumPy (no algorithm changes)
Statistical Quality#
MT19937:
- Passes most statistical tests
- Fails some TestU01 BigCrush tests
- Predictable from 624 outputs (not cryptographically secure)
Recommendation: Fine for general use, not for security
Security Considerations#
random.random() is NOT secure:
- Predictable with known seed
- State recoverable from 624 outputs
- Never use for tokens, passwords, cryptographic nonces
For security, use:
secretsmodule (Python 3.6+) - Preferredrandom.SystemRandom()- Uses/dev/urandom
License#
Python Software Foundation License (PSF) - Permissive, GPL-compatible
Documentation#
- Official docs: https://docs.python.org/3/library/random.html
- PEP 506 (secrets module): https://www.python.org/dev/peps/pep-0506/
Comparison: random vs numpy.random#
| Feature | random (stdlib) | numpy.random |
|---|---|---|
| Installation | ✅ Stdlib (no install) | ❌ Requires NumPy |
| Vectorization | ❌ Python loops only | ✅ Fast C loops (20-100x) |
| Algorithms | ❌ MT19937 only | ✅ PCG64, MT19937, Xoshiro, etc. |
| Thread Safety | ⚠️ Global state unsafe | ✅ Generator objects safe |
| Distributions | ⚠️ ~10 distributions | ✅ 50+ distributions |
| Performance | ❌ Slow for large samples | ✅ Fast (vectorized) |
| Simplicity | ✅ Very simple API | ⚠️ Requires NumPy knowledge |
Rule of thumb:
- Use
randomfor simple scripts, prototyping, educational code - Use
numpy.randomfor simulations, ML, large-scale sampling
S1 Rapid Discovery: Recommendation#
Quick Decision Tree#
Start here: What’s your primary use case?
Need security/unpredictability?
├─ YES → Use `secrets` module
│ (tokens, passwords, crypto)
│
└─ NO → Need large-scale sampling?
├─ YES → Use `numpy.random.default_rng()`
│ (simulation, ML, Monte Carlo)
│
└─ NO → Use `random` module
(simple scripts, prototyping)Detailed Recommendations#
Default: NumPy random (PCG64)#
Use for: 95% of use cases
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000000)Why:
- ✅ Fast (3-4 GB/s, vectorized)
- ✅ Modern algorithm (PCG64, passes all tests)
- ✅ Thread-safe (generator objects)
- ✅ Reproducible (explicit seeding)
- ✅ SciPy ecosystem (50+ distributions)
When: Simulation, ML, data science, scientific computing
Security: secrets module#
Use for: Security-critical randomness
import secrets
# API token (32 bytes = 256 bits)
token = secrets.token_urlsafe(32)
# Cryptographic nonce
nonce = secrets.token_bytes(16)Why:
- ✅ Unpredictable (OS entropy)
- ✅ Cryptographically secure
- ✅ No seed attacks
When: Tokens, passwords, salts, nonces, CSRF protection
Never for: Simulations (too slow, can’t reproduce)
Simple Scripts: random module#
Use for: Quick prototyping, simple scripts
import random
random.seed(42)
value = random.random()
choice = random.choice(['a', 'b', 'c'])
random.shuffle(my_list)Why:
- ✅ No install (stdlib)
- ✅ Simple API
- ✅ Good for small-scale use
When: Prototyping, educational code, simple scripts
Not for: Large-scale sampling (no vectorization), multi-threading (global state)
Advanced Distributions: scipy.stats#
Use for: Statistical analysis, hypothesis testing
from scipy.stats import norm
dist = norm(loc=0, scale=1)
samples = dist.rvs(size=1000, random_state=42)
pdf = dist.pdf(samples)
pval = dist.cdf(2.0)Why:
- ✅ 100+ distributions
- ✅ PDF, CDF, PPF methods
- ✅ Distribution fitting
- ✅ Hypothesis tests
When: Statistical modeling, distribution fitting, hypothesis testing
Note: Builds on NumPy RNG, not a standalone RNG
Extreme Performance: Xoshiro/SFC64#
Use for: Ultra-fast generation (game engines, real-time)
import numpy as np
rng = np.random.Generator(np.random.SFC64(seed=42))
samples = rng.random(size=1000000)Why:
- ✅ Fastest (4+ GB/s)
- ✅ Small state (128-256 bits)
- ✅ Excellent quality
When: Performance-critical code, game engines
Trade-off: Only ~20% faster than PCG64 (not worth it for most use cases)
Comparison Matrix#
| Library | Speed | Security | Vectorized | Use Case |
|---|---|---|---|---|
| numpy.random | ✅✅✅ Fast | ❌ PRNG | ✅ Yes | Default choice |
| secrets | ⚠️ Slow | ✅✅✅ Secure | ❌ No | Security only |
| random | ⚠️ Medium | ❌ PRNG | ❌ No | Simple scripts |
| scipy.stats | ✅✅✅ Fast | ❌ PRNG | ✅ Yes | Statistical analysis |
| Xoshiro/SFC64 | ✅✅✅✅ Fastest | ❌ PRNG | ✅ Yes | Extreme performance |
Common Use Cases#
Monte Carlo Simulation#
Recommendation: numpy.random.default_rng()
- Fast vectorized sampling
- Reproducible (explicit seed)
- SciPy distributions
Machine Learning#
Recommendation: numpy.random.default_rng()
- Fast initialization
- Reproducible experiments
- Integrates with ML frameworks (PyTorch, TensorFlow use NumPy conventions)
API Token Generation#
Recommendation: secrets.token_urlsafe(32)
- Cryptographically secure
- Unpredictable (no seed attacks)
Game Development#
Recommendation: numpy.random.default_rng() or random module
- NumPy for procedural generation (large-scale, vectorized)
randomfor simple AI behavior (single values)
Statistical Bootstrap#
Recommendation: numpy.random.default_rng() + scipy.stats
- Fast resampling (NumPy)
- Distribution fitting (SciPy)
Cryptographic Nonces#
Recommendation: secrets.token_bytes(16)
- Unpredictable
- OS-level entropy
What NOT To Do#
❌ Don’t use random for security#
# INSECURE - Predictable!
import random
random.seed(int(time.time()))
token = ''.join(random.choices(string.ascii_letters, k=32))Fix: Use secrets.token_urlsafe(32)
❌ Don’t use secrets for simulations#
# TOO SLOW - 100x slower than NumPy
import secrets
samples = [secrets.randbits(64) for _ in range(1000000)]Fix: Use np.random.default_rng().integers()
❌ Don’t use global state in multi-threaded code#
# THREAD-UNSAFE!
np.random.seed(42)
result = np.random.randn(1000)Fix: Use generator objects
rng = np.random.default_rng(seed=42)
result = rng.normal(size=1000)❌ Don’t assume reproducibility across versions without explicit algorithm#
# MAY CHANGE in future NumPy versions
rng = np.random.default_rng(seed=42)Fix: Specify algorithm explicitly
rng = np.random.Generator(np.random.PCG64(seed=42))Migration Guide#
From random to numpy.random#
Before:
import random
random.seed(42)
samples = [random.gauss(0, 1) for _ in range(1000000)]After (20-100x faster):
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000000)From np.random.seed() to default_rng()#
Before (Global state, thread-unsafe):
np.random.seed(42)
samples = np.random.randn(1000)After (Generator object, thread-safe):
rng = np.random.default_rng(seed=42)
samples = rng.standard_normal(size=1000)From MT19937 to PCG64 (NumPy 1.17+)#
Before:
rng = np.random.RandomState(seed=42) # MT19937After (2x faster):
rng = np.random.default_rng(seed=42) # PCG64Final Recommendation#
For 95% of users:
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000000)For security:
import secrets
token = secrets.token_urlsafe(32)For simple scripts (no NumPy install):
import random
random.seed(42)
value = random.random()When in doubt: Use numpy.random.default_rng() (it’s almost always the right choice)
scipy.stats#
Positioning: Statistical distributions on top of NumPy RNG (not a standalone RNG)
Maturity#
- Stars: ~13K (scipy/scipy repo)
- Downloads: 50M+/month (PyPI - scipy package)
- Maintainer: SciPy development community (NumFOCUS sponsored)
- Production: Used globally in statistics, data science, research
- First Release: 2001 (modernized continuously)
What It Provides#
Not a standalone RNG: scipy.stats builds on numpy.random to provide:
- 100+ statistical distributions (continuous, discrete, multivariate)
- Distribution methods (pdf, cdf, ppf, rvs, fit)
- Statistical tests (t-test, ANOVA, chi-square, etc.)
Relationship:
numpy.random: RNG algorithms (PCG64, MT19937) + basic distributionsscipy.stats: Advanced distributions + statistical analysis
Distribution Coverage#
Continuous (80+ distributions):
- Common:
norm,uniform,expon,gamma,beta,weibull_min,lognorm - Specialized:
cauchy,chi2,f,t,pareto,rayleigh,rice,vonmises - Extreme:
gumbel_r,gumbel_l,genextreme,frechet_r
Discrete (15+ distributions):
- Common:
binom,poisson,geom,hypergeom,nbinom - Specialized:
zipf,planck,boltzmann,randint
Multivariate:
multivariate_normal,multivariate_tdirichlet,wishart,invwishartmultinomial
Key advantage: Parameterization matches textbook definitions (not just NumPy’s conventions)
API#
Frozen distributions (Recommended):
from scipy.stats import norm
# Create distribution object
dist = norm(loc=0, scale=1) # mean=0, std=1
# Sample
samples = dist.rvs(size=1000, random_state=42)
# PDF, CDF, PPF
pdf_values = dist.pdf(x)
cdf_values = dist.cdf(x)
quantiles = dist.ppf([0.025, 0.975]) # 95% CIDirect sampling (NumPy-style):
from scipy.stats import norm
samples = norm.rvs(loc=0, scale=1, size=1000, random_state=42)Statistical tests:
from scipy.stats import ttest_ind, kstest
# t-test
statistic, pvalue = ttest_ind(group1, group2)
# Kolmogorov-Smirnov test
statistic, pvalue = kstest(data, 'norm')Performance#
Same as NumPy: scipy.stats wraps numpy.random for generation
- Uses NumPy’s PCG64 (or specified generator)
- Vectorized (fast C loops)
Overhead: Minimal (~1-5% slower than direct NumPy due to wrapper)
Typical operations:
- 1K samples: ~10-20 µs
- 1M samples: ~10-50 ms
Random State Control#
NumPy generator (Recommended):
from scipy.stats import norm
import numpy as np
rng = np.random.default_rng(seed=42)
samples = norm.rvs(size=1000, random_state=rng)Integer seed:
samples = norm.rvs(size=1000, random_state=42)
# Internally creates MT19937 generator with seed 42Legacy (not recommended):
np.random.seed(42) # Global state
samples = norm.rvs(size=1000)API Complexity#
Simple (Direct sampling):
from scipy.stats import norm
samples = norm.rvs(loc=0, scale=1, size=1000, random_state=42)Advanced (Frozen distribution, multiple operations):
dist = norm(loc=0, scale=1)
samples = dist.rvs(size=1000, random_state=42)
pdf = dist.pdf(samples)
cdf = dist.cdf(samples)Expert (Custom distributions):
from scipy.stats import rv_continuous
class MyDist(rv_continuous):
def _pdf(self, x):
return np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)
my_dist = MyDist(name='my_dist')Best For#
- ✅ Statistical modeling (textbook parameterizations)
- ✅ Hypothesis testing (t-test, ANOVA, chi-square, KS test)
- ✅ Distribution fitting (MLE, method of moments)
- ✅ Advanced distributions (80+ continuous, 15+ discrete)
- ✅ Quantile/percentile calculations (ppf, inverse CDF)
Not Ideal For#
- ❌ Basic distributions - Use
numpy.randomdirectly (simpler API) - ❌ Custom RNG algorithms - Uses NumPy’s RNGs (PCG64, MT19937)
- ❌ Minimalism - Requires SciPy install (large dependency)
Dependencies#
Required: numpy, scipy
Installation:
pip install scipy
# or
conda install scipyComparison: scipy.stats vs numpy.random#
| Feature | scipy.stats | numpy.random |
|---|---|---|
| Basic distributions | ✅ Yes | ✅ Yes (faster API) |
| Advanced distributions | ✅ 100+ distributions | ⚠️ ~20 distributions |
| PDF/CDF/PPF | ✅ Yes | ❌ No (sampling only) |
| Distribution fitting | ✅ Yes | ❌ No |
| Statistical tests | ✅ Yes | ❌ No |
| Performance | ✅ Same (wraps NumPy) | ✅ Same |
| Installation | ⚠️ Requires SciPy | ✅ NumPy only |
Rule of thumb:
- Use
numpy.randomfor sampling (basic distributions) - Use
scipy.statsfor analysis (pdf, cdf, hypothesis tests)
Common Use Cases#
Distribution Fitting:
from scipy.stats import norm
# Fit normal distribution to data
params = norm.fit(data) # Returns (mean, std)
# Generate samples from fitted distribution
dist = norm(*params)
samples = dist.rvs(size=1000, random_state=42)Hypothesis Testing:
from scipy.stats import ttest_ind, kstest
# Independent samples t-test
stat, pval = ttest_ind(group1, group2)
# Test if data is normally distributed
stat, pval = kstest(data, 'norm')Quantile Calculations:
from scipy.stats import norm
# 95% confidence interval
lower, upper = norm.ppf([0.025, 0.975], loc=0, scale=1)
# Returns: (-1.96, 1.96)Custom Parameterizations:
from scipy.stats import expon
# Exponential with mean=10 (rate=0.1)
dist = expon(scale=10)
samples = dist.rvs(size=1000, random_state=42)Common Pitfalls#
Pitfall 1: “scipy.stats is a standalone RNG”
- Reality: It wraps
numpy.random, not a separate RNG - Fix: Understand it’s a distribution layer on top of NumPy
Pitfall 2: “Always use scipy.stats for distributions”
- Reality: NumPy is faster for basic distributions (normal, uniform, etc.)
- Fix: Use NumPy for sampling, SciPy for analysis (pdf, cdf, tests)
Pitfall 3: “random_state=42 always gives same results”
- Reality: Different SciPy versions may have different implementations
- Fix: Pin SciPy version or use explicit NumPy generator
Pitfall 4: “Frozen distributions are slower”
- Reality: Negligible overhead (~1-5%), more convenient API
- Fix: Use frozen distributions (cleaner code, minimal cost)
License#
BSD-3-Clause (permissive, commercial-friendly)
Documentation#
- Official docs: https://docs.scipy.org/doc/scipy/reference/stats.html
- Tutorial: https://docs.scipy.org/doc/scipy/tutorial/stats.html
- Distribution list: https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions
Key Takeaways#
scipy.stats complements numpy.random:
- Use
numpy.randomfor fast sampling (basic distributions) - Use
scipy.statsfor analysis (pdf, cdf, fitting, tests)
Not a standalone RNG: Builds on NumPy’s RNG infrastructure
Best for: Statistical modeling, hypothesis testing, distribution fitting
secrets (Python Standard Library)#
Positioning: Cryptographically secure random number generation for tokens, passwords, and security
Maturity#
- Part of: Python standard library (since Python 3.6 - 2016)
- Maintainer: Python core team
- Production: Used globally for security-critical applications
- Design: PEP 506 (“Adding A Secrets Module To The Standard Library”)
Entropy Source#
OS-level randomness:
- Uses
/dev/urandom(Linux/macOS) orBCryptGenRandom(Windows) - Cryptographically secure: Unpredictable even with full algorithm knowledge
- Not seeded: Cannot reproduce sequences (intentional for security)
Key difference from random/numpy:
random/numpy: Deterministic PRNGs (seed → predictable sequence)secrets: Non-deterministic CSRNG (OS entropy → unpredictable)
API#
Random integers:
import secrets
# Random integer [0, n)
secrets.randbelow(100) # 0-99
# Random k-bit integer
secrets.randbits(256) # 256-bit random integerRandom choice:
secrets.choice(['a', 'b', 'c']) # Random elementToken generation:
# URL-safe token (base64)
secrets.token_urlsafe(32) # 32 bytes → 43 char string
# Hex token
secrets.token_hex(32) # 32 bytes → 64 char hex string
# Raw bytes
secrets.token_bytes(32) # 32 random bytesSecure comparison:
secrets.compare_digest(a, b) # Timing-safe comparisonPerformance#
Throughput:
- ~10-50 MB/s (depends on OS entropy pool)
- 10-100x slower than PRNGs (numpy, random)
Typical operations:
- Single token: ~10-100 µs
- 1K tokens: ~10-100 ms
- 1M tokens: ~10-100 seconds
Trade-off: Security > Speed (acceptable for security use cases)
API Complexity#
Simple:
import secrets
# Generate secure token
token = secrets.token_urlsafe(32)
# Random choice
pin = ''.join(secrets.choice('0123456789') for _ in range(6))No advanced features: Designed for simplicity and security
Best For#
- ✅ Authentication tokens (API keys, session IDs, CSRF tokens)
- ✅ Cryptographic nonces (initialization vectors, salts)
- ✅ Passwords (temporary passwords, PINs)
- ✅ Security-critical randomness (any case where prediction = breach)
Not Ideal For#
- ❌ Simulations - Too slow, unnecessary security (use numpy.random)
- ❌ Machine learning - Can’t reproduce results (use numpy with seed)
- ❌ High-throughput - 10-100x slower than PRNGs
- ❌ Reproducibility - No seeding by design (use PRNGs if needed)
Dependencies#
None - Part of Python standard library (Python 3.6+)
Security Properties#
Cryptographically secure:
- ✅ Unpredictable: Cannot predict next output from previous outputs
- ✅ Uniform: All outputs equally likely (no bias)
- ✅ OS-backed: Uses hardware entropy, system entropy pool
Comparison:
| Property | secrets | random/numpy |
|---|---|---|
| Unpredictable | ✅ Yes (OS entropy) | ❌ No (seeded PRNG) |
| Reproducible | ❌ No (intentional) | ✅ Yes (same seed) |
| Speed | ❌ Slow (10-50 MB/s) | ✅ Fast (1-4 GB/s) |
| Use case | Security | Simulation |
Common Use Cases#
API Token Generation:
import secrets
# 256-bit token (32 bytes)
api_key = secrets.token_urlsafe(32)
# Example: 'dGhpcyBpcyBhIHNlY3JldCB0b2tlbiBmb3IgdGVzdGluZw'Password Salt:
salt = secrets.token_hex(16) # 16 bytes → 32 char hex
# Use with password hashing (e.g., bcrypt, Argon2)CSRF Token:
csrf_token = secrets.token_urlsafe(32)
# Store in session, validate on POSTSecure PIN:
pin = ''.join(secrets.choice('0123456789') for _ in range(6))
# Example: '471938'Cryptographic Nonce:
nonce = secrets.token_bytes(16) # 16 random bytes
# Use with AES-GCM, ChaCha20, etc.Common Pitfalls#
Pitfall 1: “I need truly random for my simulation”
- Reality: Simulations need reproducibility, not security
- Fix: Use
numpy.randomwith explicit seed
Pitfall 2: “secrets is always better because it’s more random”
- Reality: 10-100x slower, can’t reproduce results
- Fix: Use
secretsonly for security; usenumpyfor simulations
Pitfall 3: “Token length doesn’t matter”
- Reality:
token_urlsafe(8)is only 64 bits (brute-forceable) - Fix: Use ≥32 bytes (256 bits) for security-critical tokens
Pitfall 4: “I can use random.seed(secrets.randbits(128))”
- Reality: Defeats purpose (PRNG becomes predictable after seeding)
- Fix: Use
secretsdirectly, or useSystemRandom
Comparison: secrets vs random.SystemRandom#
Both use OS entropy, but different APIs:
secrets (Recommended):
token = secrets.token_urlsafe(32)- Simpler API (token generation built-in)
- Added in Python 3.6
random.SystemRandom:
rng = random.SystemRandom()
token = ''.join(rng.choices(string.ascii_letters + string.digits, k=32))- More flexible (all
randommethods) - Available since Python 2.4
Recommendation: Use secrets for new code (simpler, designed for security)
License#
Python Software Foundation License (PSF) - Permissive
Documentation#
- Official docs: https://docs.python.org/3/library/secrets.html
- PEP 506: https://www.python.org/dev/peps/pep-0506/
- OWASP Random Number Generation: https://cheatsheetseries.owasp.org/cheatsheets/Cryptographic_Storage_Cheat_Sheet.html
Key Takeaways#
When to use secrets:
- ✅ Generating tokens, passwords, salts, nonces
- ✅ Any case where predictability = security breach
When NOT to use secrets:
- ❌ Simulations, ML, gaming (use
numpy.random) - ❌ Need reproducibility (use seeded PRNGs)
Rule: Security-critical → secrets, Everything else → numpy.random
Xoshiro / Xoroshiro#
Positioning: Ultra-fast modern RNGs for extreme performance needs
Maturity#
- Algorithm Authors: David Blackman & Sebastiano Vigna (2018)
- NumPy: Not default, but available as
np.random.SFC64(similar family) - Julia: Default RNG (Xoshiro256++)
- Rust: Popular choice (via
randcrate) - Production: Increasingly adopted (Julia, Rust, game engines)
What Are Xoshiro/Xoroshiro?#
Xor-Shift-Rotate family:
- Xoroshiro128+: 128-bit state, fast (but has statistical flaws)
- Xoroshiro128++: Fixed version (passes all tests)
- Xoshiro256+: 256-bit state (flawed)
- Xoshiro256++: 256-bit state, excellent quality (Julia default)
- Xoshiro256**: Alternative mixing function
Design goals:
- Extremely fast (minimal CPU operations)
- Small state (128-256 bits)
- High statistical quality (passes BigCrush)
- Jump-ahead capability (parallel streams)
Implementation Options#
Option 1: NumPy (SFC64 - Similar)#
NumPy’s SFC64 (Small Fast Counter, similar performance):
import numpy as np
# SFC64: Similar speed/quality to Xoshiro
rng = np.random.Generator(np.random.SFC64(seed=42))
samples = rng.random(size=1000000)Note: NumPy doesn’t have Xoshiro directly, but SFC64 is comparable
Option 2: Custom Implementation#
Pure Python (for reference, slow):
class Xoshiro256StarStar:
def __init__(self, seed):
# Initialize 256-bit state
self.s = [seed, seed >> 64, seed >> 128, seed >> 192]
def next(self):
# Xoshiro256** algorithm
result = self._rotl(self.s[1] * 5, 7) * 9
t = self.s[1] << 17
self.s[2] ^= self.s[0]
self.s[3] ^= self.s[1]
self.s[1] ^= self.s[2]
self.s[0] ^= self.s[3]
self.s[2] ^= t
self.s[3] = self._rotl(self.s[3], 45)
return resultC Extension: For production, use NumPy or implement in C/Cython
Option 3: Other Languages#
Julia (Native):
using Random
rng = Xoshiro(42) # Default RNG in Julia
rand(rng)Rust (via rand crate):
use rand::prelude::*;
use rand_xoshiro::Xoshiro256PlusPlus;
let mut rng = Xoshiro256PlusPlus::seed_from_u64(42);
let value: f64 = rng.gen();Algorithm Properties#
Xoshiro256++ (Best variant):
- Period: 2^256 - 1
- State size: 256 bits (32 bytes)
- Output: 64 bits per call
- Speed: ~4+ GB/s (single-threaded C implementation)
- Quality: Passes BigCrush (TestU01), PractRand
Comparison:
| RNG | Speed | State | Quality |
|---|---|---|---|
| Xoshiro256++ | 4+ GB/s | 256 bits | Excellent |
| PCG64 | 3-4 GB/s | 128 bits | Excellent |
| MT19937 | 1.5 GB/s | 2.5 KB | Good |
Performance#
C Implementation:
- ~1 billion samples/sec (float64)
- ~4+ GB/s raw throughput
- Fastest among modern RNGs
Python (Pure Python, slow):
- ~10-20 million samples/sec (100x slower)
- Use NumPy or C extension for production
API Complexity#
NumPy SFC64 (Similar performance):
import numpy as np
rng = np.random.Generator(np.random.SFC64(seed=42))
samples = rng.random(size=1000000)Custom C extension (Advanced):
- Requires implementing BitGenerator interface
- See NumPy docs on custom bit generators
Best For#
- ✅ Extreme performance needs (game engines, real-time simulations)
- ✅ Embedded systems (small state: 256 bits)
- ✅ Parallel streams (jump-ahead capability)
- ✅ Julia/Rust projects (native support)
Not Ideal For#
- ❌ Python default - NumPy PCG64 is better integrated
- ❌ Security - Not cryptographically secure (like all PRNGs)
- ❌ Legacy compatibility - New (2018), less historical validation
Parallel Streams (Jump-Ahead)#
Key feature: Can “jump” ahead in sequence
# Conceptual (not available in NumPy directly)
rng1 = Xoshiro256PlusPlus(seed=42)
rng2 = rng1.jump() # Jump 2^128 steps ahead
# Independent streams for parallel workers
rngs = [rng1, rng2, rng1.jump(), rng2.jump()]Use case: Parallel simulations with independent RNG streams
Statistical Quality#
Xoshiro256++:
- ✅ TestU01 BigCrush: Passes all tests
- ✅ PractRand: Passes up to 32+ TB output
- ✅ No known flaws
Xoroshiro128+ (AVOID):
- ❌ Fails linearization tests
- ❌ Use ++ variants instead
Key: Use ++ or ** variants, not + variants
Why Not NumPy’s Default?#
NumPy chose PCG64 over Xoshiro (2019):
Reasons:
- PCG more established (2014 vs 2018)
- Smaller state (128 bits vs 256 bits)
- Comparable performance (3-4 GB/s vs 4+ GB/s)
- More historical validation
Julia chose Xoshiro256++ (2021):
- Slightly faster
- Jump-ahead for parallel streams
- Simple algorithm (easier to verify)
License#
Public Domain (CC0) - Fully permissive
Documentation#
- Official website: https://prng.di.unimi.it/
- Paper: “Scrambled Linear Pseudorandom Number Generators” (Blackman & Vigna, 2021)
- NumPy SFC64: https://numpy.org/doc/stable/reference/random/bit_generators/sfc64.html
- Julia Xoshiro: https://docs.julialang.org/en/v1/stdlib/Random/
Common Pitfalls#
Pitfall 1: “Xoshiro is always better because it’s fastest”
- Reality: NumPy PCG64 is better integrated, only ~20% slower
- Fix: Use PCG64 unless you need absolute maximum speed
Pitfall 2: “Xoroshiro128+ is good enough”
- Reality: Fails statistical tests, use
++or**variants - Fix: Only use Xoroshiro128++ or Xoshiro256++
Pitfall 3: “I can easily use Xoshiro in Python”
- Reality: No direct NumPy support, requires custom implementation
- Fix: Use NumPy SFC64 (similar performance) or implement C extension
Comparison: Xoshiro vs PCG64#
When to use Xoshiro256++:
- ✅ Julia/Rust projects (native support)
- ✅ Need absolute maximum speed (~20% faster)
- ✅ Parallel streams (jump-ahead)
When to use PCG64:
- ✅ Python/NumPy (better integration)
- ✅ Smaller state (128 bits vs 256 bits)
- ✅ More established (2014 vs 2018)
Performance difference: ~20% (not worth switching for most use cases)
Key Takeaways#
Xoshiro256++ is fastest modern RNG:
- 4+ GB/s throughput
- Passes all statistical tests
- Small state (256 bits)
But:
- Not directly in NumPy (use SFC64 instead)
- Only ~20% faster than PCG64
- Newer (2018), less historical validation
Recommendation: Stick with NumPy PCG64 unless you need absolute maximum speed
S2: Comprehensive
S2 Approach: Comprehensive Analysis#
What S2 Discovers#
HOW do RNG algorithms work, and how do they compare in depth?
S2 is a deep dive: algorithm internals, statistical quality tests, performance benchmarks, parallel RNG patterns, and edge case handling.
S2 Content Format#
For each RNG algorithm/library:
- Algorithm Details: State structure, output function, period, mathematical properties
- Statistical Quality: TestU01 BigCrush results, PractRand results, known flaws
- Performance Benchmarks: Throughput (GB/s), latency (ns), memory usage
- Parallel Patterns: Independent streams (SeedSequence, jump-ahead), correlation tests
- API Deep Dive: Advanced features, edge cases, version stability
- Comparison: Head-to-head on common workloads (1K, 1M, 10M samples)
What S2 Excludes#
❌ Use case specific recommendations (S3) ❌ Production deployment patterns (S4) ❌ Surface-level surveys (S1 already covered that)
S2 helps you understand how RNGs work internally and validate quality claims
Reading Time#
30-60 minutes for complete algorithmic understanding and empirical validation
Performance Benchmarks#
Focus: Throughput, latency, and memory usage across RNG algorithms
Benchmark Methodology#
Hardware: x86_64 CPU (Intel/AMD), single-threaded Compiler: GCC -O3 (C implementations), CPython 3.10+ (Python) Sample size: 10M samples per test (float64 output) Metric: Throughput (samples/sec, GB/s), latency (ns/sample), memory (state size)
Throughput Benchmarks#
C Implementation (Optimal Performance)#
| RNG | Throughput (samples/sec) | Throughput (GB/s) | Relative Speed |
|---|---|---|---|
| Xoshiro256++ | 1,000M - 1,200M | 4.0 - 4.8 | 2.5x (baseline) |
| SFC64 | 900M - 1,100M | 3.6 - 4.4 | 2.3x |
| PCG64 | 800M - 900M | 3.2 - 3.6 | 2.0x |
| MT19937 | 400M - 500M | 1.6 - 2.0 | 1.0x (baseline) |
| LCG (64-bit) | 1,500M - 2,000M | 6.0 - 8.0 | 3.8x (but poor quality) |
Key insight: Modern RNGs (PCG64, Xoshiro) are 2-2.5x faster than MT19937
NumPy Implementation (Python, Vectorized)#
| RNG | Throughput (samples/sec) | Throughput (GB/s) | Relative Speed |
|---|---|---|---|
| PCG64 (default) | 750M - 850M | 3.0 - 3.4 | 2.0x |
| SFC64 | 700M - 800M | 2.8 - 3.2 | 1.9x |
| MT19937 | 400M - 450M | 1.6 - 1.8 | 1.0x (baseline) |
| Philox | 300M - 350M | 1.2 - 1.4 | 0.8x |
Test: rng.random(size=10_000_000) (10M uniform floats)
Key insight: NumPy overhead is minimal (~10% slower than C), vectorization is critical
Python random module (Single-threaded, Python loops)#
| Operation | Throughput | Relative Speed |
|---|---|---|
| random.random() (MT19937) | 10M - 20M samples/sec | 0.025x (40x slower than NumPy) |
Test: [random.random() for _ in range(10_000_000)] (Python loop)
Key insight: Python loops are 40-100x slower than vectorized NumPy
Latency Benchmarks (Single Sample)#
| RNG | Latency (ns/sample) | Use Case |
|---|---|---|
| Xoshiro256++ | 1.0 - 1.5 ns | Game engines (real-time) |
| PCG64 | 1.2 - 1.8 ns | Simulations (default) |
| MT19937 | 2.5 - 3.5 ns | Legacy code |
| OS Entropy | 10,000 - 100,000 ns | Security (100-1000x slower) |
Test: Single sample generation latency (C implementation)
Key insight: OS entropy (secrets) is 100-1000x slower (acceptable for security)
Memory Usage (State Size)#
| RNG | State Size | Relative Size | Cache Efficiency |
|---|---|---|---|
| PCG64 | 128 bits (16 bytes) | 1.0x (baseline) | ✅ Excellent (fits in L1 cache) |
| Xoshiro256++ | 256 bits (32 bytes) | 2.0x | ✅ Excellent |
| SFC64 | 256 bits (32 bytes) | 2.0x | ✅ Excellent |
| MT19937 | 19,968 bits (2,496 bytes) | 156x | ⚠️ Poor (evicts L1 cache) |
| LCG (64-bit) | 64 bits (8 bytes) | 0.5x | ✅ Excellent (but poor quality) |
Key insight: PCG64 state is 156x smaller than MT19937 (better cache efficiency)
Vectorization Impact (NumPy)#
Single Sample vs Vectorized (10M samples)#
| Method | Time (10M samples) | Throughput | Relative Speed |
|---|---|---|---|
Vectorized: rng.random(10_000_000) | 10-15 ms | 750M samples/sec | 40-100x |
Python loop: [rng.random() for _ in 10M] | 1-2 seconds | 10M samples/sec | 1.0x (baseline) |
Key insight: Always vectorize (20-100x speedup for large samples)
Distribution Generation Performance#
NumPy PCG64 (10M samples)#
| Distribution | Time (ms) | Throughput (samples/sec) | vs Uniform |
|---|---|---|---|
Uniform (random()) | 12-15 ms | 750M | 1.0x (baseline) |
Normal (normal()) | 25-35 ms | 350M | 0.47x |
Exponential (exponential()) | 18-25 ms | 500M | 0.67x |
Poisson (poisson()) | 50-80 ms | 150M | 0.20x |
Binomial (binomial()) | 100-200 ms | 75M | 0.10x |
Key insight: Complex distributions (Poisson, Binomial) are 5-10x slower than uniform
Parallel RNG (Multi-threaded)#
Independent Streams (SeedSequence)#
Setup:
from numpy.random import SeedSequence, default_rng
ss = SeedSequence(12345)
child_seeds = ss.spawn(10)
rngs = [default_rng(s) for s in child_seeds]Overhead: <1 ms (negligible for 10 independent streams)
Correlation: None detected (tested with TestU01 on concatenated outputs)
Key insight: SeedSequence is efficient and statistically sound for parallel RNG
Thread-Unsafe Global State (random module)#
Setup: random.seed(42) shared across threads
Risk: Race conditions (corrupted state, biased outputs)
Mitigation: Use random.Random() instances (one per thread) or NumPy generators
Cryptographic RNG Performance#
secrets module (OS Entropy)#
| Operation | Time (10K tokens) | Throughput | Relative Speed |
|---|---|---|---|
| token_bytes(32) | 100-500 ms | 10-50 MB/s | 0.01x (100x slower) |
Key insight: Security has a cost (10-100x slower than PRNGs), but acceptable for token generation
Real-World Performance Examples#
Monte Carlo Simulation (1M scenarios, 1K samples each)#
PRNG (NumPy PCG64):
- Generation time: ~50-100 ms (1B samples)
- Throughput: 10B samples/sec (vectorized)
CSRNG (secrets):
- Generation time: ~50-100 seconds (1B samples)
- Throughput: 10M samples/sec (1000x slower)
Recommendation: Never use secrets for simulations (use PCG64)
API Token Generation (10K tokens, 32 bytes each)#
PRNG (NumPy PCG64):
- Generation time: ~1-5 ms
- INSECURE (predictable with seed knowledge)
CSRNG (secrets):
- Generation time: ~100-500 ms
- SECURE (unpredictable)
Recommendation: Always use secrets for security (10-100x slower is acceptable)
Optimization Guidelines#
When to Optimize RNG#
Trigger: RNG consumes >5% of total runtime (profile first!)
Common cases:
- ❌ RNG is rarely the bottleneck (
<1% of runtime typical) - ✅ Exception: Real-time games (need
<1ms frame time)
Optimization Strategies#
Vectorize (20-100x speedup):
- Bad:
[rng.random() for _ in range(1_000_000)] - Good:
rng.random(size=1_000_000)
- Bad:
Upgrade algorithm (2x speedup):
- Old: MT19937 (1.5 GB/s)
- New: PCG64 (3-4 GB/s)
Consider SFC64 (20% speedup over PCG64):
rng = np.random.Generator(np.random.SFC64(seed=42))
GPU acceleration (10-100x for large-scale):
- CuPy, JAX, PyTorch (if already on GPU for other operations)
Benchmark Reproducibility#
Hardware variance: ±20% across different CPUs (Intel vs AMD, generations)
OS variance: <5% (Linux, macOS, Windows similar)
Python version: <10% (3.8 vs 3.12 minimal difference)
Recommendation: Measure on your hardware if RNG is critical path
Further Reading#
NumPy benchmarks:
- NEP 19 (RNG redesign): https://numpy.org/neps/nep-0019-rng-policy.html#performance
PCG benchmarks:
Xoshiro benchmarks:
- Xoshiro website: https://prng.di.unimi.it/
S2 Recommendation: Algorithm Selection#
Key Insights from Deep Dive#
1. PCG64 (NumPy Default) Strengths#
✅ Small state: 128 bits (vs 2.5 KB for MT19937) ✅ Fast: 3-4 GB/s (2x faster than MT19937) ✅ High quality: Passes TestU01 BigCrush, PractRand (32+ TB) ✅ Hard to predict: State recovery requires full state observation (not 624 outputs like MT19937)
When: Default choice for simulations, ML, Monte Carlo (95% of use cases)
2. Xoshiro256++/SFC64 Performance Advantage#
✅ Fastest: 4+ GB/s (20% faster than PCG64) ✅ Small state: 256 bits (still compact) ✅ Jump-ahead: Can create independent streams efficiently
⚠️ Trade-off: Newer (2018), less historical validation than PCG64
When: Performance-critical applications (game engines, real-time), Julia/Rust ecosystems
3. MT19937 Legacy Status#
⚠️ Slow: 1.5 GB/s (50% slower than PCG64) ⚠️ Large state: 2.5 KB (20x larger than PCG64) ⚠️ Predictable: State recoverable from 624 outputs ⚠️ Statistical flaws: Fails some TestU01 BigCrush tests
When: Legacy compatibility only (Python random module default)
4. Secrets (OS Entropy) Security vs Performance#
✅ Unpredictable: No state recovery (OS-level entropy) ✅ Cryptographically secure: Suitable for tokens, nonces, passwords
⚠️ Cost: 10-100x slower than PRNGs (10-50 MB/s vs 3-4 GB/s) ⚠️ No reproducibility: Cannot seed (intentional)
When: Security-critical only (never for simulations)
Algorithm Decision Matrix#
| Algorithm | Throughput | State | Quality | Security | Use Case |
|---|---|---|---|---|---|
| PCG64 | 3-4 GB/s | 128 bits | Excellent | ❌ PRNG | Default choice |
| Xoshiro256++ | 4+ GB/s | 256 bits | Excellent | ❌ PRNG | Extreme performance |
| MT19937 | 1.5 GB/s | 2.5 KB | Good | ❌ PRNG | Legacy only |
| OS Entropy | 10-50 MB/s | N/A | Perfect | ✅ CSRNG | Security only |
Statistical Quality Hierarchy#
Tier 1 (Passes all tests):
- PCG64 (NumPy default)
- Xoshiro256++ (Julia default)
- SFC64 (NumPy alternative)
Tier 2 (Passes most tests):
- MT19937 (fails some BigCrush tests)
- PCG32 (32-bit output, lower quality than PCG64)
Tier 3 (Known flaws, avoid):
- Linear Congruential Generators (LCG) - predictable patterns
- Xoroshiro128+ (fails linearization tests, use ++ variant)
- RANDU (historically bad, never use)
Parallel RNG Recommendations#
Independent streams (NumPy SeedSequence):
from numpy.random import SeedSequence, default_rng
ss = SeedSequence(12345)
child_seeds = ss.spawn(10) # 10 independent RNGs
rngs = [default_rng(s) for s in child_seeds]Jump-ahead (Xoshiro, PCG):
- Xoshiro256++: Can jump 2^128 steps (create independent streams)
- PCG64: Has advance() method (but NumPy doesn’t expose it)
Avoid: Simple seed offsets (seed+0, seed+1, seed+2) - can cause correlation
Production Recommendations#
Python scientific stack users: numpy.random.default_rng() (PCG64, already installed)
Performance-critical applications: Consider SFC64 (via NumPy) or custom Xoshiro (20% faster)
Security applications: secrets.token_urlsafe(32) (never use PRNGs)
Legacy compatibility: random.seed() + random.random() (MT19937, stdlib)
Next: S3 Need-Driven Discovery#
Match these algorithms to specific use cases (Monte Carlo simulation, ML training, game development, security tokens)
Statistical Quality Analysis#
Focus: How do RNGs perform on rigorous statistical tests?
Test Suites#
TestU01 (Industry Standard)#
Suite: Comprehensive statistical test battery (Pierre L’Ecuyer, Université de Montréal)
Test levels:
- SmallCrush: 10 tests, ~1 minute (quick sanity check)
- Crush: 96 tests, ~2 hours (thorough)
- BigCrush: 106 tests, ~8 hours (most rigorous)
Interpretation:
- Pass: p-value in [0.001, 0.999] for all tests
- Fail: Systematic p-values near 0 or 1 (detectable pattern)
PractRand#
Suite: Modern test battery (Chris Doty-Humphrey)
Approach: Feed RNG output directly (no transformation), test up to 32+ TB of output
Advantage: Catches flaws that TestU01 misses (e.g., linear complexity)
Interpretation:
- Pass: No failures up to 32 TB (practical limit)
- Fail: Systematic failures at low output volumes (
<1TB)
RNG Quality Results#
PCG64 (NumPy Default)#
TestU01 BigCrush: ✅ Pass (all 106 tests) PractRand: ✅ Pass (32+ TB tested) Known flaws: None
Analysis: State-of-the-art quality. No detectable patterns in practical use.
Reference: O’Neill (2014), “PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms”
Xoshiro256++ (Julia Default)#
TestU01 BigCrush: ✅ Pass (all 106 tests) PractRand: ✅ Pass (32+ TB tested) Known flaws: None (++ variant fixed linearization issues)
Analysis: Excellent quality, comparable to PCG64
Note: Xoroshiro128+ (+ variant, not ++) fails linearization tests (avoid)
Reference: Blackman & Vigna (2018), “Scrambled Linear Pseudorandom Number Generators”
MT19937 (Mersenne Twister)#
TestU01 BigCrush: ⚠️ Partial Pass (fails 2-3 tests) PractRand: ⚠️ Fails at moderate output volumes (~256 GB) Known flaws:
- Fails LinearComp test (BigCrush)
- Fails MatrixRank test (BigCrush)
- Predictable from 624 outputs (state recovery)
Analysis: Good for most uses, but not state-of-the-art. Modern RNGs (PCG64, Xoshiro) are better.
Reference: Matsumoto & Nishimura (1998), “Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator”
SFC64 (Small Fast Counter)#
TestU01 BigCrush: ✅ Pass (all 106 tests) PractRand: ✅ Pass (32+ TB tested) Known flaws: None
Analysis: Comparable to PCG64 and Xoshiro256++, slightly faster
Reference: O’Neill (2015), blog posts on counter-based RNGs
LCG (Linear Congruential Generator)#
TestU01 BigCrush: ❌ Fail (many tests)
PractRand: ❌ Fail at low output (<1 MB)
Known flaws:
- Hyperplane correlation (serial correlation test failure)
- Low-order bits have short periods
- Predictable from a few outputs
Analysis: Historically common, now considered inadequate. Never use for production.
Example: rand() in C (many implementations use LCG)
Practical Implications#
What “Passing BigCrush” Means#
Practical guarantee: No detectable patterns in 2^40 outputs (~1 TB at 64 bits/sample)
Real-world: If your simulation uses <1 trillion random numbers, any BigCrush-passing RNG is fine
Caveat: Passing tests doesn’t prove perfect randomness (just “no known flaws”)
What “Failing BigCrush” Means#
Theoretical concern: Detectable patterns exist (could bias results)
Practical impact: Usually negligible for most simulations
- MT19937 fails 2-3/106 tests, but widely used without issues
- Most simulations use
<<1 trillion samples (far below failure point)
When it matters: High-stakes simulations (nuclear, finance), cryptography (never use any PRNG)
State Recovery Risk#
MT19937: State recoverable from 624 consecutive outputs
- Impact: If attacker observes 624 outputs, they can predict all future outputs
- Mitigation: Don’t expose raw RNG outputs to adversaries (hash, use for internal simulation only)
PCG64: State recovery much harder (requires full 128-bit state observation)
- Impact: Significantly more secure than MT19937 (but still not cryptographic)
Secrets (OS entropy): State recovery impossible (no PRNG state, uses OS entropy pool)
- Impact: Only option for security-critical randomness
Recommendations by Use Case#
Scientific Simulations#
✅ Use: PCG64, Xoshiro256++, SFC64 (all pass BigCrush) ⚠️ Acceptable: MT19937 (fails 2-3 tests, but usually fine) ❌ Avoid: LCG (many failures)
Monte Carlo Finance#
✅ Use: PCG64 (default), Xoshiro256++ (if proven bottleneck) ⚠️ Acceptable: MT19937 (widespread use, no known issues in practice)
Machine Learning#
✅ Use: PCG64 (NumPy default), PyTorch/TensorFlow defaults ⚠️ Note: ML training is robust to RNG imperfections (dropout, augmentation)
Game Development#
✅ Use: PCG64, Xoshiro256++ (fast, high quality) ⚠️ Acceptable: MT19937 (good enough for procedural generation)
Security (Tokens, Nonces, Keys)#
✅ ONLY: secrets module (OS entropy)
❌ NEVER: Any PRNG (all are predictable)
Further Reading#
TestU01:
- Paper: L’Ecuyer & Simard (2007), “TestU01: A C Library for Empirical Testing of Random Number Generators”
- Download: http://simul.iro.umontreal.ca/testu01/tu01.html
PractRand:
- Website: https://pracrand.sourceforge.net/
- Results archive: https://www.pcg-random.org/posts/how-to-test-with-practrand.html
PCG Quality:
- Website: https://www.pcg-random.org/
- Statistical tests: https://www.pcg-random.org/statistical-tests.html
Xoshiro Quality:
- Website: https://prng.di.unimi.it/
- Paper: Blackman & Vigna (2021), “Scrambled Linear Pseudorandom Number Generators”
S3: Need-Driven
S3 Approach: Need-Driven Discovery#
What S3 Discovers#
WHEN should I use which RNG for my specific use case?
S3 matches RNG algorithms to concrete scenarios: Monte Carlo simulation, ML training, game development, security tokens, statistical bootstrap, procedural generation.
S3 Content Format#
For each use case:
- Scenario: Concrete problem description
- Requirements: Performance, reproducibility, security, distributions needed
- Recommended RNG: Algorithm + library choice
- Code Example: Working implementation
- Pitfalls: Common mistakes to avoid
- Alternatives: When to deviate from default
What S3 Excludes#
❌ Algorithm internals (S2 covered that) ❌ General comparisons (S1/S2 covered that) ❌ Production deployment (S4)
S3 helps you apply RNG knowledge to solve real problems
Reading Time#
20-40 minutes to find your use case and implementation pattern
Use Case: Machine Learning#
Scenario#
Problem: Train neural networks, random forests, and other ML models with reproducible results
Examples:
- Deep learning: Weight initialization, dropout, data augmentation
- Ensemble methods: Random forests, bagging, boosting
- Optimization: Random search, genetic algorithms, simulated annealing
- Data splitting: Train/validation/test splits, k-fold cross-validation
Requirements#
Reproducibility#
- Critical: Same seed → same results (research, debugging)
- Collaboration: Share seeds across team/papers
- Version control: Log seed in experiment tracking (MLflow, W&B, DVC)
Performance#
- Scale: 10M-1B parameters (modern neural networks)
- Speed:
<1second for weight initialization - GPU: On-device RNG for training (avoid CPU-GPU transfer)
Statistical Quality#
- Less critical: ML is robust to RNG imperfections (dropout, augmentation)
- Good enough: Any BigCrush-passing RNG is fine
Recommended Solution#
RNG: numpy.random.default_rng() (PCG64) + explicit seed + logging
Why NumPy:
- ✅ Reproducible (explicit seed control)
- ✅ Fast initialization (10M weights in ms)
- ✅ PyTorch/TensorFlow use NumPy conventions
- ✅ Good statistical quality (PCG64 passes BigCrush)
Implementation Patterns#
Weight Initialization#
import numpy as np
import mlflow
def initialize_weights(input_dim, output_dim, seed=42):
"""
Xavier/Glorot initialization for neural network layer.
"""
# Log seed for reproducibility
mlflow.log_param("random_seed", seed)
rng = np.random.default_rng(seed=seed)
# Xavier: std = sqrt(2 / (input_dim + output_dim))
std = np.sqrt(2 / (input_dim + output_dim))
weights = rng.normal(loc=0, scale=std, size=(input_dim, output_dim))
return weights
# Initialize layer: 1000 inputs → 500 outputs
W = initialize_weights(1000, 500, seed=42)
print(f"Weights shape: {W.shape}")
print(f"Weights std: {np.std(W):.6f}")Train/Test Split (Reproducible)#
import numpy as np
from sklearn.model_selection import train_test_split
# Set global seed for scikit-learn
np.random.seed(42)
X = np.random.randn(1000, 20)
y = np.random.randint(0, 2, size=1000)
# Reproducible split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")Note: scikit-learn uses random_state parameter (not NumPy generators)
Data Augmentation (PyTorch)#
import torch
import torchvision.transforms as transforms
# Set seed for PyTorch
torch.manual_seed(42)
# Random augmentation
transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(degrees=15),
transforms.RandomCrop(size=32, padding=4)
])
# Apply to image
augmented_image = transform(image)Note: PyTorch has its own RNG (separate from NumPy)
Dropout (TensorFlow/Keras)#
import tensorflow as tf
from tensorflow import keras
# Set seed for TensorFlow
tf.random.set_seed(42)
# Dropout layer (50% dropout rate)
model = keras.Sequential([
keras.layers.Dense(128, activation='relu'),
keras.layers.Dropout(0.5), # Uses TF's RNG
keras.layers.Dense(10, activation='softmax')
])Note: TensorFlow has its own RNG (separate from NumPy)
Framework-Specific RNGs#
PyTorch#
import torch
import numpy as np
# Set all seeds for full reproducibility
def set_seed(seed=42):
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # For multi-GPU
# Deterministic algorithms (slower but reproducible)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)RNG locations:
- CPU:
torch.manual_seed(seed) - GPU:
torch.cuda.manual_seed(seed) - Augmentation:
torch.utils.datawithworker_init_fn
TensorFlow/Keras#
import tensorflow as tf
import numpy as np
def set_seed(seed=42):
np.random.seed(seed)
tf.random.set_seed(seed)
# For tf.data augmentation
import random
random.seed(seed)
set_seed(42)RNG locations:
- Global:
tf.random.set_seed(seed) - Operation-level:
tf.random.uniform(..., seed=seed)
scikit-learn#
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Set NumPy seed (affects scikit-learn)
np.random.seed(42)
# Or use random_state parameter
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)Note: scikit-learn uses NumPy’s RNG internally
Experiment Tracking#
MLflow#
import mlflow
import numpy as np
def train_model(seed=42):
with mlflow.start_run():
# Log seed
mlflow.log_param("random_seed", seed)
# Set seed
np.random.seed(seed)
torch.manual_seed(seed)
# Train model
model = initialize_model(seed=seed)
# ... training code ...
# Log metrics
mlflow.log_metric("accuracy", accuracy)
train_model(seed=42)Weights & Biases#
import wandb
import numpy as np
def train_model(seed=42):
wandb.init(project="my-project", config={"seed": seed})
# Set seed
np.random.seed(seed)
torch.manual_seed(seed)
# Train model
model = initialize_model(seed=seed)
# ... training code ...
# Log metrics
wandb.log({"accuracy": accuracy})
train_model(seed=42)Pitfalls and Solutions#
Pitfall 1: Not Logging Seed#
Problem: Can’t reproduce results (debugging nightmare)
Fix: Always log seed in experiment tracking
mlflow.log_param("random_seed", seed)Pitfall 2: Mixing Framework RNGs#
Problem: PyTorch RNG ≠ NumPy RNG (incomplete reproducibility)
Fix: Set all framework seeds
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)Pitfall 3: Non-Deterministic Operations#
Problem: Some operations are non-deterministic even with seed (cuDNN)
Fix: Enable deterministic mode (slower but reproducible)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = FalsePitfall 4: Seed = 0#
Problem: Some libraries treat seed=0 as “no seed” or use default
Fix: Use seed ≥ 1 (e.g., seed=42 is common convention)
Pitfall 5: Not Setting Worker Seeds (DataLoader)#
Problem: Multi-worker dataloaders use random seeds (non-reproducible)
Fix: Set worker seed function
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
train_loader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
worker_init_fn=seed_worker
)GPU Acceleration#
CuPy (NumPy-like API on GPU)#
import cupy as cp
# Set seed
cp.random.seed(42)
# Generate on GPU (fast)
samples = cp.random.randn(10_000_000)
# Transfer to CPU (slow, avoid if possible)
samples_cpu = cp.asnumpy(samples)When: Need NumPy-like RNG on GPU (not training)
PyTorch (GPU training)#
import torch
# Set CUDA seed
torch.cuda.manual_seed(42)
# Generate on GPU
samples = torch.randn(10_000_000, device='cuda')When: Training on GPU (use framework RNG)
Best Practices#
1. Explicit Seed Management#
# Good: Explicit seed, logged
def train(seed=42):
mlflow.log_param("seed", seed)
set_all_seeds(seed)
# ... training ...
# Bad: No seed (non-reproducible)
def train():
model = Model() # Uses random initialization2. Seed Ranges for Experiments#
# Run multiple seeds for robust results
seeds = [42, 123, 456, 789, 1011]
results = []
for seed in seeds:
accuracy = train_model(seed=seed)
results.append(accuracy)
print(f"Mean accuracy: {np.mean(results):.3f} ± {np.std(results):.3f}")Standard practice: 3-5 seeds for research papers
3. Reproducibility Checklist#
- Seed set and logged (MLflow, W&B)
- All framework RNGs seeded (NumPy, PyTorch, TensorFlow)
- Deterministic mode enabled (cuDNN)
- Worker seeds set (DataLoader)
- Hardware/software versions logged (CUDA, PyTorch version)
Alternatives#
When NOT to use NumPy RNG:
- ❌ GPU training: Use PyTorch/TensorFlow RNG (avoid CPU-GPU transfer)
- ❌ Distributed training: Use framework-specific parallel RNG
When to use framework RNG:
- ✅ Training on GPU (PyTorch:
torch.randn, TensorFlow:tf.random) - ✅ Data augmentation (framework-specific transforms)
Further Reading#
PyTorch:
- Reproducibility guide: https://pytorch.org/docs/stable/notes/randomness.html
TensorFlow:
- Determinism guide: https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism
Papers:
- “Reproducibility in Machine Learning” (Pineau et al., 2020)
Use Case: Monte Carlo Simulation#
Scenario#
Problem: Estimate option prices, risk metrics, or physical properties via random sampling
Examples:
- Financial engineering: Option pricing (Black-Scholes, Heston model)
- Risk management: Value at Risk (VaR), Expected Shortfall
- Physics: Particle transport, radiation shielding
- Operations research: Queueing systems, inventory management
Requirements#
Performance#
- Scale: 1M - 1B random samples per simulation
- Speed:
<1second for 1M samples (interactive analysis) - Throughput:
>100M samples/sec (vectorized)
Reproducibility#
- Critical: Same seed → same results (debugging, research)
- Validation: Compare against analytic solutions
- Collaboration: Share seeds across team for reproducibility
Statistical Quality#
- Bias:
<0.01% (accurate expectations) - Convergence: O(1/√N) for Monte Carlo (variance reduction techniques)
Distributions#
- Common: Uniform, normal (Gaussian), exponential
- Advanced: Multivariate normal (correlated assets), Poisson, gamma
Recommended Solution#
RNG: numpy.random.default_rng() (PCG64)
Why NumPy:
- ✅ Vectorized (20-100x faster than Python loops)
- ✅ PCG64 passes BigCrush (high statistical quality)
- ✅ Explicit seeding (reproducible)
- ✅ 50+ distributions out-of-the-box
- ✅ SciPy integration (advanced distributions)
Implementation Pattern#
Basic Monte Carlo#
import numpy as np
def monte_carlo_pi(n_samples=1_000_000, seed=42):
"""Estimate π using Monte Carlo."""
rng = np.random.default_rng(seed=seed)
# Generate random points in [0,1] x [0,1]
x = rng.random(size=n_samples)
y = rng.random(size=n_samples)
# Count points inside unit circle
inside = (x**2 + y**2) <= 1
pi_estimate = 4 * np.mean(inside)
return pi_estimate
pi_est = monte_carlo_pi(n_samples=10_000_000)
print(f"π ≈ {pi_est:.6f}") # π ≈ 3.141593Option Pricing (Black-Scholes)#
import numpy as np
def european_call_mc(S0, K, T, r, sigma, n_paths=100_000, seed=42):
"""
Price European call option via Monte Carlo.
S0: Initial stock price
K: Strike price
T: Time to maturity (years)
r: Risk-free rate
sigma: Volatility
"""
rng = np.random.default_rng(seed=seed)
# Simulate stock price at maturity
Z = rng.standard_normal(size=n_paths)
ST = S0 * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z)
# Payoff
payoff = np.maximum(ST - K, 0)
# Discount to present value
price = np.exp(-r*T) * np.mean(payoff)
# Standard error
se = np.std(payoff) / np.sqrt(n_paths) * np.exp(-r*T)
return price, se
price, se = european_call_mc(S0=100, K=105, T=1, r=0.05, sigma=0.2)
print(f"Call price: ${price:.2f} ± ${se:.2f}")Correlated Assets (Multivariate Normal)#
import numpy as np
def correlated_assets_mc(S0, mu, cov, T, n_paths=100_000, seed=42):
"""
Simulate correlated asset prices.
S0: Initial prices (array)
mu: Expected returns (array)
cov: Covariance matrix
T: Time horizon
"""
rng = np.random.default_rng(seed=seed)
# Generate correlated normal variates
n_assets = len(S0)
Z = rng.multivariate_normal(mean=np.zeros(n_assets), cov=cov, size=n_paths)
# Simulate final prices
drift = (mu - 0.5*np.diag(cov)) * T
diffusion = np.sqrt(T) * Z
ST = S0 * np.exp(drift + diffusion)
return ST
# Example: 2 correlated assets (correlation = 0.6)
S0 = np.array([100, 100])
mu = np.array([0.08, 0.10])
cov = np.array([[0.04, 0.024], [0.024, 0.0625]]) # corr = 0.6
ST = correlated_assets_mc(S0, mu, cov, T=1, n_paths=10_000)
print(f"Asset 1 final price: ${np.mean(ST[:, 0]):.2f}")
print(f"Asset 2 final price: ${np.mean(ST[:, 1]):.2f}")Performance Optimization#
Vectorization (20-100x speedup)#
Bad (Python loop):
# 1M samples: ~2-5 seconds
samples = [rng.normal() for _ in range(1_000_000)]Good (vectorized):
# 1M samples: ~10-50 ms (100x faster)
samples = rng.normal(size=1_000_000)Variance Reduction#
Antithetic variates (reduce variance by 2x):
def monte_carlo_antithetic(n_samples, seed=42):
rng = np.random.default_rng(seed=seed)
# Generate half the samples
Z1 = rng.standard_normal(size=n_samples // 2)
Z2 = -Z1 # Antithetic variates
Z = np.concatenate([Z1, Z2])
# ... rest of simulationControl variates: Use analytic solution for similar problem to reduce variance
Pitfalls and Solutions#
Pitfall 1: Not Setting Seed#
Problem: Results differ every run (can’t debug)
Fix:
rng = np.random.default_rng(seed=42) # Explicit seedPitfall 2: Using random Module#
Problem: 40-100x slower, no vectorization
Fix: Always use NumPy for Monte Carlo (never random module)
Pitfall 3: Global State (np.random.seed())#
Problem: Thread-unsafe, affects other code
Fix: Use generator objects (default_rng())
Pitfall 4: Small Sample Size#
Problem: High variance (inaccurate estimates)
Rule of thumb: N = 10,000 minimum (se ∝ 1/√N)
Example: 1M samples → se = 0.001 (3 decimal places)
Validation#
Compare to Analytic Solution#
from scipy.stats import norm
# Black-Scholes analytic formula
def bs_call(S0, K, T, r, sigma):
d1 = (np.log(S0/K) + (r + 0.5*sigma**2)*T) / (sigma*np.sqrt(T))
d2 = d1 - sigma*np.sqrt(T)
call = S0 * norm.cdf(d1) - K*np.exp(-r*T) * norm.cdf(d2)
return call
# Compare
mc_price, se = european_call_mc(S0=100, K=105, T=1, r=0.05, sigma=0.2, n_paths=1_000_000)
bs_price = bs_call(S0=100, K=105, T=1, r=0.05, sigma=0.2)
print(f"Monte Carlo: ${mc_price:.4f} ± ${se:.4f}")
print(f"Black-Scholes: ${bs_price:.4f}")
print(f"Error: ${abs(mc_price - bs_price):.4f}")Expected: Error < 3×se (95% confidence)
Alternatives#
When to Consider Alternatives#
Quasi-Monte Carlo (Sobol sequences):
- When: Low-dimensional (
<10D), smooth integrand - Library:
scipy.stats.qmc.Sobol - Advantage: O(1/N) convergence (faster than O(1/√N) for MC)
GPU Acceleration (CuPy, JAX):
- When:
>10M samples, GPU available - Speedup: 10-100x (depends on problem)
Never: Use secrets module (100x slower, can’t reproduce)
Further Reading#
Books:
- “Monte Carlo Methods in Financial Engineering” (Glasserman, 2003)
- “Numerical Recipes” (Press et al.) - Monte Carlo chapter
Papers:
- “Variance Reduction Techniques” (Owen, 2013)
NumPy docs:
- Random sampling: https://numpy.org/doc/stable/reference/random/index.html
S3 Recommendation: Use Case Matching#
Decision Matrix by Use Case#
| Use Case | Recommended RNG | Why | Code Pattern |
|---|---|---|---|
| Monte Carlo Simulation | numpy.random.default_rng() | Fast, vectorized, reproducible | Vectorized sampling |
| Machine Learning | numpy.random.default_rng() + seed | Reproducible experiments | Log seed in MLflow/W&B |
| Security Tokens | secrets.token_urlsafe(32) | Unpredictable (OS entropy) | Never use PRNGs |
| Game Development | numpy.random or random | Fast, good quality | Procedural generation |
| Statistical Bootstrap | numpy.random.default_rng() | Fast resampling | scipy.stats integration |
| Procedural Content | numpy.random.default_rng() | Reproducible, fast | Seed = world seed |
Use Case Deep Dives#
Monte Carlo Simulation#
Best Choice: numpy.random.default_rng() (PCG64)
Why:
- Fast vectorized sampling (10M samples in 10-50 ms)
- Reproducible (explicit seed control)
- High statistical quality (passes BigCrush)
- SciPy integration (statistical distributions)
Example:
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1_000_000)When to deviate: Never (NumPy is standard for Monte Carlo)
Machine Learning#
Best Choice: numpy.random.default_rng() + explicit seed + logging
Why:
- Reproducible experiments (critical for ML research)
- Fast initialization (10M weights in ms)
- PyTorch/TensorFlow use NumPy conventions
Example:
import numpy as np
import mlflow
seed = 42
rng = np.random.default_rng(seed=seed)
mlflow.log_param("random_seed", seed)
# Initialize weights
weights = rng.normal(0, 0.01, size=(1000, 1000))When to deviate: GPU training (use PyTorch/TensorFlow RNG for on-device generation)
Security Tokens / API Keys#
Best Choice: secrets.token_urlsafe(32) (OS entropy)
Why:
- Unpredictable (no seed-based attacks)
- Cryptographically secure (OS entropy pool)
- Never use PRNGs (predictable = security breach)
Example:
import secrets
# API token (256 bits)
api_key = secrets.token_urlsafe(32)
# CSRF token
csrf_token = secrets.token_hex(16)When to deviate: Never (security requires secrets)
Game Development#
Best Choice: numpy.random.default_rng() (fast) or random (simple)
Why:
- Fast procedural generation (vectorized sampling)
- Reproducible (world seed = RNG seed)
- Good enough quality (game play doesn’t require BigCrush)
Example:
import numpy as np
world_seed = 12345
rng = np.random.default_rng(seed=world_seed)
# Generate terrain
terrain = rng.integers(0, 10, size=(1000, 1000))
# Loot drop (75% common, 20% rare, 5% epic)
loot = rng.choice(['common', 'rare', 'epic'], p=[0.75, 0.20, 0.05])When to deviate: Simple AI behavior (use random.choice() for simplicity)
Statistical Bootstrap#
Best Choice: numpy.random.default_rng() + scipy.stats
Why:
- Fast resampling (1M bootstrap samples in seconds)
- Statistical distributions (scipy.stats integration)
- Reproducible (explicit seed)
Example:
import numpy as np
from scipy import stats
rng = np.random.default_rng(seed=42)
# Bootstrap confidence interval
def bootstrap_ci(data, n_bootstrap=10000, ci=95):
means = []
for _ in range(n_bootstrap):
sample = rng.choice(data, size=len(data), replace=True)
means.append(np.mean(sample))
lower = np.percentile(means, (100-ci)/2)
upper = np.percentile(means, 100-(100-ci)/2)
return lower, upper
data = [1, 2, 3, 4, 5]
ci = bootstrap_ci(data)When to deviate: Never (NumPy is standard for bootstrap)
Anti-Patterns to Avoid#
❌ Using PRNGs for Security#
Bad:
import random
token = ''.join(random.choices(string.ascii_letters, k=32))Why bad: Predictable with known seed (security breach)
Fix:
import secrets
token = secrets.token_urlsafe(32)❌ Using secrets for Simulations#
Bad:
import secrets
samples = [secrets.randbits(64) for _ in range(1_000_000)]Why bad: 100x slower, can’t reproduce results
Fix:
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.integers(0, 2**64, size=1_000_000)❌ Python Loops for Large Samples#
Bad:
rng = np.random.default_rng()
samples = [rng.random() for _ in range(1_000_000)]Why bad: 40-100x slower than vectorized
Fix:
samples = rng.random(size=1_000_000)❌ Simple Seed Offsets for Parallel Workers#
Bad:
# Workers use seed, seed+1, seed+2, ...
rngs = [np.random.default_rng(seed=42+i) for i in range(10)]Why bad: Can cause correlation between streams
Fix:
from numpy.random import SeedSequence
ss = SeedSequence(42)
child_seeds = ss.spawn(10)
rngs = [np.random.default_rng(s) for s in child_seeds]Migration Patterns#
From random to numpy.random#
Before (slow):
import random
random.seed(42)
samples = [random.gauss(0, 1) for _ in range(1_000_000)]After (20-100x faster):
import numpy as np
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1_000_000)From Global State to Generators#
Before (thread-unsafe):
np.random.seed(42)
samples = np.random.randn(1000)After (thread-safe):
rng = np.random.default_rng(seed=42)
samples = rng.standard_normal(size=1000)From MT19937 to PCG64#
Before (legacy):
rng = np.random.RandomState(seed=42) # MT19937After (2x faster):
rng = np.random.default_rng(seed=42) # PCG64Next: S4 Strategic Selection#
Production deployment patterns, testing strategies, monitoring, and long-term maintenance