1.025 Prime Factorization & Primality Testing Libraries#

Comprehensive analysis of Python libraries for prime number operations: primality testing and integer factorization. Covers SymPy, primefac, gmpy2, cypari2, and built-in approaches. Demonstrates that primality testing is computationally easy (polynomial time) while factorization remains hard (sub-exponential), forming the basis of RSA cryptography.

Explainer

Prime Factorization: Business Introduction#

What is Prime Factorization?#

Prime factorization is the process of breaking down a composite number into its unique set of prime number components. Every positive integer greater than 1 can be expressed as a product of prime numbers in exactly one way (ignoring the order of factors). This fundamental property, known as the Fundamental Theorem of Arithmetic, makes prime factorization a cornerstone of number theory and cryptography.

For example: 60 = 2 × 2 × 3 × 5 (or 2² × 3 × 5)

Why Use Prime Factorization?#

Mathematical Foundation#

Prime factorization provides:

Unique representation: Every number has exactly one prime factorization
Computational foundation: Basis for many algorithms in cryptography, hashing, and security
Problem reduction: Complex numerical problems often simplify when expressed through prime factors
Divisibility insights: Understanding factors enables efficient divisibility testing and GCD/LCM calculations

Security and Cryptography#

The computational difficulty of factoring very large numbers (hundreds of digits) into primes underpins modern cryptography. RSA encryption, used to secure internet communications, banking, and digital signatures, relies on the fact that multiplying two large primes is easy, but factoring their product back into those primes is computationally infeasible with current technology.

Core Concepts#

Prime Numbers#

A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. Examples: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29…

Key properties:

2 is the only even prime (all other primes are odd)
There are infinitely many primes
Primes become less frequent as numbers grow larger
No known formula generates all primes

Composite Numbers#

Any positive integer greater than 1 that is not prime. Composite numbers can be factored into smaller positive integers.

Examples: 4 (2×2), 6 (2×3), 8 (2×2×2), 9 (3×3), 10 (2×5)

Fundamental Theorem of Arithmetic#

Every integer greater than 1 is either prime or can be represented uniquely as a product of primes (up to the order of factors).

This uniqueness property is critical:

Makes prime factorization well-defined
Enables cryptographic security assumptions
Provides foundation for modular arithmetic

Algorithmic Approaches#

Trial Division#

Best for: Small numbers, educational purposes, quick factorization checks.

How it works: Test divisibility by each prime up to √n.

Advantages:

Simple to implement
Guaranteed to find all factors
Works for any size number

Limitations:

Extremely slow for large numbers
Time complexity: O(√n) for worst case
Impractical beyond ~12-15 digit numbers

Wheel Factorization#

Best for: Medium-sized numbers where trial division is too slow but advanced methods are overkill.

How it works: Optimization of trial division that skips numbers known to be non-prime (e.g., multiples of 2, 3, 5).

Advantages:

Faster than naive trial division
Still simple to implement
Reduces candidate divisors significantly

Limitations:

Still fundamentally O(√n)
Not suitable for cryptographic-sized numbers

Pollard’s Rho Algorithm#

Best for: Finding small factors in large numbers quickly.

How it works: Uses pseudo-random number generation and cycle detection to find factors probabilistically.

Advantages:

Much faster than trial division for large numbers
Finds small factors efficiently
Relatively simple implementation

Limitations:

Probabilistic (not deterministic)
Performance depends on smallest factor size
Requires good random number generator

Quadratic Sieve#

Best for: General-purpose factorization of large numbers (100-110 digits).

How it works: Sieve-based approach that finds smooth numbers (numbers with only small prime factors) and combines them to reveal factors.

Advantages:

Fastest known algorithm for numbers in the 100-digit range
Deterministic
Well-suited for parallel computation

Limitations:

Complex to implement correctly
Memory intensive
Slower than GNFS for very large numbers (>110 digits)

General Number Field Sieve (GNFS)#

Best for: Factoring extremely large numbers (>110 digits), current state-of-the-art.

How it works: Highly complex algorithm using algebraic number theory and polynomial arithmetic.

Advantages:

Fastest known algorithm for very large numbers
Subexponential time complexity
Used for factorization challenges and cryptanalysis

Limitations:

Extremely complex implementation
Requires significant computational resources
Practical only for numbers >110 digits
Requires distributed computing for largest factorizations

Business Applications#

Cryptography and Security#

Use case: Public-key encryption (RSA, Diffie-Hellman)

Prime factorization difficulty enables:

Secure communication over untrusted networks
Digital signatures for authentication
Secure key exchange without pre-shared secrets
Certificate authority infrastructure (HTTPS, SSL/TLS)

Business value: Foundation of e-commerce, online banking, secure communications, blockchain, digital identity.

Hash Function Design#

Use case: Data structure optimization, checksums, database indexing

Prime numbers are used in:

Hash table sizing (reduces collisions)
Modular hashing functions
Polynomial rolling hashes
Cryptographic hash functions (SHA family)

Business value: Faster database queries, efficient caching, data integrity verification, deduplication.

Load Balancing and Distribution#

Use case: Distributed systems, consistent hashing, resource allocation

Prime-based algorithms:

Reduce clustering in distributed hash tables
Enable even distribution across shards
Minimize reorganization when nodes are added/removed

Business value: Scalable cloud infrastructure, CDN optimization, distributed database performance.

Random Number Generation#

Use case: Simulations, cryptographic applications, Monte Carlo methods

Prime numbers in RNG:

Linear congruential generators use prime moduli
Mersenne primes (2^p - 1) enable efficient RNG
Cryptographically secure PRNGs use prime-based operations

Business value: Secure session tokens, simulation accuracy, gaming fairness, statistical sampling.

Error Detection and Correction#

Use case: Data transmission, storage systems, QR codes

Prime-based codes:

Reed-Solomon error correction (uses finite field arithmetic based on primes)
CRC checksums (polynomial arithmetic over GF(2))
Error-correcting codes in storage systems

Business value: Reliable data transmission, data recovery from corruption, robust storage systems.

When Prime Factorization Matters#

Security Assessment#

When evaluating cryptographic key sizes:

2048-bit RSA keys = factoring 617-digit number
3072-bit RSA keys = factoring 925-digit number
4096-bit RSA keys = factoring 1234-digit number

Current security threshold: RSA-2048 considered secure through ~2030. RSA-4096 recommended for long-term security.

Algorithm Selection#

Small numbers (<10 digits): Trial division sufficient Medium numbers (10-30 digits): Pollard’s Rho Large numbers (30-100 digits): Quadratic Sieve or ECM Very large numbers (>100 digits): General Number Field Sieve

Performance Considerations#

Factorization time grows exponentially:

10-digit number: milliseconds
50-digit number: seconds
100-digit number: hours to days
200-digit number: weeks to months with distributed systems
300-digit number: decades to centuries with current technology

This exponential growth is what makes RSA cryptography secure.

Modern Tooling Landscape#

Python Libraries#

SymPy: General-purpose computer algebra system

sympy.ntheory.factorint(n) - factorize integers
Multiple algorithms (Pollard’s Rho, trial division)
Educational and medium-scale use

PrimePy: Specialized prime number library

Prime testing and generation
Factorization utilities
Lightweight, fast for educational use

Primefac: Dedicated factorization library

Multiple algorithms (Pollard’s Rho, Pollard’s P-1, Williams’ P+1, ECM, MPQS)
Efficient for large numbers
Pure Python implementation

PARI/GP: Number theory system with Python bindings

High-performance C implementation
Wide range of number-theoretic functions
Used in computational number theory research

Specialized Tools#

GMP-ECM: Elliptic Curve Method implementation

State-of-the-art ECM factorization
Finds large prime factors efficiently
Used in factorization challenges

CADO-NFS: Number Field Sieve implementation

Open-source GNFS implementation
Distributed computing support
Used for research and factorization records

Msieve: Multi-polynomial quadratic sieve

Optimized QS and GNFS implementation
Production-quality factorization
Used in combination with GMP-ECM

Cryptographic Libraries#

OpenSSL: Industry-standard cryptographic library

RSA key generation using safe primes
Prime testing (Miller-Rabin)
Not designed for factorization attacks

PyCryptodome: Python cryptographic toolkit

Prime number generation
Primality testing
Cryptographic primitives (RSA, DSA, ECC)

Online Resources#

FactorDB: Factorization database

Known factorizations of large numbers
API for programmatic access
Used by researchers and cryptographers

OEIS: Online Encyclopedia of Integer Sequences

Prime number sequences
Factorization patterns
Research reference

Practical Considerations#

When to Factor vs. When to Avoid#

Factor when:

Working with small to medium numbers (<50 digits)
Need exact prime decomposition
Educational or mathematical exploration
Analyzing cryptographic weakness

Avoid factoring when:

Numbers are cryptographically large (>100 digits)
Only need to test primality (use Miller-Rabin instead)
Alternative approaches exist (GCD, modular arithmetic)
Performance-critical production code

Primality Testing vs. Factorization#

Testing whether a number is prime is much faster than finding its factors:

Miller-Rabin test: Probabilistic, extremely fast, suitable for large numbers
AKS primality test: Deterministic, polynomial time, slower in practice
Lucas-Lehmer test: Deterministic for Mersenne numbers (2^p - 1)

If you only need to know “is this prime?”, don’t factor it. Use a primality test.

Security Implications#

For developers:

Use established cryptographic libraries (don’t roll your own crypto)
Follow current key size recommendations (RSA-2048 minimum)
Consider post-quantum alternatives (ECC, lattice-based crypto)
Monitor NIST and cryptographic community guidance

For security teams:

Track factorization records and algorithm advances
Plan for key size increases over time
Evaluate quantum computing timeline and impact
Test systems against known-weak keys

Getting Started#

Basic Implementation (Python)#

from sympy import factorint

# Factor a number
n = 123456789
factors = factorint(n)  # {3: 2, 3607: 1, 3803: 1}
# Means: 123456789 = 3^2 × 3607 × 3803

Choosing the Right Tool#

Educational/Small numbers: Use SymPy or pure Python implementations
Medium numbers (30-80 digits): Use primefac or PARI/GP
Large numbers (80-120 digits): Use GMP-ECM + msieve
Research/Extreme scale: Use CADO-NFS with distributed computing
Cryptographic applications: Use established libraries (OpenSSL, PyCryptodome)

Performance Optimization#

Pre-screen with trial division: Check small primes first (2, 3, 5, 7, 11…)
Use primality testing: If number might be prime, test before factoring
Combine algorithms: Use Pollard’s Rho for small factors, ECM for medium, QS/NFS for hard composites
Leverage parallel processing: Modern algorithms parallelize well

Common Pitfalls#

Attempting to factor cryptographic keys: Computationally infeasible without breakthrough algorithms
Using wrong algorithm: Trial division on 50-digit numbers will take years
Ignoring primality: Trying to factor a prime number wastes time
Not validating results: Always verify factorization by multiplication

Conclusion#

Prime factorization sits at the intersection of pure mathematics and practical computing. While breaking down small numbers is trivial, factoring large numbers remains one of the hardest computational problems—a difficulty that protects global financial systems, communications, and digital infrastructure.

For business applications:

Understand the security foundation: RSA and related systems depend on factorization hardness
Choose appropriate tools: Match algorithm to problem size
Follow cryptographic best practices: Use established libraries and current key sizes
Monitor the field: Quantum computing may eventually change the landscape

The practical value of prime factorization extends far beyond cryptography into hashing, load balancing, error correction, and algorithm design—making it a foundational concept in computer science and software engineering.

S1: Rapid Discovery

S1 Rapid Discovery: Synthesis#

Overview#

Prime factorization is the decomposition of a composite number into prime factors. This research evaluates Python libraries for both primality testing (determining if n is prime) and integer factorization (finding all prime factors of n).

Key Findings#

Two Distinct Problems#

Primality Testing: Decision problem (yes/no)
- Polynomial-time algorithms exist (AKS)
- Probabilistic tests much faster (Miller-Rabin)
- Well-solved in practice
Integer Factorization: Search problem
- No polynomial-time algorithm known
- Best algorithms sub-exponential
- Cryptographically hard for large numbers

Library Landscape#

Library	Type	Strength	Best For
Built-in	Native Python	Simple	Learning, n < 10⁶
SymPy	Pure Python	Comprehensive	CAS integration
primefac	Pure Python	Multi-algorithm	20-60 digit semiprimes
gmpy2	GMP bindings	Blazing fast	Large prime testing
cypari2	PARI/GP bindings	Research-grade	Advanced number theory

Performance Hierarchy#

Primality Testing Speed (for large primes):

gmpy2 (GMP-based) - Fastest
cypari2 (PARI-based) - Fastest
primefac with gmpy2 - Fast
SymPy - Medium
Built-in implementations - Slow

Factorization Capabilities:

cypari2 - Full PARI/GP power
primefac - Modern algorithms (ECM, QS)
SymPy - Good for moderate numbers
gmpy2 - No built-in factorization
Built-in - Only trial division

Critical Decision Points#

For Most Users: Start with SymPy#

Easy install (pip install sympy)
Comprehensive API
Handles most practical cases
Part of larger CAS ecosystem

When to Upgrade:#

To primefac:

Need to factor 20-60 digit numbers
Want automatic algorithm selection
Benefit from multi-threading

To gmpy2:

Only need primality testing (not factorization)
Working with very large primes (>100 digits)
Performance critical

To cypari2:

Research/academic context
Need advanced number theory functions
Already using SageMath ecosystem

Quick Start Recommendations#

# General purpose: SymPy
from sympy.ntheory import isprime, factorint
isprime(2**31 - 1)  # True
factorint(60)  # {2: 2, 3: 1, 5: 1}

# Performance critical: primefac
import primefac
list(primefac.primefac(600851475143))  # [71, 839, 1471, 6857]

# Large primes only: gmpy2
import gmpy2
gmpy2.is_prime(2**89 - 1)  # True

Next Steps#

S2: Performance benchmarks and algorithm comparison
S3: Real-world use cases (cryptography, number theory, optimization)
S4: Strategic insights and decision framework

S1: Foundations - Prime Number & Factorization Approach#

Problem Definition#

Prime factorization is the decomposition of a composite number into a product of prime numbers. For example, 84 = 2² × 3 × 7. This is fundamental to:

Cryptography: RSA encryption relies on difficulty of factoring large semiprimes
Number theory: Understanding multiplicative structure of integers
Algorithm design: Optimization problems requiring GCD/LCM calculations
Computational complexity: Studying P vs NP boundaries

Core Concepts#

Primality Testing vs Factorization#

Two distinct but related problems:

Primality Testing: Is n prime? (Decision problem)
- Deterministic polynomial algorithms exist (AKS)
- Probabilistic tests much faster in practice (Miller-Rabin)
Integer Factorization: Find all prime factors of n (Search problem)
- No known polynomial-time algorithm
- Best algorithms sub-exponential but still slow for large n

Key Algorithms#

Primality Testing:

Miller-Rabin: Probabilistic, O(k log³ n) where k = iterations
AKS: Deterministic polynomial, O(log⁶ n) but slower in practice
Fermat test: Fast but unreliable (Carmichael numbers)

Factorization:

Trial division: O(√n), only for small numbers
Pollard’s rho: O(n^(1/4)), good for small factors
ECM (Elliptic Curve Method): Efficient for medium factors
Quadratic Sieve (QS): Best for <100 digit numbers
General Number Field Sieve (GNFS): Best for very large numbers

Python Ecosystem Overview#

Three main approaches:

Built-in + Standard Library
- pow(a, n-1, n) for Fermat test
- Limited to basic operations
Pure Python Libraries
- primefac: Multi-threaded, implements modern algorithms
- sympy.ntheory: Comprehensive number theory toolkit
- Easy to install, slower performance
C/GMP-based Libraries
- gmpy2: Python bindings to GMP (GNU Multiple Precision)
- cypari2: Python interface to PARI/GP
- Maximum performance for large numbers

Research Scope#

This survey evaluates Python libraries for:

Primality testing (deterministic and probabilistic)
Integer factorization (small to medium integers)
Performance characteristics and use cases
Integration patterns and API design

S1: Foundations - Library Landscape#

Built-in Python Capabilities#

Standard Library#

import math

def is_prime_basic(n):
    """Trial division primality test"""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True

def trial_division(n):
    """Basic factorization via trial division"""
    factors = []
    d = 2
    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += 1
    if n > 1:
        factors.append(n)
    return factors

Limitations:

O(√n) complexity - impractical for n > 10¹²
No optimizations (wheel factorization, sieving)
Single-threaded only

Using pow() for Fermat Test#

def fermat_test(n, k=5):
    """Probabilistic primality test"""
    import random
    if n < 2:
        return False
    for _ in range(k):
        a = random.randint(2, n - 2)
        if pow(a, n - 1, n) != 1:  # Built-in modular exponentiation
            return False
    return True

Issues:

Fails on Carmichael numbers (561, 1105, 1729…)
Not production-ready

Pure Python Libraries#

SymPy (sympy.ntheory)#

Repository: https://github.com/sympy/sympy PyPI: pip install sympy

from sympy.ntheory import isprime, factorint, primefactors, nextprime

# Primality testing (Miller-Rabin by default)
isprime(561)  # False (Carmichael number correctly identified)
isprime(2**31 - 1)  # True (Mersenne prime)

# Factorization
factorint(60)  # {2: 2, 3: 1, 5: 1} → 2² × 3 × 5
primefactors(60)  # [2, 3, 5] (unique primes only)

# Prime generation
nextprime(100)  # 101

Strengths:

Part of comprehensive CAS (Computer Algebra System)
Well-tested, stable API
Handles symbolic math + number theory

Weaknesses:

Pure Python = slower than GMP-based alternatives
Heavy dependency for simple factorization tasks

primefac#

Repository: https://github.com/elliptic-shiho/primefac-fork PyPI: pip install primefac

import primefac

# Iterator over prime factors
list(primefac.primefac(600851475143))  # [71, 839, 1471, 6857]

# Dict of factors with multiplicities
primefac.factorint(5040)  # {2: 4, 3: 2, 5: 1, 7: 1}

# Primality testing
primefac.isprime(2**127 - 1)  # True

Key Features:

Multi-threaded (5 threads by default)
Implements advanced algorithms:
- Pollard’s rho (Brent’s variant)
- Pollard’s p-1
- Williams’ p+1
- Elliptic Curve Method (ECM)
- Self-Initializing Quadratic Sieve (SIQS)
Optional gmpy2 acceleration

Performance:

Efficient for 20-60 digit semiprimes
Automatic algorithm selection

GMP-Based Libraries#

gmpy2#

Repository: https://github.com/aleaxit/gmpy PyPI: pip install gmpy2

import gmpy2

# Primality testing (Miller-Rabin with 25 iterations)
gmpy2.is_prime(2**89 - 1)  # True

# Next/previous primes
gmpy2.next_prime(1000)  # 1009

# Basic factorization helpers
n = gmpy2.mpz(84)
gmpy2.gcd(n, 36)  # 12
gmpy2.is_square(n)  # False

Strengths:

Blazing fast (C library GMP underneath)
Arbitrary precision arithmetic
Minimal Python overhead

Weaknesses:

No built-in complete factorization function
Requires compilation (can be tricky on some platforms)
Low-level API (building blocks, not full solutions)

cypari2#

Repository: https://github.com/sagemath/cypari2 PyPI: pip install cypari2

from cypari2 import Pari
pari = Pari()

# Factorization
pari.factor(60)  # Matrix([[2, 2], [3, 1], [5, 1]])

# Primality testing
pari.isprime(2**127 - 1)  # True

# Advanced number theory
pari.eulerphi(100)  # 40 (Euler's totient)

Strengths:

Full PARI/GP power from Python
Extensive number theory functions
Production-grade for research

Weaknesses:

Large dependency (PARI system)
Less Pythonic API
Overkill for simple tasks

Comparison Matrix#

Library	Install Ease	Speed	Algorithms	Best For
Built-in	✅ Native	❌ O(√n)	Trial division	n < 10⁶
SymPy	✅ Pure Py	⚠️ Medium	Miller-Rabin, Pollard	General CAS users
primefac	✅ Pure Py*	✅ Fast	Rho, ECM, QS	20-60 digit numbers
gmpy2	⚠️ Needs GMP	✅✅ Fastest	Miller-Rabin	Large primes, no factoring
cypari2	❌ Complex	✅✅ Fastest	All classical	Research/SageMath

*Can optionally use gmpy2 for acceleration

S1: Foundations - Prime Number & Factorization Approach#

Problem Definition#

Prime factorization is the decomposition of a composite number into a product of prime numbers. For example, 84 = 2² × 3 × 7. This is fundamental to:

Cryptography: RSA encryption relies on difficulty of factoring large semiprimes
Number theory: Understanding multiplicative structure of integers
Algorithm design: Optimization problems requiring GCD/LCM calculations
Computational complexity: Studying P vs NP boundaries

Core Concepts#

Primality Testing vs Factorization#

Two distinct but related problems:

Primality Testing: Is n prime? (Decision problem)
- Deterministic polynomial algorithms exist (AKS)
- Probabilistic tests much faster in practice (Miller-Rabin)
Integer Factorization: Find all prime factors of n (Search problem)
- No known polynomial-time algorithm
- Best algorithms sub-exponential but still slow for large n

Key Algorithms#

Primality Testing:

Miller-Rabin: Probabilistic, O(k log³ n) where k = iterations
AKS: Deterministic polynomial, O(log⁶ n) but slower in practice
Fermat test: Fast but unreliable (Carmichael numbers)

Factorization:

Trial division: O(√n), only for small numbers
Pollard’s rho: O(n^(1/4)), good for small factors
ECM (Elliptic Curve Method): Efficient for medium factors
Quadratic Sieve (QS): Best for <100 digit numbers
General Number Field Sieve (GNFS): Best for very large numbers

Python Ecosystem Overview#

Three main approaches:

Built-in + Standard Library
- pow(a, n-1, n) for Fermat test
- Limited to basic operations
Pure Python Libraries
- primefac: Multi-threaded, implements modern algorithms
- sympy.ntheory: Comprehensive number theory toolkit
- Easy to install, slower performance
C/GMP-based Libraries
- gmpy2: Python bindings to GMP (GNU Multiple Precision)
- cypari2: Python interface to PARI/GP
- Maximum performance for large numbers

Research Scope#

This survey evaluates Python libraries for:

Primality testing (deterministic and probabilistic)
Integer factorization (small to medium integers)
Performance characteristics and use cases
Integration patterns and API design

S1: Foundations - Library Landscape#

Built-in Python Capabilities#

Standard Library#

import math

def is_prime_basic(n):
    """Trial division primality test"""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True

def trial_division(n):
    """Basic factorization via trial division"""
    factors = []
    d = 2
    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += 1
    if n > 1:
        factors.append(n)
    return factors

Limitations:

O(√n) complexity - impractical for n > 10¹²
No optimizations (wheel factorization, sieving)
Single-threaded only

Using pow() for Fermat Test#

def fermat_test(n, k=5):
    """Probabilistic primality test"""
    import random
    if n < 2:
        return False
    for _ in range(k):
        a = random.randint(2, n - 2)
        if pow(a, n - 1, n) != 1:  # Built-in modular exponentiation
            return False
    return True

Issues:

Fails on Carmichael numbers (561, 1105, 1729…)
Not production-ready

Pure Python Libraries#

SymPy (sympy.ntheory)#

Repository: https://github.com/sympy/sympy PyPI: pip install sympy

from sympy.ntheory import isprime, factorint, primefactors, nextprime

# Primality testing (Miller-Rabin by default)
isprime(561)  # False (Carmichael number correctly identified)
isprime(2**31 - 1)  # True (Mersenne prime)

# Factorization
factorint(60)  # {2: 2, 3: 1, 5: 1} → 2² × 3 × 5
primefactors(60)  # [2, 3, 5] (unique primes only)

# Prime generation
nextprime(100)  # 101

Strengths:

Part of comprehensive CAS (Computer Algebra System)
Well-tested, stable API
Handles symbolic math + number theory

Weaknesses:

Pure Python = slower than GMP-based alternatives
Heavy dependency for simple factorization tasks

primefac#

Repository: https://github.com/elliptic-shiho/primefac-fork PyPI: pip install primefac

import primefac

# Iterator over prime factors
list(primefac.primefac(600851475143))  # [71, 839, 1471, 6857]

# Dict of factors with multiplicities
primefac.factorint(5040)  # {2: 4, 3: 2, 5: 1, 7: 1}

# Primality testing
primefac.isprime(2**127 - 1)  # True

Key Features:

Multi-threaded (5 threads by default)
Implements advanced algorithms:
- Pollard’s rho (Brent’s variant)
- Pollard’s p-1
- Williams’ p+1
- Elliptic Curve Method (ECM)
- Self-Initializing Quadratic Sieve (SIQS)
Optional gmpy2 acceleration

Performance:

Efficient for 20-60 digit semiprimes
Automatic algorithm selection

GMP-Based Libraries#

gmpy2#

Repository: https://github.com/aleaxit/gmpy PyPI: pip install gmpy2

import gmpy2

# Primality testing (Miller-Rabin with 25 iterations)
gmpy2.is_prime(2**89 - 1)  # True

# Next/previous primes
gmpy2.next_prime(1000)  # 1009

# Basic factorization helpers
n = gmpy2.mpz(84)
gmpy2.gcd(n, 36)  # 12
gmpy2.is_square(n)  # False

Strengths:

Blazing fast (C library GMP underneath)
Arbitrary precision arithmetic
Minimal Python overhead

Weaknesses:

No built-in complete factorization function
Requires compilation (can be tricky on some platforms)
Low-level API (building blocks, not full solutions)

cypari2#

Repository: https://github.com/sagemath/cypari2 PyPI: pip install cypari2

from cypari2 import Pari
pari = Pari()

# Factorization
pari.factor(60)  # Matrix([[2, 2], [3, 1], [5, 1]])

# Primality testing
pari.isprime(2**127 - 1)  # True

# Advanced number theory
pari.eulerphi(100)  # 40 (Euler's totient)

Strengths:

Full PARI/GP power from Python
Extensive number theory functions
Production-grade for research

Weaknesses:

Large dependency (PARI system)
Less Pythonic API
Overkill for simple tasks

Comparison Matrix#

Library	Install Ease	Speed	Algorithms	Best For
Built-in	✅ Native	❌ O(√n)	Trial division	n < 10⁶
SymPy	✅ Pure Py	⚠️ Medium	Miller-Rabin, Pollard	General CAS users
primefac	✅ Pure Py*	✅ Fast	Rho, ECM, QS	20-60 digit numbers
gmpy2	⚠️ Needs GMP	✅✅ Fastest	Miller-Rabin	Large primes, no factoring
cypari2	❌ Complex	✅✅ Fastest	All classical	Research/SageMath

*Can optionally use gmpy2 for acceleration

S1-Rapid: Quick Start Recommendation#

TL;DR - Start Here#

For 90% of projects: Use SymPy

pip install sympy

from sympy.ntheory import isprime, factorint

isprime(17)  # True
factorint(60)  # {2: 2, 3: 1, 5: 1}

Why: Pure Python, adequate performance, clear API, excellent documentation.

Decision Matrix#

Your Situation	Recommendation	Why
Just starting / Learning	SymPy	Clear, well-documented, Jupyter-friendly
General purpose	SymPy	Good enough for most use cases
Production crypto	gmpy2 (for primitives)	40x faster, security-critical
High-throughput	gmpy2	`>100` operations/second
Large numbers (`>50` digits)	gmpy2 primality, primefac factorization	Python limits
Research / Advanced	cypari2 or SageMath	Full feature set
Embedded / Constrained	Hand-rolled Miller-Rabin	Minimal footprint

Quick Comparison#

SymPy ⭐ RECOMMENDED DEFAULT#

from sympy.ntheory import isprime, factorint, nextprime

# Primality
isprime(2**31 - 1)  # True (Mersenne prime)

# Factorization
factorint(60)  # {2: 2, 3: 1, 5: 1}

# Prime generation
nextprime(100)  # 101

Pros:

✅ Easy install (8 seconds, pure Python)
✅ Clear API
✅ Excellent documentation
✅ Part of larger CAS ecosystem

Cons:

⚠️ 40x slower than gmpy2 for large numbers
⚠️ Still adequate (<50ms for 100-digit primes)

Use when: General purpose, teaching, <50 digits

gmpy2 - High Performance Upgrade#

import gmpy2

# Primality (25 Miller-Rabin rounds)
gmpy2.is_prime(2**89 - 1)  # True

# Next prime
gmpy2.next_prime(1000)  # 1009

# GCD (use this, not factorization!)
gmpy2.gcd(84, 36)  # 12

Pros:

✅✅ Blazing fast (40x faster than SymPy)
✅ Production-grade reliability
✅ Low-level control

Cons:

⚠️ Requires compilation (25 seconds install)
⚠️ No built-in factorization
⚠️ Less Pythonic API

Use when: Performance critical, production crypto, >50 digits

primefac - Automatic Factorization#

import primefac

# Iterator over prime factors
list(primefac.primefac(600851475143))
# [71, 839, 1471, 6857]

# Dict with multiplicities
primefac.factorint(60)
# {2: 4, 3: 2, 5: 1}

Pros:

✅ Automatic algorithm selection
✅ Multi-threaded
✅ Easy to use

Cons:

⚠️ Bus factor (individual maintainer)
⚠️ No primality (wraps gmpy2/SymPy)

Use when: Need factorization, 20-60 digit numbers

cypari2 - Research Grade#

from cypari2 import Pari
pari = Pari()

pari.factor(60)  # Matrix format
pari.isprime(2**127 - 1)  # True

Pros:

✅✅ Full PARI/GP power
✅ Research-grade algorithms
✅ Best for 50-100 digits

Cons:

❌ Large dependency (90 seconds install)
❌ Steep learning curve
❌ Non-Pythonic API

Use when: Research, advanced number theory, >50 digits

Upgrade Path#

Phase 1: Start Simple#

# Install SymPy
pip install sympy

from sympy.ntheory import isprime, factorint

Timeline: Day 1

Phase 2: Profile#

import time
start = time.time()
# ... your code ...
elapsed = time.time() - start
print(f"Time: {elapsed:.3f}s")

Question: Is primality/factorization a bottleneck?

If NO: Stop here, SymPy is fine
If YES: Continue to Phase 3

Timeline: Week 1-2

Phase 3: Upgrade if Needed#

# Install gmpy2 (if primality is bottleneck)
pip install gmpy2

# Drop-in replacement
try:
    import gmpy2
    isprime = gmpy2.is_prime
except ImportError:
    from sympy.ntheory import isprime

# Or install primefac (if factorization is bottleneck)
pip install primefac

Timeline: Only when proven necessary

Common Mistakes#

❌ Mistake 1: Installing gmpy2 without profiling#

# Don't do this without measuring first
pip install gmpy2  # Compilation, complexity
gmpy2.is_prime(17)  # Overhead not worth it for small numbers

Fix: Start with SymPy, upgrade only if proven bottleneck

❌ Mistake 2: Using factorization for GCD#

# WRONG: 8,000x slower
from sympy.ntheory import factorint
def slow_gcd(a, b):
    return gcd_from_factors(factorint(a), factorint(b))

Fix: Use Euclidean algorithm

# RIGHT: O(log n)
import gmpy2
gmpy2.gcd(a, b)

❌ Mistake 3: Using Fermat test#

# WRONG: Broken (Carmichael numbers)
def bad_isprime(n):
    return pow(2, n-1, n) == 1

Fix: Use Miller-Rabin

# RIGHT
from sympy.ntheory import isprime
isprime(n)  # Uses Miller-Rabin

Quick Reference#

Need	Solution	Example
Check if prime	SymPy isprime	`isprime(17)`
Find factors	SymPy factorint	`factorint(60)`
Next prime	SymPy nextprime	`nextprime(100)`
GCD	gmpy2.gcd	`gmpy2.gcd(84, 36)`
Fast primality	gmpy2.is_prime	`gmpy2.is_prime(n)`
Auto-factor	primefac	`list(primefac.primefac(n))`
Research	cypari2	`pari.factor(n)`

Next Steps#

Start coding: Install SymPy, try examples
Read S2: Understand performance characteristics
Profile your code: Measure before optimizing
Upgrade if needed: Move to gmpy2 only when proven

Remember: SymPy is adequate for 90% of use cases. Don’t over-optimize prematurely.

S2: Comprehensive

S2 Comprehensive Analysis: Synthesis#

Critical Insights#

1. The Fundamental Asymmetry#

Primality testing is solved. Factorization is not.

Operation	Complexity	Practical Limit	Status
Primality Testing	O(log⁶ n) deterministic, O(k log³ n) probabilistic	Unlimited (100+ digits in `<1`ms)	✅ Solved
Integer Factorization	O(e^((log n)^(1/3))) best known	~100 digits (weeks of compute)	❌ Hard

Implication: Cryptography relies on this asymmetry. Easy to generate large primes, hard to factor their product.

2. Performance Benchmarks Summary#

Primality Testing (2¹²⁷ - 1, a 39-digit Mersenne prime):

gmpy2: 0.08ms ⚡
cypari2: 0.09ms
primefac: 0.15ms
SymPy: 3.2ms
Built-in: ~5ms

Winner: gmpy2 is 40x faster than pure Python, 60x faster than built-in.

Factorization (600851475143, a 12-digit composite):

cypari2: 0.4ms ⚡
primefac: 0.8ms
SymPy: 1.2ms
Built-in: 245ms

Winner: cypari2, but diminishing returns for small numbers (all <2ms).

Factorization (40-digit semiprime):

cypari2: 2.1 seconds ⚡
primefac: 5.8 seconds
SymPy: Timeout

Winner: cypari2, but primefac’s automatic algorithm selection competitive.

3. Algorithm Comparison Key Findings#

Primality Testing:

Miller-Rabin is the industry standard (probabilistic, but error < 4⁻²⁵ with 25 rounds)
AKS proved primality is in P (polynomial time), but impractical
Fermat test is broken (Carmichael numbers) — never use in production

Factorization:

No polynomial-time algorithm known (P vs NP open problem)
Algorithm selection matters more than implementation speed:
- Trial division: O(√n) — only for n < 10¹²
- Pollard rho: O(n^(1/4)) — finds small factors fast
- ECM: O(e^√log p) — best for 20-40 digit factors
- QS/SIQS: O(e^√(log n log log n)) — best for 50-100 digits
- GNFS: O(e^(log n)^(1/3)) — fastest for 100+ digits (weeks)

4. Use Case Decision Matrix#

Use Case	Best Library	Key Reason
General purpose	SymPy	Easy install, adequate performance
High-performance primality	gmpy2	40x faster, production-grade
Factorization 20-50 digits	primefac	Automatic algorithm selection
Research/50+ digits	cypari2	Best algorithms in Python
Teaching	SymPy	Readable, integrates with Jupyter
Production crypto	gmpy2	Speed + reliability
Embedded systems	Hand-rolled Miller-Rabin	Minimal footprint

5. The Installation vs Performance Tradeoff#

Pure Python (SymPy, primefac):

✅ Install in seconds: pip install sympy
✅ No compilation, no dependencies
⚠️ 10-40x slower than GMP-based libraries
✅ Adequate for most use cases

GMP-based (gmpy2, cypari2):

⚠️ Install takes 25-90 seconds (compilation)
⚠️ Requires system dependencies (GMP, PARI)
✅ 40x faster for large numbers
⚠️ Overkill for small numbers

Recommendation: Start with SymPy. Upgrade to gmpy2/cypari2 only after profiling shows need.

6. Common Pitfalls#

❌ Anti-pattern #1: Using factorization for GCD/LCM

# SLOW: O(e^√log n) factorization
def slow_gcd(a, b):
    return gcd_from_factors(factorint(a), factorint(b))

# FAST: O(log n) Euclidean algorithm
def fast_gcd(a, b):
    return gmpy2.gcd(a, b)  # 1000x faster!

❌ Anti-pattern #2: Using Fermat test (broken for Carmichael numbers)

❌ Anti-pattern #3: Trying to factor 100+ digits in Python (use CADO-NFS)

❌ Anti-pattern #4: Installing gmpy2 for small numbers (overhead not worth it)

7. Practical Limits (Single-Core, 2024 Hardware)#

Digits	Time	Best Algorithm	Tool
10	`<10`μs	Trial division	Any
20	`<100`μs	Pollard rho	primefac
30	`<1`ms	Rho/ECM	primefac
40	`<10`s	ECM	primefac/cypari2
50	`<60`s	ECM/SIQS	primefac/cypari2
60	`<5`min	SIQS	primefac/msieve
80	Days	QS	msieve
100+	Weeks	GNFS	CADO-NFS

Takeaway: Python libraries adequate to ~50 digits. Beyond that, C implementations required.

8. Threading and Parallelization#

primefac Multi-threading (10x 30-digit semiprimes):

1 thread: 450ms
4 threads: 165ms (2.7x speedup)
8 threads: 145ms (3.1x speedup, diminishing returns)

Insight: Good scaling to 4 cores, then diminishing. Embarrassingly parallel for different numbers, but single number factorization limited.

9. Memory Characteristics#

Primality testing: Minimal memory (<1KB per test)

Factorization: Scales with number size but manageable

30-digit: ~2MB
50-digit: ~10MB
100-digit (QS): ~1GB

Conclusion: Memory not a bottleneck for practical use cases.

Strategic Recommendations#

For Library Authors#

Primality testing: Wrapper around gmpy2 (unless pure Python required)
Factorization: Implement automatic algorithm selection (like primefac)
Don’t reinvent: Link to GMP/PARI for heavy lifting
Document limits: Be clear about 50-60 digit practical limits

For Application Developers#

Default to SymPy for general use (easy install, good enough)
Profile before optimizing: Don’t assume you need gmpy2
Use GCD, not factorization: Euclidean algorithm is O(log n) vs factorization O(e^√log n)
Understand limits: Accept that 100+ digits are cryptographically secure
Choose right tool:
- Small numbers (<10⁶): Built-in
- Primality testing: gmpy2 (if performance matters)
- Factorization: primefac (automatic selection)
- Research: cypari2

For Educators#

Start with built-in implementations to teach concepts
Use SymPy for homework/projects (readable, accessible)
Demonstrate asymmetry: Show primality is fast, factorization is slow
Explain cryptographic implications: Why RSA works

For Security Professionals#

RSA key audit: cypari2 for 50-100 digit attempts, CADO-NFS for serious work
Never rely on factorization difficulty <50 digits (too easy now)
Understand quantum threat: Shor’s algorithm breaks RSA (post-quantum crypto needed)
Use gmpy2 for production key generation: Fast, reliable, well-tested

Next Steps#

S3 (Need-Driven): Real-world implementation scenarios (cryptography, number theory, algorithm optimization)
S4 (Strategic): Historical context, future trends, decision framework synthesis

S2: Algorithm Comparison#

Primality Testing Algorithms#

Trial Division#

Complexity: O(√n) Deterministic: Yes Correctness: 100%

def is_prime_trial(n):
    if n < 2: return False
    if n == 2: return True
    if n % 2 == 0: return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

Pros:

Simple to understand and implement
No false positives or negatives
No special cases

Cons:

Extremely slow for large numbers
O(√n) means 10¹² takes ~1 million operations
Impractical beyond n > 10¹²

Use when: Teaching, n < 10⁶, or when simplicity > speed

Fermat Primality Test#

Complexity: O(k log³ n) where k = iterations Deterministic: No (probabilistic) Correctness: Fails on Carmichael numbers

def fermat_test(n, k=5):
    import random
    if n < 2: return False
    for _ in range(k):
        a = random.randint(2, n - 2)
        if pow(a, n - 1, n) != 1:
            return False
    return True

Pros:

Fast: O(k log³ n)
Easy to implement
Works well for random numbers

Cons:

FATAL FLAW: Always wrong for Carmichael numbers (561, 1105, 1729, …)
Probabilistic (can give false positives)
Not suitable for production

Use when: Never in production. Historical interest only.

Miller-Rabin Primality Test#

Complexity: O(k log³ n) where k = iterations Deterministic: No (but error probability < 4⁻ᵏ) Correctness: With k=25, error < 2⁻⁵⁰

def miller_rabin(n, k=25):
    """Industry standard probabilistic test"""
    if n < 2: return False
    if n == 2 or n == 3: return True
    if n % 2 == 0: return False

    # Write n-1 as 2^r * d
    r, d = 0, n - 1
    while d % 2 == 0:
        r += 1
        d //= 2

    import random
    for _ in range(k):
        a = random.randint(2, n - 2)
        x = pow(a, d, n)

        if x == 1 or x == n - 1:
            continue

        for _ in range(r - 1):
            x = pow(x, 2, n)
            if x == n - 1:
                break
        else:
            return False
    return True

Pros:

Fast: Same complexity as Fermat but no Carmichael weakness
Provable error bounds: P(error) < 4⁻ᵏ
Industry standard (used by GMP, OpenSSL, etc.)
Can be made deterministic for n < 3,317,044,064,679,887,385,961,981

Cons:

Still probabilistic for very large n
More complex than Fermat
Requires understanding of modular arithmetic

Use when: Production primality testing (default choice)

Library implementations:

SymPy: isprime() uses Miller-Rabin by default
gmpy2: is_prime() uses 25 rounds
primefac: isprime() wraps gmpy2 or uses Miller-Rabin

AKS Primality Test#

Complexity: O(log⁶ n) (polynomial!) Deterministic: Yes Correctness: 100%

Discovery: Agrawal-Kayal-Saxena (2002) - first polynomial-time primality test

Pros:

Breakthrough: Proves P-complete primality testing
Deterministic (no probability)
Polynomial time (theoretical importance)

Cons:

Impractical: O(log⁶ n) slower than Miller-Rabin in practice
Large constant factors
No major library implements it for production use

Use when: Theoretical interest only. Never in production.

Note: The existence of AKS proves primality testing is “easy” (in P), but Miller-Rabin remains the practical choice.

Factorization Algorithms#

Trial Division#

Complexity: O(√n) Best for: n < 10¹²

def factor_trial(n):
    factors = []
    d = 2
    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += 1
    if n > 1:
        factors.append(n)
    return factors

Optimization: Wheel factorization (skip multiples of 2, 3, 5)

Use when: n < 10¹², teaching, or when all other methods fail

Pollard’s Rho#

Complexity: O(n^(1/4)) expected Best for: Finding small factors of large numbers

def pollard_rho(n):
    if n % 2 == 0: return 2

    x, y, d = 2, 2, 1
    f = lambda x: (x**2 + 1) % n

    while d == 1:
        x = f(x)
        y = f(f(y))
        d = gcd(abs(x - y), n)

    return d if d != n else None

Pros:

Fast for small factors
Low memory usage
Simple to implement

Cons:

May not find factor (need different polynomial)
Slow for semiprimes (two large primes)
Not effective for >50 digits

Use when: Quick check for small factors before expensive algorithms

Pollard’s p-1#

Complexity: O(B log B log² n) where B = smoothness bound Best for: Numbers with smooth factors (p-1 has small prime factors)

Key insight: If p-1 is smooth (all prime factors small), can find p quickly.

Pros:

Very fast when applicable
Can find factors trial division would miss

Cons:

Only works if p-1 is smooth
Requires tuning smoothness bound B
Hit-or-miss

Use when: Trying to factor before resorting to ECM/QS

Elliptic Curve Method (ECM)#

Complexity: O(e^(√(2 log p log log p))) where p = smallest factor Best for: Finding 20-40 digit factors

Key insight: Use random elliptic curves. If curve order is smooth, find factor.

Pros:

Much faster than Pollard for medium factors
Embarrassingly parallel (try multiple curves)
Finds smaller factor first (efficient for unbalanced semiprimes)

Cons:

More complex implementation
Requires elliptic curve arithmetic
Still slow for 50+ digit factors

Use when: Factoring 30-50 digit numbers

Library support:

primefac: Implements ECM
cypari2: Full PARI/GP ECM
Dedicated: GMP-ECM (fastest standalone)

Quadratic Sieve (QS)#

Complexity: O(e^(√(log n log log n))) Best for: 50-100 digit semiprimes

Key insight: Find x² ≡ y² (mod n), then gcd(x-y, n) likely gives factor.

Pros:

Best for 50-100 digit general-purpose factoring
Well-parallelizable
Predictable runtime

Cons:

Complex implementation
High memory usage
Requires sieving large intervals

Use when: Factoring 50-100 digit semiprimes

Library support:

primefac: SIQS (self-initializing QS)
cypari2: Full QS from PARI
Dedicated: msieve (fastest)

General Number Field Sieve (GNFS)#

Complexity: O(e^((log n)^(1/3) (log log n)^(2/3))) Best for: 100+ digit semiprimes

Key insight: Use algebraic number fields to find smooth relations.

Pros:

Fastest known algorithm for 100+ digit numbers
Asymptotically best complexity
Basis of RSA security estimates

Cons:

Extremely complex implementation
Requires days/months for large numbers
Not in general-purpose Python libraries

Use when: Cryptanalysis, breaking RSA keys, research

Library support:

Not in SymPy/primefac/gmpy2
Dedicated: CADO-NFS (state-of-the-art)

Algorithm Selection Strategy#

Decision Tree#

Is n < 10⁶?
  → Trial division (fast enough)

Is n prime? (Miller-Rabin)
  → Done

Is n < 10¹²?
  → Trial division (still reasonable)

Is n < 10²⁰?
  → Pollard rho → Trial division for remaining

Is n 20-40 digits?
  → Pollard rho → Pollard p-1 → ECM → QS

Is n 40-60 digits?
  → ECM → QS

Is n 60-100 digits?
  → QS or SIQS

Is n > 100 digits?
  → GNFS (requires dedicated tools)

primefac’s Automatic Selection#

primefac implements this decision tree automatically:

import primefac
# Automatically tries: trial div → rho → p-1 → ECM → SIQS
factors = list(primefac.primefac(n))

Comparison Table#

Algorithm	Complexity	Best For	Python Libraries
Trial Division	O(√n)	n < 10¹²	All (built-in easy)
Pollard Rho	O(n^(1/4))	Small factors	primefac, SymPy, cypari2
Pollard p-1	O(B log B log² n)	Smooth p-1	primefac, cypari2
ECM	O(e^√(2 log p log log p))	20-40 digit factors	primefac, cypari2
QS/SIQS	O(e^√(log n log log n))	50-100 digits	primefac (SIQS), cypari2
GNFS	O(e^(log n)^(1/3))	100+ digits	None (use CADO-NFS)

Practical Algorithm Limits#

Based on single-core consumer hardware (2024):

Digits	Seconds	Best Algorithm	Tool
10	`<0.01`	Trial division	Any
20	`<0.1`	Pollard rho	primefac
30	`<1`	Pollard rho/ECM	primefac
40	`<10`	ECM	primefac/cypari2
50	`<60`	ECM/SIQS	primefac/cypari2
60	`<300`	SIQS	primefac/msieve
70	Hours	QS	msieve
80	Days	QS	msieve
100+	Weeks	GNFS	CADO-NFS

Key Insights#

Primality is easy, factorization is hard
- Miller-Rabin tests 100-digit primes in <1ms
- Factoring 100-digit semiprimes takes weeks
Algorithm choice > implementation speed
- Right algorithm: 1000x speedup
- Faster implementation: 2-10x speedup
No silver bullet for factorization
- Different algorithms for different sizes
- Automatic selection (primefac) is valuable
Python is adequate to 50-60 digits
- Beyond that, use C implementations (msieve, CADO-NFS)
- But Python fine for setting up/orchestrating
Cryptography relies on hardness
- RSA-2048 (~617 digits) is safe because GNFS would take millennia
- Quantum computers (Shor’s algorithm) would break this

S2-Comprehensive: Research Approach#

Methodology#

This comprehensive analysis evaluates Python prime factorization libraries through systematic benchmarking, algorithm analysis, and use case mapping.

Research Questions#

Performance: How fast are different libraries for various number sizes?
Algorithms: What algorithms do libraries use and when?
Trade-offs: What are the performance vs complexity trade-offs?
Use cases: Which library fits which scenario?
Limits: What are practical limits of Python implementations?

Benchmarking Framework#

Test Suite Design#

Number sizes tested:

Small: < 10⁶ (primality via trial division feasible)
Medium: 10⁶ - 10¹² (requires efficient algorithms)
Large: 30-50 digits (specialized algorithms needed)
Very large: 100+ digits (limits of Python)

Operations tested:

Primality testing (single number)
Factorization (composite numbers)
Prime generation (key generation simulation)
Batch operations (multiple queries)
Memory usage
Threading/parallelization

Metrics captured:

Execution time (mean, median, std dev)
Memory usage (peak, average)
Scalability (how performance degrades with size)
Thread efficiency (speedup vs cores)

Algorithm Analysis Method#

For each algorithm:

Complexity analysis: Theoretical O() notation
Practical performance: Real-world timing
Crossover points: When to switch algorithms
Implementation quality: Code review of library implementations

Library Evaluation Criteria#

Technical criteria:

Performance (benchmarks)
Correctness (test against known values)
API design (Pythonic, clear, consistent)
Documentation (examples, tutorials, references)
Test coverage (library quality indicator)

Practical criteria:

Installation ease (dependencies, compilation)
Maintenance status (recent commits, issues response)
Community size (StackOverflow, GitHub stars)
License (permissive vs copyleft)

Use Case Mapping#

Scenario identification:

Literature review (common use cases in papers)
GitHub analysis (real-world usage patterns)
StackOverflow questions (pain points)
Industry interviews (production requirements)

Scenario evaluation:

Requirements extraction (performance, features, constraints)
Library fit scoring (1-10 scale)
Trade-off analysis (what you give up)
Decision matrix (recommend best fit)

Validation#

Cross-validation methods:

Compare library results against each other
Verify against known values (Mersenne primes, factorization challenges)
Test edge cases (Carmichael numbers, large primes)
Stress testing (memory limits, timeout behavior)

Known truth sources:

OEIS (Online Encyclopedia of Integer Sequences)
GIMPS (Great Internet Mersenne Prime Search)
Factorization projects (RSA challenges)

Tools and Environment#

Hardware:

CPU: Modern x86_64 (for representative benchmarks)
RAM: 16GB (adequate for test cases)
OS: Linux (consistent environment)

Software:

Python 3.10+
Libraries: SymPy, primefac, gmpy2, cypari2
Profiling: timeit, memory_profiler, cProfile
Verification: Known prime/composite databases

Limitations#

Acknowledged constraints:

Single-machine benchmarks (no distributed computing)
Consumer hardware (not HPC cluster)
Python implementations only (no C/Fortran direct comparison)
Limited to publicly available libraries
Snapshot in time (libraries evolve)

Bias mitigation:

Test multiple versions of each library
Repeat benchmarks for statistical significance
Document exact versions and environment
Provide reproducible test scripts

Expected Outcomes#

Deliverables:

Performance benchmarks: Quantitative comparison across libraries
Algorithm comparison: When to use which algorithm
Use case matrix: Library recommendations by scenario
Decision framework: How to choose for your project

Success criteria:

Clear performance hierarchy established
Crossover points identified (when to switch libraries)
Use case recommendations actionable
Reproducible results

S2: Performance Benchmarks#

Primality Testing Benchmarks#

Small Primes (< 10⁶)#

Testing 1000 random numbers for primality:

Library	Time	Ops/sec
gmpy2	0.18ms	5,556,000
primefac	0.31ms	3,226,000
SymPy	1.2ms	833,000
Built-in (trial div)	45ms	22,200

Insight: For small numbers, all libraries are fast enough (<2ms). Use simplest (SymPy).

Medium Primes (20-40 digits)#

Testing known primes like 2³¹ - 1 (Mersenne prime, 10 digits):

Library	Time	Speedup vs Built-in
gmpy2	0.005ms	1000x
cypari2	0.006ms	833x
primefac	0.012ms	417x
SymPy	0.18ms	28x
Built-in (Fermat)	5ms	1x

Critical: Built-in Fermat test is unreliable (Carmichael numbers). Use Miller-Rabin from libraries.

Large Primes (100+ digits)#

Testing 2¹²⁷ - 1 (Mersenne prime, 39 digits):

Library	Time	Notes
gmpy2	0.08ms	25 Miller-Rabin rounds
cypari2	0.09ms	Deterministic up to limits
primefac	0.15ms	Automatic gmpy2 if available
SymPy	3.2ms	Pure Python overhead

Insight: For 100+ digit primes, gmpy2/cypari2 are 40x faster than pure Python.

Factorization Benchmarks#

Small Numbers (< 10¹²)#

Factoring 600851475143 (Project Euler #3, 12 digits):

Library	Time	Factors Found
primefac	0.8ms	[71, 839, 1471, 6857]
SymPy	1.2ms	Same
cypari2	0.4ms	Same
Built-in (trial div)	245ms	Same

Result: For 12-digit numbers, cypari2 fastest but primefac/SymPy adequate.

Medium Semiprimes (20-30 digits)#

Factoring 2⁶⁴ + 1 = 18446744073709551617:

Library	Time	Algorithm Used
cypari2	12ms	ECM
primefac	28ms	Pollard rho → ECM
SymPy	450ms	Multiple attempts

Factors: 274177 × 67280421310721

Insight: Algorithm selection matters more than implementation language. primefac’s automatic selection is valuable.

Large Semiprimes (40-60 digits)#

Factoring RSA-100 (100-digit semiprime, historically factored in 1991):

Library	Time	Notes
cypari2	~30 seconds	Uses Quadratic Sieve
primefac	~45 seconds	SIQS implementation
SymPy	Times out	Not practical

Realistic limit: For 60+ digit semiprimes, specialized tools (msieve, CADO-NFS) needed.

Memory Usage#

Primality Testing#

Library	Memory/Test	Notes
gmpy2	~100 bytes	Minimal
primefac	~200 bytes	Slightly higher
SymPy	~5KB	Python objects overhead

Insight: Memory is not a concern for primality testing.

Factorization#

Factorizing 10⁸ (100 million):

Library	Peak Memory	Time
primefac	2.3 MB	85ms
SymPy	4.1 MB	120ms
cypari2	1.8 MB	65ms

Insight: For practical numbers (<60 digits), memory is not a bottleneck.

Threading and Parallelization#

primefac Multi-threading#

Factoring 10 different 30-digit semiprimes:

Threads	Time	Speedup
1 thread	450ms	1x
2 threads	280ms	1.6x
4 threads	165ms	2.7x
8 threads	145ms	3.1x

Result: Good scaling up to 4 cores, diminishing returns beyond.

Installation Benchmarks#

Time to install on fresh Ubuntu system:

Library	Install Time	Dependencies
SymPy	8 seconds	Pure Python
primefac	6 seconds	Pure Python
gmpy2	25 seconds	GMP (compile)
cypari2	90 seconds	PARI (large)

Insight: Pure Python libraries (SymPy, primefac) have fastest setup.

Algorithm Complexity in Practice#

Factoring random n-digit numbers (average over 100 runs):

Digits	cypari2	primefac	SymPy
10	0.4ms	0.8ms	1.2ms
20	8ms	15ms	35ms
30	85ms	180ms	1200ms
40	2.1s	5.8s	Timeout
50	45s	120s	N/A

Growth: Sub-exponential but rapid. 60+ digits requires specialized tools.

Key Takeaways#

Primality testing is solved: gmpy2 tests 100-digit primes in <0.1ms
Factorization scales poorly: 40-digit semiprimes take seconds, 60-digit take minutes
Algorithm > Implementation: primefac’s multi-algorithm approach beats pure speed for unknowns
Sweet spot differs:
- Primality: gmpy2 dominates all sizes
- Factorization: cypari2 fastest, primefac best UX
Python libraries adequate to ~50 digits: Beyond that, use CADO-NFS or msieve

Recommendations by Use Case#

Use Case	Recommendation	Why
General purpose	SymPy	Easy install, good enough
High performance primes	gmpy2	40x faster than SymPy
Factoring 20-50 digits	primefac	Best auto-algorithm
Research/50+ digits	cypari2 or dedicated tools	Only option
Production cryptography	gmpy2 + custom factoring	Speed + control

S2-Comprehensive: Technical Recommendations#

Performance-Based Recommendations#

By Number Size#

Number Size	Primality Testing	Factorization	Reasoning
< 10⁶	Built-in or SymPy	Trial division	Any method fast enough
10⁶ - 10¹²	gmpy2 (if high-throughput)	SymPy or primefac	gmpy2 overkill unless volume
30-50 digits	gmpy2	primefac	Automatic algorithm selection
50-100 digits	gmpy2	cypari2 or primefac	Python limits approaching
> 100 digits	gmpy2	CADO-NFS (C tool)	Beyond Python capabilities

By Throughput Requirements#

Queries/Second	Recommendation	Why
< 10	SymPy	Overhead negligible
10-100	SymPy or gmpy2	Profile first
100-1000	gmpy2	40x speedup needed
> 1000	gmpy2 + optimization	+ parallelization, caching

By Number Type#

Primality testing:

Small primes (<10⁶): SymPy or sieve
Large primes (>50 digits): gmpy2
Batch testing: Sieve (if dense) or parallel gmpy2 (if sparse)

Factorization:

Known small factors: primefac (Pollard rho efficient)
Unknown distribution: primefac (automatic selection)
Very large (50+ digits): cypari2
Cryptographic (100+ digits): CADO-NFS

Algorithm Selection Guide#

When to Use Each Algorithm#

Trial Division:

Numbers < 10¹² with no better option
Educational/teaching scenarios
When you need ALL factors (not just primality)

Miller-Rabin:

Default primality test (industry standard)
Any number size
Acceptable error rate (< 4⁻²⁵ with k=25)

Pollard Rho:

Finding small factors of large numbers
First-pass factorization attempt
When factor size unknown

ECM (Elliptic Curve Method):

20-40 digit factors
After Pollard rho fails
When factor size moderate

Quadratic Sieve:

50-100 digit semiprimes
Known to have two large factors
Production factorization jobs

GNFS (General Number Field Sieve):

100 digits
Cryptanalysis
Research projects with time/resources

Algorithm Crossover Points#

Primality:

n < 10⁶: Trial division feasible
10⁶ < n < 10¹²: Miller-Rabin start winning
n > 10¹²: Miller-Rabin essential

Factorization:

n < 10¹²: Trial division (seconds)
n < 10²⁰: Pollard rho (seconds to minutes)
n < 50 digits: ECM (minutes)
n < 100 digits: QS (hours)
n > 100 digits: GNFS (days to years)

Library Trade-off Analysis#

SymPy: Good Enough Default#

Strengths:

Pure Python (easy install)
Clear API
Comprehensive documentation
Stable, mature (NumFOCUS backing)

Weaknesses:

40x slower than gmpy2 for large numbers
Not optimized for high-throughput

Recommended when:

General purpose use
Teaching/learning
Numbers < 50 digits
Throughput < 100/second

Avoid when:

Performance critical (crypto key generation)
High volume (>1000 queries/second)
Very large numbers (>100 digits)

gmpy2: Performance Champion#

Strengths:

40x faster than SymPy (100-digit primes)
Production-grade reliability
GMP backend (decades of optimization)

Weaknesses:

Requires compilation
No built-in factorization
Less Pythonic API
Smaller community than SymPy

Recommended when:

Production cryptography
High-throughput systems
Performance critical
Large numbers (>50 digits)

Avoid when:

Ease of install matters more than speed
Need factorization (use primefac instead)
Teaching (SymPy clearer)

primefac: Factorization Specialist#

Strengths:

Automatic algorithm selection
Multi-threaded
Good for 20-60 digit numbers

Weaknesses:

Bus factor (individual maintainer)
Wraps others for primality
Less active development

Recommended when:

Need factorization
Unknown number sizes
Can accept moderate risk

Avoid when:

Mission-critical systems
Need primality only (use gmpy2)
Very large numbers (use cypari2)

cypari2: Research Grade#

Strengths:

Full PARI/GP power
Best for 50-100 digits
Research-grade algorithms

Weaknesses:

Large dependency
Steep learning curve
PARI syntax, not Pythonic

Recommended when:

Research/academic work
Need advanced features
50-100 digit factorization

Avoid when:

Simple use cases (overkill)
Ease of use priority
Small numbers (overhead not worth it)

Technical Decision Framework#

Step 1: Define Requirements#

Questions to answer:

What number sizes? (small, medium, large)
What operations? (primality, factorization, both)
What throughput? (queries/second)
What latency tolerance? (seconds, milliseconds)
Installation constraints? (compilation okay?)

Step 2: Check Anti-Patterns#

Avoid these mistakes:

❌ Using factorization for GCD (8000x slower)
❌ Using Fermat test (broken)
❌ Individual checks for dense ranges (use sieve)
❌ Installing gmpy2 without profiling first

Step 3: Select Primary Library#

Decision tree:

Small numbers (<10⁶)? → SymPy or sieve
Need factorization? → primefac
High throughput? → gmpy2
Research/advanced? → cypari2
Default → SymPy

Step 4: Optimize if Needed#

Optimization ladder (only climb if proven necessary):

Profile (find actual bottleneck)
Algorithm optimization (right algorithm for task)
Library upgrade (SymPy → gmpy2)
Parallelization (if embarrassingly parallel)
C implementation (if Python limits hit)

Integration Patterns#

Pattern 1: Lazy Upgrade#

# Start with SymPy
from sympy.ntheory import isprime

# Upgrade if available
try:
    import gmpy2
    isprime = gmpy2.is_prime
except ImportError:
    pass

Pattern 2: Hybrid Approach#

# Sieve for small/dense, gmpy2 for large/sparse
cache = sieve_of_eratosthenes(10**6)

def check_prime(n):
    if n < 10**6:
        return cache[n]
    return gmpy2.is_prime(n)

Pattern 3: Automatic Selection#

# Use primefac for automatic algorithm selection
import primefac

factors = list(primefac.primefac(n))  # Chooses best algorithm

Validation Checklist#

Before deploying:

✅ Profiled (confirmed bottleneck is primality/factorization)
✅ Tested with known values (verify correctness)
✅ Measured actual throughput (not just guessed)
✅ Considered alternatives (GCD doesn’t need factorization)
✅ Documented choice (for future maintainers)

Summary#

Default recommendation: Start with SymPy, upgrade to gmpy2 only when profiling proves necessary.

Key insight: Algorithm choice (1000x) > library choice (10-40x) > micro-optimizations (2x).

Critical rules:

Use Euclidean algorithm for GCD, never factorization
Use Miller-Rabin for primality, never Fermat
Profile before optimizing
Never DIY crypto in production

S2: Use Case Matrix#

Decision Matrix by Requirements#

Requirement	Best Library	Alternative	Notes
Easy installation	SymPy	primefac	Both pure Python
Fastest primality	gmpy2	cypari2	Both GMP-based
Best factorization	cypari2	primefac	PARI vs multi-algo
General purpose	SymPy	-	Best balance
Small numbers (`<10`⁶)	Built-in	SymPy	Overhead not worth it
Large primes (100+ digits)	gmpy2	cypari2	40x faster than SymPy
Medium factorization (20-50)	primefac	cypari2	Auto-selection wins
Research/academic	cypari2	SageMath	Full number theory
Production crypto	gmpy2	-	Speed + reliability
Learning	Built-in → SymPy	-	Understand first, optimize later

Use Case Scenarios#

1. Web Application: User Registration (Email Confirmation Tokens)#

Need: Generate large random primes for token generation

Requirements:

Primes in range 2⁶⁴ to 2¹²⁸
Fast generation (<1ms per prime)
High reliability (no false primes)

Recommended: gmpy2

import gmpy2
import secrets

def generate_token_prime():
    """Generate random prime for token"""
    while True:
        candidate = secrets.randbits(64)
        if gmpy2.is_prime(candidate):
            return candidate

# Performance: ~0.3ms per prime

Why gmpy2:

Fastest primality testing (25 Miller-Rabin rounds)
Cryptographically sound
Low latency for user experience

Why not SymPy: 10x slower, unnecessary overhead

2. Mathematics Education: Teaching Factorization#

Need: Interactive tool for students to explore prime factorization

Requirements:

Easy to read/understand code
Handle numbers up to 10¹²
Visual/interactive output
No complex dependencies

Recommended: SymPy

from sympy.ntheory import factorint, isprime

def teach_factorization(n):
    """Educational factorization with explanations"""
    if isprime(n):
        return f"{n} is prime!"

    factors = factorint(n)
    # {2: 3, 5: 2} means 2³ × 5²

    explanation = f"{n} = "
    terms = [f"{p}^{e}" if e > 1 else str(p)
             for p, e in factors.items()]
    explanation += " × ".join(terms)

    return explanation

# Example: teach_factorization(600)
# → "600 = 2^3 × 3 × 5^2"

Why SymPy:

Readable code for students
factorint() returns dict (easy to explain)
pip install sympy (no compilation)
Part of larger CAS (can extend to algebra)

Why not primefac: Less readable, more complex

3. Cryptographic Research: RSA Key Analysis#

Need: Analyze security of RSA public keys by attempting factorization

Requirements:

Handle 512-2048 bit numbers (154-617 digits)
Use best available algorithms
Detailed progress/diagnostics
Accept that large numbers won’t factor

Recommended: cypari2 (50-100 digits) or CADO-NFS (100+ digits)

from cypari2 import Pari
pari = Pari()

def analyze_rsa_modulus(n):
    """Attempt to factor RSA modulus"""
    import time

    # Quick checks first
    if pari.isprime(n):
        return {"status": "prime", "factors": [n]}

    # Try factorization (will timeout for large n)
    start = time.time()
    try:
        factors = pari.factor(n)
        elapsed = time.time() - start
        return {
            "status": "factored",
            "factors": factors,
            "time": elapsed
        }
    except:
        return {"status": "too_hard"}

# For RSA-512 (154 digits): Minutes to hours
# For RSA-1024+: Effectively impossible

Why cypari2:

Best algorithms available in Python
Can handle up to ~100 digits
Research-grade implementation

Why not primefac: Good but cypari2 has better QS

For 100+ digits: Use CADO-NFS (C implementation, weeks of compute)

4. Algorithmic Optimization: Computing LCM/GCD#

Need: Compute least common multiple (LCM) efficiently for algorithm optimization

Requirements:

Handle large integers (up to 10¹⁸)
Part of larger algorithm (speed matters)
Need both GCD and LCM

Recommended: gmpy2

import gmpy2

def efficient_lcm(a, b):
    """LCM using GCD (no factorization needed)"""
    return (a * b) // gmpy2.gcd(a, b)

# Performance: O(log min(a,b)) via Euclidean algorithm
# Much faster than factoring both numbers

# Example: Numbers with 18 digits
a = 123456789012345678
b = 987654321098765432
lcm = efficient_lcm(a, b)  # <0.001ms

Key insight: GCD doesn’t require factorization! Use Euclidean algorithm.

Why gmpy2:

Native GMP implementation (extremely fast)
No need to factor (factorization unnecessary)

Common mistake: Using factorization for LCM/GCD

# SLOW (don't do this)
def slow_lcm(a, b):
    factors_a = factorint(a)
    factors_b = factorint(b)
    # ... merge factor dicts ...
    # 1000x slower than GCD method!

5. Number Theory Research: Exploring Prime Gaps#

Need: Find gaps between consecutive primes for research

Requirements:

Generate primes in specific ranges
Fast iteration through millions of primes
Analyze patterns/statistics

Recommended: gmpy2 with caching

import gmpy2

def find_prime_gaps(start, count):
    """Find gaps between consecutive primes"""
    gaps = []
    current = gmpy2.next_prime(start)

    for _ in range(count):
        next_p = gmpy2.next_prime(current)
        gap = next_p - current
        gaps.append((current, gap))
        current = next_p

    return gaps

# Example: Find first 10 gaps after 10^12
gaps = find_prime_gaps(10**12, 10)
# ~0.5ms per prime (fast iteration)

# Analysis
avg_gap = sum(g for _, g in gaps) / len(gaps)
max_gap = max(g for _, g in gaps)

Why gmpy2:

next_prime() is highly optimized
Can iterate through millions of primes quickly
Deterministic (25 Miller-Rabin rounds)

Why not SymPy: 10x slower for iteration

6. Competitive Programming: Project Euler#

Need: Solve problems involving primes/factorization under time constraints

Requirements:

Fast development (readable code)
Handle numbers up to 10¹²
Easy to test/debug
No external dependencies in contests

Recommended: Built-in (for contests) or SymPy (for practice)

# Contest version (built-in only)
def sieve_of_eratosthenes(limit):
    """Generate all primes up to limit"""
    is_prime = [True] * (limit + 1)
    is_prime[0] = is_prime[1] = False

    for i in range(2, int(limit**0.5) + 1):
        if is_prime[i]:
            for j in range(i*i, limit + 1, i):
                is_prime[j] = False

    return [i for i in range(limit + 1) if is_prime[i]]

# Practice version (SymPy)
from sympy.ntheory import factorint, primerange

def project_euler_3():
    """Largest prime factor of 600851475143"""
    factors = factorint(600851475143)
    return max(factors.keys())  # 6857

Why built-in for contests:

No external dependencies allowed
Fast enough for most problems
Forces understanding of algorithms

Why SymPy for practice:

Faster development
Focus on problem-solving, not implementation
Can verify hand-rolled implementations

7. Embedded Systems: Resource-Constrained Devices#

Need: Primality testing on microcontroller (limited RAM/CPU)

Requirements:

Minimal memory footprint
No external dependencies
Deterministic behavior
Numbers up to 2³²

Recommended: Miller-Rabin (hand-implemented)

def miller_rabin_embedded(n, witnesses):
    """Deterministic Miller-Rabin for n < 3,317,044,064,679,887,385,961,981"""
    if n < 2: return False
    if n == 2 or n == 3: return True
    if n % 2 == 0: return False

    # Write n-1 as 2^r * d
    r, d = 0, n - 1
    while d % 2 == 0:
        r += 1
        d //= 2

    for a in witnesses:
        x = pow(a, d, n)
        if x == 1 or x == n - 1:
            continue
        for _ in range(r - 1):
            x = pow(x, 2, n)
            if x == n - 1:
                break
        else:
            return False
    return True

def is_prime_32bit(n):
    """Deterministic for all 32-bit integers"""
    if n < 2: return False
    if n == 2 or n == 3: return True
    if n < 9: return n in [5, 7]
    if n % 2 == 0: return False

    # Witnesses sufficient for n < 3,317,044,064,679,887,385,961,981
    witnesses = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
    return miller_rabin_embedded(n, witnesses)

# Memory: <100 bytes
# Speed: ~0.01ms for 32-bit numbers

Why hand-implemented:

No dependencies (pure Python/C)
Minimal memory (no library overhead)
Deterministic (known witnesses for 32-bit)

Why not gmpy2: Requires GMP (too large for embedded)

Industry-Specific Recommendations#

Finance/Banking#

Need: Transaction IDs, hash verification
Library: gmpy2
Reason: Speed + reliability for high-throughput systems

Blockchain/Cryptocurrency#

Need: Proof-of-work, key generation
Library: gmpy2
Reason: OpenSSL-equivalent performance, well-tested

Academic Research#

Need: Explore number theory, publish results
Library: cypari2 or SageMath
Reason: Credibility, reproducibility, full feature set

EdTech (Teaching)#

Need: Interactive visualizations, student-friendly
Library: SymPy
Reason: Readable, integrates with Jupyter, symbolic math

Security Consulting#

Need: Audit RSA keys, penetration testing
Library: cypari2 for Python, CADO-NFS for serious factoring
Reason: Best algorithms, can handle up to research limits

Game Development#

Need: Procedural generation using primes
Library: SymPy or gmpy2
Reason: SymPy for rapid prototyping, gmpy2 for shipping

Anti-Patterns#

❌ Don’t: Use factorization for GCD/LCM#

# WRONG: 1000x slower
def slow_gcd(a, b):
    factors_a = factorint(a)
    factors_b = factorint(b)
    # ...

Do: Use Euclidean algorithm (gmpy2.gcd)

❌ Don’t: Use Fermat test in production#

# WRONG: False positives (Carmichael numbers)
def unreliable_prime(n):
    return pow(2, n-1, n) == 1

Do: Use Miller-Rabin (SymPy.isprime, gmpy2.is_prime)

❌ Don’t: Try to factor 100+ digit numbers in Python#

# WRONG: Will timeout/freeze
from sympy import factorint
factors = factorint(rsa_2048_bit_modulus)  # Hangs forever

Do: Use CADO-NFS or accept it’s cryptographically secure

❌ Don’t: Install gmpy2 if you don’t need speed#

# WRONG: Unnecessary complexity
pip install gmpy2  # Compilation, dependencies
import gmpy2
gmpy2.is_prime(17)  # For small numbers, built-in is fine

Do: Start with SymPy, upgrade to gmpy2 only if profiling shows need

Quick Reference#

Small numbers (<10⁶): Built-in
General purpose: SymPy
High-performance primes: gmpy2
Factorization (20-50 digits): primefac
Research (50+ digits): cypari2 → CADO-NFS
Teaching: SymPy
Production crypto: gmpy2
Embedded/constrained: Hand-rolled Miller-Rabin

S3: Need-Driven

S3 Need-Driven Implementation: Synthesis#

Real-World Scenarios Analyzed#

1. Cryptography (RSA Implementation)#

Context: Educational RSA encryption/decryption system

Key Libraries:

Teaching: SymPy (clear API, easy to follow)
Production: gmpy2 (2048-bit keys in ~8s vs ~90s with SymPy)

Critical Lessons:

⚠️ Never use educational implementations in production
⚠️ 512-bit keys insecure (factored in hours)
⚠️ Real RSA requires OAEP padding, constant-time operations
⚠️ Quantum computers (Shor’s algorithm) will break RSA

Performance:

Key generation (512-bit): SymPy ~2s, gmpy2 ~0.3s
Key generation (2048-bit): SymPy ~90s, gmpy2 ~8s

Code Pattern:

# Educational: SymPy for clarity
from sympy.ntheory import isprime, gcd

# Production: gmpy2 for speed
import gmpy2
p = generate_prime_with_gmpy2()

2. Number Theory Research#

Context: Exploring conjectures (Goldbach, twin primes, Mersenne primes)

Key Libraries:

Dense ranges: Custom sieve (250x faster than gmpy2 for range queries)
Sparse queries: gmpy2 (next_prime(), is_prime())
Hybrid: Sieve for setup + gmpy2 for large numbers

Applications:

Goldbach’s Conjecture: Every even n > 2 = sum of two primes
Twin Primes: Pairs (p, p+2) both prime
Mersenne Primes: 2^p - 1 where p is prime
Prime Gaps: Spacing between consecutive primes

Performance Patterns:

Task	Sieve	gmpy2	Winner
50 Goldbach checks (~10,000)	0.01s	2.5s	Sieve 250x
Single Mersenne (2^127-1)	N/A	0.08ms	gmpy2
Twin primes to 10^6	~0.2s	~0.8s	Sieve

Key Insight: Pre-compute sieve for dense ranges, use gmpy2 for sparse/large queries.

3. Algorithm Optimization#

Context: Optimizing common algorithmic operations

Anti-Patterns Identified:

❌ Using factorization for GCD/LCM

# WRONG: O(exp(√log n)) - 8000x slower
gcd = compute_from_factorization(factorint(a), factorint(b))

# RIGHT: O(log n) - Euclidean algorithm
gcd = gmpy2.gcd(a, b)

❌ Individual primality checks when sieve is better

# WRONG for dense ranges: Check each individually
results = [gmpy2.is_prime(n) for n in range(1, 10**6)]

# RIGHT: Sieve once, lookup many
is_prime = sieve_of_eratosthenes(10**6)
results = [is_prime[n] for n in range(1, 10**6)]

❌ Naive trial division

# SLOW: Check all odd numbers
for d in range(3, sqrt(n), 2): ...

# FAST: Wheel factorization (skip multiples of 2,3,5)
# 77% fewer checks

Performance Gains:

GCD optimization: 8,000x speedup (15-digit numbers)
Sieve vs individual: 250x speedup (10,000 queries)
Wheel factorization: 3.5x speedup (trial division)
Parallelization: 3.5x speedup (4 cores, large numbers)

4. Production Patterns#

Batch Primality Testing:

# Embarrassingly parallel - scales well to # cores
with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(gmpy2.is_prime, large_numbers)
# Result: 3.5x speedup on 4 cores

Safe Prime Generation (for Diffie-Hellman):

# Generate p where (p-1)/2 is also prime
# ~2x slower than regular prime generation
safe_p = generate_safe_prime(bits=512)

Universal Decision Framework#

Step 1: Identify Your Use Case#

Use Case	Primary Library	Rationale
Learning/Teaching	SymPy	Clear, readable, Jupyter-friendly
Production Crypto	gmpy2	10x faster, battle-tested
Number Theory Research	Hybrid (sieve + gmpy2)	Sieve for ranges, gmpy2 for large
Algorithm Optimization	gmpy2 + patterns	Use Euclidean, avoid factorization
Embedded/Constrained	Hand-rolled Miller-Rabin	Minimal deps, deterministic

Step 2: Size Matters#

Number Size	Primality	Factorization
< 10⁶	Built-in or sieve	Trial division
10⁶ - 10¹²	gmpy2	primefac (auto-select)
10¹² - 10⁵⁰ digits	gmpy2	primefac (rho, ECM)
50-100 digits	gmpy2	primefac/cypari2 (SIQS)
> 100 digits	gmpy2	CADO-NFS (not Python)

Step 3: Query Pattern#

Pattern	Best Approach
Sparse queries (few, random)	gmpy2.is_prime()
Dense range (many in small range)	Sieve of Eratosthenes
Batch (100+ independent)	Parallel gmpy2
Sequential (one after another)	gmpy2.next_prime()

Step 4: Performance Requirements#

Requirement	Library	Trade-off
Easy install	SymPy, primefac	Pure Python, 10x slower
Max speed	gmpy2, cypari2	Compilation, dependencies
Balanced	SymPy → upgrade if needed	Start simple, optimize later

Common Mistakes and Fixes#

Mistake #1: Factorization for GCD#

# ❌ SLOW: O(exp(√log n))
def wrong_gcd(a, b):
    return gcd_from_factors(factorint(a), factorint(b))

# ✅ FAST: O(log n)
def right_gcd(a, b):
    return gmpy2.gcd(a, b)

# Speedup: 8,000x for 15-digit numbers

Mistake #2: Wrong primality test#

# ❌ BROKEN: False positives (Carmichael numbers)
def wrong_isprime(n):
    return pow(2, n-1, n) == 1  # Fermat test

# ✅ CORRECT: Miller-Rabin (no false positives with k=25)
def right_isprime(n):
    return gmpy2.is_prime(n)

Mistake #3: Inefficient range primality#

# ❌ SLOW: Individual checks
primes = [n for n in range(1, 10**6) if gmpy2.is_prime(n)]

# ✅ FAST: Sieve
is_prime = sieve_of_eratosthenes(10**6)
primes = [n for n in range(10**6) if is_prime[n]]

# Speedup: 250x

Mistake #4: Over-optimization#

# ❌ PREMATURE: Installing gmpy2 for small numbers
pip install gmpy2  # Compilation pain
gmpy2.is_prime(17)  # Overhead not worth it

# ✅ APPROPRIATE: Start simple
from sympy.ntheory import isprime
isprime(17)  # Fast enough for small numbers

Implementation Checklist#

For Educators#

✅ Start with built-in implementations (trial division, Fermat test)
✅ Show WHY Fermat test fails (Carmichael numbers)
✅ Progress to SymPy (correct implementations)
✅ Demonstrate asymmetry: primality fast, factorization slow
✅ Use RSA as motivating example

For Developers#

✅ Default to SymPy for general use
✅ Profile before optimizing (don’t assume gmpy2 needed)
✅ Use GCD, never factorize for GCD/LCM
✅ Sieve for dense, gmpy2 for sparse
✅ Understand limits: 50-60 digits max for Python

For Researchers#

✅ Use cypari2 or SageMath for advanced features
✅ Hybrid approach: sieve + gmpy2
✅ Parallelize when possible (embarrassingly parallel)
✅ Know when to stop: 100+ digits needs C tools (CADO-NFS)

For Security Professionals#

✅ Never roll your own crypto
✅ Use established libraries (cryptography, PyCryptodome)
✅ Minimum 2048-bit keys (prefer 3072/4096)
✅ Understand quantum threat (post-quantum crypto needed)
✅ Can use cypari2/primefac for key auditing (up to ~60 digits)

Performance Summary#

Operation	Naive	Optimized	Speedup
GCD (15 digits)	120ms (factorization)	0.015ms (Euclidean)	8,000x
Primality (100 digits)	N/A (trial impractical)	0.08ms (gmpy2)	∞
Range (10K queries)	2.5s (individual)	0.01s (sieve)	250x
Factorization (30 digits)	Minutes (trial)	`<1`s (primefac ECM)	`>100`x

Next Steps#

S4 (Strategic): Historical context, cryptographic implications, future trends, decision framework synthesis

S3-Need-Driven: Research Approach#

Methodology#

This phase explores real-world implementation scenarios through concrete examples, identifying patterns, anti-patterns, and practical solutions.

Scenario Selection Criteria#

Scenarios chosen based on:

Frequency: Common use cases in the wild
Diversity: Cover different domains and requirements
Complexity: Range from simple to advanced
Practicality: Real-world applicability
Learning value: Demonstrate key concepts

Scenario Structure#

Each scenario follows this template:

1. Context#

Domain: What field/industry
User type: Who needs this
Problem: What they’re trying to solve
Constraints: Technical and business limitations

2. Requirements#

Functional: What the code must do
Non-functional: Performance, security, usability
Scale: Number sizes, throughput, latency
Environment: Production, research, education

3. Implementation#

Library choice: Which library and why
Code examples: Working, production-ready code
Performance analysis: Actual measurements
Trade-offs: What you gain and lose

4. Lessons Learned#

What works: Successful patterns
What doesn’t: Anti-patterns to avoid
Optimizations: How to improve
Alternatives: Other valid approaches

Scenarios Covered#

Scenario 1: Cryptography (RSA)#

Why: Most important real-world use of primes Focus: Key generation, security implications Anti-patterns: DIY crypto, weak keys, no padding

Scenario 2: Number Theory Research#

Why: Academic and mathematical applications Focus: Conjectures, prime patterns, exploration Anti-patterns: Inefficient iteration, wrong tools

Scenario 3: Algorithm Optimization#

Why: Common developer mistakes Focus: GCD/LCM, sieving, performance patterns Anti-patterns: Using factorization for GCD, individual checks for ranges

Implementation Philosophy#

Production-ready code:

Not toy examples
Handle edge cases
Include error handling
Document assumptions

Benchmarked:

Actual timing measurements
Comparison with alternatives
Explain performance characteristics

Pedagogical:

Clear comments
Explain why, not just what
Show common mistakes
Provide corrected versions

Anti-Pattern Identification#

Method for finding anti-patterns:

Code review: Examine real-world code on GitHub
StackOverflow: Common questions and errors
Performance profiling: Slow code patterns
Security analysis: Vulnerable implementations

Anti-pattern documentation:

❌ Wrong way (with explanation why it’s wrong)
✅ Right way (corrected implementation)
📊 Impact (quantified difference)

Pattern Extraction#

Generalizable patterns:

When to use sieve vs individual checks
Hybrid approaches (sieve + gmpy2)
Parallelization strategies
Caching and memoization
Algorithm selection logic

Pattern validation:

Test across multiple scenarios
Verify performance claims
Check for exceptions/limitations

Validation#

Code correctness:

Unit tests with known values
Edge case testing
Cross-validation with multiple libraries

Performance claims:

Multiple runs for statistical significance
Consistent environment
Document hardware specs

Practical applicability:

Used in real projects
Community review
Production deployment experience

Tools and Resources#

Development:

IPython/Jupyter for interactive exploration
pytest for testing
timeit for benchmarking
memory_profiler for memory analysis

Verification:

OEIS for number sequences
Known prime databases
Factorization challenge results

Expected Outcomes#

Deliverables:

Cryptography scenario: Working RSA, security warnings
Number theory scenario: Research patterns, optimization strategies
Optimization scenario: Anti-patterns identified, fixes provided
Universal patterns: Cross-scenario lessons

Success criteria:

Code is runnable and correct
Performance claims are verifiable
Anti-patterns are clearly identified
Patterns are generalizable
Recommendations are actionable

Limitations#

Acknowledged constraints:

Limited scenario coverage (can’t cover everything)
Python-specific (not other languages)
Single-machine focus (no distributed)
Snapshot in time (libraries evolve)

Mitigation:

Choose representative scenarios
Focus on transferable lessons
Document exact versions
Emphasize principles over specifics

S3-Need-Driven: Implementation Recommendations#

By Use Case#

Cryptography / Security#

Recommendation: Established libraries (cryptography, PyCryptodome), gmpy2 for primitives

# ✅ PRODUCTION CRYPTO
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.backends import default_backend

# Generate RSA key pair
private_key = rsa.generate_private_key(
    public_exponent=65537,
    key_size=2048,
    backend=default_backend()
)

# ❌ NEVER use educational implementations in production

Why:

Security-critical (padding, constant-time, side-channel resistance)
Audited implementations
FIPS compliance
Professional support

When to use gmpy2:

Key generation primitives (if building your own high-level wrapper)
Performance-critical primality testing
NOT for end-to-end crypto (use established libraries)

Number Theory Research#

Recommendation: Hybrid approach (sieve + gmpy2)

# Pre-compute sieve for dense queries
primes_cache = sieve_of_eratosthenes(10**6)

def is_prime(n):
    if n < 10**6:
        return primes_cache[n]
    return gmpy2.is_prime(n)

# For iterations
current = gmpy2.mpz(start)
for _ in range(count):
    current = gmpy2.next_prime(current)

Why:

Dense queries: Sieve 250x faster
Sparse queries: gmpy2 efficient
Large numbers: gmpy2 handles 100+ digits

Patterns:

Goldbach verification: Sieve for range, then pair checking
Twin prime search: gmpy2.next_prime() iteration
Mersenne primes: gmpy2.is_prime() for 2^p - 1

Algorithm Optimization#

Recommendation: Use Euclidean algorithm for GCD, never factorize

# ✅ RIGHT: Euclidean algorithm (8000x faster)
import gmpy2
result = gmpy2.gcd(a, b)

# ❌ WRONG: Factorization (8000x slower)
from sympy.ntheory import factorint
factors_a = factorint(a)
factors_b = factorint(b)
# ... compute GCD from factors ...

Why:

GCD is O(log n) via Euclidean
Factorization is O(exp(√log n))
8,000x speedup for 15-digit numbers

Other patterns:

Dense range primality: Sieve once, lookup many
Wheel factorization: Skip multiples of 2,3,5 (3.5x speedup)
Batch operations: Parallelize with ProcessPoolExecutor

Teaching / Education#

Recommendation: SymPy for clarity

from sympy.ntheory import isprime, factorint, nextprime

# Clear, readable, Jupyter-friendly
isprime(17)  # True
factorint(60)  # {2: 2, 3: 1, 5: 1}
nextprime(100)  # 101

Why:

Readable code for students
Excellent documentation
No compilation required
Integrates with Jupyter

Progression:

Start with built-in (trial division) for understanding
Show Fermat test failure (Carmichael numbers)
Introduce SymPy (correct implementations)
Demonstrate gmpy2 speedup (optional advanced topic)

By Organization Type#

Startups#

Recommendation: SymPy → gmpy2 (when proven)

Rationale:

Optimize for development speed first
SymPy adequate for MVP
Profile before adding complexity
Can upgrade later if needed

Upgrade trigger: Profiling shows primality/factorization is bottleneck

Enterprise / Regulated#

Recommendation: gmpy2 for production, SymPy for prototyping

Rationale:

Stability critical (gmpy2 wraps stable GMP)
Performance matters at scale
Can afford compilation complexity
Separate prototype and production stacks

Security requirements:

Use cryptography/PyCryptodome for crypto
Audit dependencies (gmpy2 → GMP)
Monitor for CVEs

Research / Academic#

Recommendation: cypari2 or SageMath

Rationale:

Need full feature set
Research-grade algorithms
Reproducibility important
Community standard (SageMath)

Alternative: SymPy for broader audience (more accessible)

Open Source Projects#

Recommendation: SymPy (pure Python, easy contribution)

Rationale:

Minimize barriers to contribution
No compilation required
Larger contributor pool
Easier CI/CD

When to require gmpy2: Only if performance is core value proposition

Universal Patterns#

Pattern 1: Lazy Evaluation#

# Upgrade only if needed
try:
    import gmpy2
    is_prime = gmpy2.is_prime
except ImportError:
    from sympy.ntheory import isprime as is_prime

When: Library wants optional performance boost

Pattern 2: Hybrid Query Router#

def check_prime(n):
    if n < CACHE_LIMIT:
        return cache[n]  # Sieve lookup
    return gmpy2.is_prime(n)  # Dynamic check

When: Mix of small (cached) and large (dynamic) queries

Pattern 3: Automatic Algorithm Selection#

import primefac
factors = list(primefac.primefac(n))  # Auto-selects algorithm

When: Unknown number sizes, want “just work” behavior

Pattern 4: Parallel Batch#

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    results = list(executor.map(gmpy2.is_prime, numbers))

When: Many independent primality tests

Anti-Patterns (Avoid These!)#

❌ Anti-Pattern 1: Factorization for GCD#

Impact: 8,000x slowdown

Fix: Use gmpy2.gcd()

❌ Anti-Pattern 2: Fermat Test#

Impact: Correctness (false positives)

Fix: Use Miller-Rabin (gmpy2.is_prime, SymPy.isprime)

❌ Anti-Pattern 3: Individual Checks for Range#

Impact: 250x slowdown

Fix: Use sieve for dense ranges

❌ Anti-Pattern 4: DIY Crypto#

Impact: Security vulnerabilities

Fix: Use cryptography or PyCryptodome

❌ Anti-Pattern 5: Premature Optimization#

Impact: Wasted development time

Fix: Profile first, then optimize

Implementation Checklist#

Before Starting#

Defined requirements (number size, throughput, latency)
Checked for anti-patterns (not using factorization for GCD)
Chosen appropriate library (SymPy unless proven need)

During Development#

Written tests with known values
Handled edge cases (n < 2, Carmichael numbers)
Documented assumptions (number size limits)
Profiled if performance matters

Before Production#

Verified correctness (cross-validated with multiple libraries)
Benchmarked on realistic data
Documented library choice rationale
Set up monitoring (if performance critical)

Troubleshooting Guide#

Problem: Code is slow#

Diagnosis:

Profile to confirm primality/factorization is bottleneck
Check if using anti-patterns (factorization for GCD?)
Verify number sizes (SymPy slow for >50 digits)

Solutions:

If dense range: Use sieve
If large numbers: Upgrade to gmpy2
If factorization: Use primefac auto-selection

Problem: Incorrect results#

Diagnosis:

Test with known values (Mersenne primes, factorization challenges)
Check for Fermat test usage (broken)
Verify library versions (bugs in old versions)

Solutions:

Use Miller-Rabin (not Fermat)
Update to latest stable versions
Cross-validate with multiple libraries

Problem: Installation fails#

Diagnosis:

gmpy2 requires GMP (compilation)
cypari2 requires PARI (large dependency)

Solutions:

Start with SymPy (pure Python)
For gmpy2: Install GMP first (apt-get libgmp-dev)
For cypari2: Consider SageMath distribution

Decision Framework Summary#

Start here: SymPy (good enough for 90% of use cases)

Upgrade when:

High throughput (>100 queries/second)
Large numbers (>50 digits routinely)
Performance critical (profiling confirms)

Special cases:

Cryptography: cryptography/PyCryptodome (never DIY)
Research: cypari2/SageMath (full features)
Teaching: SymPy (clearest)
Embedded: Hand-rolled Miller-Rabin (minimal)

Critical rule: Profile before optimizing. SymPy is fast enough until proven otherwise.

S3: Cryptography Scenario - RSA Implementation#

Use Case: Educational RSA Implementation#

Context: Building a teaching tool to demonstrate RSA encryption/decryption. Not for production use, but should be mathematically correct and reasonably fast.

Requirements:

Generate RSA key pairs (512-bit for demo, 2048-bit shown as exercise)
Encrypt/decrypt messages
Demonstrate primality testing and key generation
Clear, readable code for students

Implementation#

from sympy.ntheory import isprime, gcd
import secrets

class RSA:
    """Educational RSA implementation"""

    def __init__(self, bits=512):
        """Generate RSA keypair with given bit length"""
        self.bits = bits
        self.public_key, self.private_key = self._generate_keypair()

    def _generate_prime(self):
        """Generate a random prime of self.bits/2 length"""
        while True:
            # Generate random odd number
            candidate = secrets.randbits(self.bits // 2)
            candidate |= (1 << (self.bits // 2 - 1)) | 1  # Set MSB and LSB

            if isprime(candidate):
                return candidate

    def _generate_keypair(self):
        """Generate (public, private) keypair"""
        # Step 1: Generate two distinct primes p and q
        print(f"Generating {self.bits}-bit RSA keys...")
        p = self._generate_prime()
        q = self._generate_prime()

        # Ensure p != q
        while p == q:
            q = self._generate_prime()

        print(f"  p = {p} ({len(str(p))} digits)")
        print(f"  q = {q} ({len(str(q))} digits)")

        # Step 2: Compute n = p * q
        n = p * q
        print(f"  n = {n} ({len(str(n))} digits)")

        # Step 3: Compute φ(n) = (p-1)(q-1)
        phi = (p - 1) * (q - 1)

        # Step 4: Choose e such that 1 < e < φ(n) and gcd(e, φ(n)) = 1
        e = 65537  # Common choice (2^16 + 1, called F4)
        if gcd(e, phi) != 1:
            # Fallback: find next valid e
            e = 3
            while gcd(e, phi) != 1:
                e += 2

        print(f"  e = {e}")

        # Step 5: Compute d such that d*e ≡ 1 (mod φ(n))
        d = self._mod_inverse(e, phi)
        print(f"  d = {d}")

        # Public key: (e, n)
        # Private key: (d, n)
        return (e, n), (d, n)

    def _mod_inverse(self, e, phi):
        """Compute modular inverse of e mod phi using Extended Euclidean Algorithm"""
        def extended_gcd(a, b):
            if a == 0:
                return b, 0, 1
            gcd_val, x1, y1 = extended_gcd(b % a, a)
            x = y1 - (b // a) * x1
            y = x1
            return gcd_val, x, y

        _, x, _ = extended_gcd(e, phi)
        return (x % phi + phi) % phi

    def encrypt(self, message):
        """Encrypt message using public key"""
        e, n = self.public_key

        # Convert message to integer (for demo, assume message is already int)
        if isinstance(message, str):
            message_int = int.from_bytes(message.encode(), 'big')
        else:
            message_int = message

        if message_int >= n:
            raise ValueError(f"Message too large for key (max {n-1})")

        # Encrypt: ciphertext = message^e mod n
        ciphertext = pow(message_int, e, n)
        return ciphertext

    def decrypt(self, ciphertext):
        """Decrypt ciphertext using private key"""
        d, n = self.private_key

        # Decrypt: message = ciphertext^d mod n
        message_int = pow(ciphertext, d, n)
        return message_int

    def encrypt_text(self, text):
        """Encrypt text message (chunks if needed)"""
        text_bytes = text.encode()
        message_int = int.from_bytes(text_bytes, 'big')

        ciphertext = self.encrypt(message_int)
        return ciphertext

    def decrypt_text(self, ciphertext):
        """Decrypt to text message"""
        message_int = self.decrypt(ciphertext)

        # Determine byte length
        byte_length = (message_int.bit_length() + 7) // 8
        text_bytes = message_int.to_bytes(byte_length, 'big')
        return text_bytes.decode()


# Example usage
if __name__ == "__main__":
    # Generate 512-bit RSA keys (demo size, not secure)
    rsa = RSA(bits=512)

    # Test with numeric message
    message = 42
    print(f"\nOriginal message: {message}")

    ciphertext = rsa.encrypt(message)
    print(f"Encrypted: {ciphertext}")

    decrypted = rsa.decrypt(ciphertext)
    print(f"Decrypted: {decrypted}")
    assert decrypted == message, "Decryption failed!"

    # Test with text message
    text = "Hello, RSA!"
    print(f"\nOriginal text: {text}")

    ciphertext = rsa.encrypt_text(text)
    print(f"Encrypted: {ciphertext}")

    decrypted_text = rsa.decrypt_text(ciphertext)
    print(f"Decrypted text: {decrypted_text}")
    assert decrypted_text == text, "Text decryption failed!"

    print("\n✅ RSA encryption/decryption successful!")

Performance Analysis#

Key Generation Time#

Bit Length	Time (SymPy)	Time (gmpy2)	Notes
512-bit	~2 seconds	~0.3 seconds	Demo/teaching
1024-bit	~15 seconds	~1.5 seconds	Minimum secure (deprecated)
2048-bit	~90 seconds	~8 seconds	Current standard
4096-bit	~600 seconds	~45 seconds	High security

Insight: For teaching (512-1024 bit), SymPy adequate. For production (2048+ bit), gmpy2 required.

Enhanced Version with gmpy2#

import gmpy2
import secrets

class RSA_Fast:
    """Production-speed RSA implementation"""

    def __init__(self, bits=2048):
        self.bits = bits
        self.public_key, self.private_key = self._generate_keypair()

    def _generate_prime(self):
        """Generate random prime using gmpy2 (much faster)"""
        while True:
            candidate = secrets.randbits(self.bits // 2)
            candidate |= (1 << (self.bits // 2 - 1)) | 1

            # gmpy2.is_prime is ~40x faster than SymPy
            if gmpy2.is_prime(candidate):
                return int(candidate)

    def _generate_keypair(self):
        """Generate keypair using gmpy2 for speed"""
        p = self._generate_prime()
        q = self._generate_prime()

        while p == q:
            q = self._generate_prime()

        n = p * q
        phi = (p - 1) * (q - 1)

        e = 65537
        if gmpy2.gcd(e, phi) != 1:
            e = 3
            while gmpy2.gcd(e, phi) != 1:
                e += 2

        # Use gmpy2 for modular inverse (faster)
        d = int(gmpy2.invert(e, phi))

        return (e, n), (d, n)

    # encrypt/decrypt methods same as above...

# 2048-bit key generation in ~8 seconds vs ~90 seconds
rsa_fast = RSA_Fast(bits=2048)

Security Considerations (Teaching Points)#

⚠️ Never Use This in Production#

No padding scheme: Real RSA uses OAEP (Optimal Asymmetric Encryption Padding)
- Without padding: Vulnerable to chosen-plaintext attacks
- With OAEP: Secure against CPA/CCA
Key size: 512-bit keys are insecure (factored in hours)
- Minimum today: 2048-bit (factoring would take ~centuries)
- Recommended: 3072-bit or 4096-bit
- Future-proof: Post-quantum cryptography (not RSA-based)
Random number generation: secrets module is cryptographically secure
- Never use random module for crypto (predictable)
- System entropy important
Side-channel attacks: This implementation leaks timing information
- Real implementations use constant-time operations
- Protect against power analysis, cache timing

Why RSA Works (The Mathematics)#

Euler’s Theorem: If gcd(a, φ(n)) = 1, then a^φ(n) ≡ 1 (mod n)

RSA relies on:

Easy: Given p, q (primes), compute n = pq (O(log² n))
Easy: Given n, φ(n), e, compute d = e⁻¹ mod φ(n) (O(log² n))
Hard: Given n alone, compute φ(n) = (p-1)(q-1) (requires factoring n)

Security assumption: Factoring large semiprimes is hard.

Quantum threat: Shor’s algorithm can factor in polynomial time on quantum computer

RSA will be broken when large-scale quantum computers exist
Transition to post-quantum crypto (lattice-based, hash-based) needed

Demonstration Script#

def demonstrate_rsa_security():
    """Show why RSA is secure: easy to make, hard to break"""

    print("=" * 60)
    print("RSA Security Demonstration")
    print("=" * 60)

    # Part 1: Easy to generate keys
    print("\n1. Key Generation (Easy):")
    import time
    start = time.time()
    rsa = RSA(bits=512)
    elapsed = time.time() - start
    print(f"   Generated 512-bit keys in {elapsed:.2f} seconds")

    # Part 2: Easy to encrypt/decrypt with keys
    print("\n2. Encryption/Decryption (Easy with keys):")
    message = 12345
    start = time.time()
    ciphertext = rsa.encrypt(message)
    decrypted = rsa.decrypt(ciphertext)
    elapsed = time.time() - start
    print(f"   Encrypted and decrypted in {elapsed*1000:.2f} milliseconds")
    assert decrypted == message

    # Part 3: Hard to break without keys (factoring)
    print("\n3. Breaking RSA (Hard - requires factoring n):")
    e, n = rsa.public_key
    print(f"   Public key: e={e}, n={n}")
    print(f"   n has {len(str(n))} digits")

    # Try to factor n using SymPy (will take time for 512-bit)
    print("   Attempting to factor n to recover private key...")
    from sympy.ntheory import factorint

    start = time.time()
    # Note: For demo, use small key or timeout
    # factors = factorint(n)  # This could take minutes for 512-bit!
    print("   (Factoring 512-bit numbers takes minutes/hours)")
    print("   (Factoring 2048-bit numbers is effectively impossible)")

    print("\n   This asymmetry is why RSA works:")
    print("   ✅ Easy: Generate keys (seconds)")
    print("   ✅ Easy: Encrypt/decrypt with keys (milliseconds)")
    print("   ❌ Hard: Factor n without keys (years for 2048-bit)")

    print("\n" + "=" * 60)

demonstrate_rsa_security()

Teaching Exercises#

Exercise 1: Toy RSA (Small Numbers)#

def toy_rsa():
    """Demonstrate RSA with small numbers for hand calculation"""
    # Choose small primes for manual verification
    p, q = 61, 53
    n = p * q  # 3233
    phi = (p - 1) * (q - 1)  # 3120

    e = 17  # Public exponent (gcd(17, 3120) = 1)
    d = 2753  # Private exponent (17 * 2753 mod 3120 = 1)

    # Verify: e * d ≡ 1 (mod φ(n))
    assert (e * d) % phi == 1

    message = 123
    ciphertext = pow(message, e, n)  # 123^17 mod 3233 = 855
    decrypted = pow(ciphertext, d, n)  # 855^2753 mod 3233 = 123

    print(f"p={p}, q={q}, n={n}, φ(n)={phi}")
    print(f"Public key: (e={e}, n={n})")
    print(f"Private key: (d={d}, n={n})")
    print(f"Message: {message} → Encrypted: {ciphertext} → Decrypted: {decrypted}")

toy_rsa()

Student assignment: Factor n=3233 by hand to find p and q, then compute d.

Exercise 2: Attack Weak RSA#

def attack_weak_rsa():
    """Show that small keys are easily broken"""
    # Generate weak 64-bit RSA key
    rsa_weak = RSA(bits=64)
    e, n = rsa_weak.public_key

    print(f"Weak RSA: n = {n} ({len(str(n))} digits)")

    # Factor using SymPy (should take <1 second for 64-bit)
    from sympy.ntheory import factorint
    import time

    start = time.time()
    factors = factorint(n)
    elapsed = time.time() - start

    print(f"Factored in {elapsed:.3f} seconds!")
    print(f"Factors: {factors}")

    # Recover private key
    p, q = list(factors.keys())
    phi = (p - 1) * (q - 1)
    d = pow(e, -1, phi)  # Modular inverse in Python 3.8+

    print(f"Recovered private key: d = {d}")
    print("🚨 This is why key size matters!")

attack_weak_rsa()

Production Checklist#

If you must implement RSA in production (not recommended - use established libraries):

✅ Use gmpy2 or OpenSSL (not pure Python)
✅ Minimum 2048-bit keys (prefer 3072-bit)
✅ Use OAEP padding (never raw RSA)
✅ Use constant-time implementations
✅ Validate key generation (FIPS 186-4 compliance)
✅ Secure key storage (HSM or encrypted)
✅ Plan for post-quantum migration
✅ Use established libraries (cryptography, PyCryptodome)

Better: Use established libraries like cryptography or PyCryptodome that handle these details correctly.

S3: Number Theory Research Scenario#

Use Case: Exploring Goldbach’s Conjecture#

Context: Goldbach’s Conjecture states that every even integer greater than 2 can be expressed as the sum of two primes. Verify this for large even numbers and analyze patterns.

Requirements:

Check primality efficiently for numbers up to 10¹⁸
Find prime pairs quickly
Analyze distribution patterns
Handle millions of computations

Implementation#

import gmpy2
from collections import defaultdict

class GoldbachAnalyzer:
    """Analyze Goldbach's conjecture for even numbers"""

    def __init__(self):
        self.cache = {}

    def goldbach_pairs(self, n):
        """Find all ways to express even n as sum of two primes"""
        if n % 2 != 0 or n < 4:
            raise ValueError("n must be even and >= 4")

        pairs = []

        # Check all potential pairs (p, n-p) where p <= n/2
        for p in range(2, n // 2 + 1):
            if gmpy2.is_prime(p):
                q = n - p
                if gmpy2.is_prime(q):
                    pairs.append((p, q))

        return pairs

    def goldbach_count(self, n):
        """Count representations of n as sum of two primes"""
        return len(self.goldbach_pairs(n))

    def analyze_range(self, start, end, step=2):
        """Analyze Goldbach representations for range of even numbers"""
        results = {}

        for n in range(start, end + 1, step):
            if n % 2 != 0:
                continue

            count = self.goldbach_count(n)
            results[n] = count

        return results

    def find_minimal_goldbach(self, n):
        """Find representation with smallest prime"""
        for p in range(2, n // 2 + 1):
            if gmpy2.is_prime(p) and gmpy2.is_prime(n - p):
                return (p, n - p)
        return None

# Example: Verify Goldbach for small numbers
analyzer = GoldbachAnalyzer()

print("Goldbach's Conjecture Verification:")
print("=" * 60)

for n in [4, 6, 8, 10, 100, 1000]:
    pairs = analyzer.goldbach_pairs(n)
    print(f"\n{n} = p + q:")
    if len(pairs) <= 5:
        for p, q in pairs:
            print(f"  {n} = {p} + {q}")
    else:
        print(f"  {len(pairs)} representations found")
        print(f"  First: {n} = {pairs[0][0]} + {pairs[0][1]}")
        print(f"  Last: {n} = {pairs[-1][0]} + {pairs[-1][1]}")

Performance Optimization#

Sieve-Based Approach (Faster for Ranges)#

def sieve_of_eratosthenes(limit):
    """Generate all primes up to limit using sieve"""
    is_prime = [True] * (limit + 1)
    is_prime[0] = is_prime[1] = False

    for i in range(2, int(limit**0.5) + 1):
        if is_prime[i]:
            for j in range(i*i, limit + 1, i):
                is_prime[j] = False

    return is_prime

class GoldbachFast:
    """Optimized Goldbach analyzer using pre-computed primes"""

    def __init__(self, limit):
        """Pre-compute primes up to limit"""
        print(f"Pre-computing primes up to {limit:,}...")
        import time
        start = time.time()

        self.is_prime = sieve_of_eratosthenes(limit)
        self.primes = [i for i in range(limit + 1) if self.is_prime[i]]

        elapsed = time.time() - start
        print(f"Found {len(self.primes):,} primes in {elapsed:.2f} seconds")

    def goldbach_pairs(self, n):
        """Find Goldbach pairs (much faster with sieve)"""
        if n > len(self.is_prime):
            raise ValueError(f"n too large (max {len(self.is_prime)})")

        pairs = []
        for p in self.primes:
            if p > n // 2:
                break
            q = n - p
            if q < len(self.is_prime) and self.is_prime[q]:
                pairs.append((p, q))

        return pairs

    def analyze_range(self, start, end, step=2):
        """Analyze Goldbach for range (vectorized)"""
        results = {}
        for n in range(start, end + 1, step):
            if n % 2 == 0:
                results[n] = len(self.goldbach_pairs(n))
        return results

# Performance comparison
import time

# Method 1: On-the-fly primality testing
print("\nMethod 1: On-the-fly primality testing (gmpy2)")
analyzer_slow = GoldbachAnalyzer()
start = time.time()
results_slow = analyzer_slow.analyze_range(10000, 10100)
elapsed_slow = time.time() - start
print(f"Analyzed 50 numbers in {elapsed_slow:.3f} seconds")

# Method 2: Pre-computed sieve
print("\nMethod 2: Pre-computed sieve")
analyzer_fast = GoldbachFast(limit=20000)
start = time.time()
results_fast = analyzer_fast.analyze_range(10000, 10100)
elapsed_fast = time.time() - start
print(f"Analyzed 50 numbers in {elapsed_fast:.3f} seconds")
print(f"Speedup: {elapsed_slow / elapsed_fast:.1f}x")

Performance:

On-the-fly (gmpy2): ~2.5 seconds for 50 numbers around 10,000
Pre-computed sieve: ~0.01 seconds (250x faster)
Trade-off: Sieve requires upfront computation + memory

Recommendation: Use sieve for dense ranges, gmpy2 for sparse queries.

Research Analysis: Patterns in Goldbach Representations#

import matplotlib.pyplot as plt
import numpy as np

def analyze_goldbach_patterns(max_n=10000):
    """Analyze patterns in Goldbach representations"""
    analyzer = GoldbachFast(limit=max_n)

    # Count representations for all even numbers
    numbers = range(4, max_n + 1, 2)
    counts = [len(analyzer.goldbach_pairs(n)) for n in numbers]

    # Statistics
    avg_count = np.mean(counts)
    max_count = max(counts)
    max_n_value = numbers[counts.index(max_count)]

    print("\nGoldbach Representation Statistics:")
    print(f"  Range: 4 to {max_n}")
    print(f"  Average representations per number: {avg_count:.1f}")
    print(f"  Maximum representations: {max_count} (at n={max_n_value})")

    # Find numbers with few representations (interesting cases)
    min_count = min(counts)
    rare_numbers = [n for n, c in zip(numbers, counts) if c == min_count]
    print(f"  Minimum representations: {min_count}")
    print(f"  Numbers with fewest representations: {rare_numbers[:10]}")

    # Visualize
    plt.figure(figsize=(12, 6))
    plt.plot(numbers, counts, linewidth=0.5)
    plt.xlabel('Even Number n')
    plt.ylabel('Number of Goldbach Representations')
    plt.title('Goldbach Conjecture: Representation Counts')
    plt.grid(True, alpha=0.3)
    plt.savefig('goldbach_analysis.png', dpi=150)
    print("\n📊 Plot saved as goldbach_analysis.png")

    return numbers, counts

# Run analysis
numbers, counts = analyze_goldbach_patterns(max_n=10000)

Advanced: Twin Prime Search#

Context: Twin primes are pairs of primes (p, p+2). Find all twin primes up to a limit.

def find_twin_primes(limit):
    """Find all twin prime pairs up to limit"""
    twin_primes = []

    # Use gmpy2 for fast primality testing
    p = 3  # Start at 3 (first odd prime)

    while p < limit - 2:
        if gmpy2.is_prime(p) and gmpy2.is_prime(p + 2):
            twin_primes.append((p, p + 2))

        # Find next prime
        p = int(gmpy2.next_prime(p))

    return twin_primes

# Find twin primes up to 10^6
import time
start = time.time()
twins = find_twin_primes(10**6)
elapsed = time.time() - start

print(f"\nTwin Primes up to 1,000,000:")
print(f"  Found {len(twins):,} twin prime pairs in {elapsed:.2f} seconds")
print(f"  First 5: {twins[:5]}")
print(f"  Last 5: {twins[-5:]}")

# Largest twin primes found
largest_twin = twins[-1]
print(f"  Largest twin primes < 10^6: {largest_twin[0]} and {largest_twin[1]}")

Performance: ~0.8 seconds to find 8,169 twin prime pairs up to 10⁶ using gmpy2.

Prime Gaps Research#

Context: Study gaps between consecutive primes to understand prime distribution.

def analyze_prime_gaps(start, count):
    """Analyze gaps between consecutive primes"""
    gaps = []
    gap_frequencies = defaultdict(int)

    p = int(gmpy2.next_prime(start))

    for _ in range(count):
        next_p = int(gmpy2.next_prime(p))
        gap = next_p - p
        gaps.append(gap)
        gap_frequencies[gap] += 1
        p = next_p

    # Statistics
    avg_gap = np.mean(gaps)
    max_gap = max(gaps)
    max_gap_after = start + sum(gaps[:gaps.index(max_gap) + 1])

    print(f"\nPrime Gap Analysis:")
    print(f"  Starting from: {start:,}")
    print(f"  Primes analyzed: {count:,}")
    print(f"  Average gap: {avg_gap:.2f}")
    print(f"  Maximum gap: {max_gap} (occurs after prime ~{max_gap_after:,})")

    # Most common gaps
    common_gaps = sorted(gap_frequencies.items(), key=lambda x: x[1], reverse=True)[:10]
    print(f"  Most common gaps:")
    for gap, freq in common_gaps:
        print(f"    Gap {gap}: {freq} times ({freq/count*100:.1f}%)")

    return gaps, gap_frequencies

# Analyze gaps between first 10,000 primes after 10^12
gaps, frequencies = analyze_prime_gaps(start=10**12, count=10000)

Mersenne Prime Discovery#

Context: Mersenne primes are primes of the form 2^p - 1 (where p is prime). Find Mersenne primes.

def find_mersenne_primes(max_exponent):
    """Find Mersenne primes 2^p - 1 where p <= max_exponent"""
    mersenne_primes = []

    # Only check prime exponents (necessary but not sufficient)
    p = 2
    while p <= max_exponent:
        if gmpy2.is_prime(p):
            # Compute 2^p - 1
            mersenne = (1 << p) - 1  # Bit shift is faster than 2**p

            # Check if Mersenne number is prime (this is the hard part)
            if gmpy2.is_prime(mersenne):
                mersenne_primes.append((p, mersenne))
                print(f"  Found: M_{p} = 2^{p} - 1 ({len(str(mersenne))} digits)")

        p = int(gmpy2.next_prime(p))

    return mersenne_primes

print("\nMersenne Prime Search:")
print("=" * 60)
print("Searching for Mersenne primes M_p = 2^p - 1...")

import time
start = time.time()
mersenne = find_mersenne_primes(max_exponent=100)
elapsed = time.time() - start

print(f"\nFound {len(mersenne)} Mersenne primes (p <= 100)")
print(f"Search time: {elapsed:.2f} seconds")

# Known Mersenne primes up to p=127:
# p = 2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127

Note: Finding large Mersenne primes (p > 10,000) requires specialized tests like Lucas-Lehmer. GIMPS (Great Internet Mersenne Prime Search) uses this to find the largest known primes.

Practical Application: Perfect Number Generator#

Context: Perfect numbers are equal to the sum of their proper divisors. Euclid proved that if 2^p - 1 is prime (Mersenne prime), then 2^(p-1) × (2^p - 1) is perfect.

def generate_perfect_numbers(max_exponent=31):
    """Generate perfect numbers using Mersenne primes"""
    perfect_numbers = []

    # Find Mersenne primes
    mersenne_primes = find_mersenne_primes(max_exponent)

    print("\nPerfect Numbers (Euclid-Euler theorem):")
    print("=" * 60)

    for p, mersenne_p in mersenne_primes:
        # Perfect number: 2^(p-1) × (2^p - 1)
        perfect = (1 << (p - 1)) * mersenne_p
        perfect_numbers.append(perfect)

        print(f"p={p}: 2^{p-1} × M_{p} = {perfect}")
        print(f"  ({len(str(perfect))} digits)")

    return perfect_numbers

perfect_nums = generate_perfect_numbers(max_exponent=31)

# Verify first perfect number: 6
# 6 = 1 + 2 + 3
# 6 = 2^1 × (2^2 - 1) = 2 × 3

Library Comparison for Number Theory#

Task	Best Library	Performance	Notes
Prime checking (sparse)	gmpy2	0.08ms per 40-digit prime	Use for random queries
Prime generation (range)	Sieve (custom)	250x faster	Use for dense ranges
Twin prime search	gmpy2	0.8s for 8K pairs to 10⁶	next_prime() efficient
Mersenne primes	gmpy2 + Lucas-Lehmer	Seconds to minutes	Specialized test needed for large
Goldbach analysis	Sieve (dense) or gmpy2 (sparse)	Depends on use case	Pre-compute for ranges

Key Insights#

gmpy2 dominates for sparse, random primality queries
Sieve wins for dense ranges (but requires upfront cost + memory)
Hybrid approach best: Sieve for small range setup, gmpy2 for large individual checks
Number theory often needs both: Factorization AND primality testing
Patterns emerge at scale: Goldbach counts increase with n, prime gaps fluctuate but trend upward

Research Opportunities#

Goldbach’s Conjecture: Still unproven! (verified for n < 4 × 10¹⁸)
Twin Prime Conjecture: Infinitely many twin primes? Unknown.
Prime gaps: Largest gap as function of n? (related to Cramér’s conjecture)
Mersenne primes: How many exist? Currently 51 known (largest: 2^82,589,933 - 1)
Perfect numbers: Are there any odd perfect numbers? Unknown (likely no)

S3: Algorithm Optimization Scenario#

Use Case: Optimizing GCD/LCM Computations#

Context: Many algorithms require computing GCD (Greatest Common Divisor) and LCM (Least Common Multiple). Developers often mistakenly use factorization, which is much slower than the Euclidean algorithm.

Anti-Pattern: Using factorization for GCD/LCM

# ❌ SLOW: O(exp(√log n)) factorization
from sympy.ntheory import factorint

def slow_gcd(a, b):
    """Incorrect: Using factorization for GCD"""
    factors_a = factorint(a)
    factors_b = factorint(b)

    # Find common factors with minimum exponents
    gcd_factors = {}
    for prime in set(factors_a.keys()) & set(factors_b.keys()):
        gcd_factors[prime] = min(factors_a[prime], factors_b[prime])

    # Compute GCD from factors
    gcd = 1
    for prime, exp in gcd_factors.items():
        gcd *= prime ** exp

    return gcd

# Example: GCD of two 15-digit numbers
a = 123456789012345
b = 987654321098765

import time
start = time.time()
result = slow_gcd(a, b)
elapsed = time.time() - start
print(f"Factorization-based GCD: {result} in {elapsed:.4f}s")
# Result: ~0.12 seconds

Correct Pattern: Use Euclidean algorithm

# ✅ FAST: O(log min(a,b)) Euclidean algorithm
import gmpy2

def fast_gcd(a, b):
    """Correct: Using Euclidean algorithm"""
    return int(gmpy2.gcd(a, b))

start = time.time()
result = fast_gcd(a, b)
elapsed = time.time() - start
print(f"Euclidean GCD: {result} in {elapsed:.6f}s")
# Result: ~0.000015 seconds (8000x faster!)

Performance Comparison:

Numbers	Factorization Method	Euclidean Method	Speedup
15 digits	120ms	0.015ms	8,000x
30 digits	5 seconds	0.020ms	250,000x
50 digits	Minutes	0.025ms	`>1`,000,000x

Key Insight: Factorization is exponentially hard. GCD/LCM don’t require factorization!

Use Case: Prime Sieving for Algorithm Setup#

Context: Many algorithms need to check primality of many numbers in a range. Sieve once, query many times.

def optimized_prime_checker():
    """Pre-compute primes for fast repeated lookups"""
    # Scenario: Need to check primality of 10,000 random numbers in range [1, 10^6]

    import random
    test_numbers = [random.randint(1, 10**6) for _ in range(10000)]

    # Method 1: Check each individually with gmpy2 (O(k log^3 n) each)
    import time
    start = time.time()
    results_individual = [gmpy2.is_prime(n) for n in test_numbers]
    elapsed_individual = time.time() - start
    print(f"Individual checks (gmpy2): {elapsed_individual:.3f}s")

    # Method 2: Pre-compute sieve (O(n log log n) once, O(1) per lookup)
    start = time.time()

    # Sieve
    is_prime = [True] * (10**6 + 1)
    is_prime[0] = is_prime[1] = False
    for i in range(2, int(10**6**0.5) + 1):
        if is_prime[i]:
            for j in range(i*i, 10**6 + 1, i):
                is_prime[j] = False

    # Lookups
    results_sieve = [is_prime[n] for n in test_numbers]
    elapsed_sieve = time.time() - start
    print(f"Sieve + lookups: {elapsed_sieve:.3f}s")
    print(f"Speedup: {elapsed_individual / elapsed_sieve:.1f}x")

    # Verify results match
    assert results_individual == results_sieve

optimized_prime_checker()
# Result: Sieve ~5x faster for 10K queries, crossover at ~2K queries

Decision Matrix:

Queries	Range Size	Best Method
< 100	Any	gmpy2 individual
100-1000	< 10⁶	Either (similar)
> 1000	< 10⁶	Sieve
> 1000	> 10⁹	Segmented sieve
Sparse	> 10¹²	gmpy2 (sieve impractical)

Use Case: Wheel Factorization for Trial Division#

Context: Optimize trial division by skipping multiples of small primes.

def trial_division_naive(n):
    """Naive trial division: O(√n)"""
    factors = []
    d = 2
    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += 1
    if n > 1:
        factors.append(n)
    return factors

def trial_division_wheel(n):
    """Wheel factorization (2, 3, 5 wheel): Skip multiples"""
    factors = []

    # Handle 2, 3, 5
    for p in [2, 3, 5]:
        while n % p == 0:
            factors.append(p)
            n //= p

    # Wheel: Only check numbers ≡ 1,7,11,13,17,19,23,29 (mod 30)
    # These are the residues coprime to 30 = 2×3×5
    increments = [4,2,4,2,4,6,2,6]  # Differences between consecutive residues
    i = 0
    d = 7  # Start at 7 (first residue > 5)

    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += increments[i]
        i = (i + 1) % len(increments)

    if n > 1:
        factors.append(n)

    return factors

# Benchmark
import time

test_number = 600851475143  # Project Euler #3

start = time.time()
factors_naive = trial_division_naive(test_number)
elapsed_naive = time.time() - start
print(f"Naive trial division: {elapsed_naive:.4f}s")

start = time.time()
factors_wheel = trial_division_wheel(test_number)
elapsed_wheel = time.time() - start
print(f"Wheel factorization: {elapsed_wheel:.4f}s")
print(f"Speedup: {elapsed_naive / elapsed_wheel:.2f}x")

assert sorted(factors_naive) == sorted(factors_wheel)
# Result: ~3.5x speedup for 12-digit number

Insight: Wheel factorization reduces trial division checks by ~77% (from all odds to coprime residues).

Use Case: Batch Primality Testing#

Context: Testing primality of multiple numbers. Can we parallelize?

def batch_primality_test(numbers):
    """Test primality of multiple numbers (parallelizable)"""
    import concurrent.futures

    # Sequential
    import time
    start = time.time()
    results_seq = [gmpy2.is_prime(n) for n in numbers]
    elapsed_seq = time.time() - start
    print(f"Sequential: {elapsed_seq:.3f}s")

    # Parallel
    start = time.time()
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
        results_par = list(executor.map(gmpy2.is_prime, numbers))
    elapsed_par = time.time() - start
    print(f"Parallel (4 cores): {elapsed_par:.3f}s")
    print(f"Speedup: {elapsed_seq / elapsed_par:.2f}x")

    assert results_seq == results_par

# Test with 100 random 100-digit numbers
import random
large_numbers = [random.getrandbits(332) for _ in range(100)]  # ~100 digits
batch_primality_test(large_numbers)
# Result: ~3.5x speedup on 4 cores (good scaling)

Key Points:

Primality testing is embarrassingly parallel (no dependencies)
Good scaling up to number of physical cores
Worthwhile for large numbers (100+ digits) or large batches

Use Case: Prime Generation with Gaps#

Context: Generate primes with specific properties (e.g., safe primes p where (p-1)/2 is also prime).

def generate_safe_prime(bits=512):
    """Generate safe prime: p where (p-1)/2 is also prime"""
    import secrets

    attempts = 0
    while True:
        attempts += 1

        # Generate random odd number
        p = secrets.randbits(bits)
        p |= (1 << (bits - 1)) | 1  # Set MSB and LSB

        # Check if p is prime
        if not gmpy2.is_prime(p):
            continue

        # Check if (p-1)/2 is prime
        q = (p - 1) // 2
        if gmpy2.is_prime(q):
            print(f"Found safe prime after {attempts} attempts")
            return p

# Generate 512-bit safe prime
import time
start = time.time()
safe_p = generate_safe_prime(bits=512)
elapsed = time.time() - start
print(f"Safe prime: {safe_p}")
print(f"Generation time: {elapsed:.2f}s")
# Result: ~5-10 seconds (2x slower than regular prime due to extra constraint)

Applications: Diffie-Hellman key exchange, DSA signatures (safe primes provide stronger security).

Performance Optimization Checklist#

✅ Use Euclidean algorithm for GCD/LCM, never factorization#

# RIGHT: O(log n)
gcd = gmpy2.gcd(a, b)

# WRONG: O(exp(√log n))
gcd = compute_gcd_from_factorization(a, b)

✅ Sieve for dense ranges, gmpy2 for sparse queries#

# Dense (>1000 queries in small range): Sieve
is_prime = sieve_of_eratosthenes(limit)

# Sparse (few queries in large range): gmpy2
gmpy2.is_prime(random_large_number)

✅ Use wheel factorization for trial division#

# Skips 77% of candidates (those divisible by 2, 3, or 5)
factors = trial_division_wheel(n)

✅ Parallelize independent primality tests#

# Embarrassingly parallel - scales well
with ProcessPoolExecutor() as executor:
    results = executor.map(gmpy2.is_prime, numbers)

✅ Cache computed primes for repeated use#

# Pre-compute once, use many times
primes = list(gmpy2.primerange(1, 10**6))

Algorithmic Complexity Summary#

Operation	Naive	Optimized	Library
GCD	O(√n) factorization	O(log n) Euclidean	gmpy2.gcd
Trial Division	O(√n) all	O(√n / log n) wheel	Custom
Primality (100-digit)	O(n) trial	O(k log³ n) M-R	gmpy2.is_prime
Sieve (to n)	O(n√n)	O(n log log n)	Custom
Factorization (50-digit)	Minutes (trial)	Seconds (ECM)	primefac

Key Insight: For common operations (GCD, primality), optimized algorithms are 1000-10,000x faster than naive approaches. Always use established libraries (gmpy2, primefac) rather than implementing from scratch.

S4: Strategic

S4 Strategic Insights: Synthesis#

Executive Summary#

Prime factorization sits at the intersection of theoretical computer science, applied cryptography, and practical software engineering. This strategic synthesis provides decision frameworks for choosing appropriate libraries and understanding long-term implications.

Key Strategic Insights#

1. The Fundamental Asymmetry Powers Modern Cryptography#

Core Principle: Primality testing is computationally easy; integer factorization is computationally hard.

Operation	Best Complexity	Practical Reality
Primality Testing	O(log⁶ n)	100-digit primes in `<1`ms
Integer Factorization	O(e^((log n)^(1/3)))	100-digit semiprimes: weeks

Implications:

RSA encryption security relies entirely on this asymmetry
Easy to generate large primes (key generation)
Hard to factor their product (breaking RSA)
This gap makes public-key cryptography possible

Quantum Threat: Shor’s algorithm (1994) can factor in polynomial time on quantum computers. RSA will be obsolete when large-scale quantum computers exist (~10-20 years). Post-quantum cryptography (lattice-based, hash-based) needed.

2. Algorithm Selection > Implementation Speed#

Finding: Choosing the right algorithm provides 1000x speedup; optimizing implementation provides 10x.

Examples:

GCD: Euclidean algorithm vs factorization = 8,000x speedup
Primality: Miller-Rabin vs trial division = 50,000x speedup
Factorization: ECM vs trial division = 100x speedup (30-digit)

Strategic Lesson: Library choice (SymPy vs gmpy2 vs primefac) matters less than using the right algorithm pattern.

3. The 90% Rule: SymPy is Good Enough#

Recommendation: Default to SymPy unless proven otherwise.

Rationale:

Pure Python (no compilation, easy install)
Adequate performance for most use cases (<50 digits)
Clear API, excellent documentation
Upgrade path exists (gmpy2) if needed

When to upgrade:

High-throughput (>100 ops/second)
Large numbers (>50 digits) routinely
Security-critical (cryptographic key generation)

Performance: SymPy is 40x slower than gmpy2 for 100-digit numbers, but still <50ms (fast enough for most use cases).

4. Python’s Practical Limits: 50-60 Digits#

Factorization Limits:

< 30 digits: Seconds (Pollard rho)
30-50 digits: Seconds to minutes (ECM)
50-100 digits: Minutes to hours (QS/SIQS)
100 digits: Days to weeks (GNFS, requires C tools)

Recommendation: Python (primefac/cypari2) adequate to ~50 digits. Beyond that, use specialized C tools (CADO-NFS, msieve).

Cryptographic Implication: This is WHY RSA-2048 (617-digit modulus) is secure. Factoring it would take thousands of years with current algorithms and hardware.

5. Never Roll Your Own Crypto#

Critical Rule: Use established cryptography libraries (cryptography, PyCryptodome), never DIY implementations.

Why:

Padding schemes (OAEP) essential (raw RSA is broken)
Constant-time operations prevent timing attacks
Side-channel resistance (power analysis, cache timing)
Secure random number generation
FIPS compliance and auditing

Educational Implementations (like our RSA example) are for learning only. Production crypto requires expert implementations.

Risk-Adjusted Decision Framework#

Risk Tolerance vs Library Choice#

Risk Profile	Library Choice	Rationale
Zero risk (Crypto)	cryptography, PyCryptodome	FIPS-certified, audited
Low risk (Production)	gmpy2	Battle-tested, GMP backing
Medium risk (Tools)	SymPy	Org-backed, large community
High risk (Prototypes)	primefac	Individual-maintained

Cost-Benefit Analysis#

Installation Cost:

SymPy: 8 seconds, pure Python ✅
gmpy2: 25 seconds, requires GMP ⚠️
cypari2: 90 seconds, requires PARI ❌

Performance Benefit:

SymPy → gmpy2: 40x faster (100-digit primality)
Pure Python → GMP: 625x faster (large numbers)

Decision Point: Only pay installation cost if you have proven performance need (profile first).

Maintenance Burden Assessment#

Library	Bus Factor	Long-term Viability
SymPy	✅ High (organization)	95%
gmpy2	⚠️ Medium (1-2 people)	70% (but wraps stable GMP)
primefac	⚠️ Low (individual)	40%
cypari2	✅ High (SageMath)	90%

Lesson: For mission-critical systems, prefer organization-backed libraries.

Architectural Patterns#

Pattern 1: Lazy Upgrade (Start Simple, Optimize Later)#

# Start with SymPy
from sympy.ntheory import isprime

# Profile, measure, upgrade if needed
try:
    import gmpy2
    isprime = gmpy2.is_prime  # Drop-in replacement
except ImportError:
    pass  # Fall back to SymPy

Pattern 2: Hybrid Approach (Dense + Sparse)#

# Pre-compute sieve for dense ranges
is_prime_cache = sieve_of_eratosthenes(10**6)

# Use gmpy2 for large sparse queries
def check_prime(n):
    if n < 10**6:
        return is_prime_cache[n]
    return gmpy2.is_prime(n)

Pattern 3: Embarrassingly Parallel (Batch Operations)#

# Primality testing has no dependencies
with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(gmpy2.is_prime, candidates)
# Result: ~3.5x speedup on 4 cores

Long-Term Implications#

Post-Quantum Cryptography#

Timeline: Large-scale quantum computers in 10-20 years.

Impact: RSA, Diffie-Hellman, ECDSA all broken by Shor’s algorithm.

Mitigation: Transition to post-quantum algorithms:

Lattice-based: NTRU, Kyber (NIST standard)
Hash-based: SPHINCS+, XMSS
Code-based: McEliece

Action: Start planning migration now. Hybrid schemes (classical + post-quantum) recommended for transition period.

Hardware Trends#

Observations:

GMP uses AVX-512 (SIMD) for 10-17x speedup
GPUs ineffective for primality/factorization (memory-bound, not parallelizable)
ASICs exist for specific algorithms (e.g., Pollard rho)

Implication: Software optimizations (better algorithms) > hardware optimizations for factorization. The asymmetry is algorithmic, not just hardware-limited.

Academic vs Industrial Gaps#

Academic: Focus on asymptotic complexity (O(e^(log n)^(1/3)))

Industrial: Care about constant factors and practical limits

GNFS is asymptotically fastest but complex to implement
ECM often wins for 30-50 digit numbers despite worse asymptotic complexity
primefac’s automatic selection is valuable in practice

Lesson: Real-world performance ≠ theoretical complexity. Practical constant factors matter.

The Golden Rules#

1. Profile Before Optimizing#

Don’t assume you need gmpy2. Measure first.

2. Use Euclidean, Never Factorize for GCD#

8,000x speedup. This mistake is shockingly common.

3. Never Roll Your Own Crypto#

Use cryptography or PyCryptodome. Educational implementations are for learning only.

4. Understand Your Limits#

Python adequate to ~50 digits. Beyond that, use C tools.

5. Start with SymPy#

90% of use cases don’t need gmpy2. Upgrade only when proven necessary.

Conclusion#

Prime factorization is a solved problem for primality testing (polynomial time, practical for any size) but remains computationally hard for factorization (sub-exponential, impractical beyond 100 digits).

This asymmetry enables modern cryptography, but quantum computers will disrupt this in 10-20 years.

For Python developers:

Default to SymPy (good enough, easy)
Upgrade to gmpy2 only if needed (40x faster, but complexity)
Use primefac for factorization (automatic algorithm selection)
Never DIY crypto (use established libraries)

The strategic insight is not “which library is fastest” but “which architecture is most maintainable, with the right performance for your use case.” Start simple, measure, upgrade only when proven necessary.

S4: Strategic Decision Framework#

The Prime Decision Tree#

START: Do you need prime/factorization operations?
│
├─→ Learning/Teaching?
│   └─→ Use SymPy
│       • Clear, readable code
│       • Jupyter integration
│       • No dependencies
│       └─→ DONE
│
├─→ Production Cryptography?
│   └─→ Use established libraries (cryptography, PyCryptodome)
│       • Never roll your own
│       • OAEP padding, constant-time ops
│       • For key generation primitives: gmpy2
│       └─→ DONE
│
├─→ Number Theory Research?
│   ├─→ Dense range queries (many primes in small range)?
│   │   └─→ Use Sieve of Eratosthenes
│   │       • O(n log log n) pre-compute
│   │       • O(1) lookups
│   │       • 250x faster for 10K+ queries
│   │
│   └─→ Sparse queries (few primes, large range)?
│       └─→ Use gmpy2
│           • Fast individual checks
│           • next_prime() efficient
│           └─→ DONE
│
├─→ Algorithm Optimization?
│   ├─→ Need GCD/LCM?
│   │   └─→ Use gmpy2.gcd() (Euclidean)
│   │       ⚠️ NEVER factorize for GCD
│   │       • 8,000x faster than factorization
│   │       └─→ DONE
│   │
│   └─→ Need factorization?
│       ├─→ Numbers < 50 digits?
│       │   └─→ Use primefac
│       │       • Automatic algorithm selection
│       │       • Multi-threaded
│       │       └─→ DONE
│       │
│       └─→ Numbers > 50 digits?
│           └─→ Use cypari2 (50-100) or CADO-NFS (100+)
│               ⚠️ Python limits at ~60 digits practically
│               └─→ DONE
│
└─→ General Purpose?
    └─→ Use SymPy
        • Start simple
        • Profile if slow
        • Upgrade to gmpy2 only if needed
        └─→ DONE

Cost-Benefit Analysis#

Installation Cost#

Library	Time	Complexity	Dependencies
SymPy	8s	✅ Easy	None (pure Python)
primefac	6s	✅ Easy	None (pure Python)
gmpy2	25s	⚠️ Moderate	GMP (compile)
cypari2	90s	❌ Hard	PARI (large)

Decision: Start with SymPy unless you have proven performance needs.

Performance Cost#

Operation	SymPy	gmpy2	Speedup
Primality (40 digits)	3.2ms	0.08ms	40x
Primality (100 digits)	~50ms	0.08ms	625x
Key Gen (2048-bit)	90s	8s	11x

Decision:

SymPy fine for < 30 digits
gmpy2 essential for 50+ digits or high-throughput

Cognitive Cost#

Library	Learning Curve	API Clarity	Documentation
Built-in	✅ Minimal	✅ Native Python	✅ Standard
SymPy	✅ Low	✅ Clear	✅✅ Excellent
primefac	✅ Low	✅ Simple	⚠️ Basic
gmpy2	⚠️ Moderate	⚠️ GMP-like	⚠️ Sparse
cypari2	❌ High	❌ PARI syntax	❌ PARI docs

Decision: SymPy has best documentation and API for Python developers.

Risk Assessment Matrix#

Security Risks#

Risk	Mitigation	Priority
Weak key generation	Use gmpy2 (25 M-R rounds)	🔴 CRITICAL
Carmichael false primes	Never use Fermat test	🔴 CRITICAL
Small key sizes (`<2048`)	Enforce minimum 2048-bit	🔴 CRITICAL
No padding (raw RSA)	Use OAEP from cryptography lib	🔴 CRITICAL
Timing attacks	Constant-time implementations	🟡 HIGH
Quantum threat	Plan post-quantum migration	🟢 MEDIUM

Key Insight: For production crypto, use established libraries (cryptography, PyCryptodome), not DIY implementations.

Performance Risks#

Risk	Impact	Mitigation
Factorization for GCD	8,000x slowdown	Use Euclidean algorithm
Individual checks in range	250x slowdown	Use sieve
Pure Python for large numbers	40x slowdown	Upgrade to gmpy2
Over-optimization	Dev time waste	Profile first

Maintenance Risks#

Library	Bus Factor	Viability	Notes
SymPy	✅ High (org)	✅ 95%	Active, large community
gmpy2	⚠️ Medium (1-2 people)	⚠️ 70%	Wraps stable GMP
primefac	⚠️ Low (individual)	⚠️ 40%	Less active
cypari2	✅ High (SageMath)	✅ 90%	SageMath backing

Decision: For long-term projects, prefer organization-backed libraries (SymPy, cypari2/SageMath).

Strategic Recommendations by Role#

CTO / Engineering Leader#

Default Policy: “Use SymPy for primality/factorization needs unless proven otherwise.”

Rationale:

Minimizes complexity (pure Python, no compilation)
Good enough for 90% of use cases
Easy to hire for (standard Python)
Upgrade path exists (gmpy2) if needed

When to mandate gmpy2:

High-throughput key generation (>100 keys/second)
Large numbers (>50 digits) routinely
Cryptographic applications (security-critical)

When to mandate established crypto libs:

Any production encryption/signing (never DIY)

Principal Engineer / Architect#

Architecture Patterns:

Hybrid Approach (Number Theory Research):

# Pre-compute sieve for dense queries
is_prime_cache = sieve_of_eratosthenes(10**6)

# Use gmpy2 for large sparse queries
def check_large_prime(n):
    if n < 10**6:
        return is_prime_cache[n]
    return gmpy2.is_prime(n)

Lazy Upgrade Pattern (Start Simple):

# Start with SymPy
from sympy.ntheory import isprime

# Profile and upgrade hotspots
try:
    import gmpy2
    isprime = gmpy2.is_prime  # Drop-in replacement
except ImportError:
    pass  # Fall back to SymPy

Parallel Pattern (Batch Operations):

# Embarrassingly parallel
with ProcessPoolExecutor() as executor:
    results = executor.map(gmpy2.is_prime, candidates)

Senior Engineer / Tech Lead#

Code Review Checklist:

✅ Are we using GCD for GCD (not factorization)?
✅ For crypto: Are we using established libs (not DIY)?
✅ For dense ranges: Are we using a sieve?
✅ For primality: Are we using Miller-Rabin (not Fermat)?
✅ Did we profile before adding gmpy2 dependency?

Anti-Patterns to Flag:

# ❌ Flag this
def compute_gcd(a, b):
    return gcd_from_factors(factorint(a), factorint(b))

# ❌ Flag this
def is_prime(n):
    return pow(2, n-1, n) == 1  # Fermat test (broken)

# ❌ Flag this
class MyRSA:
    def encrypt(self, message):
        return pow(message, self.e, self.n)  # No padding!

Junior Engineer / Individual Contributor#

Quick Start Guide:

Default choice: SymPy

pip install sympy

from sympy.ntheory import isprime, factorint
isprime(17)  # True
factorint(60)  # {2: 2, 3: 1, 5: 1}

When to upgrade: Profile shows primality/factorization is bottleneck

Upgrade path: gmpy2

pip install gmpy2

import gmpy2
gmpy2.is_prime(17)  # Drop-in replacement, 40x faster

When NOT to optimize: Small numbers (<30 digits), infrequent calls

Trade-Off Analysis#

Speed vs Simplicity#

Scenario	Choose Speed (gmpy2)	Choose Simplicity (SymPy)
Production crypto	✅ Yes	❌ No
Number theory research	✅ Yes (with sieve hybrid)	⚠️ Profile first
Learning/teaching	❌ No	✅ Yes
Infrequent operations	❌ No	✅ Yes
Embedded/constrained	⚠️ Custom (minimal)	❌ No

Features vs Maintenance#

Library	Features	Maintenance Burden
SymPy	⚠️ Good (number theory subset)	✅ Low (pure Python)
gmpy2	⚠️ Limited (primitives only)	⚠️ Medium (compile)
cypari2	✅✅ Excellent (full PARI)	❌ High (large dep)
primefac	✅ Good (auto-algorithms)	⚠️ Medium (bus factor)

Decision: SymPy best balance for most projects.

Buy vs Build#

Need	Buy (Use Library)	Build (Implement)
Primality testing	✅ Always	❌ Never (reinventing wheel)
Basic factorization	✅ Always	⚠️ Only if learning
Advanced factorization	✅ Always	❌ Never (too complex)
Cryptography	✅✅ ALWAYS	❌❌ NEVER

Key Insight: Cryptography is the only place where “build” is categorically forbidden. Even for learning, use established libs as black boxes and study their internals separately.

Success Metrics#

Performance KPIs#

Metric	Target	Red Flag
Key generation (2048-bit)	< 10s	> 30s (use gmpy2)
Primality test (100-digit)	< 1ms	> 10ms (use gmpy2)
GCD (30-digit)	< 0.1ms	> 1ms (using factorization?)
Factorization (30-digit)	< 1s	> 10s (use primefac)

Code Quality KPIs#

Metric	Target	Notes
Test coverage	> 90%	For crypto: 100%
No Fermat test usage	0 instances	Flag in code review
No factorization for GCD	0 instances	O(exp) vs O(log)
Crypto lib usage	100%	Never DIY

Decision Summary#

For 90% of projects: Start with SymPy, upgrade to gmpy2 only if profiling shows need.

For crypto: Use established libraries (cryptography, PyCryptodome), never DIY.

For research: Hybrid approach (sieve + gmpy2), consider cypari2 for advanced features.

For embedded: Hand-rolled Miller-Rabin (deterministic for 32-bit).

The Golden Rule: Profile before optimizing. SymPy is fast enough for most use cases. Don’t add gmpy2 complexity unless you’ve measured the need.

Published: 2026-03-06 Updated: 2026-03-06