1.152.1 CJK Readability Analysis#

Comprehensive analysis of Chinese text readability assessment tools and methodologies for matching content to learner proficiency levels. Covers character-based and word-based approaches, HSK/TOCFL standards, frequency analysis, and machine learning methods across web tools, Python libraries, and academic research systems.

Explainer

CJK Readability Analysis#

What This Solves#

Imagine giving someone a book in a language they’re learning. In alphabetic languages like English or Spanish, they can at least attempt every word—sound it out, guess from context, look it up. In Chinese, if they don’t know a character, they’re completely stuck. They can’t sound it out phonetically. They can’t even look it up easily without knowing how to write it or the pronunciation.

CJK readability analysis solves this fundamental problem: determining whether a Chinese learner can actually read a piece of text before they waste time trying. It takes any Chinese text and answers: “What proficiency level (HSK 1-6, TOCFL levels, etc.) does someone need to understand this?”

This matters to three groups:

Language learning platforms need to match content to learner levels automatically (you can’t have humans reading every article)
Content creators need to know if they’re writing at the right difficulty (textbook authors, simplified news sites)
Learners themselves want to find materials they can actually read (not too easy, not impossibly hard)

Without automated analysis, these groups resort to manual assessment (expensive, slow) or guess-and-check (frustrating for learners).

Accessible Analogies#

Think of Chinese characters like a massive LEGO collection with 3,000+ unique pieces. Learning Chinese means gradually acquiring these pieces:

HSK 1 learner: 300 pieces (can build simple structures)
HSK 6 learner: 2,500 pieces (can build complex models)

Now imagine you want to give someone assembly instructions. Before handing them over, you scan the instruction manual: “This design requires 847 different LEGO pieces. Do you have them all?”

That’s readability analysis. It looks at the text (the instruction manual), counts which unique characters (LEGO pieces) are needed, and compares against standardized learner levels.

The challenge: Unlike LEGO where you can see which pieces you need, Chinese text doesn’t have spaces between words. It’s like a bag of attached LEGO pieces you need to separate first. This separation step (called “segmentation”) is why Chinese readability analysis is more complex than counting words in English.

Another angle: Character frequency works like cooking skills. Common characters (的, 是, 我) are like salt and pepper—used in almost every dish, learned first. Rare characters (辩证法 “dialectics”) are like saffron—specialized, learned later. Readability analysis counts how much “saffron” vs “salt” is in a text to determine if a beginner cook can handle the recipe.

When You Need This#

You definitely need this if:

You run a language learning app and want to recommend content automatically (“here are 10 articles at your level”)
You’re building a digital library with graded readers and need to categorize thousands of texts
You create educational materials and want real-time feedback on whether you’re writing at the target level
You manage a news site offering “Easy Chinese” versions and need to validate simplifications

You probably need this if:

You’re designing accessibility features for Chinese content (simplified government documents, healthcare info)
You’re researching second-language acquisition and need to control for text difficulty
You’re building translation tools that should simplify for learners, not just translate accurately

You DON’T need this if:

Your users are native speakers (they already know all common characters)
You’re doing general NLP (sentiment analysis, classification) where readability is irrelevant
You’re working with non-Chinese languages (completely different problem—this research doesn’t transfer)

The decision hinges on: Are you matching content difficulty to learner proficiency at scale? If yes, automate. If it’s a one-time task, use free web tools.

Trade-offs#

Simple vs Sophisticated#

Coverage formula approach (count characters known at HSK level X):

✅ Fast (milliseconds per text)
✅ Easy to explain to users (“You know 94% of characters = HSK 3”)
✅ Works well for learner apps (good enough accuracy)
❌ Ignores sentence complexity, discourse structure
❌ Assumes all HSK 3 characters are equally difficult (not true)

Machine learning approach (82+ linguistic features):

✅ More accurate (accounts for sentence structure, vocabulary patterns)
✅ Can provide diagnostics (“too many compound sentences for this level”)
❌ Slower (requires full NLP pipeline: segmentation, parsing, POS tagging)
❌ Harder to explain (“the SVM says it’s level 5” doesn’t help users understand why)
❌ Requires training data (labeled textbooks, expert assessments)

Most language learning apps use the simple approach. Publishers and educators use ML when they need fine-grained assessment and can afford the complexity.

Character-Based vs Word-Based#

Character-based (HSK lists are characters):

✅ Aligns with how learners study (character lists, flashcards)
✅ Simpler implementation (no word segmentation needed)
❌ Misses vocabulary nuance (knowing 研 + 究 individually doesn’t mean you know 研究 “research” as a word)

Word-based (TOCFL focuses on vocabulary):

✅ Better reflects actual reading comprehension
✅ More accurate for intermediate/advanced learners
❌ Requires accurate word segmentation (adds complexity, potential errors)
❌ Harder to align with learning materials (most resources teach characters, not word lists)

The trend is starting with characters (MVP) and adding word-based analysis for advanced learners.

Build vs Buy#

Self-hosted (use open-source libraries, HSK word lists):

✅ No recurring costs (just server hosting)
✅ Full control over algorithm, customization
❌ Initial development time (1-2 weeks for basic version)
❌ Maintenance burden (keep HSK lists updated, handle edge cases)

Commercial API (Google Cloud NLP, LTP-Cloud):

✅ Fast integration (1-2 days)
✅ Fully managed (no infrastructure worries)
❌ Recurring costs (~$1 per million characters analyzed)
❌ Less control (can’t customize for your domain)
❌ APIs don’t specifically support HSK levels (you’d still build that layer)

Break-even point: If you’re analyzing more than ~5 million texts per year, self-hosting becomes cheaper.

Cost Considerations#

Free tier (web tools like HSK Analyzer):

Good for manual testing, one-off analysis
Not for production use (rate limits, no API)

DIY approach:

Development: ~$5K-$10K (1-2 weeks)
Hosting: $20-50/month for 1M texts/month
Year 1 total: ~$7K-$12K

Commercial APIs:

Google Cloud NLP: $1 per 1M characters after 5M free tier
At 50M characters/year: ~$45/year in API costs
But APIs don’t give you HSK levels—you still build that logic yourself
Better suited if you need multi-language NLP beyond just Chinese readability

Open-source libraries (HSK Character Profiler, etc.):

Integration: $500-$1.5K (1-3 days)
Hosting: $0 (runs in your app)
Year 1 total: ~$1K-$2K
Sweet spot for most language learning apps

Enterprise/Academic (CRIE-style ML system):

Development: $50K-$100K (3-6 months, requires NLP expertise)
Only makes sense if you need research-grade accuracy and diagnostic features
For publishers, large educational institutions

The pattern: Start cheap with open-source libraries. Upgrade to custom build if you hit scale or need specific features. Only go commercial API if you’re already using those platforms for other NLP tasks.

Implementation Reality#

First 30 Days#

Week 1: Integrate an open-source library (HSK Character Profiler) or build a simple coverage formula. You’ll have a working prototype that can say “this text is HSK 3” with ~80% accuracy.

Weeks 2-4: Add caching (texts get analyzed repeatedly), error handling, and tests against sample texts at known levels. Deploy with basic API endpoint.

What Actually Takes Time#

Segmentation edge cases: Medical terms, internet slang, names—Jieba will mess these up. You’ll need custom dictionaries.
HSK 3.0 migration: New standard takes effect July 2026. You’ll maintain two versions during transition (2026-2027).
Threshold tuning: Is 95% character coverage “readable”? Depends on your users. Expect A/B testing to find the right balance.
Traditional vs Simplified: Not a 1:1 character mapping. Need separate frequency lists and proper conversion libraries.

Common Pitfalls#

Assuming perfect segmentation: Jieba is ~95% accurate. That 5% error rate cascades into readability errors.
Treating all HSK 3 characters as equally difficult: Frequency and context matter. 的 (most common character) vs 辩证法 (rare academic term).
Ignoring idioms: 成语 (4-character idioms) must be learned as units, not as individual characters.
Over-engineering for MVP: Start with character coverage. Don’t build the ML system until you know you need it.

Team Skills Required#

Basic version: One Python developer familiar with pip install and REST APIs (junior level is fine)
Production version: Mid-level developer who can handle caching, error handling, deployment
ML version: NLP engineer or data scientist with experience in text classification, training ML models

Most teams can ship a working readability analyzer in 1-2 weeks without NLP expertise. The sophisticated stuff (CRIE-level) is a multi-month project requiring specialists.

Realistic Expectations#

You’ll get 80-90% agreement with human assessors on exact level match, 95%+ within ±1 level. That’s good enough for learner apps. If you need research-grade precision, budget for the ML approach and several months of development.

The technology is mature. The tools exist. The main challenge is deciding how much accuracy you need versus how much complexity you can handle.

S1: Rapid Discovery

S1: Surface Scan - CJK Readability Analysis#

What This Is#

CJK readability analysis evaluates Chinese text difficulty based on character frequency and proficiency level standards (HSK for mainland China, TOCFL for Taiwan). It answers: “What proficiency level does a learner need to read this text?”

The Core Problem#

Unlike alphabetic languages where ~26 letters form all words, Chinese uses thousands of characters. A learner with 500 characters can’t read text requiring 2,000 characters. Readability analysis maps text to standardized proficiency levels so learners know if material matches their skill.

Key Standards#

HSK (Hanyu Shuiping Kaoshi)#

Old system: 6 levels (HSK 1-6)
New system (2026): 9 levels, effective July 2026
Character requirements:
- HSK 1: ~300 characters
- HSK 6: ~2,500 characters
- HSK 9: 3,000+ characters (academic/professional)
Most common 1,000 characters cover ~90% of everyday written Chinese
2,500 characters cover ~98% of common texts

TOCFL (Test of Chinese as a Foreign Language - Taiwan)#

8 levels: Novice 1-2, Levels 1-6
Organized in 4 bands (Novice, A, B, C)
Band A: 500-1,000 characters (240-720 learning hours)
Uses TBCL (Taiwan Benchmarks for the Chinese Language): 3,100 characters, 14,425 words
Focuses on vocabulary words (ci) rather than character counts

Existing Tools#

Web-Based#

Chinese Text Analyser (chinesetextanalyser.com): Fast segmentation/analysis
HSK HSK Analyzer (hskhsk.com/analyse): HSK level breakdown per text
Chinese Text Analyzer (chine-culture.com): Junda frequency list analysis

Academic/Research#

CRIE (Chinese Readability Index Explorer): 82 multilevel linguistic features
CkipTagger (Sinica-Taiwan): POS tagging and tokenization

Libraries#

cntext (Python): Word frequency, readability, sentiment analysis
Jieba: Word segmentation (used by many tools)
chinese-text-analyzer (GitHub): HSK breakdown with Jieba

Key Insights#

Character vs word: Some systems use character counts, others word counts (ci)
Frequency lists: Junda, HSK official lists, TOCFL/TBCL lists
Coverage metrics: “90% coverage at HSK 3” = learner knows 90% of characters
Segmentation required: Chinese text has no spaces; must tokenize before analysis

Sources#

S2: Comprehensive

S2: Structure - How CJK Readability Analysis Works#

Core Algorithm Pipeline#

Raw Chinese Text
    ↓
1. Text Segmentation (word boundary detection)
    ↓
2. Character/Word Frequency Analysis
    ↓
3. Linguistic Feature Extraction
    ↓
4. Readability Classification (HSK/TOCFL level)

1. Text Segmentation (Jieba Algorithm)#

The Problem#

Chinese text has no spaces - “我爱你” could be “我/爱/你” (I/love/you) or “我爱/你” (my love/you). Must determine word boundaries before analysis.

How Jieba Works#

Prefix dictionary structure: Fast word graph scanning
DAG construction: Builds directed acyclic graph of all possible word combinations
Dynamic programming: Identifies most probable combination based on word frequency
HMM for unknowns: Uses Hidden Markov Model (Viterbi algorithm) for new words not in dictionary
Character-based tagging: Recognizes new words through statistical character patterns

Alternatives#

BERT-based segmentation: Deep learning approach for specialized domains (geoscience, legal, etc.)
CkipTagger (Sinica-Taiwan): POS tagging + tokenization
HanLP: More sophisticated NLP pipeline

2. Character/Word Frequency Analysis#

Frequency Datasets#

SUBTLEX-CH

46.8 million characters from film/TV subtitles
33.5 million words
Psychologically/cognitively relevant (reflects real usage)

Jun Da Corpus

9,933 characters
Most common character (的) appears 7,922,684 times
1,000 most common = 89% coverage

FineFreq

Web-scale multilingual dataset
Covers Mandarin + other high-resource languages

Key Metrics#

Coverage: % of text a learner can read at their level
Shannon entropy: Chinese “alphabet” = 9.56 bits/character (much higher than alphabetic languages)
Zipf distribution: Frequency follows power law (few characters = most usage)

3. Linguistic Feature Extraction#

Traditional Features (Easy to Count)#

Character count
Word count
Average sentence length
Vocabulary difficulty (based on HSK/TOCFL lists)
Vocabulary frequency (from frequency corpora)

Advanced Features (CRIE: 82 total)#

Character Level

Total characters
Unique characters
Character frequency distribution

Word Level

Word length
Word frequency
Vocabulary diversity (type-token ratio)

Sentence Level

Sentence length
Clause complexity
Dependency tree depth

Discourse Level

Cohesion metrics
Semantic relations
Topic consistency

CTAP (196 Features)#

4 levels: character, word, sentence, discourse

More comprehensive than CRIE
Includes syntactic complexity, lexical sophistication

4. Readability Classification#

Simple Formula Approach#

Readability Score = f(characters, words, avg_sentence_length)

Linear regression on 3 variables
Fast but less accurate
Good for quick estimates

Machine Learning Approach (CRIE)#

Training Data

Taiwanese primary/secondary school textbooks
Pre-labeled by grade level (1-9)

Model: Support Vector Machine (SVM)

Learns nonlinear relationships between 70-82 features
Maps data to high-dimensionality space
More accurate than linear formulas
Can provide diagnostic information (which features make text hard?)

Output

Grade level classification (1-9)
Diagnostic report (which linguistic features contribute to difficulty)

HSK/TOCFL Level Mapping#

Character Coverage Method

def get_hsk_level(text, char_freq_dict):
    chars_in_text = set(segment_characters(text))

    for level in [1, 2, 3, 4, 5, 6]:
        known_chars = get_hsk_chars(level)
        coverage = len(chars_in_text & known_chars) / len(chars_in_text)

        if coverage >= 0.95:  # 95% coverage threshold
            return level

    return "above HSK 6"

Vocabulary Coverage Method

Same approach but uses words (ci) instead of characters
More accurate for TOCFL (word-focused)
Requires segmentation first

Technical Challenges#

Segmentation ambiguity: “研究生” = “research student” or “research born”?
New words: Internet slang, neologisms not in dictionaries
Domain-specific vocabulary: Medical/legal text needs specialized dictionaries
Simplified vs Traditional: Two character sets with different frequency patterns
Context dependence: Character difficulty varies by context (的 vs 辩证法)

Performance Considerations#

Jieba segmentation: ~200K chars/second (fast enough for most use cases)
Feature extraction: Depends on depth (CRIE 82 features slower than simple 3-feature formula)
SVM prediction: Fast once trained (~milliseconds per text)
Bottleneck: Usually segmentation + NLP parsing for syntactic features

Sources#

S3: Need-Driven

S3-Need-Driven Approach#

Objective#

Analyze real-world use cases for CJK readability assessment. Move from “what can these libraries do” to “what problems do they solve” and “who needs them.”

Methodology#

Identify 3-5 concrete application categories
For each use case, define:
- The user persona and their goal
- The specific technical requirements
- Which library features are essential vs nice-to-have
- Trade-offs specific to that use case
- Library recommendation with rationale
Avoid abstract capabilities; focus on actual workflows

Use Cases Selected#

1. Language Learning Applications#

User: App developers matching content difficulty to learner proficiency Example: Duolingo, HelloChinese, ChinesePod grading lesson content

2. Graded Reader Publishers#

User: Publishers categorizing books/articles by reading difficulty Example: Mandarin Companion, The Chairman’s Bao content leveling

3. Educational Content Creators#

User: Textbook authors and educators validating material difficulty Example: Teachers ensuring vocabulary matches curriculum standards (HSK, TOCFL)

4. Reading Assistant Tools#

User: Developers building browser extensions and e-reader features Example: Zhongwen popup dictionary showing character frequencies, difficulty warnings

5. Curriculum Designers#

User: Language program coordinators planning lesson progression Example: University programs sequencing materials from HSK 1 → HSK 6

Analysis Framework#

For each use case, address:

Requirements#

What readability metrics are needed? (character frequency, word frequency, HSK/TOCFL levels)
Is batch analysis or real-time assessment required?
How important is accuracy vs speed?
What granularity is needed? (character-level, word-level, document-level)
Are there performance constraints? (mobile apps vs server processing)

Library Fit#

Which library’s strengths align with this use case?
What features are critical vs optional?
Are there missing capabilities?

Implementation Considerations#

Typical code patterns for this use case
Integration challenges
Performance implications
Edge cases to handle (proper names, specialized vocabulary, mixed scripts)

Decision Factors#

Why one library over another?
When would you need multiple libraries?
When is a custom solution required?

CJK Readability Dimensions#

Character Frequency#

Jun Da frequency lists: 8000+ characters ranked by corpus frequency
Modern Chinese Character Frequency List: BCC corpus (10 billion characters)
Use: Identify rare characters that signal difficulty

Word-Level Proficiency Standards#

HSK (Hanyu Shuiping Kaoshi): 6 levels, ~5000 words for HSK 6
TOCFL (Test of Chinese as a Foreign Language): Taiwan standard, 8000+ words
Use: Map vocabulary to standardized learning curricula

Readability Metrics#

Character coverage: % of text composed of high-frequency characters
Unknown word ratio: % of words outside learner’s level
Average word frequency: Lower rank = harder text
Unique character count: More unique characters = more cognitive load

Success Criteria#

Clear guidance for developers choosing readability tools
Realistic assessment of what each library enables
Identification of gaps no library fills
Practical code patterns for common scenarios
Honest trade-off discussions (not just feature promotion)

S3 Need-Driven: Recommendations#

Overview#

CJK readability assessment has no single library solution. Every use case requires integrating multiple data sources with custom logic. This analysis examined 5 distinct use cases to understand real-world requirements.

Key Finding: Multi-Library Integration is Universal#

None of the 5 use cases can be solved with a single library. All require:

Word segmentation (jieba, jieba.js)
Character/word frequency data (Jun Da, BCC Corpus, SUBTLEX-CH)
Proficiency mapping (HSK/TOCFL word lists)
Custom business logic (coverage calculation, difficulty scoring, filtering)

Recommendations by Use Case#

1. Language Learning Applications#

Who: Duolingo, HelloChinese, ChinesePod Need: Match content to learner proficiency dynamically

Recommended Stack:

jieba: Word segmentation
BCC Character Frequency or Jun Da: Character-level difficulty
CC-CEDICT + HSK tags: Word-level proficiency mapping
Custom scoring: Coverage calculation, adaptive recommendations

Implementation Priority:

HIGH: HSK tagging for vocabulary (curriculum alignment)
HIGH: Coverage calculation (90-95% = optimal challenge)
MEDIUM: Character frequency indicators (rare character warnings)
LOW: Perfect segmentation (learner-facing apps tolerate minor errors)

Complexity Justified When:

Adaptive content selection is core feature
Target serious learners (not casual tourists)
Large content library needs automated grading

Simpler Alternative: Manual tiering (easy/medium/hard) for small, curated content libraries

2. Graded Reader Publishers#

Who: Mandarin Companion, The Chairman’s Bao, Chinese Breeze Need: Categorize books/articles by reading difficulty

Recommended Stack:

jieba: Word segmentation
CC-CEDICT + HSK: Vocabulary compliance checking
BCC Character Frequency: Identify rare characters
Custom proper name dictionary: Filter false positives
Editorial rules engine: Validate vocabulary constraints

Implementation Priority:

HIGH: Vocabulary compliance validation (catch violations before publication)
HIGH: Catalog consistency (books at same level should match)
MEDIUM: Chapter progression analysis (smooth difficulty curve)
MEDIUM: Comparative difficulty ranking (within-level ordering)

Complexity Justified When:

Large catalog (50+ books) requiring consistent leveling
Series with strict vocabulary control
Multiple authors need alignment

Simpler Alternative: Editorial judgment for small catalogs (<20 books), literary quality prioritized over strict leveling

3. Educational Content Creators#

Who: Textbook authors, educators, course creators Need: Validate material difficulty while writing

Recommended Stack:

jieba or jieba.js: Real-time segmentation
CC-CEDICT + HSK lists: Vocabulary level lookup
Custom synonym engine: Suggest simpler alternatives
Browser extension / Google Docs add-on: Inline feedback

Implementation Priority:

HIGH: Real-time violation highlighting (fix while writing)
HIGH: Synonym suggestions (pedagogically appropriate replacements)
MEDIUM: Batch validation (pre-publication QA)
LOW: Grammar complexity (vocabulary harder to assess)

Complexity Justified When:

High-volume production (multiple textbooks/year)
Strict curriculum alignment (HSK/TOCFL requirements)
Multi-author coordination needed

Simpler Alternative: Post-writing manual validation with reference materials (Pleco, HSK word lists) for one-off worksheets

4. Reading Assistant Tools#

Who: Zhongwen extension, Du Chinese, LingQ, Pleco Reader Need: Real-time difficulty indicators and vocabulary popups

Recommended Stack:

jieba.js: Browser-native word segmentation
Pruned CC-CEDICT: Top 10k words only (3 MB vs 30 MB)
Jun Da Character Frequency: Fast character-level difficulty
IndexedDB: Track user’s known vocabulary
Optional cloud sync: Cross-device vocabulary tracking

Implementation Priority:

HIGH: Fast performance (sub-second page analysis)
HIGH: Small bundle size (browser extension limits)
MEDIUM: User vocabulary personalization (known word highlighting)
LOW: Perfect accuracy (speed > completeness for real-time tools)

Complexity Justified When:

Privacy important (user data stays on device)
Real-time performance critical
Offline usage required

Simpler Alternative: Server-side API for complex processing (trade privacy/offline for simplicity), but most users prefer client-side

5. Curriculum Designers#

Who: University coordinators, K-12 developers, corporate trainers Need: Design multi-year learning progressions

Recommended Stack:

jieba: Batch word segmentation
CC-CEDICT + HSK/TOCFL: Standards alignment validation
BCC/SUBTLEX-CH: Vocabulary frequency reference
Custom gap detection: Find missing difficulty levels
Visualization tools: Progression curves, coverage heatmaps

Implementation Priority:

HIGH: Standards coverage validation (HSK/TOCFL compliance)
HIGH: Gap detection (missing proficiency levels)
MEDIUM: Competitive benchmarking (compare to peers)
LOW: Cost optimization (budget secondary to quality)

Complexity Justified When:

Large program (>100 students/year)
Standards-driven (accreditation requirements)
Textbook adoption decisions (objective comparison)

Simpler Alternative: Faculty judgment for small/specialized programs, experimental curricula, heritage learner programs

Cross-Use-Case Insights#

Universal Requirements#

All 5 use cases need:

✅ Word segmentation (jieba is de facto standard)
✅ HSK/TOCFL mapping (curriculum alignment universal)
✅ Character frequency data (difficulty signals)
✅ Custom logic (no library provides complete solution)

Use-Case-Specific Requirements#

Feature	Learning Apps	Publishers	Creators	Reading Tools	Curriculum
Real-time feedback	Medium	Low	HIGH	HIGH	Low
Batch processing	Medium	HIGH	Medium	Low	HIGH
User personalization	HIGH	Low	Low	HIGH	Low
Standards alignment	HIGH	HIGH	HIGH	Medium	HIGH
Offline capability	HIGH	Medium	HIGH	HIGH	Medium

Common Pain Points#

All use cases struggle with:

❌ Proper name handling (character/place names flagged as rare)
❌ Grammar complexity (sentence structure difficulty not captured)
❌ Context-aware definitions (word meaning varies)
❌ Synonym quality (not all alternatives pedagogically equivalent)
❌ Standards evolution (HSK 2012 → 2021 migration)

Library Selection Matrix#

Core Components (Choose One from Each)#

Word Segmentation:

jieba (Python): Server-side, batch processing, highest accuracy
jieba.js (JavaScript): Browser, real-time, lightweight

Character Frequency:

Jun Da: Simple, fast, small file (~100 KB)
BCC Corpus: Authoritative, contemporary, large (requires preprocessing)
SUBTLEX-CH: Spoken language focus (conversational content)

Word-Level Proficiency:

CC-CEDICT + HSK tags: Comprehensive but incomplete coverage
BLCU HSK Lists: Official standard, requires integration
Custom HSK database: Combine multiple sources for completeness

Integration Patterns#

Pattern A: Lightweight Client-Side (Reading Assistants)

jieba.js + Jun Da + Pruned CC-CEDICT + IndexedDB
= Fast, offline, privacy-friendly, limited features

Pattern B: Server-Side Comprehensive (Publishers, Curriculum)

jieba + BCC Corpus + Full CC-CEDICT + Custom DB
= Slow, accurate, feature-rich, server required

Pattern C: Hybrid Real-Time (Learning Apps, Content Creators)

jieba.js (client) + HSK API (server) + User vocab cache (IndexedDB)
= Fast feedback, comprehensive data, complex architecture

Missing Capabilities (Build Custom)#

No existing library provides:

1. Proper Name Filtering#

Problem: Character/place names flagged as rare vocabulary Solution: Custom dictionary of names + NER (Named Entity Recognition) Affected Use Cases: Publishers (50% false positives), Content Creators (40%)

2. Grammar Complexity Metrics#

Problem: Sentence structure difficulty not captured (only vocabulary) Solution: Dependency parsing + sentence length + clause complexity Affected Use Cases: All (but especially Curriculum Designers)

3. Pedagogical Synonym Ranking#

Problem: Not all synonyms equally teachable Solution: ML model trained on textbook corpora + teacher feedback Affected Use Cases: Content Creators (critical), Learning Apps (important)

4. Context-Aware Proficiency#

Problem: Word difficulty varies by context (多 harder in 多少 vs 很多) Solution: Phrase-level HSK tagging + context windows Affected Use Cases: Learning Apps (important), Reading Assistants (nice-to-have)

5. Standards Migration Tools#

Problem: HSK 2012 → HSK 2021 vocabulary changes Solution: Mapping tables + automated curriculum updates Affected Use Cases: Publishers (critical), Curriculum Designers (critical)

Decision Framework#

When to Build Custom Solution#

Build custom readability tools when:

Volume justifies cost: 100+ books, 1000+ lessons, or 10k+ users
Competitive advantage: Readability assessment is core differentiation
Domain-specific needs: Business Chinese, medical Chinese, classical Chinese
Standards compliance critical: Accreditation, certification, testing

When to Use Manual Methods#

Rely on human judgment when:

Small scale: <20 books, <100 lessons, boutique programs
Experimental: Pioneering new approaches, no benchmarks
Quality > consistency: Literary content, cultural nuance
Budget constraints: Custom tooling expensive, ROI unclear

When to License Third-Party#

Consider SaaS tools when:

Exists: (Currently no comprehensive SaaS for CJK readability)
Cost-effective: Subscription < custom development
Good enough: 80% solution acceptable, not 100% perfection

Reality check: As of 2025, no off-the-shelf SaaS provides comprehensive CJK readability assessment. All serious use cases require custom integration.

Implementation Roadmap (Generic)#

Phase 1: Proof of Concept (1-2 months)#

Choose core libraries (jieba + frequency data + HSK lists)
Build basic difficulty estimator
Validate on sample content (10-20 texts)
Measure accuracy vs manual classification

Phase 2: MVP (3-6 months)#

Integrate libraries into workflow (API, UI, or batch tool)
Add custom logic (coverage, scoring, filtering)
Pilot with small user group (5-10 people)
Iterate based on feedback

Phase 3: Production (6-12 months)#

Optimize performance (caching, parallel processing)
Build proper name dictionary (reduce false positives)
Add synonym suggestions (for content creators)
Implement user vocabulary tracking (for learners)

Phase 4: Advanced Features (12+ months)#

Grammar complexity metrics
Context-aware proficiency
ML-based improvements
Cross-device sync, collaboration features

Cost-Benefit Analysis#

Automation Benefits (Quantified)#

Textbook publisher: Save 20-40 hours/book in manual leveling
Learning app: 2x content library size (more materials graded)
Curriculum designer: 80% reduction in textbook evaluation time
Content creator: 50% faster writing (real-time feedback)

Automation Costs (Estimated)#

Development: $50k-$200k (depending on features)
Maintenance: $10k-$30k/year (data updates, bug fixes)
Accuracy trade-offs: 10-20% false positive rate on proper names

ROI Thresholds#

High-volume publishers: ROI positive at 10+ books/year
Large learning apps: ROI positive at 1000+ active users
Universities: ROI positive at 100+ students/year
Individual creators: ROI negative (manual methods better)

Final Recommendation#

For most use cases, build custom integration of:

jieba (segmentation)
Jun Da or BCC (character frequency)
CC-CEDICT + HSK (word proficiency)
Custom logic (business-specific rules)

Start simple (manual validation) and add automation incrementally as volume grows. Perfect accuracy is unattainable - optimize for “good enough” given use case constraints.

No silver bullet exists. CJK readability assessment requires domain expertise + software engineering + continuous iteration. Tools enable but don’t replace human judgment.

Use Case: Educational Content Creators#

Scenario Description#

Textbook authors, curriculum developers, and educators create teaching materials while ensuring vocabulary and grammar match target proficiency levels and curriculum standards.

User Persona#

Primary: Textbook authors, curriculum developers, language teachers
Secondary: Online course creators, YouTube educators, educational bloggers
Output: Textbooks, worksheets, lesson plans, video scripts, blog posts
Constraints: Must align with standardized curricula (HSK, TOCFL, school syllabi)

Examples of Real Applications#

University textbook authors: Creating HSK-aligned coursebooks
K-12 curriculum developers: Designing age-appropriate Chinese lessons
YouTube educators: Scripting comprehensible input videos
Blog writers: Writing learner-friendly explanations of Chinese culture/grammar
Worksheet creators: Generating practice materials at specific levels

Technical Requirements#

Core Capabilities#

Real-time difficulty feedback: Authors see difficulty as they write
Vocabulary constraint validation: Alert when using words above target level
Suggested replacements: Recommend simpler synonyms for complex words
Coverage visualization: Show what % of text learners can understand
Curriculum alignment check: Verify content matches HSK/TOCFL standards
Export word lists: Generate vocabulary lists for lesson appendices

Performance Constraints#

Real-time responsiveness: Sub-second feedback while typing (Google Docs add-on)
Batch validation: Analyze complete chapters (10k-50k words)
Lightweight: Work in browser or lightweight desktop app
Offline capability: Authors work without internet connection

Accuracy Requirements#

Critical: Catch vocabulary above target level (breaks learner flow)
Important: Suggest pedagogically appropriate alternatives
Nice-to-have: Grammar complexity warnings (sentence structure)

Library Analysis#

CC-CEDICT + HSK Vocabulary Lists#

Strengths for Content Creators:

✅ HSK 1-6 tagging (align content with standards)
✅ Comprehensive coverage (100k+ words)
✅ Synonym lookup (find simpler alternatives)
✅ Offline-capable (author workflow often offline)

Weaknesses for Content Creators:

⚠️ HSK coverage gaps (common words lack tags)
⚠️ No real-time integration (need custom tooling)
⚠️ Synonym quality (not all alternatives pedagogically sound)

Verdict: Essential reference, needs UI layer.

jieba + Custom Dictionaries#

Strengths for Content Creators:

✅ Fast word segmentation (real-time feedback possible)
✅ Custom dictionaries (add HSK tags, proper names)
✅ Lightweight (runs in browser via WebAssembly)

Weaknesses for Content Creators:

⚠️ Segmentation errors (need manual correction)
⚠️ No built-in simplification suggestions

Verdict: Critical for real-time analysis.

Pleco Dictionary (User Tooling Inspiration)#

Strengths for Content Creators:

✅ HSK tagging in dictionary (shows word levels)
✅ Synonym explorer (find alternatives)
✅ Example sentences (pedagogical context)

Weaknesses for Content Creators:

❌ Manual lookup (not integrated into writing workflow)
❌ Mobile-only (authors work on desktop)

Verdict: Excellent reference but not authoring tool.

BLCU HSK Vocabulary Graded Lists#

Strengths for Content Creators:

✅ Official HSK standard (authoritative)
✅ Both 2012 and 2021 versions (cover transitions)
✅ Part-of-speech tags (syntactic guidance)

Weaknesses for Content Creators:

⚠️ Static lists (need lookup tool)
⚠️ No synonym suggestions

Verdict: Authoritative reference for validation.

Detailed Feature Comparison#

Feature	CC-CEDICT+HSK	jieba	Pleco	BLCU Lists	Creator Value
Real-time feedback	⚠️ Manual	✅	❌	❌	Critical (authoring flow)
HSK tagging	✅	⚠️ Custom	✅	✅	Critical (curriculum align)
Synonym suggestions	⚠️ Manual	❌	✅	❌	High (vocabulary control)
Offline-capable	✅	✅	⚠️ Mobile	✅	Important (authoring workflow)
Part-of-speech tags	✅	⚠️ Custom	✅	✅	Medium (grammar guidance)
Integration-ready	⚠️ API	✅	❌	⚠️ CSV	Critical (tooling)

Recommendation#

Custom Authoring Tool Required#

No off-the-shelf tool exists for real-time content creation. Authors need custom integration:

jieba: Real-time word segmentation as author types
CC-CEDICT + HSK lists: Vocabulary level lookup
Custom synonym engine: Suggest simpler alternatives
Browser extension or Google Docs add-on: Inline feedback

Authoring Tool Architecture#

# Pseudocode for real-time authoring assistant
import jieba

class ContentCreatorAssistant:
    def __init__(self, target_level='HSK3'):
        self.target_level = target_level
        self.hsk_vocab = load_hsk_vocabulary()
        self.synonyms = load_synonym_database()

    def analyze_as_you_type(self, text):
        """Real-time feedback while author writes"""
        words = jieba.cut(text)
        issues = []

        for word in words:
            word_level = self.hsk_vocab.get(word, 'unknown')

            if word_level > self.target_level:
                suggestions = self.find_simpler_alternatives(word)
                issues.append({
                    'word': word,
                    'level': word_level,
                    'severity': 'high' if word_level > self.target_level + 1 else 'medium',
                    'suggestions': suggestions,
                })

        return {
            'issues': issues,
            'difficulty_estimate': self.estimate_difficulty(words),
            'target_level_compliance': len(issues) == 0,
        }

    def find_simpler_alternatives(self, word):
        """Suggest simpler synonyms"""
        candidates = self.synonyms.get(word, [])
        return [
            syn for syn in candidates
            if self.hsk_vocab.get(syn, 99) <= self.target_level
        ]

Implementation Patterns#

Pattern 1: Google Docs Add-On (Real-Time Highlighting)#

Highlight vocabulary violations as author writes:

// Google Apps Script for Docs add-on
function analyzeDocument() {
  var doc = DocumentApp.getActiveDocument();
  var body = doc.getBody();
  var text = body.getText();

  // Call backend API (jieba + HSK lookup)
  var issues = analyzeText(text, targetLevel='HSK3');

  // Highlight violations in yellow
  issues.forEach(function(issue) {
    var range = body.findText(issue.word);
    if (range) {
      range.getElement().asText().setBackgroundColor(
        issue.severity === 'high' ? '#FFFF00' : '#FFFFE0'
      );
    }
  });

  // Show sidebar with suggestions
  showSuggestionsSidebar(issues);
}

Pattern 2: Browser Extension (Webpage Content Validation)#

Validate content on educational blogs/websites:

// Browser extension for content creators
chrome.action.onClicked.addListener(async (tab) => {
  // Extract text from current page
  const text = await extractTextFromPage(tab.id);

  // Analyze difficulty
  const analysis = await analyzeDifficulty(text, 'HSK4');

  // Show popup with results
  chrome.notifications.create({
    type: 'basic',
    title: 'Content Difficulty Analysis',
    message: `Level: ${analysis.level}\nCompliance: ${analysis.compliant ? 'Yes' : 'No'}\nViolations: ${analysis.violations.length}`,
  });

  // Highlight violations on page
  highlightViolations(tab.id, analysis.violations);
});

Pattern 3: Vocabulary Constraint Checker#

Validate entire manuscript before publication:

def validate_manuscript_constraints(manuscript, target_level, allowed_exceptions):
    """Check if content meets vocabulary constraints"""
    words = segment(manuscript)
    violations = []

    # Check each word
    for word in set(words):
        # Skip allowed exceptions (proper names, technical terms)
        if word in allowed_exceptions:
            continue

        word_level = get_hsk_level(word)

        if word_level > target_level or word_level == 'unknown':
            occurrences = words.count(word)
            simpler_alternatives = find_simpler_synonyms(word, target_level)

            violations.append({
                'word': word,
                'level': word_level,
                'occurrences': occurrences,
                'suggested_replacements': simpler_alternatives,
                'example_sentences': find_sentences_with_word(manuscript, word)[:3],
            })

    # Generate report
    return {
        'compliant': len(violations) == 0,
        'total_violations': len(violations),
        'violations_by_severity': categorize_violations(violations),
        'detailed_violations': sorted(violations, key=lambda x: x['occurrences'], reverse=True),
        'recommended_action': 'fix' if len(violations) > 10 else 'review',
    }

Pattern 4: Synonym Suggestion Engine#

Help authors replace complex words with simpler alternatives:

def suggest_replacements(word, target_level, context_sentence):
    """Find pedagogically appropriate synonyms"""
    # Look up synonyms from dictionary
    raw_synonyms = get_synonyms(word)

    # Filter to target level
    level_appropriate = [
        syn for syn in raw_synonyms
        if get_hsk_level(syn) <= target_level
    ]

    # Rank by pedagogical value
    ranked = []
    for syn in level_appropriate:
        score = {
            'synonym': syn,
            'level': get_hsk_level(syn),
            'frequency': get_word_frequency(syn),
            'context_fit': check_context_fit(syn, context_sentence),
            'pedagogical_value': calculate_pedagogical_value(syn, target_level),
        }
        ranked.append(score)

    # Sort by best fit
    ranked.sort(key=lambda x: x['pedagogical_value'], reverse=True)

    return ranked[:5]  # Top 5 suggestions

Pattern 5: Lesson Vocabulary Planner#

Plan vocabulary introduction across lesson sequence:

def plan_lesson_vocabulary(lessons, starting_level='HSK1'):
    """Ensure gradual vocabulary progression across lessons"""
    cumulative_vocab = set()
    current_level = starting_level

    lesson_plan = []

    for lesson in lessons:
        words = segment(lesson.text)
        new_words = set(words) - cumulative_vocab

        # Calculate difficulty
        new_word_count = len(new_words)
        difficulty_jump = estimate_difficulty_increase(new_words, cumulative_vocab)

        # Flag if too many new words
        if new_word_count > 20:  # More than 20 new words = too much
            lesson_plan.append({
                'lesson': lesson.title,
                'status': 'warning',
                'issue': f'{new_word_count} new words (max 20 recommended)',
                'suggestions': ['Split into 2 lessons', 'Remove low-frequency words'],
            })
        elif difficulty_jump > 1.0:  # Difficulty spike
            lesson_plan.append({
                'lesson': lesson.title,
                'status': 'warning',
                'issue': f'Difficulty spike (+{difficulty_jump:.2f})',
                'suggestions': ['Add bridge lesson', 'Pre-teach difficult vocabulary'],
            })
        else:
            lesson_plan.append({
                'lesson': lesson.title,
                'status': 'ok',
                'new_words': list(new_words),
                'cumulative_vocab_size': len(cumulative_vocab) + len(new_words),
            })

        cumulative_vocab.update(new_words)

    return lesson_plan

Trade-offs#

Real-Time Authoring Tools Benefits#

Immediate feedback: Authors fix issues while writing (not after)
Pedagogical guidance: Suggests appropriate vocabulary choices
Quality assurance: Prevents vocabulary violations before publication
Productivity: Faster than manual dictionary lookups

Real-Time Authoring Tools Costs#

Development effort: Custom tooling required (no off-the-shelf solutions)
False positives: Proper names, technical terms flagged incorrectly
Synonym quality: Not all replacements pedagogically equivalent
Author training: Learning curve for new tool adoption

When Custom Tooling is Worth It#

Build authoring assistant when:

High-volume content production (multiple textbooks/year)
Strict curriculum alignment required (HSK, TOCFL)
Multiple authors need consistency (editorial coordination)
Competitive advantage in curriculum quality

When Manual Validation Suffices#

Rely on post-writing review when:

Small-scale production (one-off worksheets)
Expert authors with deep pedagogical knowledge
Flexible curriculum (not tied to standardized tests)
Budget constraints (custom tooling expensive)

Missing Capabilities#

No existing tool provides:

❌ Integrated authoring environment (Google Docs add-on, VS Code extension)
❌ Grammar complexity metrics (sentence structure difficulty)
❌ Pedagogical synonym ranking (not all synonyms equally teachable)
❌ Lesson vocabulary planning (progression across lesson series)
❌ Age-appropriate vocabulary (K-12 vs adult learners)
❌ Cultural appropriateness (Taiwan vs Mainland vocabulary)

Content creators must build custom solutions or rely on manual judgment.

Real-World Integration Examples#

Textbook Author Workflow#

class TextbookAuthorAssistant:
    def __init__(self, book_series_name, target_level):
        self.series = book_series_name
        self.target_level = target_level
        self.cumulative_vocab = load_series_vocabulary(book_series_name)

    def validate_chapter_draft(self, chapter_text):
        """Provide feedback on chapter draft"""
        words = segment(chapter_text)

        # 1. Vocabulary compliance
        violations = find_vocabulary_violations(words, self.target_level)

        # 2. New vocabulary load
        new_words = set(words) - self.cumulative_vocab
        new_word_count = len(new_words)

        # 3. Coverage analysis
        coverage = calculate_coverage(words, self.target_level)

        # 4. Generate feedback
        feedback = {
            'ready_for_review': len(violations) == 0 and new_word_count <= 20,
            'issues': violations,
            'new_vocabulary': {
                'count': new_word_count,
                'words': list(new_words),
                'recommendation': 'ok' if new_word_count <= 20 else 'reduce',
            },
            'coverage': coverage,
            'suggested_edits': generate_edit_suggestions(violations),
        }

        return feedback

    def update_series_vocabulary(self, chapter_text):
        """Track cumulative vocabulary as series progresses"""
        words = set(segment(chapter_text))
        self.cumulative_vocab.update(words)
        save_series_vocabulary(self.series, self.cumulative_vocab)

YouTube Educator Script Validator#

def validate_video_script(script, target_audience='HSK3'):
    """Check if video script matches target audience level"""
    # Segment into sentences
    sentences = split_into_sentences(script)

    sentence_analysis = []
    for sent in sentences:
        words = segment(sent)
        difficulty = estimate_difficulty(words)

        # Check if sentence appropriate
        if difficulty > target_audience + 1:  # More than 1 level above
            sentence_analysis.append({
                'sentence': sent,
                'difficulty': difficulty,
                'issue': 'too_hard',
                'suggestion': 'Simplify vocabulary or add explanation',
            })
        else:
            sentence_analysis.append({
                'sentence': sent,
                'difficulty': difficulty,
                'status': 'ok',
            })

    # Overall script assessment
    hard_sentences = [s for s in sentence_analysis if s.get('issue') == 'too_hard']

    return {
        'script_appropriate': len(hard_sentences) < len(sentences) * 0.1,  # < 10% hard
        'hard_sentence_count': len(hard_sentences),
        'recommendations': 'Ready for recording' if len(hard_sentences) == 0 else 'Simplify before recording',
        'detailed_analysis': sentence_analysis,
    }

Worksheet Generator with Difficulty Control#

def generate_worksheet(vocabulary_list, target_level, exercise_type='fill_blank'):
    """Create practice worksheet at specific difficulty"""
    # Filter vocabulary to target level
    level_vocab = [
        word for word in vocabulary_list
        if get_hsk_level(word) == target_level
    ]

    # Generate exercises
    exercises = []
    for word in level_vocab[:20]:  # 20 questions
        if exercise_type == 'fill_blank':
            sentence = generate_example_sentence(word, target_level)
            blank_sentence = sentence.replace(word, '____')
            exercises.append({
                'question': blank_sentence,
                'answer': word,
                'pinyin_hint': get_pinyin(word),
            })

    # Validate worksheet difficulty
    validation = validate_worksheet_difficulty(exercises, target_level)

    return {
        'exercises': exercises,
        'difficulty_validated': validation['compliant'],
        'answer_key': [ex['answer'] for ex in exercises],
    }

Performance Considerations#

Typical Workload#

Content creators work with:

Real-time typing (100-500 words/hour)
Chapter drafts (3k-10k words)
Full textbook manuscripts (50k-100k words)

Optimization Strategies#

# Incremental analysis (only new text, not whole document)
class IncrementalAnalyzer:
    def __init__(self, target_level):
        self.target_level = target_level
        self.previous_text = ""
        self.cached_issues = []

    def analyze_changes(self, current_text):
        """Only analyze what changed since last call"""
        # Diff to find new text
        new_text = current_text[len(self.previous_text):]

        # Analyze only new portion
        new_issues = analyze_text(new_text, self.target_level)

        # Merge with cached issues
        all_issues = self.cached_issues + new_issues

        # Update cache
        self.previous_text = current_text
        self.cached_issues = all_issues

        return all_issues

# Debounce real-time analysis (don't analyze every keystroke)
import time

def debounced_analysis(text, target_level, delay=1.0):
    """Wait for user to stop typing before analyzing"""
    time.sleep(delay)
    return analyze_text(text, target_level)

Conclusion#

Content creators need custom authoring tools with real-time feedback. No off-the-shelf solution exists. Best approach:

Build custom integration (Google Docs add-on, browser extension)
Combine libraries: jieba (segmentation) + CC-CEDICT/HSK (tagging)
Add synonym engine: Help authors find simpler alternatives
Validate post-writing: Batch analysis before publication

Investment in tooling pays off for high-volume content production and multi-author coordination. Small-scale creators can rely on manual validation with reference materials (Pleco, HSK word lists).

Use Case: Curriculum Designers#

Scenario Description#

Language program coordinators, university professors, and school administrators design multi-year learning progressions, ensuring materials sequence from beginner to advanced with appropriate difficulty increases.

User Persona#

Primary: University Chinese program coordinators, K-12 curriculum developers
Secondary: Corporate training program designers, online course architects
Output: Multi-year curricula, semester lesson plans, course sequences
Constraints: Must align with standards (HSK, TOCFL, ACTFL), budget limits

Examples of Real Applications#

University Chinese programs: 4-year BA programs (Year 1 → Year 4 progression)
K-12 school districts: Chinese immersion programs (Grade 1 → Grade 12)
Corporate training: Business Chinese programs (3-month intensive courses)
MOOCs: Coursera/edX Chinese courses (beginner → advanced tracks)
Language schools: Summer intensive programs (8-week progressions)

Technical Requirements#

Core Capabilities#

Material sequencing: Order materials from easy → hard with smooth progression
Vocabulary distribution analysis: Ensure even vocabulary load across semesters
Curriculum gap detection: Find missing proficiency levels in material collection
Textbook comparison: Evaluate competing textbooks for level appropriateness
Standards alignment: Verify coverage of HSK/TOCFL/ACTFL requirements
Resource allocation: Plan budgets based on material difficulty needs

Performance Constraints#

Batch analysis: Process hundreds of textbooks/materials (curriculum library)
Latency: Hours acceptable (strategic planning, not real-time)
Reporting: Detailed visualizations (gap analysis, progression curves)
Accuracy: High priority (curriculum decisions are multi-year commitments)

Accuracy Requirements#

Critical: Accurate level classification (wrong textbook = semester wasted)
Critical: Vocabulary coverage validation (gaps leave students unprepared)
Important: Competitive benchmarking (how do materials compare to other programs?)
Nice-to-have: Cost-per-level optimization (budget allocation)

Library Analysis#

CC-CEDICT + HSK Vocabulary Lists#

Strengths for Curriculum Design:

✅ HSK 1-6 tagging (align materials with standardized progression)
✅ Comprehensive coverage (validate textbook completeness)
✅ Batch processing (analyze entire curriculum library)
✅ Standards-based (matches learner expectations)

Weaknesses for Curriculum Design:

⚠️ HSK 2012 vs 2021 differences (curriculum transitions needed)
⚠️ Incomplete coverage (materials with unique vocabulary hard to assess)
⚠️ No curriculum-specific metrics (materials may use same words but different teaching approach)

Verdict: Essential foundation for standards alignment.

ACTFL Proficiency Guidelines + Can-Do Statements#

Strengths for Curriculum Design:

✅ Competency-based framework (focus on what learners can do)
✅ K-12 adoption (US schools use ACTFL standards)
✅ Skills-based (speaking, reading, writing, listening)

Weaknesses for Curriculum Design:

❌ Not automated (no word lists, manual assessment)
❌ Qualitative not quantitative (hard to measure with tools)
⚠️ No direct HSK mapping (curriculum spans multiple frameworks)

Verdict: Important for K-12, needs manual integration.

Corpus-Based Frequency Data (BCC, SUBTLEX-CH)#

Strengths for Curriculum Design:

✅ Empirical evidence (real-world usage patterns)
✅ Genre differentiation (academic vs conversational vocabulary)
✅ Contemporary relevance (recent corpus data)

Weaknesses for Curriculum Design:

❌ No pedagogical sequencing (frequency ≠ learnability)
❌ Corpus bias (news ≠ learner needs)
⚠️ Requires interpretation (how to map frequency to curriculum levels?)

Verdict: Useful reference but not curriculum design tool alone.

Textbook Metadata (Publishers’ Level Claims)#

Strengths for Curriculum Design:

✅ Publisher specifications (stated target levels)
✅ Market positioning (competitive benchmarking)
✅ Pedagogical intent (authors’ design goals)

Weaknesses for Curriculum Design:

⚠️ Publisher claims unvalidated (marketing vs reality)
⚠️ Inconsistent leveling (different publishers use different scales)
⚠️ Lack of granularity (same level can vary widely)

Verdict: Starting point but requires independent validation.

Detailed Feature Comparison#

Feature	CC-CEDICT+HSK	ACTFL	BCC Corpus	Textbook Metadata	Curriculum Value
Standards alignment	✅ HSK	✅ ACTFL	❌	⚠️ Varies	Critical (accreditation)
Batch analysis	✅	❌ Manual	✅	⚠️ Manual	Critical (library evaluation)
Vocabulary sequencing	⚠️ Levels only	❌	⚠️ Frequency	❌	High (progression planning)
Gap detection	⚠️ Manual	❌	❌	❌	High (curriculum holes)
Competitive benchmarking	⚠️ Manual	⚠️ Manual	❌	✅	Medium (market positioning)
Cost optimization	❌	❌	❌	⚠️ Price data	Medium (budget planning)

Recommendation#

Multi-Phase Curriculum Design Process#

Curriculum design requires combining automated analysis with expert judgment:

Phase 1: Standards Mapping

Use CC-CEDICT + HSK to map materials to standardized levels
Validate publisher claims through independent vocabulary analysis
Identify gaps in curriculum progression (missing HSK levels)

Phase 2: Sequencing Analysis

Use corpus frequency data to validate vocabulary introduction order
Ensure vocabulary load distributed evenly across semesters
Check for difficulty spikes between consecutive courses

Phase 3: Competitive Benchmarking

Compare program materials to peer institutions
Validate that progression matches industry standards
Identify unique strengths/weaknesses

Phase 4: Expert Review

Faculty validate automated recommendations
Adjust for pedagogical factors (teachability, cultural relevance)
Pilot materials with small cohorts before full adoption

Curriculum Analysis Workflow#

# Pseudocode for curriculum analysis
class CurriculumAnalyzer:
    def __init__(self, materials_library):
        self.materials = materials_library
        self.hsk_vocab = load_hsk_vocabulary()
        self.corpus_freq = load_corpus_frequency()

    def analyze_curriculum_progression(self, course_sequence):
        """Validate multi-year course progression"""
        progression = []

        for i, course in enumerate(course_sequence):
            # Analyze course difficulty
            difficulty = self.estimate_difficulty(course.materials)

            # Check vocabulary coverage
            vocab_coverage = self.check_hsk_coverage(course.materials)

            # Detect gaps from previous course
            if i > 0:
                prev_vocab = set(get_vocabulary(course_sequence[i-1].materials))
                current_vocab = set(get_vocabulary(course.materials))
                new_vocab = current_vocab - prev_vocab
                overlap = current_vocab & prev_vocab

                progression.append({
                    'course': course.name,
                    'difficulty': difficulty,
                    'hsk_level': vocab_coverage['estimated_level'],
                    'new_vocabulary': len(new_vocab),
                    'vocabulary_overlap': len(overlap) / len(prev_vocab),
                    'difficulty_jump': difficulty - progression[i-1]['difficulty'],
                })

        return self.validate_progression(progression)

    def validate_progression(self, progression):
        """Check for curriculum issues"""
        issues = []

        for i, course in enumerate(progression):
            # Check for difficulty spikes
            if course['difficulty_jump'] > 1.5:
                issues.append({
                    'type': 'difficulty_spike',
                    'course': course['course'],
                    'severity': 'high',
                    'message': f"Large difficulty jump (+{course['difficulty_jump']:.2f}) from previous course",
                })

            # Check for vocabulary overload
            if course['new_vocabulary'] > 500:
                issues.append({
                    'type': 'vocabulary_overload',
                    'course': course['course'],
                    'severity': 'medium',
                    'message': f"{course['new_vocabulary']} new words (recommend <500 per semester)",
                })

            # Check for insufficient review
            if course['vocabulary_overlap'] < 0.4:  # Less than 40% overlap
                issues.append({
                    'type': 'insufficient_review',
                    'course': course['course'],
                    'severity': 'medium',
                    'message': f"Only {course['vocabulary_overlap']:.1%} vocabulary overlap with previous course",
                })

        return {
            'progression': progression,
            'issues': issues,
            'overall_assessment': 'needs_revision' if len(issues) > 0 else 'acceptable',
        }

Implementation Patterns#

Pattern 1: Standards Alignment Validator#

Ensure curriculum covers required HSK/TOCFL vocabulary:

def validate_standards_coverage(curriculum_materials, target_standard='HSK'):
    """Check if curriculum covers all required vocabulary"""
    # Load standard vocabulary requirements
    if target_standard == 'HSK':
        required_vocab = {
            1: load_hsk_level(1),
            2: load_hsk_level(2),
            3: load_hsk_level(3),
            4: load_hsk_level(4),
            5: load_hsk_level(5),
            6: load_hsk_level(6),
        }

    # Extract curriculum vocabulary
    curriculum_vocab = set()
    for material in curriculum_materials:
        words = extract_vocabulary(material.text)
        curriculum_vocab.update(words)

    # Check coverage for each level
    coverage = {}
    for level, vocab_set in required_vocab.items():
        covered = curriculum_vocab & vocab_set
        coverage[level] = {
            'required': len(vocab_set),
            'covered': len(covered),
            'coverage_rate': len(covered) / len(vocab_set),
            'missing': list(vocab_set - covered)[:20],  # Show first 20 missing
        }

    return coverage

Pattern 2: Gap Detection in Curriculum#

Find missing proficiency levels:

def detect_curriculum_gaps(materials_library):
    """Find missing difficulty levels in material collection"""
    # Classify all materials by difficulty
    classified = []
    for material in materials_library:
        difficulty = estimate_difficulty(material.text)
        classified.append({
            'material': material,
            'difficulty': difficulty,
            'estimated_hsk': map_difficulty_to_hsk(difficulty),
        })

    # Group by HSK level
    by_level = {}
    for item in classified:
        level = item['estimated_hsk']
        if level not in by_level:
            by_level[level] = []
        by_level[level].append(item['material'])

    # Detect gaps
    gaps = []
    for level in range(1, 7):  # HSK 1-6
        if level not in by_level or len(by_level[level]) < 3:
            gaps.append({
                'level': level,
                'available_materials': len(by_level.get(level, [])),
                'recommended_materials': 3,  # At least 3 per level
                'priority': 'high' if level <= 3 else 'medium',  # Lower levels more critical
            })

    return {
        'gaps': gaps,
        'distribution': {level: len(materials) for level, materials in by_level.items()},
        'recommendations': generate_acquisition_recommendations(gaps),
    }

Pattern 3: Textbook Comparison for Adoption#

Evaluate competing textbooks:

def compare_textbooks(candidates, target_level, criteria):
    """Rank textbooks for curriculum adoption"""
    scored_books = []

    for book in candidates:
        analysis = analyze_textbook(book)

        # Score on multiple dimensions
        score = {
            'book': book,
            'level_accuracy': abs(analysis['estimated_level'] - target_level),  # Lower = better
            'vocabulary_coverage': check_hsk_coverage(book, target_level),
            'progression_quality': analyze_chapter_progression(book),
            'price_per_page': book.price / book.page_count,
            'publisher_reputation': get_publisher_score(book.publisher),
        }

        # Weighted composite score
        composite = (
            (1 - score['level_accuracy'] / 6) * 0.3 +  # 30% level match
            score['vocabulary_coverage'] * 0.3 +        # 30% vocab coverage
            score['progression_quality'] * 0.2 +        # 20% internal progression
            (1 - normalize(score['price_per_page'])) * 0.1 +  # 10% cost
            score['publisher_reputation'] * 0.1         # 10% reputation
        )

        score['composite_score'] = composite
        scored_books.append(score)

    # Rank by composite score
    return sorted(scored_books, key=lambda x: x['composite_score'], reverse=True)

Pattern 4: Vocabulary Distribution Planner#

Ensure even vocabulary load across semesters:

def plan_vocabulary_distribution(years=4, semesters_per_year=2):
    """Plan vocabulary introduction across multi-year program"""
    total_semesters = years * semesters_per_year
    hsk_6_vocab = 5000  # HSK 6 target

    # Calculate vocabulary per semester (accounting for forgetting)
    vocab_per_semester = calculate_optimal_load(
        total_vocab=hsk_6_vocab,
        semesters=total_semesters,
        retention_rate=0.85,  # Assume 15% forgetting per semester
    )

    # Build progression plan
    plan = []
    cumulative_vocab = 0

    for year in range(1, years + 1):
        for semester in range(1, semesters_per_year + 1):
            # Vocabulary load increases gradually
            load_multiplier = 1 + (year - 1) * 0.2  # Year 4 = 60% more vocab than Year 1
            semester_load = int(vocab_per_semester * load_multiplier)

            cumulative_vocab += semester_load

            plan.append({
                'year': year,
                'semester': semester,
                'new_vocabulary': semester_load,
                'cumulative': cumulative_vocab,
                'target_hsk_level': map_vocab_to_hsk(cumulative_vocab),
                'weekly_load': semester_load / 15,  # 15-week semester
            })

    return plan

Pattern 5: Competitive Benchmark Report#

Compare program to peer institutions:

def generate_benchmark_report(own_program, peer_programs):
    """Compare program materials to competitors"""
    # Analyze own program
    own_analysis = {
        'total_materials': len(own_program.materials),
        'hsk_coverage': analyze_hsk_coverage(own_program),
        'progression_quality': analyze_progression(own_program.course_sequence),
        'cost_per_student': calculate_program_cost(own_program),
    }

    # Analyze peers
    peer_analyses = [
        analyze_program(peer) for peer in peer_programs
    ]

    # Calculate percentiles
    benchmarks = {
        'materials_count': {
            'own': own_analysis['total_materials'],
            'peer_avg': statistics.mean([p['total_materials'] for p in peer_analyses]),
            'percentile': calculate_percentile(own_analysis['total_materials'], [p['total_materials'] for p in peer_analyses]),
        },
        'hsk_coverage': {
            'own': own_analysis['hsk_coverage'],
            'peer_avg': statistics.mean([p['hsk_coverage'] for p in peer_analyses]),
            'percentile': calculate_percentile(own_analysis['hsk_coverage'], [p['hsk_coverage'] for p in peer_analyses]),
        },
        'cost_per_student': {
            'own': own_analysis['cost_per_student'],
            'peer_avg': statistics.mean([p['cost_per_student'] for p in peer_analyses]),
            'percentile': calculate_percentile(own_analysis['cost_per_student'], [p['cost_per_student'] for p in peer_analyses], reverse=True),  # Lower cost = better
        },
    }

    return {
        'benchmarks': benchmarks,
        'competitive_position': assess_competitive_position(benchmarks),
        'recommendations': generate_recommendations(benchmarks),
    }

Trade-offs#

Automated Curriculum Analysis Benefits#

Objectivity: Data-driven decisions reduce bias
Scale: Analyze hundreds of materials efficiently
Standards alignment: Validate compliance with HSK/TOCFL
Gap detection: Identify missing levels before students suffer

Automated Curriculum Analysis Costs#

Context blindness: Tools miss pedagogical quality, cultural relevance
Over-reliance on metrics: Vocabulary coverage ≠ teaching effectiveness
Standards evolution: HSK 2012 → 2021 requires recalibration
Faculty resistance: Perception of automation replacing expert judgment

When Automation is Worth It#

Use automated analysis when:

Large program (>100 students/year, multi-year curriculum)
Standards-driven (HSK/TOCFL alignment required for accreditation)
Textbook adoption decisions (objective comparison needed)
Program review cycles (periodic validation of curriculum quality)

When Manual Analysis is Better#

Rely on faculty judgment when:

Small programs (20-50 students, boutique courses)
Experimental curricula (pioneering new approaches)
Heritage learner programs (different needs than L2 learners)
Highly specialized content (business Chinese, medical Chinese)

Missing Capabilities#

No existing tool provides:

❌ ACTFL integration (automated proficiency level mapping)
❌ Pedagogical quality metrics (teachability, engagement potential)
❌ Cultural content analysis (cultural relevance, diversity)
❌ Skills-based assessment (speaking/listening difficulty, not just reading)
❌ Retention modeling (predict vocabulary forgetting over semesters)
❌ Cost optimization (budget allocation for maximum curriculum quality)

Curriculum designers must combine automated tools with expert judgment.

Real-World Integration Examples#

University Program Review Dashboard#

class ProgramReviewDashboard:
    def __init__(self, program_name, years=4):
        self.program = load_program(program_name)
        self.years = years

    def generate_annual_report(self):
        """Comprehensive program review"""
        return {
            'enrollment': self.get_enrollment_stats(),
            'standards_coverage': self.validate_hsk_coverage(),
            'progression_quality': self.analyze_course_progression(),
            'material_gaps': self.detect_curriculum_gaps(),
            'competitive_position': self.benchmark_against_peers(),
            'budget_analysis': self.analyze_material_costs(),
            'recommendations': self.generate_action_items(),
        }

    def generate_action_items(self):
        """Prioritized recommendations"""
        gaps = self.detect_curriculum_gaps()
        progression = self.analyze_course_progression()

        actions = []

        # Critical gaps
        for gap in gaps['gaps']:
            if gap['priority'] == 'high':
                actions.append({
                    'priority': 1,
                    'action': f"Acquire {gap['recommended_materials']} materials for HSK {gap['level']}",
                    'deadline': 'Next semester',
                })

        # Progression issues
        if progression['issues']:
            actions.append({
                'priority': 2,
                'action': 'Revise course sequence to fix difficulty spikes',
                'deadline': 'Next academic year',
            })

        return actions
}

Textbook Adoption Committee Tool#

def textbook_adoption_analysis(candidates, committee_criteria):
    """Support faculty adoption decision"""
    # Analyze each candidate
    analyses = []
    for book in candidates:
        analysis = {
            'book': book.title,
            'publisher': book.publisher,
            'price': book.price,

            # Automated metrics
            'estimated_level': estimate_difficulty(book.text),
            'hsk_coverage': check_hsk_coverage(book),
            'progression_quality': analyze_chapter_progression(book),

            # Manual review scores (faculty input)
            'pedagogical_quality': None,  # Faculty scores 1-10
            'cultural_content': None,     # Faculty scores 1-10
            'exercises_quality': None,    # Faculty scores 1-10
        }

        analyses.append(analysis)

    # Generate committee report
    return {
        'candidates': analyses,
        'automated_ranking': rank_by_automated_metrics(analyses),
        'faculty_review_form': generate_review_form(analyses),
        'recommendation_template': generate_committee_recommendation(),
    }

Multi-Year Curriculum Builder#

def build_curriculum(target_proficiency='HSK6', years=4):
    """Design multi-year curriculum from scratch"""
    # Calculate vocabulary targets per year
    vocab_plan = plan_vocabulary_distribution(years)

    # Find materials matching each year's target
    curriculum = {}
    for year in range(1, years + 1):
        year_plan = [p for p in vocab_plan if p['year'] == year]
        target_level = year_plan[0]['target_hsk_level']

        # Search material library
        suitable_materials = find_materials_for_level(target_level)

        curriculum[f'Year {year}'] = {
            'target_hsk_level': target_level,
            'vocabulary_goal': year_plan[-1]['cumulative'],
            'recommended_materials': suitable_materials[:3],  # Top 3
            'supplementary_resources': suggest_supplementary(target_level),
        }

    return {
        'curriculum': curriculum,
        'vocabulary_progression': vocab_plan,
        'total_cost': calculate_total_cost(curriculum),
        'implementation_timeline': generate_timeline(curriculum),
    }

Performance Considerations#

Typical Workload#

Curriculum designers analyze:

Entire program libraries (100-500 textbooks/materials)
Multi-year course sequences (4-10 courses)
Competitor programs (5-20 peer institutions)

Optimization Strategies#

# Cache material analyses (don't re-analyze every time)
material_cache = {}

def analyze_with_cache(material):
    if material.id in material_cache:
        return material_cache[material.id]

    analysis = analyze_material(material)
    material_cache[material.id] = analysis
    return analysis

# Parallel processing for large libraries
from multiprocessing import Pool

def analyze_library_parallel(materials):
    with Pool() as pool:
        results = pool.map(analyze_material, materials)
    return results

Conclusion#

Curriculum design requires automated analysis plus expert judgment. Tools provide:

Objective standards alignment validation (HSK/TOCFL coverage)
Gap detection (missing proficiency levels)
Competitive benchmarking (compare to peer programs)
Progression analysis (smooth difficulty increases)

However, automation cannot replace faculty expertise on:

Pedagogical quality (teachability, engagement)
Cultural relevance (appropriate content for learners)
Program-specific needs (heritage learners, specialized domains)
Budget vs quality trade-offs (institutional constraints)

Best practice: Use automation for QA and evidence gathering, rely on faculty for final curriculum decisions. Automated metrics inform but don’t dictate.

Use Case: Graded Reader Publishers#

Scenario Description#

Publishers create and categorize books, articles, and reading materials by difficulty level, ensuring learners can find content matching their proficiency without frustration.

User Persona#

Primary: Educational publishers (Mandarin Companion, Chinese Breeze, Sinolingua)
Secondary: Content platforms (The Chairman’s Bao, Du Chinese, Decipher Chinese)
Output: Graded readers, leveled articles, children’s books
Scale: Catalogs of 100-1000+ books/articles needing consistent leveling

Examples of Real Applications#

Mandarin Companion: Graded readers leveled 1-2 (breakthrough → elementary)
Chinese Breeze: 8 levels (300 → 3000 character vocabulary)
The Chairman’s Bao: Daily news articles graded by HSK level
Du Chinese: Stories leveled from HSK 1 → HSK 6
Decipher Chinese: Chapters marked by character frequency coverage

Technical Requirements#

Core Capabilities#

Automated level assignment: Classify texts into difficulty tiers (HSK 1-6, CEFR A1-C2)
Vocabulary coverage analysis: Ensure text uses only characters/words at target level
Difficulty validation: Verify author hasn’t accidentally used advanced vocabulary
Comparative analysis: Rank books within same level (easier HSK 3 vs harder HSK 3)
Batch processing: Analyze entire book catalog for consistency
Vocabulary extraction: Generate word lists for each book (appendix material)

Performance Constraints#

Batch processing: Analyze 50k-100k word manuscripts
Latency: Minutes acceptable (editorial workflow, not real-time)
Accuracy: High priority (mislabeled books hurt learner trust)
Reporting: Detailed breakdowns for editors (which chapters are too hard?)

Accuracy Requirements#

Critical: No advanced vocabulary in beginner texts (breaks learner flow)
Critical: Consistent leveling across book series
Important: Character frequency accuracy (rare characters stand out)
Nice-to-have: Sentence complexity metrics (long sentences = harder)

Library Analysis#

CC-CEDICT + HSK Word Lists#

Strengths for Publishers:

✅ HSK 1-6 tagging (~5000 words with level assignments)
✅ Standardized levels (aligns with learner expectations)
✅ Comprehensive coverage (100k+ dictionary entries)
✅ Batch-friendly (process entire manuscripts)

Weaknesses for Publishers:

⚠️ HSK 2012 vs 2021 standard differences (vocabulary lists changed)
⚠️ Incomplete coverage (many common words lack HSK tags)
⚠️ No proper name filtering (character names counted as rare)

Verdict: Essential foundation but needs editorial oversight.

BCC Character Frequency List#

Strengths for Publishers:

✅ 10 billion character corpus (authoritative frequency data)
✅ Fine-grained rankings (differentiate top 500 vs top 1500)
✅ Contemporary relevance (2000-2020 text sources)

Weaknesses for Publishers:

❌ Character-only (books need word-level analysis)
❌ No proficiency mapping (frequency ≠ HSK level)

Verdict: Excellent for character difficulty, insufficient alone.

SUBTLEX-CH (Word Frequency from Subtitles)#

Strengths for Publishers:

✅ Word frequency data (from 46 million subtitle words)
✅ Spoken language focus (matches conversational content)
✅ Psycholinguistic validity (frequency correlates with familiarity)

Weaknesses for Publishers:

⚠️ Subtitle corpus bias (informal language overrepresented)
⚠️ No HSK mapping
⚠️ Not ideal for literary/formal texts

Verdict: Useful for conversational readers, less so for classical texts.

jieba + Custom Dictionaries#

Strengths for Publishers:

✅ Word segmentation (essential for word-level analysis)
✅ Custom dictionaries (add HSK tags, proper names)
✅ Fast batch processing (analyze full books in seconds)

Weaknesses for Publishers:

⚠️ Segmentation errors on literary text (classical Chinese, idioms)
⚠️ No built-in leveling (requires custom logic)

Verdict: Critical infrastructure, needs integration layer.

Detailed Feature Comparison#

Feature	CC-CEDICT+HSK	BCC Freq	SUBTLEX-CH	jieba	Publisher Value
HSK levels	✅	❌	❌	⚠️ Custom	Critical (reader expectations)
Character frequency	❌	✅	⚠️ Partial	❌	High (difficulty signals)
Word frequency	❌	❌	✅	⚠️ Custom	High (vocabulary difficulty)
Batch processing	✅	✅	✅	✅	Critical (catalog analysis)
Proper name handling	❌	❌	❌	⚠️ Custom	Important (avoid false positives)
Coverage reports	⚠️ Manual	⚠️ Manual	⚠️ Manual	⚠️ Manual	Critical (editorial feedback)

Recommendation#

Multi-Source Integration for Publishers#

Publishing workflow requires combining multiple data sources:

jieba: Segment manuscripts into words
CC-CEDICT + HSK: Map words to proficiency levels
BCC Character Frequency: Identify rare characters
Custom proper name dictionary: Filter character/place names
Editorial rules engine: Custom vocabulary limits per level

Publishing Workflow Integration#

# Pseudocode for manuscript grading
import jieba
from collections import Counter

def grade_manuscript(text, target_level='HSK3'):
    """Analyze manuscript for difficulty and vocabulary compliance"""

    # 1. Segment into words
    words = list(jieba.cut(text))

    # 2. Look up HSK levels
    word_levels = {word: hsk_dict.get(word, 'unknown') for word in set(words)}

    # 3. Find vocabulary violations (words above target level)
    violations = [
        word for word, level in word_levels.items()
        if level > target_level or level == 'unknown'
    ]

    # 4. Calculate character coverage
    chars = [c for c in text if is_cjk(c)]
    rare_chars = [c for c in chars if char_frequency_rank(c) > 3000]

    # 5. Generate editorial report
    return {
        'recommended_level': estimate_level(word_levels, rare_chars),
        'target_level': target_level,
        'compliant': len(violations) == 0,
        'violations': violations[:20],  # Show first 20
        'vocabulary_distribution': Counter(word_levels.values()),
        'character_coverage_at_target': calculate_coverage(chars, target_level),
        'editorial_notes': generate_suggestions(violations),
    }

Implementation Patterns#

Pattern 1: Manuscript Compliance Check#

Validate that book uses only target-level vocabulary:

def validate_vocabulary_compliance(manuscript, target_hsk_level):
    """Check if manuscript stays within vocabulary constraints"""
    words = segment(manuscript)
    violations = []

    for word in set(words):
        word_level = get_hsk_level(word)

        if word_level > target_hsk_level:
            occurrences = words.count(word)
            violations.append({
                'word': word,
                'level': word_level,
                'occurrences': occurrences,
                'suggested_alternatives': find_simpler_synonyms(word, target_hsk_level),
            })

    return {
        'compliant': len(violations) == 0,
        'violations': sorted(violations, key=lambda x: x['occurrences'], reverse=True),
        'compliance_rate': (len(set(words)) - len(violations)) / len(set(words)),
    }

Pattern 2: Catalog Consistency Validation#

Ensure books labeled same level have similar difficulty:

def validate_catalog_consistency(books_by_level):
    """Check that books within same level have comparable difficulty"""

    for level, books in books_by_level.items():
        difficulties = [calculate_difficulty_score(book.text) for book in books]
        mean_diff = statistics.mean(difficulties)
        std_dev = statistics.stdev(difficulties)

        # Flag outliers
        for book, difficulty in zip(books, difficulties):
            if abs(difficulty - mean_diff) > 2 * std_dev:
                print(f"WARNING: {book.title} at level {level} is outlier")
                print(f"  Difficulty: {difficulty:.2f} (mean: {mean_diff:.2f})")
                suggest_relevel(book, difficulty)

Pattern 3: Chapter-by-Chapter Difficulty Curve#

Ensure book difficulty increases gradually:

def analyze_chapter_progression(chapters):
    """Validate that book maintains consistent difficulty"""
    chapter_difficulties = []

    for i, chapter in enumerate(chapters):
        difficulty = calculate_difficulty_score(chapter.text)
        new_vocabulary = count_new_words_since_chapter(chapter, chapters[:i])

        chapter_difficulties.append({
            'chapter': i + 1,
            'difficulty_score': difficulty,
            'new_vocabulary_count': new_vocabulary,
        })

        # Flag sudden jumps
        if i > 0:
            prev_difficulty = chapter_difficulties[i-1]['difficulty_score']
            jump = difficulty - prev_difficulty

            if jump > 1.0:  # Difficulty spike
                print(f"WARNING: Chapter {i+1} has difficulty spike (+{jump:.2f})")

    return chapter_difficulties

Pattern 4: Vocabulary Appendix Generation#

Auto-generate word lists for book appendices:

def generate_vocabulary_appendix(manuscript, target_level):
    """Create word list for back-of-book appendix"""
    words = segment(manuscript)
    word_freq = Counter(words)

    # Categorize vocabulary
    appendix = {
        'target_level_words': [],
        'review_words': [],  # Below target level
        'challenge_words': [],  # Above target level
    }

    for word, freq in word_freq.items():
        level = get_hsk_level(word)
        entry = {
            'word': word,
            'pinyin': get_pinyin(word),
            'definition': get_definition(word),
            'frequency_in_text': freq,
        }

        if level == target_level:
            appendix['target_level_words'].append(entry)
        elif level < target_level:
            appendix['review_words'].append(entry)
        else:
            appendix['challenge_words'].append(entry)

    # Sort by frequency
    for category in appendix.values():
        category.sort(key=lambda x: x['frequency_in_text'], reverse=True)

    return appendix

Pattern 5: Comparative Difficulty Ranking#

Rank books within same level from easiest to hardest:

def rank_books_within_level(books, level):
    """Order books by difficulty for reader recommendations"""
    scored_books = []

    for book in books:
        if book.target_level != level:
            continue

        # Multiple difficulty signals
        score = {
            'character_coverage': character_coverage_at_level(book.text, level),
            'unique_char_count': count_unique_characters(book.text),
            'avg_sentence_length': average_sentence_length(book.text),
            'rare_word_ratio': count_rare_words(book.text, level) / count_words(book.text),
        }

        # Weighted composite score
        composite = (
            score['character_coverage'] * 0.4 +
            (1 - score['rare_word_ratio']) * 0.3 +
            (3000 - score['unique_char_count']) / 3000 * 0.2 +
            (30 - score['avg_sentence_length']) / 30 * 0.1
        )

        scored_books.append({
            'book': book,
            'composite_score': composite,
            'breakdown': score,
        })

    # Return from easiest to hardest
    return sorted(scored_books, key=lambda x: x['composite_score'], reverse=True)

Trade-offs#

Automated Grading Benefits#

Consistency: Objective metrics reduce subjective leveling errors
Scale: Analyze hundreds of books quickly
Quality assurance: Catch vocabulary violations before publication
Competitive analysis: Benchmark against other publishers

Automated Grading Costs#

False positives: Proper names flagged as rare (need manual filtering)
Context blindness: Algorithms miss stylistic difficulty (prose quality)
HSK evolution: Vocabulary standards change (2012 → 2021 revision)
Genre differences: News articles ≠ fiction ≠ textbooks (different vocab)

When Automation is Worth It#

Use automated grading when:

Large catalog requiring consistent leveling (50+ books)
Series with strict vocabulary control (graded readers)
Multiple authors need alignment (editorial coordination)
Competitive positioning requires precise differentiation

When Manual is Better#

Rely on editorial judgment when:

Small catalog (under 20 books)
Literary quality more important than strict leveling
Genre-specific vocabulary (business, medicine) not in HSK
Pioneering new content types (no benchmarks)

Missing Capabilities#

No existing tool provides:

❌ Proper name dictionaries (character/place names for filtering)
❌ Genre-specific vocabulary (literary vs conversational vs academic)
❌ Sentence complexity metrics (grammar difficulty, not just vocab)
❌ Readability formulas (CJK equivalent of Flesch-Kincaid)
❌ Comparative benchmarking (how does this compare to competitors?)
❌ HSK 2021 migration tools (map old levels to new standard)

Publishers must build custom solutions for these needs.

Real-World Integration Examples#

Editorial Dashboard#

class ManuscriptGrader:
    def __init__(self, target_level):
        self.target_level = target_level
        self.hsk_vocab = load_hsk_vocabulary(target_level)
        self.char_freq = load_character_frequency()

    def grade_and_report(self, manuscript):
        """Generate comprehensive grading report for editors"""
        words = segment(manuscript)
        chars = extract_cjk_characters(manuscript)

        report = {
            'summary': {
                'target_level': self.target_level,
                'recommended_level': self.estimate_level(words, chars),
                'compliant': self.check_compliance(words),
                'word_count': len(words),
                'unique_words': len(set(words)),
            },
            'vocabulary_analysis': self.analyze_vocabulary(words),
            'character_analysis': self.analyze_characters(chars),
            'violations': self.find_violations(words),
            'suggestions': self.generate_suggestions(words),
            'appendix_preview': self.generate_vocabulary_list(words)[:50],
        }

        return report

Catalog Management System#

def update_catalog_leveling(catalog):
    """Re-grade entire catalog for consistency"""
    graded_catalog = []

    for book in catalog:
        auto_level = estimate_level_from_text(book.text)
        manual_level = book.assigned_level

        if auto_level != manual_level:
            print(f"MISMATCH: {book.title}")
            print(f"  Manual: {manual_level}, Auto: {auto_level}")
            review_needed = True
        else:
            review_needed = False

        graded_catalog.append({
            'book': book,
            'auto_level': auto_level,
            'manual_level': manual_level,
            'review_needed': review_needed,
            'difficulty_score': calculate_difficulty_score(book.text),
        })

    return graded_catalog

Quality Assurance Pipeline#

def pre_publication_qa(manuscript, target_level):
    """Final check before printing"""
    issues = []

    # Check 1: Vocabulary compliance
    vocab_check = validate_vocabulary_compliance(manuscript, target_level)
    if not vocab_check['compliant']:
        issues.append({
            'type': 'vocabulary_violation',
            'severity': 'high',
            'details': vocab_check['violations'],
        })

    # Check 2: Rare character check
    rare_chars = find_rare_characters(manuscript, max_rank=3000)
    if len(rare_chars) > 10:
        issues.append({
            'type': 'rare_characters',
            'severity': 'medium',
            'details': rare_chars,
        })

    # Check 3: Consistency with series
    if series_books:
        consistency = check_series_consistency(manuscript, series_books)
        if consistency['outlier']:
            issues.append({
                'type': 'series_inconsistency',
                'severity': 'medium',
                'details': consistency,
            })

    return {
        'ready_for_publication': len(issues) == 0,
        'issues': issues,
    }

Performance Considerations#

Typical Workload#

Publishers process:

50k-100k word manuscripts (full books)
Batch analysis of 100-500 books (catalog updates)
Chapter-by-chapter review (editorial workflow)

Optimization Strategies#

# Cache character frequency lookups
char_freq_cache = load_character_frequency()  # Load once

# Parallel processing for catalog
from multiprocessing import Pool
with Pool() as pool:
    graded_books = pool.map(grade_book, book_catalog)

# Incremental chapter analysis (don't reprocess whole book)
def analyze_chapter_incremental(new_chapter, previous_chapters_vocab):
    new_words = set(segment(new_chapter)) - previous_chapters_vocab
    return analyze_vocabulary(new_words)

Conclusion#

Publishers need multi-library integration with editorial oversight. Automated grading provides:

Consistency across large catalogs
Objective vocabulary compliance checking
Quality assurance before publication

However, algorithms cannot replace editorial judgment on:

Literary quality and prose style
Genre-appropriate vocabulary
Proper name handling
Reader engagement factors

Best practice: Use automation for QA and consistency, rely on editors for final leveling decisions.

Use Case: Language Learning Applications#

Scenario Description#

Applications that teach Chinese/Japanese/Korean dynamically adjust content difficulty to match learner proficiency, ensuring materials are challenging but not overwhelming.

User Persona#

Primary: Language learning app developers (Duolingo, HelloChinese, ChinesePod)
Secondary: Adaptive learning platform builders
Platforms: Mobile apps, web apps, spaced repetition systems
Scale: Millions of users across proficiency levels (A1 → C2, HSK 1 → 6)

Examples of Real Applications#

Duolingo: Lessons scaled to learner progress, introducing new characters gradually
HelloChinese: Content graded by HSK level with difficulty indicators
ChinesePod: Podcast lessons tagged by proficiency level (Newbie → Advanced)
LingQ/Readlang: Reading materials with unknown word highlighting
Clozemaster: Sentence difficulty based on vocabulary frequency

Technical Requirements#

Core Capabilities#

Character frequency analysis: Identify rare characters that exceed learner level
HSK/TOCFL mapping: Classify words into standardized proficiency levels
Coverage calculation: % of text the learner can understand
Unknown word detection: Flag vocabulary outside learner’s current level
Difficulty scoring: Assign numeric readability scores (1-10, beginner-advanced)
Batch processing: Analyze lesson libraries (thousands of texts)

Performance Constraints#

Latency: Sub-second for individual lessons (real-time preview)
Throughput: Process thousands of lessons during content ingestion
Memory: Efficient on mobile devices (vocabulary databases can be large)
Offline capability: Prefer local processing for mobile apps

Accuracy Requirements#

Critical: Correct HSK/TOCFL level assignment (misleveling frustrates learners)
Important: Character frequency accuracy (identifies difficulty spikes)
Nice-to-have: Context-aware proper name filtering (names shouldn’t count as “rare”)

Library Analysis#

hanziDB / CC-CEDICT with HSK Tags#

Strengths for Learning Apps:

✅ HSK level tagging (1-6 classification for ~5000 words)
✅ Comprehensive vocabulary (100k+ entries)
✅ Offline-capable (bundled dictionary data)
✅ Open data (no licensing restrictions)

Weaknesses for Learning Apps:

⚠️ HSK coverage incomplete (many common words lack tags)
⚠️ No character-level frequency data
⚠️ Static data (doesn’t learn from user interactions)

Verdict: Good foundation but needs supplementation.

Jun Da Character Frequency Lists#

Strengths for Learning Apps:

✅ Character frequency rankings (8000+ characters from corpus analysis)
✅ Research-backed (based on real text corpora)
✅ Granular differentiation (top 500 vs top 3000 vs rare)

Weaknesses for Learning Apps:

❌ Character-only (no word-level frequency)
❌ No HSK mapping
❌ Requires separate word segmentation

Verdict: Essential for character-level analysis but incomplete alone.

Modern Chinese Character Frequency List (BCC Corpus)#

Strengths for Learning Apps:

✅ Massive corpus (10 billion characters from 2000-2020 texts)
✅ Contemporary relevance (includes internet language)
✅ High accuracy (large sample size reduces noise)

Weaknesses for Learning Apps:

❌ Character-only (no word data)
❌ No proficiency level mapping
❌ Requires preprocessing

Verdict: Best character frequency source, needs integration layer.

jieba with Custom Dictionaries#

Strengths for Learning Apps:

✅ Word segmentation (essential for word-level analysis)
✅ Custom dictionaries (add HSK tags, frequency data)
✅ Fast processing (C++ implementation)
✅ Widely used (battle-tested in production)

Weaknesses for Learning Apps:

⚠️ Segmentation errors on domain-specific text
⚠️ No built-in readability features (need custom logic)

Verdict: Critical infrastructure component, not a complete solution.

Detailed Feature Comparison#

Feature	hanziDB+HSK	Jun Da	BCC Corpus	jieba	Learning Value
HSK levels	✅	❌	❌	⚠️ Custom	Critical (curriculum alignment)
Character frequency	❌	✅	✅	❌	High (difficulty signals)
Word frequency	❌	❌	❌	⚠️ Custom	High (readability core metric)
Word segmentation	❌	❌	❌	✅	Critical (prerequisite)
Coverage calculation	⚠️ Manual	⚠️ Manual	⚠️ Manual	⚠️ Manual	High (comprehension predictor)
Offline-capable	✅	✅	✅	✅	High (mobile apps)

Recommendation#

Multi-Library Integration Required#

No single library provides complete readability assessment. Best practice combines:

jieba: Word segmentation (prerequisite for all analysis)
BCC Character Frequency or Jun Da: Character-level difficulty
CC-CEDICT + HSK tags: Word-level proficiency mapping
Custom logic: Coverage calculation, scoring algorithms

Implementation Stack#

# Pseudocode showing integration pattern
import jieba
from collections import Counter

# 1. Segment text into words
words = jieba.cut(text)

# 2. Look up each word's HSK level
hsk_levels = [hsk_dict.get(word, 'unknown') for word in words]

# 3. Calculate coverage at learner's level
learner_level = 3  # HSK 3
known_words = sum(1 for level in hsk_levels if level <= learner_level)
coverage = known_words / len(words)

# 4. Character frequency analysis
chars = list(text)
rare_chars = [c for c in chars if char_frequency[c] > 3000]  # Beyond top 3000

# 5. Difficulty score
difficulty = calculate_difficulty(coverage, rare_chars, unique_chars)

Implementation Patterns#

Pattern 1: Adaptive Content Selection#

Select lesson content matching learner’s proficiency:

def select_lesson_for_learner(lessons, learner_hsk_level):
    """Find lessons with 90-95% vocabulary coverage"""
    suitable = []

    for lesson in lessons:
        words = segment(lesson.text)
        known_count = sum(1 for w in words if word_hsk_level(w) <= learner_hsk_level)
        coverage = known_count / len(words)

        if 0.90 <= coverage <= 0.95:  # Sweet spot for learning
            suitable.append(lesson)

    return suitable

Pattern 2: Difficulty Preview#

Show learners what % of content they’ll understand:

def preview_difficulty(text, learner_vocab):
    """Calculate comprehension metrics before learner starts"""
    words = segment(text)

    known_words = set(words) & learner_vocab
    unknown_words = set(words) - learner_vocab

    return {
        'coverage': len(known_words) / len(set(words)),
        'total_words': len(words),
        'unique_words': len(set(words)),
        'new_words_to_learn': list(unknown_words)[:10],  # Show first 10
        'difficulty_rating': classify_difficulty(coverage),
    }

Pattern 3: Progressive Difficulty Curve#

Ensure lesson sequence increases difficulty gradually:

def validate_curriculum_progression(lessons):
    """Check that difficulty increases smoothly"""
    difficulties = [calculate_difficulty(lesson.text) for lesson in lessons]

    # Flag large jumps
    for i in range(1, len(difficulties)):
        jump = difficulties[i] - difficulties[i-1]
        if jump > 1.5:  # More than 1.5 levels jump
            print(f"Warning: Large difficulty spike at lesson {i}")

    # Ensure progression
    if difficulties != sorted(difficulties):
        print("Warning: Lessons not in difficulty order")

Pattern 4: Unknown Word Highlighting#

Visual feedback on which vocabulary is new:

def annotate_unknown_words(text, learner_vocab):
    """Mark unknown words for learner attention"""
    words = segment(text)

    annotated = []
    for word in words:
        if word in learner_vocab:
            annotated.append({'word': word, 'status': 'known'})
        else:
            level = word_hsk_level(word)
            annotated.append({
                'word': word,
                'status': 'unknown',
                'hsk_level': level,
                'frequency_rank': word_frequency_rank(word),
            })

    return annotated

Pattern 5: Spaced Repetition Integration#

Track which difficult words learner has mastered:

class VocabularyTracker:
    def __init__(self, learner_id):
        self.learner_vocab = load_known_words(learner_id)
        self.learned_words = load_learning_history(learner_id)

    def update_after_lesson(self, lesson_text):
        """Update learner's vocabulary after completing lesson"""
        words = segment(lesson_text)
        new_words = set(words) - self.learner_vocab

        # Add to learning queue for spaced repetition
        for word in new_words:
            self.learned_words.add(word, difficulty=word_hsk_level(word))

        # Promote words to known after N successful reviews
        mastered = self.learned_words.get_mastered()
        self.learner_vocab.update(mastered)

Trade-offs#

Multi-Library Integration Benefits#

Comprehensive analysis: Character + word + proficiency levels
Curriculum alignment: HSK/TOCFL mapping for standardized programs
Accurate difficulty scoring: Multiple signals reduce false positives

Multi-Library Integration Costs#

Complexity: Must maintain multiple data sources
Data sync: HSK standards update periodically (2012 → 2021 revision)
Segmentation errors: jieba mistakes propagate through pipeline
Memory footprint: Large dictionaries for offline mobile apps

When Complexity is Worth It#

Use full integration for learning apps when:

Adaptive content selection is core feature
Target audience: serious learners (not casual tourists)
Large content library needs automated grading
Curriculum must align with standardized tests (HSK, TOCFL)

When Simpler is Better#

Consider lighter approaches if:

Only need rough difficulty tiers (easy/medium/hard)
Small, manually curated content library
Target is single proficiency level (no adaptation needed)
Memory/size severely constrained (offline mobile)

Missing Capabilities#

No existing library provides:

❌ Context-aware proper name filtering (names shouldn’t count as rare)
❌ Domain-specific vocabulary (business Chinese, medical Chinese)
❌ Grammar complexity metrics (sentence structure, not just vocab)
❌ Learner corpus (actual learner comprehension data for validation)
❌ CEFR mapping (HSK ≠ CEFR, no standard conversion)

These require custom development or additional data sources.

Real-World Integration Examples#

Duolingo-Style Adaptive Lessons#

def generate_lesson_variants(base_text, learner_level):
    """Create easier/harder versions of same lesson"""
    words = segment(base_text)

    # Easier: Replace rare words with common synonyms
    easy_text = replace_rare_words(words, max_hsk=learner_level - 1)

    # Harder: Keep original text
    hard_text = base_text

    return {
        'review': easy_text,      # Below learner level (reinforcement)
        'lesson': base_text,       # At learner level (learning)
        'challenge': hard_text,    # Above learner level (stretch)
    }

Content Recommendation Engine#

def recommend_next_content(learner_id, content_library):
    """Find content at learner's current level + 1"""
    learner_vocab = get_learner_vocabulary(learner_id)
    current_level = estimate_learner_level(learner_vocab)

    # Find content with 85-90% coverage (optimal challenge)
    recommendations = []
    for content in content_library:
        coverage = calculate_coverage(content.text, learner_vocab)
        if 0.85 <= coverage <= 0.90:
            recommendations.append({
                'content': content,
                'coverage': coverage,
                'new_words': count_unknown_words(content.text, learner_vocab),
            })

    # Prioritize content with high-value new vocabulary
    return sorted(recommendations, key=lambda x: vocabulary_value(x))

Progress Tracking Dashboard#

def learner_progress_report(learner_id):
    """Show learner's vocabulary growth over time"""
    history = get_learning_history(learner_id)

    return {
        'current_hsk_level': estimate_hsk_level(learner_id),
        'vocabulary_size': len(get_learner_vocabulary(learner_id)),
        'character_coverage': character_coverage_percentage(learner_id),
        'words_learned_this_week': count_new_words(history, days=7),
        'recommended_next_level': calculate_next_milestone(learner_id),
    }

Performance Considerations#

Typical Workload#

A learning app might process:

100-500 words per lesson
10-50 lessons in content library preview
Real-time difficulty calculation on content creation

Optimization Strategies#

# Pre-compute difficulty for entire library
content_library_cache = {
    lesson.id: {
        'difficulty_score': calculate_difficulty(lesson.text),
        'hsk_level': estimate_hsk_level_of_text(lesson.text),
        'character_coverage_by_level': {1: 0.3, 2: 0.5, 3: 0.7, ...},
    }
    for lesson in content_library
}

# Fast lookup for learner recommendations
def recommend_fast(learner_hsk_level):
    return [
        lesson for lesson in content_library
        if lesson.hsk_level == learner_hsk_level
    ]

Conclusion#

Language learning apps require multi-library integration. No single library provides complete readability assessment for CJK text. Best practice combines:

jieba for segmentation
BCC/Jun Da for character frequency
CC-CEDICT + HSK tags for word-level proficiency
Custom scoring logic for coverage and difficulty calculation

The complexity is justified when adaptive content selection and curriculum alignment are core features. For simpler apps with static content tiers, manual classification may suffice.

Use Case: Reading Assistant Tools#

Scenario Description#

Browser extensions, e-reader apps, and mobile applications that help learners read authentic CJK content by providing real-time difficulty indicators, vocabulary popups, and comprehensible input filtering.

User Persona#

Primary: Tool developers building reading assistance features
Secondary: Learners consuming authentic content (news, novels, social media)
Platforms: Browser extensions (Chrome, Firefox), mobile apps, e-readers
Scale: Analyze web pages, articles, ebooks in real-time

Examples of Real Applications#

Zhongwen (browser extension): Popup dictionary with character frequency indicators
Du Chinese Reader: Articles with difficulty ratings and popup definitions
Readibu: E-reader with HSK level filtering
LingQ: Web reader highlighting known vs unknown words
Pleco Reader: Mobile document reader with tap-to-translate

Technical Requirements#

Core Capabilities#

Real-time difficulty estimation: Analyze web page difficulty instantly
Character/word frequency lookup: Show how common each word is
Unknown word highlighting: Visual distinction between known/unknown vocabulary
Content filtering: Find articles matching learner’s level
Popup dictionary integration: Quick definitions without context switch
Progress tracking: Monitor vocabulary growth over time

Performance Constraints#

Real-time responsiveness: Sub-second page analysis (user waiting)
Lightweight: Browser extension memory limits (50-100 MB)
Offline capability: Core features work without internet
Battery efficiency: Mobile apps shouldn’t drain battery

Accuracy Requirements#

Critical: Fast performance (slow tools frustrate users)
Important: Character frequency accuracy (difficulty indicators)
Nice-to-have: Perfect HSK tagging (gaps acceptable if core is fast)

Library Analysis#

Jun Da Character Frequency Lists#

Strengths for Reading Assistants:

✅ Fast lookup (8000 character frequency rankings)
✅ Small data file (~100 KB, browser-friendly)
✅ Research-backed (corpus-derived)
✅ Offline-capable (bundle with extension)

Weaknesses for Reading Assistants:

⚠️ Character-only (no word frequency)
⚠️ No HSK mapping
⚠️ Requires word segmentation separately

Verdict: Excellent for character-level indicators.

CC-CEDICT + HSK Tags#

Strengths for Reading Assistants:

✅ HSK tagging (show word difficulty levels)
✅ Comprehensive (100k+ entries)
✅ Popup dictionary data (definitions included)
✅ Offline-capable (bundle with app)

Weaknesses for Reading Assistants:

⚠️ Large file size (~30 MB, impacts extension load time)
⚠️ Incomplete HSK coverage
⚠️ Static data (doesn’t learn user’s vocabulary)

Verdict: Essential for vocabulary lookup, optimize file size.

jieba.js (JavaScript port)#

Strengths for Reading Assistants:

✅ Browser-native (runs in JavaScript, no server needed)
✅ Fast segmentation (real-time page analysis)
✅ Offline-capable (no API calls)
✅ Lightweight (small bundle size)

Weaknesses for Reading Assistants:

⚠️ JavaScript slower than native (acceptable trade-off)
⚠️ No built-in readability features

Verdict: Critical for browser extensions.

IndexedDB / LocalStorage (Browser Storage)#

Strengths for Reading Assistants:

✅ Persist user vocabulary (track known words)
✅ Fast local queries (no network latency)
✅ Privacy-friendly (data stays on device)

Weaknesses for Reading Assistants:

⚠️ Storage limits (5-50 MB depending on browser)
⚠️ No built-in sync (need custom cloud sync)

Verdict: Essential for personalized features.

Detailed Feature Comparison#

Feature	Jun Da	CC-CEDICT+HSK	jieba.js	IndexedDB	Reading Assistant Value
Fast lookup	✅	⚠️ Large	✅	✅	Critical (real-time)
Character frequency	✅	❌	❌	⚠️ Cache	Critical (difficulty)
HSK levels	❌	✅	❌	⚠️ Cache	High (learner guidance)
Offline-capable	✅	✅	✅	✅	Critical (mobile/privacy)
Small bundle size	✅	❌	✅	N/A	High (extension limits)
User vocabulary tracking	❌	❌	❌	✅	High (personalization)

Recommendation#

Hybrid Approach for Reading Assistants#

Combine lightweight data with user personalization:

Jun Da frequency: Character-level difficulty indicators (fast, small)
Pruned CC-CEDICT: Top 10k words only (reduce size from 30 MB → 3 MB)
jieba.js: Word segmentation in browser
IndexedDB: Track user’s known vocabulary for highlighting
Optional cloud sync: Sync vocabulary across devices

Browser Extension Architecture#

// Pseudocode for browser extension
class ReadingAssistant {
  constructor() {
    this.charFreq = loadCharacterFrequency();  // Jun Da data
    this.wordDict = loadPrunedDictionary();    // Top 10k words only
    this.userVocab = loadUserVocabulary();     // IndexedDB
  }

  async analyzePage() {
    // 1. Extract text from page
    const text = document.body.innerText;

    // 2. Segment into words
    const words = jieba.cut(text);

    // 3. Calculate difficulty
    const difficulty = this.estimateDifficulty(words);

    // 4. Highlight unknown words
    this.highlightUnknownWords(words);

    // 5. Show difficulty badge
    this.showDifficultyBadge(difficulty);
  }

  highlightUnknownWords(words) {
    words.forEach(word => {
      if (!this.userVocab.has(word)) {
        const hskLevel = this.wordDict.getHSKLevel(word);
        const color = this.getLevelColor(hskLevel);
        // Highlight word in page with color
        highlightWord(word, color);
      }
    });
  }
}

Implementation Patterns#

Pattern 1: Real-Time Difficulty Badge#

Show page difficulty instantly:

function showDifficultyBadge(pageText) {
  // Segment text
  const words = jieba.cut(pageText);

  // Calculate character coverage
  const chars = [...pageText].filter(isChineseChar);
  const commonChars = chars.filter(c => charFreq[c] <= 1500);  // Top 1500
  const coverage = commonChars.length / chars.length;

  // Estimate difficulty
  let difficulty, color;
  if (coverage >= 0.95) {
    difficulty = 'Beginner';
    color = 'green';
  } else if (coverage >= 0.85) {
    difficulty = 'Intermediate';
    color = 'orange';
  } else {
    difficulty = 'Advanced';
    color = 'red';
  }

  // Show badge in corner
  showBadge(difficulty, color);
}

Pattern 2: Unknown Word Highlighting#

Visual feedback on comprehension:

function highlightUnknownWords(text, userVocabulary) {
  const words = jieba.cut(text);

  words.forEach(word => {
    if (!userVocabulary.has(word)) {
      // Find word in DOM and highlight
      const nodes = findTextNodes(word);
      nodes.forEach(node => {
        const span = document.createElement('span');
        span.className = 'unknown-word';
        span.style.backgroundColor = '#FFEB3B';  // Yellow
        span.textContent = word;

        // Add click handler for popup
        span.onclick = () => showPopupDictionary(word);

        node.replaceWith(span);
      });
    }
  });
}

Pattern 3: Content Filtering by Difficulty#

Find readable articles:

function filterContentByLevel(articles, targetLevel) {
  const readable = [];

  articles.forEach(article => {
    const words = jieba.cut(article.text);
    const difficulty = estimateDifficulty(words);

    // Check if within learner's range
    if (difficulty >= targetLevel - 0.5 && difficulty <= targetLevel + 0.5) {
      readable.push({
        article: article,
        difficulty: difficulty,
        newWords: countUnknownWords(words, userVocabulary),
      });
    }
  });

  // Sort by fewest new words (easiest to hardest within level)
  return readable.sort((a, b) => a.newWords - b.newWords);
}

Show comprehensive word information:

function showPopupDictionary(word) {
  // Look up word data
  const definition = dictionary.lookup(word);
  const hskLevel = dictionary.getHSKLevel(word);
  const frequency = calculateWordFrequency(word);

  // Calculate character frequencies
  const charFrequencies = [...word].map(c => charFreq[c]);
  const rareChar = Math.max(...charFrequencies);

  // Build popup content
  const popup = createPopup({
    word: word,
    pinyin: definition.pinyin,
    definition: definition.english,
    hskLevel: hskLevel || 'Not in HSK',
    frequency: frequency ? `Top ${frequency}` : 'Rare',
    characterDifficulty: rareChar > 3000 ? 'Contains rare characters' : 'Common characters',
    example: definition.example || null,
  });

  showPopup(popup);
}

Pattern 5: Vocabulary Progress Tracking#

Monitor learning over time:

class VocabularyTracker {
  constructor() {
    this.knownWords = new Set(loadFromIndexedDB('knownWords'));
    this.learningWords = new Map(loadFromIndexedDB('learningWords'));
  }

  markWordAsKnown(word) {
    this.knownWords.add(word);
    this.learningWords.delete(word);
    saveToIndexedDB('knownWords', [...this.knownWords]);

    // Update statistics
    this.updateStats();
  }

  getProgressReport() {
    // Calculate HSK level coverage
    const hskCoverage = {};
    for (let level = 1; level <= 6; level++) {
      const hskWords = getHSKWords(level);
      const known = hskWords.filter(w => this.knownWords.has(w));
      hskCoverage[level] = known.length / hskWords.length;
    }

    return {
      totalKnownWords: this.knownWords.size,
      wordsLearningNow: this.learningWords.size,
      hskCoverage: hskCoverage,
      estimatedLevel: this.estimateLevel(),
      progressThisWeek: this.getWeeklyProgress(),
    };
  }

  estimateLevel() {
    // User is at highest level where they know 80%+ vocabulary
    for (let level = 6; level >= 1; level--) {
      if (this.hskCoverage[level] >= 0.8) {
        return level;
      }
    }
    return 1;
  }
}

Pattern 6: Optimized Data Loading#

Minimize bundle size and load time:

// Lazy-load dictionary data in chunks
class LazyDictionary {
  constructor() {
    this.cache = new Map();
    this.chunkSize = 1000;  // Load 1000 words at a time
  }

  async lookup(word) {
    // Check cache first
    if (this.cache.has(word)) {
      return this.cache.get(word);
    }

    // Load chunk containing this word
    const chunk = await this.loadChunk(word);
    chunk.forEach(entry => this.cache.set(entry.word, entry));

    return this.cache.get(word);
  }

  async loadChunk(word) {
    // Determine which chunk (alphabetical)
    const chunkIndex = Math.floor(pinyinSort(word) / this.chunkSize);

    // Fetch from bundled data or API
    const response = await fetch(`/data/dict_chunk_${chunkIndex}.json`);
    return await response.json();
  }
}

Trade-offs#

Browser-Based Processing Benefits#

Privacy: User data stays on device
Speed: No network latency
Offline: Works without internet
Cost: No server infrastructure needed

Browser-Based Processing Costs#

Bundle size: Extension size limits (Chrome: 5 MB packaged)
Memory: Browser memory constraints
Processing power: Slower than server-side
Data sync: Complex cross-device synchronization

When Client-Side is Worth It#

Use browser-based processing when:

Privacy is important (user vocabulary data sensitive)
Real-time performance critical (popup dictionary)
Offline usage required (mobile apps, privacy-conscious users)
Simple features (lookup, highlighting) not complex NLP

When Server-Side is Better#

Use server API when:

Complex processing needed (advanced NLP, ML models)
Large data requirements (full dictionary, corpus analysis)
Multi-user features (social vocabulary sharing)
Limited device resources (older mobile devices)

Missing Capabilities#

No existing tool provides:

❌ Adaptive difficulty estimation (learns from user’s reading history)
❌ Context-aware definitions (word meaning varies by context)
❌ Grammar complexity indicators (sentence structure difficulty)
❌ Cross-device vocabulary sync (seamless sync across browser/mobile)
❌ Reading speed estimation (how long will this article take?)
❌ Optimal word learning order (which unknown words to learn first?)

Reading assistant developers must build these features custom.

Real-World Integration Examples#

Browser Extension: Zhongwen-Style#

// Content script injected into web pages
class ZhongwenReader {
  constructor() {
    this.dictionary = new LazyDictionary();
    this.charFreq = loadCharacterFrequency();
    this.userLevel = getUserLevel();  // HSK 1-6

    // Add hover listeners
    this.setupHoverPopups();
  }

  setupHoverPopups() {
    document.addEventListener('mouseover', async (e) => {
      const word = getWordUnderCursor(e);
      if (word && isChineseWord(word)) {
        const data = await this.dictionary.lookup(word);
        this.showPopup(word, data, e.pageX, e.pageY);
      }
    });
  }

  showPopup(word, data, x, y) {
    // Build popup with difficulty indicators
    const popup = document.createElement('div');
    popup.className = 'zhongwen-popup';
    popup.style.left = `${x}px`;
    popup.style.top = `${y}px`;

    // Difficulty badge
    const level = data.hskLevel || 'Unknown';
    const badge = level <= this.userLevel ? '✅' : '⚠️';

    popup.innerHTML = `
      <div class="word">${word} ${badge}</div>
      <div class="pinyin">${data.pinyin}</div>
      <div class="definition">${data.definition}</div>
      <div class="meta">HSK ${level} | Freq: ${getFrequency(word)}</div>
    `;

    document.body.appendChild(popup);
  }
}

Mobile E-Reader App#

class MobileEReader {
  constructor() {
    this.userVocab = loadUserVocabulary();
    this.touchHandler = this.setupTouchHandler();
  }

  setupTouchHandler() {
    // Tap word to see definition
    document.addEventListener('touchstart', async (e) => {
      const word = getWordAtPoint(e.touches[0]);

      if (word) {
        const definition = await lookup(word);
        this.showDefinitionModal(word, definition);
      }
    });
  }

  showDefinitionModal(word, definition) {
    // Full-screen modal with action buttons
    const modal = createModal({
      word: word,
      pinyin: definition.pinyin,
      definition: definition.english,
      actions: [
        {
          label: 'Mark as Known',
          action: () => this.markAsKnown(word),
        },
        {
          label: 'Add to Study List',
          action: () => this.addToStudyList(word),
        },
      ],
    });

    showModal(modal);
  }

  markAsKnown(word) {
    this.userVocab.add(word);
    saveUserVocabulary(this.userVocab);

    // Update highlighting on page
    this.refreshWordHighlighting();
  }
}

Content Recommendation Engine#

class ContentRecommender {
  constructor(userProfile) {
    this.userLevel = userProfile.estimatedHSKLevel;
    this.knownWords = userProfile.knownWords;
    this.interests = userProfile.interests;
  }

  async recommendArticles(articlePool) {
    const scored = [];

    for (const article of articlePool) {
      const words = jieba.cut(article.text);

      // Calculate comprehension
      const knownCount = words.filter(w => this.knownWords.has(w)).length;
      const comprehension = knownCount / words.length;

      // Calculate new vocabulary load
      const unknownWords = words.filter(w => !this.knownWords.has(w));
      const newWordLoad = unknownWords.length;

      // Interest match (topic modeling)
      const topicMatch = this.scoreTopicMatch(article, this.interests);

      // Composite score
      const score = {
        article: article,
        comprehension: comprehension,
        newWordLoad: newWordLoad,
        topicMatch: topicMatch,
        recommendScore: (
          comprehension * 0.5 +          // 50% comprehension weight
          (1 - newWordLoad / 100) * 0.3 + // 30% vocabulary load
          topicMatch * 0.2                // 20% interest match
        ),
      };

      // Only recommend if 80-95% comprehension (i+1 zone)
      if (comprehension >= 0.80 && comprehension <= 0.95) {
        scored.push(score);
      }
    }

    // Return top recommendations
    return scored.sort((a, b) => b.recommendScore - a.recommendScore).slice(0, 10);
  }
}

Performance Considerations#

Typical Workload#

Reading assistants process:

Web pages: 500-5000 characters
Articles: 1000-10000 characters
Books: 50k-200k characters (chapter at a time)

Optimization Strategies#

// 1. Incremental processing (don't re-analyze whole page)
let lastAnalyzedLength = 0;

function analyzePageIncremental() {
  const currentText = document.body.innerText;

  if (currentText.length > lastAnalyzedLength) {
    const newText = currentText.slice(lastAnalyzedLength);
    analyzeText(newText);
    lastAnalyzedLength = currentText.length;
  }
}

// 2. Virtualization for long documents
function renderVisiblePortion(document, viewport) {
  // Only process text visible in viewport
  const visibleText = getTextInViewport(viewport);
  return analyzeAndHighlight(visibleText);
}

// 3. Web Workers for background processing
const worker = new Worker('analyzer.js');
worker.postMessage({ text: pageText });
worker.onmessage = (e) => {
  const { difficulty, unknownWords } = e.data;
  updateUI(difficulty, unknownWords);
};

Conclusion#

Reading assistants need lightweight, browser-optimized solutions. Best approach:

Pruned data: Top 10k words only (not full 100k dictionary)
Client-side processing: jieba.js for real-time segmentation
User vocabulary tracking: IndexedDB for personalization
Hybrid architecture: Client-side for speed, optional cloud sync for features

Trade-off bundle size (faster loading) vs features (comprehensive data). Prioritize real-time responsiveness over perfect accuracy - fast feedback more valuable than 100% coverage for learners reading authentic content.

S4: Strategic

S4: Synthesis - Strategic Insights and Recommendations#

The Core Value Proposition#

CJK readability analysis solves a fundamental problem in Chinese language learning: knowing whether you can read something before you start. Unlike English where a learner can attempt any text and struggle through unknown words, Chinese text with too many unknown characters is literally unreadable—you can’t even sound words out phonetically.

When This Technology Matters#

High-Value Use Cases#

Educational Content Curation
- Language learning platforms (Duolingo, HelloChinese, etc.)
- Digital libraries for learners (graded readers)
- Textbook publishers (automatic leveling)
Content Accessibility
- News sites with “Easy Chinese” versions
- Government services (simplified language requirements)
- Healthcare information (patient education materials)
Language Learning Apps
- Automatic text difficulty assessment
- Personalized content recommendations
- Progress tracking (reading level advancement)
Content Creation Tools
- Writing assistants that flag difficult characters
- Automatic simplification suggestions
- Target-level validation for authors

Low-Value Use Cases#

General-purpose NLP (sentiment analysis, classification) - readability features add noise
Native speaker applications - they already know all the characters
Machine translation - different problem space entirely

Architecture Decision Framework#

Choice 1: Character vs Word-Based Analysis#

Character-based (HSK approach):

✅ Simpler algorithm (just count unique characters)
✅ Aligns with how learners actually learn (character lists)
✅ Works without perfect segmentation
❌ Misses vocabulary complexity (knowing 研 + 究 ≠ knowing 研究 “research”)

Word-based (TOCFL approach):

✅ More accurate for actual reading comprehension
✅ Captures vocabulary knowledge, not just characters
❌ Requires segmentation (Jieba, adds complexity/errors)
❌ Harder to align with learning materials (HSK lists are character-focused)

Recommendation: Start character-based for MVP (simpler, faster). Add word-based analysis if you need higher accuracy for advanced learners (HSK 4+).

Choice 2: Simple vs ML-Based Classification#

Simple coverage formula (character/word coverage at HSK level):

coverage = known_chars / total_chars_in_text
if coverage >= 0.95: return current_level

✅ Fast (~milliseconds per text)
✅ Easy to debug and explain to users
✅ Good enough for most use cases (learner apps, content filters)
❌ Ignores linguistic complexity (sentence structure, discourse)
❌ Fixed threshold (95% might not be right for everyone)

ML-based (CRIE-style SVM with 82 features):

✅ More accurate grade level prediction
✅ Can provide diagnostic feedback (“too many complex sentences”)
✅ Learns from real educational materials
❌ Slower (requires full NLP pipeline: segmentation, POS, parsing)
❌ Black box (harder to explain to users why text is level X)
❌ Requires training data (textbooks, labeled corpus)

Recommendation:

B2C apps (language learners): Simple coverage formula + Jieba. Users want “HSK 3” or “HSK 4”, not detailed diagnostics.
B2B tools (publishers, educators): ML-based if you can afford the complexity. They need fine-grained assessment and diagnostic reports.

Choice 3: Build vs Buy vs Use Free Tools#

Build your own (Jieba + HSK lists + coverage formula):

✅ Full control over algorithm
✅ No API costs (self-hosted)
✅ Can customize for your domain (medical, legal, etc.)
❌ Maintenance burden (keep HSK lists updated, new 2026 standards)
❌ Need NLP expertise (if going beyond simple coverage)
Effort: 1-2 weeks for MVP, ongoing maintenance

Use free OSS library (HSK Character Profiler, etc.):

✅ Fast time-to-market (days, not weeks)
✅ Community-maintained (bug fixes, updates)
❌ Less control (tied to library’s roadmap)
❌ May not match your exact requirements
Effort: 1-3 days integration

Commercial API (Google Cloud NLP, LTP-Cloud):

✅ Fully managed (no infrastructure)
✅ Production-grade (high availability, scaling)
❌ Recurring costs (pay-per-request)
❌ Lock-in (hard to switch later)
❌ Chinese-specific features limited (Google doesn’t do HSK levels)
Effort: 1-2 days integration
Cost: Google ~$1/million characters; LTP-Cloud pricing varies

Free web tools (HSK HSK Analyzer, etc.):

✅ Zero cost, zero effort
❌ Not for production use (rate limits, no SLA)
❌ Can’t integrate into your app
Best for: Manual testing, one-off analysis

Recommendation:

MVP/prototype: Free OSS library (HSK Character Profiler) or build simple coverage formula (1 day)
Production app (< 1M texts/month): Build your own (Jieba + HSK lists)
Production app (> 1M texts/month): Commercial API if you need multi-language NLP; otherwise still self-host for cost savings
Enterprise/publishers: CRIE-style ML system (hire NLP consultants or build in-house)

Choice 4: HSK vs TOCFL vs Both#

HSK 3.0 (2026 standard, 9 levels):

✅ Larger user base (mainland China market)
✅ More resources (apps, textbooks, word lists)
✅ New 2026 standard more comprehensive
❌ Simplified Chinese focus

TOCFL (Taiwan, 8 levels):

✅ Traditional Chinese focus
✅ Word-based (better for comprehension)
❌ Smaller ecosystem (fewer learning resources)
❌ Less data available (character/word lists harder to find)

Both:

✅ Cover entire Chinese-speaking market
❌ More complexity (maintain two systems)
❌ User confusion (which standard to show?)

Recommendation:

Mainland China market: HSK only
Taiwan/Hong Kong market: TOCFL preferred, HSK as fallback
Global market: HSK primary, add TOCFL if you have Traditional Chinese users (> 20% of base)

Hidden Complexity and Gotchas#

1. Segmentation Errors Cascade#

Jieba accuracy: ~95% for general text. But errors in segmentation cause errors in readability analysis:

“研究生” segmented as “研究” + “生” instead of “研究生” → wrong HSK level
Domain-specific terms (medical, legal) often mis-segmented
Mitigation: Use domain-specific dictionaries (Jieba supports custom dicts)

2. HSK 3.0 Migration (2026)#

New standard effective July 2026. 9 levels instead of 6. Character/word requirements changed.

Old HSK 6 ≠ new HSK 6 (different word counts)
Need to update word lists, retrain models
Mitigation: Maintain both HSK 2.0 and HSK 3.0 mappings during transition (2026-2027)

3. Context-Dependent Difficulty#

Character frequency ≠ character difficulty in context:

的 (de, particle): most common character, learned in HSK 1
辩证法 (dialectics): rare but each character individually might be HSK 3-4
Idioms (成语): 4 characters that must be learned as unit, not individually
Mitigation: Use word-based analysis for HSK 4+; flag idioms separately

4. Traditional vs Simplified Mapping#

Not 1:1 correspondence:

台 (simplified) → 臺/台 (traditional) - same character, different meanings
后 (simplified) → 後/后 (traditional) - two different words merged
Mitigation: Use proper conversion libraries (OpenCC), maintain separate frequency lists

5. Coverage Threshold is Arbitrary#

95% coverage = readable? Depends on:

Text type (narrative easier than academic)
Learner background (heritage speakers vs beginners)
Glossary availability (can look up 5% unknown words?)
Mitigation: Make threshold configurable (90-98%), A/B test optimal value for your users

Cost Modeling#

DIY Approach (Jieba + HSK lists)#

Setup: 1-2 weeks dev time (~$5K-$10K if outsourced) Hosting: ~$20-50/month (1M texts/month on modest server) Maintenance: 4-8 hours/quarter (update word lists, bug fixes) Total Year 1: ~$7K-$12K (mostly upfront dev)

Commercial API (Google Cloud NLP)#

Setup: 1-2 days integration (~$1K) Usage: $1 per 1M characters analyzed (after 5M free tier) Maintenance: ~0 (fully managed) Total Year 1 at 10M texts (~50M characters): ~$1K setup + $45 usage = $46K

Break-even: ~5M texts/year (or ~420K texts/month) - beyond this, DIY is cheaper

OSS Library (HSK Character Profiler)#

Setup: 1-3 days integration (~$500-$1.5K) Hosting: $0 (runs in your app) Maintenance: ~2 hours/quarter (library updates) Total Year 1: ~$1K-$2K

ROI sweet spot: 100K-1M texts/month. Below that, use web tools. Above that, consider custom build for more control.

Implementation Roadmap#

Phase 1: MVP (Week 1)#

Integrate HSK Character Profiler or build simple coverage formula
Use HSK 3.0 character lists (GitHub: krmanik/HSK-3.0)
Simple API: POST /analyze with {text: "...", standard: "hsk"} → {level: 3, coverage: 0.94}
Test with sample texts at known levels

Phase 2: Production (Weeks 2-4)#

Add Jieba for word segmentation (if word-based analysis needed)
Implement caching (Redis) for frequently analyzed texts
Add metrics (latency, accuracy vs human labels)
Deploy with proper error handling

Phase 3: Enhancement (Months 2-3)#

Custom dictionaries for your domain
Support Traditional Chinese + TOCFL
Diagnostic reports (which characters/words are too hard?)
A/B test coverage thresholds

Phase 4: ML (Months 4-6, optional)#

Collect labeled training data (texts + human-assessed levels)
Train SVM with CTAP features or simpler model
Compare accuracy vs coverage formula
Deploy if significant improvement (> 10% accuracy gain)

Key Success Metrics#

Accuracy: % agreement with human assessors on text level
- Target: 80-90% exact match, 95%+ within ±1 level
Coverage: % of texts that get a confident level prediction
- Target: > 95% (few “unknown level” results)
Latency: Time to analyze typical text
- Target: < 100ms for 1000 characters (simple), < 500ms (ML-based)
User satisfaction: Do learners find texts at recommended level readable?
- Target: > 80% report “just right” difficulty (not too easy/hard)

The Bigger Picture#

Market Trends#

Chinese learner population growing (300M+ worldwide)
Digital learning platforms expanding (COVID-19 accelerated shift)
HSK 3.0 (2026) creating demand for updated tools
AI/LLM integration opportunity (auto-generate level-appropriate content)

Adjacent Technologies#

Content generation: LLMs that write at target HSK level
Personalization: Adaptive learning paths based on reading level
Translation: Simplify-for-learners translation (not just English-Chinese)
Speech: Readability analysis for spoken content (podcast transcripts)

Future Research Directions#

Multimodal: Images + text (children’s books, comics)
Dialogue: Conversational difficulty vs written text
Cultural load: Idioms, cultural references independent of language level
Error prediction: Which characters will THIS learner struggle with? (personalized beyond HSK)

Bottom Line Recommendations#

If you’re building a language learning app: → Start with HSK Character Profiler (OSS, free, 1-day integration) → Upgrade to custom Jieba + HSK 3.0 lists when you hit 100K texts/month → Stick with simple coverage formula unless you need fine-grained diagnostics

If you’re a publisher/educator: → Invest in CRIE-style ML system (hire consultants, 3-6 months) → Use CTAP features for comprehensive analysis → Build internal tools for authors (real-time difficulty feedback as they write)

If you’re a researcher: → Use CTAP (196 features, most comprehensive) → Compare ML models (SVM vs neural networks vs LLM-based) → Publish open datasets (labeled texts + human assessments)

If you’re just exploring: → Use HSK HSK Analyzer (web, free) for one-off analysis → Read CRIE paper for methodology deep-dive → Experiment with Jieba to understand segmentation challenges

The technology is mature, tools exist, and the use cases are clear. The main decision is build-vs-buy and simple-vs-ML, which depends entirely on your scale and accuracy requirements.

Published: 2026-03-06 Updated: 2026-03-06