1.154.1 Chinese Text Simplification Libraries#

Comprehensive analysis of Chinese text simplification libraries and approaches. Covers foundational NLP libraries (segmentation, conversion), training datasets, and implementation strategies for automated text simplification targeting different HSK proficiency levels. Reveals no turnkey solutions exist; teams must build custom hybrid stacks.

Explainer

Chinese Text Simplification Libraries#

What This Solves#

Imagine you’re running a Chinese learning app with thousands of authentic news articles, but your HSK 3 learners keep hitting walls of vocabulary they don’t know. You could manually rewrite articles (expensive, slow), or you could automatically simplify them—replacing difficult words with easier synonyms, shortening complex sentences, and removing advanced grammar patterns.

Chinese Text Simplification (CTS) solves this fundamental problem: automatically converting complex Chinese text into simpler versions that match a target proficiency level. It takes any Chinese text and rewrites it to be comprehensible at HSK 2, 3, 4, etc.

This differs from readability analysis (which just measures difficulty) by actually transforming the text. It’s the difference between a thermometer (analysis) and an air conditioner (simplification).

This matters to three groups:

Language learning platforms need to offer graded content at scale (can’t hire editors to simplify thousands of articles)
Accessibility services need to make government documents, healthcare info, and public services readable for lower-literacy Chinese readers
Content creators need tools to adapt writing for different audience proficiency levels (textbooks, news sites, technical documentation)

Without automated simplification, these groups resort to manual rewriting (expensive, inconsistent) or avoid complex content entirely (limiting educational value).

Accessible Analogies#

Think of Chinese text simplification like a recipe adapter:

Original recipe (advanced): “Julienne the carrots, create a mirepoix base, deglaze with Shaoxing wine”
Simplified recipe (beginner): “Cut the carrots into thin strips, cook onions and carrots together, add rice wine to the pan”

The recipe adapter:

Replaces fancy cooking terms with simple descriptions
Breaks complex steps into smaller ones
Uses common ingredients instead of specialized ones
Keeps the same end result (dish is the same, instructions are clearer)

Chinese text simplification does the same:

Replaces advanced vocabulary with HSK-level equivalents
Splits long compound sentences into shorter ones
Removes or replaces idioms (成语) with plain explanations
Keeps the same meaning (content is preserved, expression is simplified)

Another angle: Like a translation app, but instead of translating between languages, it “translates” from advanced Chinese to beginner Chinese. Both preserve meaning while changing form.

The challenge: Unlike English where you can often just swap “utilize” → “use”, Chinese has:

No spaces between words (need segmentation first)
Multiple ways to express the same concept with different difficulty levels
Idioms that can’t be translated word-for-word
Sentence structures that require complete restructuring, not just word replacement

This is why off-the-shelf NLP libraries don’t work—you need specialized Chinese text simplification tools.

When You Need This#

You definitely need this if:

You run a language learning platform and want to offer graded readers automatically (“here’s today’s news at HSK 3 level”)
You’re building accessibility tools for government or public services (simplified documents for low-literacy readers)
You create educational materials and need to generate multiple difficulty versions of the same content
You manage a news site offering “Easy Chinese” versions and want to automate the simplification pipeline

You probably need this if:

You’re building AI tutoring systems that need to adjust explanation complexity to learner level
You’re researching second-language acquisition and need controlled text difficulty
You’re developing translation tools with simplification as a post-processing step

You DON’T need this if:

You only need to measure readability (use HSK-Character-Profiler or similar instead)
You only need Traditional/Simplified conversion (use OpenCC instead)
You’re working with native speakers who don’t need simplification
You have manual editorial resources and small volume (< 100 texts/month)

The decision hinges on: Are you transforming complex text to simpler versions at scale? If yes, you need simplification. If you just need to know “is this HSK 3?”, use analysis tools instead.

Trade-offs#

Rule-Based vs Neural Network#

Rule-based simplification (word replacement + sentence splitting):

✅ Fast (milliseconds per text)
✅ Predictable output (same input always gives same result)
✅ Easy to debug (you can see which rules fired)
✅ Requires no training data
❌ Limited to vocabulary substitution (can’t restructure complex sentences well)
❌ Struggles with idioms, metaphors, context-dependent meaning
❌ Needs manually curated synonym dictionaries at each HSK level

Neural network approach (seq2seq, transformer models):

✅ Can restructure sentences creatively (not just word swaps)
✅ Handles idioms and context better
✅ Improves with more training data
❌ Slower (seconds per text on CPU)
❌ Unpredictable (might generate fluent but incorrect simplifications)
❌ Requires large parallel corpora (complex ↔ simple sentence pairs)
❌ Hard to control output level (can’t guarantee “exactly HSK 3”)

Current reality: Most production systems use rule-based as foundation + neural for specific hard cases. Pure neural is still research-grade.

Character-Level vs Word-Level#

Character-level substitution:

✅ Simpler implementation (no word segmentation needed)
✅ Aligns with HSK character lists
❌ Breaks compound words (replacing 研 in 研究 changes meaning)
❌ Misses multi-word expressions that need to be replaced as units

Word-level substitution:

✅ Preserves compound word integrity
✅ Can handle multi-word idioms
❌ Requires accurate segmentation (jieba is ~95% accurate, errors cascade)
❌ More complex to implement

Hybrid approach (recommended): Segment into words, simplify at word level, validate at character level.

Build vs Dataset#

Build from scratch (your own rules + dictionaries):

✅ Full control over simplification strategy
✅ Can customize for your domain (medical, legal, etc.)
❌ Requires linguistic expertise
❌ Months of development time
❌ Need native speaker validation

Use research datasets (MCTS, parallel corpora):

✅ Training data already exists (691K+ sentence pairs)
✅ Can train neural models if you have ML expertise
❌ Datasets are academic (news text, not your domain)
❌ Still need to build the actual simplification pipeline
❌ No turnkey solution (MCTS is data, not a library)

Use existing libraries (currently limited options):

✅ Fastest time to value (if they exist)
❌ Reality check: There are very few production-ready pip-installable CTS libraries as of 2026
❌ Most published work is research code, not maintained libraries

Cost Considerations#

Research approach (use MCTS dataset + train your own model):

Dataset: Free (open source)
Model training: $100-500 in GPU time (if using cloud)
Development: $20K-$50K (2-4 months, requires ML expertise)
Hosting: $50-200/month for inference
Year 1 total: ~$25K-$60K
Only makes sense for large-scale platforms (> 10K texts/month)

Rule-based DIY:

Development: $10K-$30K (1-3 months, requires NLP + Chinese expertise)
Hosting: $20-50/month (runs in your app)
HSK vocabulary lists: Free (open source)
Synonym dictionaries: Build manually or scrape ($2K-$5K)
Year 1 total: ~$12K-$35K
Sweet spot for mid-sized platforms (1K-10K texts/month)

Hybrid approach (jieba + OpenCC + HSK-Character-Profiler + custom rules):

Integration: $5K-$15K (2-4 weeks)
Use existing libraries for segmentation, conversion, analysis
Build custom simplification logic on top
Year 1 total: ~$7K-$18K
Most practical option for MVP

Commercial API (if they existed):

Would cost ~$5-20 per 1K simplifications
None currently available as of 2026 for Chinese
English has services (Rewordify, TextCompactor), Chinese market is nascent

Manual editing (baseline comparison):

Professional editor: $0.10-$0.30 per sentence
At 1,000 texts/month (avg 20 sentences): $2K-$6K/month
Year 1: $24K-$72K
Break-even: Automation pays off at > 100 texts/month

Implementation Reality#

First 30 Days#

Week 1: Set up infrastructure with existing libraries:

Install jieba for segmentation
Install OpenCC for Traditional/Simplified handling
Install HSK-Character-Profiler for difficulty measurement
Build simple word replacement pipeline with HSK vocabulary lists

Weeks 2-4: Build and test simplification rules:

Create synonym dictionaries at each HSK level
Implement sentence splitting for long sentences (> 20 characters)
Test on sample texts, measure with human evaluators
Deploy basic API endpoint

You’ll have a working prototype that can simplify ~70% of sentences (the easy cases).

What Actually Takes Time#

Synonym dictionary curation: Finding “simple equivalents” for 2,500+ HSK 6 words takes weeks of linguistic work
Context handling: “银行” means “bank” (financial) or “riverbank” depending on context—rules alone won’t catch this
Idiom treatment: 成语 (4-character idioms) need special handling (replace whole unit, not individual characters)
Quality validation: Need native speakers to verify simplifications don’t change meaning
Edge cases: Names, numbers, technical terms, internet slang—each needs special rules

Common Pitfalls#

Over-simplifying: Replacing every HSK 5 word breaks flow (“The very smart person” → “The very very smart person” because you replaced “intelligent”)
Meaning drift: Synonyms aren’t perfect (“老师” (teacher) → “先生” (Mr./teacher) shifts formality)
Segmentation errors: Jieba mistakes cascade (if it segments wrong, replacements break)
No ground truth: Unlike translation (many references exist), Chinese simplification has limited parallel corpora for validation

Team Skills Required#

Rule-based MVP: Mid-level Python dev + native Chinese speaker for validation (2 people, 1 month)
Production system: Senior NLP engineer + Chinese linguist + QA (3 people, 3 months)
Neural approach: ML engineer + data scientist + Chinese linguist (3 people, 6+ months)

Realistic Expectations#

You’ll achieve:

70-80% of sentences simplified successfully (word replacements work)
15-20% need manual review (complex restructuring)
5-10% fail or degrade quality (idioms, context errors)

This is good enough for assistive tools (human reads final output). Not good enough for fully automated publishing (needs editorial review).

The technology is nascent compared to English: While English text simplification has commercial solutions, Chinese is still largely a research problem with limited production-ready libraries. Most organizations build custom solutions.

Library Landscape (2026)#

Key distinction: This research focuses on LIBRARIES (pip-installable, production-ready), not research datasets or one-off scripts.

Current state:

✅ Analysis libraries are mature (HSK-Character-Profiler, HanLP)
✅ Utility libraries are solid (jieba, OpenCC)
❌ Actual simplification libraries are sparse (mostly research code, not production libraries)

Most teams combine existing analysis/utility libraries with custom simplification logic.

What you’ll find in S1-rapid: Inventory of available libraries and their actual capabilities for text simplification.

S1: Rapid Discovery

S1-rapid: Chinese Text Simplification Libraries#

Quick Summary#

Key Finding: As of 2026, there are NO mature, pip-installable libraries specifically designed for Chinese text simplification. The landscape consists of:

Research datasets (MCTS) - training data, not production libraries
Analysis libraries (HSK-Character-Profiler, HanLP) - measure difficulty, don’t transform text
Utility libraries (jieba, OpenCC) - building blocks for simplification, but you must write the logic yourself

Reality check: Unlike English (which has Rewordify, TextCompactor), Chinese text simplification is still mostly a DIY endeavor combining multiple libraries.

Recommended approach: Hybrid stack using jieba (segmentation) + HSK-Character-Profiler (analysis) + OpenCC (conversion) + custom simplification rules.

Libraries Inventory#

1. Text Simplification (Direct)#

MCTS (Multi-Reference Chinese Text Simplification)

Type: Research dataset + evaluation scripts
Not a library: Provides training data (691K+ parallel sentences), not a pip-installable tool
GitHub: https://github.com/blcuicall/mcts
Use case: Train your own neural simplification model
Limitation: Requires ML expertise, months of development

chinese-comprehension

Type: Analysis tool (not simplification)
GitHub: https://github.com/Destaq/chinese-comprehension
What it does: Analyzes text against your known words
What it doesn’t do: Doesn’t transform text, just gauges difficulty
Install: Clone + pip install -r requirements.txt (not on PyPI)

Verdict: No direct text simplification libraries exist on PyPI.

2. Analysis Libraries (Measure Difficulty)#

HSK-Character-Profiler

Purpose: Analyze text readability based on HSK levels
GitHub: https://github.com/Ancastal/HSK-Character-Profiler
Pip: Not on PyPI, clone and run
Features: Character proficiency analysis, text readability scoring
For simplification: Use to verify output difficulty after simplification
Status: Active (2024-2025)

Language-Analyzer

Purpose: Multi-language text analysis including HSK profiling
GitHub: https://github.com/Ancastal/Language-Analyzer
Features: HSK profiling, readability analysis
Limitation: Analysis only, not transformation

3. NLP Foundations (Building Blocks)#

jieba (结巴分词)

Purpose: Chinese text segmentation (word splitting)
PyPI: pip install jieba
GitHub: https://github.com/fxsjy/jieba
Essential for: Splitting unsegmented Chinese into words before simplification
Accuracy: ~95% for general text
Status: Mature, widely used
Modes:
- Accurate Mode: Best for text analysis
- Full Mode: Gets all possible words (overlapping)
- Search Engine Mode: Cuts long words into short words

HanLP 2.x

Purpose: Comprehensive NLP platform (deep learning)
PyPI: pip install hanlp
GitHub: https://github.com/HIT-SCIR/ltp
Features: Segmentation, POS tagging, NER, parsing, SRL
For simplification: Segmentation + POS tagging to identify replaceable words
Requires: Python 3.6+, PyTorch/TensorFlow
Status: Active (latest release Oct 2025)

PyHanLP

Purpose: Python wrapper for HanLP 1.x (traditional algorithms)
PyPI: pip install pyhanlp
GitHub: https://github.com/hankcs/pyhanlp
Lighter than: HanLP 2.x (no deep learning overhead)
Status: Active (latest Jan 2025)

LTP (Language Technology Platform)

Purpose: Multi-task Chinese NLP platform
PyPI: pip install ltp
GitHub: https://github.com/HIT-SCIR/ltp
Features: Segmentation, POS, NER, dependency parsing, SDP
Architecture: Shared pre-trained model across tasks (efficient)
For simplification: Dependency parsing to identify complex sentence structures
Status: Research-grade (N-LTP from 2020)

chinese (Chinese Text Analyzer)

PyPI: pip install chinese
GitHub: https://github.com/morinokami/chinese
Features: Tokenization, pinyin conversion, definitions
Uses: jieba or pynlpir for tokenization
Supports: Simplified and Traditional Chinese
Limitation: Analysis/conversion, not simplification

4. Character Conversion#

OpenCC (Open Chinese Convert)

Purpose: Traditional ↔ Simplified conversion
PyPI: pip install OpenCC or pip install opencc-python-reimplemented
GitHub: https://github.com/BYVoid/OpenCC
Conversion modes:
- s2t: Simplified → Traditional
- t2s: Traditional → Simplified
- s2tw: Simplified → Traditional (Taiwan standard)
- s2hk: Simplified → Traditional (Hong Kong standard)
For simplification: Pre-process text to normalized form (Simplified)
Status: Mature, actively maintained

hanziconv

PyPI: pip install hanziconv
GitHub: https://github.com/berniey/hanziconv
Purpose: Simpler Traditional/Simplified conversion
Lighter than: OpenCC (pure Python)

hanzidentifier

PyPI: pip install hanzidentifier
GitHub: https://github.com/tsroten/hanzidentifier
Purpose: Detect if text is Traditional or Simplified
Use before: Conversion (know what you’re working with)

5. Vocabulary Tools#

HSK 3.0 Lists

GitHub: https://github.com/krmanik/HSK-3.0
Purpose: Official HSK vocabulary lists (levels 1-9, 2026 standard)
Use for: Building synonym dictionaries at each level
Format: Character lists, word lists

TOCFL Vocabulary

GitHub: https://github.com/skishore/inkstone/pull/47
Purpose: Taiwan TOCFL standards (Traditional Chinese)
Use for: Simplification targeting TOCFL levels

6. Research Tools (Not Production Libraries)#

CTAP (Common Text Analysis Platform)

Type: Research tool (196 linguistic features)
Not pip-installable: Research implementation
Paper: https://aclanthology.org/2022.lrec-1.592.pdf
Use for: Feature extraction for ML models

CRIE (Chinese Readability Index Explorer)

Type: Research system (82 features + SVM)
Not publicly available: Academic tool
Paper: https://link.springer.com/article/10.3758/s13428-015-0649-1

What’s Missing#

What you can’t pip install (as of 2026):

❌ Complete Chinese text simplification library
❌ HSK-aware synonym replacement engine
❌ Sentence simplification (complex → simple restructuring)
❌ Idiom simplification (成语 handling)
❌ Pre-trained neural simplification models (easy to load and use)

What you must build yourself:

Synonym dictionaries mapping HSK 6 words → HSK 3 equivalents
Sentence splitting logic (identify and split complex sentences)
Simplification rules engine
Quality validation pipeline

Recommended Stack#

For building a Chinese text simplification system:

# Install these libraries
pip install jieba          # Segmentation
pip install opencc-python-reimplemented  # Conversion
pip install hanlp          # Optional: advanced NLP features

# Clone these repos
git clone https://github.com/Ancastal/HSK-Character-Profiler
git clone https://github.com/krmanik/HSK-3.0
git clone https://github.com/blcuicall/mcts  # If training neural models

Then build:

Segmentation pipeline (jieba)
HSK level analyzer (HSK-Character-Profiler)
Custom simplification logic:
- Word replacement (HSK vocabulary)
- Sentence splitting
- Idiom handling
Quality validation

Time Estimates#

Approach	Time	Complexity
Rule-based MVP	2-4 weeks	Mid-level Python + Chinese speaker
Production system	2-3 months	Senior NLP engineer + linguist
Neural model	4-6 months	ML engineer + data scientist

S1-rapid Conclusion#

The reality: Chinese text simplification is a BUILD, not BUY problem in 2026.

Unlike mature NLP tasks (segmentation, POS tagging) with turnkey libraries, simplification requires:

Custom logic combining multiple libraries
Domain expertise (Chinese linguistics + NLP)
Iterative testing with native speakers

Next steps:

S2-comprehensive: Dive into MCTS dataset, neural approaches, feature engineering
S3-need-driven: Map use cases to implementation strategies
S4-strategic: Build vs buy analysis, ROI models

Sources#

Foundational Libraries for Chinese Text Simplification#

These libraries don’t perform text simplification directly, but they’re essential building blocks for any simplification system.

jieba (结巴分词) - Chinese Text Segmentation#

Purpose: Split continuous Chinese text into words Why needed: Chinese has no spaces between words—you must segment before processing

Installation#

pip install jieba

Basic Usage#

import jieba

text = "我爱自然语言处理"
words = jieba.cut(text)  # Returns generator
print(" / ".join(words))  # Output: 我 / 爱 / 自然语言 / 处理

Segmentation Modes#

1. Accurate Mode (Default)#

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
# Output: 我 / 来到 / 北京 / 清华大学

Best for text analysis and NLP tasks
Attempts most accurate segmentation
Use for text simplification

2. Full Mode#

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
# Output: 我 / 来到 / 北京 / 清华 / 清华大学 / 华大 / 大学

Finds all possible words (overlapping)
Fast but not accurate
Not recommended for simplification

3. Search Engine Mode#

seg_list = jieba.cut_for_search("我来到北京清华大学")
# Output: 我 / 来到 / 北京 / 清华 / 华大 / 大学 / 清华大学

Cuts long words into shorter segments
Good for search indexing
Not ideal for simplification

Custom Dictionaries#

Add domain-specific words jieba might miss:

jieba.load_userdict("custom_words.txt")
# Format: word frequency part_of_speech
# 台中 100 n
# 云计算 50 n

Critical for simplification: Add HSK vocabulary to ensure consistent segmentation

Accuracy#

General text: ~95% accuracy
Domain-specific: May need custom dictionary (medical, legal, technical)
Errors cascade: Wrong segmentation → wrong word replacements

Performance#

Speed: Very fast (millions of characters per second)
Memory: ~100MB with full dictionary loaded
Language support: Simplified and Traditional Chinese

Role in Simplification Pipeline#

# 1. Segment text
import jieba
text = "这个句子包含一些复杂的词汇"
words = list(jieba.cut(text))

# 2. Identify words to simplify
# (You build this logic)
for word in words:
    if word_difficulty(word) > target_hsk_level:
        simplified_word = find_synonym(word, target_hsk_level)
        # Replace word

Links:

PyPI: https://pypi.org/project/jieba/
GitHub: https://github.com/fxsjy/jieba

OpenCC - Traditional/Simplified Conversion#

Purpose: Convert between Traditional and Simplified Chinese variants Why needed: Normalize text before simplification (most resources use Simplified)

Installation#

Option 1: Official binding (requires C++ library)

pip install OpenCC

Option 2: Pure Python (easier, no dependencies)

pip install opencc-python-reimplemented

Basic Usage#

from opencc import OpenCC

# Initialize converter
cc = OpenCC('s2t')  # Simplified to Traditional

# Convert
text = "开放中文转换"
traditional = cc.convert(text)
print(traditional)  # Output: 開放中文轉換

Conversion Modes#

Mode	Description	Example
`s2t`	Simplified → Traditional	中国 → 中國
`t2s`	Traditional → Simplified	中國 → 中国
`s2tw`	Simplified → Taiwan Standard	鼠标 → 滑鼠
`s2hk`	Simplified → Hong Kong Standard	信息 → 資訊
`t2tw`	Traditional → Taiwan Standard	鼠標 → 滑鼠
`t2hk`	Traditional → Hong Kong Standard	資訊 → 資訊

Regional variants matter: Taiwan and Hong Kong use different vocabulary beyond character conversion.

Advanced Features#

Character-Level Conversion#

One-to-one character mapping (mostly)
Handles variant forms (異體字)

Phrase-Level Conversion#

Multi-character expressions converted as units
Example: “计算机” (computer, Mainland) → “電腦” (Taiwan)

Regional Idioms#

Idioms converted to regional equivalents
“鼠标” (mouse, Mainland) → “滑鼠” (Taiwan)

Role in Simplification Pipeline#

Pre-processing:

from opencc import OpenCC

# Normalize to Simplified (most HSK resources use Simplified)
converter = OpenCC('t2s')
text = "這是傳統中文"
simplified = converter.convert(text)
# Now use simplified version for HSK analysis and simplification

Post-processing (if targeting Traditional Chinese learners):

# After simplification, convert back to Traditional
converter = OpenCC('s2t')
output = converter.convert(simplified_text)

Accuracy#

Character conversion: Near 100% for common characters
Regional vocabulary: Good coverage, but not exhaustive
Context: Character-level conversion can miss nuances

Example issue:

“后” can mean “后面” (back) or “皇后” (empress)
Traditional: “後” (back) vs “后” (empress)
OpenCC uses phrase dictionaries to handle most cases

Performance#

Very fast (millions of characters per second)
Minimal memory footprint
Thread-safe

Links:

GitHub: https://github.com/BYVoid/OpenCC
PyPI (official): https://pypi.org/project/OpenCC/
PyPI (pure Python): https://pypi.org/project/opencc-python-reimplemented/

HanLP - Comprehensive NLP Platform#

Purpose: Multi-task Chinese NLP (segmentation, POS, NER, parsing, etc.) Why useful: Advanced linguistic analysis for sophisticated simplification

Installation#

pip install hanlp

Requirements: Python 3.6+, PyTorch or TensorFlow 2.x

Basic Usage#

import hanlp

# Load pre-trained model (downloads on first use)
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)

# Analyze text
text = "我爱自然语言处理"
result = HanLP(text)

print(result)
# Output includes: tokens, POS tags, NER, dependency parse, etc.

Key Features for Simplification#

1. Word Segmentation#

tokenizer = hanlp.load('PKU_NAME_MERGED_SIX_MONTHS_CONVSEG')
tokens = tokenizer("我爱自然语言处理")
# Alternative to jieba with different algorithm

2. Part-of-Speech Tagging#

# Included in multi-task model
result = HanLP(text)
pos_tags = result['pos']
# Identify nouns, verbs, adjectives to simplify

Use case: Only simplify content words (nouns, verbs), not function words

3. Named Entity Recognition#

result = HanLP(text)
entities = result['ner']
# Identify people, places, organizations
# DON'T simplify proper nouns

4. Dependency Parsing#

result = HanLP(text)
deps = result['dep']
# Understand sentence structure
# Identify complex subordinate clauses to split

Use case: Find sentences with deep syntactic trees → candidates for splitting

5. Semantic Role Labeling (SRL)#

result = HanLP(text)
srl = result['srl']
# Identify who did what to whom
# Restructure passive → active voice

HanLP 2.x vs PyHanLP#

HanLP 2.x (Modern):

Deep learning models (BERT, ELECTRA)
State-of-the-art accuracy
Heavier (requires PyTorch/TF)
Slower (seconds per sentence)

PyHanLP (Traditional):

Classic algorithms (HMM, CRF)
Lighter weight (no DL frameworks)
Faster (milliseconds per sentence)
Slightly lower accuracy

For simplification MVP: Start with PyHanLP (lighter), upgrade to HanLP 2.x if you need advanced features

Role in Simplification Pipeline#

Advanced simplification logic:

import hanlp

HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)

text = "我昨天买了一本非常有趣的书"
result = HanLP(text)

# 1. Extract POS tags
tokens = result['tok/fine']  # ['我', '昨天', '买', '了', '一', '本', '非常', '有趣', '的', '书']
pos = result['pos/pku']      # ['PN', 'TIME', 'VV', 'AS', 'CD', 'M', 'AD', 'VA', 'DEG', 'NN']

# 2. Identify adjectives and adverbs
for token, tag in zip(tokens, pos):
    if tag in ['VA', 'AD']:  # Adjectives, adverbs
        # Simplify: 非常 → 很, 有趣 → 好玩
        simplified = simplify_word(token)

# 3. Check dependency structure
deps = result['dep']
# If deep nesting → split sentence

Performance Considerations#

Model loading: ~10-30 seconds (first time, cached after)
Inference: 0.5-2 seconds per sentence (CPU), faster on GPU
Memory: 500MB-2GB depending on model
Batch processing: Significantly faster for multiple sentences

For production: Use smaller models (ELECTRA_SMALL) or cache results

Alternatives#

LTP (Language Technology Platform): Similar features, different architecture
jieba + custom POS tagger: Lighter but less accurate
StanfordNLP: Multi-language but heavier

Links:

Documentation: https://hanlp.hankcs.com/docs/install.html
PyPI: https://pypi.org/project/hanlp/
GitHub: https://github.com/hankcs/HanLP

Integration Strategy#

Recommended stack for text simplification:

OpenCC - Normalize to Simplified Chinese
jieba - Segment into words
HanLP (optional) - POS tagging, NER, parsing for advanced logic
Custom logic - Synonym replacement, sentence splitting
HSK-Character-Profiler - Validate output difficulty

Minimal stack (faster, lighter):

jieba + OpenCC + custom rules

Advanced stack (higher quality, slower):

HanLP (all-in-one) + custom rules + MCTS-trained model

Sources#

MCTS: Multi-Reference Chinese Text Simplification Dataset#

Overview#

MCTS is the first published multi-reference Chinese text simplification evaluation dataset, released in 2024 for the LREC-COLING conference.

Paper: MCTS: A Multi-Reference Chinese Text Simplification Dataset GitHub: https://github.com/blcuicall/mcts Authors: Ruining Chong, Luming Lu, Liner Yang, Jinran Nie, Zhenghao Liu, Shuo Wang, Shuhan Zhou, Yaoxin Li, Erhong Yang (2024)

What It Provides#

1. Evaluation Dataset#

3,615 human simplifications across 723 original sentences
5 simplifications per original sentence (multi-reference)
Source: Penn Chinese Treebank (CTB) - news, government docs, broadcasts

2. Training Corpus#

691,474 high-quality parallel training pairs (complex ↔ simple)
Largest scale training data in Chinese Text Simplification field (as of 2024)
Generated via rigorous automatic screening
Built using combination of Machine Translation + English Text Simplification

3. Evaluation Scripts#

hsk_evaluate.py - HSK-based evaluation metrics
Benchmarking tools for comparing simplification models

Dataset Composition#

Original sentences source: Penn Chinese Treebank (CTB)

Xinhua news agency reports
Government documents
News magazines
Broadcasts and interviews
Online news and web logs

Human simplifications:

Professional annotators
Multiple references per sentence (captures variation)
Quality controlled

What It’s NOT#

❌ Not a pip-installable library - It’s a dataset, not production code ❌ Not a pre-trained model - No ready-to-use simplification models included ❌ Not turnkey - Requires ML expertise to train models from the data

How to Use#

1. Download the Dataset#

git clone https://github.com/blcuicall/mcts
cd mcts

2. Access the Data#

data/evaluation/ - Multi-reference evaluation set (723 sentences × 5 simplifications)
data/training/ - Parallel corpus (691K pairs)

3. Train a Model#

Use training corpus with seq2seq models (T5, BART, mBART)
Fine-tune on Chinese text simplification task
Evaluate on multi-reference test set

4. Evaluate with HSK Metrics#

python hsk_evaluate.py --input your_simplifications.txt

Use Cases#

For researchers:

Train and evaluate neural text simplification models
Compare against multi-reference gold standard
Publish papers on Chinese text simplification

For product teams (with ML resources):

Train custom simplification models for your domain
Fine-tune on domain-specific data after pre-training on MCTS
Requires: ML engineers, GPU resources, 2-4 months development

Not suitable for:

Quick prototypes (dataset, not library)
Teams without ML expertise
Production deployment without training pipeline

Significance#

Why it matters:

First multi-reference dataset - Previous work had single reference translations
Largest training corpus - 691K pairs vs. previous datasets with < 100K
HSK-aware evaluation - Specifically designed for learner-focused simplification

Research impact:

Enables neural model development
Standardizes evaluation (multi-reference BLEU, HSK metrics)
Provides benchmark for comparing approaches

Limitations#

Domain bias: News/formal text (CTB source)
- May not generalize to casual, social media, or technical text
Mainland Chinese: Simplified Chinese from mainland sources
- Not optimized for Traditional Chinese or Taiwan/HK variants
No model included: Data only, you must build the model
Static dataset: As of 2024, no updates since publication

Practical Path Forward#

If you want to use MCTS:

Option A: Train a neural model

Download dataset
Set up training pipeline (T5/BART + seq2seq)
Train on 691K corpus (requires GPU, ~$100-500 cloud cost)
Evaluate on multi-reference test set
Deploy model for inference
Timeline: 2-4 months (with ML expertise)

Option B: Use as inspiration

Study human simplifications to understand patterns
Extract simplification rules (what words get replaced, how sentences split)
Implement rules in code (rule-based approach)
Use MCTS as validation (compare your output to human references)
Timeline: 1-2 months (less ML-heavy)

Option C: Hybrid

Use MCTS for hard cases (train small specialized model)
Use rule-based for easy cases (word replacements)
Combine for production system
Timeline: 2-3 months

Integration with Other Tools#

MCTS pairs well with:

jieba: Segment text before feeding to trained model
HSK-Character-Profiler: Validate output difficulty
OpenCC: Handle Traditional/Simplified conversion
HanLP: Extract linguistic features for model training

Example Workflow#

# 1. Segment input text
import jieba
text = "这是一段复杂的中文文本"
words = jieba.cut(text)

# 2. Simplify with MCTS-trained model
# (You must train this model first using MCTS dataset)
simplified = your_trained_model.simplify(text)

# 3. Validate output difficulty
from hsk_profiler import analyze
hsk_level = analyze(simplified)
print(f"Output difficulty: HSK {hsk_level}")

Verdict#

MCTS is essential infrastructure for research and large-scale production systems, but it’s not a quick solution for MVPs or small teams.

Best for:

Research teams publishing on Chinese NLP
Large platforms with ML resources (> 10K texts/month to simplify)

Not for:

Rapid prototypes (use rule-based instead)
Small teams without ML expertise
Projects with < 3 month timeline

Sources#

S1-rapid Recommendations#

Executive Summary#

Finding: No pip-installable libraries exist specifically for Chinese text simplification as of 2026. This is a BUILD, not BUY problem.

Recommended approach: Hybrid stack combining existing NLP libraries (jieba, OpenCC, HSK-Character-Profiler) with custom simplification logic.

Timeline: 2-4 weeks for MVP (rule-based), 2-4 months for production system

Cost: $5K-$35K Year 1 depending on complexity

Key Findings from S1-rapid#

1. Library Landscape (Reality Check)#

What EXISTS: ✅ Text segmentation (jieba, HanLP) ✅ Traditional/Simplified conversion (OpenCC, hanziconv) ✅ Readability analysis (HSK-Character-Profiler) ✅ NLP foundations (HanLP, LTP) ✅ Training datasets (MCTS - 691K parallel sentences)

What DOESN’T exist: ❌ Production-ready text simplification libraries (pip-installable) ❌ Pre-trained simplification models (load-and-use) ❌ Turnkey solutions (like English has Rewordify)

Gap: Chinese text simplification is 3-5 years behind English in terms of library maturity.

2. Three Viable Approaches#

Option A: Rule-Based (Recommended for MVP)#

Stack: jieba + OpenCC + HSK vocabulary + custom rules Timeline: 2-4 weeks Cost: $5K-$15K Pros:

Fast to implement
Predictable results
Easy to debug and maintain
No ML expertise required

Cons:

Limited to word replacement (can’t restructure sentences well)
Needs manual synonym dictionary curation
Struggles with idioms and context

Success rate: 70-80% of sentences

Option B: Neural (Research-Grade)#

Stack: MCTS dataset + transformer model (T5/BART) + GPU training Timeline: 2-4 months Cost: $20K-$60K (development + GPU) Pros:

Can handle complex restructuring
Improves with more data
Handles idioms better

Cons:

Requires ML expertise
Unpredictable output (may generate fluent but incorrect text)
Slower inference
Hard to control exact output level

Success rate: 80-90% (but 10-20% errors can be severe)

Option C: Hybrid (Production-Ready)#

Stack: Rule-based for common cases + neural for complex cases Timeline: 2-3 months Cost: $15K-$35K Pros:

Best of both worlds
Rules handle 70%, neural handles remaining 30%
Fallback logic (if neural fails, use rule output)

Cons:

More complex architecture
Requires both rule curation AND model training

Success rate: 85-95%

Decision Matrix#

Scenario	Recommended Approach	Rationale
MVP / Prototype (< 1 month)	Rule-based	Fastest time to value
Language learning app (1K-10K texts/month)	Rule-based → Hybrid	Start simple, upgrade if needed
Large platform (> 10K texts/month)	Hybrid	ROI justifies complexity
Research / Publishing	Neural	Accuracy matters more than speed
Accessibility (government docs)	Rule-based	Predictability matters

Recommended MVP Stack#

Goal: Working text simplification in 2-4 weeks

Components#

jieba - Text segmentation
```
pip install jieba
```
OpenCC - Traditional/Simplified normalization
```
pip install opencc-python-reimplemented
```

HSK-Character-Profiler - Difficulty validation

git clone https://github.com/Ancastal/HSK-Character-Profiler

HSK Vocabulary Lists

git clone https://github.com/krmanik/HSK-3.0

Custom logic (you build):
- Synonym dictionary (HSK 6→3 word mappings)
- Sentence splitting rules
- Idiom handling

Implementation Steps#

Week 1: Infrastructure

Set up jieba segmentation pipeline
Integrate OpenCC for normalization
Load HSK vocabulary lists
Set up HSK-Character-Profiler for validation

Week 2: Simplification logic

Build synonym dictionary (map 500 common HSK 4-6 words to HSK 2-3 equivalents)
Implement word replacement logic
Add sentence splitting (sentences > 20 chars)

Week 3: Quality assurance

Test on sample texts
Native speaker validation
Fix edge cases (names, numbers, idioms)

Week 4: Deployment

Build API endpoint
Add caching
Performance optimization

What to Build Yourself#

1. Synonym Dictionary (Critical)#

Format:

{
  "复杂": {
    "hsk_level": 4,
    "synonyms": {
      "3": ["难"],
      "2": ["难"]
    }
  },
  "研究": {
    "hsk_level": 4,
    "synonyms": {
      "3": ["学习"],
      "2": ["学"]
    }
  }
}

Sources for building:

HSK vocabulary lists (levels 1-6 or 1-9)
Chinese learner dictionaries
Manual curation by native speakers

Effort: 1-2 weeks for 500-1000 words

2. Simplification Rules#

Word replacement:

def simplify_word(word, target_hsk=3):
    if word_hsk_level(word) <= target_hsk:
        return word  # Already simple enough

    synonym = synonym_dict[word]['synonyms'].get(str(target_hsk))
    return synonym if synonym else word

Sentence splitting:

def split_long_sentence(sentence, max_length=15):
    if len(sentence) <= max_length:
        return [sentence]

    # Find split points (commas, conjunctions)
    # Split and return list of shorter sentences

Idiom handling:

def simplify_idiom(text):
    # Replace 4-character idioms with plain explanations
    idiom_map = {
        "一举两得": "一次做两件事",
        "画蛇添足": "做多余的事"
    }
    # Replace in text

3. Quality Validation#

from hsk_profiler import analyze

def validate_simplification(original, simplified, target_hsk=3):
    # 1. Check difficulty
    difficulty = analyze(simplified)
    if difficulty > target_hsk:
        return False, "Still too difficult"

    # 2. Check meaning preservation
    # (You need semantic similarity metric)
    similarity = compute_similarity(original, simplified)
    if similarity < 0.7:
        return False, "Meaning changed too much"

    return True, "OK"

Common Pitfalls to Avoid#

Over-simplification
- Don’t replace every word blindly
- Keep domain-specific terms (if learner needs them)
- Maintain natural flow
Segmentation errors compound
- Jieba mistakes → wrong word boundaries → wrong replacements
- Add custom dictionary for your domain
Context blindness
- “银行” = bank (financial) or riverbank
- Rule-based can’t distinguish without context
- Solution: Use POS tagging (HanLP) or accept limitation
No ground truth
- Unlike translation (many parallel corpora), simplification has limited references
- Solution: Human validation, iterative testing

Next Steps in 4PS Research#

S2-comprehensive (Technical Depth)#

Deep dive into MCTS dataset structure
Neural model architectures (T5, BART, mBART)
Feature engineering for ML approaches
Evaluation metrics (BLEU, SARI, HSK-aware metrics)

Questions to answer:

How do you train a neural simplification model?
What linguistic features correlate with simplification quality?
How do you evaluate simplification (beyond human judgment)?

S3-need-driven (Use Case Mapping)#

Language learning apps: Requirements and implementation
Accessibility services: Government doc simplification
Publishers: Educational content adaptation

Questions to answer:

What accuracy threshold is “good enough” for each use case?
What’s the TCO over 3 years for each approach?
Build vs buy decision tree

S4-strategic (Viability & ROI)#

3-year TCO analysis (rule-based vs neural vs hybrid)
Break-even volume (when does automation pay off?)
Risk assessment (what can go wrong?)

Questions to answer:

At what scale does neural approach become cost-effective?
What’s the risk of meaning drift in automated simplification?
When should you hire editors instead of building tech?

S1-rapid Conclusion#

For most teams: Start with rule-based MVP using jieba + OpenCC + custom logic.

Timeline: 2-4 weeks to working prototype Cost: $5K-$15K Success rate: 70-80% (good enough for MVP)

Upgrade path: If you hit limitations (complex sentences not simplifying well), add neural model for those cases (hybrid approach).

Reality: Chinese text simplification is immature compared to English. You will build custom solutions, not pip install magic.

Sources#

S2: Comprehensive

S2-comprehensive: Technical Deep Dive#

Status: 🚧 IN PROGRESS#

This phase will cover the technical depth of Chinese text simplification, including:

Planned Topics#

1. Neural Model Architectures#

Transformer-based approaches (T5, BART, mBART)
Seq2seq vs pre-trained models
Fine-tuning strategies for Chinese
Model size trade-offs (performance vs inference speed)

2. MCTS Training Pipeline#

Dataset structure and format
Training data preprocessing
Model training workflow
Hyperparameter tuning
Evaluation on multi-reference test set

3. Rule-Based Approaches (Detailed)#

Synonym extraction methods
HSK-level word mapping strategies
Sentence splitting algorithms
Idiom detection and replacement
Compound word handling

4. Evaluation Metrics#

BLEU score for simplification
SARI (System output vs references And against the Input sentence)
HSK-aware metrics (vocabulary coverage)
Semantic similarity (meaning preservation)
Fluency metrics
Human evaluation protocols

5. Feature Engineering#

Linguistic features for simplification
Character frequency analysis
Sentence complexity metrics
Discourse structure
Readability formulas for Chinese

6. Comparative Analysis#

Rule-based vs neural (detailed comparison)
Accuracy vs speed trade-offs
Error analysis (what fails in each approach)
Hybrid architectures

Research Questions#

What makes a good simplification model?
- Accuracy benchmarks from MCTS paper
- State-of-the-art results (2024-2026)
How do you train on MCTS dataset?
- Step-by-step training guide
- GPU requirements
- Training time estimates
- Fine-tuning vs training from scratch
What linguistic features matter most?
- Feature importance analysis
- Correlation with simplification quality
How do you evaluate without ground truth?
- Multi-reference evaluation
- Automatic metrics vs human judgment
- Inter-annotator agreement

Estimated Time#

3-4 hours for comprehensive technical research

Deliverables (Planned)#

neural-architectures.md - T5, BART, mBART for Chinese TS
mcts-training-guide.md - How to train on MCTS dataset
evaluation-metrics.md - BLEU, SARI, HSK metrics deep dive
rule-based-detailed.md - Advanced rule-based techniques
feature-engineering.md - Linguistic features for ML
recommendation.md - S2 technical recommendations

Status: Outline created, detailed research pending Next session: Start with neural-architectures.md or mcts-training-guide.md

Evaluation Metrics for Chinese Text Simplification#

The Challenge#

Unlike translation (compare to reference), simplification has:

Multiple valid outputs (many ways to simplify)
No single ground truth (HSK 3 can be expressed many ways)
Dual goals: Simplicity AND meaning preservation

Automatic Metrics#

1. BLEU (Bilingual Evaluation Understudy)#

What it measures: N-gram overlap with reference simplifications

from sacrebleu import corpus_bleu

references = [["这是简单的句子。", "这个句子很简单。"]]  # Multiple refs
hypothesis = ["这是简单句子。"]

score = corpus_bleu(hypothesis, references)
print(score.score)  # 0-100, higher is better

Pros: Standard, widely used Cons: Rewards exact matches, penalizes valid paraphrases Typical scores: 30-45 for text simplification (lower than translation)

2. SARI (System output vs References and Input)#

What it measures: How well you ADD simple words, KEEP important words, DELETE complex words

from easse.sari import corpus_sari

sources = ["这是一个非常复杂的句子。"]
predictions = ["这是复杂句子。"]
references = [["这是难句子。", "这个句子很难。"]]

score = corpus_sari(sources, predictions, references)
print(score)  # 0-100, higher is better

Install: pip install easse

Formula:

SARI = (F1_add + F1_keep + F1_delete) / 3

Pros: Designed for simplification, better than BLEU Cons: Requires multiple references for accuracy

Typical scores: 35-45 for good simplification

3. HSK Vocabulary Coverage#

What it measures: Percentage of words within target HSK level

def hsk_coverage(text, target_level=3):
    words = jieba.cut(text)
    hsk_vocab = load_hsk_vocab(levels=range(1, target_level+1))

    known_words = sum(1 for w in words if w in hsk_vocab)
    return known_words / len(list(words))

coverage = hsk_coverage("这是一个简单的句子", target_level=3)
# 0.95 = 95% of words are HSK 1-3

Pros: Directly measures learner comprehension Cons: Doesn’t measure sentence complexity

Targets:

HSK 2: 90-95% coverage
HSK 3: 95-98% coverage
HSK 4: 98-99% coverage

4. Sentence Length#

What it measures: Average characters per sentence (simpler = shorter)

avg_length = sum(len(s) for s in sentences) / len(sentences)

Targets:

HSK 2: 8-12 characters
HSK 3: 12-18 characters
HSK 4: 18-25 characters

5. Semantic Similarity#

What it measures: Meaning preservation (does simplification keep same meaning?)

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

original = "这是一个非常复杂的句子"
simplified = "这是难句子"

emb1 = model.encode(original)
emb2 = model.encode(simplified)

similarity = util.cos_sim(emb1, emb2)
print(similarity)  # 0-1, higher is better

Thresholds:

< 0.7: Meaning changed too much
0.7-0.85: Acceptable paraphrase
0.85: Very similar meaning

Human Evaluation#

Dimensions to Measure#

Grammaticality: Is the simplified text fluent?
- Scale: 1-5 (1=broken, 5=perfect)
Meaning Preservation: Does it keep the original meaning?
- Scale: 1-5 (1=completely different, 5=identical)
Simplicity: Is it simpler than the original?
- Scale: 1-5 (1=same difficulty, 5=much simpler)
Adequacy: Would an HSK X learner understand this?
- Binary: Yes/No

Evaluation Protocol#

Annotators: 3-5 native Chinese speakers (preferably with teaching experience)

Sample size: 100-200 sentence pairs (random sample)

Agreement: Calculate inter-annotator agreement (Fleiss’ kappa)

κ > 0.6: Good agreement
κ < 0.4: Revise guidelines

Cost: $500-1000 for 200 evaluations (crowdsourcing) or $2K-5K (expert annotators)

Composite Score#

Combine metrics for overall quality:

def evaluate_simplification(original, simplified, references, target_hsk=3):
    # Automatic metrics
    bleu = compute_bleu([simplified], [references])
    sari = compute_sari([original], [simplified], [references])
    hsk_cov = hsk_coverage(simplified, target_hsk)
    similarity = semantic_similarity(original, simplified)

    # Composite score
    score = {
        'bleu': bleu,
        'sari': sari,
        'hsk_coverage': hsk_cov,
        'semantic_sim': similarity,
        'composite': 0.2*bleu + 0.3*sari + 0.3*hsk_cov + 0.2*similarity
    }

    # Pass criteria
    passes = (
        hsk_cov >= 0.95 and  # 95%+ HSK coverage
        similarity >= 0.75 and  # Meaning preserved
        len(simplified) < len(original)  # Actually simpler
    )

    return score, passes

Benchmarking Your System#

Baseline: No Simplification#

BLEU: 0 (no match with references)
SARI: ~30 (keeps all words, doesn’t simplify)
HSK coverage: Depends on original

Rule-Based Target#

SARI: 35-40
HSK coverage: 90-95%
Semantic similarity: 0.8-0.9

Neural Target#

SARI: 40-45
HSK coverage: 85-95% (less controllable)
Semantic similarity: 0.75-0.85

MCTS Paper Results#

Best models: ~40 BLEU, ~45 SARI
Human upper bound: ~60 SARI (multi-reference)

Practical Validation Workflow#

Week 1: Automated

Run on 1K test sentences
Compute BLEU, SARI, HSK coverage
Filter failures (< thresholds)

Week 2: Spot Check

Manual review of 100 random samples
Identify error patterns (what’s breaking?)

Week 3: Human Eval

Formal evaluation on 200 samples
Calculate inter-annotator agreement
Iterate if needed

Week 4: Production

Deploy with monitoring
Log edge cases for improvement
Periodic re-evaluation

Monitoring in Production#

Track these metrics over time:

# Log per simplification
{
    'original_length': 45,
    'simplified_length': 28,
    'hsk_coverage': 0.94,
    'semantic_similarity': 0.82,
    'inference_time_ms': 250
}

# Alert if:
# - HSK coverage < 0.90 (too hard)
# - Semantic similarity < 0.70 (meaning drift)
# - Inference time > 500ms (too slow)

Error Analysis#

Common failure modes:

Over-simplification: “研究表明” → “说” loses academic tone
- Fix: Be more conservative with replacements
Under-simplification: Didn’t simplify hard words
- Fix: Expand synonym dictionary
Meaning drift: “银行” (bank) → “河边” (riverbank) wrong context
- Fix: Use POS tags or context-aware rules
Unnatural output: “非常的好” (ungrammatical)
- Fix: Add grammar validation post-processing

Tools#

Libraries:

sacrebleu: BLEU calculation
easse: SARI and other simplification metrics (English-focused but adaptable)
sentence-transformers: Semantic similarity
jieba: Segmentation for HSK coverage

MCTS eval scripts: https://github.com/blcuicall/mcts (includes HSK evaluator)

Verdict#

MVP evaluation (fast):

HSK coverage (must-have)
Sentence length reduction
Manual spot-checks (50 samples)

Production evaluation (rigorous):

SARI (automatic)
Semantic similarity (automatic)
Human eval (200 samples, quarterly)

Research evaluation (comprehensive):

All automatic metrics
Human eval (500+ samples)
Inter-annotator agreement
Error analysis by category

Sources#

Neural Architectures for Chinese Text Simplification#

Models That Work#

1. mBART (Multilingual BART)#

Best for: Chinese text simplification (multilingual pre-training helps)

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="zh_CN", tgt_lang="zh_CN")

# Fine-tune on MCTS dataset

Pros: Pre-trained on Chinese, handles seq2seq well Cons: Large (600M params), slow inference MCTS paper results: Not specifically tested, but BART-style models work well

2. mT5 (Multilingual T5)#

Best for: Chinese when you need smaller models

from transformers import MT5ForConditionalGeneration, MT5Tokenizer

model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")

Sizes: Small (300M), Base (580M), Large (1.2B) Pros: Good Chinese support, multiple sizes Cons: Requires more training data than BART

3. CPT (Chinese Pre-Trained Transformer)#

Best for: Chinese-only tasks (no multilingual overhead)

GitHub: https://github.com/fastnlp/CPT Pros: Optimized for Chinese, faster than mBART Cons: Less widely adopted, fewer resources

Training Setup#

Hardware Requirements#

Model Size	GPU Memory	Training Time (MCTS 691K)	Inference Speed
mT5-small	16GB	2-3 days	0.5s/sentence
mT5-base	24GB	4-5 days	1s/sentence
mBART	32GB+	5-7 days	1.5s/sentence

Cloud costs: ~$100-500 (AWS p3.2xlarge or equivalent)

Training Pipeline#

# 1. Load MCTS dataset
from datasets import load_dataset
dataset = load_dataset('json', data_files={'train': 'mcts/train.jsonl'})

# 2. Tokenize
def tokenize(examples):
    inputs = tokenizer(examples['source'], max_length=128, truncation=True)
    targets = tokenizer(examples['target'], max_length=128, truncation=True)
    inputs['labels'] = targets['input_ids']
    return inputs

dataset = dataset.map(tokenize, batched=True)

# 3. Train
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    save_steps=10000,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
)

trainer.train()

Fine-Tuning Strategies#

1. Full Fine-Tuning#

Update all model weights
Best accuracy
Requires most GPU memory
3-5 epochs on MCTS: ~$200-500

2. LoRA (Low-Rank Adaptation)#

Update only small adapter layers
90% of full fine-tuning accuracy
1/4 the memory usage
Recommended for smaller teams

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
)

model = get_peft_model(model, lora_config)
# Train as normal, much less memory

3. Prefix Tuning#

Add learnable prefix tokens
Even smaller than LoRA
Slightly lower accuracy

Controlling Output Difficulty#

Challenge: Neural models don’t respect HSK levels by default.

Solution 1: Prompt engineering

input_text = f"[HSK 3] {source_text}"
# Model learns to simplify to HSK 3 level

Solution 2: Separate models per level

Train mT5-small for HSK 3
Train mT5-small for HSK 4
Route at inference time

Solution 3: Post-process with HSK validator

simplified = model.generate(input)
if hsk_level(simplified) > target_level:
    # Reject and regenerate with different decoding params

Decoding Strategies#

Beam Search (Standard)#

output = model.generate(
    input_ids,
    max_length=128,
    num_beams=5,
    early_stopping=True
)

Best for: Quality (default choice)

Sampling#

output = model.generate(
    input_ids,
    max_length=128,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7
)

Best for: Variety (multiple simplification candidates)

Constrained Decoding#

Force output to use only HSK 1-3 vocabulary (advanced).

Benchmarks from Literature#

MCTS paper (2024):

BART-based models: ~40 BLEU score
mT5-base: ~38 BLEU score
Human references: 100 BLEU (by definition)

Reality check: BLEU scores are low because simplification has multiple valid outputs. Multi-reference BLEU is more meaningful.

Production Considerations#

Inference Optimization#

TorchScript: 20-30% faster
ONNX Runtime: 30-50% faster
Model quantization: 2-4x faster, slight quality loss
Batching: 5-10x throughput improvement

Cost at Scale#

1K simplifications/day:

GPU inference (T4): $50-100/month
CPU inference (optimized): $20-50/month
Serverless (AWS Lambda + GPU): $100-200/month

10K simplifications/day:

Dedicated GPU server becomes cost-effective

Verdict#

For most teams: Start with mT5-base + LoRA on MCTS

Good balance of quality and resources
2-3 days training on single GPU
Deploy with ONNX for fast inference

For research: mBART-large (best quality) For production at scale: mT5-small (fast, good enough)

Sources#

S2-comprehensive Recommendations#

Neural Approach: Go/No-Go#

GO if:#

Volume > 5K texts/month (justify training cost)
Accuracy needs > 80% (rule-based plateaus at 70-80%)
Have ML engineer or budget for consulting
Can tolerate 3-5% meaning drift

NO-GO if:#

Volume < 1K texts/month (not worth complexity)
Need 100% predictable output (use rules)
No ML expertise and < $20K budget
Can’t accept any meaning errors

Recommended Neural Stack#

mT5-base + LoRA on MCTS dataset:

2-3 days training (single GPU, ~$150)
Deploy with ONNX (1s/sentence CPU)
40-45 SARI, 85-95% HSK coverage
LoRA = 1/4 memory of full fine-tuning

Implementation:

# 1. Setup
pip install transformers datasets peft onnx

# 2. Train
python train.py \
  --model google/mt5-base \
  --dataset mcts/train.jsonl \
  --epochs 3 \
  --lora_r 16 \
  --output models/mt5-lora-hsk3

# 3. Evaluate
python eval.py \
  --model models/mt5-lora-hsk3 \
  --test mcts/test.jsonl \
  --metrics sari,bleu,hsk_coverage

# 4. Deploy
python convert_to_onnx.py \
  --model models/mt5-lora-hsk3 \
  --output models/mt5-lora-hsk3.onnx

Timeline: 1 week (setup + train + eval) Cost: $200-500 (cloud GPU + storage)

Evaluation Strategy#

MVP: HSK coverage + spot checks

100 test sentences
Must achieve 95%+ HSK coverage
Manual review of 20 samples

Production: SARI + semantic similarity + human eval

Run SARI on full test set (target: 40+)
Semantic similarity > 0.75 (meaning preserved)
Human eval on 200 samples quarterly

Monitoring: Log these per-simplification

{
  'hsk_coverage': 0.94,
  'semantic_similarity': 0.82,
  'inference_time_ms': 250
}
# Alert if coverage < 0.90 or similarity < 0.70

Hybrid Architecture (Best of Both)#

Route by sentence complexity:

def simplify(text):
    complexity = measure_complexity(text)

    if complexity < 15:  # Simple sentence
        return rule_based_simplify(text)  # Fast, predictable

    else:  # Complex sentence
        result = neural_simplify(text)
        if validate(result):
            return result
        else:
            return rule_based_simplify(text)  # Fallback

Result: 85-90% success rate (neural handles hard cases, rules are fallback)

S2 Key Takeaways#

Neural is viable but not trivial (1 week + $500, requires ML skills)
mT5-base + LoRA = best balance (quality vs resources)
SARI + HSK coverage = must-have metrics
Hybrid architecture = production-grade (rules + neural)
3-5% meaning drift is unavoidable with neural (need human review)

Next: S3 maps these technical options to specific use cases with TCO models.

S3: Need-Driven

S3-need-driven: Solutions by Use Case#

Status: 🚧 PLANNED#

This phase will map implementation strategies to real-world applications with cost models.

Planned Topics#

1. Language Learning Platforms#

Requirements analysis (graded readers, adaptive content)
Implementation strategy (rule-based MVP → hybrid)
Cost model (3-year TCO)
Success metrics (learner engagement, comprehension)
Case study examples

2. Accessibility Services#

Government document simplification
Healthcare information
Public service announcements
Requirements (accuracy, consistency, auditability)
Implementation strategy
Compliance and legal considerations

3. Educational Publishers#

Textbook adaptation
Multi-level content generation
Editorial workflow integration
Quality assurance requirements
ROI analysis

4. AI Tutoring Systems#

Dynamic difficulty adjustment
Real-time simplification
Personalization (beyond HSK levels)
Latency requirements
Implementation strategy

5. Decision Tree#

Use case → approach mapping
Volume thresholds (rule-based vs neural)
Accuracy requirements → implementation
Budget constraints → solution

Research Questions#

What accuracy is “good enough” for each use case?
- Learner apps: 70-80% (human reads output)
- Publishing: 90%+ (editorial review)
- Accessibility: 85%+ (legal compliance)
What’s the TCO over 3 years?
- Rule-based: $12K-$35K
- Neural: $25K-$80K
- Hybrid: $20K-$60K
- Manual editing baseline: $24K-$72K/year
When should you hire editors instead of automating?
- Volume < 100 texts/month
- High-stakes content (legal, medical)
- Niche domains (limited training data)

Estimated Time#

3-4 hours for use case mapping and cost modeling

Deliverables (Planned)#

use-case-language-learning.md
use-case-accessibility.md
use-case-publishers.md
use-case-ai-tutoring.md
decision-tree.md - Which approach for which scenario
recommendation.md - S3 summary

Status: Outline created, detailed research pending Next session: Start after S2-comprehensive is complete

S3-need-driven Recommendations#

Quick Decision Tree#

Volume/month?
├─ < 500
│  └─ Manual editing ($3K-6K/month)
│     OR rule-based if need latency < 5min
│
├─ 500-5K
│  ├─ Accuracy < 80%? → Rule-based ($10K-15K year 1)
│  └─ Accuracy 80%+? → Hybrid ($25K-40K year 1)
│
└─ > 5K
   ├─ Accuracy < 85%? → Hybrid ($30K-50K year 1)
   └─ Accuracy 90%+? → Neural + review ($50K-90K year 1)

Use Case Mapping#

Use Case	Approach	Why
Language learning app	Rule-based → Hybrid	Users tolerate 75-85%, scale gradually
Government accessibility	Hybrid + mandatory review	Need 90%+ + auditability
Publishers	Neural + editorial	95%+ needed, editors refine output
AI tutoring	Neural (optimized)	10K+/day needs speed + personalization
News sites	Hybrid	Daily content, 80%+ acceptable

Break-Even Analysis#

Automation vs manual editing:

100 texts/month: Manual cheaper ($1.2K vs $1.7K/month amortized)
500 texts/month: Break-even point
2K texts/month: Automation 3x cheaper
10K texts/month: Automation 5x cheaper

Rule-based vs neural:

< 5K/month: Rule-based cheaper
5K-20K/month: Hybrid best ROI
> 20K/month: Full neural justified

Key Insight#

Most teams should start rule-based even if they’ll eventually need neural:

Learn the domain (what fails? what patterns?)
Collect data (input → desired output)
Fine-tune neural on YOUR data (better than MCTS alone)

Exception: If you have > $50K budget and 10K+ texts/month from day 1, skip to neural.

Cost Optimization#

Year 1 costs high (development), Year 2-3 drop 70% (maintenance only):

Rule-based: $15K → $5K/year
Hybrid: $30K → $8K/year
Neural: $60K → $15K/year

Mistake to avoid: Comparing Year 1 automation cost to Year 1 manual cost. Compare 3-year TCO.

Success Criteria by Use Case#

Learner apps:

90%+ completion rate
< 5% difficulty complaints
95%+ HSK coverage

Accessibility:

0 legal challenges due to unclear language
90%+ users report “easy to understand”
Audit trail for all simplifications

Publishers:

Editors spend 50% less time vs writing from scratch
95%+ accuracy (minimal edits needed)
Consistent style across levels

AI tutoring:

< 500ms latency
80%+ learners report helpful explanations
Adapt to user over time (personalization)

Use Case Implementation Guide#

1. Language Learning Apps (B2C)#

Requirements#

Volume: 1K-10K texts/month
Accuracy: 75-85% acceptable (users read output, can adapt)
Latency: < 2s per text
Cost: Must be cheaper than manual editing

Recommended Approach#

Phase 1 (MVP): Rule-based

jieba + OpenCC + custom rules
Timeline: 2-4 weeks
Cost: $5K-15K

Phase 2 (Scale): Hybrid

Rules for 70% (simple cases)
Neural for 30% (complex sentences)
Timeline: +1 month
Cost: +$10K-20K

3-Year TCO#

Year	Rule-Based	Hybrid	Manual Editing
1	$15K	$25K	$36K
2	$5K	$8K	$36K
3	$5K	$8K	$36K
Total	$25K	$41K	$108K

Break-even: 500 texts/month (automation cheaper than editors)

Success metrics:

90%+ learners complete articles
< 5% complaints about difficulty
HSK coverage 95%+

2. Accessibility Services (Government)#

Requirements#

Volume: 100-1K documents/month
Accuracy: 90%+ (public-facing, legal implications)
Auditability: Must explain simplifications
Consistency: Same input → same output

Recommended Approach#

Hybrid with human review

Rule-based for consistency
Neural for complex legal language
Mandatory human review before publication

Timeline: 3 months (includes compliance review) Cost: $30K-50K (development + legal review)

3-Year TCO#

Year	Hybrid + Review	Manual Only
1	$60K	$48K
2	$25K	$48K
3	$25K	$48K
Total	$110K	$144K

Break-even: 200 documents/month

Constraints:

Must log all simplifications (auditability)
Rule-based preferred (explainable)
Human QA on 100% of output

3. Educational Publishers#

Requirements#

Volume: 500-2K texts/year (textbooks, readers)
Accuracy: 95%+ (high stakes)
Multiple levels: Need HSK 2, 3, 4, 5 versions
Editorial workflow: Integrate with existing process

Recommended Approach#

Neural + editorial workflow

Train separate models per HSK level
Output 3 candidates per level
Editors select best + refine
Builds dataset for future improvement

Timeline: 4-6 months Cost: $50K-80K (development + training)

3-Year TCO#

Year	Neural + Editorial	Manual Only
1	$90K	$80K
2	$30K	$80K
3	$25K	$80K
Total	$145K	$240K

Break-even: 1K texts/year

Workflow:

Author writes at natural level
Neural generates HSK 3, 4, 5 versions
Editors review and refine
Collect edits for model improvement

4. AI Tutoring Systems#

Requirements#

Volume: 10K+ per day (real-time)
Accuracy: 80%+ (AI can re-explain if confused)
Latency: < 500ms (conversational)
Personalization: Adapt to individual learner, not just HSK level

Recommended Approach#

Optimized neural with caching

mT5-small (fast inference)
ONNX runtime on CPU
Cache common simplifications
Fine-tune on user feedback

Timeline: 3-4 months Cost: $40K-70K

Operating Costs#

Volume/day	Infrastructure	Cost/month
10K	2x CPU (8 core)	$200
50K	GPU (T4)	$400
100K	2x GPU	$800

Latency optimization:

Caching: 50% hit rate → 250ms avg
Batching: 5-10x throughput
Model quantization: 2x faster

Decision Matrix#

Scenario	Volume/month	Accuracy Need	Recommended	Timeline	Year 1 Cost
Learner app MVP	500-2K	75%+	Rule-based	3 weeks	$10K
Learner app scale	5K-20K	80%+	Hybrid	2 months	$25K
Government docs	100-500	90%+	Hybrid + review	3 months	$60K
Publishers	1K-3K/year	95%+	Neural + editorial	5 months	$90K
AI tutoring	10K+/day	80%+	Neural (optimized)	3 months	$50K

General Guidelines#

< 500 texts/month: Manual editing cheaper (unless latency matters) 500-5K/month: Rule-based MVP, upgrade to hybrid if limited 5K-20K/month: Hybrid (rules + neural) > 20K/month: Full neural with optimization

Accuracy requirements:

< 80%: Rule-based sufficient
80-90%: Hybrid
90%: Neural + human review

Budget constraints:

< $15K: Rule-based only
$15K-$40K: Hybrid possible
$40K: Full neural viable

S4: Strategic

S4-strategic: Viability & ROI Analysis#

Status: 🚧 PLANNED#

This phase will provide long-term strategic decisions with financial models and risk assessment.

Planned Topics#

1. Build vs Buy Viability#

3-year TCO comparison
Break-even analysis
Market landscape (commercial solutions, if any)
Build decision factors

2. Rule-Based vs Neural vs Hybrid ROI#

Cost models for each approach
Maintenance costs
Scaling costs (volume growth)
Quality improvements over time

3. Break-Even Analysis#

At what volume does automation pay off vs manual editing?
At what volume does neural become cheaper than rule-based?
Fixed costs vs variable costs

4. Risk Assessment#

Meaning drift: Automated simplification changes meaning
Quality degradation: Errors accumulate over time
Segmentation errors: Jieba mistakes cascade
Context blindness: Rule-based misses context
Mitigation strategies

5. Team Skills and Hiring#

What skills are needed for each approach?
Hiring costs (NLP engineer vs linguist vs ML engineer)
Training existing team vs hiring specialists
Consulting vs in-house development

6. Technology Maturity Assessment#

Current state of Chinese text simplification (2026)
Projected improvements (2027-2029)
When to wait vs build now
Vendor landscape (commercial APIs, if any)

Research Questions#

At what scale does neural become cost-effective?
- Volume thresholds: 1K, 10K, 100K texts/month
- Quality requirements: 70%, 80%, 90%+ accuracy
- Time horizons: 1 year, 3 years, 5 years
What are the risks of automated simplification?
- Meaning drift: 5-10% of sentences change meaning subtly
- Unnatural output: 10-15% sound awkward
- Over-simplification: 5% become too simple (lose nuance)
- Under-simplification: 10-20% remain too complex
- Mitigation: Human review, conservative rules, hybrid approach
When should you wait for better libraries?
- If budget < $10K → wait 1-2 years, libraries may mature
- If volume < 500 texts/month → manual editing may be cheaper
- If accuracy needs > 95% → wait or use hybrid with heavy editing
What’s the competitive advantage of building now?
- First-mover advantage in language learning apps
- Custom domain adaptation (medical, legal)
- Data moat (collect user feedback, improve over time)

Estimated Time#

3-4 hours for strategic analysis and ROI modeling

Deliverables (Planned)#

build-vs-buy-viability.md - 3-year TCO comparison
roi-analysis.md - Break-even models, cost scenarios
risk-assessment.md - What can go wrong, mitigation strategies
team-skills.md - Hiring and training considerations
technology-maturity.md - Market state, future outlook
recommendation.md - FINAL STRATEGIC RECOMMENDATIONS

Status: Outline created, detailed research pending Next session: Start after S3-need-driven is complete Final output: Complete 4PS research package with strategic go/no-go decision

S4-strategic: Final Recommendations#

Strategic Go/No-Go Decision#

BUILD NOW (Strong recommendation)#

✅ Build if:

Volume > 500 texts/month
Budget > $15K Year 1
Have mid-level dev + Chinese speaker
Can tolerate 75-85% accuracy

Expected outcome: 60-80% cost savings vs manual over 3 years

WAIT 2-3 YEARS (Conditional)#

⏸️ Wait if:

Volume < 300 texts/month (manual cheaper)
Budget < $10K (can’t build properly)
Need > 95% accuracy (tech not ready)
No technical capability

Risk: Competitors build data moats, miss market window

NEVER BUILD (Manual forever)#

❌ Don’t build if:

Volume < 100 texts/month
Niche domain (classical Chinese, legal, medical)
Can’t accept ANY errors (high-stakes publishing)
Short-term project (< 18 months)

Technology Maturity Assessment (2026)#

Current state:

❌ No pip-installable simplification libraries
✅ Building blocks mature (jieba, OpenCC, HanLP)
✅ Training data available (MCTS: 691K pairs)
⚠️ Neural models work but need ML expertise

2-3 year outlook (2028-2029):

Possible: Turnkey libraries emerge (50% chance)
Likely: Commercial APIs (Chinese equivalent of Rewordify)
Certain: Better pre-trained models (easier fine-tuning)

Strategic implication: Early movers (2026-2027) have 2-3 year advantage, but risk is higher

Risk Assessment#

Technical Risks#

Risk	Probability	Impact	Mitigation
Rule-based plateaus at 70%	60%	Medium	Plan for hybrid from start
Neural meaning drift (5-10%)	80%	High	Human review on critical content
Jieba segmentation errors cascade	40%	Medium	Custom dictionary, validation
HSK coverage drift over time	30%	Low	Quarterly updates

Business Risks#

Risk	Probability	Impact	Mitigation
Volume lower than expected	40%	High	Start manual, automate at 500/month
Accuracy not good enough	30%	Medium	Hybrid approach (fallback to human)
Technology improves rapidly	50%	Low	Incremental build (not all-in upfront)
Competitors build better solution	30%	Medium	Focus on domain data (your moat)

Mitigation Strategy#

Start small: Rule-based MVP ($15K), not full neural ($70K)
Validate early: 100 test simplifications before full rollout
Build incrementally: MVP → rules → hybrid → neural (staged)
Human-in-loop: Review high-stakes content regardless of automation

Recommended Path (Most Teams)#

Phase 1: MVP (Weeks 1-4)#

Approach: Rule-based
Investment: $10K-15K
Goal: 70-75% success rate on 100 test texts
Decision point: If success rate < 65%, reconsider

Phase 2: Production (Months 2-3)#

Approach: Harden MVP, add monitoring
Investment: $5K-10K
Goal: Handle 1K texts/month reliably
Decision point: If volume > 5K/month, plan hybrid

Phase 3: Scale (Months 4-6)#

Approach: Add neural for complex cases (hybrid)
Investment: $15K-25K
Goal: 85-90% success rate
Decision point: If still not good enough, consider full neural

Phase 4: Optimize (Months 7-12)#

Approach: Fine-tune on YOUR domain data
Investment: $5K-10K
Goal: 90%+ success rate, < 500ms latency
Decision point: Maintenance mode (ongoing curation only)

Total Year 1: $35K-60K Year 2-3: $5K-10K/year maintenance

Competitive Positioning#

First-Mover Advantages (2026-2027)#

Data moat: Collect user feedback → improve model
Market share: Early users sticky (switching costs)
Learning curve: 2-3 years to get good (others behind)

Window closes: 2028-2029 (when turnkey solutions may emerge)

Defensibility#

Weak defense: Generic HSK simplification (others can replicate) Strong defense: Domain-specific (medical Chinese, business Chinese, kids’ books)

Recommendation: Build generic MVP, specialize by domain for moat

Team & Skills#

Minimum Viable Team#

Rule-based:

1 mid-level Python developer (3 weeks)
1 native Chinese speaker for validation (1 week)
Total: ~$10K-15K

Hybrid:

Above + 1 ML engineer (2-3 weeks)
Total: ~$25K-40K

Full neural:

1 senior ML engineer (6-8 weeks)
1 Chinese linguist (2 weeks)
Infrastructure engineer (1 week)
Total: ~$50K-80K

Build vs Hire vs Outsource#

If you have in-house devs: Build (cheapest) If you hire contractors: 2-3x cost multiplier If you outsource fully: 3-5x cost, quality risk

Recommendation: Hire 1 full-time if volume > 5K/month, contract otherwise

Final Verdict#

For Language Learning Apps#

✅ BUILD rule-based MVP now (2-4 weeks, $15K)

ROI positive at 500+ texts/month
Iterate to hybrid if needed
Expected savings: 60-80% vs manual

For Government/Accessibility#

⚠️ BUILD hybrid with mandatory review (3 months, $50K)

Need 90%+ accuracy (automation alone insufficient)
Auditability critical (use rule-based as primary)
Expected savings: 30-50% vs manual

For Publishers#

✅ BUILD neural + editorial workflow (4-6 months, $80K)

Need 95%+ accuracy (editors refine AI output)
Volume justifies investment (1K+ texts/year)
Expected savings: 40-60% vs full manual

For AI Tutoring#

✅ BUILD optimized neural (3 months, $50K)

Volume is high (10K+/day)
Latency matters (< 500ms)
Expected ROI: Enables product (not just cost savings)

The Strategic Question#

“Should I build Chinese text simplification in 2026?”

Answer: YES, if volume > 500 texts/month and budget > $15K

The technology is immature but viable. Early movers (2026-2027) will build data moats. Waiting 2-3 years reduces risk but loses competitive advantage.

Start with rule-based MVP (low risk, fast validation). Iterate to neural only if volume and accuracy requirements justify it.

The window is open: Build now (2026-2028) or wait until turnkey solutions exist (2029+).

Research Complete#

This concludes the 4PS research on Chinese Text Simplification Libraries.

Key deliverables:

S1: Library landscape (no turnkey solutions exist)
S2: Neural approach viable (mT5 + LoRA on MCTS)
S3: Use case mapping (rule-based → hybrid → neural)
S4: Strategic recommendation (BUILD for most teams)

Next steps: Implementation (use S1-S2 as technical guide, S3-S4 for decision support)

ROI Analysis: Build vs Wait vs Manual#

3-Year TCO Comparison#

Scenario: 2K texts/month (typical language learning app)#

Approach	Year 1	Year 2	Year 3	Total	Notes
Manual editing	$36K	$36K	$36K	$108K	Baseline
Rule-based	$15K	$5K	$5K	$25K	77% savings
Hybrid	$30K	$8K	$8K	$46K	57% savings, better quality
Full neural	$60K	$15K	$15K	$90K	17% savings, best quality

Verdict: Rule-based or hybrid (neural not justified at this volume)

Scenario: 10K texts/month (large platform)#

Approach	Year 1	Year 2	Year 3	Total	Notes
Manual editing	$180K	$180K	$180K	$540K	Baseline
Rule-based	$20K	$5K	$5K	$30K	94% savings, quality plateau
Hybrid	$40K	$10K	$10K	$60K	89% savings
Full neural	$70K	$20K	$20K	$110K	80% savings, best quality

Verdict: Hybrid or neural (savings justify investment)

Break-Even Timeline#

Rule-based:

Payback: 6-9 months at 1K texts/month
Payback: 3-4 months at 5K texts/month

Hybrid:

Payback: 12-18 months at 2K texts/month
Payback: 6-8 months at 10K texts/month

Neural:

Payback: 18-24 months at 5K texts/month
Payback: 8-12 months at 20K texts/month

Risk-Adjusted ROI#

Optimistic (90th percentile)#

Development faster than expected
Quality better than expected
Maintenance costs lower

Result: ROI +50% better

Realistic (50th percentile)#

Numbers as stated above
Some rework needed
Expected maintenance

Result: ROI as modeled

Pessimistic (10th percentile)#

Development 2x longer
Quality requires more iteration
Hidden maintenance costs

Result: ROI -50%, may not break even until Year 2

Mitigation: Start with rule-based MVP (lower risk, faster validation)

Competitive Advantage Analysis#

Build Now (2026)#

Advantages:

First-mover in nascent market
Collect user feedback data → improve model
Data moat (your domain-specific corpus)
Control over quality/latency

Disadvantages:

Technology immature (no turnkey libraries)
Must build custom solution
Ongoing maintenance burden

Wait 2-3 Years (2028-2029)#

Advantages:

Mature libraries may emerge
Commercial APIs possible (like English has)
Learn from others’ mistakes
Lower development cost

Disadvantages:

Competitors already have 2-3 years of data
Miss early market opportunity
May still need custom solution (libraries might not fit your use case)

Never Build (Manual Forever)#

Advantages:

No technical risk
Editors can handle edge cases
Quality ceiling is higher

Disadvantages:

5-10x more expensive at scale
Can’t scale to 100K+ texts/month
Latency (humans need hours/days)

Strategic Decision Framework#

BUILD NOW if:#

Volume > 500 texts/month (automation ROI positive)
Need latency < 1 hour (humans too slow)
Have budget ($15K+ Year 1)
Technical capability (mid-level dev + Chinese speaker)

WAIT 2-3 YEARS if:#

Volume < 500 texts/month (manual cheaper)
Budget < $10K (can’t build properly)
No technical team (can’t maintain)
Accuracy needs > 95% (technology not ready)

MANUAL FOREVER if:#

Volume < 100 texts/month
High-stakes content (legal, medical) where errors unacceptable
Domain too niche (no training data exists)

Investment Priorities#

If budget is $15K (rule-based):

$8K: Development (2-3 weeks, mid-level dev)
$3K: HSK vocabulary + synonym dictionary curation
$2K: Testing + validation (50-100 samples)
$2K: Deployment + infrastructure

If budget is $40K (hybrid):

$15K: Rule-based foundation
$10K: Neural model training + integration
$8K: Testing + human evaluation
$7K: Infrastructure + monitoring

If budget is $70K (full neural):

$25K: Model training (mT5/mBART on MCTS)
$15K: Data preparation + fine-tuning
$12K: Evaluation + iteration
$10K: Production deployment
$8K: Infrastructure (GPU inference)

Hidden Costs to Budget#

Ongoing curation (10-20% of Year 1 cost annually)
- HSK vocabulary updates (3.0 migration)
- New slang, technical terms
- User-reported errors
Infrastructure scaling
- 10K → 100K texts/month: 10x compute cost
- Budget $500-2K/month for hosting at scale
Quality drift
- Models degrade over time (language evolves)
- Re-train every 12-18 months (~$5K-10K)
Support & monitoring
- On-call for failures
- Debugging edge cases
- A/B testing improvements

Total ongoing: 20-30% of Year 1 cost per year

Scenarios Where ROI is Negative#

Volume too low: < 300 texts/month (manual cheaper)
Accuracy too high: Need 98%+ (humans required anyway)
No technical team: Outsource development + maintenance = 3x cost
Domain too niche: Legal Chinese, classical Chinese (no training data)
Short-term project: < 18 months (won’t reach break-even)

Expected Value Calculation#

Language learning app scenario (2K texts/month):

Outcome	Probability	Cost (3yr)	Savings vs Manual	Expected Value
Success (rule-based works)	70%	$25K	$83K	+$58K
Partial (need hybrid)	25%	$46K	$62K	+$16K
Failure (revert to manual)	5%	$15K + $108K	-$15K	-$1K
Expected	100%			+$73K

Verdict: Strong positive expected value (build rule-based MVP)

Strategic Recommendations#

Most teams: Start rule-based, iterate to hybrid if needed
Large platforms: Go straight to hybrid (skip learning phase)
Publishers: Neural + editorial (quality matters most)
Startups: Wait until PMF, then automate (manual until 500 texts/month)

The mistake: Jumping to neural too early (before you understand the problem) The opportunity: Building now while market is nascent (2026-2028 window)

Published: 2026-03-06 Updated: 2026-03-06