1.154.1 Chinese Text Simplification Libraries#
Comprehensive analysis of Chinese text simplification libraries and approaches. Covers foundational NLP libraries (segmentation, conversion), training datasets, and implementation strategies for automated text simplification targeting different HSK proficiency levels. Reveals no turnkey solutions exist; teams must build custom hybrid stacks.
Explainer
Chinese Text Simplification Libraries#
What This Solves#
Imagine you’re running a Chinese learning app with thousands of authentic news articles, but your HSK 3 learners keep hitting walls of vocabulary they don’t know. You could manually rewrite articles (expensive, slow), or you could automatically simplify them—replacing difficult words with easier synonyms, shortening complex sentences, and removing advanced grammar patterns.
Chinese Text Simplification (CTS) solves this fundamental problem: automatically converting complex Chinese text into simpler versions that match a target proficiency level. It takes any Chinese text and rewrites it to be comprehensible at HSK 2, 3, 4, etc.
This differs from readability analysis (which just measures difficulty) by actually transforming the text. It’s the difference between a thermometer (analysis) and an air conditioner (simplification).
This matters to three groups:
- Language learning platforms need to offer graded content at scale (can’t hire editors to simplify thousands of articles)
- Accessibility services need to make government documents, healthcare info, and public services readable for lower-literacy Chinese readers
- Content creators need tools to adapt writing for different audience proficiency levels (textbooks, news sites, technical documentation)
Without automated simplification, these groups resort to manual rewriting (expensive, inconsistent) or avoid complex content entirely (limiting educational value).
Accessible Analogies#
Think of Chinese text simplification like a recipe adapter:
- Original recipe (advanced): “Julienne the carrots, create a mirepoix base, deglaze with Shaoxing wine”
- Simplified recipe (beginner): “Cut the carrots into thin strips, cook onions and carrots together, add rice wine to the pan”
The recipe adapter:
- Replaces fancy cooking terms with simple descriptions
- Breaks complex steps into smaller ones
- Uses common ingredients instead of specialized ones
- Keeps the same end result (dish is the same, instructions are clearer)
Chinese text simplification does the same:
- Replaces advanced vocabulary with HSK-level equivalents
- Splits long compound sentences into shorter ones
- Removes or replaces idioms (成语) with plain explanations
- Keeps the same meaning (content is preserved, expression is simplified)
Another angle: Like a translation app, but instead of translating between languages, it “translates” from advanced Chinese to beginner Chinese. Both preserve meaning while changing form.
The challenge: Unlike English where you can often just swap “utilize” → “use”, Chinese has:
- No spaces between words (need segmentation first)
- Multiple ways to express the same concept with different difficulty levels
- Idioms that can’t be translated word-for-word
- Sentence structures that require complete restructuring, not just word replacement
This is why off-the-shelf NLP libraries don’t work—you need specialized Chinese text simplification tools.
When You Need This#
You definitely need this if:
- You run a language learning platform and want to offer graded readers automatically (“here’s today’s news at HSK 3 level”)
- You’re building accessibility tools for government or public services (simplified documents for low-literacy readers)
- You create educational materials and need to generate multiple difficulty versions of the same content
- You manage a news site offering “Easy Chinese” versions and want to automate the simplification pipeline
You probably need this if:
- You’re building AI tutoring systems that need to adjust explanation complexity to learner level
- You’re researching second-language acquisition and need controlled text difficulty
- You’re developing translation tools with simplification as a post-processing step
You DON’T need this if:
- You only need to measure readability (use HSK-Character-Profiler or similar instead)
- You only need Traditional/Simplified conversion (use OpenCC instead)
- You’re working with native speakers who don’t need simplification
- You have manual editorial resources and small volume (< 100 texts/month)
The decision hinges on: Are you transforming complex text to simpler versions at scale? If yes, you need simplification. If you just need to know “is this HSK 3?”, use analysis tools instead.
Trade-offs#
Rule-Based vs Neural Network#
Rule-based simplification (word replacement + sentence splitting):
- ✅ Fast (milliseconds per text)
- ✅ Predictable output (same input always gives same result)
- ✅ Easy to debug (you can see which rules fired)
- ✅ Requires no training data
- ❌ Limited to vocabulary substitution (can’t restructure complex sentences well)
- ❌ Struggles with idioms, metaphors, context-dependent meaning
- ❌ Needs manually curated synonym dictionaries at each HSK level
Neural network approach (seq2seq, transformer models):
- ✅ Can restructure sentences creatively (not just word swaps)
- ✅ Handles idioms and context better
- ✅ Improves with more training data
- ❌ Slower (seconds per text on CPU)
- ❌ Unpredictable (might generate fluent but incorrect simplifications)
- ❌ Requires large parallel corpora (complex ↔ simple sentence pairs)
- ❌ Hard to control output level (can’t guarantee “exactly HSK 3”)
Current reality: Most production systems use rule-based as foundation + neural for specific hard cases. Pure neural is still research-grade.
Character-Level vs Word-Level#
Character-level substitution:
- ✅ Simpler implementation (no word segmentation needed)
- ✅ Aligns with HSK character lists
- ❌ Breaks compound words (replacing 研 in 研究 changes meaning)
- ❌ Misses multi-word expressions that need to be replaced as units
Word-level substitution:
- ✅ Preserves compound word integrity
- ✅ Can handle multi-word idioms
- ❌ Requires accurate segmentation (jieba is ~95% accurate, errors cascade)
- ❌ More complex to implement
Hybrid approach (recommended): Segment into words, simplify at word level, validate at character level.
Build vs Dataset#
Build from scratch (your own rules + dictionaries):
- ✅ Full control over simplification strategy
- ✅ Can customize for your domain (medical, legal, etc.)
- ❌ Requires linguistic expertise
- ❌ Months of development time
- ❌ Need native speaker validation
Use research datasets (MCTS, parallel corpora):
- ✅ Training data already exists (691K+ sentence pairs)
- ✅ Can train neural models if you have ML expertise
- ❌ Datasets are academic (news text, not your domain)
- ❌ Still need to build the actual simplification pipeline
- ❌ No turnkey solution (MCTS is data, not a library)
Use existing libraries (currently limited options):
- ✅ Fastest time to value (if they exist)
- ❌ Reality check: There are very few production-ready pip-installable CTS libraries as of 2026
- ❌ Most published work is research code, not maintained libraries
Cost Considerations#
Research approach (use MCTS dataset + train your own model):
- Dataset: Free (open source)
- Model training: $100-500 in GPU time (if using cloud)
- Development: $20K-$50K (2-4 months, requires ML expertise)
- Hosting: $50-200/month for inference
- Year 1 total: ~$25K-$60K
- Only makes sense for large-scale platforms (> 10K texts/month)
Rule-based DIY:
- Development: $10K-$30K (1-3 months, requires NLP + Chinese expertise)
- Hosting: $20-50/month (runs in your app)
- HSK vocabulary lists: Free (open source)
- Synonym dictionaries: Build manually or scrape ($2K-$5K)
- Year 1 total: ~$12K-$35K
- Sweet spot for mid-sized platforms (1K-10K texts/month)
Hybrid approach (jieba + OpenCC + HSK-Character-Profiler + custom rules):
- Integration: $5K-$15K (2-4 weeks)
- Use existing libraries for segmentation, conversion, analysis
- Build custom simplification logic on top
- Year 1 total: ~$7K-$18K
- Most practical option for MVP
Commercial API (if they existed):
- Would cost ~$5-20 per 1K simplifications
- None currently available as of 2026 for Chinese
- English has services (Rewordify, TextCompactor), Chinese market is nascent
Manual editing (baseline comparison):
- Professional editor: $0.10-$0.30 per sentence
- At 1,000 texts/month (avg 20 sentences): $2K-$6K/month
- Year 1: $24K-$72K
- Break-even: Automation pays off at > 100 texts/month
Implementation Reality#
First 30 Days#
Week 1: Set up infrastructure with existing libraries:
- Install jieba for segmentation
- Install OpenCC for Traditional/Simplified handling
- Install HSK-Character-Profiler for difficulty measurement
- Build simple word replacement pipeline with HSK vocabulary lists
Weeks 2-4: Build and test simplification rules:
- Create synonym dictionaries at each HSK level
- Implement sentence splitting for long sentences (> 20 characters)
- Test on sample texts, measure with human evaluators
- Deploy basic API endpoint
You’ll have a working prototype that can simplify ~70% of sentences (the easy cases).
What Actually Takes Time#
- Synonym dictionary curation: Finding “simple equivalents” for 2,500+ HSK 6 words takes weeks of linguistic work
- Context handling: “银行” means “bank” (financial) or “riverbank” depending on context—rules alone won’t catch this
- Idiom treatment: 成语 (4-character idioms) need special handling (replace whole unit, not individual characters)
- Quality validation: Need native speakers to verify simplifications don’t change meaning
- Edge cases: Names, numbers, technical terms, internet slang—each needs special rules
Common Pitfalls#
- Over-simplifying: Replacing every HSK 5 word breaks flow (“The very smart person” → “The very very smart person” because you replaced “intelligent”)
- Meaning drift: Synonyms aren’t perfect (“老师” (teacher) → “先生” (Mr./teacher) shifts formality)
- Segmentation errors: Jieba mistakes cascade (if it segments wrong, replacements break)
- No ground truth: Unlike translation (many references exist), Chinese simplification has limited parallel corpora for validation
Team Skills Required#
- Rule-based MVP: Mid-level Python dev + native Chinese speaker for validation (2 people, 1 month)
- Production system: Senior NLP engineer + Chinese linguist + QA (3 people, 3 months)
- Neural approach: ML engineer + data scientist + Chinese linguist (3 people, 6+ months)
Realistic Expectations#
You’ll achieve:
- 70-80% of sentences simplified successfully (word replacements work)
- 15-20% need manual review (complex restructuring)
- 5-10% fail or degrade quality (idioms, context errors)
This is good enough for assistive tools (human reads final output). Not good enough for fully automated publishing (needs editorial review).
The technology is nascent compared to English: While English text simplification has commercial solutions, Chinese is still largely a research problem with limited production-ready libraries. Most organizations build custom solutions.
Library Landscape (2026)#
Key distinction: This research focuses on LIBRARIES (pip-installable, production-ready), not research datasets or one-off scripts.
Current state:
- ✅ Analysis libraries are mature (HSK-Character-Profiler, HanLP)
- ✅ Utility libraries are solid (jieba, OpenCC)
- ❌ Actual simplification libraries are sparse (mostly research code, not production libraries)
Most teams combine existing analysis/utility libraries with custom simplification logic.
What you’ll find in S1-rapid: Inventory of available libraries and their actual capabilities for text simplification.
S1: Rapid Discovery
S1-rapid: Chinese Text Simplification Libraries#
Quick Summary#
Key Finding: As of 2026, there are NO mature, pip-installable libraries specifically designed for Chinese text simplification. The landscape consists of:
- Research datasets (MCTS) - training data, not production libraries
- Analysis libraries (HSK-Character-Profiler, HanLP) - measure difficulty, don’t transform text
- Utility libraries (jieba, OpenCC) - building blocks for simplification, but you must write the logic yourself
Reality check: Unlike English (which has Rewordify, TextCompactor), Chinese text simplification is still mostly a DIY endeavor combining multiple libraries.
Recommended approach: Hybrid stack using jieba (segmentation) + HSK-Character-Profiler (analysis) + OpenCC (conversion) + custom simplification rules.
Libraries Inventory#
1. Text Simplification (Direct)#
MCTS (Multi-Reference Chinese Text Simplification)
- Type: Research dataset + evaluation scripts
- Not a library: Provides training data (691K+ parallel sentences), not a pip-installable tool
- GitHub: https://github.com/blcuicall/mcts
- Use case: Train your own neural simplification model
- Limitation: Requires ML expertise, months of development
chinese-comprehension
- Type: Analysis tool (not simplification)
- GitHub: https://github.com/Destaq/chinese-comprehension
- What it does: Analyzes text against your known words
- What it doesn’t do: Doesn’t transform text, just gauges difficulty
- Install: Clone +
pip install -r requirements.txt(not on PyPI)
Verdict: No direct text simplification libraries exist on PyPI.
2. Analysis Libraries (Measure Difficulty)#
HSK-Character-Profiler
- Purpose: Analyze text readability based on HSK levels
- GitHub: https://github.com/Ancastal/HSK-Character-Profiler
- Pip: Not on PyPI, clone and run
- Features: Character proficiency analysis, text readability scoring
- For simplification: Use to verify output difficulty after simplification
- Status: Active (2024-2025)
Language-Analyzer
- Purpose: Multi-language text analysis including HSK profiling
- GitHub: https://github.com/Ancastal/Language-Analyzer
- Features: HSK profiling, readability analysis
- Limitation: Analysis only, not transformation
3. NLP Foundations (Building Blocks)#
jieba (结巴分词)
- Purpose: Chinese text segmentation (word splitting)
- PyPI:
pip install jieba - GitHub: https://github.com/fxsjy/jieba
- Essential for: Splitting unsegmented Chinese into words before simplification
- Accuracy: ~95% for general text
- Status: Mature, widely used
- Modes:
- Accurate Mode: Best for text analysis
- Full Mode: Gets all possible words (overlapping)
- Search Engine Mode: Cuts long words into short words
HanLP 2.x
- Purpose: Comprehensive NLP platform (deep learning)
- PyPI:
pip install hanlp - GitHub: https://github.com/HIT-SCIR/ltp
- Features: Segmentation, POS tagging, NER, parsing, SRL
- For simplification: Segmentation + POS tagging to identify replaceable words
- Requires: Python 3.6+, PyTorch/TensorFlow
- Status: Active (latest release Oct 2025)
PyHanLP
- Purpose: Python wrapper for HanLP 1.x (traditional algorithms)
- PyPI:
pip install pyhanlp - GitHub: https://github.com/hankcs/pyhanlp
- Lighter than: HanLP 2.x (no deep learning overhead)
- Status: Active (latest Jan 2025)
LTP (Language Technology Platform)
- Purpose: Multi-task Chinese NLP platform
- PyPI:
pip install ltp - GitHub: https://github.com/HIT-SCIR/ltp
- Features: Segmentation, POS, NER, dependency parsing, SDP
- Architecture: Shared pre-trained model across tasks (efficient)
- For simplification: Dependency parsing to identify complex sentence structures
- Status: Research-grade (N-LTP from 2020)
chinese (Chinese Text Analyzer)
- PyPI:
pip install chinese - GitHub: https://github.com/morinokami/chinese
- Features: Tokenization, pinyin conversion, definitions
- Uses: jieba or pynlpir for tokenization
- Supports: Simplified and Traditional Chinese
- Limitation: Analysis/conversion, not simplification
4. Character Conversion#
OpenCC (Open Chinese Convert)
- Purpose: Traditional ↔ Simplified conversion
- PyPI:
pip install OpenCCorpip install opencc-python-reimplemented - GitHub: https://github.com/BYVoid/OpenCC
- Conversion modes:
- s2t: Simplified → Traditional
- t2s: Traditional → Simplified
- s2tw: Simplified → Traditional (Taiwan standard)
- s2hk: Simplified → Traditional (Hong Kong standard)
- For simplification: Pre-process text to normalized form (Simplified)
- Status: Mature, actively maintained
hanziconv
- PyPI:
pip install hanziconv - GitHub: https://github.com/berniey/hanziconv
- Purpose: Simpler Traditional/Simplified conversion
- Lighter than: OpenCC (pure Python)
hanzidentifier
- PyPI:
pip install hanzidentifier - GitHub: https://github.com/tsroten/hanzidentifier
- Purpose: Detect if text is Traditional or Simplified
- Use before: Conversion (know what you’re working with)
5. Vocabulary Tools#
HSK 3.0 Lists
- GitHub: https://github.com/krmanik/HSK-3.0
- Purpose: Official HSK vocabulary lists (levels 1-9, 2026 standard)
- Use for: Building synonym dictionaries at each level
- Format: Character lists, word lists
TOCFL Vocabulary
- GitHub: https://github.com/skishore/inkstone/pull/47
- Purpose: Taiwan TOCFL standards (Traditional Chinese)
- Use for: Simplification targeting TOCFL levels
6. Research Tools (Not Production Libraries)#
CTAP (Common Text Analysis Platform)
- Type: Research tool (196 linguistic features)
- Not pip-installable: Research implementation
- Paper: https://aclanthology.org/2022.lrec-1.592.pdf
- Use for: Feature extraction for ML models
CRIE (Chinese Readability Index Explorer)
- Type: Research system (82 features + SVM)
- Not publicly available: Academic tool
- Paper: https://link.springer.com/article/10.3758/s13428-015-0649-1
What’s Missing#
What you can’t pip install (as of 2026):
- ❌ Complete Chinese text simplification library
- ❌ HSK-aware synonym replacement engine
- ❌ Sentence simplification (complex → simple restructuring)
- ❌ Idiom simplification (成语 handling)
- ❌ Pre-trained neural simplification models (easy to load and use)
What you must build yourself:
- Synonym dictionaries mapping HSK 6 words → HSK 3 equivalents
- Sentence splitting logic (identify and split complex sentences)
- Simplification rules engine
- Quality validation pipeline
Recommended Stack#
For building a Chinese text simplification system:
# Install these libraries
pip install jieba # Segmentation
pip install opencc-python-reimplemented # Conversion
pip install hanlp # Optional: advanced NLP features
# Clone these repos
git clone https://github.com/Ancastal/HSK-Character-Profiler
git clone https://github.com/krmanik/HSK-3.0
git clone https://github.com/blcuicall/mcts # If training neural modelsThen build:
- Segmentation pipeline (jieba)
- HSK level analyzer (HSK-Character-Profiler)
- Custom simplification logic:
- Word replacement (HSK vocabulary)
- Sentence splitting
- Idiom handling
- Quality validation
Time Estimates#
| Approach | Time | Complexity |
|---|---|---|
| Rule-based MVP | 2-4 weeks | Mid-level Python + Chinese speaker |
| Production system | 2-3 months | Senior NLP engineer + linguist |
| Neural model | 4-6 months | ML engineer + data scientist |
S1-rapid Conclusion#
The reality: Chinese text simplification is a BUILD, not BUY problem in 2026.
Unlike mature NLP tasks (segmentation, POS tagging) with turnkey libraries, simplification requires:
- Custom logic combining multiple libraries
- Domain expertise (Chinese linguistics + NLP)
- Iterative testing with native speakers
Next steps:
- S2-comprehensive: Dive into MCTS dataset, neural approaches, feature engineering
- S3-need-driven: Map use cases to implementation strategies
- S4-strategic: Build vs buy analysis, ROI models
Sources#
- MCTS: A Multi-Reference Chinese Text Simplification Dataset
- MCTS GitHub Repository
- chinese-comprehension GitHub
- HSK-Character-Profiler
- jieba PyPI
- HanLP Documentation
- OpenCC PyPI
- LTP GitHub
- HSK 3.0 Lists
Foundational Libraries for Chinese Text Simplification#
These libraries don’t perform text simplification directly, but they’re essential building blocks for any simplification system.
jieba (结巴分词) - Chinese Text Segmentation#
Purpose: Split continuous Chinese text into words Why needed: Chinese has no spaces between words—you must segment before processing
Installation#
pip install jiebaBasic Usage#
import jieba
text = "我爱自然语言处理"
words = jieba.cut(text) # Returns generator
print(" / ".join(words)) # Output: 我 / 爱 / 自然语言 / 处理Segmentation Modes#
1. Accurate Mode (Default)#
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
# Output: 我 / 来到 / 北京 / 清华大学- Best for text analysis and NLP tasks
- Attempts most accurate segmentation
- Use for text simplification
2. Full Mode#
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
# Output: 我 / 来到 / 北京 / 清华 / 清华大学 / 华大 / 大学- Finds all possible words (overlapping)
- Fast but not accurate
- Not recommended for simplification
3. Search Engine Mode#
seg_list = jieba.cut_for_search("我来到北京清华大学")
# Output: 我 / 来到 / 北京 / 清华 / 华大 / 大学 / 清华大学- Cuts long words into shorter segments
- Good for search indexing
- Not ideal for simplification
Custom Dictionaries#
Add domain-specific words jieba might miss:
jieba.load_userdict("custom_words.txt")
# Format: word frequency part_of_speech
# 台中 100 n
# 云计算 50 nCritical for simplification: Add HSK vocabulary to ensure consistent segmentation
Accuracy#
- General text: ~95% accuracy
- Domain-specific: May need custom dictionary (medical, legal, technical)
- Errors cascade: Wrong segmentation → wrong word replacements
Performance#
- Speed: Very fast (millions of characters per second)
- Memory: ~100MB with full dictionary loaded
- Language support: Simplified and Traditional Chinese
Role in Simplification Pipeline#
# 1. Segment text
import jieba
text = "这个句子包含一些复杂的词汇"
words = list(jieba.cut(text))
# 2. Identify words to simplify
# (You build this logic)
for word in words:
if word_difficulty(word) > target_hsk_level:
simplified_word = find_synonym(word, target_hsk_level)
# Replace wordLinks:
OpenCC - Traditional/Simplified Conversion#
Purpose: Convert between Traditional and Simplified Chinese variants Why needed: Normalize text before simplification (most resources use Simplified)
Installation#
Option 1: Official binding (requires C++ library)
pip install OpenCCOption 2: Pure Python (easier, no dependencies)
pip install opencc-python-reimplementedBasic Usage#
from opencc import OpenCC
# Initialize converter
cc = OpenCC('s2t') # Simplified to Traditional
# Convert
text = "开放中文转换"
traditional = cc.convert(text)
print(traditional) # Output: 開放中文轉換Conversion Modes#
| Mode | Description | Example |
|---|---|---|
s2t | Simplified → Traditional | 中国 → 中國 |
t2s | Traditional → Simplified | 中國 → 中国 |
s2tw | Simplified → Taiwan Standard | 鼠标 → 滑鼠 |
s2hk | Simplified → Hong Kong Standard | 信息 → 資訊 |
t2tw | Traditional → Taiwan Standard | 鼠標 → 滑鼠 |
t2hk | Traditional → Hong Kong Standard | 資訊 → 資訊 |
Regional variants matter: Taiwan and Hong Kong use different vocabulary beyond character conversion.
Advanced Features#
Character-Level Conversion#
- One-to-one character mapping (mostly)
- Handles variant forms (異體字)
Phrase-Level Conversion#
- Multi-character expressions converted as units
- Example: “计算机” (computer, Mainland) → “電腦” (Taiwan)
Regional Idioms#
- Idioms converted to regional equivalents
- “鼠标” (mouse, Mainland) → “滑鼠” (Taiwan)
Role in Simplification Pipeline#
Pre-processing:
from opencc import OpenCC
# Normalize to Simplified (most HSK resources use Simplified)
converter = OpenCC('t2s')
text = "這是傳統中文"
simplified = converter.convert(text)
# Now use simplified version for HSK analysis and simplificationPost-processing (if targeting Traditional Chinese learners):
# After simplification, convert back to Traditional
converter = OpenCC('s2t')
output = converter.convert(simplified_text)Accuracy#
- Character conversion: Near 100% for common characters
- Regional vocabulary: Good coverage, but not exhaustive
- Context: Character-level conversion can miss nuances
Example issue:
- “后” can mean “后面” (back) or “皇后” (empress)
- Traditional: “後” (back) vs “后” (empress)
- OpenCC uses phrase dictionaries to handle most cases
Performance#
- Very fast (millions of characters per second)
- Minimal memory footprint
- Thread-safe
Links:
- GitHub: https://github.com/BYVoid/OpenCC
- PyPI (official): https://pypi.org/project/OpenCC/
- PyPI (pure Python): https://pypi.org/project/opencc-python-reimplemented/
HanLP - Comprehensive NLP Platform#
Purpose: Multi-task Chinese NLP (segmentation, POS, NER, parsing, etc.) Why useful: Advanced linguistic analysis for sophisticated simplification
Installation#
pip install hanlpRequirements: Python 3.6+, PyTorch or TensorFlow 2.x
Basic Usage#
import hanlp
# Load pre-trained model (downloads on first use)
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)
# Analyze text
text = "我爱自然语言处理"
result = HanLP(text)
print(result)
# Output includes: tokens, POS tags, NER, dependency parse, etc.Key Features for Simplification#
1. Word Segmentation#
tokenizer = hanlp.load('PKU_NAME_MERGED_SIX_MONTHS_CONVSEG')
tokens = tokenizer("我爱自然语言处理")
# Alternative to jieba with different algorithm2. Part-of-Speech Tagging#
# Included in multi-task model
result = HanLP(text)
pos_tags = result['pos']
# Identify nouns, verbs, adjectives to simplifyUse case: Only simplify content words (nouns, verbs), not function words
3. Named Entity Recognition#
result = HanLP(text)
entities = result['ner']
# Identify people, places, organizations
# DON'T simplify proper nouns4. Dependency Parsing#
result = HanLP(text)
deps = result['dep']
# Understand sentence structure
# Identify complex subordinate clauses to splitUse case: Find sentences with deep syntactic trees → candidates for splitting
5. Semantic Role Labeling (SRL)#
result = HanLP(text)
srl = result['srl']
# Identify who did what to whom
# Restructure passive → active voiceHanLP 2.x vs PyHanLP#
HanLP 2.x (Modern):
- Deep learning models (BERT, ELECTRA)
- State-of-the-art accuracy
- Heavier (requires PyTorch/TF)
- Slower (seconds per sentence)
PyHanLP (Traditional):
- Classic algorithms (HMM, CRF)
- Lighter weight (no DL frameworks)
- Faster (milliseconds per sentence)
- Slightly lower accuracy
For simplification MVP: Start with PyHanLP (lighter), upgrade to HanLP 2.x if you need advanced features
Role in Simplification Pipeline#
Advanced simplification logic:
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH)
text = "我昨天买了一本非常有趣的书"
result = HanLP(text)
# 1. Extract POS tags
tokens = result['tok/fine'] # ['我', '昨天', '买', '了', '一', '本', '非常', '有趣', '的', '书']
pos = result['pos/pku'] # ['PN', 'TIME', 'VV', 'AS', 'CD', 'M', 'AD', 'VA', 'DEG', 'NN']
# 2. Identify adjectives and adverbs
for token, tag in zip(tokens, pos):
if tag in ['VA', 'AD']: # Adjectives, adverbs
# Simplify: 非常 → 很, 有趣 → 好玩
simplified = simplify_word(token)
# 3. Check dependency structure
deps = result['dep']
# If deep nesting → split sentencePerformance Considerations#
- Model loading: ~10-30 seconds (first time, cached after)
- Inference: 0.5-2 seconds per sentence (CPU), faster on GPU
- Memory: 500MB-2GB depending on model
- Batch processing: Significantly faster for multiple sentences
For production: Use smaller models (ELECTRA_SMALL) or cache results
Alternatives#
- LTP (Language Technology Platform): Similar features, different architecture
- jieba + custom POS tagger: Lighter but less accurate
- StanfordNLP: Multi-language but heavier
Links:
- Documentation: https://hanlp.hankcs.com/docs/install.html
- PyPI: https://pypi.org/project/hanlp/
- GitHub: https://github.com/hankcs/HanLP
Integration Strategy#
Recommended stack for text simplification:
- OpenCC - Normalize to Simplified Chinese
- jieba - Segment into words
- HanLP (optional) - POS tagging, NER, parsing for advanced logic
- Custom logic - Synonym replacement, sentence splitting
- HSK-Character-Profiler - Validate output difficulty
Minimal stack (faster, lighter):
- jieba + OpenCC + custom rules
Advanced stack (higher quality, slower):
- HanLP (all-in-one) + custom rules + MCTS-trained model
Sources#
MCTS: Multi-Reference Chinese Text Simplification Dataset#
Overview#
MCTS is the first published multi-reference Chinese text simplification evaluation dataset, released in 2024 for the LREC-COLING conference.
Paper: MCTS: A Multi-Reference Chinese Text Simplification Dataset GitHub: https://github.com/blcuicall/mcts Authors: Ruining Chong, Luming Lu, Liner Yang, Jinran Nie, Zhenghao Liu, Shuo Wang, Shuhan Zhou, Yaoxin Li, Erhong Yang (2024)
What It Provides#
1. Evaluation Dataset#
- 3,615 human simplifications across 723 original sentences
- 5 simplifications per original sentence (multi-reference)
- Source: Penn Chinese Treebank (CTB) - news, government docs, broadcasts
2. Training Corpus#
- 691,474 high-quality parallel training pairs (complex ↔ simple)
- Largest scale training data in Chinese Text Simplification field (as of 2024)
- Generated via rigorous automatic screening
- Built using combination of Machine Translation + English Text Simplification
3. Evaluation Scripts#
hsk_evaluate.py- HSK-based evaluation metrics- Benchmarking tools for comparing simplification models
Dataset Composition#
Original sentences source: Penn Chinese Treebank (CTB)
- Xinhua news agency reports
- Government documents
- News magazines
- Broadcasts and interviews
- Online news and web logs
Human simplifications:
- Professional annotators
- Multiple references per sentence (captures variation)
- Quality controlled
What It’s NOT#
❌ Not a pip-installable library - It’s a dataset, not production code ❌ Not a pre-trained model - No ready-to-use simplification models included ❌ Not turnkey - Requires ML expertise to train models from the data
How to Use#
1. Download the Dataset#
git clone https://github.com/blcuicall/mcts
cd mcts2. Access the Data#
data/evaluation/- Multi-reference evaluation set (723 sentences × 5 simplifications)data/training/- Parallel corpus (691K pairs)
3. Train a Model#
- Use training corpus with seq2seq models (T5, BART, mBART)
- Fine-tune on Chinese text simplification task
- Evaluate on multi-reference test set
4. Evaluate with HSK Metrics#
python hsk_evaluate.py --input your_simplifications.txtUse Cases#
For researchers:
- Train and evaluate neural text simplification models
- Compare against multi-reference gold standard
- Publish papers on Chinese text simplification
For product teams (with ML resources):
- Train custom simplification models for your domain
- Fine-tune on domain-specific data after pre-training on MCTS
- Requires: ML engineers, GPU resources, 2-4 months development
Not suitable for:
- Quick prototypes (dataset, not library)
- Teams without ML expertise
- Production deployment without training pipeline
Significance#
Why it matters:
- First multi-reference dataset - Previous work had single reference translations
- Largest training corpus - 691K pairs vs. previous datasets with < 100K
- HSK-aware evaluation - Specifically designed for learner-focused simplification
Research impact:
- Enables neural model development
- Standardizes evaluation (multi-reference BLEU, HSK metrics)
- Provides benchmark for comparing approaches
Limitations#
- Domain bias: News/formal text (CTB source)
- May not generalize to casual, social media, or technical text
- Mainland Chinese: Simplified Chinese from mainland sources
- Not optimized for Traditional Chinese or Taiwan/HK variants
- No model included: Data only, you must build the model
- Static dataset: As of 2024, no updates since publication
Practical Path Forward#
If you want to use MCTS:
Option A: Train a neural model
- Download dataset
- Set up training pipeline (T5/BART + seq2seq)
- Train on 691K corpus (requires GPU, ~$100-500 cloud cost)
- Evaluate on multi-reference test set
- Deploy model for inference
- Timeline: 2-4 months (with ML expertise)
Option B: Use as inspiration
- Study human simplifications to understand patterns
- Extract simplification rules (what words get replaced, how sentences split)
- Implement rules in code (rule-based approach)
- Use MCTS as validation (compare your output to human references)
- Timeline: 1-2 months (less ML-heavy)
Option C: Hybrid
- Use MCTS for hard cases (train small specialized model)
- Use rule-based for easy cases (word replacements)
- Combine for production system
- Timeline: 2-3 months
Integration with Other Tools#
MCTS pairs well with:
- jieba: Segment text before feeding to trained model
- HSK-Character-Profiler: Validate output difficulty
- OpenCC: Handle Traditional/Simplified conversion
- HanLP: Extract linguistic features for model training
Example Workflow#
# 1. Segment input text
import jieba
text = "这是一段复杂的中文文本"
words = jieba.cut(text)
# 2. Simplify with MCTS-trained model
# (You must train this model first using MCTS dataset)
simplified = your_trained_model.simplify(text)
# 3. Validate output difficulty
from hsk_profiler import analyze
hsk_level = analyze(simplified)
print(f"Output difficulty: HSK {hsk_level}")Verdict#
MCTS is essential infrastructure for research and large-scale production systems, but it’s not a quick solution for MVPs or small teams.
Best for:
- Research teams publishing on Chinese NLP
- Large platforms with ML resources (> 10K texts/month to simplify)
Not for:
- Rapid prototypes (use rule-based instead)
- Small teams without ML expertise
- Projects with < 3 month timeline
Sources#
S1-rapid Recommendations#
Executive Summary#
Finding: No pip-installable libraries exist specifically for Chinese text simplification as of 2026. This is a BUILD, not BUY problem.
Recommended approach: Hybrid stack combining existing NLP libraries (jieba, OpenCC, HSK-Character-Profiler) with custom simplification logic.
Timeline: 2-4 weeks for MVP (rule-based), 2-4 months for production system
Cost: $5K-$35K Year 1 depending on complexity
Key Findings from S1-rapid#
1. Library Landscape (Reality Check)#
What EXISTS: ✅ Text segmentation (jieba, HanLP) ✅ Traditional/Simplified conversion (OpenCC, hanziconv) ✅ Readability analysis (HSK-Character-Profiler) ✅ NLP foundations (HanLP, LTP) ✅ Training datasets (MCTS - 691K parallel sentences)
What DOESN’T exist: ❌ Production-ready text simplification libraries (pip-installable) ❌ Pre-trained simplification models (load-and-use) ❌ Turnkey solutions (like English has Rewordify)
Gap: Chinese text simplification is 3-5 years behind English in terms of library maturity.
2. Three Viable Approaches#
Option A: Rule-Based (Recommended for MVP)#
Stack: jieba + OpenCC + HSK vocabulary + custom rules Timeline: 2-4 weeks Cost: $5K-$15K Pros:
- Fast to implement
- Predictable results
- Easy to debug and maintain
- No ML expertise required
Cons:
- Limited to word replacement (can’t restructure sentences well)
- Needs manual synonym dictionary curation
- Struggles with idioms and context
Success rate: 70-80% of sentences
Option B: Neural (Research-Grade)#
Stack: MCTS dataset + transformer model (T5/BART) + GPU training Timeline: 2-4 months Cost: $20K-$60K (development + GPU) Pros:
- Can handle complex restructuring
- Improves with more data
- Handles idioms better
Cons:
- Requires ML expertise
- Unpredictable output (may generate fluent but incorrect text)
- Slower inference
- Hard to control exact output level
Success rate: 80-90% (but 10-20% errors can be severe)
Option C: Hybrid (Production-Ready)#
Stack: Rule-based for common cases + neural for complex cases Timeline: 2-3 months Cost: $15K-$35K Pros:
- Best of both worlds
- Rules handle 70%, neural handles remaining 30%
- Fallback logic (if neural fails, use rule output)
Cons:
- More complex architecture
- Requires both rule curation AND model training
Success rate: 85-95%
Decision Matrix#
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| MVP / Prototype (< 1 month) | Rule-based | Fastest time to value |
| Language learning app (1K-10K texts/month) | Rule-based → Hybrid | Start simple, upgrade if needed |
| Large platform (> 10K texts/month) | Hybrid | ROI justifies complexity |
| Research / Publishing | Neural | Accuracy matters more than speed |
| Accessibility (government docs) | Rule-based | Predictability matters |
Recommended MVP Stack#
Goal: Working text simplification in 2-4 weeks
Components#
jieba - Text segmentation
pip install jiebaOpenCC - Traditional/Simplified normalization
pip install opencc-python-reimplementedHSK-Character-Profiler - Difficulty validation
git clone https://github.com/Ancastal/HSK-Character-ProfilerHSK Vocabulary Lists
git clone https://github.com/krmanik/HSK-3.0Custom logic (you build):
- Synonym dictionary (HSK 6→3 word mappings)
- Sentence splitting rules
- Idiom handling
Implementation Steps#
Week 1: Infrastructure
- Set up jieba segmentation pipeline
- Integrate OpenCC for normalization
- Load HSK vocabulary lists
- Set up HSK-Character-Profiler for validation
Week 2: Simplification logic
- Build synonym dictionary (map 500 common HSK 4-6 words to HSK 2-3 equivalents)
- Implement word replacement logic
- Add sentence splitting (sentences > 20 chars)
Week 3: Quality assurance
- Test on sample texts
- Native speaker validation
- Fix edge cases (names, numbers, idioms)
Week 4: Deployment
- Build API endpoint
- Add caching
- Performance optimization
What to Build Yourself#
1. Synonym Dictionary (Critical)#
Format:
{
"复杂": {
"hsk_level": 4,
"synonyms": {
"3": ["难"],
"2": ["难"]
}
},
"研究": {
"hsk_level": 4,
"synonyms": {
"3": ["学习"],
"2": ["学"]
}
}
}Sources for building:
- HSK vocabulary lists (levels 1-6 or 1-9)
- Chinese learner dictionaries
- Manual curation by native speakers
Effort: 1-2 weeks for 500-1000 words
2. Simplification Rules#
Word replacement:
def simplify_word(word, target_hsk=3):
if word_hsk_level(word) <= target_hsk:
return word # Already simple enough
synonym = synonym_dict[word]['synonyms'].get(str(target_hsk))
return synonym if synonym else wordSentence splitting:
def split_long_sentence(sentence, max_length=15):
if len(sentence) <= max_length:
return [sentence]
# Find split points (commas, conjunctions)
# Split and return list of shorter sentencesIdiom handling:
def simplify_idiom(text):
# Replace 4-character idioms with plain explanations
idiom_map = {
"一举两得": "一次做两件事",
"画蛇添足": "做多余的事"
}
# Replace in text3. Quality Validation#
from hsk_profiler import analyze
def validate_simplification(original, simplified, target_hsk=3):
# 1. Check difficulty
difficulty = analyze(simplified)
if difficulty > target_hsk:
return False, "Still too difficult"
# 2. Check meaning preservation
# (You need semantic similarity metric)
similarity = compute_similarity(original, simplified)
if similarity < 0.7:
return False, "Meaning changed too much"
return True, "OK"Common Pitfalls to Avoid#
Over-simplification
- Don’t replace every word blindly
- Keep domain-specific terms (if learner needs them)
- Maintain natural flow
Segmentation errors compound
- Jieba mistakes → wrong word boundaries → wrong replacements
- Add custom dictionary for your domain
Context blindness
- “银行” = bank (financial) or riverbank
- Rule-based can’t distinguish without context
- Solution: Use POS tagging (HanLP) or accept limitation
No ground truth
- Unlike translation (many parallel corpora), simplification has limited references
- Solution: Human validation, iterative testing
Next Steps in 4PS Research#
S2-comprehensive (Technical Depth)#
- Deep dive into MCTS dataset structure
- Neural model architectures (T5, BART, mBART)
- Feature engineering for ML approaches
- Evaluation metrics (BLEU, SARI, HSK-aware metrics)
Questions to answer:
- How do you train a neural simplification model?
- What linguistic features correlate with simplification quality?
- How do you evaluate simplification (beyond human judgment)?
S3-need-driven (Use Case Mapping)#
- Language learning apps: Requirements and implementation
- Accessibility services: Government doc simplification
- Publishers: Educational content adaptation
Questions to answer:
- What accuracy threshold is “good enough” for each use case?
- What’s the TCO over 3 years for each approach?
- Build vs buy decision tree
S4-strategic (Viability & ROI)#
- 3-year TCO analysis (rule-based vs neural vs hybrid)
- Break-even volume (when does automation pay off?)
- Risk assessment (what can go wrong?)
Questions to answer:
- At what scale does neural approach become cost-effective?
- What’s the risk of meaning drift in automated simplification?
- When should you hire editors instead of building tech?
S1-rapid Conclusion#
For most teams: Start with rule-based MVP using jieba + OpenCC + custom logic.
Timeline: 2-4 weeks to working prototype Cost: $5K-$15K Success rate: 70-80% (good enough for MVP)
Upgrade path: If you hit limitations (complex sentences not simplifying well), add neural model for those cases (hybrid approach).
Reality: Chinese text simplification is immature compared to English. You will build custom solutions, not pip install magic.
Sources#
S2: Comprehensive
S2-comprehensive: Technical Deep Dive#
Status: 🚧 IN PROGRESS#
This phase will cover the technical depth of Chinese text simplification, including:
Planned Topics#
1. Neural Model Architectures#
- Transformer-based approaches (T5, BART, mBART)
- Seq2seq vs pre-trained models
- Fine-tuning strategies for Chinese
- Model size trade-offs (performance vs inference speed)
2. MCTS Training Pipeline#
- Dataset structure and format
- Training data preprocessing
- Model training workflow
- Hyperparameter tuning
- Evaluation on multi-reference test set
3. Rule-Based Approaches (Detailed)#
- Synonym extraction methods
- HSK-level word mapping strategies
- Sentence splitting algorithms
- Idiom detection and replacement
- Compound word handling
4. Evaluation Metrics#
- BLEU score for simplification
- SARI (System output vs references And against the Input sentence)
- HSK-aware metrics (vocabulary coverage)
- Semantic similarity (meaning preservation)
- Fluency metrics
- Human evaluation protocols
5. Feature Engineering#
- Linguistic features for simplification
- Character frequency analysis
- Sentence complexity metrics
- Discourse structure
- Readability formulas for Chinese
6. Comparative Analysis#
- Rule-based vs neural (detailed comparison)
- Accuracy vs speed trade-offs
- Error analysis (what fails in each approach)
- Hybrid architectures
Research Questions#
What makes a good simplification model?
- Accuracy benchmarks from MCTS paper
- State-of-the-art results (2024-2026)
How do you train on MCTS dataset?
- Step-by-step training guide
- GPU requirements
- Training time estimates
- Fine-tuning vs training from scratch
What linguistic features matter most?
- Feature importance analysis
- Correlation with simplification quality
How do you evaluate without ground truth?
- Multi-reference evaluation
- Automatic metrics vs human judgment
- Inter-annotator agreement
Estimated Time#
3-4 hours for comprehensive technical research
Deliverables (Planned)#
neural-architectures.md- T5, BART, mBART for Chinese TSmcts-training-guide.md- How to train on MCTS datasetevaluation-metrics.md- BLEU, SARI, HSK metrics deep diverule-based-detailed.md- Advanced rule-based techniquesfeature-engineering.md- Linguistic features for MLrecommendation.md- S2 technical recommendations
Status: Outline created, detailed research pending Next session: Start with neural-architectures.md or mcts-training-guide.md
Evaluation Metrics for Chinese Text Simplification#
The Challenge#
Unlike translation (compare to reference), simplification has:
- Multiple valid outputs (many ways to simplify)
- No single ground truth (HSK 3 can be expressed many ways)
- Dual goals: Simplicity AND meaning preservation
Automatic Metrics#
1. BLEU (Bilingual Evaluation Understudy)#
What it measures: N-gram overlap with reference simplifications
from sacrebleu import corpus_bleu
references = [["这是简单的句子。", "这个句子很简单。"]] # Multiple refs
hypothesis = ["这是简单句子。"]
score = corpus_bleu(hypothesis, references)
print(score.score) # 0-100, higher is betterPros: Standard, widely used Cons: Rewards exact matches, penalizes valid paraphrases Typical scores: 30-45 for text simplification (lower than translation)
2. SARI (System output vs References and Input)#
What it measures: How well you ADD simple words, KEEP important words, DELETE complex words
from easse.sari import corpus_sari
sources = ["这是一个非常复杂的句子。"]
predictions = ["这是复杂句子。"]
references = [["这是难句子。", "这个句子很难。"]]
score = corpus_sari(sources, predictions, references)
print(score) # 0-100, higher is betterInstall: pip install easse
Formula:
SARI = (F1_add + F1_keep + F1_delete) / 3Pros: Designed for simplification, better than BLEU Cons: Requires multiple references for accuracy
Typical scores: 35-45 for good simplification
3. HSK Vocabulary Coverage#
What it measures: Percentage of words within target HSK level
def hsk_coverage(text, target_level=3):
words = jieba.cut(text)
hsk_vocab = load_hsk_vocab(levels=range(1, target_level+1))
known_words = sum(1 for w in words if w in hsk_vocab)
return known_words / len(list(words))
coverage = hsk_coverage("这是一个简单的句子", target_level=3)
# 0.95 = 95% of words are HSK 1-3Pros: Directly measures learner comprehension Cons: Doesn’t measure sentence complexity
Targets:
- HSK 2: 90-95% coverage
- HSK 3: 95-98% coverage
- HSK 4: 98-99% coverage
4. Sentence Length#
What it measures: Average characters per sentence (simpler = shorter)
avg_length = sum(len(s) for s in sentences) / len(sentences)Targets:
- HSK 2: 8-12 characters
- HSK 3: 12-18 characters
- HSK 4: 18-25 characters
5. Semantic Similarity#
What it measures: Meaning preservation (does simplification keep same meaning?)
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
original = "这是一个非常复杂的句子"
simplified = "这是难句子"
emb1 = model.encode(original)
emb2 = model.encode(simplified)
similarity = util.cos_sim(emb1, emb2)
print(similarity) # 0-1, higher is betterThresholds:
- < 0.7: Meaning changed too much
- 0.7-0.85: Acceptable paraphrase
0.85: Very similar meaning
Human Evaluation#
Dimensions to Measure#
Grammaticality: Is the simplified text fluent?
- Scale: 1-5 (1=broken, 5=perfect)
Meaning Preservation: Does it keep the original meaning?
- Scale: 1-5 (1=completely different, 5=identical)
Simplicity: Is it simpler than the original?
- Scale: 1-5 (1=same difficulty, 5=much simpler)
Adequacy: Would an HSK X learner understand this?
- Binary: Yes/No
Evaluation Protocol#
Annotators: 3-5 native Chinese speakers (preferably with teaching experience)
Sample size: 100-200 sentence pairs (random sample)
Agreement: Calculate inter-annotator agreement (Fleiss’ kappa)
- κ > 0.6: Good agreement
- κ < 0.4: Revise guidelines
Cost: $500-1000 for 200 evaluations (crowdsourcing) or $2K-5K (expert annotators)
Composite Score#
Combine metrics for overall quality:
def evaluate_simplification(original, simplified, references, target_hsk=3):
# Automatic metrics
bleu = compute_bleu([simplified], [references])
sari = compute_sari([original], [simplified], [references])
hsk_cov = hsk_coverage(simplified, target_hsk)
similarity = semantic_similarity(original, simplified)
# Composite score
score = {
'bleu': bleu,
'sari': sari,
'hsk_coverage': hsk_cov,
'semantic_sim': similarity,
'composite': 0.2*bleu + 0.3*sari + 0.3*hsk_cov + 0.2*similarity
}
# Pass criteria
passes = (
hsk_cov >= 0.95 and # 95%+ HSK coverage
similarity >= 0.75 and # Meaning preserved
len(simplified) < len(original) # Actually simpler
)
return score, passesBenchmarking Your System#
Baseline: No Simplification#
- BLEU: 0 (no match with references)
- SARI: ~30 (keeps all words, doesn’t simplify)
- HSK coverage: Depends on original
Rule-Based Target#
- SARI: 35-40
- HSK coverage: 90-95%
- Semantic similarity: 0.8-0.9
Neural Target#
- SARI: 40-45
- HSK coverage: 85-95% (less controllable)
- Semantic similarity: 0.75-0.85
MCTS Paper Results#
- Best models: ~40 BLEU, ~45 SARI
- Human upper bound: ~60 SARI (multi-reference)
Practical Validation Workflow#
Week 1: Automated
- Run on 1K test sentences
- Compute BLEU, SARI, HSK coverage
- Filter failures (< thresholds)
Week 2: Spot Check
- Manual review of 100 random samples
- Identify error patterns (what’s breaking?)
Week 3: Human Eval
- Formal evaluation on 200 samples
- Calculate inter-annotator agreement
- Iterate if needed
Week 4: Production
- Deploy with monitoring
- Log edge cases for improvement
- Periodic re-evaluation
Monitoring in Production#
Track these metrics over time:
# Log per simplification
{
'original_length': 45,
'simplified_length': 28,
'hsk_coverage': 0.94,
'semantic_similarity': 0.82,
'inference_time_ms': 250
}
# Alert if:
# - HSK coverage < 0.90 (too hard)
# - Semantic similarity < 0.70 (meaning drift)
# - Inference time > 500ms (too slow)Error Analysis#
Common failure modes:
Over-simplification: “研究表明” → “说” loses academic tone
- Fix: Be more conservative with replacements
Under-simplification: Didn’t simplify hard words
- Fix: Expand synonym dictionary
Meaning drift: “银行” (bank) → “河边” (riverbank) wrong context
- Fix: Use POS tags or context-aware rules
Unnatural output: “非常的好” (ungrammatical)
- Fix: Add grammar validation post-processing
Tools#
Libraries:
sacrebleu: BLEU calculationeasse: SARI and other simplification metrics (English-focused but adaptable)sentence-transformers: Semantic similarityjieba: Segmentation for HSK coverage
MCTS eval scripts: https://github.com/blcuicall/mcts (includes HSK evaluator)
Verdict#
MVP evaluation (fast):
- HSK coverage (must-have)
- Sentence length reduction
- Manual spot-checks (50 samples)
Production evaluation (rigorous):
- SARI (automatic)
- Semantic similarity (automatic)
- Human eval (200 samples, quarterly)
Research evaluation (comprehensive):
- All automatic metrics
- Human eval (500+ samples)
- Inter-annotator agreement
- Error analysis by category
Sources#
Neural Architectures for Chinese Text Simplification#
Models That Work#
1. mBART (Multilingual BART)#
Best for: Chinese text simplification (multilingual pre-training helps)
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="zh_CN", tgt_lang="zh_CN")
# Fine-tune on MCTS datasetPros: Pre-trained on Chinese, handles seq2seq well Cons: Large (600M params), slow inference MCTS paper results: Not specifically tested, but BART-style models work well
2. mT5 (Multilingual T5)#
Best for: Chinese when you need smaller models
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")Sizes: Small (300M), Base (580M), Large (1.2B) Pros: Good Chinese support, multiple sizes Cons: Requires more training data than BART
3. CPT (Chinese Pre-Trained Transformer)#
Best for: Chinese-only tasks (no multilingual overhead)
GitHub: https://github.com/fastnlp/CPT Pros: Optimized for Chinese, faster than mBART Cons: Less widely adopted, fewer resources
Training Setup#
Hardware Requirements#
| Model Size | GPU Memory | Training Time (MCTS 691K) | Inference Speed |
|---|---|---|---|
| mT5-small | 16GB | 2-3 days | 0.5s/sentence |
| mT5-base | 24GB | 4-5 days | 1s/sentence |
| mBART | 32GB+ | 5-7 days | 1.5s/sentence |
Cloud costs: ~$100-500 (AWS p3.2xlarge or equivalent)
Training Pipeline#
# 1. Load MCTS dataset
from datasets import load_dataset
dataset = load_dataset('json', data_files={'train': 'mcts/train.jsonl'})
# 2. Tokenize
def tokenize(examples):
inputs = tokenizer(examples['source'], max_length=128, truncation=True)
targets = tokenizer(examples['target'], max_length=128, truncation=True)
inputs['labels'] = targets['input_ids']
return inputs
dataset = dataset.map(tokenize, batched=True)
# 3. Train
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
learning_rate=5e-5,
save_steps=10000,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
)
trainer.train()Fine-Tuning Strategies#
1. Full Fine-Tuning#
- Update all model weights
- Best accuracy
- Requires most GPU memory
- 3-5 epochs on MCTS: ~$200-500
2. LoRA (Low-Rank Adaptation)#
- Update only small adapter layers
- 90% of full fine-tuning accuracy
- 1/4 the memory usage
- Recommended for smaller teams
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
)
model = get_peft_model(model, lora_config)
# Train as normal, much less memory3. Prefix Tuning#
- Add learnable prefix tokens
- Even smaller than LoRA
- Slightly lower accuracy
Controlling Output Difficulty#
Challenge: Neural models don’t respect HSK levels by default.
Solution 1: Prompt engineering
input_text = f"[HSK 3] {source_text}"
# Model learns to simplify to HSK 3 levelSolution 2: Separate models per level
- Train mT5-small for HSK 3
- Train mT5-small for HSK 4
- Route at inference time
Solution 3: Post-process with HSK validator
simplified = model.generate(input)
if hsk_level(simplified) > target_level:
# Reject and regenerate with different decoding paramsDecoding Strategies#
Beam Search (Standard)#
output = model.generate(
input_ids,
max_length=128,
num_beams=5,
early_stopping=True
)Best for: Quality (default choice)
Sampling#
output = model.generate(
input_ids,
max_length=128,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)Best for: Variety (multiple simplification candidates)
Constrained Decoding#
Force output to use only HSK 1-3 vocabulary (advanced).
Benchmarks from Literature#
MCTS paper (2024):
- BART-based models: ~40 BLEU score
- mT5-base: ~38 BLEU score
- Human references: 100 BLEU (by definition)
Reality check: BLEU scores are low because simplification has multiple valid outputs. Multi-reference BLEU is more meaningful.
Production Considerations#
Inference Optimization#
- TorchScript: 20-30% faster
- ONNX Runtime: 30-50% faster
- Model quantization: 2-4x faster, slight quality loss
- Batching: 5-10x throughput improvement
Cost at Scale#
1K simplifications/day:
- GPU inference (T4): $50-100/month
- CPU inference (optimized): $20-50/month
- Serverless (AWS Lambda + GPU): $100-200/month
10K simplifications/day:
- Dedicated GPU server becomes cost-effective
Verdict#
For most teams: Start with mT5-base + LoRA on MCTS
- Good balance of quality and resources
- 2-3 days training on single GPU
- Deploy with ONNX for fast inference
For research: mBART-large (best quality) For production at scale: mT5-small (fast, good enough)
Sources#
S2-comprehensive Recommendations#
Neural Approach: Go/No-Go#
GO if:#
- Volume > 5K texts/month (justify training cost)
- Accuracy needs > 80% (rule-based plateaus at 70-80%)
- Have ML engineer or budget for consulting
- Can tolerate 3-5% meaning drift
NO-GO if:#
- Volume < 1K texts/month (not worth complexity)
- Need 100% predictable output (use rules)
- No ML expertise and < $20K budget
- Can’t accept any meaning errors
Recommended Neural Stack#
mT5-base + LoRA on MCTS dataset:
- 2-3 days training (single GPU, ~$150)
- Deploy with ONNX (1s/sentence CPU)
- 40-45 SARI, 85-95% HSK coverage
- LoRA = 1/4 memory of full fine-tuning
Implementation:
# 1. Setup
pip install transformers datasets peft onnx
# 2. Train
python train.py \
--model google/mt5-base \
--dataset mcts/train.jsonl \
--epochs 3 \
--lora_r 16 \
--output models/mt5-lora-hsk3
# 3. Evaluate
python eval.py \
--model models/mt5-lora-hsk3 \
--test mcts/test.jsonl \
--metrics sari,bleu,hsk_coverage
# 4. Deploy
python convert_to_onnx.py \
--model models/mt5-lora-hsk3 \
--output models/mt5-lora-hsk3.onnxTimeline: 1 week (setup + train + eval) Cost: $200-500 (cloud GPU + storage)
Evaluation Strategy#
MVP: HSK coverage + spot checks
- 100 test sentences
- Must achieve 95%+ HSK coverage
- Manual review of 20 samples
Production: SARI + semantic similarity + human eval
- Run SARI on full test set (target: 40+)
- Semantic similarity > 0.75 (meaning preserved)
- Human eval on 200 samples quarterly
Monitoring: Log these per-simplification
{
'hsk_coverage': 0.94,
'semantic_similarity': 0.82,
'inference_time_ms': 250
}
# Alert if coverage < 0.90 or similarity < 0.70Hybrid Architecture (Best of Both)#
Route by sentence complexity:
def simplify(text):
complexity = measure_complexity(text)
if complexity < 15: # Simple sentence
return rule_based_simplify(text) # Fast, predictable
else: # Complex sentence
result = neural_simplify(text)
if validate(result):
return result
else:
return rule_based_simplify(text) # FallbackResult: 85-90% success rate (neural handles hard cases, rules are fallback)
S2 Key Takeaways#
- Neural is viable but not trivial (1 week + $500, requires ML skills)
- mT5-base + LoRA = best balance (quality vs resources)
- SARI + HSK coverage = must-have metrics
- Hybrid architecture = production-grade (rules + neural)
- 3-5% meaning drift is unavoidable with neural (need human review)
Next: S3 maps these technical options to specific use cases with TCO models.
S3: Need-Driven
S3-need-driven: Solutions by Use Case#
Status: 🚧 PLANNED#
This phase will map implementation strategies to real-world applications with cost models.
Planned Topics#
1. Language Learning Platforms#
- Requirements analysis (graded readers, adaptive content)
- Implementation strategy (rule-based MVP → hybrid)
- Cost model (3-year TCO)
- Success metrics (learner engagement, comprehension)
- Case study examples
2. Accessibility Services#
- Government document simplification
- Healthcare information
- Public service announcements
- Requirements (accuracy, consistency, auditability)
- Implementation strategy
- Compliance and legal considerations
3. Educational Publishers#
- Textbook adaptation
- Multi-level content generation
- Editorial workflow integration
- Quality assurance requirements
- ROI analysis
4. AI Tutoring Systems#
- Dynamic difficulty adjustment
- Real-time simplification
- Personalization (beyond HSK levels)
- Latency requirements
- Implementation strategy
5. Decision Tree#
- Use case → approach mapping
- Volume thresholds (rule-based vs neural)
- Accuracy requirements → implementation
- Budget constraints → solution
Research Questions#
What accuracy is “good enough” for each use case?
- Learner apps: 70-80% (human reads output)
- Publishing: 90%+ (editorial review)
- Accessibility: 85%+ (legal compliance)
What’s the TCO over 3 years?
- Rule-based: $12K-$35K
- Neural: $25K-$80K
- Hybrid: $20K-$60K
- Manual editing baseline: $24K-$72K/year
When should you hire editors instead of automating?
- Volume < 100 texts/month
- High-stakes content (legal, medical)
- Niche domains (limited training data)
Estimated Time#
3-4 hours for use case mapping and cost modeling
Deliverables (Planned)#
use-case-language-learning.mduse-case-accessibility.mduse-case-publishers.mduse-case-ai-tutoring.mddecision-tree.md- Which approach for which scenariorecommendation.md- S3 summary
Status: Outline created, detailed research pending Next session: Start after S2-comprehensive is complete
S3-need-driven Recommendations#
Quick Decision Tree#
Volume/month?
├─ < 500
│ └─ Manual editing ($3K-6K/month)
│ OR rule-based if need latency < 5min
│
├─ 500-5K
│ ├─ Accuracy < 80%? → Rule-based ($10K-15K year 1)
│ └─ Accuracy 80%+? → Hybrid ($25K-40K year 1)
│
└─ > 5K
├─ Accuracy < 85%? → Hybrid ($30K-50K year 1)
└─ Accuracy 90%+? → Neural + review ($50K-90K year 1)Use Case Mapping#
| Use Case | Approach | Why |
|---|---|---|
| Language learning app | Rule-based → Hybrid | Users tolerate 75-85%, scale gradually |
| Government accessibility | Hybrid + mandatory review | Need 90%+ + auditability |
| Publishers | Neural + editorial | 95%+ needed, editors refine output |
| AI tutoring | Neural (optimized) | 10K+/day needs speed + personalization |
| News sites | Hybrid | Daily content, 80%+ acceptable |
Break-Even Analysis#
Automation vs manual editing:
- 100 texts/month: Manual cheaper ($1.2K vs $1.7K/month amortized)
- 500 texts/month: Break-even point
- 2K texts/month: Automation 3x cheaper
- 10K texts/month: Automation 5x cheaper
Rule-based vs neural:
- < 5K/month: Rule-based cheaper
- 5K-20K/month: Hybrid best ROI
- > 20K/month: Full neural justified
Key Insight#
Most teams should start rule-based even if they’ll eventually need neural:
- Learn the domain (what fails? what patterns?)
- Collect data (input → desired output)
- Fine-tune neural on YOUR data (better than MCTS alone)
Exception: If you have > $50K budget and 10K+ texts/month from day 1, skip to neural.
Cost Optimization#
Year 1 costs high (development), Year 2-3 drop 70% (maintenance only):
- Rule-based: $15K → $5K/year
- Hybrid: $30K → $8K/year
- Neural: $60K → $15K/year
Mistake to avoid: Comparing Year 1 automation cost to Year 1 manual cost. Compare 3-year TCO.
Success Criteria by Use Case#
Learner apps:
- 90%+ completion rate
- < 5% difficulty complaints
- 95%+ HSK coverage
Accessibility:
- 0 legal challenges due to unclear language
- 90%+ users report “easy to understand”
- Audit trail for all simplifications
Publishers:
- Editors spend 50% less time vs writing from scratch
- 95%+ accuracy (minimal edits needed)
- Consistent style across levels
AI tutoring:
- < 500ms latency
- 80%+ learners report helpful explanations
- Adapt to user over time (personalization)
Use Case Implementation Guide#
1. Language Learning Apps (B2C)#
Requirements#
- Volume: 1K-10K texts/month
- Accuracy: 75-85% acceptable (users read output, can adapt)
- Latency: < 2s per text
- Cost: Must be cheaper than manual editing
Recommended Approach#
Phase 1 (MVP): Rule-based
- jieba + OpenCC + custom rules
- Timeline: 2-4 weeks
- Cost: $5K-15K
Phase 2 (Scale): Hybrid
- Rules for 70% (simple cases)
- Neural for 30% (complex sentences)
- Timeline: +1 month
- Cost: +$10K-20K
3-Year TCO#
| Year | Rule-Based | Hybrid | Manual Editing |
|---|---|---|---|
| 1 | $15K | $25K | $36K |
| 2 | $5K | $8K | $36K |
| 3 | $5K | $8K | $36K |
| Total | $25K | $41K | $108K |
Break-even: 500 texts/month (automation cheaper than editors)
Success metrics:
- 90%+ learners complete articles
- < 5% complaints about difficulty
- HSK coverage 95%+
2. Accessibility Services (Government)#
Requirements#
- Volume: 100-1K documents/month
- Accuracy: 90%+ (public-facing, legal implications)
- Auditability: Must explain simplifications
- Consistency: Same input → same output
Recommended Approach#
Hybrid with human review
- Rule-based for consistency
- Neural for complex legal language
- Mandatory human review before publication
Timeline: 3 months (includes compliance review) Cost: $30K-50K (development + legal review)
3-Year TCO#
| Year | Hybrid + Review | Manual Only |
|---|---|---|
| 1 | $60K | $48K |
| 2 | $25K | $48K |
| 3 | $25K | $48K |
| Total | $110K | $144K |
Break-even: 200 documents/month
Constraints:
- Must log all simplifications (auditability)
- Rule-based preferred (explainable)
- Human QA on 100% of output
3. Educational Publishers#
Requirements#
- Volume: 500-2K texts/year (textbooks, readers)
- Accuracy: 95%+ (high stakes)
- Multiple levels: Need HSK 2, 3, 4, 5 versions
- Editorial workflow: Integrate with existing process
Recommended Approach#
Neural + editorial workflow
- Train separate models per HSK level
- Output 3 candidates per level
- Editors select best + refine
- Builds dataset for future improvement
Timeline: 4-6 months Cost: $50K-80K (development + training)
3-Year TCO#
| Year | Neural + Editorial | Manual Only |
|---|---|---|
| 1 | $90K | $80K |
| 2 | $30K | $80K |
| 3 | $25K | $80K |
| Total | $145K | $240K |
Break-even: 1K texts/year
Workflow:
- Author writes at natural level
- Neural generates HSK 3, 4, 5 versions
- Editors review and refine
- Collect edits for model improvement
4. AI Tutoring Systems#
Requirements#
- Volume: 10K+ per day (real-time)
- Accuracy: 80%+ (AI can re-explain if confused)
- Latency: < 500ms (conversational)
- Personalization: Adapt to individual learner, not just HSK level
Recommended Approach#
Optimized neural with caching
- mT5-small (fast inference)
- ONNX runtime on CPU
- Cache common simplifications
- Fine-tune on user feedback
Timeline: 3-4 months Cost: $40K-70K
Operating Costs#
| Volume/day | Infrastructure | Cost/month |
|---|---|---|
| 10K | 2x CPU (8 core) | $200 |
| 50K | GPU (T4) | $400 |
| 100K | 2x GPU | $800 |
Latency optimization:
- Caching: 50% hit rate → 250ms avg
- Batching: 5-10x throughput
- Model quantization: 2x faster
Decision Matrix#
| Scenario | Volume/month | Accuracy Need | Recommended | Timeline | Year 1 Cost |
|---|---|---|---|---|---|
| Learner app MVP | 500-2K | 75%+ | Rule-based | 3 weeks | $10K |
| Learner app scale | 5K-20K | 80%+ | Hybrid | 2 months | $25K |
| Government docs | 100-500 | 90%+ | Hybrid + review | 3 months | $60K |
| Publishers | 1K-3K/year | 95%+ | Neural + editorial | 5 months | $90K |
| AI tutoring | 10K+/day | 80%+ | Neural (optimized) | 3 months | $50K |
General Guidelines#
< 500 texts/month: Manual editing cheaper (unless latency matters) 500-5K/month: Rule-based MVP, upgrade to hybrid if limited 5K-20K/month: Hybrid (rules + neural) > 20K/month: Full neural with optimization
Accuracy requirements:
- < 80%: Rule-based sufficient
- 80-90%: Hybrid
90%: Neural + human review
Budget constraints:
- < $15K: Rule-based only
- $15K-$40K: Hybrid possible
$40K: Full neural viable
S4: Strategic
S4-strategic: Viability & ROI Analysis#
Status: 🚧 PLANNED#
This phase will provide long-term strategic decisions with financial models and risk assessment.
Planned Topics#
1. Build vs Buy Viability#
- 3-year TCO comparison
- Break-even analysis
- Market landscape (commercial solutions, if any)
- Build decision factors
2. Rule-Based vs Neural vs Hybrid ROI#
- Cost models for each approach
- Maintenance costs
- Scaling costs (volume growth)
- Quality improvements over time
3. Break-Even Analysis#
- At what volume does automation pay off vs manual editing?
- At what volume does neural become cheaper than rule-based?
- Fixed costs vs variable costs
4. Risk Assessment#
- Meaning drift: Automated simplification changes meaning
- Quality degradation: Errors accumulate over time
- Segmentation errors: Jieba mistakes cascade
- Context blindness: Rule-based misses context
- Mitigation strategies
5. Team Skills and Hiring#
- What skills are needed for each approach?
- Hiring costs (NLP engineer vs linguist vs ML engineer)
- Training existing team vs hiring specialists
- Consulting vs in-house development
6. Technology Maturity Assessment#
- Current state of Chinese text simplification (2026)
- Projected improvements (2027-2029)
- When to wait vs build now
- Vendor landscape (commercial APIs, if any)
Research Questions#
At what scale does neural become cost-effective?
- Volume thresholds: 1K, 10K, 100K texts/month
- Quality requirements: 70%, 80%, 90%+ accuracy
- Time horizons: 1 year, 3 years, 5 years
What are the risks of automated simplification?
- Meaning drift: 5-10% of sentences change meaning subtly
- Unnatural output: 10-15% sound awkward
- Over-simplification: 5% become too simple (lose nuance)
- Under-simplification: 10-20% remain too complex
- Mitigation: Human review, conservative rules, hybrid approach
When should you wait for better libraries?
- If budget < $10K → wait 1-2 years, libraries may mature
- If volume < 500 texts/month → manual editing may be cheaper
- If accuracy needs > 95% → wait or use hybrid with heavy editing
What’s the competitive advantage of building now?
- First-mover advantage in language learning apps
- Custom domain adaptation (medical, legal)
- Data moat (collect user feedback, improve over time)
Estimated Time#
3-4 hours for strategic analysis and ROI modeling
Deliverables (Planned)#
build-vs-buy-viability.md- 3-year TCO comparisonroi-analysis.md- Break-even models, cost scenariosrisk-assessment.md- What can go wrong, mitigation strategiesteam-skills.md- Hiring and training considerationstechnology-maturity.md- Market state, future outlookrecommendation.md- FINAL STRATEGIC RECOMMENDATIONS
Status: Outline created, detailed research pending Next session: Start after S3-need-driven is complete Final output: Complete 4PS research package with strategic go/no-go decision
S4-strategic: Final Recommendations#
Strategic Go/No-Go Decision#
BUILD NOW (Strong recommendation)#
✅ Build if:
- Volume > 500 texts/month
- Budget > $15K Year 1
- Have mid-level dev + Chinese speaker
- Can tolerate 75-85% accuracy
Expected outcome: 60-80% cost savings vs manual over 3 years
WAIT 2-3 YEARS (Conditional)#
⏸️ Wait if:
- Volume < 300 texts/month (manual cheaper)
- Budget < $10K (can’t build properly)
- Need > 95% accuracy (tech not ready)
- No technical capability
Risk: Competitors build data moats, miss market window
NEVER BUILD (Manual forever)#
❌ Don’t build if:
- Volume < 100 texts/month
- Niche domain (classical Chinese, legal, medical)
- Can’t accept ANY errors (high-stakes publishing)
- Short-term project (< 18 months)
Technology Maturity Assessment (2026)#
Current state:
- ❌ No pip-installable simplification libraries
- ✅ Building blocks mature (jieba, OpenCC, HanLP)
- ✅ Training data available (MCTS: 691K pairs)
- ⚠️ Neural models work but need ML expertise
2-3 year outlook (2028-2029):
- Possible: Turnkey libraries emerge (50% chance)
- Likely: Commercial APIs (Chinese equivalent of Rewordify)
- Certain: Better pre-trained models (easier fine-tuning)
Strategic implication: Early movers (2026-2027) have 2-3 year advantage, but risk is higher
Risk Assessment#
Technical Risks#
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Rule-based plateaus at 70% | 60% | Medium | Plan for hybrid from start |
| Neural meaning drift (5-10%) | 80% | High | Human review on critical content |
| Jieba segmentation errors cascade | 40% | Medium | Custom dictionary, validation |
| HSK coverage drift over time | 30% | Low | Quarterly updates |
Business Risks#
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Volume lower than expected | 40% | High | Start manual, automate at 500/month |
| Accuracy not good enough | 30% | Medium | Hybrid approach (fallback to human) |
| Technology improves rapidly | 50% | Low | Incremental build (not all-in upfront) |
| Competitors build better solution | 30% | Medium | Focus on domain data (your moat) |
Mitigation Strategy#
- Start small: Rule-based MVP ($15K), not full neural ($70K)
- Validate early: 100 test simplifications before full rollout
- Build incrementally: MVP → rules → hybrid → neural (staged)
- Human-in-loop: Review high-stakes content regardless of automation
Recommended Path (Most Teams)#
Phase 1: MVP (Weeks 1-4)#
- Approach: Rule-based
- Investment: $10K-15K
- Goal: 70-75% success rate on 100 test texts
- Decision point: If success rate < 65%, reconsider
Phase 2: Production (Months 2-3)#
- Approach: Harden MVP, add monitoring
- Investment: $5K-10K
- Goal: Handle 1K texts/month reliably
- Decision point: If volume > 5K/month, plan hybrid
Phase 3: Scale (Months 4-6)#
- Approach: Add neural for complex cases (hybrid)
- Investment: $15K-25K
- Goal: 85-90% success rate
- Decision point: If still not good enough, consider full neural
Phase 4: Optimize (Months 7-12)#
- Approach: Fine-tune on YOUR domain data
- Investment: $5K-10K
- Goal: 90%+ success rate, < 500ms latency
- Decision point: Maintenance mode (ongoing curation only)
Total Year 1: $35K-60K Year 2-3: $5K-10K/year maintenance
Competitive Positioning#
First-Mover Advantages (2026-2027)#
- Data moat: Collect user feedback → improve model
- Market share: Early users sticky (switching costs)
- Learning curve: 2-3 years to get good (others behind)
Window closes: 2028-2029 (when turnkey solutions may emerge)
Defensibility#
Weak defense: Generic HSK simplification (others can replicate) Strong defense: Domain-specific (medical Chinese, business Chinese, kids’ books)
Recommendation: Build generic MVP, specialize by domain for moat
Team & Skills#
Minimum Viable Team#
Rule-based:
- 1 mid-level Python developer (3 weeks)
- 1 native Chinese speaker for validation (1 week)
- Total: ~$10K-15K
Hybrid:
- Above + 1 ML engineer (2-3 weeks)
- Total: ~$25K-40K
Full neural:
- 1 senior ML engineer (6-8 weeks)
- 1 Chinese linguist (2 weeks)
- Infrastructure engineer (1 week)
- Total: ~$50K-80K
Build vs Hire vs Outsource#
If you have in-house devs: Build (cheapest) If you hire contractors: 2-3x cost multiplier If you outsource fully: 3-5x cost, quality risk
Recommendation: Hire 1 full-time if volume > 5K/month, contract otherwise
Final Verdict#
For Language Learning Apps#
✅ BUILD rule-based MVP now (2-4 weeks, $15K)
- ROI positive at 500+ texts/month
- Iterate to hybrid if needed
- Expected savings: 60-80% vs manual
For Government/Accessibility#
⚠️ BUILD hybrid with mandatory review (3 months, $50K)
- Need 90%+ accuracy (automation alone insufficient)
- Auditability critical (use rule-based as primary)
- Expected savings: 30-50% vs manual
For Publishers#
✅ BUILD neural + editorial workflow (4-6 months, $80K)
- Need 95%+ accuracy (editors refine AI output)
- Volume justifies investment (1K+ texts/year)
- Expected savings: 40-60% vs full manual
For AI Tutoring#
✅ BUILD optimized neural (3 months, $50K)
- Volume is high (10K+/day)
- Latency matters (< 500ms)
- Expected ROI: Enables product (not just cost savings)
The Strategic Question#
“Should I build Chinese text simplification in 2026?”
Answer: YES, if volume > 500 texts/month and budget > $15K
The technology is immature but viable. Early movers (2026-2027) will build data moats. Waiting 2-3 years reduces risk but loses competitive advantage.
Start with rule-based MVP (low risk, fast validation). Iterate to neural only if volume and accuracy requirements justify it.
The window is open: Build now (2026-2028) or wait until turnkey solutions exist (2029+).
Research Complete#
This concludes the 4PS research on Chinese Text Simplification Libraries.
Key deliverables:
- S1: Library landscape (no turnkey solutions exist)
- S2: Neural approach viable (mT5 + LoRA on MCTS)
- S3: Use case mapping (rule-based → hybrid → neural)
- S4: Strategic recommendation (BUILD for most teams)
Next steps: Implementation (use S1-S2 as technical guide, S3-S4 for decision support)
ROI Analysis: Build vs Wait vs Manual#
3-Year TCO Comparison#
Scenario: 2K texts/month (typical language learning app)#
| Approach | Year 1 | Year 2 | Year 3 | Total | Notes |
|---|---|---|---|---|---|
| Manual editing | $36K | $36K | $36K | $108K | Baseline |
| Rule-based | $15K | $5K | $5K | $25K | 77% savings |
| Hybrid | $30K | $8K | $8K | $46K | 57% savings, better quality |
| Full neural | $60K | $15K | $15K | $90K | 17% savings, best quality |
Verdict: Rule-based or hybrid (neural not justified at this volume)
Scenario: 10K texts/month (large platform)#
| Approach | Year 1 | Year 2 | Year 3 | Total | Notes |
|---|---|---|---|---|---|
| Manual editing | $180K | $180K | $180K | $540K | Baseline |
| Rule-based | $20K | $5K | $5K | $30K | 94% savings, quality plateau |
| Hybrid | $40K | $10K | $10K | $60K | 89% savings |
| Full neural | $70K | $20K | $20K | $110K | 80% savings, best quality |
Verdict: Hybrid or neural (savings justify investment)
Break-Even Timeline#
Rule-based:
- Payback: 6-9 months at 1K texts/month
- Payback: 3-4 months at 5K texts/month
Hybrid:
- Payback: 12-18 months at 2K texts/month
- Payback: 6-8 months at 10K texts/month
Neural:
- Payback: 18-24 months at 5K texts/month
- Payback: 8-12 months at 20K texts/month
Risk-Adjusted ROI#
Optimistic (90th percentile)#
- Development faster than expected
- Quality better than expected
- Maintenance costs lower
Result: ROI +50% better
Realistic (50th percentile)#
- Numbers as stated above
- Some rework needed
- Expected maintenance
Result: ROI as modeled
Pessimistic (10th percentile)#
- Development 2x longer
- Quality requires more iteration
- Hidden maintenance costs
Result: ROI -50%, may not break even until Year 2
Mitigation: Start with rule-based MVP (lower risk, faster validation)
Competitive Advantage Analysis#
Build Now (2026)#
Advantages:
- First-mover in nascent market
- Collect user feedback data → improve model
- Data moat (your domain-specific corpus)
- Control over quality/latency
Disadvantages:
- Technology immature (no turnkey libraries)
- Must build custom solution
- Ongoing maintenance burden
Wait 2-3 Years (2028-2029)#
Advantages:
- Mature libraries may emerge
- Commercial APIs possible (like English has)
- Learn from others’ mistakes
- Lower development cost
Disadvantages:
- Competitors already have 2-3 years of data
- Miss early market opportunity
- May still need custom solution (libraries might not fit your use case)
Never Build (Manual Forever)#
Advantages:
- No technical risk
- Editors can handle edge cases
- Quality ceiling is higher
Disadvantages:
- 5-10x more expensive at scale
- Can’t scale to 100K+ texts/month
- Latency (humans need hours/days)
Strategic Decision Framework#
BUILD NOW if:#
- Volume > 500 texts/month (automation ROI positive)
- Need latency < 1 hour (humans too slow)
- Have budget ($15K+ Year 1)
- Technical capability (mid-level dev + Chinese speaker)
WAIT 2-3 YEARS if:#
- Volume < 500 texts/month (manual cheaper)
- Budget < $10K (can’t build properly)
- No technical team (can’t maintain)
- Accuracy needs > 95% (technology not ready)
MANUAL FOREVER if:#
- Volume < 100 texts/month
- High-stakes content (legal, medical) where errors unacceptable
- Domain too niche (no training data exists)
Investment Priorities#
If budget is $15K (rule-based):
- $8K: Development (2-3 weeks, mid-level dev)
- $3K: HSK vocabulary + synonym dictionary curation
- $2K: Testing + validation (50-100 samples)
- $2K: Deployment + infrastructure
If budget is $40K (hybrid):
- $15K: Rule-based foundation
- $10K: Neural model training + integration
- $8K: Testing + human evaluation
- $7K: Infrastructure + monitoring
If budget is $70K (full neural):
- $25K: Model training (mT5/mBART on MCTS)
- $15K: Data preparation + fine-tuning
- $12K: Evaluation + iteration
- $10K: Production deployment
- $8K: Infrastructure (GPU inference)
Hidden Costs to Budget#
Ongoing curation (10-20% of Year 1 cost annually)
- HSK vocabulary updates (3.0 migration)
- New slang, technical terms
- User-reported errors
Infrastructure scaling
- 10K → 100K texts/month: 10x compute cost
- Budget $500-2K/month for hosting at scale
Quality drift
- Models degrade over time (language evolves)
- Re-train every 12-18 months (~$5K-10K)
Support & monitoring
- On-call for failures
- Debugging edge cases
- A/B testing improvements
Total ongoing: 20-30% of Year 1 cost per year
Scenarios Where ROI is Negative#
- Volume too low: < 300 texts/month (manual cheaper)
- Accuracy too high: Need 98%+ (humans required anyway)
- No technical team: Outsource development + maintenance = 3x cost
- Domain too niche: Legal Chinese, classical Chinese (no training data)
- Short-term project: < 18 months (won’t reach break-even)
Expected Value Calculation#
Language learning app scenario (2K texts/month):
| Outcome | Probability | Cost (3yr) | Savings vs Manual | Expected Value |
|---|---|---|---|---|
| Success (rule-based works) | 70% | $25K | $83K | +$58K |
| Partial (need hybrid) | 25% | $46K | $62K | +$16K |
| Failure (revert to manual) | 5% | $15K + $108K | -$15K | -$1K |
| Expected | 100% | +$73K |
Verdict: Strong positive expected value (build rule-based MVP)
Strategic Recommendations#
- Most teams: Start rule-based, iterate to hybrid if needed
- Large platforms: Go straight to hybrid (skip learning phase)
- Publishers: Neural + editorial (quality matters most)
- Startups: Wait until PMF, then automate (manual until 500 texts/month)
The mistake: Jumping to neural too early (before you understand the problem) The opportunity: Building now while market is nascent (2026-2028 window)