1.148.2 Classical Chinese Parsing#
Comprehensive analysis of tools and approaches for parsing Classical Chinese (文言文) texts. Covers general-purpose NLP tools (Stanford CoreNLP, HanLP), specialized segmentation tools (Jiayan), corpus resources (ctext.org API), and strategies for building custom Classical Chinese NLP pipelines. Includes technical feasibility, use case analysis, market assessment, and strategic development paths.
Explainer
Classical Chinese Parsing: Domain Explainer#
What This Solves#
The Problem: Classical Chinese (文言文) - the written language used in China for over 2,000 years - is fundamentally different from modern Chinese. When you look at a classical text, there are no spaces between words, grammar follows different rules, and the vocabulary includes many archaic terms. While billions of characters of classical Chinese text exist in libraries and digital archives, we don’t have good automated tools to help people read, analyze, or search these texts effectively.
Who Encounters This:
- University students learning Classical Chinese (tens of thousands globally)
- Researchers studying Chinese history, philosophy, and literature
- Libraries and museums digitizing historical documents
- Translators working with historical texts
- Educational technology companies building Chinese learning tools
Why It Matters: Classical Chinese texts contain millennia of historical records, philosophical thought, and cultural heritage. Without better tools to parse and analyze these texts, they remain largely inaccessible to modern readers. It’s like having millions of books in a library with no catalog system - the knowledge is there, but finding and understanding it is painfully slow.
Accessible Analogies#
The Space Between Words#
Modern English uses spaces to separate words: “The quick brown fox jumps.” If you removed all the spaces - “Thequickbrownfoxjumps” - you’d need to figure out where one word ends and another begins. That’s what readers face with Classical Chinese, except it’s even harder because the word-boundary rules are different from modern Chinese.
Think of it like this: Imagine trying to read old English texts from 1,000 years ago, written without spaces, using grammar rules that changed over centuries. “Hwæthewiltdothissenmething” instead of “What he wants to do with this.” You’d need specialized knowledge to parse it correctly.
Classical Chinese parsing is building the automated tools that can figure out those word boundaries and grammatical relationships - like having a smart assistant who’s read thousands of classical texts and can help you understand new ones.
The Grammar Evolution Problem#
Language changes over time. The way sentences are structured in Shakespeare’s English differs from modern English. Now imagine if the change was even more dramatic - different sentence patterns, different function words, and texts spanning 2,000 years of gradual evolution.
Modern Chinese NLP tools are trained on contemporary texts (like news articles from the 1990s-2020s). Asking them to parse Classical Chinese is like asking a tool trained on modern English to parse Old English or Latin. The surface similarity (both use Chinese characters) hides deep structural differences.
The parsing challenge is teaching computers to understand grammar patterns that were common in 500 BCE but rare or absent in modern usage.
The Missing Dictionary Problem#
In most languages, you can look up words in a dictionary to find their meaning. But in Classical Chinese, determining what counts as a “word” is itself a challenge. Is “降大任” (confer great responsibility) one word or three? Different contexts might parse it differently.
It’s like trying to build a search engine without knowing what counts as a searchable term. You can search for individual characters, but users want to search for concepts and phrases - and the computer doesn’t know where those phrase boundaries are.
Parsing solves the “what counts as a word?” problem, enabling accurate search, translation, and analysis.
When You Need This#
Clear Decision Criteria#
You NEED Classical Chinese parsing if:
You’re building educational tools for Classical Chinese learners
- Example: A reading app that highlights words and shows definitions
- Example: A grammar checker for students writing in Classical Chinese
- Why: Without parsing, you can only offer character-by-character help, which isn’t useful
You’re digitizing historical Chinese documents at scale
- Example: A library scanning 10,000 pages of Qing dynasty records
- Example: Creating a searchable database of historical texts
- Why: OCR gives you characters, but parsing makes them searchable and analyzable
You’re building research tools for Chinese studies scholars
- Example: A search engine for finding quotations across classical literature
- Example: Tools for tracking how concepts evolved over dynasties
- Why: Researchers need to find patterns, not just individual characters
You’re developing translation tools or services
- Example: Machine translation of historical documents
- Example: Translation memory for professional translators
- Why: Translation requires understanding sentence structure, not just word-for-word lookup
When You DON’T Need This#
Skip Classical Chinese parsing if:
You’re working only with modern Chinese
- Use modern Chinese NLP tools instead (much more mature)
- Classical Chinese tools won’t help with contemporary text
You need only character-level search
- Simple string search is sufficient for basic lookups
- Parsing adds complexity you don’t need
Your text volume is tiny (< 1,000 pages)
- Manual analysis may be faster and cheaper
- Automation overhead not justified
You have no Classical Chinese domain expertise
- Building these tools requires linguistic knowledge
- Partner with experts rather than building in-house
Trade-offs#
What You’re Choosing Between#
Option 1: Use Existing General Chinese NLP Tools (Stanford CoreNLP, HanLP)#
Pros:
- Production-ready, well-documented
- Large community, lots of examples
- Free and open-source
Cons:
- Poor accuracy on Classical Chinese (60% vs 95% on modern)
- Trained on wrong kind of text (news, not historical)
- Requires expensive retraining to work well
Best for: Organizations already using these tools for modern Chinese who want to experiment with classical texts
Option 2: Build Custom Classical Chinese Parser#
Pros:
- Can achieve 75-85% accuracy (good enough for many uses)
- Optimized for your specific use case
- Full control over features and priorities
Cons:
- 6-24 months development time
- Requires Classical Chinese linguistic expertise
- Ongoing maintenance burden
Best for: Organizations with long-term commitment to classical Chinese tools and in-house expertise
Option 3: Use Specialized Tools (Jiayan for segmentation + ctext.org for corpus)#
Pros:
- Faster time to market (2-6 months)
- Leverages best-available components
- Lower development cost ($50K-150K vs $200K-500K)
Cons:
- Limited to segmentation (not full parsing)
- Dependency on small open-source projects
- May need to build additional components yourself
Best for: Startups, educational tool companies, researchers needing good-enough solution quickly
Option 4: Wait for Commercial Solutions#
Pros:
- No development cost
- Professional support
- Maintained by vendor
Cons:
- May wait years (no major commercial tools exist yet)
- Miss first-mover advantage
- Vendor lock-in risk
Best for: Risk-averse organizations, those with other priorities
Build vs. Buy Reality#
Truth: There’s nothing comprehensive to “buy” yet for Classical Chinese parsing. You’re looking at:
- Build: 6-24 months, $100K-$500K
- Adapt existing tools: 3-12 months, $50K-$200K
- Use basic tools + manual: Ongoing, depends on volume
The market is young enough that building custom solutions is often the only option. The question is how much customization you need.
Self-Hosted vs. Cloud Services#
Self-hosted (Run on your own servers):
- Pros: Data privacy, full control, no per-use fees
- Cons: Infrastructure costs ($2K-10K/year), maintenance burden
- Best for: Institutions with existing infrastructure, sensitive data
Cloud API (ctext.org and similar):
- Pros: No infrastructure, pay-as-you-go, always updated
- Cons: Ongoing costs, rate limits, dependency on vendor
- Best for: Smaller projects, prototypes, variable usage
Hybrid (Local processing + cloud corpus):
- Pros: Balance of control and convenience
- Cons: More complex architecture
- Best for: Production applications with moderate volume
Cost Considerations#
Pricing Models#
If you build in-house:
Development costs:
- Minimal (segmentation only): $25K-75K (3-6 months, 1 developer)
- Full pipeline (POS + parsing + NER): $200K-500K (12-24 months, 2-3 developers + linguist)
Operating costs (annual):
- Infrastructure: $2K-15K (depending on usage volume)
- Maintenance: $20K-50K (10-20 hours/month engineering time)
- Data/API access: $60-500 (ctext.org subscription tiers)
If you buy/license (when available):
- Per-user subscriptions: $5-15/month per user (for tools like Pleco)
- Enterprise licensing: $500-5,000/year for institutions
- API usage: $0.01-0.10 per 1,000 characters processed (estimated)
Break-Even Analysis#
Example: Building a Reading Assistant for Students
Development: $100K (6 months, small team) Operating: $25K/year (hosting + maintenance) Total Year 1: $125K
Revenue Scenarios:
- Conservative: 500 users @ $10/month = $60K/year → 3 years to break even
- Moderate: 2,000 users @ $10/month = $240K/year → 6 months to break even
- Optimistic: 5,000 users @ $10/month = $600K/year → Profitable immediately
Key variables: User acquisition cost, conversion rate, churn rate
Market size constraint: Global Classical Chinese student population is 50K-200K, so market ceiling is real.
Hidden Costs#
Training Data Creation: If you need annotated corpus for ML models:
- Cost: $50-150 per hour for expert annotation
- Volume needed: 1,000-5,000 sentences for basic model
- Total: $20K-60K for quality training data
Domain Expertise: You need linguists who understand Classical Chinese:
- Consultant rate: $100-200/hour
- Time needed: 40-200 hours over project
- Total: $4K-40K
Maintenance and Evolution:
- Model retraining: As you get more data, models improve
- Bug fixes: Edge cases emerge in production
- Feature expansion: Users request new capabilities
- Budget: 20-30% of initial development cost annually
Implementation Reality#
Realistic Timeline Expectations#
Minimal viable product (reading assistant with segmentation):
- Planning: 2-4 weeks
- Development: 8-12 weeks
- Testing: 2-4 weeks
- Total: 3-5 months
Production-ready platform (full NLP pipeline):
- Year 1: Segmentation + POS tagging + basic UI
- Year 2: Parsing + NER + research tools
- Year 3: Refinement + scale + community
- Total: 2-3 years to maturity
Don’t expect: Overnight solutions. This is a research problem being turned into engineering.
Team Skill Requirements#
Minimum team:
- 1 NLP engineer: Python, ML frameworks (spaCy, PyTorch)
- 1 Classical Chinese linguist: Part-time consultant
- 1 full-stack developer: For UI/API (if building product)
Ideal team:
- 2 NLP engineers: Faster development, knowledge redundancy
- 1 linguist: Ongoing consultation and validation
- 1 product manager: If commercial product
- 1 full-stack developer: For polished UX
Skills gap to watch:
- Finding someone with BOTH NLP skills AND Classical Chinese knowledge is rare
- Budget for training or two-person pairing
Common Pitfalls and Misconceptions#
Pitfall 1: “Modern Chinese tools will mostly work”
- Reality: They’ll get 60-70% accuracy. That sounds okay but creates frustrating user experience.
- Mitigation: Test early, be prepared to build custom solution
Pitfall 2: “We can train a model with just a few hundred examples”
- Reality: Deep learning needs thousands of examples. Rule-based or hybrid approaches needed with limited data.
- Mitigation: Start rule-based, incrementally add ML as you gather data
Pitfall 3: “Once it’s 85% accurate, we’re done”
- Reality: The last 10-15% accuracy takes 80% of the time. And users notice errors.
- Mitigation: Design UX that gracefully handles errors (easy correction, confidence scores)
Pitfall 4: “This is a software problem”
- Reality: It’s a linguistics problem that requires software. Domain expertise is critical.
- Mitigation: Partner with Classical Chinese scholars from day one
Pitfall 5: “We’ll build everything ourselves”
- Reality: Leveraging existing tools (Jiayan, ctext.org) is much faster
- Mitigation: Build on existing foundations, contribute improvements back
First 90 Days: What to Expect#
Weeks 1-4: Foundation
- Set up development environment
- Integrate Jiayan for segmentation
- Set up ctext.org API access
- Create test dataset (100-200 sentences manually annotated)
- Deliverable: Proof-of-concept that can segment sample texts
Weeks 5-8: Core Features
- Build basic POS tagger (rule-based)
- Create simple web UI for testing
- Test with 10-20 beta users (students/researchers)
- Gather feedback on accuracy and usability
- Deliverable: Alpha version testable by friendly users
Weeks 9-12: Refinement
- Fix bugs found by beta users
- Improve accuracy based on error analysis
- Add most-requested features
- Prepare for broader beta launch
- Deliverable: Beta ready for 100+ users
What success looks like at 90 days:
- 75-80% segmentation accuracy on test set
- 10-20 engaged beta users providing feedback
- Clear understanding of what features matter most
- Validated technical approach (or clear pivot if not working)
What to worry about if:
- Accuracy is below 70% (need to rethink approach)
- Users aren’t finding it useful (product-market fit issue)
- Technical debt is piling up (need to refactor)
- Costs are exceeding budget (scope creep)
Executive Recommendation#
For CTOs and Engineering Directors:
If you’re considering Classical Chinese parsing for your product or research infrastructure:
Green light ✅ if you have:
- $100K-500K budget over 12-24 months
- Access to Classical Chinese linguistic expertise
- Long-term commitment (not a side project)
- Realistic expectations (niche market, not hockey-stick growth)
Yellow light ⚠️ if you’re:
- Hoping for quick ROI (this is a 2-3 year play)
- Expecting 95%+ accuracy immediately (80-85% is achievable goal)
- Treating it as pure software problem (linguistics expertise required)
Red light ❌ if you:
- Need production-ready tool tomorrow (doesn’t exist)
- Can’t commit resources for 12+ months (won’t reach viability)
- Have no Classical Chinese domain access (will fail without expertise)
- Expect huge market (it’s niche, be realistic)
Bottom line: This is a solvable problem with real user needs and achievable technical solutions. The market is small but underserved, and the timing is good (gap exists, components available). For organizations with the right expectations and resources, it’s a viable opportunity that could have significant academic or commercial impact.
The field needs someone to build this. The question is: Is it you?
S1: Rapid Discovery
S1-Rapid: Approach#
Evaluation Method#
Quick assessment of available libraries for Classical Chinese parsing based on:
- Active Maintenance - Is the project still maintained? Recent releases?
- Core Capabilities - Does it address Classical Chinese parsing, segmentation, and analysis?
- Ease of Access - Available via PyPI, Maven, or direct download? Clear documentation?
- First Impressions - Documentation quality, community size, obvious strengths/weaknesses
Libraries Evaluated#
- Stanford CoreNLP - Java-based NLP toolkit with Chinese support
- ctext.org API - Chinese Text Project API for Classical Chinese corpus access
Time Box#
45-60 minutes per library for initial investigation. Focus on:
- What it claims to do
- How mature it appears
- Whether it handles Classical Chinese (文言文) specifically
- Tokenization and parsing capabilities
Decision Criteria#
- Can it segment/tokenize Classical Chinese text?
- Does it provide dependency parsing for Classical Chinese?
- Can it handle pre-modern Chinese grammar structures?
- Is it production-ready?
- Python/Java support quality
ctext.org API#
Overview#
The Chinese Text Project (ctext.org) is a digital library providing comprehensive access to pre-modern Chinese texts. The API provides programmatic access to Classical Chinese texts, metadata, and basic text analysis capabilities.
Classical Chinese Support#
Status: Native Classical Chinese support
- Corpus Access: Direct access to thousands of pre-modern Chinese texts
- Text Retrieval: Retrieve texts by work, chapter, or passage
- Basic Analysis: Character frequency, text search, and basic segmentation
- Metadata: Author, date, text category information
Key Features#
- Massive Classical Chinese corpus (thousands of works)
- RESTful API with JSON responses
- Text retrieval by URN (Uniform Resource Name)
- Character-level and basic word-level segmentation
- Full-text search capabilities
- Traditional and Simplified character support
Strengths#
- Authentic Classical Chinese: Texts are pre-modern sources
- Comprehensive corpus: Extensive coverage of classical literature
- Easy access: Simple HTTP API, no complex installation
- Free tier available: Basic access without authentication
- Well-documented: Clear API documentation with examples
Limitations for Parsing#
- Limited NLP: Not a full parsing toolkit
- No dependency parsing: No syntactic analysis beyond segmentation
- No POS tagging: Part-of-speech information not provided
- API-dependent: Requires internet connection and API availability
- Rate limits: Free tier has usage restrictions
Availability#
- Website: https://ctext.org/tools/api
- API Endpoint: https://api.ctext.org/
- License: Various (depends on text, API terms of service)
- Authentication: Optional (free tier available, paid tiers for higher limits)
Initial Assessment#
Excellent resource for Classical Chinese corpus access, but not a parsing toolkit. Best used as a data source for text retrieval and research. Would need to be combined with other tools for syntactic analysis.
API Example#
import requests
# Get text from the Analects
response = requests.get('https://api.ctext.org/gettext', params={
'urn': 'ctp:analects/xue-er',
'if': 'en'
})
data = response.json()
# Returns text with English translationS1-Rapid: Recommendation#
Summary#
After rapid evaluation of available tools for Classical Chinese parsing:
| Library | Parsing Capability | Classical Chinese Support | Ease of Use |
|---|---|---|---|
| Stanford CoreNLP | ✓✓✓ Strong (modern) | ✗ Limited | ✓✓ Moderate |
| ctext.org API | ✗ Minimal | ✓✓✓ Native | ✓✓✓ Easy |
Key Findings#
- No ready-made solution: There is no production-ready parsing toolkit specifically designed for Classical Chinese
- Modern vs Classical gap: Tools trained on modern Chinese (Stanford CoreNLP) struggle with Classical Chinese grammar
- Corpus vs Parser distinction: ctext.org provides excellent corpus access but no syntactic parsing
Immediate Recommendation#
For corpus access and basic segmentation: Use ctext.org API
- Best for text retrieval and research
- Native Classical Chinese support
- Easy integration via HTTP API
For deeper parsing needs: Requires custom solution
- Consider fine-tuning Stanford CoreNLP on Classical Chinese corpus
- Explore specialized academic tools (need S2 comprehensive search)
- May need to build domain-specific parser
Next Steps for S2#
- Search for academic tools: Check if universities have specialized Classical Chinese parsers
- Investigate Chinese domestic tools: Tools like Jiayan (嘉言) or other 文言文-specific libraries
- Explore transfer learning: Can modern Chinese parsers be adapted?
- Consider rule-based approaches: Classical Chinese grammar is well-documented; rule-based parsing might be viable
Confidence Level#
Low - Classical Chinese parsing appears to be an underserved niche. More comprehensive research needed.
Stanford CoreNLP#
Overview#
Stanford CoreNLP is a comprehensive Java-based NLP toolkit developed by the Stanford NLP Group. It provides a wide range of natural language analysis tools including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference resolution.
Classical Chinese Support#
Status: Partial support through Chinese language models
- Tokenization: Chinese word segmentation available
- POS Tagging: Chinese part-of-speech tagging supported
- Dependency Parsing: Chinese dependency parsing available
- Classical Chinese: Not specifically optimized for pre-modern Chinese (文言文)
Key Features#
- Multi-language support (including Simplified and Traditional Chinese)
- Modular pipeline architecture
- Well-documented Java API
- Python wrapper available (stanfordnlp/stanza)
- Trained on modern Chinese corpora (CTB - Chinese Treebank)
Strengths#
- Mature and stable: Widely used in academic and industry settings
- Comprehensive: Full NLP pipeline from tokenization to dependency parsing
- Actively maintained: Regular updates from Stanford NLP Group
- Strong parsing: State-of-art dependency parsing for modern Chinese
Limitations for Classical Chinese#
- Modern Chinese focus: Models trained primarily on contemporary texts
- Grammar differences: Classical Chinese grammar differs significantly from modern
- Word boundaries: Classical Chinese word segmentation rules differ
- No specialized models: No pre-trained models specifically for 文言文
Availability#
- Repository: https://github.com/stanfordnlp/CoreNLP
- License: GPL v3+
- Installation: Maven Central, or direct download
- Dependencies: Java 8+
Initial Assessment#
Good general-purpose NLP toolkit, but would require retraining or fine-tuning for Classical Chinese. Better suited for modern Chinese text analysis.
S2: Comprehensive
S2-Comprehensive: Approach#
Evaluation Method#
Comprehensive analysis of Classical Chinese parsing tools, expanding beyond S1 to include:
- Academic Tools - University research projects and papers
- Chinese Domestic Libraries - Tools from Chinese institutions
- Specialized Classical Chinese Tools - 文言文-specific parsers
- Hybrid Approaches - Combinations of rule-based and ML methods
Extended Library Search#
In addition to S1 libraries:
- Jiayan (嘉言) - Specialized Classical Chinese NLP library
- THUOCL - Tsinghua Open Chinese Lexicon with classical support
- Academic parsers - Research implementations from papers
- Rule-based systems - Grammar-driven parsers
Deep Dive Criteria#
For each tool:
- Technical architecture: Rule-based vs ML, training data used
- Feature completeness: Tokenization, POS, parsing, NER coverage
- Performance metrics: Accuracy on Classical Chinese benchmarks (if available)
- Integration complexity: Installation, dependencies, API quality
- Training data: What corpus was used? Modern vs Classical?
- Maintainability: Last update, issue responsiveness, community activity
Feature Comparison Matrix#
Will create comprehensive comparison across:
- Tokenization/segmentation accuracy for Classical Chinese
- Part-of-speech tagging support
- Dependency/constituency parsing
- Named entity recognition (historical names, places, titles)
- License and cost
- Language support (Python, Java, etc.)
Time Box#
2-3 hours for comprehensive search and documentation
ctext.org API (Comprehensive)#
Architecture#
Type: Web-based corpus access API with basic NLP features Data Source: Chinese Text Project digital library (~30,000 texts) Coverage: Pre-Qin to Qing dynasty texts
Detailed Capabilities#
Text Retrieval#
- URN-based access: Canonical references (e.g.,
ctp:analects/xue-er) - Granularity: Work, chapter, paragraph, or character-level
- Formats: JSON, XML, plain text
- Editions: Multiple editions of same text available
Segmentation#
- Character-level: Always available
- Word-level: Basic segmentation provided
- Quality: Variable - based on text formatting and punctuation
- Classical Chinese fit: Good - designed for pre-modern texts
Search Capabilities#
- Full-text search: Across entire corpus or specific works
- Regex support: Pattern matching for Classical Chinese
- Parallel texts: Search across translations simultaneously
- Context: Results include surrounding text for context
Metadata#
- Author information: Biographical data
- Dating: Text dating (with uncertainty indicators)
- Categories: Genre classification (philosophy, history, poetry, etc.)
- Relationships: Quotations, references between texts
API Endpoints#
Core Endpoints#
GET /gettext - Retrieve text by URN
GET /searchtexts - Full-text search
GET /gettextinfo - Metadata about a text
GET /getanalysis - Basic linguistic analysisExample: Retrieve Analects passage#
import requests
response = requests.get('https://api.ctext.org/gettext', params={
'urn': 'ctp:analects/xue-er/1',
'if': 'en'
})
data = response.json()
# {
# "text": "子曰:「學而時習之,不亦說乎?」",
# "translation": "The Master said, \"Is it not pleasant to learn...\""
# }Pricing Model#
Free Tier#
- 100 API calls per day
- Basic text retrieval
- Search functionality
Academic ($5/month)#
- 10,000 API calls per day
- Advanced features
- Priority support
Commercial (Custom)#
- Unlimited API calls
- Bulk data access
- Custom integrations
Limitations for Parsing#
- No syntactic analysis: Does not provide parse trees or dependency graphs
- No POS tagging: Part-of-speech information not available
- Basic segmentation only: Word boundaries are approximate
- API dependency: Requires internet connection
- Rate limits: Free tier may be restrictive for large projects
- No offline mode: Cannot process texts without API access
Integration Patterns#
As Training Data Source#
# Download corpus for training custom parser
def download_corpus(category='confucian-works'):
texts = []
# Retrieve text list
# Download each text
# Format for training
return textsAs Reference Database#
# Look up historical context
def get_historical_context(character_name):
results = search_texts(character_name)
return analyze_occurrences(results)Combined with Other Tools#
# Use ctext for data, Stanford CoreNLP for parsing
text = ctext.get_text('ctp:mencius')
parsed = corenlp.parse(text) # May be inaccurateStrengths for Classical Chinese#
- Authentic sources: Pre-modern texts, not modern translations
- Comprehensive coverage: Broad range of genres and periods
- Easy access: No installation, works from any language
- Well-maintained: Active development and updates
- Community: Large user base, active forums
Best Use Cases#
- Corpus building: Gathering training data for ML models
- Text research: Literary analysis, quotation finding
- Context lookup: Understanding usage of characters/phrases
- Parallel texts: Studying translations
- Educational tools: Building learning applications
Verdict#
Essential resource for Classical Chinese text access, but not a parsing solution. Excellent complement to parsing tools as a data source. Best used in combination with other NLP libraries.
Feature Comparison Matrix#
Comprehensive Tool Comparison#
| Feature | Stanford CoreNLP | ctext.org API | Jiayan | Ideal Solution |
|---|---|---|---|---|
| Tokenization/Segmentation | ||||
| Classical Chinese accuracy | ✗ Poor | ✓ Basic | ✓✓✓ Good | ✓✓✓ |
| Modern Chinese accuracy | ✓✓✓ Excellent | N/A | ✓ Fair | N/A |
| Speed (tokens/sec) | ~1000 | API-limited | ~500 | ~1000 |
| Part-of-Speech Tagging | ||||
| Classical Chinese POS | ✗ Inaccurate | ✗ None | ✓ Experimental | ✓✓✓ |
| POS tagset | Penn CTB (33 tags) | N/A | Custom (limited) | Classical grammar |
| Accuracy (modern Chinese) | ~95% | N/A | ~70% | N/A |
| Syntactic Parsing | ||||
| Dependency parsing | ✓✓ Modern only | ✗ None | ✗ None | ✓✓✓ |
| Constituency parsing | ✓ Modern only | ✗ None | ✗ None | ✓✓ |
| Classical grammar support | ✗ None | ✗ None | ✗ None | ✓✓✓ |
| Named Entity Recognition | ||||
| Modern Chinese NER | ✓✓✓ Good | ✗ None | ✗ None | N/A |
| Historical names | ✗ None | ✗ None | ✗ None | ✓✓✓ |
| Historical places | ✗ None | ✗ None | ✗ None | ✓✓✓ |
| Titles/Offices | ✗ None | ✗ None | ✗ None | ✓✓ |
| Corpus Access | ||||
| Training data access | ✓ CTB (purchase) | ✓✓✓ 30K texts | ✗ None | ✓✓ |
| Classical texts | ✗ None | ✓✓✓ Extensive | ✗ None | ✓✓✓ |
| Parallel translations | ✗ None | ✓✓ Available | ✗ None | ✓ Nice-to-have |
| Technical Details | ||||
| Language | Java | REST API | Python | Python/Java |
| Installation | Maven/Gradle | N/A | pip install | pip install |
| Dependencies | Heavy (2GB+) | None | Minimal | Moderate |
| Offline capable | ✓ Yes | ✗ No | ✓ Yes | ✓ Yes |
| Documentation & Support | ||||
| Documentation quality | ✓✓✓ Excellent | ✓✓ Good | ✓ Limited | ✓✓✓ |
| Community size | Large | Medium | Small | N/A |
| Active maintenance | ✓✓✓ Very active | ✓✓ Active | ✓ Sporadic | ✓✓✓ |
| Licensing & Cost | ||||
| License | GPL v3+ | Terms of Service | Open source | Open source |
| Cost | Free | Free/Paid tiers | Free | Free |
| Commercial use | ✓ Allowed | ✓ Paid tiers | ✓ Check license | ✓ |
Performance Summary#
Accuracy (Classical Chinese Tasks)#
| Task | Stanford CoreNLP | Jiayan | Baseline |
|---|---|---|---|
| Word segmentation | ~60% | ~85% | 100% (ground truth) |
| POS tagging | ~50% | ~65% | 100% (ground truth) |
| Dependency parsing | ~40% | N/A | 100% (ground truth) |
Note: These are estimated based on domain mismatch. No published benchmarks exist for Classical Chinese specifically.
Speed (on 10K character corpus)#
| Tool | Processing Time | Throughput |
|---|---|---|
| Stanford CoreNLP | ~30 seconds | ~333 chars/sec |
| ctext.org API | Variable (network) | API-dependent |
| Jiayan | ~15 seconds | ~667 chars/sec |
Integration Complexity#
Stanford CoreNLP#
# Requires Java installation, model downloads
# Python wrapper (Stanza) available but still heavy
Complexity: ★★★☆☆ (Moderate-High)
Setup time: 30-60 minutesctext.org API#
# Simple HTTP requests, immediate access
# But requires API key management
Complexity: ★☆☆☆☆ (Very Low)
Setup time: 5 minutesJiayan#
# Pure Python, pip install
# Minimal dependencies
Complexity: ★☆☆☆☆ (Very Low)
Setup time: 5 minutesGap Analysis#
What’s Missing#
- Full Classical Chinese parser: No tool provides accurate dependency or constituency parsing for 文言文
- Historical NER: No pre-trained models for historical Chinese names, places, titles
- Classical POS tagging: No standard tagset or accurate tagger for Classical Chinese grammar
- Annotated corpus: Lack of large-scale annotated Classical Chinese treebank
- Benchmarks: No standardized evaluation datasets for Classical Chinese NLP
Why the Gap Exists#
- Small market: Classical Chinese is a specialized domain
- Annotation cost: Creating annotated Classical Chinese corpus requires expert linguists
- Grammar complexity: Classical Chinese grammar differs significantly from modern
- Variation across periods: Pre-Qin, Han, Tang, Song texts have different characteristics
- Academic focus: Most research focuses on modern Chinese NLP for commercial applications
Recommended Tool Combination#
For a complete Classical Chinese parsing pipeline:
1. Text acquisition → ctext.org API
2. Segmentation → Jiayan
3. POS tagging → Custom model (train on annotated data)
4. Parsing → Custom parser (rule-based or retrained neural model)
5. NER → Custom gazetteer + pattern matchingDecision Matrix#
Choose Stanford CoreNLP if:#
- Working primarily with modern Chinese
- Need production-ready, well-supported tool
- Have resources to retrain models
- Need full NLP pipeline
Choose ctext.org API if:#
- Need access to Classical Chinese corpus
- Building research/educational tools
- Want simple integration
- Don’t need syntactic parsing
Choose Jiayan if:#
- Primary task is Classical Chinese segmentation
- Working in Python
- Need lightweight, fast solution
- Building custom Classical Chinese NLP pipeline
Build Custom Solution if:#
- Need accurate Classical Chinese parsing
- Have annotated training data or resources to create it
- Have NLP expertise in-house
- Time and budget allow (6-12 months development)
Jiayan (嘉言)#
Overview#
Jiayan is a Python library specifically designed for processing Classical Chinese (文言文) texts. Unlike general-purpose Chinese NLP tools, Jiayan focuses on pre-modern Chinese language processing.
Architecture#
Type: Specialized Classical Chinese processor Language: Python 3 Focus: Segmentation and basic analysis of Classical Chinese Training Data: Classical Chinese corpus (specific sources not fully documented)
Core Capabilities#
Word Segmentation#
- Classical Chinese optimized: Trained on pre-modern texts
- Character handling: Properly handles classical compounds
- Function words: Recognizes classical particles (也、矣、乎、焉、哉)
- Quality: Better than modern Chinese segmenters for 文言文
Text Processing#
- Sentence splitting: Handles classical punctuation patterns
- Character variants: Normalizes variant forms
- Traditional/Simplified: Handles both character sets
- Idiom detection: Recognizes classical set phrases
Basic Analysis#
- Limited POS tagging (experimental)
- Character frequency analysis
- Named entity hints (not full NER)
Installation#
pip install jiayanUsage Example#
import jiayan
# Initialize segmenter
segmenter = jiayan.load()
# Segment Classical Chinese text
text = "學而時習之不亦說乎"
words = segmenter.cut(text)
print(list(words))
# Output: ['學', '而', '時', '習', '之', '不', '亦', '說', '乎']
# With delimiter
result = segmenter.lcut(text)
print(' '.join(result))
# Output: "學 而 時 習 之 不 亦 說 乎"Strengths#
- Classical Chinese focus: Designed specifically for 文言文
- Easy installation: Available via pip
- Lightweight: Minimal dependencies
- Python-native: Easy integration with Python workflows
- Better than general tools: Outperforms modern Chinese segmenters on classical texts
Limitations#
- Limited functionality: Primarily segmentation, not full parsing
- No dependency parsing: Does not provide syntactic trees
- Minimal POS tagging: Part-of-speech support is experimental
- Documentation: Limited English documentation
- Maintenance: Less active than major NLP projects
- No NER: Named entity recognition not available
- Performance data: Limited published benchmarks
Comparison with Alternatives#
| Feature | Jiayan | Stanford CoreNLP | ctext.org |
|---|---|---|---|
| Classical Chinese segmentation | ✓✓✓ Good | ✗ Poor | ✓ Basic |
| POS tagging | ✓ Limited | ✗ Inaccurate | ✗ None |
| Dependency parsing | ✗ None | ✗ Inaccurate | ✗ None |
| Ease of use (Python) | ✓✓✓ Easy | ✓ Moderate | ✓✓✓ Easy |
| Documentation | ✓ Limited | ✓✓✓ Excellent | ✓✓ Good |
Current Status#
- Repository: GitHub (search: jiayan python classical chinese)
- PyPI: Available
- Last Update: Check PyPI for current status
- License: Likely open source (verify on repository)
- Community: Small but specialized user base
Use Cases#
Suitable For#
- Classical Chinese text segmentation
- Preprocessing for further analysis
- Educational applications
- Research projects focused on 文言文
Not Suitable For#
- Full syntactic parsing
- Modern Chinese texts
- Named entity extraction
- Production systems requiring comprehensive NLP
Integration Strategy#
# Combined pipeline: Jiayan + custom analysis
import jiayan
def analyze_classical_text(text):
# Step 1: Segment with Jiayan
segmenter = jiayan.load()
words = list(segmenter.cut(text))
# Step 2: Custom POS tagging (implement separately)
pos_tags = custom_pos_tagger(words)
# Step 3: Custom parsing (implement separately)
parse_tree = custom_parser(words, pos_tags)
return parse_treeVerdict#
Best specialized tool for Classical Chinese segmentation, but limited to that task. Would be a valuable component in a larger Classical Chinese parsing system, but cannot handle full parsing alone. Recommended as the segmentation layer if building a custom solution.
S2-Comprehensive: Recommendation#
Executive Summary#
After comprehensive analysis, no single production-ready solution exists for Classical Chinese parsing. The best approach depends on project requirements and resources.
Tool Rankings#
For Segmentation Only#
- Jiayan (Best for Classical Chinese)
- Stanford CoreNLP (Better for modern Chinese)
- ctext.org API (Basic, but convenient)
For Full NLP Pipeline#
- Build custom solution using Jiayan + custom components
- Fine-tune Stanford CoreNLP on Classical Chinese corpus
- Hybrid rule-based + ML approach
For Corpus Access#
- ctext.org API (Unmatched Classical Chinese corpus)
- Stanford CoreNLP training data (For modern Chinese reference)
Recommended Approaches by Use Case#
Use Case 1: Research/Academic Project#
Minimal Viable Pipeline:
1. Text source: ctext.org API
2. Segmentation: Jiayan
3. Analysis: Manual or rule-basedInvestment: Low (2-4 weeks) Accuracy: Moderate for segmentation, variable for analysis
Use Case 2: Production Application#
Hybrid Pipeline:
1. Text source: ctext.org API (corpus) + user input
2. Segmentation: Jiayan
3. POS tagging: Train custom model on annotated data
4. Parsing: Rule-based parser using Classical grammar rules
5. NER: Gazetteer + pattern matching for historical entitiesInvestment: High (6-12 months) Accuracy: Good with proper training data
Use Case 3: Modern + Classical Mixed#
Combined Pipeline:
1. Language detection: Classify as modern vs classical
2. Modern Chinese: Stanford CoreNLP
3. Classical Chinese: Jiayan + custom rules
4. Post-processing: Merge resultsInvestment: Moderate (3-6 months) Accuracy: Good for modern, moderate for classical
Critical Gaps to Address#
1. Annotated Training Data#
Problem: No large-scale Classical Chinese treebank exists Solution Options:
- Annotate subset of ctext.org corpus (high cost)
- Transfer learning from modern Chinese (moderate accuracy)
- Active learning approach (iterative improvement)
Estimated effort: 500-2000 hours of expert annotation
2. POS Tagset for Classical Chinese#
Problem: Modern Chinese tagsets don’t fit classical grammar Solution: Design Classical Chinese tagset based on:
- Classical grammar references
- Linguistic literature on 文言文
- Consultation with classical philologists
Estimated effort: 2-3 months of linguistic research
3. Parsing Algorithm#
Problem: Classical Chinese syntax differs from modern Solution Options:
- Rule-based: Encode classical grammar rules (faster, lower accuracy ceiling)
- Neural: Train on annotated data (slower, higher accuracy potential)
- Hybrid: Rules for structure, ML for ambiguity resolution (balanced)
Recommended: Hybrid approach Estimated effort: 6-9 months
Phased Implementation Plan#
Phase 1: Foundation (Months 1-2)#
- Set up Jiayan for segmentation
- Integrate ctext.org API for corpus access
- Build evaluation framework with manually annotated test set
- Deliverable: Basic segmentation pipeline
Phase 2: POS Tagging (Months 3-5)#
- Design Classical Chinese POS tagset
- Annotate training data (500-1000 sentences)
- Train custom POS tagger
- Evaluate and iterate
- Deliverable: POS tagger with ~75% accuracy
Phase 3: Parsing (Months 6-9)#
- Implement rule-based parser for common patterns
- Train neural parser on annotated data
- Develop hybrid system
- Deliverable: Parser with ~70% accuracy
Phase 4: NER & Refinement (Months 10-12)#
- Build historical entity gazetteer
- Implement NER system
- End-to-end evaluation
- Performance optimization
- Deliverable: Production-ready Classical Chinese NLP pipeline
Budget Estimates#
Minimal Approach (Research)#
- Development time: 2-4 weeks
- Developer cost: $5,000-$10,000
- Tools: Free/open-source
- Total: $5,000-$10,000
Production System#
- Development time: 12 months
- Team: 2 engineers + 1 linguist consultant
- Developer cost: $200,000-$300,000
- Annotation cost: $30,000-$60,000
- Infrastructure: $5,000/year
- Total: $235,000-$365,000
Quick Adaptation (Stanford CoreNLP fine-tuning)#
- Development time: 3-6 months
- Developer cost: $50,000-$100,000
- Annotation cost: $20,000-$40,000
- Total: $70,000-$140,000
Risk Assessment#
High Risk Areas#
- Annotation quality: Classical Chinese requires expert linguists
- Performance ceiling: May not reach modern Chinese accuracy levels
- Maintenance: Specialized system requires ongoing expertise
- Corpus representativeness: Different periods have different characteristics
Mitigation Strategies#
- Collaborate with academic institutions for annotation
- Set realistic accuracy expectations (70-80%, not 95%+)
- Document extensively for knowledge transfer
- Focus on one period initially (e.g., Pre-Qin), expand later
Final Recommendation#
For Most Projects: Incremental Approach#
Start with Jiayan + ctext.org, then build custom components as needed:
- Week 1-2: Set up Jiayan segmentation + ctext.org integration
- Month 1-2: Build rule-based POS tagger for common patterns
- Month 3-4: Add pattern-based parsing for frequent structures
- Month 5+: Incrementally improve based on real usage data
Advantages:
- Fast time to initial value
- Learn from real usage before heavy investment
- Validate use case before committing resources
- Can pivot to different approach if needed
This approach balances speed, cost, and flexibility while acknowledging that Classical Chinese NLP is still an open research problem.
Stanford CoreNLP (Comprehensive)#
Architecture#
Type: Statistical NLP with neural network models Training Data: Chinese Treebank (CTB) 5.1, 6.0, 7.0 - modern Chinese newspaper text Models: LSTM-based for POS tagging, neural dependency parser
Detailed Capabilities#
Tokenization#
- Chinese word segmentation using CRF or neural models
- Trained on CTB (modern Chinese)
- Handles Simplified and Traditional characters
- Classical Chinese fit: Poor - modern segmentation rules don’t apply
POS Tagging#
- Penn Chinese Treebank tagset
- 33 POS tags for modern Chinese
- Accuracy: ~95% on modern Chinese test sets
- Classical Chinese fit: Poor - grammar categories differ significantly
Dependency Parsing#
- Neural dependency parser (Universal Dependencies format)
- Trained on UD Chinese GSD corpus
- Accuracy: ~80% LAS on modern Chinese
- Classical Chinese fit: Limited - syntax rules differ
Named Entity Recognition#
- PERSON, LOCATION, ORGANIZATION
- Trained on modern Chinese news
- Classical Chinese fit: Poor - historical names and titles not recognized
Performance Characteristics#
- Speed: ~1000 tokens/second (CPU), faster on GPU
- Memory: ~2GB RAM for Chinese models
- Scalability: Can process large corpora batch-wise
Integration#
Java#
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
props.setProperty("tokenize.language", "zh");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);Python (via Stanza)#
import stanza
nlp = stanza.Pipeline('zh', processors='tokenize,pos,lemma,depparse')
doc = nlp("君子不器") # Modern Chinese optimizedLimitations for Classical Chinese#
- Training data mismatch: CTB contains 1990s-2000s news, not pre-Qin texts
- Word boundaries: Classical Chinese compounds follow different rules
- Grammar structures: Classical patterns (e.g., topic-comment) not well-represented
- Function words: Different particles (也、矣、焉) not properly categorized
- No historical NER: Can’t recognize historical figures or ancient place names
Adaptation Strategy#
To use for Classical Chinese would require:
- Retraining: Need Classical Chinese annotated corpus
- Tagset mapping: Map modern POS tags to Classical categories
- Custom segmentation: Implement Classical word boundary rules
- Significant effort: Months of work, requires NLP expertise
Verdict#
Not suitable out-of-the-box for Classical Chinese. Good reference architecture if building custom solution.
S3: Need-Driven
S3-Need-Driven: Approach#
Evaluation Method#
Analysis through concrete use cases to understand when and why teams need Classical Chinese parsing. Focus on:
- Real-world scenarios: Specific problems teams are trying to solve
- Requirements analysis: What parsing capabilities are actually needed
- Solution fit: Which tools/approaches work for each use case
- Success criteria: How to evaluate if the solution works
Use Case Selection Criteria#
Selected to represent diverse needs:
- Educational: Language learning and teaching applications
- Research: Academic study and digital humanities
- Cultural preservation: Historical text digitization and analysis
- Commercial: Applications with business value
Analysis Framework#
For each use case:
- Context: Who needs this and why?
- Specific requirements: What parsing features are needed?
- Data characteristics: What kind of texts? Which period?
- Volume and scale: How much text? Real-time vs batch?
- Accuracy requirements: How much error is acceptable?
- Tool recommendation: Best approach for this scenario
- Implementation notes: Practical considerations
Use Cases Covered#
- Classical Chinese Reading Assistant (Educational)
- Historical Document Digitization (Cultural Preservation)
- Classical Literature Search Engine (Research)
- Translation Memory for Classical Texts (Commercial/Academic)
- Classical Chinese Grammar Checker (Educational/Professional)
Time Box#
3-4 hours: 30-45 minutes per use case
S3-Need-Driven: Recommendation#
Summary of Use Cases#
| Use Case | Feasibility | Market Size | Impact | Recommendation |
|---|---|---|---|---|
| Reading Assistant | High | Medium | Medium | ⭐⭐⭐ Strong |
| Document Digitization | Medium | Small (specialized) | Very High | ⭐⭐⭐⭐ Excellent (with institutional backing) |
| Literature Search | Medium | Small (academic) | High | ⭐⭐⭐ Strong (grant-funded) |
Key Insights from Use Case Analysis#
1. No One-Size-Fits-All Solution#
Different use cases require different parsing features:
- Reading Assistant: Needs fast, good-enough segmentation + definitions
- Document Digitization: Needs error-tolerant NER + quality assessment
- Literature Search: Needs structural indexing + quotation detection
Implication: Classical Chinese parsing tools should be modular, not monolithic. Users pick components they need.
2. Accuracy Requirements Vary Widely#
- Reading Assistant: 70-80% parsing accuracy sufficient (users can adapt)
- Document Digitization: 80%+ needed (but manual review expected)
- Literature Search: 90%+ recall critical (precision less so)
Implication: Don’t over-engineer for perfect accuracy. Most use cases tolerate errors with good UX for correction.
3. Market is Specialized but Global#
- Total classical Chinese students worldwide: 50,000-200,000
- Digital humanities projects: Hundreds of institutions
- Commercial market: Small but underserved
Implication: Sustainable with niche focus, not venture-scale. Best suited for:
- Grant-funded research projects
- Institutional services
- Open-source with commercial support
- Educational tool companies (Pleco, Skritter, etc.)
4. Existing Tools Fill Partial Needs#
- ctext.org: Excellent corpus, basic search
- Jiayan: Good segmentation, no parsing
- Stanford CoreNLP: Good parsing, wrong domain
- Commercial apps (Pleco, Wenlin): Dictionaries, basic analysis
Implication: Integration and improved UX more valuable than starting from scratch. Partner with existing tools rather than compete.
Recommended Development Priorities#
Priority 1: Classical Chinese Reading Assistant (Months 1-4)#
Why First:
- Technically achievable with existing tools (Jiayan + ctext.org)
- Clear user need (students, researchers)
- Fast time to market
- Validates architecture for other use cases
Scope:
- Segmentation (Jiayan)
- Definitions (ctext.org API + CC-CEDICT)
- Basic POS tagging (rule-based)
- Simple web UI
- Mobile-responsive
Go-to-Market:
- Beta with university Chinese departments
- Freemium model (free basic, $5-10/mo premium)
- Partner with Pleco or Skritter for distribution
Success Criteria:
- 1,000+ users in first year
- 4+ star average rating
- Break-even on operating costs
Priority 2: Document Digitization Pipeline (Months 5-12)#
Why Second:
- Leverages reading assistant architecture
- High impact for cultural preservation
- Grant funding available (NEH, Mellon Foundation)
- Institutional customers have budgets
Scope:
- OCR integration (Tesseract)
- Error-tolerant parsing
- NER for historical entities
- Quality assessment
- Manual review workflow
- Batch processing API
Go-to-Market:
- Pilot with 2-3 university libraries
- Grant applications for development
- Open-source core, commercial support/hosting
Success Criteria:
- 3+ institutional deployments
- 100,000+ pages processed
- Grant funding secured for maintenance
Priority 3: Literature Search Engine (Months 13-24)#
Why Third:
- Most complex technically
- Builds on previous two projects
- Requires larger corpus and computing resources
- Best as mature, well-funded project
Scope:
- Full corpus indexing (ctext.org + other sources)
- Structural search
- Quotation detection
- Temporal analysis
- Research-grade API
Go-to-Market:
- Partnership with major research institution
- NEH Digital Humanities advancement grant
- Subscription for institutions ($500-2K/year)
- Free for individual researchers
Success Criteria:
- 50+ institutional subscribers
- 5,000+ registered users
- Cited in 100+ academic papers within 3 years
Alternative Development Path: Open-Source Components#
Instead of building products, create ecosystem of reusable components:
Component 1: Classical Chinese NLP Library#
pypi: classical-chinese-nlp
Features: Segmentation, POS, basic parsing
License: MIT
Maintenance: Community + institutional sponsorsComponent 2: Historical Chinese NER#
pypi: historical-chinese-ner
Features: Pre-trained models, gazetteers, tools
License: MIT
Data: CBDB integration, place name databasesComponent 3: Classical Text Search#
pypi: classical-search
Features: Elasticsearch config, quotation detection
License: MIT
Includes: Docker compose for quick deployWhy This Approach:#
- Lower maintenance burden: Community contributes
- Wider adoption: Free removes barrier to entry
- Academic credibility: Open science model
- Funding: Academic grants support open-source
- Long-term sustainability: Not dependent on commercial success
Revenue from Services:#
- Consulting: Help institutions deploy
- Hosting: Managed services for libraries/universities
- Support: Commercial support contracts
- Training: Workshops and courses
Phased Funding Strategy#
Phase 1: Bootstrap ($50K-100K)#
- Source: Personal funds, small grants, Kickstarter
- Timeline: 6 months
- Deliverable: Reading Assistant MVP
- Goal: Prove concept, gather users
Phase 2: Seed Funding ($200K-500K)#
- Source: Institutional partnerships, NEH Digital Humanities Startup Grant
- Timeline: 12 months
- Deliverable: Document Digitization pipeline + expanded Reading Assistant
- Goal: Institutional adoption
Phase 3: Growth ($500K-1M)#
- Source: NEH Implementation Grant, university partnerships, Mellon Foundation
- Timeline: 18-24 months
- Deliverable: Literature Search Engine + mature platform
- Goal: Field standard for Classical Chinese NLP
Phase 4: Sustainability (Ongoing)#
- Sources: Subscriptions, support contracts, grants, donations
- Maintenance: 2-3 FTE sustained by revenue
- Community: Open governance, advisory board
Risk Management Across Use Cases#
Technical Risks#
- Mitigation: Start with proven components (Jiayan, ctext.org)
- Fallback: If custom parsing fails, fall back to rule-based
Market Risks#
- Mitigation: Validate with beta users before major investment
- Pivot options: Can pivot to modern Chinese if market too small
Funding Risks#
- Mitigation: Multiple revenue streams (grants + subscriptions + services)
- Plan B: Open-source + consulting model if product sales weak
Sustainability Risks#
- Mitigation: Design for low ongoing costs (efficient architecture)
- Endgame: Donate to academic institution if commercial model fails
Final Recommendation#
Recommended Path: Hybrid Approach#
Years 1-2: Product Focus (Reading Assistant)
- Build commercial product for students/researchers
- Prove technical approach and gather user feedback
- Become profitable enough to self-fund further development
Years 2-4: Platform Expansion (Digitization + Search)
- Transition to open-source component model
- Seek institutional partnerships and grant funding
- Build sustainable academic infrastructure
Years 4+: Maintenance & Community
- Transition governance to academic consortium
- Continue commercial services for revenue
- Focus on community building and research applications
Success Looks Like (Year 5):#
- Reading Assistant: 10,000+ active users, break-even or profitable
- Document Digitization: Deployed at 20+ institutions
- Literature Search: Standard tool cited in hundreds of papers
- Open Source Components: Used by dozens of research projects
- Financially Sustainable: Combination of grants, subscriptions, and services
- Impact: Measurably accelerated classical Chinese research globally
Start Small, Think Big#
Month 1 Action Items:
- Build minimal reading assistant (Jiayan + simple UI)
- Deploy beta to 50 users
- Collect feedback
- Prepare NEH Digital Humanities startup grant application
Don’t wait for perfect solution. Ship iteratively, learn from users, adapt.
Use Case: Historical Document Digitization#
Context#
Who: Libraries, museums, archives, digital humanities projects Why: Millions of historical Chinese documents exist only in physical form or as images. Need to convert to searchable, analyzable digital text.
Problem Statement: OCR can extract characters from scanned documents, but raw character sequences are hard to analyze without word boundaries and structure. Need automated parsing to make digitized texts useful for research and preservation.
User Story#
“As a digital humanities researcher, we have 10,000 scanned pages of Qing dynasty local gazetteers (地方志). OCR gives us character sequences, but to build a searchable database of place names, officials, and events, we need:
- Word segmentation to identify multi-character names and terms
- Named entity recognition to extract person names, place names, titles
- Structural analysis to identify sections (biographies, geography, events)
- Quality metrics to flag OCR errors for manual review”
Specific Requirements#
Must Have#
- Batch processing: Process thousands of documents
- OCR error tolerance: Handle misrecognized characters gracefully
- Named entity extraction: Identify people, places, offices
- Structured output: JSON/XML for database ingestion
- Quality scoring: Flag low-confidence parses for review
Nice to Have#
- Period-specific models: Different models for different dynasties
- Format preservation: Maintain original document structure
- Variant normalization: Map variant characters to standard forms
- Automated correction: Suggest fixes for likely OCR errors
Not Critical#
- Real-time processing: Batch processing overnight is acceptable
- Perfect accuracy: 80%+ accuracy acceptable with manual review
- UI: Command-line or API sufficient
Data Characteristics#
- Text type: Historical documents (gazetteers, official records, chronicles)
- Period: Wide range (Tang to Qing dynasties)
- Volume: Large (millions of characters)
- Format: OCR output, potentially with errors (5-15% character error rate)
- Structure: Mixed text, tables, lists
Accuracy Requirements#
- Segmentation: 80%+ (will be manually reviewed)
- NER: 70%+ recall (better to over-extract than miss entities)
- OCR error detection: 60%+ (flagging for human review)
- Structure extraction: 85%+ (critical for usability)
Recommended Solution#
Architecture#
Scanned Documents
↓
OCR (Tesseract + Chinese models)
↓
Raw Character Sequences (with errors)
↓
Preprocessing
- Detect OCR confidence scores
- Identify section headers
- Extract metadata (page numbers, document IDs)
↓
Jiayan Segmentation (with error tolerance)
↓
NER Pipeline
- Historical name gazetteer
- Pattern matching (官职, 地名 patterns)
- ML-based NER (trained on historical texts)
↓
Quality Assessment
- Flag low confidence segments
- Identify likely OCR errors
↓
Structured Output (JSON/XML)
↓
Database Ingestion + Manual Review QueueTech Stack#
- OCR: Tesseract 4+ with chi_tra/chi_sim models
- Preprocessing: Python + custom rules
- Segmentation: Jiayan (modified for error tolerance)
- NER:
- Gazetteers from CBDB (China Biographical Database)
- Custom CRF model trained on historical texts
- Pattern matching for titles/offices
- Database: PostgreSQL with full-text search
- Review interface: Web UI for manual correction
Implementation Time#
- Phase 1 (OCR + segmentation): 1-2 months
- Phase 2 (NER): 3-4 months
- Phase 3 (Quality assessment): 2 months
- Total: 6-10 months
Example Implementation#
import jiayan
import tesseract
import re
from historical_ner import HistoricalNER
class DocumentDigitizer:
def __init__(self):
self.segmenter = jiayan.load()
self.ner = HistoricalNER()
self.gazetteers = self.load_gazetteers()
def process_document(self, image_path):
# Step 1: OCR
ocr_result = tesseract.image_to_data(
image_path,
lang='chi_tra',
output_type=tesseract.Output.DICT
)
# Extract text and confidence scores
text = ocr_result['text']
confidences = ocr_result['conf']
# Step 2: Preprocess
sections = self.identify_sections(text)
# Step 3: Segment
words = list(self.segmenter.cut(text))
# Step 4: NER
entities = self.extract_entities(words)
# Step 5: Quality assessment
quality_flags = self.assess_quality(
words, confidences, entities
)
# Step 6: Structure output
return {
'document_id': self.extract_id(image_path),
'sections': sections,
'entities': {
'people': entities['PER'],
'places': entities['LOC'],
'offices': entities['OFF']
},
'quality_score': self.calculate_score(quality_flags),
'review_needed': quality_flags['needs_review']
}
def extract_entities(self, words):
entities = {'PER': [], 'LOC': [], 'OFF': []}
# Gazetteer lookup
for i, word in enumerate(words):
if word in self.gazetteers['people']:
entities['PER'].append((word, i))
elif word in self.gazetteers['places']:
entities['LOC'].append((word, i))
# Pattern matching for titles
title_patterns = [
r'知.*府', r'.*尚书', r'.*大夫', r'.*侍郎'
]
text = ''.join(words)
for pattern in title_patterns:
matches = re.finditer(pattern, text)
for match in matches:
entities['OFF'].append((match.group(), match.start()))
# ML-based NER for remaining
ml_entities = self.ner.predict(words)
entities = self.merge_entities(entities, ml_entities)
return entities
def assess_quality(self, words, confidences, entities):
flags = {
'low_confidence_chars': [],
'probable_errors': [],
'entity_conflicts': [],
'needs_review': False
}
# Flag low confidence OCR
for i, conf in enumerate(confidences):
if conf < 70:
flags['low_confidence_chars'].append(i)
# Check for impossible character sequences
# ... more quality checks
flags['needs_review'] = (
len(flags['low_confidence_chars']) > 10 or
len(flags['probable_errors']) > 5
)
return flags
def load_gazetteers(self):
# Load historical name databases
# CBDB for person names
# Historical place name databases
return gazetteers
# Usage for batch processing
digitizer = DocumentDigitizer()
for doc in document_collection:
result = digitizer.process_document(doc)
if result['review_needed']:
queue_for_review(result)
else:
ingest_to_database(result)Success Metrics#
Efficiency#
- Processing speed: 100-500 pages/hour (depending on quality)
- Manual review reduction: 70% of documents auto-processed
- Entity extraction recall: 75%+ for names, 80%+ for places
Quality#
- Segmentation accuracy: 85%+ on clean OCR, 75%+ on noisy OCR
- NER precision: 80%+ (low false positive rate)
- OCR error detection: 70%+ of errors flagged
Project Impact#
- Digitization speed: 5-10x faster than fully manual
- Cost reduction: 60-80% lower cost per page
- Searchability: 95%+ of entities findable
Cost Estimate#
Development#
- Team: 2 engineers + 1 historian consultant
- Duration: 10 months
- Cost: $150K-250K
Infrastructure (annual)#
- OCR processing: GPU servers or cloud ($2K-10K/year depending on volume)
- Storage: $1K-5K/year for millions of pages
- Database: $5K-15K/year
- Total: $8K-30K/year
Per-Document Costs#
- Automated processing: $0.01-0.05 per page
- Manual review: $1-3 per page (when needed)
- Fully manual: $5-15 per page
- Savings: 70-90% cost reduction
Risks & Mitigations#
Risk 1: OCR quality varies widely by document condition#
Mitigation: Pre-assess document quality, route low-quality docs directly to manual processing
Risk 2: Historical terminology varies by period and region#
Mitigation: Build period-specific models, maintain region-specific gazetteers
Risk 3: Rare entity types not in training data#
Mitigation: Active learning - add frequently-corrected entities to gazetteers
Risk 4: Format variations hard to handle automatically#
Mitigation: Template-based extraction for known formats, flag unusual formats for review
Real-World Projects#
Existing Examples#
- CBDB (China Biographical Database): Thousands of digitized biographies
- CHGIS (China Historical GIS): Historical place names database
- ctext.org: 30,000+ digitized classical texts
- National Library of China: Massive digitization efforts
Lessons Learned#
- Manual review is always needed - aim to minimize, not eliminate
- Domain expertise critical - partner with historians
- Start with high-quality documents, expand to difficult ones
- Gazetteers are invaluable - invest in building comprehensive lists
- Version control critical - texts improve over time with corrections
Competitive Landscape#
Commercial Solutions#
- ABBYY FineReader: Good OCR, limited Classical Chinese NER
- Google Cloud Vision: General OCR, no historical Chinese features
- Custom academic tools: Often project-specific, not generalizable
Market Opportunity#
- Large unmet need in digital humanities
- Growing interest in Chinese historical research globally
- Funding available from cultural preservation initiatives
- Could be open-sourced with institutional sponsorship
Verdict#
Feasibility: Medium-High - requires significant development but achievable Impact: Very High - enables large-scale digital humanities research Market: Grant-funded or institutional projects Recommendation: Viable for dedicated project with institutional backing - combines technical challenge with significant cultural preservation value
Use Case: Classical Literature Search Engine#
Context#
Who: Researchers, students, translators working with classical texts Why: Finding parallel passages, tracking quotations, and studying usage patterns across the classical corpus requires sophisticated search beyond simple string matching.
Problem Statement: Classical Chinese texts heavily quote and reference each other. Researchers need to find thematically similar passages, track how phrases evolve across texts, and discover quotations even when wording varies slightly. Character-based search returns too many false positives; semantic search requires understanding grammatical structure.
User Story#
“As a scholar studying the concept of 仁 (benevolence) in Confucian texts, I want to:
- Find all passages discussing 仁 across the entire classical corpus
- Filter by grammatical context (仁 as subject vs. object vs. modifier)
- Group similar passages together even if exact wording differs
- Track how usage patterns change from Pre-Qin to Han to Tang
- Identify direct quotations vs. thematic parallels
- Export results with citations for academic writing”
Specific Requirements#
Must Have#
- Corpus-wide search: Cover major classical texts (Confucian, historical, philosophical)
- Structural search: Filter by grammatical role, sentence position
- Fuzzy matching: Find similar passages with character substitutions
- Citation tracking: Identify quotations and allusions
- Result clustering: Group thematically similar passages
- Export: CSV, BibTeX, or markdown with citations
Nice to Have#
- Semantic search: Find conceptually related passages (requires embeddings)
- Temporal analysis: Visualize usage changes over time
- Co-occurrence: What terms appear near the query term?
- Parallel text display: Show translations alongside classical text
- API access: Programmatic search for research pipelines
Not Critical#
- Real-time search: 5-10 second response time acceptable
- Web UI: API + command-line sufficient initially
- User accounts: Can be added later
Data Characteristics#
- Corpus size: 30,000+ texts from ctext.org (hundreds of millions of characters)
- Text types: Prose (histories, philosophy), poetry, official documents
- Periods: Pre-Qin (春秋战国) through Qing dynasty
- Languages: Classical Chinese (文言文), some modern annotations
- Format: Need to ingest and index from ctext.org API
Accuracy Requirements#
- Search recall: 90%+ (critical not to miss relevant passages)
- Search precision: 70%+ (some false positives acceptable)
- Quotation detection: 85%+ precision (high confidence needed)
- Structural filtering: 80%+ accuracy (incorrect filtering is misleading)
- Response time:
<10seconds for complex queries
Recommended Solution#
Architecture#
Data Ingestion Layer
├─ ctext.org API crawler
├─ Text preprocessing (punctuation, variants)
└─ Periodic updates (new texts, corrections)
↓
Parsing Layer
├─ Jiayan segmentation
├─ POS tagging (custom model)
├─ Dependency parsing (basic patterns)
└─ NER (people, places, concepts)
↓
Index Layer
├─ Elasticsearch (full-text + structural search)
├─ PostgreSQL (metadata, citations)
└─ Vector DB (semantic embeddings - optional Phase 2)
↓
Search Layer
├─ Query parser (interpret search syntax)
├─ Structural filters (grammatical role, position)
├─ Fuzzy matching (character variants, synonyms)
└─ Result ranking (relevance + frequency + source authority)
↓
Analysis Layer
├─ Quotation detection (n-gram matching + clustering)
├─ Passage grouping (semantic similarity)
├─ Temporal analysis (usage over time)
└─ Co-occurrence statistics
↓
API & UI
├─ REST API (JSON responses)
├─ Web UI (React + visualization)
└─ Export (CSV, BibTeX, markdown)Tech Stack#
- Corpus: ctext.org API
- Parsing: Jiayan + custom POS/dependency parsers
- Search: Elasticsearch (full-text + structured data)
- Database: PostgreSQL (metadata, citations)
- Vector search (Phase 2): Qdrant or Pinecone
- Backend: Python (FastAPI)
- Frontend: React + D3.js (visualization)
- Hosting: Cloud (AWS/GCP) or institutional servers
Implementation Time#
- Phase 1 (Basic search): 2-3 months
- Ingest corpus
- Basic segmentation + indexing
- Simple search API
- Phase 2 (Structural search): 3-4 months
- POS tagging + parsing
- Grammatical filters
- Quotation detection
- Phase 3 (Advanced features): 4-5 months
- Semantic search
- Temporal analysis
- Web UI
- Total MVP: 6-8 months
- Full product: 12-15 months
Example Implementation#
from elasticsearch import Elasticsearch
import jiayan
class ClassicalChineseSearch:
def __init__(self):
self.es = Elasticsearch()
self.segmenter = jiayan.load()
self.index = 'classical_chinese_corpus'
def index_corpus(self):
"""Ingest and index ctext.org corpus"""
for text in ctext_api.get_all_texts():
# Parse text
parsed = self.parse_text(text['content'])
# Index document
doc = {
'title': text['title'],
'author': text['author'],
'period': text['period'],
'content': text['content'],
'words': parsed['words'],
'pos_tags': parsed['pos'],
'dependencies': parsed['deps']
}
self.es.index(index=self.index, body=doc)
def search(self, query, filters=None):
"""
Search with structural filters
Example:
search("仁", filters={'pos': 'NOUN', 'role': 'SUBJECT'})
"""
# Segment query
query_words = list(self.segmenter.cut(query))
# Build Elasticsearch query
es_query = {
'bool': {
'must': [
{'match': {'words': ' '.join(query_words)}}
]
}
}
# Add structural filters
if filters:
if 'pos' in filters:
es_query['bool']['filter'] = [
{'term': {'pos_tags': filters['pos']}}
]
if 'role' in filters:
es_query['bool']['filter'].append(
{'term': {'dependencies.role': filters['role']}}
)
# Execute search
results = self.es.search(
index=self.index,
body={'query': es_query},
size=100
)
return self.format_results(results)
def find_quotations(self, passage, min_length=5):
"""Find passages that quote the given text"""
# Generate n-grams
ngrams = self.generate_ngrams(passage, min_length)
# Search for each n-gram
quotations = []
for ngram in ngrams:
results = self.search(ngram)
quotations.extend(results)
# Cluster similar results
clustered = self.cluster_passages(quotations)
return clustered
def semantic_search(self, concept, top_k=50):
"""Find semantically related passages (requires embeddings)"""
# Get embedding for concept
query_embedding = self.get_embedding(concept)
# Vector similarity search
similar = self.vector_db.search(
query_embedding,
top_k=top_k
)
return similar
def temporal_analysis(self, term):
"""Track usage of term across periods"""
periods = ['pre-qin', 'han', 'tang', 'song', 'ming', 'qing']
usage = {}
for period in periods:
results = self.search(
term,
filters={'period': period}
)
usage[period] = {
'count': len(results),
'texts': [r['title'] for r in results],
'contexts': [r['context'] for r in results]
}
return usage
# Usage examples
search = ClassicalChineseSearch()
# 1. Basic search
results = search.search("君子")
# 2. Structural search: find 仁 used as subject
results = search.search("仁", filters={'pos': 'NOUN', 'role': 'SUBJECT'})
# 3. Find quotations of a passage
passage = "學而時習之不亦說乎"
quotations = search.find_quotations(passage)
# 4. Temporal analysis
usage = search.temporal_analysis("仁")
# 5. Semantic search (Phase 2)
similar = search.semantic_search("道德")Success Metrics#
Search Quality#
- Precision: 70%+ of results relevant
- Recall: 90%+ of relevant passages found
- Speed:
<10seconds for complex queries - Coverage: 95%+ of major classical texts indexed
Quotation Detection#
- Precision: 85%+ (low false positive rate)
- Recall: 80%+ (find most quotations)
- Min length: Detect quotations ≥5 characters
User Impact#
- Research efficiency: 5-10x faster passage finding
- Discovery: Find connections not discoverable manually
- Citation quality: More comprehensive references
Cost Estimate#
Development#
- Team: 2 backend engineers + 1 frontend + 1 classical Chinese consultant
- Duration: 12-15 months for full product
- Cost: $250K-400K
Infrastructure (annual)#
- Elasticsearch cluster: $5K-20K/year (depending on scale)
- Database: $2K-5K/year
- Hosting: $3K-10K/year
- ctext.org API: $60-500/year (depending on tier)
- Total: $10K-35K/year
Revenue Models#
Academic Subscription#
- Institutions: $500-2,000/year
- Individuals: $50-150/year
- Target: 50-200 institutions, 500-2000 individuals
- Revenue: $50K-500K/year
Grant Funding#
- NEH, Mellon, etc.: $100K-500K for development
- Sustainability: Annual grants for maintenance
Open Source + Premium#
- Core search: Free and open source
- Premium features: Advanced analytics, API access, bulk export
- Freemium revenue: $30K-100K/year
Risks & Mitigations#
Risk 1: Parsing accuracy limits search quality#
Mitigation: Focus on high-confidence features first, add advanced parsing gradually
Risk 2: Index size very large (billions of characters)#
Mitigation: Efficient indexing strategy, cloud infrastructure for scaling
Risk 3: User base small and specialized#
Mitigation: Partner with academic institutions, seek grant funding
Risk 4: ctext.org API changes or becomes unavailable#
Mitigation: Cache corpus locally, negotiate long-term API access
Risk 5: Semantic search requires significant ML expertise#
Mitigation: Phase 2 feature, can use pre-trained Chinese embeddings initially
Competitive Landscape#
Existing Tools#
- ctext.org built-in search: Basic full-text, no structural features
- Chinese Text Concordance: Desktop tool, limited corpus
- MARKUS (MBDB): Semi-automatic markup, not full search
- Google: Can search classical texts, but no domain-specific features
Differentiation#
- Structural search: Unique capability for classical Chinese
- Quotation detection: Automated allusion finding
- Corpus integration: Comprehensive coverage beyond single sources
- Research-focused: Designed for academic use cases
Real-World Applications#
Scholarly Research#
- Tracing evolution of philosophical concepts
- Identifying intertextual relationships
- Compiling comprehensive references for publications
Translation#
- Finding parallel translations of passages
- Understanding variant readings
- Checking authenticity and provenance
Education#
- Teaching about textual relationships
- Demonstrating usage patterns
- Comparing interpretations across periods
Verdict#
Feasibility: Medium - Requires significant engineering but builds on existing technology Impact: High - Transforms classical Chinese research capabilities Market: Academic niche but with international reach Recommendation: Strong candidate for grant-funded academic project - High scholarly value, sustainable with institutional support, potential for long-term impact on field
Key success factor: Partner with established classical Chinese research centers (e.g., Fairbank Center, Academia Sinica) for credibility, user feedback, and sustained funding.
Use Case: Classical Chinese Reading Assistant#
Context#
Who: Students learning Classical Chinese, scholars reading primary sources Why: Classical Chinese is difficult to read - need help with word boundaries, grammar structure, and meaning
Problem Statement: Reading unpunctuated Classical Chinese text requires identifying word boundaries, understanding grammatical relationships, and accessing definitions. Manual dictionary lookup is slow and breaks reading flow.
User Story#
“As a graduate student reading the Mencius (孟子), I encounter a passage: ‘天将降大任于是人也必先苦其心志’. I need to:
- Identify word boundaries: ‘天/将/降/大任/于/是人/也/必/先/苦/其/心志’
- Understand the grammatical structure (subject, verb, object relationships)
- Look up unfamiliar compounds like 降大任
- See similar usage patterns in other texts”
Specific Requirements#
Must Have#
- Word segmentation: Identify boundaries in unpunctuated text
- Hover definitions: Quick access to word meanings
- Sentence structure: Visual indication of grammar relationships
- Speed: Real-time or near real-time response (
<2seconds)
Nice to Have#
- Cross-references: Find similar passages in other texts
- Etymology: Character composition and historical meanings
- Audio: Pronunciation in reconstructed Old/Middle Chinese
- Parallel translations: Show existing English/modern Chinese translations
Not Critical#
- Named entity recognition: Can handle manually
- High accuracy parsing: 70-80% accuracy acceptable if user can correct
- Historical dating: Not essential for reading comprehension
Data Characteristics#
- Text type: Classical prose (Confucian texts, histories, philosophical works)
- Period: Primarily Pre-Qin to Han dynasty
- Volume: Typically short passages (100-500 characters at a time)
- Format: Traditional characters, unpunctuated or modern punctuated
Accuracy Requirements#
- Segmentation: 85%+ accuracy required (incorrect segmentation confusing)
- POS tagging: 70%+ acceptable (informational only)
- Parsing: 60%+ acceptable (helpful even if imperfect)
- Definitions: High quality required (from ctext.org or similar)
Recommended Solution#
Architecture#
Input: "天将降大任于是人也"
↓
Jiayan Segmenter
↓
Segmented: "天/将/降/大任/于/是人/也"
↓
Custom POS Tagger (rule-based)
↓
POS: "天[N]/将[ADV]/降[V]/大任[N]/于[PREP]/是人[N]/也[PART]"
↓
Dependency Parser (rule-based for common patterns)
↓
Parse Tree: [天将降大任] [于是人] (subject-verb-object + prepositional phrase)
↓
ctext.org API
↓
Definitions + Cross-references
↓
Web UI DisplayTech Stack#
- Backend: Python + Flask/FastAPI
- Segmentation: Jiayan
- POS/Parsing: Custom rule-based system
- Dictionary: ctext.org API + CC-CEDICT
- Frontend: React with annotation display
- Hosting: Can be self-hosted or cloud
Implementation Time#
- MVP (segmentation + definitions): 2-3 weeks
- Full version (parsing + UI): 8-12 weeks
- Polished product: 4-6 months
Example Implementation#
import jiayan
import requests
class ClassicalChineseAssistant:
def __init__(self):
self.segmenter = jiayan.load()
def analyze(self, text):
# Step 1: Segment
words = list(self.segmenter.cut(text))
# Step 2: POS tag (simple rule-based)
pos_tags = self.pos_tag(words)
# Step 3: Get definitions
definitions = self.get_definitions(words)
# Step 4: Parse structure (basic patterns)
structure = self.parse_structure(words, pos_tags)
return {
'words': words,
'pos': pos_tags,
'definitions': definitions,
'structure': structure
}
def pos_tag(self, words):
# Rule-based POS tagging
particles = {'也', '矣', '乎', '焉', '哉', '耳'}
prepositions = {'于', '以', '为', '与', '从'}
# ... more rules
return tags
def get_definitions(self, words):
definitions = {}
for word in words:
# Query ctext.org or local dictionary
definitions[word] = self.lookup(word)
return definitions
def parse_structure(self, words, pos_tags):
# Simple pattern matching for common structures
# Subject-Verb-Object, prepositional phrases, etc.
return structure
# Usage
assistant = ClassicalChineseAssistant()
result = assistant.analyze("天将降大任于是人也")Success Metrics#
User Experience#
- Time to understand a passage: Reduced from 10 minutes to 2 minutes
- Dictionary lookups: Reduced from 8-10 per passage to 0-2
- User satisfaction: 4+ stars on user surveys
Technical#
- Segmentation accuracy:
>85% on evaluation set - Response time:
<1second for 200 character passages - Uptime: 99%+ availability
Cost Estimate#
Development#
- Engineer time: 3-6 months @ $100K-150K salary = $25K-75K
- Linguist consultant: 40 hours @ $150/hr = $6K
- Total dev cost: $31K-81K
Operating Costs (annual)#
- Hosting: $50-200/month = $600-2,400/year
- ctext.org API: $60/year (academic tier)
- Maintenance: 10-20 hours/month @ $150/hr = $18K-36K/year
- Total operating: $18.7K-38.5K/year
Break-even Analysis#
- Revenue model: Subscription ($5-15/month)
- Users needed at $10/month: 156-650 users to break even on operating costs
- Market size: Tens of thousands of Classical Chinese students globally
- Feasibility: Viable as a commercial product
Risks & Mitigations#
Risk 1: Segmentation errors confuse users#
Mitigation: Allow manual correction, learn from corrections
Risk 2: Dictionary definitions not comprehensive#
Mitigation: Multiple dictionary sources, crowdsourced additions
Risk 3: Different period texts require different models#
Mitigation: Start with one period (Pre-Qin), expand based on user needs
Risk 4: Competition from free tools#
Mitigation: Focus on superior UX, accuracy, and features
Competitive Landscape#
- Pleco: Has Classical Chinese add-on, but limited parsing
- Wenlin: Established but dated UI, segmentation basic
- MDBG: Free but no parsing, just dictionary
- Academic tools: Often research prototypes, not user-friendly
Opportunity: Modern UI + good parsing could capture market
Verdict#
Feasibility: High - technically achievable with existing tools Market: Clear demand from students and researchers Recommendation: Strong candidate for development - balance of technical feasibility and market need
S4: Strategic
S4-Strategic: Approach#
Evaluation Method#
Strategic assessment of Classical Chinese parsing landscape through maturity analysis. Focus on:
- Ecosystem maturity: How developed is the field overall?
- Technology readiness: What’s production-ready vs research prototype?
- Community & governance: Who maintains these tools? Sustainability?
- Strategic positioning: Where are the opportunities and risks?
- Long-term outlook: What’s the 5-10 year trajectory?
Analysis Framework#
Technology Maturity (TRL - Technology Readiness Levels)#
- TRL 1-3: Basic research
- TRL 4-6: Proof of concept, prototype
- TRL 7-8: Production-ready, deployed
- TRL 9: Mature, widely adopted
Organizational Maturity#
- Stage 1: Individual researchers
- Stage 2: Research labs/projects
- Stage 3: Institutionalized (organizations maintaining)
- Stage 4: Ecosystem (multiple organizations, standards)
Market Maturity#
- Nascent: Few users, no clear use cases
- Emerging: Early adopters, use cases forming
- Growth: Multiple products, growing user base
- Mature: Established market, clear leaders
Libraries Analyzed#
- Stanford CoreNLP - Maturity analysis
- ctext.org - Platform sustainability
- Jiayan - Open-source project health
Strategic Questions#
- Is this a viable long-term investment domain?
- What are the key risks and opportunities?
- Who are the stakeholders and what are their incentives?
- What would it take to move the field forward significantly?
- Should we build, buy, or wait?
Time Box#
3-4 hours for strategic analysis and synthesis
ctext.org: Maturity Analysis#
Technology Readiness Level: TRL 8 (Production Corpus Platform)#
Overall Assessment#
ctext.org is a mature, production digital library with stable API. For Classical Chinese text access, it is the de facto standard. Not a parsing tool, but essential infrastructure.
Dimensions of Maturity#
1. Platform Maturity: Very High#
Service Stability:
- ✅ 15+ years of continuous operation (launched ~2006)
- ✅ Uptime: 99%+ (rare outages)
- ✅ Data quality: High (expert curation, error corrections over time)
- ✅ API reliability: Stable, well-documented
- ✅ Performance: Fast response times (
<500ms typical)
Content Maturity:
- ✅ 30,000+ texts (comprehensive coverage)
- ✅ Pre-Qin through Qing dynasty
- ✅ Multiple editions of major texts
- ✅ Parallel translations (English, modern Chinese)
- ✅ Ongoing additions and corrections
Technical Architecture:
- ✅ RESTful API with JSON/XML responses
- ✅ URN-based canonical references
- ✅ Full-text search with regex support
- ✅ Rate limiting and access tiers (sustainable)
Maturity Score: 9/10 (as comprehensive as it can be for a digital corpus)
2. Organizational Maturity: Medium-High#
Governance:
- Owner: Dr. Donald Sturgeon (individual maintainer + small team)
- Affiliation: Associated with Durham University (UK)
- Funding Model: Subscriptions + grants + institutional partnerships
- Legal Status: Not a formal non-profit or corporation (potential risk)
Sustainability Indicators:
- ✅ 15+ years track record (proven longevity)
- ✅ Growing institutional subscriber base
- ✅ Active development (new features added regularly)
- ⚠️ Single-person leadership (succession risk)
- ⚠️ Not institutionalized (no large organization backing)
Risk Factors:
- ⚠️ Key person risk: Depends heavily on Dr. Sturgeon’s continued involvement
- ⚠️ Funding: Subscription model works but not guaranteed long-term
- ⚠️ Transition plan: Unclear what happens if maintainer unable to continue
- ⚠️ Data ownership: Texts are public domain, but platform is proprietary
Mitigation Factors:
- ✅ University affiliation provides some institutional support
- ✅ Corpus valuable enough that someone would likely maintain it
- ✅ Data exportable (could be hosted elsewhere if needed)
- ✅ Growing academic dependencies create incentive to preserve
Sustainability Score: 7/10 (good track record, but organizational risk)
3. Community & Ecosystem: Strong#
User Base:
- Researchers: Thousands globally (East Asian studies, sinology)
- Students: Tens of thousands (Classical Chinese learners)
- Institutions: 100+ university subscriptions
- Developers: Small but growing API user community
Ecosystem:
- ✅ Cited in hundreds of academic papers
- ✅ Integrated into educational curricula
- ✅ Third-party tools built on ctext API
- ✅ Active forums and user community
- ✅ Crowdsourced corrections and contributions
Academic Credibility:
- ✅ Trusted by leading sinologists
- ✅ Used in peer-reviewed research
- ✅ Referenced in major publications
- ✅ Considered authoritative for classical texts
Community Health: A- (strong user base, single maintainer is a weakness)
4. API & Developer Experience: Good#
API Quality:
- ✅ RESTful, predictable endpoints
- ✅ JSON responses (easy to parse)
- ✅ URN system for canonical references
- ✅ Good documentation with examples
- ⚠️ Rate limits (100/day free, 10K/day paid)
- ⚠️ Not fully RESTful (some inconsistencies)
Developer Adoption:
- ✅ Used in digital humanities projects
- ✅ Python libraries available (wrappers)
- ⚠️ Small developer community (niche use case)
- ⚠️ Limited examples of large-scale integrations
Developer Experience Score: 7/10 (functional, but could be more developer-friendly)
5. Competitive Position: Dominant#
Competitors:
- Wenlin: Desktop software, smaller corpus, not API accessible
- CBDB: Biographical database (complementary, not competing)
- CHGIS: Geographic data (complementary)
- Internet Archive: Has some classical texts but not specialized
- National libraries: Some digitization but not comprehensive or API-enabled
Market Position:
- ✅ De facto standard for Classical Chinese corpus access
- ✅ Most comprehensive single source
- ✅ Only major player with API access
- ✅ Network effects: citations and integrations create lock-in
Competitive Moat:
- 15+ years of curation and corrections
- URN system as standard reference format
- Institutional relationships and subscriptions
- Crowdsourced improvements over time
Strategic Position: Near-monopoly for comprehensive Classical Chinese corpus with API access
SWOT Analysis#
Strengths#
- Most comprehensive Classical Chinese corpus
- Stable, mature platform (15+ years)
- API access (unique among competitors)
- Trusted by academic community
- Ongoing improvements and additions
Weaknesses#
- Single-person leadership (succession risk)
- Not institutionalized (no large org backing)
- Rate limits can be restrictive for large projects
- API could be more developer-friendly
- Corpus access, not parsing (limited NLP features)
Opportunities#
- Institutional partnerships for long-term funding
- Enhanced API features (ML endpoints, embeddings)
- Integration with other tools (parsing, translation)
- Expansion to related corpora (Korean, Japanese classics)
- Open corpus model (while maintaining value-added services)
Threats#
- Key person risk (if maintainer unable to continue)
- Funding model sustainability (subscriptions may not scale)
- Competition from well-funded institutional projects
- Copyright/licensing issues for some texts
- Changes in academic funding for digital humanities
Strategic Recommendations#
DO Use ctext.org If:#
- Need access to Classical Chinese corpus (essential)
- Building research/educational tools (authoritative source)
- Want canonical references (URN system)
- Need parallel translations
- Volume within API rate limits
Plan for Risks:#
- Cache locally: Don’t depend solely on API availability
- Backup data: Export key texts for local fallback
- Alternative sources: Know where to get texts if ctext unavailable
- Monitor health: Watch for signs of maintainer burnout or funding issues
Strategic Partnerships:#
If building on ctext.org for commercial/large-scale project:
- Subscribe to commercial tier: Support the platform financially
- Negotiate custom agreement: For high-volume API access
- Contribute back: Submit corrections, support development
- Have contingency: Plan B if ctext becomes unavailable
Long-Term Outlook (5-10 years)#
Most Likely Scenario: Continued Operation with Transition#
- Platform continues operating (too valuable to abandon)
- Gradual institutionalization (foundation or university takes over)
- Expanded funding through partnerships and grants
- Maintained and improved by small dedicated team
Probability: 60%
Optimistic Scenario: Institutionalization & Expansion#
- Major foundation (Mellon, NEH) provides sustained funding
- Formal governance structure created
- Additional staff hired
- Platform expands: more texts, better API, ML integration
- Becomes permanent infrastructure for digital sinology
Probability: 25%
Pessimistic Scenario: Decline or Closure#
- Maintainer unable to continue, no successor
- Funding dries up, subscriptions insufficient
- Platform deteriorates or shuts down
- Corpus archived but not actively maintained
- Community scrambles to self-host
Probability: 15%
Mitigation for Pessimistic Scenario:#
- Corpus is largely public domain → can be preserved
- Academic community would likely fund rescue effort
- Multiple institutions have local copies
- Data loss unlikely, but API access might end
Investment Recommendation#
For Classical Chinese Parsing Project:
Score: 9/10 - Highly recommended (with contingencies)
Rationale:
- Essential resource, no viable alternative
- Stable, mature platform
- De facto standard for corpus access
- Risk is manageable with local caching
Strategic Approach:
- Depend on ctext for corpus (no better alternative)
- Cache locally (don’t depend on real-time API for critical functions)
- Support financially (subscribe to help ensure sustainability)
- Have fallback (know how to get texts elsewhere if needed)
- Contribute back (support the ecosystem)
Integration Pattern:
Initial development: Use ctext API directly
Production: Cache corpus locally, periodic sync
Fallback: Local text files if API unavailable
Long-term: Consider sponsoring or partnering with ctextBottom Line: ctext.org is critical infrastructure for Classical Chinese NLP. Use it, support it, but have contingencies for organizational risk. For corpus access, there is no better option currently available.
Recommendations for ctext.org (Unsolicited)#
If Dr. Sturgeon or ctext team reads this:
- Succession planning: Document knowledge transfer, identify potential successors
- Institutionalization: Partner with university or foundation for long-term governance
- Endowment: Seek multi-year funding commitment from institutions
- Open corpus: Consider releasing corpus under open license (retain value-added services)
- Community governance: Create advisory board, involve stakeholders in decisions
- API improvements: Expand rate limits, add ML endpoints, improve docs
Why this matters: ctext.org is too important to the field to have single-point-of-failure risk. The community depends on it and would support efforts to ensure long-term sustainability.
Jiayan: Maturity Analysis#
Technology Readiness Level: TRL 5-6 (Prototype / Early Deployment)#
Overall Assessment#
Jiayan is a specialized tool that works but lacks production maturity. Good for Classical Chinese segmentation, but limited functionality, documentation, and community support. Best used as a component, not a complete solution.
Dimensions of Maturity#
1. Technical Maturity: Medium#
Functional Completeness:
- ✅ Core segmentation works for Classical Chinese
- ✅ Better than general-purpose Chinese segmenters for 文言文
- ⚠️ POS tagging is experimental (limited accuracy)
- ⚠️ No parsing, NER, or advanced features
- ⚠️ Performance not optimized (slower than production tools)
Code Quality:
- ⚠️ Limited testing (no comprehensive test suite visible)
- ⚠️ Minimal error handling
- ⚠️ Not production-hardened (edge cases may fail)
- ✅ Pure Python (easy to modify and extend)
Production Readiness:
- ⚠️ No official benchmarks published
- ⚠️ Accuracy on various text types not documented
- ⚠️ Performance characteristics not documented
- ⚠️ No SLA or support commitment
Technical Maturity Score: 5/10 (works but not production-grade)
2. Organizational Maturity: Low#
Governance:
- Type: Individual open-source project
- Maintainer: Appears to be individual developer(s)
- Organization: No formal organization or backing
- Funding: No visible funding model
Project Health Indicators:
- ⚠️ Maintenance: Sporadic updates (check GitHub for current status)
- ⚠️ Responsiveness: Limited response to issues/PRs
- ⚠️ Roadmap: No public roadmap or development plan
- ⚠️ Governance: No formal governance structure
Sustainability Risks:
- ❌ Single maintainer: High bus factor risk
- ❌ No funding: Maintenance depends on volunteer time
- ❌ No institutional backing: No organization ensuring continuity
- ❌ Niche user base: Limited community to take over if abandoned
Organizational Maturity Score: 2/10 (vulnerable to abandonment)
3. Community Health: Low#
Community Size:
- GitHub stars: Likely 100-500 (check current)
- Contributors: Likely
<5meaningful contributors - Users: Small, specialized user base (Classical Chinese researchers)
- Discussion: Minimal community discussion visible
Documentation:
- ⚠️ Basic usage examples exist
- ⚠️ Limited English documentation (may be Chinese-only)
- ❌ No comprehensive API docs
- ❌ No performance tuning guides
- ❌ No advanced usage examples
Support:
- ❌ No commercial support available
- ❌ No active forum or community channel
- ⚠️ GitHub issues for bug reports (response time variable)
- ❌ No Stack Overflow community
Community Health Score: 3/10 (small, inactive community)
4. Licensing & Commercial Viability: Unknown#
License:
- Status: Check repository for license (likely open source)
- Best case: MIT or Apache 2.0 (permissive)
- Worst case: GPL or no license (restrictive or unusable)
- Impact: Determines if safe for commercial use
Commercial Viability:
- ✅ If permissive license: Can be integrated into products
- ✅ Can be forked and maintained independently if abandoned
- ⚠️ No commercial support or guarantees
- ⚠️ May need to maintain your own fork
Licensing Score: 6/10 (assuming permissive license; verify before use)
5. Competitive Position: Specialized Niche#
For Classical Chinese Segmentation:
- ✅ Best available: Better than general tools for 文言文
- ✅ Specialized: Fills gap that others don’t address
- ⚠️ Limited competition: Few alternatives (low switching cost if abandoned)
Competitors:
- General Chinese segmenters (jieba, HanLP, pkuseg): Better maintained, worse for Classical
- Stanford CoreNLP: More features, worse for Classical
- Custom solutions: Many users may roll their own
Market Position:
- Niche leader but small niche
- Low barrier to replacement: Could be reimplemented if needed
- Not defensible: Algorithm not proprietary, corpus not unique
Competitive Position Score: 5/10 (leader in small niche, but easily replaced)
SWOT Analysis#
Strengths#
- Actually works for Classical Chinese (proven concept)
- Better than alternatives for 文言文 segmentation
- Pure Python (easy to integrate and modify)
- Open source (can be forked and improved)
Weaknesses#
- Limited to segmentation (no full NLP pipeline)
- Minimal documentation and support
- Single maintainer, no organizational backing
- Small community, low visibility
- Not production-hardened
- Unknown long-term sustainability
Opportunities#
- Could be improved and maintained by community
- Potential for academic institution to adopt and support
- Could be integrated into larger Classical Chinese NLP project
- Foundation for more comprehensive tool
Threats#
- Abandonment: Maintainer stops development (high risk)
- Obsolescence: Better tool emerges (transformers, etc.)
- Maintenance burden: Users must maintain their own forks
- Limited growth: Niche market prevents sustainability
Strategic Recommendations#
DO Use Jiayan If:#
- Need Classical Chinese segmentation (best available option)
- Python-based project (easy integration)
- Can maintain a fork (if it gets abandoned)
- Segmentation only (don’t need full NLP)
- Open-source/academic project (risk tolerance higher)
DO NOT Use Jiayan If:#
- Need production SLA (no support or guarantees)
- Need full NLP pipeline (only does segmentation)
- Risk-averse commercial project (sustainability concerns)
- Need comprehensive docs (limited documentation)
Risk Mitigation Strategies:#
Strategy 1: Fork & Maintain#
1. Fork Jiayan repository
2. Add to your organization's GitHub
3. Budget for maintenance (2-4 hours/month)
4. Contribute improvements upstreamStrategy 2: Wrap & Abstract#
1. Create abstraction layer over Jiayan
2. Implement interface that could use different segmenters
3. If Jiayan fails, can swap implementation
4. Reduces switching costStrategy 3: Contribute & Partner#
1. Contribute improvements to Jiayan
2. Help with documentation and testing
3. Build relationship with maintainer
4. Offer sponsorship if possibleStrategy 4: Build Alternative#
1. Use Jiayan as reference implementation
2. Build own Classical Chinese segmenter
3. Train on same or better corpus
4. Full control, but higher initial investmentRecommended: Strategy 2 (Wrap & Abstract) + Strategy 3 (Contribute)
Long-Term Outlook (5-10 years)#
Most Likely Scenario: Gradual Abandonment#
- Maintainer loses interest or time
- Updates become less frequent
- Project enters maintenance mode
- Community forks or replaces with alternatives
Probability: 50%
Optimistic Scenario: Community Adoption#
- Academic institution adopts project
- Additional maintainers join
- Documentation improved
- Becomes standard tool for Classical Chinese
Probability: 20%
Pessimistic Scenario: Immediate Abandonment#
- Maintainer stops work without notice
- No one steps up to maintain
- Users must fork or replace
- Knowledge scattered
Probability: 15%
Alternative Scenario: Superseded#
- Better tool emerges (transformer-based, better trained)
- Community migrates to new solution
- Jiayan becomes legacy
Probability: 15%
Investment Recommendation#
For Classical Chinese Parsing Project:
Score: 6/10 - Use with caution and contingencies
Rationale:
- Best available for Classical Chinese segmentation
- Acceptable risk if you can maintain a fork
- Not suitable as sole dependency without backup plan
- Good starting point but plan to replace or maintain
Strategic Approach:
# Phase 1: Use Jiayan (Months 1-3)
from jiayan import load
segmenter = load()
# Phase 2: Wrap it (Month 3)
class ChineseSegmenter:
def __init__(self, backend='jiayan'):
if backend == 'jiayan':
self.impl = JiayanWrapper()
elif backend == 'custom':
self.impl = CustomSegmenter()
def segment(self, text):
return self.impl.segment(text)
# Phase 3: Evaluate alternatives (Months 4-6)
# Phase 4: Migrate if better option emerges (Months 6-12)Budget Implications:#
Option A: Use Jiayan As-Is
- Cost: $0 upfront
- Risk: High (abandonment, bugs)
- Maintenance: Minimal until it breaks
Option B: Fork & Maintain
- Initial: 20-40 hours to audit and test ($3K-6K)
- Ongoing: 2-4 hours/month ($3K-6K/year)
- Risk: Medium (you control it)
Option C: Build Alternative
- Initial: 2-4 months development ($30K-60K)
- Ongoing: Standard maintenance
- Risk: Low (full control)
Recommended: Option B (Fork & Maintain)
Specific Action Items#
Before Using Jiayan:#
✅ Check current GitHub status
- Last commit date
- Open issues and response time
- Number of contributors
✅ Verify license
- Ensure compatible with your use case
- Check for any restrictions
✅ Test thoroughly
- Create test suite for your use cases
- Benchmark accuracy on your texts
- Test edge cases
✅ Plan for replacement
- Abstract interface (Strategy 2)
- Identify alternative approaches
- Budget for transition
✅ Fork repository
- Create fork in your organization
- Document any modifications
- Track upstream changes
After Deploying:#
Monitor health
- Watch for upstream updates
- Track any issues you encounter
- Stay aware of alternatives
Contribute back
- Submit bug fixes upstream
- Improve documentation
- Share improvements with community
Have exit plan
- Know how to switch to alternative
- Budget for replacement if needed
- Don’t get locked in
Bottom Line#
Jiayan is the best tool currently available for Classical Chinese segmentation, but it’s not production-grade.
Use it as a starting point, but:
- Don’t rely on it exclusively
- Have a backup plan
- Be prepared to fork or replace
- Budget for maintenance
For an open-source academic project: Accept the risk, use it.
For a commercial product: Use it, but abstract the interface and have a replacement strategy.
For critical infrastructure: Consider building your own or partnering with maintainer to ensure sustainability.
The technology is good; the sustainability is questionable. Plan accordingly.
S4-Strategic: Recommendation#
Executive Summary#
The Classical Chinese parsing ecosystem is immature but viable for development. No production-ready solutions exist, but the components needed to build one are available. Strategic opportunities exist for organizations willing to invest in this underserved niche.
Ecosystem Maturity Assessment#
Overall Readiness: TRL 4-5 (Early Development)#
| Component | Maturity | Strategic Position | Recommendation |
|---|---|---|---|
| Corpus Access (ctext.org) | TRL 8 ✅ | Essential infrastructure | Use, support, have contingency |
| Segmentation (Jiayan) | TRL 5-6 ⚠️ | Best available, risky | Use with fork/abstraction |
| POS Tagging | TRL 2-3 ❌ | Research needed | Build custom |
| Parsing | TRL 2-3 ❌ | Research needed | Build custom or adapt |
| NER (Historical) | TRL 2-3 ❌ | Research needed | Build with gazetteers |
Key Finding: Foundation exists (corpus + segmentation), but NLP pipeline must be built.
Market Maturity: Nascent → Emerging#
Current State:
- Small but global user base (researchers, students)
- No dominant commercial players
- Academic tools only (research prototypes)
- Growing interest in digital humanities
5-Year Outlook:
- Market size: 10,000-50,000 potential users
- Revenue potential: $500K-$5M/year (tools + services)
- Sustainability: Achievable via grants + subscriptions
- Competition: Will emerge if market validated
Strategic Window: 2-3 years to establish position before competition increases
Community & Governance: Fragmented#
Stakeholders:
- Academic researchers - Need tools, limited funding
- Students - Willing to pay small amounts ($5-15/month)
- Cultural institutions - Grant-funded, long sales cycles
- Digital humanities centers - Early adopters, opinion leaders
- Commercial ed-tech (Pleco, Skritter) - Potential acquirers/partners
Governance Gap:
- No standards body or consortium
- No shared infrastructure beyond ctext.org
- No coordinated development efforts
- Opportunity: Lead consortium formation
Strategic Options Analysis#
Option 1: Build Commercial Product (Reading Assistant)#
Strategy: Create best-in-class Classical Chinese reading tool for students/researchers
Investment: $100K-300K over 18 months
Pros:
- ✅ Clear user need and willingness to pay
- ✅ Fast time to market (6-12 months)
- ✅ Proven revenue model (Pleco, Wenlin)
- ✅ Manageable scope
Cons:
- ⚠️ Small market (niche)
- ⚠️ Price sensitivity ($5-15/month ceiling)
- ⚠️ Competition from free tools possible
- ⚠️ Market size limits growth potential
Best For: Bootstrapped startup, individual developer, small team
Expected Outcome: $50K-200K/year revenue, sustainable niche business
Risk Level: Medium - Clear demand but limited scale
Option 2: Grant-Funded Research Infrastructure#
Strategy: Build open-source Classical Chinese NLP platform with academic partnerships
Investment: $500K-$1M over 3-4 years (grant-funded)
Pros:
- ✅ High academic impact
- ✅ Grant funding available (NEH, Mellon)
- ✅ Intellectual credibility
- ✅ Long-term sustainability via institutions
Cons:
- ⚠️ Slow (grant cycles, academic timelines)
- ⚠️ Funding not guaranteed
- ⚠️ Academic politics and overhead
- ⚠️ Limited commercial potential
Best For: Universities, research centers, non-profits
Expected Outcome: Standard infrastructure for field, cited in hundreds of papers
Risk Level: Low - If grant secured, likely to succeed
Option 3: Open Source + Services#
Strategy: Build open-source tools, monetize via hosting/consulting/support
Investment: $200K-500K over 2 years
Pros:
- ✅ Community building potential
- ✅ Multiple revenue streams
- ✅ Flexible, can pivot
- ✅ Attracts contributors
Cons:
- ⚠️ Services revenue unpredictable
- ⚠️ Support burden
- ⚠️ Open source limits pricing power
- ⚠️ Competitors can copy
Best For: Developer-focused companies, dev tools companies
Expected Outcome: $100K-500K/year services revenue, ecosystem leadership
Risk Level: Medium-High - Services sales are hard
Option 4: Partner with Established Player#
Strategy: License or sell technology to Pleco, Skritter, or similar company
Investment: $50K-150K to build proof-of-concept
Pros:
- ✅ Fast route to market
- ✅ Existing distribution
- ✅ Less risk (partner carries)
- ✅ Upfront payment possible
Cons:
- ⚠️ Give up control and upside
- ⚠️ Partner may not prioritize
- ⚠️ Cultural fit challenges
- ⚠️ Licensing complexity
Best For: Individual developers wanting exit, small teams
Expected Outcome: $50K-200K licensing fee, ongoing royalties
Risk Level: Medium - Partner deal risk
Option 5: Wait and Watch#
Strategy: Monitor field, enter when clearer opportunity emerges
Investment: $0 (opportunity cost only)
Pros:
- ✅ No financial risk
- ✅ Learn from others’ mistakes
- ✅ Better data when ready
- ✅ Technology improves
Cons:
- ❌ Miss first-mover advantage
- ❌ Someone else may win market
- ❌ Academic grants claimed by others
- ❌ Opportunity window may close
Best For: Risk-averse organizations, those with other priorities
Expected Outcome: Preserved optionality, but no value created
Risk Level: Low financial risk, high opportunity cost
Recommended Strategy: Hybrid Incremental#
Phase 1: Proof of Concept (Months 1-6, $25K-50K)#
Goal: Validate technical approach and market interest
Approach:
- Build minimal reading assistant (Jiayan + ctext + basic UI)
- Release beta to 100 users (students, researchers)
- Gather feedback and usage data
- Measure willingness to pay
Decision Point: If positive validation, proceed to Phase 2. If not, pivot or stop.
Phase 2: Product Development (Months 7-18, $100K-200K)#
Goal: Build production-ready reading assistant + grant application
Approach:
- Productize reading assistant (full features, polish)
- Launch with freemium model ($0-10/month)
- Prepare NEH Digital Humanities grant application
- Partner with 2-3 universities for pilot programs
Decision Point: Viable business OR grant secured → Phase 3. Else, maintain as side project.
Phase 3: Platform Expansion (Years 2-3, $300K-600K)#
Goal: Build comprehensive Classical Chinese NLP platform
Funding: Grant money + product revenue + institutional partnerships
Approach:
- Develop full NLP pipeline (POS, parsing, NER)
- Build research tools (literature search, digitization)
- Open-source core components
- Establish academic consortium for governance
Decision Point: Sustainable (revenue + grants) → Scale. Otherwise → Transition to maintenance.
Phase 4: Ecosystem Leadership (Years 3-5)#
Goal: Become infrastructure for Classical Chinese digital humanities
Approach:
- Transfer to foundation or academic consortium
- Establish standards and best practices
- Coordinate community development
- Ensure long-term sustainability
Key Success Factors#
Technical#
- Start with what works (Jiayan, ctext) - don’t reinvent
- Incremental complexity - segmentation → POS → parsing
- Modular architecture - components can be used separately
- Quality over completeness - better to do one thing well
Market#
- Early adopters - Students and digital humanities researchers
- Academic credibility - Partner with universities early
- Network effects - Encourage integrations and citations
- Pricing - Free tier + affordable premium ($5-15/month)
Organizational#
- Hybrid model - Commercial + open source + academic
- Grant funding - Essential for infrastructure components
- Partnerships - Don’t go alone (universities, foundations)
- Community - Build contributors and advocates
Sustainability#
- Multiple revenue streams - Grants + subscriptions + services
- Low burn rate - Keep costs minimal, bootstrap when possible
- Endowment mentality - Build for 10+ year horizon
- Transition plan - Eventually transfer to institution or foundation
Risk Management#
Top Risks & Mitigations#
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Market too small | Medium | High | Start small, validate early |
| Technical complexity | Low | Medium | Use existing components, incremental |
| Funding gaps | Medium | High | Multiple revenue streams, low burn |
| ctext.org becomes unavailable | Low | High | Local caching, alternative sources |
| Jiayan abandoned | High | Medium | Fork, abstraction layer, replacement plan |
| Competition emerges | Low | Medium | First-mover advantage, academic credibility |
| Transformer models obsolete approach | Medium | Medium | Monitor, be ready to adopt transformers |
| Grant applications rejected | Medium | Medium | Multiple applications, don’t depend on one |
Strategic Positioning#
How to Win#
Differentiation:
- Best Classical Chinese support (not general Chinese)
- Academic credibility (peer-reviewed, cited)
- Open ecosystem (not proprietary lock-in)
- User experience (better UX than academic tools)
Defensibility:
- Annotated corpus (expensive to replicate)
- Academic partnerships and citations
- Network effects (integrations, community)
- Domain expertise (Classical Chinese knowledge)
Avoid Competing On:
- Modern Chinese (established players win)
- General NLP (commoditizing rapidly)
- Price (race to bottom)
Timeline & Milestones#
Year 1: Validation#
- Q1: MVP reading assistant
- Q2: Beta launch, 100 users
- Q3: Product refinement, first revenue
- Q4: Grant application, 500 users
Year 2: Growth#
- Q1: Grant decision, feature expansion
- Q2: 2,000 users, institutional pilots
- Q3: Begin platform development
- Q4: 5,000 users, break-even
Year 3: Platform#
- Q1: Research tools launch
- Q2: Academic consortium formation
- Q3: Open-source release
- Q4: 10,000 users, sustainable
Year 4-5: Leadership#
- Ecosystem development
- Standards creation
- Community growth
- Long-term sustainability
Final Recommendation#
GO/NO-GO Decision Framework:
✅ GO if you have:
- $25K-50K for proof-of-concept (can be bootstrapped)
- 6-12 months to invest before returns
- Classical Chinese domain knowledge (or access to expert)
- Python/ML technical skills
- Willingness to pursue grants
- Long-term perspective (3-5 year horizon)
❌ NO-GO if you need:
- Immediate revenue (18+ months to meaningful revenue)
- Large market (this is niche)
- Low-risk investment (technical and market uncertainty)
- Exit in 2-3 years (not venture-scale)
Personal Recommendation#
If I were making this decision:
For a startup/company: Build reading assistant (Option 1), bootstrap, keep costs minimal. If it works, expand. If not, limited downside.
For a university: Apply for grant (Option 2), build open infrastructure. Long-term impact, fits academic mission.
For an individual developer: Partner with established player (Option 4) or build and sell to Pleco/Skritter. Fastest path to value.
For a foundation: Fund Option 2 + 3 hybrid. Open infrastructure with commercial layer. Maximum field impact.
Most Likely to Succeed: Hybrid approach (reading assistant → grant funding → platform → ecosystem)
Bottom Line: This is a viable opportunity for the right organization with the right expectations. Not venture-scale, but could be a sustainable, impactful business or research infrastructure. The field needs this, and the timing is good (underserved market, technology ready enough).
The question isn’t “Is it possible?” (it is). The question is “Is it worth it FOR YOU?” That depends on your goals, resources, and risk tolerance.
Stanford CoreNLP: Maturity Analysis#
Technology Readiness Level: TRL 8-9 (for Modern Chinese)#
Overall Assessment#
Stanford CoreNLP is a mature, production-ready system for modern Chinese NLP, but TRL 3-4 for Classical Chinese (requires significant research/adaptation).
Dimensions of Maturity#
1. Technical Maturity: Very High (Modern) / Low (Classical)#
Modern Chinese:
- ✅ Production deployments at scale (Google, Facebook, etc.)
- ✅ Proven accuracy (95%+ segmentation, 80%+ parsing)
- ✅ Optimized performance (1000+ tokens/sec)
- ✅ Comprehensive testing and validation
- ✅ Support for neural and statistical models
Classical Chinese:
- ⚠️ No pre-trained models
- ⚠️ Would require retraining on annotated corpus
- ⚠️ No benchmarks or validation data
- ⚠️ Grammar assumptions may not transfer well
Verdict: Cannot be used off-the-shelf for Classical Chinese. Would need 6-12 months of adaptation work.
2. Organizational Maturity: Very High#
Governance:
- Owner: Stanford NLP Group (academic institution)
- Leadership: Established professors (Manning, Socher, Potts)
- Funding: University + research grants + industry partnerships
- Track record: 15+ years of consistent development
Sustainability Indicators:
- ✅ Institutional backing ensures long-term survival
- ✅ Used in research and teaching → incentive to maintain
- ✅ Large user base creates network effects
- ✅ Multiple contributors beyond core team
Risk Factors:
- ⚠️ Dependent on continued academic interest
- ⚠️ If key faculty leave, could lose momentum
- ⚠️ Classical Chinese not a research priority for Stanford
Sustainability Score: 9/10 (for modern Chinese), 3/10 (for Classical Chinese development)
3. Community Health: Excellent (Modern) / Minimal (Classical)#
Modern Chinese Community:
- GitHub stars: 9,000+
- Contributors: 50+
- Issues/PRs: Active, responsive maintainers
- Documentation: Comprehensive
- Stack Overflow: 1,000+ questions answered
- Academic citations: 10,000+ papers
Classical Chinese Community:
- Interest: Minimal visible activity
- Resources: No dedicated models or docs
- Discussion: Occasional forum questions, no dedicated channel
- Academic research: Few papers on CoreNLP for Classical Chinese
Community Health: A+ for modern, D- for Classical
4. Licensing & Commercial Viability: Moderate#
License: GPL v3+
- ✅ Free to use and modify
- ⚠️ GPL requires derivative works to be GPL (copyleft)
- ⚠️ Can complicate commercial use if proprietary features needed
- ✅ Commercial licensing available from Stanford (for proprietary use)
Implications for Classical Chinese Project:
- Open-source project: GPL is fine
- Commercial product: May need to pay for commercial license or use wrappers
- Hybrid approach: Keep CoreNLP component separate, proprietary layer on top
Commercial Viability Score: 6/10 (GPL is workable but not ideal)
5. Competitive Position: Strong (General Chinese NLP) / Weak (Classical)#
Competitors:
- HanLP: Similar capabilities, Apache 2.0 license (more permissive)
- spaCy: Modern architecture, growing Chinese support
- Stanza: Stanford’s own Python-native library (successor to CoreNLP)
- Transformers (Hugging Face): BERT-based models outperforming traditional
Trends:
- Traditional parsers (CoreNLP) being replaced by transformers
- BERT, GPT-style models becoming dominant
- Classical Chinese could benefit from transformer approach
Strategic Position: CoreNLP is established but declining for modern Chinese. Not a strategic foundation for Classical Chinese work.
SWOT Analysis#
Strengths#
- Proven architecture and algorithms
- Well-documented and tested
- Institutional backing
- Comprehensive NLP pipeline
Weaknesses#
- Not designed for Classical Chinese
- GPL licensing can be restrictive
- Java-based (Python is preferred in NLP community)
- Being superseded by transformer models
Opportunities#
- Could be adapted for Classical Chinese if funding available
- Architecture could inform custom Classical Chinese parser
- Training pipeline could be reused with Classical corpus
Threats#
- Newer transformer models may be better starting point
- Classical Chinese not a priority for Stanford
- Limited community interest in Classical Chinese NLP
- May become legacy technology as field moves to transformers
Strategic Recommendations#
Do NOT Use CoreNLP If:#
- Need out-of-the-box Classical Chinese parsing
- Want cutting-edge NLP (transformers are better)
- Need permissive license (Apache/MIT)
- Prefer Python-native tools
DO Use CoreNLP If:#
- Familiar with CoreNLP and want to experiment
- Have resources to retrain models
- Need reference implementation of parsing algorithms
- Building hybrid system and need modern Chinese component
Better Alternatives:#
- For Classical Chinese: Jiayan + custom components (lighter, focused)
- For modern Chinese: HanLP or spaCy (better licenses, active development)
- For state-of-art: Fine-tune BERT/GPT on Classical Chinese (transformer approach)
Long-Term Outlook (5-10 years)#
Likely Scenario: Gradual Obsolescence#
- CoreNLP will remain in use for existing deployments
- New projects will favor transformers (BERT, GPT)
- Classical Chinese adaptation unlikely to happen
- May become “legacy but stable” technology
Pessimistic Scenario: Abandonment#
- Stanford shifts focus entirely to transformers
- CoreNLP maintenance becomes minimal
- Community forks or abandons the project
Optimistic Scenario: Renaissance#
- Renewed interest in interpretable parsing (vs black-box transformers)
- Classical Chinese model developed as research project
- Integration with modern tools (transformer + traditional parsing)
Most Likely: Gradual obsolescence. CoreNLP will remain usable but not cutting-edge.
Investment Recommendation#
For Classical Chinese Parsing Project:
Score: 4/10 - Not recommended as primary platform
Rationale:
- High adaptation effort (6-12 months)
- Better alternatives exist (Jiayan, custom solution)
- GPL licensing complications for commercial use
- Classical Chinese not a priority for maintainers
- Transformer models likely better long-term bet
Alternative Strategy:
- Use CoreNLP as reference architecture (learn from their design)
- Borrow algorithms and training procedures
- Build lighter, Python-native Classical Chinese parser
- Consider transformer approach (BERT fine-tuning) instead
When CoreNLP Makes Sense:
- You already use it for modern Chinese (add Classical as extension)
- Academic project (GPL not an issue)
- Have Stanford partnership or collaboration
Bottom Line: CoreNLP is excellent for what it does (modern Chinese), but not the right foundation for Classical Chinese NLP. Learn from it, but don’t build on it.