1.148.1 Chinese Morphological Analysis#

Tools for Chinese text structure analysis - character decomposition, compound words - HanLP, Stanza, LTP

Explainer

Chinese Morphological Analysis: Domain Explainer#

What This Solves#

Chinese text processing faces a fundamental challenge that doesn’t exist in space-delimited languages: identifying where one meaningful unit ends and another begins. This affects everything from search engines to translation tools to educational software.

The problem splits into two distinct layers:

Character-level: Chinese characters themselves are composite structures. The character 好 (good) consists of 女 (woman) and 子 (child). Understanding these components helps with learning, search, and etymology research. But there’s no standard programming library that exposes this structure in a modern, maintainable way.

Word-level: Chinese text doesn’t use spaces between words. “我爱学习汉字” could be segmented multiple ways. Getting this wrong breaks translation, search, and text analysis. While mature tools exist for word segmentation, they operate independently from character structure analysis.

This research addresses: Which tools provide character decomposition? Which handle word segmentation? Can we combine them effectively? And critically: which approaches avoid technical debt that will burden your project for years?

Accessible Analogies#

Character Decomposition is Like Molecular Chemistry

Think of Chinese characters as molecules and components as atoms. Water (H₂O) is composed of hydrogen and oxygen. Similarly, 江 (river) is composed of 氵 (water radical) + 工 (phonetic component). Just as knowing molecular structure helps chemists, knowing character structure helps learners, search systems, and etymology researchers.

But here’s the catch: While periodic tables and molecular databases are universally standardized, Chinese character decomposition data is scattered across legacy codebases, aging libraries, and various incompatible formats. Imagine if every chemistry lab had to maintain its own periodic table with slightly different data.

Word Segmentation is Like Assembly Line Organization

Picture a factory where products flow down a conveyor belt without any dividers. Your job: figure out which items belong together as a unit. A box of screws and a wrench might be one product, or three separate items, depending on context.

Chinese text is that conveyor belt. The characters 中国人 could be “China” + “person” (Chinese person) or “middle” + “country” + “person” (person from the middle kingdom). Context determines the correct grouping. Modern NLP tools handle this well—they’re like experienced factory workers who’ve seen enough to know the patterns. But they can’t tell you why 中国 means “China” (that requires character-level analysis).

The Technical Debt Trap

Some powerful tools exist but run on outdated infrastructure—like discovering a fully-equipped machine shop from the 1950s. Everything works, but spare parts are scarce, maintenance manuals reference obsolete standards, and integrating with modern systems requires adapters and workarounds. You can use it, but every year makes it harder to maintain. That’s the Python 2 library dilemma this research confronted.

When You Need This#

You Definitely Need Character Decomposition If:#

Building educational apps that teach character structure
Creating search systems where users look up characters by components (“show me all characters with the water radical”)
Researching etymology or historical character evolution
Developing font design or glyph analysis tools

You Definitely Need Word Segmentation If:#

Building machine translation systems
Creating search engines for Chinese text
Performing sentiment analysis or text classification
Developing chatbots or question-answering systems
Any NLP pipeline processing Chinese documents

You Need BOTH If:#

Building comprehensive dictionary or reference tools
Creating advanced learning platforms that connect word meaning to character structure
Developing linguistic research tools
Building systems that analyze both grammatical structure (words) and semantic structure (character components)

You DON’T Need This If:#

Your text processing stays at sentence or document level (sentiment, topics, classification without word boundaries)
You’re working with speech (audio), not text
Your use case already has text pre-segmented with spaces

Trade-offs#

Character Decomposition: Library vs. Data#

Legacy Library (cjklib):

Comprehensive search capabilities built-in
Python 2 codebase requires maintenance workarounds
Rich functionality but uncertain long-term viability
Think: Powerful but aging machine shop

Open Data (makemeahanzi, CJKVI-IDS):

JSON/text files you parse yourself
Zero technical debt, future-proof
Requires building your own query layer
Think: Raw materials, you build the tools

Trade-off: Immediate functionality vs. long-term maintainability. Research recommends data-first approach unless you need comprehensive character search immediately and can accept 1-2 year migration timeline.

Word Segmentation: Multilingual vs. Specialized#

Multilingual Platforms (HanLP):

Handles 130 languages
Future-proof if you expand to other languages
Larger models, more memory/compute
Think: Swiss Army knife

Chinese-Optimized (LTP):

Specialized for Chinese only
Faster, smaller models
Less flexibility for other languages
Think: Specialist tool

Academic Standard (Stanza):

Universal Dependencies framework
Cross-lingual research reproducibility
Stanford backing
Think: Laboratory standard

Trade-off: Breadth vs. depth. Choose multilingual if you might expand; choose specialized if you’re Chinese-only and want maximum performance.

Build vs. Adopt#

Adopt existing tools:

Fast time-to-market (1-3 weeks)
Proven algorithms
Ongoing maintenance someone else’s problem
Limited to what tools provide

Build on open data:

Full control and customization
Modern, debt-free codebase
Initial investment (3-6 weeks)
You own maintenance

Trade-off: Most projects should adopt for word segmentation (mature tools), build custom for character analysis (data-first future-proofing).

Implementation Reality#

First 90 Days: Expect Two Separate Systems#

Don’t expect a unified “Chinese analysis platform.” You’ll integrate two independent components:

Weeks 1-2: Set up word segmentation (HanLP or Stanza)

pip install, download models
Test with sample text
Integrate into your pipeline
This part is straightforward—tools are mature

Weeks 2-3: Set up character decomposition

Download makemeahanzi JSON data
Parse into your preferred format (database, in-memory, etc.)
Build query API for your needs
This requires more custom work but avoids technical debt

Week 4: Connect the pieces

Design API layer that combines both
Handle edge cases (unknown characters, segmentation errors)
Performance optimization (caching, indexing)
Documentation and testing

Team Requirements#

Essential:

Python proficiency (or your language of choice)
Basic NLP understanding (tokenization, POS tagging concepts)
JSON parsing and data structure design

Helpful but not required:

Chinese language knowledge (helpful for testing, not for implementation)
Unicode standards familiarity
Previous NLP library experience

Realistic effort: 1 senior developer can implement in 3-4 weeks. Ongoing maintenance: minimal (data doesn’t “break,” models update quarterly at most).

Common Pitfalls#

“Let’s build our own word segmenter”: Don’t. Use HanLP/Stanza/LTP. This is solved, mature technology. Focus your effort on problems unique to your domain.

“We’ll use cjklib because it has everything”: The Python 2 technical debt will cost you more over 2 years than building modern tools costs upfront. Exception: if you need comprehensive character search now and have a <1 year migration timeline.

“We need morpheme analysis of compound words”: Be specific about what this means. If you mean word segmentation, tools exist. If you mean decomposing 电脑 (computer) into 电 (electricity) + 脑 (brain) with semantic roles, no library does this automatically—you’ll need linguistic rules or accept manual annotation.

Performance Expectations#

Word segmentation: 100-1000 characters/second (CPU), 10x faster with GPU
Character decomposition: O(1) lookups (instant for 9,000 characters)
Memory: ~1-2GB for NLP models, <100MB for character data
Scalability: Both components scale horizontally (stateless processing)

Maintenance Reality#

Word segmentation: Update models every 3-6 months as platforms release improvements. Breaking changes rare (1-2 years between major versions).

Character decomposition: Data is stable. Unicode IDS standard evolves slowly. Expect updates yearly at most, mostly for coverage expansion.

Combined system: After initial 3-4 week build, expect <1 day/month maintenance. Most time spent on application-specific enhancements, not core functionality upkeep.

Bottom Line: Chinese text processing requires two independent capabilities: word boundaries and character structure. Modern approaches favor proven NLP platforms for words and data-first architectures for characters. Expect 3-4 weeks to production with minimal ongoing maintenance. Avoid legacy libraries with technical debt unless you have explicit migration plans.

Word Count: ~1,450 words

S1: Rapid Discovery

S1-Rapid: Approach#

Evaluation Method#

Quick assessment of available libraries for Chinese morphological analysis and character decomposition based on:

Active Maintenance - Is the project still maintained? Recent releases?
Core Capabilities - Does it address character decomposition and/or morphological analysis?
Ease of Access - Available via PyPI? Clear documentation?
First Impressions - Documentation quality, community size, obvious strengths/weaknesses

Libraries Evaluated#

HanLP - Multilingual NLP toolkit
Stanza - Stanford NLP with Universal Dependencies support
LTP - Language Technology Platform (HIT)
cjklib - Han character library for CJKV languages

Time Box#

45-60 minutes per library for initial investigation. Focus on:

What it claims to do
How mature it appears
Whether it handles our specific needs (character decomposition, compound analysis)

Decision Criteria#

Can it decompose characters into components/radicals?
Can it analyze compound words?
Is it production-ready?
Python support quality

cjklib - Rapid Assessment#

Overview#

Han character library for CJKV languages (Chinese, Japanese, Korean, Vietnamese). Provides character-level routines including pronunciations, radicals, glyph components, stroke decomposition, and variant information.

Repository: GitHub - cburgmer/cjklib

Documentation: cjklib 0.3.2 docs

Character Decomposition Capabilities#

Strengths:

Explicit character decomposition using Ideographic Description Sequences (IDS)
Binary operators (e.g., ⿰ for left-right: 好 = ⿰女子)
Trinary operators for three-component decomposition
Radical analysis (Kangxi radicals)
Stroke information derivable from component data
Character variant mapping

Example Use Cases:

Character lookup by components
Font design pattern analysis
Character study and etymology
Stroke order/count deduction

Morphological Analysis (Word Level)#

None - cjklib operates at character level, not word level. No word segmentation or compound word analysis.

Maturity#

Moderate concerns:

Last documented version: 0.3.2
Documentation references Python 2.x
No 2026 updates found in search
Available on PyPI: cjklib package

Maintenance status unclear - GitHub repository exists but update frequency unknown from initial search.

Quick Verdict#

Good for: Character decomposition, radical analysis, IDS manipulation, character structure study Limitations: Python 2 legacy, maintenance unclear, no word-level morphology

cjklib is the only library reviewed that explicitly handles character decomposition. It directly addresses our character structure needs but may require Python 3 compatibility verification.

Complementary Option#

makemeahanzi - Free, open-source Chinese character data with decomposition information (mentioned in search results as related project).

Sources:

HanLP - Rapid Assessment#

Overview#

Multilingual NLP toolkit supporting 130 languages with 10 joint tasks including tokenization, POS tagging, dependency parsing, and semantic role labeling. Built on PyTorch and TensorFlow 2.x.

Repository: GitHub - hankcs/HanLP

Installation: Available via PyPI

Morphological Analysis Capabilities#

Strengths:

Comprehensive Chinese NLP pipeline (word segmentation, POS tagging, NER, parsing)
Active development and maintained
Research shows morpheme and character features address unknown word problems (Springer Study)
Integration with Haystack framework (Haystack Integration)

Limitations:

Character decomposition not a primary feature
Focused on higher-level NLP tasks rather than character-level analysis
More oriented toward tokenization and parsing than radical/component analysis

Character Decomposition#

No explicit character decomposition API visible in initial review
Character features used internally for morphological analysis
Not designed as a character structure analysis tool

Maturity#

High - Active project with comprehensive documentation, regular releases, and production use

Quick Verdict#

Good for: Word segmentation, morphological tagging, NLP pipelines Not ideal for: Character decomposition into radicals/components

HanLP excels at document-level morphological analysis but doesn’t directly expose character decomposition functionality needed for studying character structure.

Sources:

LTP - Rapid Assessment#

Overview#

Language Technology Platform developed by HIT (Harbin Institute of Technology). Integrated Chinese NLP platform with lexical analysis, syntactic parsing, and semantic parsing.

Repository: GitHub - HIT-SCIR/ltp

Recent Version: N-LTP (Neural LTP) - multi-task framework with shared pretrained models

Morphological Analysis Capabilities#

Strengths:

Six core Chinese NLP tasks: word segmentation, POS tagging, NER, dependency parsing, semantic dependency, semantic role labeling
N-LTP uses multi-task learning with shared pretrained models
Cloud service available: LTP-Cloud
Specifically designed for Chinese (not multilingual adaptation)

Terminology Note: LTP uses “lexical analysis” rather than “morphological analysis”. This includes:

Chinese word segmentation
Part-of-speech tagging
Named entity recognition

Limitations:

Focuses on word-level lexical analysis, not character-level decomposition
No explicit character decomposition features documented
Designed for document processing pipeline, not character structure study

Character Decomposition#

None identified - LTP operates at word/token level, not character component level.

Maturity#

High - Established platform with academic backing, production cloud service, and neural architecture updates (N-LTP).

Quick Verdict#

Good for: Chinese word segmentation, lexical analysis, NLP pipelines Not suitable for: Character decomposition, radical analysis

LTP is a comprehensive Chinese NLP platform but operates at word level. Like HanLP, it’s designed for document processing rather than character structure analysis.

Sources:

S1-Rapid: Recommendation#

Clear Winner for Character Decomposition#

cjklib is the only library reviewed that explicitly handles character decomposition into components and radicals.

The Landscape Split#

The evaluation revealed a fundamental divide:

Word-Level Tools (HanLP, Stanza, LTP):

Focus on tokenization, POS tagging, parsing
Operate on words/documents, not character structure
Production-ready, actively maintained
Not designed for character decomposition

Character-Level Tool (cjklib):

Explicitly handles IDS (Ideographic Description Sequences)
Provides radical analysis, component decomposition
Only library addressing our core need
Maintenance status unclear (Python 2 references)

Quick Recommendation#

For character decomposition: Use cjklib - it’s the only option that directly addresses the requirement.

Caveat: Verify Python 3 compatibility and check GitHub for maintenance status.

Alternative: Consider makemeahanzi as data source if cjklib proves unmaintained.

For Compound Word Analysis#

None of these libraries clearly excel at Chinese compound word morphological analysis. This requires deeper investigation in S2-comprehensive pass:

How do compounds differ from simple word segmentation?
Is this a linguistic analysis task or a lexical lookup problem?
May require combining tools or custom logic

Next Steps for S2#

Deep dive into cjklib: Python 3 support, API depth, data quality
Investigate compound word analysis: What’s actually needed?
Explore makemeahanzi and other character decomposition data sources
Test actual code examples with real Chinese text

Stanza - Rapid Assessment#

Overview#

Stanford NLP toolkit providing pretrained models for 80 languages based on Universal Dependencies (UD) treebanks. Supports tokenization, lemmatization, POS tagging, morphological features, and dependency parsing.

Models: Available Models & Languages

Paper: Stanza: A Python NLP Toolkit

Morphological Analysis Capabilities#

Strengths:

Universal Dependencies framework provides consistent morphological annotation
Pretrained models trained on UD v2.12
Strong academic backing (Stanford NLP)
Standardized approach across languages

Limitations:

Chinese has “weak morphology” - lacks formal devices like tense/number markers
UD tagging focuses on grammatical features, not character structure
Research notes “very little research devoted to Chinese word segmentation based on morphemes” (Computational Linguistics)

Character Decomposition#

None - Stanza focuses on morphological features (grammatical properties) not character decomposition. The UD framework doesn’t model character-level structure.

Maturity#

High - Stable, well-documented, production-ready. Part of Stanford’s NLP infrastructure.

Quick Verdict#

Good for: UD-style morphological tagging, cross-lingual NLP, grammatical analysis Not suitable for: Character decomposition, radical analysis

Stanza provides morphological features in the UD sense (grammatical properties), not character structure analysis. Chinese morphology in UD focuses on word-level features, not sub-character components.

Sources:

S2: Comprehensive

S2-Comprehensive: Approach#

Evaluation Method#

Deep dive into each library’s actual capabilities, focusing on:

Python 3 Compatibility - Can it run in modern Python environments?
API Depth - What functionality is actually exposed?
Data Sources - Where does the decomposition/morphological data come from?
Compound Word Analysis - What does this actually mean in Chinese context?
Installation & Usage - Is it production-ready or requires significant setup?

Key Research Questions#

Character Decomposition#

What IDS databases are available?
How complete is the coverage?
Can we access stroke information?
Etymology data availability?

Compound Word Analysis#

Is this word segmentation or morphological decomposition?
Do tools analyze internal structure of compounds?
Or just identify word boundaries?

Practical Concerns#

Library maintenance status
Documentation quality
Community support
Integration complexity

Extended Library Set#

Beyond the original four, also investigating:

makemeahanzi - Character data source
Jieba, pkuseg - Word segmentation tools
CJKVI-IDS - IDS database
spaCy Chinese models - NLP pipeline

Feature Matrix#

Will create detailed comparison across:

Character decomposition (IDS support)
Radical extraction
Stroke information
Component analysis
Word segmentation
Compound word analysis
Python 3 support
Maintenance status
Data quality

cjklib - Comprehensive Assessment#

Python 3 Compatibility Status#

Original cjklib: Python 2 only. No Python 3 support in the main cburgmer/cjklib repository.

cjklib3 Fork: Python 3.7+ compatible fork available at free-utils-python/cjklib3. Requires 2to3 conversion during installation:

conda create -n py37 python=3.7
conda activate py37
git clone https://github.com/free-utils-python/cjklib3
cd cjklib3
pip install 2to3
2to3 -w .
# Build database, install dictionaries

Verdict: Python 3 is possible but requires fork and manual setup. Not pip-installable for Python 3.

Character Decomposition Capabilities#

IDS Support#

Full IDS (Ideographic Description Sequences) implementation
Stores decompositions using Unicode IDS operators (⿰, ⿱, ⿲, etc.)
Example: 好 = ⿰女子 (left-right: woman + child)

API Features#

From characterlookup module:

getDecompositionEntries() - Get all decomposition trees
getRadicalForms() - Get radical variants
getStrokeCount() - Character stroke counts
getCharacterVariants() - Traditional/simplified variants

Data Quality#

Comprehensive coverage of CJK Unified Ideographs
Multiple decomposition paths for characters with variant structures
Kangxi radical mappings
Component-to-stroke mappings for analysis

Limitations#

Maintenance Concerns#

Original project: Last PyPI release unclear
Documentation references Python 2
Open issue #11 requesting Python 3 support from 2017
Snyk analysis shows no recent releases

Installation Complexity#

Not simple pip install for Python 3
Requires building database from source
Dictionary installation separate step
Fork-based solution not ideal for production

Alternative: Data Sources Only#

Instead of using cjklib as a library, consider using its data sources:

CJKVI-IDS database: cjkvi/cjkvi-ids
Parse IDS strings directly
Simpler integration, modern Python code

Compound Word Analysis#

None - cjklib operates at character level only. No word segmentation or compound analysis features.

Production Readiness#

Moderate Risk:

✅ Proven character decomposition algorithms
✅ Comprehensive data coverage
❌ Python 2 legacy code
❌ Requires fork for Python 3
❌ Setup complexity
❌ Unclear maintenance status

Recommendation: Use cjklib3 fork for prototyping, but investigate migrating to IDS database parsing with modern Python for production.

Sources:

Feature Comparison Matrix#

Character-Level Analysis#

Feature	cjklib	cjklib3	makemeahanzi	HanLP	Stanza	LTP
Character Decomposition (IDS)	✅ Full	✅ Full	✅ Full	❌	❌	❌
Radical Extraction	✅ Kangxi	✅ Kangxi	✅	❌	❌	❌
Stroke Information	✅ Derived	✅ Derived	✅ SVG	❌	❌	❌
Etymology Data	⚠️ Limited	⚠️ Limited	✅ Rich	❌	❌	❌
Component Trees	✅	✅	✅	❌	❌	❌

Word-Level Analysis#

Feature	cjklib	HanLP	Stanza	LTP	Jieba	pkuseg
Word Segmentation	❌	✅	✅	✅	✅	✅
POS Tagging	❌	✅	✅	✅	⚠️ Basic	✅
Morphological Features (UD)	❌	✅	✅	⚠️	❌	❌
Morpheme Decomposition	❌	❌	❌	❌	❌	❌
Compound Analysis	❌	❌	❌	❌	❌	❌

Technical Characteristics#

Aspect	cjklib	cjklib3	makemeahanzi	HanLP	Stanza	LTP
Python 3 Support	❌	✅	N/A (data)	✅	✅	✅
pip Installable	⚠️ Py2	❌ Fork	N/A	✅	✅	✅
Active Maintenance	❌	⚠️	✅	✅	✅	✅
Documentation Quality	✅ Good	✅ Good	✅	✅	✅	✅
Setup Complexity	Medium	High	Low	Low	Low	Low

Data Sources#

Source	Format	Coverage	Character Decomposition	Etymology	Strokes
CJKVI-IDS	IDS files	CJK Unified	✅	❌	❌
makemeahanzi	JSON	9,000+ chars	✅	✅	✅ SVG
cjklib DB	SQLite	Comprehensive	✅	⚠️	✅
Unicode IDS	Standard	All CJK	✅	❌	❌

Alternative Approaches#

Character Decomposition Options#

Use cjklib3 fork (library approach)
- Pro: Complete API, rich functionality
- Con: Setup complexity, Python 3 via fork
Use makemeahanzi data (data-first approach)
- Pro: JSON format, modern, rich etymology
- Con: Need to build parser, 9K characters (not all CJK)
Parse CJKVI-IDS directly (minimal approach)
- Pro: Complete CJK coverage, simple format
- Con: Need IDS parser, no etymology, no stroke data
Hybrid: makemeahanzi + CJKVI-IDS
- Pro: Best of both worlds
- Con: Integration complexity

Word Segmentation Options#

Tool	Strength	Use Case
HanLP	Multilingual pipeline	Production, comprehensive NLP
Stanza	UD framework, Stanford	Research, cross-lingual
LTP	Chinese-specific	Chinese-only production
Jieba	Lightweight, popular	Simple segmentation
pkuseg	Domain-specific models	Specialized domains

Key Findings#

No Morpheme Decomposition Tools Found#

Significant gap: None of the libraries analyzed actually decompose Chinese compound words into morphemes.

Example of what’s missing:

Input: “电脑” (computer)
What tools do: Identify as single word [电脑/NOUN]
What’s missing: Decompose to “电” (electricity) + “脑” (brain)

Why this matters:

Might need custom implementation
Or: redefine requirements to focus on word segmentation + character decomposition
Morpheme decomposition may require linguistic rules, not library

Two Separate Problem Domains#

Character Structure:

Tools: cjklib, makemeahanzi, IDS databases
Well-supported with existing libraries/data
Mature solutions available

Word/Morpheme Analysis:

Tools: HanLP, Stanza, LTP, Jieba, pkuseg
All do word segmentation
None do morpheme decomposition of compounds
May need custom solution or clarified requirements

Sources: See individual library assessments in this directory.

HanLP - Comprehensive Assessment#

Character-Level Capabilities#

Internal Character Features#

HanLP uses character features internally for morphological analysis. Research (Springer 2012) shows character features help with unknown words, but this is not exposed as a public API for character decomposition.

Not a Character Decomposition Tool#

No IDS extraction methods
No radical decomposition API
Character features used internally for ML models, not user-facing
Focus on word-level and sentence-level analysis

Word Segmentation & Morphological Tagging#

Core Strengths#

From HanLP documentation:

Word segmentation (分词)
Part-of-speech tagging (词性标注)
Named entity recognition (命名实体识别)
Dependency parsing (依存句法分析)
Semantic role labeling (语义角色标注)

Multi-Task Architecture#

Supports 130 languages
Joint training across tasks
PyTorch and TensorFlow 2.x backends
RESTful API available

Compound Word Analysis#

Word-level segmentation, not morphological decomposition:

Identifies word boundaries in text
Tags grammatical roles
Does NOT analyze internal morpheme structure
Example: Segments “中国人” (Chinese person) as one word, doesn’t decompose into “中国” (China) + “人” (person)

Research Context#

Chinese word segmentation research (Computational Linguistics) notes:

Chinese has “weak morphology”
Limited formal devices (no tense/number markers like English)
Word boundaries often ambiguous
Segmentation ≠ morphological analysis

Production Readiness#

High:

✅ Active development
✅ Python 3 support
✅ pip installable: pip install hanlp
✅ Comprehensive documentation
✅ Large community
✅ Production deployments
✅ RESTful API option

Integration with Morphological Analysis#

Complementary, not overlapping:

HanLP provides word segmentation
Separate tool needed for character decomposition
Could use HanLP → segment text into words → then analyze character structure of words

Use Case Example#

Input: “我喜欢汉字学习”
HanLP segments: [“我”, “喜欢”, “汉字”, “学习”]
Character decomposition tool analyzes: “汉” = ⿰氵又, “字” = ⿱宀子

Verdict for Our Needs#

Word-level processing: Yes Character decomposition: No

HanLP is an excellent NLP pipeline for Chinese text processing at word/sentence level, but does not provide character structure analysis. For morphological analysis project, HanLP handles the “compound word” aspect (if interpreted as word segmentation) but not character decomposition.

Sources:

LTP - Comprehensive Assessment#

Language Technology Platform Overview#

Core Architecture#

From LTP documentation:

Integrated Chinese NLP platform
Developed by HIT (Harbin Institute of Technology)
Research Center for Social Computing and Information Retrieval

Neural Architecture: N-LTP#

ArXiv paper describes modern version:

Multi-task learning framework
Shared pretrained models
Unified approach across tasks
More efficient than independent models

Six Core Tasks#

Lexical Analysis:

Chinese word segmentation (分词)
Part-of-speech tagging (词性标注)
Named entity recognition (命名实体识别)

Syntactic Parsing: 4. Dependency parsing (依存句法分析)

Semantic Parsing: 5. Semantic dependency parsing (语义依存分析) 6. Semantic role labeling (语义角色标注)

Terminology: “Lexical” not “Morphological”#

Important distinction:

LTP documentation uses “lexical analysis” (词法分析)
NOT “morphological analysis” in linguistic sense
Refers to word-level token analysis
Includes segmentation, POS, NER

Why “Lexical”?#

Chinese word segmentation is primary challenge
Identifying word boundaries from character stream
POS tagging and NER follow segmentation
Focus on lexicon (words) not morphemes (sub-word units)

Character Decomposition#

None - LTP operates at word level. No character structure analysis, IDS support, or radical extraction.

Compound Word Analysis#

Word segmentation, not morpheme decomposition:

Segments text into words
Tags each word’s part of speech
Identifies named entities
Does NOT decompose compounds into constituent morphemes

Example Processing#

Input: “中国人民大学”

LTP might segment as: [“中国”, “人民”, “大学”] or [“中国人民大学”]
Tags: [NOUN, NOUN, NOUN] or [ORG]
Does NOT decompose: “人民” = “人” (person) + “民” (people)

Production Readiness#

High:

✅ Python 3 support: GitHub
✅ pip installable
✅ Cloud service: LTP-Cloud
✅ Academic backing (HIT)
✅ N-LTP modernization (2020+)
✅ Active Chinese NLP community

Comparison with HanLP#

Similar scope, different implementations:

Both provide Chinese NLP pipelines
Both include segmentation, POS, NER, parsing
LTP: Chinese-specific, HIT research
HanLP: Multilingual, broader language support
Both operate at word level, not character level

Integration with Character Analysis#

Preprocessing pipeline:

LTP segments and tags text
Provides linguistic context
Separate tool needed for character decomposition
LTP output useful for identifying important words to analyze

Workflow#

LTP segments: “学习汉字需要耐心”
Output: [“学习”/VERB, “汉字”/NOUN, “需要”/VERB, “耐心”/NOUN]
Focus on nouns/verbs for character analysis
Use cjklib/makemeahanzi for “汉字” decomposition

Verdict for Our Needs#

Word segmentation: Yes (excellent) Lexical tagging: Yes (comprehensive) Character decomposition: No Morpheme analysis: No

LTP is a production-ready Chinese NLP platform for word-level processing. Like HanLP and Stanza, it doesn’t provide character structure analysis. The term “morphological analysis” in our requirements doesn’t align with what LTP offers unless we interpret it as “word segmentation.”

Clarification Needed#

The project requirements mention:

Character decomposition ✓ (clearly sub-character components)
Compound word analysis ? (ambiguous)

If “compound word analysis” means:

Identifying compounds from simple words: LTP does this (word segmentation)
Decomposing compounds into morphemes: LTP does NOT do this
Analyzing character structure: LTP does NOT do this

Sources:

S2-Comprehensive: Recommendation#

Two-Tier Approach#

Based on comprehensive analysis, recommend splitting the solution:

Tier 1: Character Decomposition#

Tool: makemeahanzi + CJKVI-IDS hybrid

Rationale:

makemeahanzi provides rich data for common characters (9K+)
- IDS decomposition
- Etymology information
- Stroke order SVG data
- JSON format (modern, easy to parse)
CJKVI-IDS fills gaps for comprehensive CJK coverage
Both are data sources, not libraries - gives full control
No Python 2 legacy issues
Active maintenance

Implementation:

Primary: Parse makemeahanzi JSON for covered characters
Fallback: Parse CJKVI-IDS for remaining CJK characters
Build custom API layer for application needs

Tier 2: Word Segmentation#

Tool: HanLP

Rationale:

Production-ready Python 3 library
Comprehensive NLP pipeline
Active development
pip installable
Handles word boundary identification

Caveat: HanLP provides segmentation, not morpheme decomposition.

The Morpheme Decomposition Gap#

Critical finding: No library found that decomposes Chinese compounds into morphemes.

Example of gap:

Character decomposition: 电 = IDS ⿻日乚 (components)
Word segmentation: “我用电脑” → [“我”, “用”, “电脑”]
Missing: “电脑” → “电” + “脑” (morpheme analysis)

Options:

Redefine requirements: Focus on character decomposition + word segmentation
Custom implementation: Build morpheme analyzer using:
- Word segmentation from HanLP
- Character decomposition from makemeahanzi
- Linguistic rules for identifying productive compounds
Accept limitation: Most compounds are lexicalized (treated as single words)

Recommended Stack#

Application Layer
    ↓
Word Segmentation: HanLP
    ↓
Character Decomposition: makemeahanzi + CJKVI-IDS
    ↓
Data Layer: JSON + IDS text files

Why Not cjklib?#

cjklib is functionally superior for character analysis, but:

Python 2 codebase
Python 3 requires fork + manual setup
Not pip installable for Python 3
Uncertain maintenance

Alternative: Extract cjklib’s algorithms and port to modern Python, using makemeahanzi/CJKVI-IDS as data sources. This gets the best of both:

Proven algorithms (cjklib)
Modern data formats (makemeahanzi)
Clean Python 3 code
Full control over implementation

Production Deployment#

Low-risk path:

Start with makemeahanzi data (9K characters covers most use cases)
Parse JSON in Python 3
Build simple API: decompose(char) → components
Add CJKVI-IDS for complete coverage as needed
Use HanLP for word segmentation

Higher-effort, more capable path:

Port cjklib algorithms to Python 3
Use makemeahanzi + CJKVI-IDS as data
Build comprehensive character analysis library
Integrate with HanLP for word processing

Next Steps for S3: Need-Driven#

Must clarify actual use cases:

What is “compound word analysis” supposed to do?
Are we analyzing character structure or word structure?
What’s the end goal: teaching, search, analysis?

Different use cases may lead to different recommendations.

Sources:

Stanza - Comprehensive Assessment#

Universal Dependencies Framework#

What Stanza Provides#

From Stanza documentation:

Tokenization & sentence segmentation
Lemmatization
POS tagging
Morphological features (UD framework)
Dependency parsing

Chinese Models#

Trained on Universal Dependencies v2.12
Models available for simplified and traditional Chinese
Performance metrics show 94%+ token accuracy

Morphological Features in UD#

What “Morphological” Means in UD Context#

UD morphological features capture grammatical properties:

Tense, aspect, mood (for languages that have them)
Number, gender, case
Transitivity, politeness markers

Chinese in UD Framework#

Research findings (Computational Linguistics):

Chinese is an analytic language
Lacks formal morphological devices (no tense inflections, number markers)
“Weak morphology” compared to agglutinative or fusional languages
UD features limited to aspect markers, classifiers, etc.

Example UD tags for Chinese:

Aspect markers: 了 (le), 着 (zhe), 过 (guo)
Classifiers: 个, 只, 本
NOT character decomposition or internal word structure

Character Decomposition#

None - Stanza operates at token/word level, not character component level. UD framework doesn’t model sub-character structure.

Compound Word Analysis#

Word-level tokenization only:

Segments text into tokens
Assigns grammatical tags
Does NOT analyze morpheme composition of compounds
Example: “电脑” (computer) = single token “电脑”, not analyzed as “电” (electricity) + “脑” (brain)

Research Context#

ACL 2020 paper notes:

Stanza focuses on cross-lingual consistency
Same annotation framework for all languages
Chinese processing adapted to UD conventions
Morphological analysis limited to what UD framework supports

Production Readiness#

High:

✅ Python 3 support
✅ pip installable: pip install stanza
✅ Stanford NLP backing
✅ Excellent documentation
✅ 80 languages supported
✅ Regular updates with new UD versions

Integration with Character Analysis#

Preprocessing role only:

Stanza segments text into tokens
Provides grammatical structure
Separate tool needed for character decomposition
UD parse trees useful for understanding word relationships

Workflow Example#

Input: “学习汉字很有趣”
Stanza tokenizes: [“学习”, “汉字”, “很”, “有趣”]
POS tags: [VERB, NOUN, ADV, ADJ]
Dependency parse: shows “学习” as root, “汉字” as object
Separate tool decomposes characters: “汉” = ⿰氵又

Verdict for Our Needs#

Grammatical morphology (UD sense): Yes Character decomposition: No Compound analysis (morphemes): No

Stanza provides morphological tagging in the UD sense (grammatical features), not character decomposition or morpheme analysis. Useful for NLP pipelines but doesn’t address core requirement of analyzing character structure.

Clarification Needed#

The term “morphological analysis” is ambiguous:

UD morphology: Grammatical features (what Stanza does)
Character morphology: Component structure (what we need)
Word morphology: Compound word decomposition (unclear if needed)

Stanza addresses the first definition only.

Sources:

S3: Need-Driven

S3-Need-Driven: Approach#

Methodology#

Evaluate libraries based on concrete use cases rather than abstract features. For each use case:

Define the goal - What does the user want to accomplish?
Map to requirements - What technical capabilities are needed?
Test fit - Which libraries can deliver this?
Identify gaps - What’s missing?

Use Case Categories#

Based on common needs for Chinese morphological analysis:

1. Educational/Learning#

Character etymology and learning aids
Radical-based character lookup
Understanding character construction
Vocabulary building through component analysis

2. Search & Information Retrieval#

Component-based search
Fuzzy matching by structural similarity
Radical-based indexing
Character variant handling

3. Text Analysis & Processing#

Word segmentation for NLP pipelines
POS tagging and parsing
Named entity recognition
Document processing

4. Linguistic Research#

Character structure analysis
Etymology research
Historical character evolution
Compound word formation patterns

5. Font & Typography#

Glyph component analysis
Stroke order validation
Font design pattern extraction
Character rendering optimization

Evaluation Criteria Per Use Case#

Capability match: Does the tool provide needed functionality?
Data quality: Is the underlying data accurate and comprehensive?
Ease of use: How much custom code is needed?
Performance: Can it handle expected data volumes?
Maintenance: Is it sustainable long-term?

Use Cases Analyzed#

Will create separate markdown files for:

use-case-educational.md - Learning and teaching tools
use-case-search.md - Search and retrieval systems
use-case-nlp-pipeline.md - Text processing pipelines
use-case-linguistic-research.md - Academic research
use-case-etymology.md - Character origin and evolution study

Each file will map the use case to library recommendations with specific implementation guidance.

S3-Need-Driven: Recommendation#

Use Case → Library Mapping#

Based on concrete use case analysis:

Use Case	Primary Tool	Secondary Tool	Rationale
Educational/Learning	makemeahanzi	CJKVI-IDS	Rich etymology + mnemonics
Search/Retrieval	cjklib/cjklib3	CJKVI-IDS	Comprehensive search APIs
NLP Pipeline	HanLP/Stanza/LTP	makemeahanzi	Production NLP + optional char features
Etymology Research	makemeahanzi	Sears DB scrape	Etymology classification + historical forms

Key Insight: No One-Size-Fits-All#

Different needs require different tools:

Character Structure Analysis#

→ cjklib (most comprehensive) or makemeahanzi (modern, rich data)

Word Processing#

→ HanLP/Stanza/LTP (depending on multilingual vs. Chinese-only needs)

Etymology & Pedagogy#

→ makemeahanzi (only tool with explicit etymology data)

The Morpheme Decomposition Gap (Revisited)#

Critical finding from use case analysis:

None of the analyzed use cases actually require morpheme decomposition of compound words in the linguistic sense.

What users actually need:

Character decomposition → cjklib/makemeahanzi/CJKVI-IDS ✅
Word segmentation → HanLP/Stanza/LTP ✅
Etymology understanding → makemeahanzi ✅
Linguistic morpheme analysis → No library exists ❌

Example of missing capability:

Input: “电脑” (computer = electricity + brain)
What we have: Word segmentation identifies “电脑” as one word
What’s missing: Automatic morpheme decomposition to “电” + “脑” with semantic roles
Reality: This requires linguistic analysis, not just library lookup

Implication: If true morpheme decomposition is needed, it requires:

Custom implementation
Linguistic rules database
Or: Manual annotation

Alternative interpretation: If “compound word analysis” means “word segmentation,” then HanLP/Stanza/LTP provide this.

Decision Framework#

Question 1: What level are you analyzing?#

Characters (sub-word structure)? → Use makemeahanzi or cjklib

Words (text segmentation)? → Use HanLP/Stanza/LTP

Both? → Use NLP tool for words + character tool for components

Question 2: What’s your domain?#

Education/Learning? → makemeahanzi (best etymology)

Production NLP? → HanLP/Stanza/LTP (best pipelines)

Linguistic Research? → makemeahanzi + cjklib + Sears DB

Search/IR? → cjklib (best search APIs)

Question 3: What’s your Python environment?#

Python 3 only? → makemeahanzi (data) or HanLP/Stanza/LTP (NLP)

Can handle Python 2/3 split? → cjklib via fork or subprocess

Want minimal dependencies? → CJKVI-IDS + custom parser

Recommended Stacks#

Stack A: Modern Python, Educational Focus#

makemeahanzi (character data)
+ HanLP (word processing)
+ Custom integration layer

Pros: All Python 3, rich data, production-ready Cons: Need to build integration

Stack B: Research-Grade, Comprehensive#

cjklib3 fork (character analysis)
+ Stanza (UD-compliant word processing)
+ Sears DB scrape (historical forms)

Pros: Most comprehensive, research-quality Cons: Setup complexity, Python 3 fork

Stack C: Data-First, Maximum Control#

CJKVI-IDS + makemeahanzi (raw data)
+ pkuseg/Jieba (lightweight segmentation)
+ Custom Python 3 parser

Pros: Full control, modern Python, no legacy code Cons: Need to build everything, significant effort

Stack D: Production NLP, Minimal Character Analysis#

HanLP or LTP (complete NLP pipeline)
+ CJKVI-IDS (fallback for char decomposition)

Pros: Production-ready, minimal complexity Cons: Limited character analysis depth

Final Recommendation#

For most projects: Stack A (makemeahanzi + HanLP)

Rationale:

90% of character analysis needs covered by makemeahanzi (9K chars)
HanLP provides production NLP pipeline
All Python 3, pip installable
Easy integration
Can enhance later if needed

When to choose alternatives:

Stack B: Academic research requiring comprehensive character coverage
Stack C: Maximum customization, willing to invest development time
Stack D: NLP-focused, character analysis is secondary

Implementation Priority#

Phase 1: Proof of Concept (1-2 days)#

Download makemeahanzi JSON
Parse into Python dict/SQLite
Test character decomposition queries
Install HanLP, test word segmentation

Phase 2: Integration (3-5 days)#

Build unified API layer
Combine word segmentation + character analysis
Create sample use cases
Performance testing

Phase 3: Enhancement (1-2 weeks)#

Add CJKVI-IDS for extended coverage
Optimize performance (caching, indexing)
Build higher-level functions (search, etymology lookup)
Documentation

Phase 4: Production (ongoing)#

Deploy as service or library
Monitor usage patterns
Refine based on real needs
Continuous data quality improvements

Bottom line: The right tool depends on whether you’re analyzing characters (structure) or words (segmentation). Most projects need both, so a hybrid approach (makemeahanzi + HanLP) provides the best balance of capability and maintainability.

Use Case: Educational & Learning Tools#

Goal#

Help learners understand Chinese characters through component analysis and etymology, making character acquisition more systematic and memorable.

User Stories#

Student learning character 認 (recognize):
- Wants to know: What are the components?
- System shows: ⿰言刃 (words + blade)
- Etymology: “recognizing” = distinguishing with words as sharp as a blade
Teacher creating flashcards:
- Needs: Systematic grouping by radicals/components
- System provides: Characters sharing components (e.g., all with 氵water radical)
- Use for: Progressive learning sequences
Self-study app developer:
- Requirement: Show character decomposition in mobile app
- Data needs: IDS, stroke order, semantic hints
- Performance: Fast lookup, offline capability

Required Capabilities#

Character-Level#

✅ IDS decomposition (show structure)
✅ Radical identification (search by radical)
✅ Etymology data (learning mnemonics)
✅ Stroke order (writing practice)
⚠️ Semantic/phonetic classification (which component carries meaning vs. sound)

Word-Level#

Optional: Word segmentation for context
Not needed: Complex morphological tagging

Library Fit Analysis#

makemeahanzi#

Excellent fit (9/10):

✅ Rich etymology data with type (ideographic, pictophonetic, etc.)
✅ Semantic and phonetic hints explicitly marked
✅ SVG stroke data for animations
✅ IDS decomposition
✅ JSON format (easy integration with web/mobile)
⚠️ Coverage: 9,000+ characters (sufficient for learners, not comprehensive)

Sample data:

{
  "character": "認",
  "decomposition": "⿰言刃",
  "etymology": {
    "type": "pictophonetic",
    "phonetic": "刃",
    "semantic": "言"
  }
}

cjklib/cjklib3#

Good fit (7/10):

✅ Comprehensive character coverage
✅ IDS decomposition
✅ Radical lookups
❌ Limited etymology data
⚠️ Python 3 via fork (deployment complexity)
❌ No stroke order SVG

CJKVI-IDS#

Moderate fit (5/10):

✅ Complete CJK coverage
✅ IDS decomposition
❌ No etymology
❌ No stroke order
❌ No semantic/phonetic distinction
⚠️ Raw IDS text (requires parser)

HanLP/Stanza/LTP#

Poor fit (2/10):

❌ No character decomposition
✅ Word segmentation (useful for context)
Not designed for educational character analysis

Alternative Approach: Custom Dataset#

If makemeahanzi is insufficient, consider:

Start with makemeahanzi core data
Enhance with linguistic annotations
Add pedagogy-specific fields:
- Frequency rank
- HSK level
- Related characters
- Common mnemonics
Curate for learner needs

Example Integration: Flashcard App#

User sees: 認

App displays:
┌─────────────────────────────┐
│ 認 (rèn) - recognize        │
├─────────────────────────────┤
│ Components:                  │
│ 言 (words) + 刃 (blade)      │
│                              │
│ Etymology: Pictophonetic     │
│ Phonetic: 刃 (rèn)          │
│ Semantic: 言 (words)        │
│                              │
│ Mnemonic: Using sharp words  │
│ to distinguish/recognize     │
└─────────────────────────────┘

[Show stroke order animation]

Data source: makemeahanzi JSON + custom mnemonic database

Verdict: makemeahanzi is ideal for educational use cases. Combine with custom learning scaffolding for complete solution.

Use Case: Etymology & Character Origin Research#

Goal#

Study the historical development, original meanings, and structural evolution of Chinese characters for linguistic research or educational content creation.

User Stories#

Linguist researching pictophonetic characters:
- Query: Which characters use 工 as phonetic component?
- System: Returns 江 (river), 紅 (red), 功 (achievement), etc.
- Analysis: Common phonetic ‘gong’ sound
Content creator making etymology videos:
- Need: Visual evolution of character forms
- System provides: Oracle bone → bronze → seal → modern
- Plus: Semantic/phonetic classification
Dictionary editor adding etymology notes:
- Input: Character 信
- System returns: Pictophonetic (人 person + 言 words)
- Context: “信” = person standing by their words = trust
Historical text researcher:
- Analyzing archaic character forms
- Needs variant mappings (ancient → modern)
- Tracks character evolution through dynasties

Required Capabilities#

Etymology-Specific#

✅ Character origin classification (pictographic, ideographic, pictophonetic, etc.)
✅ Semantic/phonetic component identification
✅ Historical forms (oracle bone, bronze, seal script)
✅ Etymology explanations (why this structure?)
Optional: Cross-reference to variants

Structure Analysis#

✅ IDS decomposition (see construction)
✅ Component relationships (semantic vs. phonetic)
Optional: Component genealogy (how components evolved)

Library Fit Analysis#

makemeahanzi#

Excellent fit (9/10):

✅ Etymology field explicitly included
✅ Classification: pictographic, ideographic, pictophonetic
✅ Semantic and phonetic components marked
✅ IDS decomposition
⚠️ Coverage: 9,000+ characters (common ones well-covered)
❌ No historical forms (oracle bone, bronze, seal)

Etymology data structure:

{
  "character": "江",
  "decomposition": "⿰氵工",
  "etymology": {
    "type": "pictophonetic",
    "phonetic": "工",
    "semantic": "氵",
    "hint": "river; water (氵) + phonetic gong (工)"
  },
  "pinyin": ["jiāng"]
}

Strengths:

Direct etymology classification
Phonetic/semantic explicitly distinguished
Hint field provides explanation
Easy to build etymology database

cjklib/cjklib3#

Moderate fit (6/10):

✅ IDS decomposition (structure visible)
✅ Radical identification
⚠️ Limited etymology information
❌ No semantic/phonetic classification
❌ No historical forms
✅ Comprehensive character coverage

Use case: Structural analysis, but requires external etymology data

CJKVI-IDS#

Poor fit (3/10):

✅ IDS structure
❌ No etymology information
❌ No component classification
❌ No historical forms
Use case: Structural data only, need external etymology source

Specialized Etymology Resources (Not Libraries)#

Zhongwen.com#

Online etymology dictionary
Not programmatic access
Rich explanations

Chinese Etymology by Richard Sears#

Historical character forms database
Shows evolution: oracle bone → modern
Not easily parsable API

Wiktionary Chinese Etymology#

Community-contributed
Variable quality
Can be scraped but maintenance burden

Use Case Examples#

Research Query: Find All Pictophonetic Characters with Phonetic ‘马’#

results = [
    char for char, data in etymology_db.items()
    if data['etymology']['type'] == 'pictophonetic'
    and '马' in data['etymology']['phonetic']
]
# → ['吗', '妈', '码', '玛', etc.]
# All pronounced 'ma'

Educational Content: Explain Character 好#

info = get_etymology('好')

# Output:
{
  'modern': '好',
  'structure': '⿰女子',
  'type': 'ideographic',
  'meaning': 'woman + child = good (mother with child)',
  'components': {
    '女': 'woman (semantic)',
    '子': 'child (semantic)'
  }
}

Linguistic Research: Analyze Phonetic Components#

# Find all characters using same phonetic
phonetic_family = find_by_phonetic_component('工')

# Analysis:
# 工 (gōng) - work
# 江 (jiāng) - river [older pronunciation gōng]
# 紅 (hóng) - red
# 功 (gōng) - achievement
# Common phonetic base indicating ancient sound

Gaps & Workarounds#

Gap 1: Historical Forms Not in makemeahanzi#

Workaround:

Scrape Richard Sears etymology database
Map to makemeahanzi characters
Store separately, join on character key

Effort: ~3-5 days scraping + integration

Gap 2: Only 9K Characters Covered#

Workaround:

Most research focuses on common characters (covered)
For rare characters: Manual annotation
Or: Use CJKVI-IDS for structure, manual etymology

Gap 3: No Component Evolution Tracking#

Workaround:

Requires linguistic research
Not available in any library
Build manually for specific research questions

Alternative Approach: Build Comprehensive Etymology DB#

If makemeahanzi is insufficient:

Foundation: makemeahanzi (9K with etymology)
Historical forms: Scrape Sears database
Extended coverage: Manual annotation for specialized characters
Component evolution: Linguistic research + manual entry
Cross-references: Link related characters

Effort: Significant (~weeks/months) Benefit: Research-grade etymology resource

Production Considerations#

For Educational Content#

makemeahanzi sufficient for most cases
Focus on common 3,000-5,000 characters
Supplement with curated explanations

For Linguistic Research#

Start with makemeahanzi
Enhance with historical data (Sears)
Accept manual work for specialized queries
Build custom database over time

For Dictionary/Reference#

makemeahanzi + manual curation
Quality control on etymology explanations
Peer review by linguists
Continuous refinement

Verdict: makemeahanzi provides best programmatic access to etymology data for common characters. For comprehensive research, combine with historical databases (Sears) and expect manual curation for specialized needs.

Use Case: NLP Pipeline & Text Processing#

Goal#

Process Chinese text for downstream NLP tasks: sentiment analysis, machine translation, information extraction, question answering, etc.

User Stories#

Machine translation preprocessing:
- Input: Raw Chinese text (no spaces)
- System: Segment into words, tag POS
- Output: Structured tokens for MT system
Sentiment analysis:
- Input: Product reviews in Chinese
- System: Tokenize, identify entities, extract features
- Output: Sentiment scores per aspect
Named entity recognition:
- Input: News articles
- System: Segment text, identify person/org/location names
- Output: Annotated entities
Chatbot processing:
- Input: User query in Chinese
- System: Parse intent, extract entities
- Output: Structured query for dialog system

Required Capabilities#

Word-Level (Primary)#

✅ Word segmentation (critical for Chinese)
✅ POS tagging (grammatical analysis)
✅ NER (entity extraction)
✅ Dependency parsing (sentence structure)
Optional: Semantic role labeling

Character-Level (Secondary)#

Optional: Character decomposition for OOV handling
Optional: Radical features for ML models

Library Fit Analysis#

HanLP#

Excellent fit (9/10):

✅ Complete NLP pipeline
✅ Python 3 support
✅ pip installable
✅ RESTful API option
✅ Multilingual (130 languages)
✅ Joint task training
✅ Production-ready

Pipeline:

import hanlp

HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)

result = HanLP("我爱学习汉字")
# {
#   'tok': ['我', '爱', '学习', '汉字'],
#   'pos': ['PN', 'VV', 'VV', 'NN'],
#   'ner': [],
#   'dep': [...],
#   ...
# }

Stanza#

Excellent fit (9/10):

✅ Complete pipeline
✅ UD framework (cross-lingual consistency)
✅ Python 3, pip installable
✅ Stanford backing
✅ 80 languages
✅ Proven performance

Pipeline:

import stanza

nlp = stanza.Pipeline('zh')
doc = nlp("我爱学习汉字")

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.pos, word.deprel)

LTP#

Excellent fit (9/10):

✅ Chinese-specific optimization
✅ Complete NLP pipeline
✅ Python 3 (N-LTP)
✅ Cloud service available
✅ Academic backing (HIT)

Use case: Chinese-only applications benefit from Chinese-specific training

Comparison#

Feature	HanLP	Stanza	LTP
Segmentation	✅	✅	✅
POS Tagging	✅	✅ UD	✅
NER	✅	✅	✅
Dependency	✅	✅ UD	✅
Semantic Role	✅	⚠️	✅
Languages	130	80	Chinese
Framework	MTL	UD	Multi-task
Deployment	Lib + API	Lib	Lib + Cloud

Character Analysis Libraries#

Poor fit (1/10):

cjklib, makemeahanzi: Not designed for NLP pipelines
Use case: Supplementary features, not primary processing

Character Decomposition Integration#

Use Case: OOV Handling#

Problem: Unknown words break segmentation

Solution: Use character features

# Segment with HanLP
tokens = hanlp_segment(text)

# For OOV tokens, analyze character structure
for token in tokens:
    if is_oov(token):
        # Get character components
        components = decompose_chars(token)
        # Use components as features for classification
        features = extract_features(components)

Benefit: Character decomposition provides fallback features for ML models when encountering unknown words.

Implementation#

Primary: HanLP for word segmentation
Secondary: makemeahanzi/CJKVI-IDS for character features
Use character features in ML pipeline:
- Word embeddings + character embeddings
- Radical embeddings
- Structural features from IDS

Production Pipeline Example#

import hanlp
from character_analyzer import decompose  # Custom

# Initialize
nlp = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)

def process_document(text):
    # NLP pipeline
    result = nlp(text)

    # Enhance with character features
    for token in result['tok']:
        if is_rare(token):
            # Decompose characters in token
            char_features = [decompose(c) for c in token]
            result['char_features'] = char_features

    return result

Performance Considerations#

Throughput#

HanLP: ~100-1000 chars/sec (depends on model)
Stanza: Similar performance
LTP: Optimized for speed in Chinese

Optimization#

Batch processing (process multiple docs together)
Model selection (smaller models for speed vs. accuracy)
Caching (segment common phrases once)
GPU acceleration (for large-scale processing)

Deployment Options#

Option A: Local Library#

pip install hanlp

Pro: No network, full control Con: Model size (~GB), memory usage

Option B: RESTful API (HanLP)#

# HanLP provides hosted API
curl -X POST https://hanlp.hankcs.com/api/...

Pro: No local setup, scalability Con: Network dependency, cost

Option C: Self-Hosted Service#

# FastAPI + HanLP
from fastapi import FastAPI
app = FastAPI()

@app.post("/segment")
def segment(text: str):
    return nlp(text)

Pro: Control + scalability Con: Infrastructure management

Verdict: HanLP, Stanza, or LTP are all excellent for NLP pipelines. Choose based on: multilingual needs (HanLP), UD framework (Stanza), or Chinese-only optimization (LTP). Character decomposition is supplementary, not primary.

Use Case: Search & Information Retrieval#

Goal#

Enable search and retrieval of Chinese text based on character structure, components, and morphological properties.

User Stories#

Dictionary user searching by radical:
- “Show me all characters with 氵 water radical”
- System returns: 江, 河, 湖, 海, 洋, 汉, etc.
- Sorted by stroke count or frequency
Unicode researcher finding similar glyphs:
- “Find characters structurally similar to 好”
- System finds: All ⿰ left-right compositions
- Filter by specific components
Translation tool handling variants:
- Input: Traditional 學 or Simplified 学
- System recognizes as variants
- Returns unified results
Digital library indexing text:
- Index historical documents by characters
- Handle variant forms
- Enable component-based search

Required Capabilities#

Character-Level#

✅ Component extraction (index by radicals/components)
✅ IDS parsing (structural similarity)
✅ Variant mapping (traditional/simplified/regional)
✅ Radical classification (standard lookup tables)
⚠️ Fuzzy matching (partial component match)

Word-Level#

✅ Word segmentation (identify searchable units)
Optional: Morphological normalization

Library Fit Analysis#

cjklib/cjklib3#

Excellent fit (9/10):

✅ getCharactersForComponents() - exact match search
✅ getRadicalForms() - handle radical variants
✅ getCharacterVariants() - traditional/simplified mapping
✅ Comprehensive CJK coverage
✅ SQLite backend (efficient lookup)
⚠️ Python 3 via fork

Key APIs for search:

from cjklib import characterlookup
cjk = characterlookup.CharacterLookup('T')  # Traditional

# Find by component
chars = cjk.getCharactersForComponents(['氵'])

# Get variants
variants = cjk.getCharacterVariants('學')  # → ['学', ...]

# Radical-based search
radical_chars = cjk.getCharactersForRadical('水')

CJKVI-IDS#

Good fit (7/10):

✅ Complete IDS data (structural search)
✅ Can build component index
⚠️ Requires custom search implementation
❌ No built-in variant mapping
❌ No radical classification tables

Use case:

Parse IDS for all characters
Build inverted index: component → [characters]
Implement fuzzy structure matching

makemeahanzi#

Moderate fit (6/10):

✅ IDS decomposition
⚠️ Limited coverage (9K chars)
❌ No variant tables
❌ No radical classification
✅ Easy to index (JSON)

Best for: Simple component search in constrained character set

HanLP#

Moderate fit for word-level (6/10):

✅ Word segmentation (identify search terms)
✅ Can index at word level
❌ No character structure search
Use case: “Find documents containing ‘学习’” (word search, not component search)

Production Considerations#

Performance#

cjklib SQLite: Fast lookups for single-character queries
Inverted index: Fast component search
Caching: Cache common queries (radical lists)

Scalability#

Character search: O(1) with proper indexing
Document search: O(n) with inverted index
Structure matching: O(n) requires full scan, optimize with filters

Deployment#

Option A: Use cjklib3 fork (accept Python 3 complexity)
Option B: Port cjklib algorithms to modern Python, use CJKVI-IDS data
Option C: Use cjklib Python 2 via subprocess (isolation)

Alternative: Build Custom Search#

If cjklib complexity is prohibitive:

Data: CJKVI-IDS (all CJK characters with IDS)
Preprocessing:
- Parse IDS into component trees
- Build inverted index: component → characters
- Create radical mapping table
Query engine:
- Component search: O(1) lookup in index
- Structure search: Filter by IDS pattern
- Fuzzy match: Levenshtein distance on IDS strings

Effort: ~1-2 weeks development Advantage: Full control, modern Python 3, no legacy dependencies

Verdict: For production character search, cjklib provides comprehensive functionality despite Python 2 legacy. For new projects, consider building custom search on CJKVI-IDS data with modern Python.

S4: Strategic

S4-Strategic: Approach#

Evaluation Method#

Assess long-term viability and strategic considerations for each approach beyond immediate technical capabilities.

Strategic Dimensions#

1. Maintenance & Longevity#

Is the project actively maintained?
Risk of abandonment?
Community health and contribution patterns
Corporate vs. academic backing

2. Evolution & Adaptation#

Can the tool adapt to new requirements?
Extensibility and customization options
Migration paths if needs change
Future-proofing considerations

3. Ecosystem Integration#

Does it play well with other tools?
Standard formats and protocols?
Community libraries and extensions
Lock-in risks

4. Resource Requirements#

Development effort for integration
Ongoing maintenance burden
Infrastructure costs
Team skill requirements

5. Legal & Compliance#

License compatibility
Data licensing and attribution
Export/usage restrictions
IP considerations

6. Build vs. Buy vs. Data#

Library adoption
Build custom on open data
Use existing data sources directly
Hybrid approaches

Assessment Framework#

For each major approach:

Current State - Where is it today?
Trajectory - Where is it heading?
Risks - What could go wrong?
Mitigation - How to manage risks?
Exit Strategy - How to pivot if needed?

Approaches to Assess#

Approach A: cjklib (Legacy Library)#

Python 2 codebase
Comprehensive functionality
Fork for Python 3 (cjklib3)

Approach B: makemeahanzi (Data-First)#

Open JSON data
No library dependency
Build custom tooling

Approach C: NLP Platforms (HanLP/Stanza/LTP)#

Production NLP pipelines
Character analysis secondary

Approach D: Hybrid (Data + Tools)#

Combine data sources with libraries
Selective integration

Will create separate viability assessments for each approach with strategic recommendations.

Strategic Viability: cjklib#

Current State#

Library: cjklib - Han character library for CJKV languages Maintainer: cburgmer (original), free-utils-python (Python 3 fork) Status: Python 2 EOL, Python 3 via fork License: MIT (permissive)

Maintenance & Longevity Assessment#

Original cjklib (cburgmer/cjklib)#

Risk Level: HIGH

❌ Python 2 only (EOL since 2020)
❌ No commits addressing Python 3 in main repo
⚠️ Open issue #11 (Python 3 support) since 2017, unresolved
⚠️ Unclear maintenance status
✅ MIT license allows forking

cjklib3 Fork (free-utils-python/cjklib3)#

Risk Level: MODERATE-HIGH

✅ Python 3 compatibility via 2to3
⚠️ Fork maintenance unclear
⚠️ Not on PyPI (manual installation)
⚠️ Depends on 2to3 conversion (not native Python 3)
⚠️ Small community around fork

Verdict#

High technical debt, uncertain future

Evolution & Adaptation#

Extensibility#

✅ SQLite backend = easy to extend data
✅ Modular architecture
❌ Python 2 codebase limits modern Python features
⚠️ Would require significant refactoring for modernization

Migration Paths#

Option A: Continue with fork

Keep using cjklib3 as-is
Accept technical debt
Risk: Fork abandonment

Option B: Port to native Python 3

Rewrite cjklib in modern Python
Use same algorithms, fresh codebase
Effort: ~2-4 weeks
Benefit: Full control, modern code

Option C: Extract data, abandon library

Parse cjklib’s SQLite database
Build minimal custom API
Effort: ~1 week
Benefit: No legacy code

Ecosystem Integration#

Standards Compliance#

✅ Unicode IDS standard
✅ SQLite (universal format)
✅ Standard radical classifications
✅ Well-documented file formats

Integration Points#

⚠️ Python 2/3 split complicates integration
✅ Can extract data for use elsewhere
✅ SQL interface standard

Resource Requirements#

Initial Integration#

Time: 1-2 weeks (setup fork, database build)
Skills: Python, SQL, CJK character knowledge
Infrastructure: Minimal (local DB)

Ongoing Maintenance#

Monitoring: Watch for fork updates (low frequency expected)
Updates: Manual process (rebuild from fork)
Risk Management: Plan for fork abandonment

Exit Costs#

Low: Data is accessible via SQLite
Migration: Can extract data, build new tool
No vendor lock-in: Open source, standard formats

Legal & Compliance#

License: MIT#

✅ Permissive use
✅ Commercial use allowed
✅ Modification allowed
✅ No copyleft requirements

Attribution#

Required: MIT license notice
Low burden: Include license file

Data Sources#

✅ Multiple open character databases
✅ Unicode standard (public)
No licensing concerns for data

Strategic Risks#

Risk 1: Fork Abandonment#

Probability: MODERATE-HIGH Impact: MODERATE (data remains accessible)

Mitigation:

Extract SQLite database
Document data schema
Prepare migration to custom code

Risk 2: Python 3 Incompatibility Issues#

Probability: LOW-MODERATE Impact: LOW-MODERATE (can fix or work around)

Mitigation:

Test thoroughly before production
Isolate in separate service if needed
Have extraction plan ready

Risk 3: Limited Community Support#

Probability: HIGH Impact: LOW (reduces feature additions, not core functionality)

Mitigation:

Don’t depend on new features
Budget for custom fixes
Consider eventual migration

Build vs. Adopt Decision#

Adopt cjklib3 If:#

✅ Need comprehensive character search immediately
✅ Can accept Python 3 fork complexity
✅ Short-term solution acceptable (~1-2 years)
✅ Plan to migrate eventually

Build Custom If:#

✅ Long-term solution needed (5+ years)
✅ Have development resources (2-4 weeks)
✅ Want full control and modern code
✅ Can leverage existing data (CJKVI-IDS, makemeahanzi)

Recommended Strategy#

Phase 1: Use Data, Not Library (LOW RISK)#

Extract data from cjklib’s sources or use CJKVI-IDS directly
Build minimal Python 3 API for your needs
Avoid dependency on cjklib code
Use cjklib algorithms as reference, but reimplement

Phase 2: Enhance as Needed#

Start with basic character decomposition
Add search features progressively
Optimize based on actual usage
No technical debt from legacy code

Rationale:#

Same data quality (using same sources)
Modern Python 3 (clean implementation)
Full control (no fork dependency)
Lower risk (no abandonment concerns)
Maintainable (code you understand)

Timeline & Effort#

Data Extraction: 2-3 days#

Parse CJKVI-IDS or cjklib DB
Load into modern format (JSON, SQLite, etc.)

Basic API: 3-5 days#

Character decomposition
Radical lookup
IDS parsing

Advanced Features: 1-2 weeks#

Component search
Variant mappings
Optimization

Total: 2-3 weeks for production-ready solution

vs. cjklib3 fork: 1 week setup + ongoing maintenance burden

Exit Strategy#

If cjklib approach fails:

Data is preserved - SQLite DB, IDS files remain accessible
Migrate to makemeahanzi - JSON data, easier integration
Build on CJKVI-IDS - Authoritative IDS source
No data loss - All approaches use same underlying data

Low switching cost due to standard formats and open data

Strategic Verdict: MODERATE RISK

cjklib provides excellent functionality but carries significant technical debt. Recommended approach: Extract data and algorithms, rebuild in modern Python for long-term viability. Short-term use acceptable if migration plan exists.

Timeline: Can use fork for <1 year, plan migration to custom solution Effort: 2-3 weeks to build modern replacement Risk: Manageable via data extraction and migration plan

Strategic Viability: makemeahanzi#

Current State#

Resource: makemeahanzi - Free, open-source Chinese character data Maintainer: skishore (GitHub) Status: Active, data-focused project License: LGPL (with exceptions for data) Format: JSON (dictionary.txt, graphics.txt)

Maintenance & Longevity Assessment#

Risk Level: LOW-MODERATE

Active Maintenance#

✅ GitHub repository active
✅ Community contributions
✅ Clear data format
✅ Multiple forks/derivatives (healthy ecosystem)

Sustainability#

✅ Data project (not code) = lower maintenance burden
✅ Unicode IDS standard unlikely to change
✅ Can be maintained by community even if original author steps back
✅ Multiple projects using it as data source

Longevity Indicators#

✅ Used by other projects (HanziJS, character learning apps)
✅ Standard format (JSON)
✅ Well-documented schema
⚠️ Coverage limited to ~9,000 characters

Verdict: Low maintenance risk, sustainable data resource

Evolution & Adaptation#

Extensibility#

✅ JSON format = easy to extend with custom fields
✅ Can merge with other data sources (CJKVI-IDS, etc.)
✅ No library lock-in = full flexibility
✅ Can build any API layer on top

Data Completeness Evolution#

Current: ~9,000 characters
Path to expand: Add more characters following same schema
Community can contribute additional character data
Can supplement with CJKVI-IDS for full CJK coverage

Migration Paths#

Minimal risk: Data format stable and standard
Easy export: Already JSON (universally parsable)
Can enhance: Add fields without breaking existing data
No vendor lock-in: Pure data, not proprietary format

Ecosystem Integration#

Standards Compliance#

✅ Unicode characters
✅ IDS standard for decomposition
✅ Pinyin standard
✅ SVG for stroke graphics
✅ JSON (universal format)

Integration Points#

✅ Easy to import into any language/platform
✅ Can load into databases (SQL, NoSQL)
✅ No runtime dependencies
✅ Works with any tech stack

Ecosystem Health#

✅ Multiple projects built on makemeahanzi
✅ Active discussion and contributions
✅ Clear documentation
✅ Examples available

Verdict: Excellent ecosystem integration

Resource Requirements#

Initial Integration#

Time: 1-2 days (parse JSON, load into DB/structure)
Skills: Basic programming, JSON parsing
Infrastructure: Minimal (local files)
Dependencies: None (pure data)

Ongoing Maintenance#

Monitoring: Check GitHub for updates (low frequency)
Updates: Download new JSON files (simple)
No runtime dependencies: Just data files
Minimal burden: Data doesn’t need “maintenance”

Development Flexibility#

✅ Build API in any language
✅ Choose your own architecture
✅ Optimize for your use case
✅ No framework constraints

Legal & Compliance#

License: LGPL with Exceptions#

Character data: Special exception allows use without LGPL restrictions Derivations: Can use data freely, even in proprietary systems Attribution: Required (give credit to makemeahanzi project)

License Risk: LOW#

Data usage is permissive
Commercial use allowed
No copyleft concerns for data
Attribution burden minimal

Data Provenance#

✅ Clearly documented sources
✅ Public domain Unicode standard
✅ Community-contributed etymology
✅ No IP concerns

Verdict: License-friendly for all use cases

Strategic Risks#

Risk 1: Coverage Limitations#

Probability: HIGH (intentional limitation) Impact: MODERATE (9K chars sufficient for most cases)

Mitigation:

Covers HSK 1-6 + common characters
Supplement with CJKVI-IDS for rare characters
Prioritize common characters (80/20 rule)

Analysis:

9,000 characters cover >99% of everyday text
Specialized needs can combine with other sources
Not a blocker for most applications

Risk 2: Maintainer Availability#

Probability: LOW-MODERATE (single maintainer) Impact: LOW (data project, can be forked)

Mitigation:

Fork if needed (multiple forks exist)
Data is stable (not frequent updates needed)
Community can contribute
JSON format = future-proof

Risk 3: Data Quality Issues#

Probability: LOW (well-curated) Impact: MODERATE (errors in etymology)

Mitigation:

Spot-check critical characters
Community review process
Can correct locally if needed
Submit fixes upstream

Build vs. Adopt Decision#

Adopt makemeahanzi Data If:#

✅ Need character data (decomposition, etymology, strokes)
✅ Want zero dependencies
✅ Coverage of 9K characters sufficient
✅ Prefer data-first architecture
✅ Want full control over implementation

Supplement with Other Sources If:#

Need rare characters beyond 9K (use CJKVI-IDS)
Need historical forms (use Sears database)
Need comprehensive search (reference cjklib algorithms)

Recommended Strategy#

Phase 1: Foundation (2-3 days)#

Download makemeahanzi JSON files
Parse into your preferred format (SQLite, JSON, in-memory)
Build basic query API
Test with sample characters

Phase 2: Enhancement (1 week)#

Add CJKVI-IDS for extended coverage
Build search indexes (by component, radical, etc.)
Optimize for performance (caching, indexes)
Create higher-level APIs for your use cases

Phase 3: Polish (ongoing)#

Add custom annotations
Improve etymology explanations
Contribute fixes back to project
Expand coverage as needed

Comparison with cjklib#

Aspect	makemeahanzi	cjklib
Format	JSON data	Python library
Coverage	9K chars	Comprehensive
Etymology	✅ Rich	⚠️ Limited
Maintenance	✅ Active	❌ Python 2
Setup	Easy	Complex (fork)
Lock-in	None	Moderate
Flexibility	✅ Full	⚠️ API limits
Long-term	✅ Low risk	⚠️ Technical debt

Timeline & Effort#

Minimal Integration: 1-2 days#

Parse JSON
Basic lookup functions
Sufficient for simple use cases

Production-Ready: 1 week#

Database loading
Indexed search
Performance optimization
Error handling

Advanced Features: 2-3 weeks#

Complex search (component-based)
Integrated with CJKVI-IDS
Custom APIs for specific needs
Web service deployment

Total: 1 week to production, 3 weeks to comprehensive solution

Exit Strategy#

If makemeahanzi approach insufficient:

Easy to migrate: JSON → any format
Combine with CJKVI-IDS: Add coverage
Supplement with cjklib data: Extract from SQLite
No code to rewrite: Pure data, API is yours

Zero switching cost - data is universally accessible

Strategic Verdict: LOW RISK, HIGH FLEXIBILITY

makemeahanzi offers excellent data quality with minimal integration risk. Recommended for:

Educational applications
Etymology research
Character learning tools
Any project wanting full control

Timeline: 1 week to production-ready Effort: Low (simple data parsing) Risk: Minimal (standard formats, active project, no lock-in) Flexibility: Maximum (build any architecture)

Recommendation: Default choice for character decomposition needs. Start here, enhance as needed.

Strategic Viability: NLP Platforms (HanLP/Stanza/LTP)#

Current State#

Platforms Assessed:

HanLP: Multilingual NLP toolkit (130 languages)
Stanza: Stanford NLP with UD framework (80 languages)
LTP: Language Technology Platform (Chinese-specific)

All are actively maintained, production-ready platforms.

Maintenance & Longevity Assessment#

HanLP#

Risk Level: LOW

✅ Active development (regular releases)
✅ Commercial backing + open source
✅ Large community
✅ Multilingual (not dependent on Chinese market alone)
✅ Modern architecture (PyTorch/TensorFlow)

Longevity: HIGH - sustained by commercial interest and academic use

Stanza#

Risk Level: LOW

✅ Stanford NLP backing (institutional support)
✅ Academic research foundation
✅ UD framework (cross-lingual standard)
✅ Regular updates with UD releases
✅ Large research community

Longevity: HIGH - academic foundation ensures continuity

LTP#

Risk Level: LOW-MODERATE

✅ HIT (Harbin Institute of Technology) backing
✅ Academic + industry use in China
✅ N-LTP modernization (2020+)
⚠️ More dependent on Chinese NLP market
✅ Active GitHub repository

Longevity: MODERATE-HIGH - strong in Chinese NLP space

Overall Verdict: All three platforms have strong long-term prospects

Evolution & Adaptation#

HanLP#

✅ Multi-task learning architecture (flexible)
✅ RESTful API option (cloud-ready)
✅ Continuous model improvements
✅ Expanding language support
✅ Integration with modern ML frameworks

Adaptation: EXCELLENT - designed for evolution

Stanza#

✅ UD framework updates regularly
✅ New language support added
✅ Model improvements with each UD release
✅ Research-driven enhancements
⚠️ Tied to UD annotation conventions

Adaptation: GOOD - evolves with UD standard

LTP#

✅ N-LTP modernization (neural architecture)
✅ Cloud service option
✅ Chinese language focus allows specialization
⚠️ Less flexible for non-Chinese needs

Adaptation: GOOD - modernizing actively

Ecosystem Integration#

HanLP#

✅ Haystack integration
✅ spaCy pipeline compatibility
✅ TensorFlow/PyTorch backends
✅ RESTful API (microservices)
✅ Python + Java support

Integration: EXCELLENT

Stanza#

✅ UD framework (cross-tool compatibility)
✅ spaCy integration available
✅ Python native
✅ Well-documented API
✅ Research-standard outputs

Integration: EXCELLENT

LTP#

✅ Python API
✅ Cloud service option
✅ Standard NLP outputs
⚠️ Less third-party integration than HanLP/Stanza

Integration: GOOD

Resource Requirements#

Initial Integration#

Platform	Setup Time	Skills Needed	Infrastructure
HanLP	1-2 hours	Python, NLP basics	~1-2GB model
Stanza	1-2 hours	Python, NLP basics	~500MB model
LTP	1-2 hours	Python, NLP basics	~1GB model

All are pip-installable, well-documented, production-ready.

Ongoing Maintenance#

All platforms: LOW maintenance burden

✅ pip install updates
✅ Automatic model downloads
✅ Stable APIs
⚠️ Need to monitor breaking changes in major versions

Performance Optimization#

GPU acceleration available (optional)
Batch processing built-in
Model selection (speed vs. accuracy trade-offs)
Caching strategies

Legal & Compliance#

HanLP#

License: Apache 2.0 (permissive)

✅ Commercial use allowed
✅ Modification allowed
✅ Patent grant included
No licensing concerns

Stanza#

License: Apache 2.0 (permissive)

✅ Commercial use allowed
✅ Stanford institutional backing
✅ No usage restrictions
No licensing concerns

LTP#

License: Apache 2.0 (original), check N-LTP

✅ Generally permissive
✅ Academic use clearly allowed
✅ Commercial use typically allowed
Verify specific version

Overall: No license concerns for any platform

Strategic Risks#

Risk 1: Model Obsolescence#

Probability: MODERATE (ML evolves rapidly) Impact: LOW (platforms update models)

Mitigation:

All platforms provide model updates
Can train custom models if needed
API abstracts model changes

Risk 2: API Breaking Changes#

Probability: LOW-MODERATE Impact: MODERATE (code changes needed)

Mitigation:

Version pinning
Test updates before production
All platforms maintain compatibility

Risk 3: Platform Abandonment#

Probability: LOW (all have institutional backing) Impact: HIGH (major migration needed)

Mitigation:

Choose platform with strongest backing (Stanza/HanLP)
Standard outputs (UD) ease migration
Multiple viable alternatives

Risk 4: Performance/Cost at Scale#

Probability: MODERATE (depends on usage) Impact: MODERATE-HIGH (infrastructure costs)

Mitigation:

Benchmark early
Consider model size trade-offs
Cloud service options (HanLP, LTP)
GPU optimization

Build vs. Adopt Decision#

Adopt Platform (HanLP/Stanza/LTP) If:#

✅ Need production NLP pipeline
✅ Word segmentation is primary need
✅ Want proven, maintained solution
✅ Have standard NLP requirements
✅ Want to leverage state-of-the-art models

Build Custom If:#

❌ Have unique NLP requirements (rare)
❌ Need extreme performance optimization
❌ Have significant ML expertise and resources

Recommendation: ADOPT - building NLP pipelines from scratch is not cost-effective

Platform Selection Criteria#

Choose HanLP If:#

✅ Need multilingual support (current or future)
✅ Want comprehensive task coverage
✅ Value RESTful API option
✅ Want active commercial ecosystem

Choose Stanza If:#

✅ Need UD-compliant outputs
✅ Cross-lingual research
✅ Academic reproducibility important
✅ Stanford NLP ecosystem integration

Choose LTP If:#

✅ Chinese-only application
✅ Want Chinese-optimized performance
✅ Need cloud service option
✅ Familiar with HIT ecosystem

Character Analysis Integration#

Key Finding: All NLP platforms operate at word level, not character level.

For Character Decomposition Needs:#

Must combine with separate tool:

makemeahanzi (data)
cjklib (library)
CJKVI-IDS (data)

Integration Architecture:#

Text → NLP Platform (word segmentation) → Words
Words → Character Tool (decomposition) → Components

No lock-in: NLP and character analysis are separate concerns

Recommended Strategy#

Phase 1: Adopt for Word Processing#

Choose platform based on criteria above
Integrate for word segmentation, POS, NER
Deploy in production
Monitor performance

Phase 2: Add Character Analysis#

Separately integrate makemeahanzi or cjklib
Build connector layer
Combine outputs as needed
No dependency between layers

Phase 3: Optimize#

Cache common segmentations
Batch processing
GPU acceleration if needed
Model tuning

Timeline & Effort#

Integration: 1-2 days#

Install platform
Basic API usage
Test with sample data

Production Deployment: 1 week#

Performance testing
Error handling
Logging and monitoring
Documentation

Optimization: 1-2 weeks#

Benchmarking
Caching strategy
Infrastructure setup
Load testing

Total: 1-2 weeks to production

Exit Strategy#

If platform switch needed:

Low switching cost because:#

Standard outputs: UD annotations, token lists
Multiple alternatives: HanLP, Stanza, LTP all viable
Modular architecture: NLP layer separate from application logic
No data lock-in: Text processing is stateless

Migration path:#

Run old and new platform in parallel
Compare outputs
Gradually shift traffic
Low risk, controlled process

Strategic Verdict: LOW RISK, HIGH VALUE

NLP platforms are mature, well-supported, and provide essential functionality. Recommended for all projects needing word segmentation.

Choose Based On:

Multilingual: HanLP
Academic/UD: Stanza
Chinese-only: LTP

Timeline: 1-2 weeks to production Risk: Minimal (institutional backing, multiple alternatives) Flexibility: High (standard outputs, modularity)

Recommendation: Adopt platform for word processing, combine with character analysis tool as needed. Do not build NLP pipeline from scratch.

S4-Strategic: Recommendation#

Strategic Risk Assessment Summary#

Approach	Maintenance Risk	Technical Debt	Flexibility	Long-term Viability
cjklib	HIGH	HIGH	MODERATE	⚠️ MODERATE
makemeahanzi	LOW	NONE	HIGH	✅ HIGH
NLP Platforms	LOW	LOW	HIGH	✅ HIGH
Hybrid (Data+Tools)	LOW	LOW	HIGH	✅ HIGH

Strategic Recommendation: Modular Architecture#

Core Principle: Separation of Concerns#

Build a modular system where character analysis and word processing are independent:

┌─────────────────────────────────────────┐
│          Application Layer              │
├─────────────────────────────────────────┤
│  Character Analysis  │  Word Processing │
│   (makemeahanzi)    │  (HanLP/Stanza)  │
├─────────────────────────────────────────┤
│          Data Layer                      │
│  JSON Files  │  Models  │  Databases    │
└─────────────────────────────────────────┘

Why This Works#

Low coupling: Components can be upgraded independently
Low risk: Failure in one doesn’t affect the other
Flexibility: Swap components as needs evolve
Standard formats: JSON, UD annotations, IDS sequences

Tier 1 Recommendation: Production Systems#

For Character Decomposition: makemeahanzi#

Why:

✅ Zero technical debt (pure data)
✅ Rich etymology (unique strength)
✅ Standard formats (JSON, IDS)
✅ Active maintenance
✅ No lock-in
✅ 9K character coverage sufficient for most uses

When to enhance:

Add CJKVI-IDS for rare characters (5% of use cases)
Add Sears database for historical forms (research only)

For Word Processing: HanLP or Stanza#

Why:

✅ Production-ready
✅ Institutional backing
✅ Active development
✅ No technical debt
✅ Low maintenance

Choose HanLP if: Multilingual or want RESTful API Choose Stanza if: Academic UD compliance needed Choose LTP if: Chinese-only optimization priority

Implementation Timeline#

Week 1: Foundation

Download makemeahanzi data
Parse into SQLite or JSON store
Build basic character decomposition API
Install HanLP/Stanza
Test word segmentation

Week 2: Integration

Combine character + word processing
Build unified API layer
Performance testing
Error handling

Week 3: Production

Deployment
Monitoring
Documentation
Initial optimization

Week 4+: Enhancement

Add CJKVI-IDS if needed
Custom features
Performance tuning
User feedback integration

Total: 3-4 weeks to production-ready system

Tier 2 Recommendation: Research/Academic#

For Comprehensive Character Analysis: Custom Tool on CJKVI-IDS + makemeahanzi#

Why:

Full CJK coverage (CJKVI-IDS)
Rich data for common characters (makemeahanzi)
Modern Python 3 implementation
Full control over algorithms
Research-grade quality

For Linguistic Research: Stanza + Custom Character Tool#

Why:

UD annotations (cross-lingual research)
Reproducible results
Academic standard
Can publish methodology

Timeline: 4-6 weeks#

2 weeks: Build character analysis tool
1 week: Integrate Stanza
1 week: Research-specific features
1-2 weeks: Validation and testing

NOT Recommended: cjklib (Unless Short-Term)#

Avoid cjklib3 Fork For New Projects#

Reasons:

Python 2 → 3 conversion technical debt
Uncertain fork maintenance
Manual setup complexity
Better alternatives available

Only use cjklib if:

Short-term project (<6 months)
Need comprehensive search immediately
Plan migration to modern solution
Can accept technical debt

Risk Mitigation Strategies#

Strategy 1: Data-First Architecture#

Protect against library obsolescence:

Store data in standard formats
Don’t tightly couple to library APIs
Build abstraction layer
Can swap libraries without data migration

Strategy 2: Modular Design#

Protect against component failure:

Character analysis independent of word processing
Can replace either component
Standard interfaces between modules
No cascading failures

Strategy 3: Progressive Enhancement#

Manage resource constraints:

Start with makemeahanzi (9K chars)
99% of use cases covered
Add CJKVI-IDS only if needed
Avoid premature optimization

Strategy 4: Exit Plans#

Prepare for platform changes:

Standard output formats (UD, IDS, JSON)
Documentation of data schemas
Abstraction layers
Multiple viable alternatives

Long-Term Evolution Path#

Year 1: Foundation#

makemeahanzi + HanLP/Stanza
Basic features
Production deployment
User feedback

Year 2: Enhancement#

Add CJKVI-IDS for extended coverage
Custom algorithms for specialized needs
Performance optimization
Feature expansion based on usage

Year 3: Maturity#

Potentially migrate to fully custom character tool
Advanced features
Research-grade quality
Contribute improvements back to open source

Cost-Benefit Analysis#

Approach A: Adopt cjklib3#

Effort: 1 week setup + ongoing maintenance
Benefit: Comprehensive search immediately
Cost: Technical debt, uncertain future
Risk: HIGH

Approach B: makemeahanzi + Custom (RECOMMENDED)#

Effort: 2-3 weeks initial
Benefit: Modern codebase, full control, no debt
Cost: Initial development time
Risk: LOW

Approach C: Build Everything Custom#

Effort: 6-8 weeks
Benefit: Maximum control
Cost: Significant development resources
Risk: MODERATE (reinventing wheel)

Best ROI: Approach B (makemeahanzi + Custom)

Decision Matrix#

If Your Priority Is…#

Speed to market: → makemeahanzi + HanLP (1-2 weeks)

Long-term maintainability: → makemeahanzi + Custom + HanLP (3-4 weeks)

Research quality: → Custom tool + CJKVI-IDS + Stanza (4-6 weeks)

Minimal resources: → makemeahanzi data + minimal API (1 week)

Character search capabilities: → Study cjklib algorithms, implement in Python 3 (2-3 weeks)

Final Strategic Recommendation#

Build Modular System with:#

Character Layer: makemeahanzi (+ CJKVI-IDS if needed)
Word Layer: HanLP or Stanza
Architecture: Separated, loosely coupled
Timeline: 3-4 weeks to production
Risk Level: LOW
Future-proofing: HIGH

Why This Wins#

✅ Low technical debt (modern Python, standard formats)
✅ Low maintenance (stable data, proven tools)
✅ High flexibility (swap components independently)
✅ Low risk (institutional backing, multiple alternatives)
✅ Fast time to market (3-4 weeks)
✅ Room to grow (can enhance without rebuilding)

Avoid#

❌ cjklib3 fork (technical debt)
❌ Building NLP pipeline from scratch (reinventing wheel)
❌ Monolithic architecture (high coupling)
❌ Vendor lock-in (proprietary formats)

Strategic Verdict: MODULAR DATA-FIRST ARCHITECTURE

Combine proven tools (HanLP/Stanza for words) with open data (makemeahanzi for characters) in a loosely coupled architecture. This maximizes flexibility while minimizing risk and technical debt.

Timeline: 3-4 weeks to production-ready Effort: Moderate (mostly integration, not invention) Risk: Low (proven components, standard formats) Flexibility: High (can evolve independently)

Next Steps:

Download makemeahanzi data
Choose NLP platform (HanLP recommended)
Build connector layer
Deploy and iterate

Published: 2026-03-06 Updated: 2026-03-06