1.148.1 Chinese Morphological Analysis#

Tools for Chinese text structure analysis - character decomposition, compound words - HanLP, Stanza, LTP


Explainer

Chinese Morphological Analysis: Domain Explainer#

What This Solves#

Chinese text processing faces a fundamental challenge that doesn’t exist in space-delimited languages: identifying where one meaningful unit ends and another begins. This affects everything from search engines to translation tools to educational software.

The problem splits into two distinct layers:

Character-level: Chinese characters themselves are composite structures. The character 好 (good) consists of 女 (woman) and 子 (child). Understanding these components helps with learning, search, and etymology research. But there’s no standard programming library that exposes this structure in a modern, maintainable way.

Word-level: Chinese text doesn’t use spaces between words. “我爱学习汉字” could be segmented multiple ways. Getting this wrong breaks translation, search, and text analysis. While mature tools exist for word segmentation, they operate independently from character structure analysis.

This research addresses: Which tools provide character decomposition? Which handle word segmentation? Can we combine them effectively? And critically: which approaches avoid technical debt that will burden your project for years?

Accessible Analogies#

Character Decomposition is Like Molecular Chemistry

Think of Chinese characters as molecules and components as atoms. Water (H₂O) is composed of hydrogen and oxygen. Similarly, 江 (river) is composed of 氵 (water radical) + 工 (phonetic component). Just as knowing molecular structure helps chemists, knowing character structure helps learners, search systems, and etymology researchers.

But here’s the catch: While periodic tables and molecular databases are universally standardized, Chinese character decomposition data is scattered across legacy codebases, aging libraries, and various incompatible formats. Imagine if every chemistry lab had to maintain its own periodic table with slightly different data.

Word Segmentation is Like Assembly Line Organization

Picture a factory where products flow down a conveyor belt without any dividers. Your job: figure out which items belong together as a unit. A box of screws and a wrench might be one product, or three separate items, depending on context.

Chinese text is that conveyor belt. The characters 中国人 could be “China” + “person” (Chinese person) or “middle” + “country” + “person” (person from the middle kingdom). Context determines the correct grouping. Modern NLP tools handle this well—they’re like experienced factory workers who’ve seen enough to know the patterns. But they can’t tell you why 中国 means “China” (that requires character-level analysis).

The Technical Debt Trap

Some powerful tools exist but run on outdated infrastructure—like discovering a fully-equipped machine shop from the 1950s. Everything works, but spare parts are scarce, maintenance manuals reference obsolete standards, and integrating with modern systems requires adapters and workarounds. You can use it, but every year makes it harder to maintain. That’s the Python 2 library dilemma this research confronted.

When You Need This#

You Definitely Need Character Decomposition If:#

  • Building educational apps that teach character structure
  • Creating search systems where users look up characters by components (“show me all characters with the water radical”)
  • Researching etymology or historical character evolution
  • Developing font design or glyph analysis tools

You Definitely Need Word Segmentation If:#

  • Building machine translation systems
  • Creating search engines for Chinese text
  • Performing sentiment analysis or text classification
  • Developing chatbots or question-answering systems
  • Any NLP pipeline processing Chinese documents

You Need BOTH If:#

  • Building comprehensive dictionary or reference tools
  • Creating advanced learning platforms that connect word meaning to character structure
  • Developing linguistic research tools
  • Building systems that analyze both grammatical structure (words) and semantic structure (character components)

You DON’T Need This If:#

  • Your text processing stays at sentence or document level (sentiment, topics, classification without word boundaries)
  • You’re working with speech (audio), not text
  • Your use case already has text pre-segmented with spaces

Trade-offs#

Character Decomposition: Library vs. Data#

Legacy Library (cjklib):

  • Comprehensive search capabilities built-in
  • Python 2 codebase requires maintenance workarounds
  • Rich functionality but uncertain long-term viability
  • Think: Powerful but aging machine shop

Open Data (makemeahanzi, CJKVI-IDS):

  • JSON/text files you parse yourself
  • Zero technical debt, future-proof
  • Requires building your own query layer
  • Think: Raw materials, you build the tools

Trade-off: Immediate functionality vs. long-term maintainability. Research recommends data-first approach unless you need comprehensive character search immediately and can accept 1-2 year migration timeline.

Word Segmentation: Multilingual vs. Specialized#

Multilingual Platforms (HanLP):

  • Handles 130 languages
  • Future-proof if you expand to other languages
  • Larger models, more memory/compute
  • Think: Swiss Army knife

Chinese-Optimized (LTP):

  • Specialized for Chinese only
  • Faster, smaller models
  • Less flexibility for other languages
  • Think: Specialist tool

Academic Standard (Stanza):

  • Universal Dependencies framework
  • Cross-lingual research reproducibility
  • Stanford backing
  • Think: Laboratory standard

Trade-off: Breadth vs. depth. Choose multilingual if you might expand; choose specialized if you’re Chinese-only and want maximum performance.

Build vs. Adopt#

Adopt existing tools:

  • Fast time-to-market (1-3 weeks)
  • Proven algorithms
  • Ongoing maintenance someone else’s problem
  • Limited to what tools provide

Build on open data:

  • Full control and customization
  • Modern, debt-free codebase
  • Initial investment (3-6 weeks)
  • You own maintenance

Trade-off: Most projects should adopt for word segmentation (mature tools), build custom for character analysis (data-first future-proofing).

Implementation Reality#

First 90 Days: Expect Two Separate Systems#

Don’t expect a unified “Chinese analysis platform.” You’ll integrate two independent components:

Weeks 1-2: Set up word segmentation (HanLP or Stanza)

  • pip install, download models
  • Test with sample text
  • Integrate into your pipeline
  • This part is straightforward—tools are mature

Weeks 2-3: Set up character decomposition

  • Download makemeahanzi JSON data
  • Parse into your preferred format (database, in-memory, etc.)
  • Build query API for your needs
  • This requires more custom work but avoids technical debt

Week 4: Connect the pieces

  • Design API layer that combines both
  • Handle edge cases (unknown characters, segmentation errors)
  • Performance optimization (caching, indexing)
  • Documentation and testing

Team Requirements#

Essential:

  • Python proficiency (or your language of choice)
  • Basic NLP understanding (tokenization, POS tagging concepts)
  • JSON parsing and data structure design

Helpful but not required:

  • Chinese language knowledge (helpful for testing, not for implementation)
  • Unicode standards familiarity
  • Previous NLP library experience

Realistic effort: 1 senior developer can implement in 3-4 weeks. Ongoing maintenance: minimal (data doesn’t “break,” models update quarterly at most).

Common Pitfalls#

“Let’s build our own word segmenter”: Don’t. Use HanLP/Stanza/LTP. This is solved, mature technology. Focus your effort on problems unique to your domain.

“We’ll use cjklib because it has everything”: The Python 2 technical debt will cost you more over 2 years than building modern tools costs upfront. Exception: if you need comprehensive character search now and have a <1 year migration timeline.

“We need morpheme analysis of compound words”: Be specific about what this means. If you mean word segmentation, tools exist. If you mean decomposing 电脑 (computer) into 电 (electricity) + 脑 (brain) with semantic roles, no library does this automatically—you’ll need linguistic rules or accept manual annotation.

Performance Expectations#

  • Word segmentation: 100-1000 characters/second (CPU), 10x faster with GPU
  • Character decomposition: O(1) lookups (instant for 9,000 characters)
  • Memory: ~1-2GB for NLP models, <100MB for character data
  • Scalability: Both components scale horizontally (stateless processing)

Maintenance Reality#

Word segmentation: Update models every 3-6 months as platforms release improvements. Breaking changes rare (1-2 years between major versions).

Character decomposition: Data is stable. Unicode IDS standard evolves slowly. Expect updates yearly at most, mostly for coverage expansion.

Combined system: After initial 3-4 week build, expect <1 day/month maintenance. Most time spent on application-specific enhancements, not core functionality upkeep.


Bottom Line: Chinese text processing requires two independent capabilities: word boundaries and character structure. Modern approaches favor proven NLP platforms for words and data-first architectures for characters. Expect 3-4 weeks to production with minimal ongoing maintenance. Avoid legacy libraries with technical debt unless you have explicit migration plans.

Word Count: ~1,450 words

S1: Rapid Discovery

S1-Rapid: Approach#

Evaluation Method#

Quick assessment of available libraries for Chinese morphological analysis and character decomposition based on:

  1. Active Maintenance - Is the project still maintained? Recent releases?
  2. Core Capabilities - Does it address character decomposition and/or morphological analysis?
  3. Ease of Access - Available via PyPI? Clear documentation?
  4. First Impressions - Documentation quality, community size, obvious strengths/weaknesses

Libraries Evaluated#

  1. HanLP - Multilingual NLP toolkit
  2. Stanza - Stanford NLP with Universal Dependencies support
  3. LTP - Language Technology Platform (HIT)
  4. cjklib - Han character library for CJKV languages

Time Box#

45-60 minutes per library for initial investigation. Focus on:

  • What it claims to do
  • How mature it appears
  • Whether it handles our specific needs (character decomposition, compound analysis)

Decision Criteria#

  • Can it decompose characters into components/radicals?
  • Can it analyze compound words?
  • Is it production-ready?
  • Python support quality

cjklib - Rapid Assessment#

Overview#

Han character library for CJKV languages (Chinese, Japanese, Korean, Vietnamese). Provides character-level routines including pronunciations, radicals, glyph components, stroke decomposition, and variant information.

Repository: GitHub - cburgmer/cjklib

Documentation: cjklib 0.3.2 docs

Character Decomposition Capabilities#

Strengths:

  • Explicit character decomposition using Ideographic Description Sequences (IDS)
  • Binary operators (e.g., ⿰ for left-right: 好 = ⿰女子)
  • Trinary operators for three-component decomposition
  • Radical analysis (Kangxi radicals)
  • Stroke information derivable from component data
  • Character variant mapping

Example Use Cases:

  • Character lookup by components
  • Font design pattern analysis
  • Character study and etymology
  • Stroke order/count deduction

Morphological Analysis (Word Level)#

None - cjklib operates at character level, not word level. No word segmentation or compound word analysis.

Maturity#

Moderate concerns:

  • Last documented version: 0.3.2
  • Documentation references Python 2.x
  • No 2026 updates found in search
  • Available on PyPI: cjklib package

Maintenance status unclear - GitHub repository exists but update frequency unknown from initial search.

Quick Verdict#

Good for: Character decomposition, radical analysis, IDS manipulation, character structure study Limitations: Python 2 legacy, maintenance unclear, no word-level morphology

cjklib is the only library reviewed that explicitly handles character decomposition. It directly addresses our character structure needs but may require Python 3 compatibility verification.

Complementary Option#

makemeahanzi - Free, open-source Chinese character data with decomposition information (mentioned in search results as related project).


Sources:


HanLP - Rapid Assessment#

Overview#

Multilingual NLP toolkit supporting 130 languages with 10 joint tasks including tokenization, POS tagging, dependency parsing, and semantic role labeling. Built on PyTorch and TensorFlow 2.x.

Repository: GitHub - hankcs/HanLP

Installation: Available via PyPI

Morphological Analysis Capabilities#

Strengths:

  • Comprehensive Chinese NLP pipeline (word segmentation, POS tagging, NER, parsing)
  • Active development and maintained
  • Research shows morpheme and character features address unknown word problems (Springer Study)
  • Integration with Haystack framework (Haystack Integration)

Limitations:

  • Character decomposition not a primary feature
  • Focused on higher-level NLP tasks rather than character-level analysis
  • More oriented toward tokenization and parsing than radical/component analysis

Character Decomposition#

  • No explicit character decomposition API visible in initial review
  • Character features used internally for morphological analysis
  • Not designed as a character structure analysis tool

Maturity#

High - Active project with comprehensive documentation, regular releases, and production use

Quick Verdict#

Good for: Word segmentation, morphological tagging, NLP pipelines Not ideal for: Character decomposition into radicals/components

HanLP excels at document-level morphological analysis but doesn’t directly expose character decomposition functionality needed for studying character structure.


Sources:


LTP - Rapid Assessment#

Overview#

Language Technology Platform developed by HIT (Harbin Institute of Technology). Integrated Chinese NLP platform with lexical analysis, syntactic parsing, and semantic parsing.

Repository: GitHub - HIT-SCIR/ltp

Recent Version: N-LTP (Neural LTP) - multi-task framework with shared pretrained models

Morphological Analysis Capabilities#

Strengths:

  • Six core Chinese NLP tasks: word segmentation, POS tagging, NER, dependency parsing, semantic dependency, semantic role labeling
  • N-LTP uses multi-task learning with shared pretrained models
  • Cloud service available: LTP-Cloud
  • Specifically designed for Chinese (not multilingual adaptation)

Terminology Note: LTP uses “lexical analysis” rather than “morphological analysis”. This includes:

  • Chinese word segmentation
  • Part-of-speech tagging
  • Named entity recognition

Limitations:

  • Focuses on word-level lexical analysis, not character-level decomposition
  • No explicit character decomposition features documented
  • Designed for document processing pipeline, not character structure study

Character Decomposition#

None identified - LTP operates at word/token level, not character component level.

Maturity#

High - Established platform with academic backing, production cloud service, and neural architecture updates (N-LTP).

Quick Verdict#

Good for: Chinese word segmentation, lexical analysis, NLP pipelines Not suitable for: Character decomposition, radical analysis

LTP is a comprehensive Chinese NLP platform but operates at word level. Like HanLP, it’s designed for document processing rather than character structure analysis.


Sources:


S1-Rapid: Recommendation#

Clear Winner for Character Decomposition#

cjklib is the only library reviewed that explicitly handles character decomposition into components and radicals.

The Landscape Split#

The evaluation revealed a fundamental divide:

Word-Level Tools (HanLP, Stanza, LTP):

  • Focus on tokenization, POS tagging, parsing
  • Operate on words/documents, not character structure
  • Production-ready, actively maintained
  • Not designed for character decomposition

Character-Level Tool (cjklib):

  • Explicitly handles IDS (Ideographic Description Sequences)
  • Provides radical analysis, component decomposition
  • Only library addressing our core need
  • Maintenance status unclear (Python 2 references)

Quick Recommendation#

For character decomposition: Use cjklib - it’s the only option that directly addresses the requirement.

Caveat: Verify Python 3 compatibility and check GitHub for maintenance status.

Alternative: Consider makemeahanzi as data source if cjklib proves unmaintained.

For Compound Word Analysis#

None of these libraries clearly excel at Chinese compound word morphological analysis. This requires deeper investigation in S2-comprehensive pass:

  • How do compounds differ from simple word segmentation?
  • Is this a linguistic analysis task or a lexical lookup problem?
  • May require combining tools or custom logic

Next Steps for S2#

  1. Deep dive into cjklib: Python 3 support, API depth, data quality
  2. Investigate compound word analysis: What’s actually needed?
  3. Explore makemeahanzi and other character decomposition data sources
  4. Test actual code examples with real Chinese text

Stanza - Rapid Assessment#

Overview#

Stanford NLP toolkit providing pretrained models for 80 languages based on Universal Dependencies (UD) treebanks. Supports tokenization, lemmatization, POS tagging, morphological features, and dependency parsing.

Models: Available Models & Languages

Paper: Stanza: A Python NLP Toolkit

Morphological Analysis Capabilities#

Strengths:

  • Universal Dependencies framework provides consistent morphological annotation
  • Pretrained models trained on UD v2.12
  • Strong academic backing (Stanford NLP)
  • Standardized approach across languages

Limitations:

  • Chinese has “weak morphology” - lacks formal devices like tense/number markers
  • UD tagging focuses on grammatical features, not character structure
  • Research notes “very little research devoted to Chinese word segmentation based on morphemes” (Computational Linguistics)

Character Decomposition#

None - Stanza focuses on morphological features (grammatical properties) not character decomposition. The UD framework doesn’t model character-level structure.

Maturity#

High - Stable, well-documented, production-ready. Part of Stanford’s NLP infrastructure.

Quick Verdict#

Good for: UD-style morphological tagging, cross-lingual NLP, grammatical analysis Not suitable for: Character decomposition, radical analysis

Stanza provides morphological features in the UD sense (grammatical properties), not character structure analysis. Chinese morphology in UD focuses on word-level features, not sub-character components.


Sources:

S2: Comprehensive

S2-Comprehensive: Approach#

Evaluation Method#

Deep dive into each library’s actual capabilities, focusing on:

  1. Python 3 Compatibility - Can it run in modern Python environments?
  2. API Depth - What functionality is actually exposed?
  3. Data Sources - Where does the decomposition/morphological data come from?
  4. Compound Word Analysis - What does this actually mean in Chinese context?
  5. Installation & Usage - Is it production-ready or requires significant setup?

Key Research Questions#

Character Decomposition#

  • What IDS databases are available?
  • How complete is the coverage?
  • Can we access stroke information?
  • Etymology data availability?

Compound Word Analysis#

  • Is this word segmentation or morphological decomposition?
  • Do tools analyze internal structure of compounds?
  • Or just identify word boundaries?

Practical Concerns#

  • Library maintenance status
  • Documentation quality
  • Community support
  • Integration complexity

Extended Library Set#

Beyond the original four, also investigating:

  • makemeahanzi - Character data source
  • Jieba, pkuseg - Word segmentation tools
  • CJKVI-IDS - IDS database
  • spaCy Chinese models - NLP pipeline

Feature Matrix#

Will create detailed comparison across:

  • Character decomposition (IDS support)
  • Radical extraction
  • Stroke information
  • Component analysis
  • Word segmentation
  • Compound word analysis
  • Python 3 support
  • Maintenance status
  • Data quality

cjklib - Comprehensive Assessment#

Python 3 Compatibility Status#

Original cjklib: Python 2 only. No Python 3 support in the main cburgmer/cjklib repository.

cjklib3 Fork: Python 3.7+ compatible fork available at free-utils-python/cjklib3. Requires 2to3 conversion during installation:

conda create -n py37 python=3.7
conda activate py37
git clone https://github.com/free-utils-python/cjklib3
cd cjklib3
pip install 2to3
2to3 -w .
# Build database, install dictionaries

Verdict: Python 3 is possible but requires fork and manual setup. Not pip-installable for Python 3.

Character Decomposition Capabilities#

IDS Support#

  • Full IDS (Ideographic Description Sequences) implementation
  • Stores decompositions using Unicode IDS operators (⿰, ⿱, ⿲, etc.)
  • Example: 好 = ⿰女子 (left-right: woman + child)

API Features#

From characterlookup module:

  • getDecompositionEntries() - Get all decomposition trees
  • getRadicalForms() - Get radical variants
  • getStrokeCount() - Character stroke counts
  • getCharacterVariants() - Traditional/simplified variants

Data Quality#

  • Comprehensive coverage of CJK Unified Ideographs
  • Multiple decomposition paths for characters with variant structures
  • Kangxi radical mappings
  • Component-to-stroke mappings for analysis

Limitations#

Maintenance Concerns#

  • Original project: Last PyPI release unclear
  • Documentation references Python 2
  • Open issue #11 requesting Python 3 support from 2017
  • Snyk analysis shows no recent releases

Installation Complexity#

  • Not simple pip install for Python 3
  • Requires building database from source
  • Dictionary installation separate step
  • Fork-based solution not ideal for production

Alternative: Data Sources Only#

Instead of using cjklib as a library, consider using its data sources:

  • CJKVI-IDS database: cjkvi/cjkvi-ids
  • Parse IDS strings directly
  • Simpler integration, modern Python code

Compound Word Analysis#

None - cjklib operates at character level only. No word segmentation or compound analysis features.

Production Readiness#

Moderate Risk:

  • ✅ Proven character decomposition algorithms
  • ✅ Comprehensive data coverage
  • ❌ Python 2 legacy code
  • ❌ Requires fork for Python 3
  • ❌ Setup complexity
  • ❌ Unclear maintenance status

Recommendation: Use cjklib3 fork for prototyping, but investigate migrating to IDS database parsing with modern Python for production.


Sources:


Feature Comparison Matrix#

Character-Level Analysis#

Featurecjklibcjklib3makemeahanziHanLPStanzaLTP
Character Decomposition (IDS)✅ Full✅ Full✅ Full
Radical Extraction✅ Kangxi✅ Kangxi
Stroke Information✅ Derived✅ Derived✅ SVG
Etymology Data⚠️ Limited⚠️ Limited✅ Rich
Component Trees

Word-Level Analysis#

FeaturecjklibHanLPStanzaLTPJiebapkuseg
Word Segmentation
POS Tagging⚠️ Basic
Morphological Features (UD)⚠️
Morpheme Decomposition
Compound Analysis

Technical Characteristics#

Aspectcjklibcjklib3makemeahanziHanLPStanzaLTP
Python 3 SupportN/A (data)
pip Installable⚠️ Py2❌ ForkN/A
Active Maintenance⚠️
Documentation Quality✅ Good✅ Good
Setup ComplexityMediumHighLowLowLowLow

Data Sources#

SourceFormatCoverageCharacter DecompositionEtymologyStrokes
CJKVI-IDSIDS filesCJK Unified
makemeahanziJSON9,000+ chars✅ SVG
cjklib DBSQLiteComprehensive⚠️
Unicode IDSStandardAll CJK

Alternative Approaches#

Character Decomposition Options#

  1. Use cjklib3 fork (library approach)

    • Pro: Complete API, rich functionality
    • Con: Setup complexity, Python 3 via fork
  2. Use makemeahanzi data (data-first approach)

    • Pro: JSON format, modern, rich etymology
    • Con: Need to build parser, 9K characters (not all CJK)
  3. Parse CJKVI-IDS directly (minimal approach)

    • Pro: Complete CJK coverage, simple format
    • Con: Need IDS parser, no etymology, no stroke data
  4. Hybrid: makemeahanzi + CJKVI-IDS

    • Pro: Best of both worlds
    • Con: Integration complexity

Word Segmentation Options#

ToolStrengthUse Case
HanLPMultilingual pipelineProduction, comprehensive NLP
StanzaUD framework, StanfordResearch, cross-lingual
LTPChinese-specificChinese-only production
JiebaLightweight, popularSimple segmentation
pkusegDomain-specific modelsSpecialized domains

Key Findings#

No Morpheme Decomposition Tools Found#

Significant gap: None of the libraries analyzed actually decompose Chinese compound words into morphemes.

Example of what’s missing:

  • Input: “电脑” (computer)
  • What tools do: Identify as single word [电脑/NOUN]
  • What’s missing: Decompose to “电” (electricity) + “脑” (brain)

Why this matters:

  • Might need custom implementation
  • Or: redefine requirements to focus on word segmentation + character decomposition
  • Morpheme decomposition may require linguistic rules, not library

Two Separate Problem Domains#

Character Structure:

  • Tools: cjklib, makemeahanzi, IDS databases
  • Well-supported with existing libraries/data
  • Mature solutions available

Word/Morpheme Analysis:

  • Tools: HanLP, Stanza, LTP, Jieba, pkuseg
  • All do word segmentation
  • None do morpheme decomposition of compounds
  • May need custom solution or clarified requirements

Sources: See individual library assessments in this directory.


HanLP - Comprehensive Assessment#

Character-Level Capabilities#

Internal Character Features#

HanLP uses character features internally for morphological analysis. Research (Springer 2012) shows character features help with unknown words, but this is not exposed as a public API for character decomposition.

Not a Character Decomposition Tool#

  • No IDS extraction methods
  • No radical decomposition API
  • Character features used internally for ML models, not user-facing
  • Focus on word-level and sentence-level analysis

Word Segmentation & Morphological Tagging#

Core Strengths#

From HanLP documentation:

  • Word segmentation (分词)
  • Part-of-speech tagging (词性标注)
  • Named entity recognition (命名实体识别)
  • Dependency parsing (依存句法分析)
  • Semantic role labeling (语义角色标注)

Multi-Task Architecture#

  • Supports 130 languages
  • Joint training across tasks
  • PyTorch and TensorFlow 2.x backends
  • RESTful API available

Compound Word Analysis#

Word-level segmentation, not morphological decomposition:

  • Identifies word boundaries in text
  • Tags grammatical roles
  • Does NOT analyze internal morpheme structure
  • Example: Segments “中国人” (Chinese person) as one word, doesn’t decompose into “中国” (China) + “人” (person)

Research Context#

Chinese word segmentation research (Computational Linguistics) notes:

  • Chinese has “weak morphology”
  • Limited formal devices (no tense/number markers like English)
  • Word boundaries often ambiguous
  • Segmentation ≠ morphological analysis

Production Readiness#

High:

  • ✅ Active development
  • ✅ Python 3 support
  • ✅ pip installable: pip install hanlp
  • ✅ Comprehensive documentation
  • ✅ Large community
  • ✅ Production deployments
  • ✅ RESTful API option

Integration with Morphological Analysis#

Complementary, not overlapping:

  • HanLP provides word segmentation
  • Separate tool needed for character decomposition
  • Could use HanLP → segment text into words → then analyze character structure of words

Use Case Example#

  1. Input: “我喜欢汉字学习”
  2. HanLP segments: [“我”, “喜欢”, “汉字”, “学习”]
  3. Character decomposition tool analyzes: “汉” = ⿰氵又, “字” = ⿱宀子

Verdict for Our Needs#

Word-level processing: Yes Character decomposition: No

HanLP is an excellent NLP pipeline for Chinese text processing at word/sentence level, but does not provide character structure analysis. For morphological analysis project, HanLP handles the “compound word” aspect (if interpreted as word segmentation) but not character decomposition.


Sources:


LTP - Comprehensive Assessment#

Language Technology Platform Overview#

Core Architecture#

From LTP documentation:

  • Integrated Chinese NLP platform
  • Developed by HIT (Harbin Institute of Technology)
  • Research Center for Social Computing and Information Retrieval

Neural Architecture: N-LTP#

ArXiv paper describes modern version:

  • Multi-task learning framework
  • Shared pretrained models
  • Unified approach across tasks
  • More efficient than independent models

Six Core Tasks#

Lexical Analysis:

  1. Chinese word segmentation (分词)
  2. Part-of-speech tagging (词性标注)
  3. Named entity recognition (命名实体识别)

Syntactic Parsing: 4. Dependency parsing (依存句法分析)

Semantic Parsing: 5. Semantic dependency parsing (语义依存分析) 6. Semantic role labeling (语义角色标注)

Terminology: “Lexical” not “Morphological”#

Important distinction:

  • LTP documentation uses “lexical analysis” (词法分析)
  • NOT “morphological analysis” in linguistic sense
  • Refers to word-level token analysis
  • Includes segmentation, POS, NER

Why “Lexical”?#

  • Chinese word segmentation is primary challenge
  • Identifying word boundaries from character stream
  • POS tagging and NER follow segmentation
  • Focus on lexicon (words) not morphemes (sub-word units)

Character Decomposition#

None - LTP operates at word level. No character structure analysis, IDS support, or radical extraction.

Compound Word Analysis#

Word segmentation, not morpheme decomposition:

  • Segments text into words
  • Tags each word’s part of speech
  • Identifies named entities
  • Does NOT decompose compounds into constituent morphemes

Example Processing#

Input: “中国人民大学”

  • LTP might segment as: [“中国”, “人民”, “大学”] or [“中国人民大学”]
  • Tags: [NOUN, NOUN, NOUN] or [ORG]
  • Does NOT decompose: “人民” = “人” (person) + “民” (people)

Production Readiness#

High:

  • ✅ Python 3 support: GitHub
  • ✅ pip installable
  • ✅ Cloud service: LTP-Cloud
  • ✅ Academic backing (HIT)
  • ✅ N-LTP modernization (2020+)
  • ✅ Active Chinese NLP community

Comparison with HanLP#

Similar scope, different implementations:

  • Both provide Chinese NLP pipelines
  • Both include segmentation, POS, NER, parsing
  • LTP: Chinese-specific, HIT research
  • HanLP: Multilingual, broader language support
  • Both operate at word level, not character level

Integration with Character Analysis#

Preprocessing pipeline:

  1. LTP segments and tags text
  2. Provides linguistic context
  3. Separate tool needed for character decomposition
  4. LTP output useful for identifying important words to analyze

Workflow#

  1. LTP segments: “学习汉字需要耐心”
  2. Output: [“学习”/VERB, “汉字”/NOUN, “需要”/VERB, “耐心”/NOUN]
  3. Focus on nouns/verbs for character analysis
  4. Use cjklib/makemeahanzi for “汉字” decomposition

Verdict for Our Needs#

Word segmentation: Yes (excellent) Lexical tagging: Yes (comprehensive) Character decomposition: No Morpheme analysis: No

LTP is a production-ready Chinese NLP platform for word-level processing. Like HanLP and Stanza, it doesn’t provide character structure analysis. The term “morphological analysis” in our requirements doesn’t align with what LTP offers unless we interpret it as “word segmentation.”

Clarification Needed#

The project requirements mention:

  • Character decomposition ✓ (clearly sub-character components)
  • Compound word analysis ? (ambiguous)

If “compound word analysis” means:

  • Identifying compounds from simple words: LTP does this (word segmentation)
  • Decomposing compounds into morphemes: LTP does NOT do this
  • Analyzing character structure: LTP does NOT do this

Sources:


S2-Comprehensive: Recommendation#

Two-Tier Approach#

Based on comprehensive analysis, recommend splitting the solution:

Tier 1: Character Decomposition#

Tool: makemeahanzi + CJKVI-IDS hybrid

Rationale:

  • makemeahanzi provides rich data for common characters (9K+)
    • IDS decomposition
    • Etymology information
    • Stroke order SVG data
    • JSON format (modern, easy to parse)
  • CJKVI-IDS fills gaps for comprehensive CJK coverage
  • Both are data sources, not libraries - gives full control
  • No Python 2 legacy issues
  • Active maintenance

Implementation:

  1. Primary: Parse makemeahanzi JSON for covered characters
  2. Fallback: Parse CJKVI-IDS for remaining CJK characters
  3. Build custom API layer for application needs

Tier 2: Word Segmentation#

Tool: HanLP

Rationale:

  • Production-ready Python 3 library
  • Comprehensive NLP pipeline
  • Active development
  • pip installable
  • Handles word boundary identification

Caveat: HanLP provides segmentation, not morpheme decomposition.

The Morpheme Decomposition Gap#

Critical finding: No library found that decomposes Chinese compounds into morphemes.

Example of gap:

  • Character decomposition: 电 = IDS ⿻日乚 (components)
  • Word segmentation: “我用电脑” → [“我”, “用”, “电脑”]
  • Missing: “电脑” → “电” + “脑” (morpheme analysis)

Options:

  1. Redefine requirements: Focus on character decomposition + word segmentation
  2. Custom implementation: Build morpheme analyzer using:
    • Word segmentation from HanLP
    • Character decomposition from makemeahanzi
    • Linguistic rules for identifying productive compounds
  3. Accept limitation: Most compounds are lexicalized (treated as single words)
Application Layer
    ↓
Word Segmentation: HanLP
    ↓
Character Decomposition: makemeahanzi + CJKVI-IDS
    ↓
Data Layer: JSON + IDS text files

Why Not cjklib?#

cjklib is functionally superior for character analysis, but:

  • Python 2 codebase
  • Python 3 requires fork + manual setup
  • Not pip installable for Python 3
  • Uncertain maintenance

Alternative: Extract cjklib’s algorithms and port to modern Python, using makemeahanzi/CJKVI-IDS as data sources. This gets the best of both:

  • Proven algorithms (cjklib)
  • Modern data formats (makemeahanzi)
  • Clean Python 3 code
  • Full control over implementation

Production Deployment#

Low-risk path:

  1. Start with makemeahanzi data (9K characters covers most use cases)
  2. Parse JSON in Python 3
  3. Build simple API: decompose(char) → components
  4. Add CJKVI-IDS for complete coverage as needed
  5. Use HanLP for word segmentation

Higher-effort, more capable path:

  1. Port cjklib algorithms to Python 3
  2. Use makemeahanzi + CJKVI-IDS as data
  3. Build comprehensive character analysis library
  4. Integrate with HanLP for word processing

Next Steps for S3: Need-Driven#

Must clarify actual use cases:

  • What is “compound word analysis” supposed to do?
  • Are we analyzing character structure or word structure?
  • What’s the end goal: teaching, search, analysis?

Different use cases may lead to different recommendations.


Sources:


Stanza - Comprehensive Assessment#

Universal Dependencies Framework#

What Stanza Provides#

From Stanza documentation:

  • Tokenization & sentence segmentation
  • Lemmatization
  • POS tagging
  • Morphological features (UD framework)
  • Dependency parsing

Chinese Models#

  • Trained on Universal Dependencies v2.12
  • Models available for simplified and traditional Chinese
  • Performance metrics show 94%+ token accuracy

Morphological Features in UD#

What “Morphological” Means in UD Context#

UD morphological features capture grammatical properties:

  • Tense, aspect, mood (for languages that have them)
  • Number, gender, case
  • Transitivity, politeness markers

Chinese in UD Framework#

Research findings (Computational Linguistics):

  • Chinese is an analytic language
  • Lacks formal morphological devices (no tense inflections, number markers)
  • “Weak morphology” compared to agglutinative or fusional languages
  • UD features limited to aspect markers, classifiers, etc.

Example UD tags for Chinese:

  • Aspect markers: 了 (le), 着 (zhe), 过 (guo)
  • Classifiers: 个, 只, 本
  • NOT character decomposition or internal word structure

Character Decomposition#

None - Stanza operates at token/word level, not character component level. UD framework doesn’t model sub-character structure.

Compound Word Analysis#

Word-level tokenization only:

  • Segments text into tokens
  • Assigns grammatical tags
  • Does NOT analyze morpheme composition of compounds
  • Example: “电脑” (computer) = single token “电脑”, not analyzed as “电” (electricity) + “脑” (brain)

Research Context#

ACL 2020 paper notes:

  • Stanza focuses on cross-lingual consistency
  • Same annotation framework for all languages
  • Chinese processing adapted to UD conventions
  • Morphological analysis limited to what UD framework supports

Production Readiness#

High:

  • ✅ Python 3 support
  • ✅ pip installable: pip install stanza
  • ✅ Stanford NLP backing
  • ✅ Excellent documentation
  • ✅ 80 languages supported
  • ✅ Regular updates with new UD versions

Integration with Character Analysis#

Preprocessing role only:

  • Stanza segments text into tokens
  • Provides grammatical structure
  • Separate tool needed for character decomposition
  • UD parse trees useful for understanding word relationships

Workflow Example#

  1. Input: “学习汉字很有趣”
  2. Stanza tokenizes: [“学习”, “汉字”, “很”, “有趣”]
  3. POS tags: [VERB, NOUN, ADV, ADJ]
  4. Dependency parse: shows “学习” as root, “汉字” as object
  5. Separate tool decomposes characters: “汉” = ⿰氵又

Verdict for Our Needs#

Grammatical morphology (UD sense): Yes Character decomposition: No Compound analysis (morphemes): No

Stanza provides morphological tagging in the UD sense (grammatical features), not character decomposition or morpheme analysis. Useful for NLP pipelines but doesn’t address core requirement of analyzing character structure.

Clarification Needed#

The term “morphological analysis” is ambiguous:

  • UD morphology: Grammatical features (what Stanza does)
  • Character morphology: Component structure (what we need)
  • Word morphology: Compound word decomposition (unclear if needed)

Stanza addresses the first definition only.


Sources:

S3: Need-Driven

S3-Need-Driven: Approach#

Methodology#

Evaluate libraries based on concrete use cases rather than abstract features. For each use case:

  1. Define the goal - What does the user want to accomplish?
  2. Map to requirements - What technical capabilities are needed?
  3. Test fit - Which libraries can deliver this?
  4. Identify gaps - What’s missing?

Use Case Categories#

Based on common needs for Chinese morphological analysis:

1. Educational/Learning#

  • Character etymology and learning aids
  • Radical-based character lookup
  • Understanding character construction
  • Vocabulary building through component analysis

2. Search & Information Retrieval#

  • Component-based search
  • Fuzzy matching by structural similarity
  • Radical-based indexing
  • Character variant handling

3. Text Analysis & Processing#

  • Word segmentation for NLP pipelines
  • POS tagging and parsing
  • Named entity recognition
  • Document processing

4. Linguistic Research#

  • Character structure analysis
  • Etymology research
  • Historical character evolution
  • Compound word formation patterns

5. Font & Typography#

  • Glyph component analysis
  • Stroke order validation
  • Font design pattern extraction
  • Character rendering optimization

Evaluation Criteria Per Use Case#

  • Capability match: Does the tool provide needed functionality?
  • Data quality: Is the underlying data accurate and comprehensive?
  • Ease of use: How much custom code is needed?
  • Performance: Can it handle expected data volumes?
  • Maintenance: Is it sustainable long-term?

Use Cases Analyzed#

Will create separate markdown files for:

  1. use-case-educational.md - Learning and teaching tools
  2. use-case-search.md - Search and retrieval systems
  3. use-case-nlp-pipeline.md - Text processing pipelines
  4. use-case-linguistic-research.md - Academic research
  5. use-case-etymology.md - Character origin and evolution study

Each file will map the use case to library recommendations with specific implementation guidance.


S3-Need-Driven: Recommendation#

Use Case → Library Mapping#

Based on concrete use case analysis:

Use CasePrimary ToolSecondary ToolRationale
Educational/LearningmakemeahanziCJKVI-IDSRich etymology + mnemonics
Search/Retrievalcjklib/cjklib3CJKVI-IDSComprehensive search APIs
NLP PipelineHanLP/Stanza/LTPmakemeahanziProduction NLP + optional char features
Etymology ResearchmakemeahanziSears DB scrapeEtymology classification + historical forms

Key Insight: No One-Size-Fits-All#

Different needs require different tools:

Character Structure Analysis#

→ cjklib (most comprehensive) or makemeahanzi (modern, rich data)

Word Processing#

→ HanLP/Stanza/LTP (depending on multilingual vs. Chinese-only needs)

Etymology & Pedagogy#

→ makemeahanzi (only tool with explicit etymology data)

The Morpheme Decomposition Gap (Revisited)#

Critical finding from use case analysis:

None of the analyzed use cases actually require morpheme decomposition of compound words in the linguistic sense.

What users actually need:

  1. Character decomposition → cjklib/makemeahanzi/CJKVI-IDS ✅
  2. Word segmentation → HanLP/Stanza/LTP ✅
  3. Etymology understanding → makemeahanzi ✅
  4. Linguistic morpheme analysis → No library exists ❌

Example of missing capability:

  • Input: “电脑” (computer = electricity + brain)
  • What we have: Word segmentation identifies “电脑” as one word
  • What’s missing: Automatic morpheme decomposition to “电” + “脑” with semantic roles
  • Reality: This requires linguistic analysis, not just library lookup

Implication: If true morpheme decomposition is needed, it requires:

  • Custom implementation
  • Linguistic rules database
  • Or: Manual annotation

Alternative interpretation: If “compound word analysis” means “word segmentation,” then HanLP/Stanza/LTP provide this.

Decision Framework#

Question 1: What level are you analyzing?#

Characters (sub-word structure)? → Use makemeahanzi or cjklib

Words (text segmentation)? → Use HanLP/Stanza/LTP

Both? → Use NLP tool for words + character tool for components

Question 2: What’s your domain?#

Education/Learning? → makemeahanzi (best etymology)

Production NLP? → HanLP/Stanza/LTP (best pipelines)

Linguistic Research? → makemeahanzi + cjklib + Sears DB

Search/IR? → cjklib (best search APIs)

Question 3: What’s your Python environment?#

Python 3 only? → makemeahanzi (data) or HanLP/Stanza/LTP (NLP)

Can handle Python 2/3 split? → cjklib via fork or subprocess

Want minimal dependencies? → CJKVI-IDS + custom parser

Stack A: Modern Python, Educational Focus#

makemeahanzi (character data)
+ HanLP (word processing)
+ Custom integration layer

Pros: All Python 3, rich data, production-ready Cons: Need to build integration

Stack B: Research-Grade, Comprehensive#

cjklib3 fork (character analysis)
+ Stanza (UD-compliant word processing)
+ Sears DB scrape (historical forms)

Pros: Most comprehensive, research-quality Cons: Setup complexity, Python 3 fork

Stack C: Data-First, Maximum Control#

CJKVI-IDS + makemeahanzi (raw data)
+ pkuseg/Jieba (lightweight segmentation)
+ Custom Python 3 parser

Pros: Full control, modern Python, no legacy code Cons: Need to build everything, significant effort

Stack D: Production NLP, Minimal Character Analysis#

HanLP or LTP (complete NLP pipeline)
+ CJKVI-IDS (fallback for char decomposition)

Pros: Production-ready, minimal complexity Cons: Limited character analysis depth

Final Recommendation#

For most projects: Stack A (makemeahanzi + HanLP)

Rationale:

  1. 90% of character analysis needs covered by makemeahanzi (9K chars)
  2. HanLP provides production NLP pipeline
  3. All Python 3, pip installable
  4. Easy integration
  5. Can enhance later if needed

When to choose alternatives:

  • Stack B: Academic research requiring comprehensive character coverage
  • Stack C: Maximum customization, willing to invest development time
  • Stack D: NLP-focused, character analysis is secondary

Implementation Priority#

Phase 1: Proof of Concept (1-2 days)#

  • Download makemeahanzi JSON
  • Parse into Python dict/SQLite
  • Test character decomposition queries
  • Install HanLP, test word segmentation

Phase 2: Integration (3-5 days)#

  • Build unified API layer
  • Combine word segmentation + character analysis
  • Create sample use cases
  • Performance testing

Phase 3: Enhancement (1-2 weeks)#

  • Add CJKVI-IDS for extended coverage
  • Optimize performance (caching, indexing)
  • Build higher-level functions (search, etymology lookup)
  • Documentation

Phase 4: Production (ongoing)#

  • Deploy as service or library
  • Monitor usage patterns
  • Refine based on real needs
  • Continuous data quality improvements

Bottom line: The right tool depends on whether you’re analyzing characters (structure) or words (segmentation). Most projects need both, so a hybrid approach (makemeahanzi + HanLP) provides the best balance of capability and maintainability.


Use Case: Educational & Learning Tools#

Goal#

Help learners understand Chinese characters through component analysis and etymology, making character acquisition more systematic and memorable.

User Stories#

  1. Student learning character 認 (recognize):

    • Wants to know: What are the components?
    • System shows: ⿰言刃 (words + blade)
    • Etymology: “recognizing” = distinguishing with words as sharp as a blade
  2. Teacher creating flashcards:

    • Needs: Systematic grouping by radicals/components
    • System provides: Characters sharing components (e.g., all with 氵water radical)
    • Use for: Progressive learning sequences
  3. Self-study app developer:

    • Requirement: Show character decomposition in mobile app
    • Data needs: IDS, stroke order, semantic hints
    • Performance: Fast lookup, offline capability

Required Capabilities#

Character-Level#

  • ✅ IDS decomposition (show structure)
  • ✅ Radical identification (search by radical)
  • ✅ Etymology data (learning mnemonics)
  • ✅ Stroke order (writing practice)
  • ⚠️ Semantic/phonetic classification (which component carries meaning vs. sound)

Word-Level#

  • Optional: Word segmentation for context
  • Not needed: Complex morphological tagging

Library Fit Analysis#

makemeahanzi#

Excellent fit (9/10):

  • ✅ Rich etymology data with type (ideographic, pictophonetic, etc.)
  • ✅ Semantic and phonetic hints explicitly marked
  • ✅ SVG stroke data for animations
  • ✅ IDS decomposition
  • ✅ JSON format (easy integration with web/mobile)
  • ⚠️ Coverage: 9,000+ characters (sufficient for learners, not comprehensive)

Sample data:

{
  "character": "認",
  "decomposition": "⿰言刃",
  "etymology": {
    "type": "pictophonetic",
    "phonetic": "刃",
    "semantic": "言"
  }
}

cjklib/cjklib3#

Good fit (7/10):

  • ✅ Comprehensive character coverage
  • ✅ IDS decomposition
  • ✅ Radical lookups
  • ❌ Limited etymology data
  • ⚠️ Python 3 via fork (deployment complexity)
  • ❌ No stroke order SVG

CJKVI-IDS#

Moderate fit (5/10):

  • ✅ Complete CJK coverage
  • ✅ IDS decomposition
  • ❌ No etymology
  • ❌ No stroke order
  • ❌ No semantic/phonetic distinction
  • ⚠️ Raw IDS text (requires parser)

HanLP/Stanza/LTP#

Poor fit (2/10):

  • ❌ No character decomposition
  • ✅ Word segmentation (useful for context)
  • Not designed for educational character analysis

Primary: makemeahanzi

Implementation:#

  1. Download makemeahanzi JSON files
  2. Parse into SQLite or key-value store
  3. Build API:
    def get_character_info(char):
        return {
            'decomposition': '⿰言刃',
            'components': ['言', '刃'],
            'etymology': {...},
            'strokes': [svg_paths...]
        }
  4. For characters not in makemeahanzi, fall back to CJKVI-IDS (IDS only, no etymology)

Why this works:#

  • 9K characters cover HSK 1-6 + common characters (sufficient for learners)
  • Rich etymology enables mnemonic generation
  • Stroke order supports writing practice
  • JSON format = easy web/mobile integration
  • No heavy NLP dependencies

What’s missing:#

  • Component genealogy (how components evolved)
  • Cross-references to related characters
  • Difficulty ratings
  • Solution: Build on top of makemeahanzi data

Alternative Approach: Custom Dataset#

If makemeahanzi is insufficient, consider:

  1. Start with makemeahanzi core data
  2. Enhance with linguistic annotations
  3. Add pedagogy-specific fields:
    • Frequency rank
    • HSK level
    • Related characters
    • Common mnemonics
  4. Curate for learner needs

Example Integration: Flashcard App#

User sees: 認

App displays:
┌─────────────────────────────┐
│ 認 (rèn) - recognize        │
├─────────────────────────────┤
│ Components:                  │
│ 言 (words) + 刃 (blade)      │
│                              │
│ Etymology: Pictophonetic     │
│ Phonetic: 刃 (rèn)          │
│ Semantic: 言 (words)        │
│                              │
│ Mnemonic: Using sharp words  │
│ to distinguish/recognize     │
└─────────────────────────────┘

[Show stroke order animation]

Data source: makemeahanzi JSON + custom mnemonic database


Verdict: makemeahanzi is ideal for educational use cases. Combine with custom learning scaffolding for complete solution.


Use Case: Etymology & Character Origin Research#

Goal#

Study the historical development, original meanings, and structural evolution of Chinese characters for linguistic research or educational content creation.

User Stories#

  1. Linguist researching pictophonetic characters:

    • Query: Which characters use 工 as phonetic component?
    • System: Returns 江 (river), 紅 (red), 功 (achievement), etc.
    • Analysis: Common phonetic ‘gong’ sound
  2. Content creator making etymology videos:

    • Need: Visual evolution of character forms
    • System provides: Oracle bone → bronze → seal → modern
    • Plus: Semantic/phonetic classification
  3. Dictionary editor adding etymology notes:

    • Input: Character 信
    • System returns: Pictophonetic (人 person + 言 words)
    • Context: “信” = person standing by their words = trust
  4. Historical text researcher:

    • Analyzing archaic character forms
    • Needs variant mappings (ancient → modern)
    • Tracks character evolution through dynasties

Required Capabilities#

Etymology-Specific#

  • ✅ Character origin classification (pictographic, ideographic, pictophonetic, etc.)
  • ✅ Semantic/phonetic component identification
  • ✅ Historical forms (oracle bone, bronze, seal script)
  • ✅ Etymology explanations (why this structure?)
  • Optional: Cross-reference to variants

Structure Analysis#

  • ✅ IDS decomposition (see construction)
  • ✅ Component relationships (semantic vs. phonetic)
  • Optional: Component genealogy (how components evolved)

Library Fit Analysis#

makemeahanzi#

Excellent fit (9/10):

  • Etymology field explicitly included
  • ✅ Classification: pictographic, ideographic, pictophonetic
  • ✅ Semantic and phonetic components marked
  • ✅ IDS decomposition
  • ⚠️ Coverage: 9,000+ characters (common ones well-covered)
  • ❌ No historical forms (oracle bone, bronze, seal)

Etymology data structure:

{
  "character": "江",
  "decomposition": "⿰氵工",
  "etymology": {
    "type": "pictophonetic",
    "phonetic": "工",
    "semantic": "氵",
    "hint": "river; water (氵) + phonetic gong (工)"
  },
  "pinyin": ["jiāng"]
}

Strengths:

  • Direct etymology classification
  • Phonetic/semantic explicitly distinguished
  • Hint field provides explanation
  • Easy to build etymology database

cjklib/cjklib3#

Moderate fit (6/10):

  • ✅ IDS decomposition (structure visible)
  • ✅ Radical identification
  • ⚠️ Limited etymology information
  • ❌ No semantic/phonetic classification
  • ❌ No historical forms
  • ✅ Comprehensive character coverage

Use case: Structural analysis, but requires external etymology data

CJKVI-IDS#

Poor fit (3/10):

  • ✅ IDS structure
  • ❌ No etymology information
  • ❌ No component classification
  • ❌ No historical forms
  • Use case: Structural data only, need external etymology source

Specialized Etymology Resources (Not Libraries)#

Zhongwen.com#

  • Online etymology dictionary
  • Not programmatic access
  • Rich explanations

Chinese Etymology by Richard Sears#

  • Historical character forms database
  • Shows evolution: oracle bone → modern
  • Not easily parsable API

Wiktionary Chinese Etymology#

  • Community-contributed
  • Variable quality
  • Can be scraped but maintenance burden

For Etymology Research: makemeahanzi + External Sources#

Architecture:

Primary: makemeahanzi (9K characters with etymology)
    ↓
Fallback: Manual curation for specialized characters
    ↓
Enhancement: Scrape Sears database for historical forms
    ↓
Output: Comprehensive etymology database

Implementation:

  1. Base layer: makemeahanzi JSON

    • Parse etymology field
    • Index by type (pictographic, ideographic, pictophonetic)
    • Build phonetic component index
    • Build semantic component index
  2. Enhancement layer: Add historical forms

    • Scrape Richard Sears database
    • Map ancient forms to modern characters
    • Store evolution sequences
  3. Query API:

    def get_etymology(char):
        base = makemeahanzi_db.get(char)
        historical = sears_db.get(char)
    
        return {
            'modern': char,
            'type': base['etymology']['type'],
            'semantic': base['etymology']['semantic'],
            'phonetic': base['etymology']['phonetic'],
            'hint': base['etymology']['hint'],
            'historical_forms': historical['forms'],
            'evolution': historical['timeline']
        }

Use Case Examples#

Research Query: Find All Pictophonetic Characters with Phonetic ‘马’#

results = [
    char for char, data in etymology_db.items()
    if data['etymology']['type'] == 'pictophonetic'
    and '马' in data['etymology']['phonetic']
]
# → ['吗', '妈', '码', '玛', etc.]
# All pronounced 'ma'

Educational Content: Explain Character 好#

info = get_etymology('好')

# Output:
{
  'modern': '好',
  'structure': '⿰女子',
  'type': 'ideographic',
  'meaning': 'woman + child = good (mother with child)',
  'components': {
    '女': 'woman (semantic)',
    '子': 'child (semantic)'
  }
}

Linguistic Research: Analyze Phonetic Components#

# Find all characters using same phonetic
phonetic_family = find_by_phonetic_component('工')

# Analysis:
# 工 (gōng) - work
# 江 (jiāng) - river [older pronunciation gōng]
# 紅 (hóng) - red
# 功 (gōng) - achievement
# Common phonetic base indicating ancient sound

Gaps & Workarounds#

Gap 1: Historical Forms Not in makemeahanzi#

Workaround:

  • Scrape Richard Sears etymology database
  • Map to makemeahanzi characters
  • Store separately, join on character key

Effort: ~3-5 days scraping + integration

Gap 2: Only 9K Characters Covered#

Workaround:

  • Most research focuses on common characters (covered)
  • For rare characters: Manual annotation
  • Or: Use CJKVI-IDS for structure, manual etymology

Gap 3: No Component Evolution Tracking#

Workaround:

  • Requires linguistic research
  • Not available in any library
  • Build manually for specific research questions

Alternative Approach: Build Comprehensive Etymology DB#

If makemeahanzi is insufficient:

  1. Foundation: makemeahanzi (9K with etymology)
  2. Historical forms: Scrape Sears database
  3. Extended coverage: Manual annotation for specialized characters
  4. Component evolution: Linguistic research + manual entry
  5. Cross-references: Link related characters

Effort: Significant (~weeks/months) Benefit: Research-grade etymology resource

Production Considerations#

For Educational Content#

  • makemeahanzi sufficient for most cases
  • Focus on common 3,000-5,000 characters
  • Supplement with curated explanations

For Linguistic Research#

  • Start with makemeahanzi
  • Enhance with historical data (Sears)
  • Accept manual work for specialized queries
  • Build custom database over time

For Dictionary/Reference#

  • makemeahanzi + manual curation
  • Quality control on etymology explanations
  • Peer review by linguists
  • Continuous refinement

Verdict: makemeahanzi provides best programmatic access to etymology data for common characters. For comprehensive research, combine with historical databases (Sears) and expect manual curation for specialized needs.


Use Case: NLP Pipeline & Text Processing#

Goal#

Process Chinese text for downstream NLP tasks: sentiment analysis, machine translation, information extraction, question answering, etc.

User Stories#

  1. Machine translation preprocessing:

    • Input: Raw Chinese text (no spaces)
    • System: Segment into words, tag POS
    • Output: Structured tokens for MT system
  2. Sentiment analysis:

    • Input: Product reviews in Chinese
    • System: Tokenize, identify entities, extract features
    • Output: Sentiment scores per aspect
  3. Named entity recognition:

    • Input: News articles
    • System: Segment text, identify person/org/location names
    • Output: Annotated entities
  4. Chatbot processing:

    • Input: User query in Chinese
    • System: Parse intent, extract entities
    • Output: Structured query for dialog system

Required Capabilities#

Word-Level (Primary)#

  • ✅ Word segmentation (critical for Chinese)
  • ✅ POS tagging (grammatical analysis)
  • ✅ NER (entity extraction)
  • ✅ Dependency parsing (sentence structure)
  • Optional: Semantic role labeling

Character-Level (Secondary)#

  • Optional: Character decomposition for OOV handling
  • Optional: Radical features for ML models

Library Fit Analysis#

HanLP#

Excellent fit (9/10):

  • ✅ Complete NLP pipeline
  • ✅ Python 3 support
  • ✅ pip installable
  • ✅ RESTful API option
  • ✅ Multilingual (130 languages)
  • ✅ Joint task training
  • ✅ Production-ready

Pipeline:

import hanlp

HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)

result = HanLP("我爱学习汉字")
# {
#   'tok': ['我', '爱', '学习', '汉字'],
#   'pos': ['PN', 'VV', 'VV', 'NN'],
#   'ner': [],
#   'dep': [...],
#   ...
# }

Stanza#

Excellent fit (9/10):

  • ✅ Complete pipeline
  • ✅ UD framework (cross-lingual consistency)
  • ✅ Python 3, pip installable
  • ✅ Stanford backing
  • ✅ 80 languages
  • ✅ Proven performance

Pipeline:

import stanza

nlp = stanza.Pipeline('zh')
doc = nlp("我爱学习汉字")

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.pos, word.deprel)

LTP#

Excellent fit (9/10):

  • ✅ Chinese-specific optimization
  • ✅ Complete NLP pipeline
  • ✅ Python 3 (N-LTP)
  • ✅ Cloud service available
  • ✅ Academic backing (HIT)

Use case: Chinese-only applications benefit from Chinese-specific training

Comparison#

FeatureHanLPStanzaLTP
Segmentation
POS Tagging✅ UD
NER
Dependency✅ UD
Semantic Role⚠️
Languages13080Chinese
FrameworkMTLUDMulti-task
DeploymentLib + APILibLib + Cloud

Character Analysis Libraries#

Poor fit (1/10):

  • cjklib, makemeahanzi: Not designed for NLP pipelines
  • Use case: Supplementary features, not primary processing

For General NLP: HanLP#

Rationale:

  • Broadest language support (future-proof)
  • RESTful API option (microservices)
  • Active development
  • Comprehensive task coverage

Architecture:

Raw Text → HanLP Pipeline → Structured Data → Downstream Tasks
                                              ↓
                                         ML Models
                                         Rule Systems
                                         Search Engines

For UD-Compliant Projects: Stanza#

Rationale:

  • Cross-lingual research
  • UD framework consistency
  • Integration with UD tools
  • Academic reproducibility

For Chinese-Only, High-Performance: LTP#

Rationale:

  • Optimized for Chinese
  • Chinese-specific models
  • Lower latency
  • Cloud service option (scalability)

Character Decomposition Integration#

Use Case: OOV Handling#

Problem: Unknown words break segmentation

Solution: Use character features

# Segment with HanLP
tokens = hanlp_segment(text)

# For OOV tokens, analyze character structure
for token in tokens:
    if is_oov(token):
        # Get character components
        components = decompose_chars(token)
        # Use components as features for classification
        features = extract_features(components)

Benefit: Character decomposition provides fallback features for ML models when encountering unknown words.

Implementation#

  1. Primary: HanLP for word segmentation
  2. Secondary: makemeahanzi/CJKVI-IDS for character features
  3. Use character features in ML pipeline:
    • Word embeddings + character embeddings
    • Radical embeddings
    • Structural features from IDS

Production Pipeline Example#

import hanlp
from character_analyzer import decompose  # Custom

# Initialize
nlp = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)

def process_document(text):
    # NLP pipeline
    result = nlp(text)

    # Enhance with character features
    for token in result['tok']:
        if is_rare(token):
            # Decompose characters in token
            char_features = [decompose(c) for c in token]
            result['char_features'] = char_features

    return result

Performance Considerations#

Throughput#

  • HanLP: ~100-1000 chars/sec (depends on model)
  • Stanza: Similar performance
  • LTP: Optimized for speed in Chinese

Optimization#

  • Batch processing (process multiple docs together)
  • Model selection (smaller models for speed vs. accuracy)
  • Caching (segment common phrases once)
  • GPU acceleration (for large-scale processing)

Deployment Options#

Option A: Local Library#

pip install hanlp

Pro: No network, full control Con: Model size (~GB), memory usage

Option B: RESTful API (HanLP)#

# HanLP provides hosted API
curl -X POST https://hanlp.hankcs.com/api/...

Pro: No local setup, scalability Con: Network dependency, cost

Option C: Self-Hosted Service#

# FastAPI + HanLP
from fastapi import FastAPI
app = FastAPI()

@app.post("/segment")
def segment(text: str):
    return nlp(text)

Pro: Control + scalability Con: Infrastructure management


Verdict: HanLP, Stanza, or LTP are all excellent for NLP pipelines. Choose based on: multilingual needs (HanLP), UD framework (Stanza), or Chinese-only optimization (LTP). Character decomposition is supplementary, not primary.


Use Case: Search & Information Retrieval#

Goal#

Enable search and retrieval of Chinese text based on character structure, components, and morphological properties.

User Stories#

  1. Dictionary user searching by radical:

    • “Show me all characters with 氵 water radical”
    • System returns: 江, 河, 湖, 海, 洋, 汉, etc.
    • Sorted by stroke count or frequency
  2. Unicode researcher finding similar glyphs:

    • “Find characters structurally similar to 好”
    • System finds: All ⿰ left-right compositions
    • Filter by specific components
  3. Translation tool handling variants:

    • Input: Traditional 學 or Simplified 学
    • System recognizes as variants
    • Returns unified results
  4. Digital library indexing text:

    • Index historical documents by characters
    • Handle variant forms
    • Enable component-based search

Required Capabilities#

Character-Level#

  • ✅ Component extraction (index by radicals/components)
  • ✅ IDS parsing (structural similarity)
  • ✅ Variant mapping (traditional/simplified/regional)
  • ✅ Radical classification (standard lookup tables)
  • ⚠️ Fuzzy matching (partial component match)

Word-Level#

  • ✅ Word segmentation (identify searchable units)
  • Optional: Morphological normalization

Library Fit Analysis#

cjklib/cjklib3#

Excellent fit (9/10):

  • getCharactersForComponents() - exact match search
  • getRadicalForms() - handle radical variants
  • getCharacterVariants() - traditional/simplified mapping
  • ✅ Comprehensive CJK coverage
  • ✅ SQLite backend (efficient lookup)
  • ⚠️ Python 3 via fork

Key APIs for search:

from cjklib import characterlookup
cjk = characterlookup.CharacterLookup('T')  # Traditional

# Find by component
chars = cjk.getCharactersForComponents(['氵'])

# Get variants
variants = cjk.getCharacterVariants('學')  # → ['学', ...]

# Radical-based search
radical_chars = cjk.getCharactersForRadical('水')

CJKVI-IDS#

Good fit (7/10):

  • ✅ Complete IDS data (structural search)
  • ✅ Can build component index
  • ⚠️ Requires custom search implementation
  • ❌ No built-in variant mapping
  • ❌ No radical classification tables

Use case:

  • Parse IDS for all characters
  • Build inverted index: component → [characters]
  • Implement fuzzy structure matching

makemeahanzi#

Moderate fit (6/10):

  • ✅ IDS decomposition
  • ⚠️ Limited coverage (9K chars)
  • ❌ No variant tables
  • ❌ No radical classification
  • ✅ Easy to index (JSON)

Best for: Simple component search in constrained character set

HanLP#

Moderate fit for word-level (6/10):

  • ✅ Word segmentation (identify search terms)
  • ✅ Can index at word level
  • ❌ No character structure search
  • Use case: “Find documents containing ‘学习’” (word search, not component search)

For Character Search: cjklib/cjklib3#

Architecture:

User Query → cjklib API → SQLite DB → Results

Implementation:

  1. Use cjklib3 fork for Python 3
  2. Build search index on startup
  3. Implement query patterns:
    • By radical: cjk.getCharactersForRadical()
    • By component: cjk.getCharactersForComponents()
    • By structure: Parse IDS, filter by pattern
    • Variants: cjk.getCharacterVariants()

Advantages:

  • Proven search algorithms
  • Comprehensive data
  • SQL backend = efficient queries
  • Handles complex cases (variant forms, radical positioning)

For Document Search: HanLP + cjklib#

Hybrid approach:

  1. Preprocessing: HanLP segments documents into words
  2. Indexing:
    • Word-level index (standard search)
    • Character-level index (cjklib for component search)
  3. Query:
    • Regular search: Use word index
    • Component search: Use character index

Example:

Document: "我喜欢学习汉字"
Word index: ["我", "喜欢", "学习", "汉字"]
Character index: {
  '子': ['字'],
  '宀': ['字'],
  '氵': ['汉'],
  // etc.
}

Query "characters with 氵" → returns positions of: 汉

Production Considerations#

Performance#

  • cjklib SQLite: Fast lookups for single-character queries
  • Inverted index: Fast component search
  • Caching: Cache common queries (radical lists)

Scalability#

  • Character search: O(1) with proper indexing
  • Document search: O(n) with inverted index
  • Structure matching: O(n) requires full scan, optimize with filters

Deployment#

  • Option A: Use cjklib3 fork (accept Python 3 complexity)
  • Option B: Port cjklib algorithms to modern Python, use CJKVI-IDS data
  • Option C: Use cjklib Python 2 via subprocess (isolation)

If cjklib complexity is prohibitive:

  1. Data: CJKVI-IDS (all CJK characters with IDS)
  2. Preprocessing:
    • Parse IDS into component trees
    • Build inverted index: component → characters
    • Create radical mapping table
  3. Query engine:
    • Component search: O(1) lookup in index
    • Structure search: Filter by IDS pattern
    • Fuzzy match: Levenshtein distance on IDS strings

Effort: ~1-2 weeks development Advantage: Full control, modern Python 3, no legacy dependencies


Verdict: For production character search, cjklib provides comprehensive functionality despite Python 2 legacy. For new projects, consider building custom search on CJKVI-IDS data with modern Python.

S4: Strategic

S4-Strategic: Approach#

Evaluation Method#

Assess long-term viability and strategic considerations for each approach beyond immediate technical capabilities.

Strategic Dimensions#

1. Maintenance & Longevity#

  • Is the project actively maintained?
  • Risk of abandonment?
  • Community health and contribution patterns
  • Corporate vs. academic backing

2. Evolution & Adaptation#

  • Can the tool adapt to new requirements?
  • Extensibility and customization options
  • Migration paths if needs change
  • Future-proofing considerations

3. Ecosystem Integration#

  • Does it play well with other tools?
  • Standard formats and protocols?
  • Community libraries and extensions
  • Lock-in risks

4. Resource Requirements#

  • Development effort for integration
  • Ongoing maintenance burden
  • Infrastructure costs
  • Team skill requirements
  • License compatibility
  • Data licensing and attribution
  • Export/usage restrictions
  • IP considerations

6. Build vs. Buy vs. Data#

  • Library adoption
  • Build custom on open data
  • Use existing data sources directly
  • Hybrid approaches

Assessment Framework#

For each major approach:

  1. Current State - Where is it today?
  2. Trajectory - Where is it heading?
  3. Risks - What could go wrong?
  4. Mitigation - How to manage risks?
  5. Exit Strategy - How to pivot if needed?

Approaches to Assess#

Approach A: cjklib (Legacy Library)#

  • Python 2 codebase
  • Comprehensive functionality
  • Fork for Python 3 (cjklib3)

Approach B: makemeahanzi (Data-First)#

  • Open JSON data
  • No library dependency
  • Build custom tooling

Approach C: NLP Platforms (HanLP/Stanza/LTP)#

  • Production NLP pipelines
  • Character analysis secondary

Approach D: Hybrid (Data + Tools)#

  • Combine data sources with libraries
  • Selective integration

Will create separate viability assessments for each approach with strategic recommendations.


Strategic Viability: cjklib#

Current State#

Library: cjklib - Han character library for CJKV languages Maintainer: cburgmer (original), free-utils-python (Python 3 fork) Status: Python 2 EOL, Python 3 via fork License: MIT (permissive)

Maintenance & Longevity Assessment#

Original cjklib (cburgmer/cjklib)#

Risk Level: HIGH

  • ❌ Python 2 only (EOL since 2020)
  • ❌ No commits addressing Python 3 in main repo
  • ⚠️ Open issue #11 (Python 3 support) since 2017, unresolved
  • ⚠️ Unclear maintenance status
  • ✅ MIT license allows forking

cjklib3 Fork (free-utils-python/cjklib3)#

Risk Level: MODERATE-HIGH

  • ✅ Python 3 compatibility via 2to3
  • ⚠️ Fork maintenance unclear
  • ⚠️ Not on PyPI (manual installation)
  • ⚠️ Depends on 2to3 conversion (not native Python 3)
  • ⚠️ Small community around fork

Verdict#

High technical debt, uncertain future

Evolution & Adaptation#

Extensibility#

  • ✅ SQLite backend = easy to extend data
  • ✅ Modular architecture
  • ❌ Python 2 codebase limits modern Python features
  • ⚠️ Would require significant refactoring for modernization

Migration Paths#

Option A: Continue with fork

  • Keep using cjklib3 as-is
  • Accept technical debt
  • Risk: Fork abandonment

Option B: Port to native Python 3

  • Rewrite cjklib in modern Python
  • Use same algorithms, fresh codebase
  • Effort: ~2-4 weeks
  • Benefit: Full control, modern code

Option C: Extract data, abandon library

  • Parse cjklib’s SQLite database
  • Build minimal custom API
  • Effort: ~1 week
  • Benefit: No legacy code

Ecosystem Integration#

Standards Compliance#

  • ✅ Unicode IDS standard
  • ✅ SQLite (universal format)
  • ✅ Standard radical classifications
  • ✅ Well-documented file formats

Integration Points#

  • ⚠️ Python 2/3 split complicates integration
  • ✅ Can extract data for use elsewhere
  • ✅ SQL interface standard

Resource Requirements#

Initial Integration#

  • Time: 1-2 weeks (setup fork, database build)
  • Skills: Python, SQL, CJK character knowledge
  • Infrastructure: Minimal (local DB)

Ongoing Maintenance#

  • Monitoring: Watch for fork updates (low frequency expected)
  • Updates: Manual process (rebuild from fork)
  • Risk Management: Plan for fork abandonment

Exit Costs#

  • Low: Data is accessible via SQLite
  • Migration: Can extract data, build new tool
  • No vendor lock-in: Open source, standard formats

License: MIT#

  • ✅ Permissive use
  • ✅ Commercial use allowed
  • ✅ Modification allowed
  • ✅ No copyleft requirements

Attribution#

  • Required: MIT license notice
  • Low burden: Include license file

Data Sources#

  • ✅ Multiple open character databases
  • ✅ Unicode standard (public)
  • No licensing concerns for data

Strategic Risks#

Risk 1: Fork Abandonment#

Probability: MODERATE-HIGH Impact: MODERATE (data remains accessible)

Mitigation:

  • Extract SQLite database
  • Document data schema
  • Prepare migration to custom code

Risk 2: Python 3 Incompatibility Issues#

Probability: LOW-MODERATE Impact: LOW-MODERATE (can fix or work around)

Mitigation:

  • Test thoroughly before production
  • Isolate in separate service if needed
  • Have extraction plan ready

Risk 3: Limited Community Support#

Probability: HIGH Impact: LOW (reduces feature additions, not core functionality)

Mitigation:

  • Don’t depend on new features
  • Budget for custom fixes
  • Consider eventual migration

Build vs. Adopt Decision#

Adopt cjklib3 If:#

  • ✅ Need comprehensive character search immediately
  • ✅ Can accept Python 3 fork complexity
  • ✅ Short-term solution acceptable (~1-2 years)
  • ✅ Plan to migrate eventually

Build Custom If:#

  • ✅ Long-term solution needed (5+ years)
  • ✅ Have development resources (2-4 weeks)
  • ✅ Want full control and modern code
  • ✅ Can leverage existing data (CJKVI-IDS, makemeahanzi)

Phase 1: Use Data, Not Library (LOW RISK)#

  1. Extract data from cjklib’s sources or use CJKVI-IDS directly
  2. Build minimal Python 3 API for your needs
  3. Avoid dependency on cjklib code
  4. Use cjklib algorithms as reference, but reimplement

Phase 2: Enhance as Needed#

  1. Start with basic character decomposition
  2. Add search features progressively
  3. Optimize based on actual usage
  4. No technical debt from legacy code

Rationale:#

  • Same data quality (using same sources)
  • Modern Python 3 (clean implementation)
  • Full control (no fork dependency)
  • Lower risk (no abandonment concerns)
  • Maintainable (code you understand)

Timeline & Effort#

Data Extraction: 2-3 days#

  • Parse CJKVI-IDS or cjklib DB
  • Load into modern format (JSON, SQLite, etc.)

Basic API: 3-5 days#

  • Character decomposition
  • Radical lookup
  • IDS parsing

Advanced Features: 1-2 weeks#

  • Component search
  • Variant mappings
  • Optimization

Total: 2-3 weeks for production-ready solution

vs. cjklib3 fork: 1 week setup + ongoing maintenance burden

Exit Strategy#

If cjklib approach fails:

  1. Data is preserved - SQLite DB, IDS files remain accessible
  2. Migrate to makemeahanzi - JSON data, easier integration
  3. Build on CJKVI-IDS - Authoritative IDS source
  4. No data loss - All approaches use same underlying data

Low switching cost due to standard formats and open data


Strategic Verdict: MODERATE RISK

cjklib provides excellent functionality but carries significant technical debt. Recommended approach: Extract data and algorithms, rebuild in modern Python for long-term viability. Short-term use acceptable if migration plan exists.

Timeline: Can use fork for <1 year, plan migration to custom solution Effort: 2-3 weeks to build modern replacement Risk: Manageable via data extraction and migration plan


Strategic Viability: makemeahanzi#

Current State#

Resource: makemeahanzi - Free, open-source Chinese character data Maintainer: skishore (GitHub) Status: Active, data-focused project License: LGPL (with exceptions for data) Format: JSON (dictionary.txt, graphics.txt)

Maintenance & Longevity Assessment#

Risk Level: LOW-MODERATE

Active Maintenance#

  • ✅ GitHub repository active
  • ✅ Community contributions
  • ✅ Clear data format
  • ✅ Multiple forks/derivatives (healthy ecosystem)

Sustainability#

  • ✅ Data project (not code) = lower maintenance burden
  • ✅ Unicode IDS standard unlikely to change
  • ✅ Can be maintained by community even if original author steps back
  • ✅ Multiple projects using it as data source

Longevity Indicators#

  • ✅ Used by other projects (HanziJS, character learning apps)
  • ✅ Standard format (JSON)
  • ✅ Well-documented schema
  • ⚠️ Coverage limited to ~9,000 characters

Verdict: Low maintenance risk, sustainable data resource

Evolution & Adaptation#

Extensibility#

  • ✅ JSON format = easy to extend with custom fields
  • ✅ Can merge with other data sources (CJKVI-IDS, etc.)
  • ✅ No library lock-in = full flexibility
  • ✅ Can build any API layer on top

Data Completeness Evolution#

  • Current: ~9,000 characters
  • Path to expand: Add more characters following same schema
  • Community can contribute additional character data
  • Can supplement with CJKVI-IDS for full CJK coverage

Migration Paths#

  • Minimal risk: Data format stable and standard
  • Easy export: Already JSON (universally parsable)
  • Can enhance: Add fields without breaking existing data
  • No vendor lock-in: Pure data, not proprietary format

Ecosystem Integration#

Standards Compliance#

  • ✅ Unicode characters
  • ✅ IDS standard for decomposition
  • ✅ Pinyin standard
  • ✅ SVG for stroke graphics
  • ✅ JSON (universal format)

Integration Points#

  • ✅ Easy to import into any language/platform
  • ✅ Can load into databases (SQL, NoSQL)
  • ✅ No runtime dependencies
  • ✅ Works with any tech stack

Ecosystem Health#

  • ✅ Multiple projects built on makemeahanzi
  • ✅ Active discussion and contributions
  • ✅ Clear documentation
  • ✅ Examples available

Verdict: Excellent ecosystem integration

Resource Requirements#

Initial Integration#

  • Time: 1-2 days (parse JSON, load into DB/structure)
  • Skills: Basic programming, JSON parsing
  • Infrastructure: Minimal (local files)
  • Dependencies: None (pure data)

Ongoing Maintenance#

  • Monitoring: Check GitHub for updates (low frequency)
  • Updates: Download new JSON files (simple)
  • No runtime dependencies: Just data files
  • Minimal burden: Data doesn’t need “maintenance”

Development Flexibility#

  • ✅ Build API in any language
  • ✅ Choose your own architecture
  • ✅ Optimize for your use case
  • ✅ No framework constraints

License: LGPL with Exceptions#

Character data: Special exception allows use without LGPL restrictions Derivations: Can use data freely, even in proprietary systems Attribution: Required (give credit to makemeahanzi project)

License Risk: LOW#

  • Data usage is permissive
  • Commercial use allowed
  • No copyleft concerns for data
  • Attribution burden minimal

Data Provenance#

  • ✅ Clearly documented sources
  • ✅ Public domain Unicode standard
  • ✅ Community-contributed etymology
  • ✅ No IP concerns

Verdict: License-friendly for all use cases

Strategic Risks#

Risk 1: Coverage Limitations#

Probability: HIGH (intentional limitation) Impact: MODERATE (9K chars sufficient for most cases)

Mitigation:

  • Covers HSK 1-6 + common characters
  • Supplement with CJKVI-IDS for rare characters
  • Prioritize common characters (80/20 rule)

Analysis:

  • 9,000 characters cover >99% of everyday text
  • Specialized needs can combine with other sources
  • Not a blocker for most applications

Risk 2: Maintainer Availability#

Probability: LOW-MODERATE (single maintainer) Impact: LOW (data project, can be forked)

Mitigation:

  • Fork if needed (multiple forks exist)
  • Data is stable (not frequent updates needed)
  • Community can contribute
  • JSON format = future-proof

Risk 3: Data Quality Issues#

Probability: LOW (well-curated) Impact: MODERATE (errors in etymology)

Mitigation:

  • Spot-check critical characters
  • Community review process
  • Can correct locally if needed
  • Submit fixes upstream

Build vs. Adopt Decision#

Adopt makemeahanzi Data If:#

  • ✅ Need character data (decomposition, etymology, strokes)
  • ✅ Want zero dependencies
  • ✅ Coverage of 9K characters sufficient
  • ✅ Prefer data-first architecture
  • ✅ Want full control over implementation

Supplement with Other Sources If:#

  • Need rare characters beyond 9K (use CJKVI-IDS)
  • Need historical forms (use Sears database)
  • Need comprehensive search (reference cjklib algorithms)

Phase 1: Foundation (2-3 days)#

  1. Download makemeahanzi JSON files
  2. Parse into your preferred format (SQLite, JSON, in-memory)
  3. Build basic query API
  4. Test with sample characters

Phase 2: Enhancement (1 week)#

  1. Add CJKVI-IDS for extended coverage
  2. Build search indexes (by component, radical, etc.)
  3. Optimize for performance (caching, indexes)
  4. Create higher-level APIs for your use cases

Phase 3: Polish (ongoing)#

  1. Add custom annotations
  2. Improve etymology explanations
  3. Contribute fixes back to project
  4. Expand coverage as needed

Comparison with cjklib#

Aspectmakemeahanzicjklib
FormatJSON dataPython library
Coverage9K charsComprehensive
Etymology✅ Rich⚠️ Limited
Maintenance✅ Active❌ Python 2
SetupEasyComplex (fork)
Lock-inNoneModerate
Flexibility✅ Full⚠️ API limits
Long-term✅ Low risk⚠️ Technical debt

Timeline & Effort#

Minimal Integration: 1-2 days#

  • Parse JSON
  • Basic lookup functions
  • Sufficient for simple use cases

Production-Ready: 1 week#

  • Database loading
  • Indexed search
  • Performance optimization
  • Error handling

Advanced Features: 2-3 weeks#

  • Complex search (component-based)
  • Integrated with CJKVI-IDS
  • Custom APIs for specific needs
  • Web service deployment

Total: 1 week to production, 3 weeks to comprehensive solution

Exit Strategy#

If makemeahanzi approach insufficient:

  1. Easy to migrate: JSON → any format
  2. Combine with CJKVI-IDS: Add coverage
  3. Supplement with cjklib data: Extract from SQLite
  4. No code to rewrite: Pure data, API is yours

Zero switching cost - data is universally accessible


Strategic Verdict: LOW RISK, HIGH FLEXIBILITY

makemeahanzi offers excellent data quality with minimal integration risk. Recommended for:

  • Educational applications
  • Etymology research
  • Character learning tools
  • Any project wanting full control

Timeline: 1 week to production-ready Effort: Low (simple data parsing) Risk: Minimal (standard formats, active project, no lock-in) Flexibility: Maximum (build any architecture)

Recommendation: Default choice for character decomposition needs. Start here, enhance as needed.


Strategic Viability: NLP Platforms (HanLP/Stanza/LTP)#

Current State#

Platforms Assessed:

  • HanLP: Multilingual NLP toolkit (130 languages)
  • Stanza: Stanford NLP with UD framework (80 languages)
  • LTP: Language Technology Platform (Chinese-specific)

All are actively maintained, production-ready platforms.

Maintenance & Longevity Assessment#

HanLP#

Risk Level: LOW

  • ✅ Active development (regular releases)
  • ✅ Commercial backing + open source
  • ✅ Large community
  • ✅ Multilingual (not dependent on Chinese market alone)
  • ✅ Modern architecture (PyTorch/TensorFlow)

Longevity: HIGH - sustained by commercial interest and academic use

Stanza#

Risk Level: LOW

  • ✅ Stanford NLP backing (institutional support)
  • ✅ Academic research foundation
  • ✅ UD framework (cross-lingual standard)
  • ✅ Regular updates with UD releases
  • ✅ Large research community

Longevity: HIGH - academic foundation ensures continuity

LTP#

Risk Level: LOW-MODERATE

  • ✅ HIT (Harbin Institute of Technology) backing
  • ✅ Academic + industry use in China
  • ✅ N-LTP modernization (2020+)
  • ⚠️ More dependent on Chinese NLP market
  • ✅ Active GitHub repository

Longevity: MODERATE-HIGH - strong in Chinese NLP space

Overall Verdict: All three platforms have strong long-term prospects

Evolution & Adaptation#

HanLP#

  • ✅ Multi-task learning architecture (flexible)
  • ✅ RESTful API option (cloud-ready)
  • ✅ Continuous model improvements
  • ✅ Expanding language support
  • ✅ Integration with modern ML frameworks

Adaptation: EXCELLENT - designed for evolution

Stanza#

  • ✅ UD framework updates regularly
  • ✅ New language support added
  • ✅ Model improvements with each UD release
  • ✅ Research-driven enhancements
  • ⚠️ Tied to UD annotation conventions

Adaptation: GOOD - evolves with UD standard

LTP#

  • ✅ N-LTP modernization (neural architecture)
  • ✅ Cloud service option
  • ✅ Chinese language focus allows specialization
  • ⚠️ Less flexible for non-Chinese needs

Adaptation: GOOD - modernizing actively

Ecosystem Integration#

HanLP#

  • ✅ Haystack integration
  • ✅ spaCy pipeline compatibility
  • ✅ TensorFlow/PyTorch backends
  • ✅ RESTful API (microservices)
  • ✅ Python + Java support

Integration: EXCELLENT

Stanza#

  • ✅ UD framework (cross-tool compatibility)
  • ✅ spaCy integration available
  • ✅ Python native
  • ✅ Well-documented API
  • ✅ Research-standard outputs

Integration: EXCELLENT

LTP#

  • ✅ Python API
  • ✅ Cloud service option
  • ✅ Standard NLP outputs
  • ⚠️ Less third-party integration than HanLP/Stanza

Integration: GOOD

Resource Requirements#

Initial Integration#

PlatformSetup TimeSkills NeededInfrastructure
HanLP1-2 hoursPython, NLP basics~1-2GB model
Stanza1-2 hoursPython, NLP basics~500MB model
LTP1-2 hoursPython, NLP basics~1GB model

All are pip-installable, well-documented, production-ready.

Ongoing Maintenance#

All platforms: LOW maintenance burden

  • pip install updates
  • ✅ Automatic model downloads
  • ✅ Stable APIs
  • ⚠️ Need to monitor breaking changes in major versions

Performance Optimization#

  • GPU acceleration available (optional)
  • Batch processing built-in
  • Model selection (speed vs. accuracy trade-offs)
  • Caching strategies

HanLP#

License: Apache 2.0 (permissive)

  • ✅ Commercial use allowed
  • ✅ Modification allowed
  • ✅ Patent grant included
  • No licensing concerns

Stanza#

License: Apache 2.0 (permissive)

  • ✅ Commercial use allowed
  • ✅ Stanford institutional backing
  • ✅ No usage restrictions
  • No licensing concerns

LTP#

License: Apache 2.0 (original), check N-LTP

  • ✅ Generally permissive
  • ✅ Academic use clearly allowed
  • ✅ Commercial use typically allowed
  • Verify specific version

Overall: No license concerns for any platform

Strategic Risks#

Risk 1: Model Obsolescence#

Probability: MODERATE (ML evolves rapidly) Impact: LOW (platforms update models)

Mitigation:

  • All platforms provide model updates
  • Can train custom models if needed
  • API abstracts model changes

Risk 2: API Breaking Changes#

Probability: LOW-MODERATE Impact: MODERATE (code changes needed)

Mitigation:

  • Version pinning
  • Test updates before production
  • All platforms maintain compatibility

Risk 3: Platform Abandonment#

Probability: LOW (all have institutional backing) Impact: HIGH (major migration needed)

Mitigation:

  • Choose platform with strongest backing (Stanza/HanLP)
  • Standard outputs (UD) ease migration
  • Multiple viable alternatives

Risk 4: Performance/Cost at Scale#

Probability: MODERATE (depends on usage) Impact: MODERATE-HIGH (infrastructure costs)

Mitigation:

  • Benchmark early
  • Consider model size trade-offs
  • Cloud service options (HanLP, LTP)
  • GPU optimization

Build vs. Adopt Decision#

Adopt Platform (HanLP/Stanza/LTP) If:#

  • ✅ Need production NLP pipeline
  • ✅ Word segmentation is primary need
  • ✅ Want proven, maintained solution
  • ✅ Have standard NLP requirements
  • ✅ Want to leverage state-of-the-art models

Build Custom If:#

  • ❌ Have unique NLP requirements (rare)
  • ❌ Need extreme performance optimization
  • ❌ Have significant ML expertise and resources

Recommendation: ADOPT - building NLP pipelines from scratch is not cost-effective

Platform Selection Criteria#

Choose HanLP If:#

  • ✅ Need multilingual support (current or future)
  • ✅ Want comprehensive task coverage
  • ✅ Value RESTful API option
  • ✅ Want active commercial ecosystem

Choose Stanza If:#

  • ✅ Need UD-compliant outputs
  • ✅ Cross-lingual research
  • ✅ Academic reproducibility important
  • ✅ Stanford NLP ecosystem integration

Choose LTP If:#

  • ✅ Chinese-only application
  • ✅ Want Chinese-optimized performance
  • ✅ Need cloud service option
  • ✅ Familiar with HIT ecosystem

Character Analysis Integration#

Key Finding: All NLP platforms operate at word level, not character level.

For Character Decomposition Needs:#

Must combine with separate tool:

  • makemeahanzi (data)
  • cjklib (library)
  • CJKVI-IDS (data)

Integration Architecture:#

Text → NLP Platform (word segmentation) → Words
Words → Character Tool (decomposition) → Components

No lock-in: NLP and character analysis are separate concerns

Phase 1: Adopt for Word Processing#

  1. Choose platform based on criteria above
  2. Integrate for word segmentation, POS, NER
  3. Deploy in production
  4. Monitor performance

Phase 2: Add Character Analysis#

  1. Separately integrate makemeahanzi or cjklib
  2. Build connector layer
  3. Combine outputs as needed
  4. No dependency between layers

Phase 3: Optimize#

  1. Cache common segmentations
  2. Batch processing
  3. GPU acceleration if needed
  4. Model tuning

Timeline & Effort#

Integration: 1-2 days#

  • Install platform
  • Basic API usage
  • Test with sample data

Production Deployment: 1 week#

  • Performance testing
  • Error handling
  • Logging and monitoring
  • Documentation

Optimization: 1-2 weeks#

  • Benchmarking
  • Caching strategy
  • Infrastructure setup
  • Load testing

Total: 1-2 weeks to production

Exit Strategy#

If platform switch needed:

Low switching cost because:#

  1. Standard outputs: UD annotations, token lists
  2. Multiple alternatives: HanLP, Stanza, LTP all viable
  3. Modular architecture: NLP layer separate from application logic
  4. No data lock-in: Text processing is stateless

Migration path:#

  1. Run old and new platform in parallel
  2. Compare outputs
  3. Gradually shift traffic
  4. Low risk, controlled process

Strategic Verdict: LOW RISK, HIGH VALUE

NLP platforms are mature, well-supported, and provide essential functionality. Recommended for all projects needing word segmentation.

Choose Based On:

  • Multilingual: HanLP
  • Academic/UD: Stanza
  • Chinese-only: LTP

Timeline: 1-2 weeks to production Risk: Minimal (institutional backing, multiple alternatives) Flexibility: High (standard outputs, modularity)

Recommendation: Adopt platform for word processing, combine with character analysis tool as needed. Do not build NLP pipeline from scratch.


S4-Strategic: Recommendation#

Strategic Risk Assessment Summary#

ApproachMaintenance RiskTechnical DebtFlexibilityLong-term Viability
cjklibHIGHHIGHMODERATE⚠️ MODERATE
makemeahanziLOWNONEHIGH✅ HIGH
NLP PlatformsLOWLOWHIGH✅ HIGH
Hybrid (Data+Tools)LOWLOWHIGH✅ HIGH

Strategic Recommendation: Modular Architecture#

Core Principle: Separation of Concerns#

Build a modular system where character analysis and word processing are independent:

┌─────────────────────────────────────────┐
│          Application Layer              │
├─────────────────────────────────────────┤
│  Character Analysis  │  Word Processing │
│   (makemeahanzi)    │  (HanLP/Stanza)  │
├─────────────────────────────────────────┤
│          Data Layer                      │
│  JSON Files  │  Models  │  Databases    │
└─────────────────────────────────────────┘

Why This Works#

  1. Low coupling: Components can be upgraded independently
  2. Low risk: Failure in one doesn’t affect the other
  3. Flexibility: Swap components as needs evolve
  4. Standard formats: JSON, UD annotations, IDS sequences

Tier 1 Recommendation: Production Systems#

For Character Decomposition: makemeahanzi#

Why:

  • ✅ Zero technical debt (pure data)
  • ✅ Rich etymology (unique strength)
  • ✅ Standard formats (JSON, IDS)
  • ✅ Active maintenance
  • ✅ No lock-in
  • ✅ 9K character coverage sufficient for most uses

When to enhance:

  • Add CJKVI-IDS for rare characters (5% of use cases)
  • Add Sears database for historical forms (research only)

For Word Processing: HanLP or Stanza#

Why:

  • ✅ Production-ready
  • ✅ Institutional backing
  • ✅ Active development
  • ✅ No technical debt
  • ✅ Low maintenance

Choose HanLP if: Multilingual or want RESTful API Choose Stanza if: Academic UD compliance needed Choose LTP if: Chinese-only optimization priority

Implementation Timeline#

Week 1: Foundation

  • Download makemeahanzi data
  • Parse into SQLite or JSON store
  • Build basic character decomposition API
  • Install HanLP/Stanza
  • Test word segmentation

Week 2: Integration

  • Combine character + word processing
  • Build unified API layer
  • Performance testing
  • Error handling

Week 3: Production

  • Deployment
  • Monitoring
  • Documentation
  • Initial optimization

Week 4+: Enhancement

  • Add CJKVI-IDS if needed
  • Custom features
  • Performance tuning
  • User feedback integration

Total: 3-4 weeks to production-ready system

Tier 2 Recommendation: Research/Academic#

For Comprehensive Character Analysis: Custom Tool on CJKVI-IDS + makemeahanzi#

Why:

  • Full CJK coverage (CJKVI-IDS)
  • Rich data for common characters (makemeahanzi)
  • Modern Python 3 implementation
  • Full control over algorithms
  • Research-grade quality

For Linguistic Research: Stanza + Custom Character Tool#

Why:

  • UD annotations (cross-lingual research)
  • Reproducible results
  • Academic standard
  • Can publish methodology

Timeline: 4-6 weeks#

  • 2 weeks: Build character analysis tool
  • 1 week: Integrate Stanza
  • 1 week: Research-specific features
  • 1-2 weeks: Validation and testing

Avoid cjklib3 Fork For New Projects#

Reasons:

  • Python 2 → 3 conversion technical debt
  • Uncertain fork maintenance
  • Manual setup complexity
  • Better alternatives available

Only use cjklib if:

  • Short-term project (<6 months)
  • Need comprehensive search immediately
  • Plan migration to modern solution
  • Can accept technical debt

Risk Mitigation Strategies#

Strategy 1: Data-First Architecture#

Protect against library obsolescence:

  • Store data in standard formats
  • Don’t tightly couple to library APIs
  • Build abstraction layer
  • Can swap libraries without data migration

Strategy 2: Modular Design#

Protect against component failure:

  • Character analysis independent of word processing
  • Can replace either component
  • Standard interfaces between modules
  • No cascading failures

Strategy 3: Progressive Enhancement#

Manage resource constraints:

  • Start with makemeahanzi (9K chars)
  • 99% of use cases covered
  • Add CJKVI-IDS only if needed
  • Avoid premature optimization

Strategy 4: Exit Plans#

Prepare for platform changes:

  • Standard output formats (UD, IDS, JSON)
  • Documentation of data schemas
  • Abstraction layers
  • Multiple viable alternatives

Long-Term Evolution Path#

Year 1: Foundation#

  • makemeahanzi + HanLP/Stanza
  • Basic features
  • Production deployment
  • User feedback

Year 2: Enhancement#

  • Add CJKVI-IDS for extended coverage
  • Custom algorithms for specialized needs
  • Performance optimization
  • Feature expansion based on usage

Year 3: Maturity#

  • Potentially migrate to fully custom character tool
  • Advanced features
  • Research-grade quality
  • Contribute improvements back to open source

Cost-Benefit Analysis#

Approach A: Adopt cjklib3#

  • Effort: 1 week setup + ongoing maintenance
  • Benefit: Comprehensive search immediately
  • Cost: Technical debt, uncertain future
  • Risk: HIGH
  • Effort: 2-3 weeks initial
  • Benefit: Modern codebase, full control, no debt
  • Cost: Initial development time
  • Risk: LOW

Approach C: Build Everything Custom#

  • Effort: 6-8 weeks
  • Benefit: Maximum control
  • Cost: Significant development resources
  • Risk: MODERATE (reinventing wheel)

Best ROI: Approach B (makemeahanzi + Custom)

Decision Matrix#

If Your Priority Is…#

Speed to market: → makemeahanzi + HanLP (1-2 weeks)

Long-term maintainability: → makemeahanzi + Custom + HanLP (3-4 weeks)

Research quality: → Custom tool + CJKVI-IDS + Stanza (4-6 weeks)

Minimal resources: → makemeahanzi data + minimal API (1 week)

Character search capabilities: → Study cjklib algorithms, implement in Python 3 (2-3 weeks)

Final Strategic Recommendation#

Build Modular System with:#

  1. Character Layer: makemeahanzi (+ CJKVI-IDS if needed)
  2. Word Layer: HanLP or Stanza
  3. Architecture: Separated, loosely coupled
  4. Timeline: 3-4 weeks to production
  5. Risk Level: LOW
  6. Future-proofing: HIGH

Why This Wins#

  • ✅ Low technical debt (modern Python, standard formats)
  • ✅ Low maintenance (stable data, proven tools)
  • ✅ High flexibility (swap components independently)
  • ✅ Low risk (institutional backing, multiple alternatives)
  • ✅ Fast time to market (3-4 weeks)
  • ✅ Room to grow (can enhance without rebuilding)

Avoid#

  • ❌ cjklib3 fork (technical debt)
  • ❌ Building NLP pipeline from scratch (reinventing wheel)
  • ❌ Monolithic architecture (high coupling)
  • ❌ Vendor lock-in (proprietary formats)

Strategic Verdict: MODULAR DATA-FIRST ARCHITECTURE

Combine proven tools (HanLP/Stanza for words) with open data (makemeahanzi for characters) in a loosely coupled architecture. This maximizes flexibility while minimizing risk and technical debt.

Timeline: 3-4 weeks to production-ready Effort: Moderate (mostly integration, not invention) Risk: Low (proven components, standard formats) Flexibility: High (can evolve independently)

Next Steps:

  1. Download makemeahanzi data
  2. Choose NLP platform (HanLP recommended)
  3. Build connector layer
  4. Deploy and iterate
Published: 2026-03-06 Updated: 2026-03-06