1.161 Radical & Component Analysis#
Character decomposition and radical analysis for CJK characters. Tools for identifying semantic components, radicals, and character structure - essential for dictionary lookups, etymology, and character learning.
Explainer
Radical & Component Analysis: Research Explainer#
Research ID: 1.161 Category: Chinese Language Processing / Character Decomposition Completed: 2026-01-28 Discovery Phase: S1-S4 Complete
What Is This?#
Character decomposition is the process of breaking down Chinese characters (汉字/漢字) into their constituent parts: radicals (部首), components, and structural relationships. This research surveys data sources, tools, and patterns for implementing character decomposition systems.
Key Use Cases:
- Etymology: Understand character origins and historical development
- Mnemonic generation: Create memory aids for language learners
- Character learning progression: Build curricula from simple to complex
- Computational linguistics: Enable character-aware NLP applications
What We Researched#
S1: Quick Ecosystem Scan (ecosystem-scan.md)#
Data Standards:
- Unihan Database: Official Unicode consortium database with radical-stroke data, variants, pronunciations
- IDS (Ideographic Description Sequences): Standardized structural representation using operators (⿰, ⿱, etc.)
- Phonetic-Semantic Compounds (形聲字): 80% of characters combine semantic radical + phonetic component
Python Libraries:
- cjklib: Comprehensive but inactive (no Python 3 support)
- hanzipy: Learner-focused, unclear maintenance
- pypinyin: Active, production-ready (pronunciation focus)
- makemeahanzi: Open-source character data (JSON format)
Open Data Sources:
- CJKVI-IDS: 158 commits, comprehensive IDS database (plain text)
- cjk-decomp: 75,000+ character decomposition data
- Hsiao & Shillcock Database: Academic phonetic compound research
S2: Deep Technical Dive (library-comparison.md)#
Library Status:
- ❌ cjklib: Abandoned (0 PRs/issues in past year, no Python 3)
- ✅ pypinyin: Actively maintained (July 2025 release, Python 3.5-3.13)
- ⚠️ hanzipy: Maintenance unclear
- ✅ CJKVI-IDS: Active commits (158 total), stale releases (2018)
Data Format Comparison:
CJKVI-IDS: U+5730 地 ⿰土也 (plain text, tab-delimited)
Unihan: kRSUnicode: 32.3 (radical 土 + 3 strokes)
makemeahanzi: {"character": "地", "etymology": {...}} (JSON)Integration Strategy:
- Layer 1: CJKVI-IDS for structural decomposition
- Layer 2: Unihan for radical/stroke metadata
- Layer 3: makemeahanzi for phonetic-semantic classification
- Layer 4: pypinyin for pronunciation
Critical Insight: No single comprehensive library exists. Hybrid data-source approach recommended.
S3: Generic Use Cases (use-case-patterns.md)#
10 Key Patterns Identified:
- Character Decomposition Lookup: 花 → ⿱艹化 (structure)
- Radical-Based Search: Find all chars with radical 土
- Phonetic Series Identification: 青 → 清情請晴… (pronunciation families)
- Etymology Tree Generation: Recursive component hierarchy
- Mnemonic Component Extraction: 好 = 女+子 → “woman + child = good”
- Learning Progression Sequencing: Simple components before complex compounds
- Character Similarity Analysis: Structural/radical/phonetic overlap
- Component Reusability Analysis: Identify high-value components (口, 氵, etc.)
- Traditional ↔ Simplified Mapping: 國 ↔ 国 (variant handling)
- Stroke Order Alignment: Component order vs. writing sequence
Data Gap Identified: No comprehensive open database combining:
- Full Unicode coverage (CJKVI-IDS level)
- Phonetic-semantic classification (makemeahanzi level)
- Etymology type annotation (ideographic/pictographic/pictophonetic)
Workaround: Combine multiple data sources programmatically
S4: Strategic Viability (viability-analysis.md)#
Unicode Roadmap:
- ✅ Stable: Unihan database regularly updated (annual Unicode releases)
- ✅ Mature: IDS operators stable since 1999 (no breaking changes)
- ✅ Ongoing: CJK extensions (Ext-I in Unicode 17.0, 2026)
Python Ecosystem Health:
- ❌ cjklib: Abandoned (technical debt if used)
- ✅ pypinyin: Active (recommended for pronunciation)
- ⚠️ CJKVI-IDS: Active commits but stale releases (monitor closely)
Recommended Architecture:
- Bet on: Open data sources (Unihan, CJKVI-IDS), stable standards
- Avoid: Unmaintained libraries (cjklib), proprietary datasets
- Hedge with: Modular design, data versioning, automated testing
10-Year Outlook:
- Stable: Unicode, Unihan, IDS operators
- Uncertain: Python library consolidation, CJK NLP evolution
- Emerging: LLM-powered mnemonics, adaptive learning curricula
Key Findings#
1. Data Sources Are More Reliable Than Libraries#
| Aspect | Libraries | Open Data |
|---|---|---|
| Maintenance | Fragile (cjklib abandoned) | Stable (Unicode consortium) |
| Format | APIs can break | Plain text (future-proof) |
| Control | Dependency on maintainers | Full parsing control |
| Risk | High (abandonment) | Low (can fork) |
Recommendation: Data-first approach - parse CJKVI-IDS and Unihan directly, avoid unmaintained libraries.
2. Hybrid Data Stack Required#
No single source provides everything:
Use Case: Educational App
├─ Structure: CJKVI-IDS (⿰土也)
├─ Radicals: Unihan (kRSUnicode = 32.3)
├─ Etymology: makemeahanzi (type: pictophonetic)
└─ Pronunciation: pypinyin (dì, de)
Use Case: Comprehensive Dictionary
├─ Structure: CJKVI-IDS (full Unicode)
├─ Metadata: Unihan (variants, readings, strokes)
├─ Phonetic: makemeahanzi (common chars) + custom for rare chars
└─ Pronunciation: pypinyin + Unihan kMandarin3. Traditional vs. Simplified Requires Dual Storage#
Challenge: Same Unicode codepoint, different glyphs
- 骨 (bone): PRC vs. Taiwan vs. Japan variants
- Decomposition differs: 國 (⿴囗或) vs. 国 (⿴囗玉)
Solution:
- Store IDS for both traditional and simplified forms
- Use Unihan
kSimplifiedVariant/kTraditionalVariantfor mapping - Locale preference determines which decomposition to show
4. 80% of Characters Are Phonetic-Semantic Compounds#
Structure: Semantic radical + Phonetic component
Example: 清 (clear, qīng)
- Semantic: 氵 (water) → meaning category
- Phonetic: 青 (qīng) → pronunciation hint
Implication:
- Mnemonic generation: “Clear water (氵) sounds like 青”
- Learning progression: Teach phonetic families together (清情請晴)
- Etymology tools: Must classify phonetic vs. semantic components
Data Source: makemeahanzi for common chars, Hsiao & Shillcock for research
5. Regional Variants Complicate Decomposition#
Unicode Han Unification: One codepoint for “same” character across regions
- Benefit: Interoperability
- Challenge: Visual/structural differences not captured
Handling Strategy:
- Default to Unicode standard (Unihan)
- Store multiple IDS per character (regional variants)
- Locale-aware decomposition (user’s region preference)
- Document variant policy (e.g., “prefer simplified for PRC learners”)
Data Source Recommendations#
For Structural Decomposition#
✅ Primary: CJKVI-IDS ids.txt
- Coverage: Full URO + Extensions
- Format: Plain text (easy parsing)
- License: ids-ext-cdef.txt is GPL-exempt
For Radical & Metadata#
✅ Primary: Unihan Database
- Authority: Official Unicode consortium
- Fields: kRSUnicode, kTotalStrokes, kSemanticVariant, etc.
- Maintenance: Regular updates
For Phonetic-Semantic Classification#
✅ Common Chars: makemeahanzi
- Format: JSON (easy integration)
- Fields: etymology.type, etymology.phonetic, etymology.semantic
- License: Open-source
✅ Research Reference: Hsiao & Shillcock (2006) Database
- Academic rigor
- Cangjie stroke decomposition
- Limited availability (contact authors)
For Pronunciation#
✅ Primary: pypinyin
- Actively maintained (July 2025)
- Python 3.5-3.13 support
- Heteronym support
Library Recommendations#
| Use Case | Recommended Approach | Why |
|---|---|---|
| Quick Prototype | Parse makemeahanzi JSON | Fastest, single file, common chars |
| Educational App | makemeahanzi + pypinyin | Etymology + pronunciation, learner-focused |
| Comprehensive Dictionary | CJKVI-IDS + Unihan + pypinyin | Full coverage, authoritative |
| Research Project | All sources + validation | Cross-reference, data quality assurance |
Libraries to AVOID:
- ❌ cjklib: No Python 3 support, abandoned
- ❌ cjklib3 fork: Maintenance unverified (as of 2026-01-28)
Implementation Guidance#
Parsing CJKVI-IDS (Plain Text)#
Format: Tab-delimited
U+5730 地 ⿰土也Python Pseudo-Code (concept, not implementation):
def parse_ids_line(line):
parts = line.strip().split('\t')
return {
'unicode': parts[0], # U+5730
'character': parts[1], # 地
'ids': parts[2], # ⿰土也
'components': extract_components(parts[2]) # [土, 也]
}Parsing Unihan (Tab-Delimited)#
Format: U+XXXX<tab>kField<tab>Value
U+5730 kRSUnicode 32.3
U+5730 kTotalStrokes 6Python Pseudo-Code:
def parse_unihan_file(file_path):
char_data = defaultdict(dict)
for line in file:
if line.startswith('#'): continue
codepoint, field, value = line.strip().split('\t')
char_data[codepoint][field] = value
return char_dataParsing makemeahanzi (JSON)#
Format: JSON objects, one per line
{"character": "地", "etymology": {"type": "pictophonetic", "phonetic": "也", "semantic": "土"}}Python Pseudo-Code:
import json
def parse_makemeahanzi(file_path):
characters = {}
with open(file_path) as f:
for line in f:
data = json.loads(line)
characters[data['character']] = data
return charactersNext Steps#
If Building Educational Tool#
- Download makemeahanzi dictionary.txt
- Install pypinyin:
pip install pypinyin - Parse JSON for etymology, use pypinyin for pronunciation
- Implement Pattern 5 (Mnemonic Extraction) from S3
If Building Comprehensive Dictionary#
- Clone CJKVI-IDS (use
ids.txt) - Download Unihan Database
- Write parsers for both formats (see pseudo-code above)
- Implement Pattern 1 (Character Decomposition Lookup) from S3
- Cross-validate data sources (check for conflicts)
If Doing Research#
- Obtain Hsiao & Shillcock (2006) database (academic access)
- Combine all sources: CJKVI-IDS + Unihan + makemeahanzi
- Build validation pipeline (data quality metrics)
- Implement Pattern 7 (Similarity Analysis) and Pattern 8 (Reusability) from S3
Critical Warnings#
❌ Do NOT Use cjklib#
- No Python 3 support
- Abandoned (no updates in years)
- Fork (cjklib3) maintenance unclear
- Use direct data parsing instead
⚠️ Handle Regional Variants#
- Same Unicode codepoint ≠ same visual form
- Store multiple IDS per character
- Document locale handling policy
⚠️ License Compliance#
- CJKVI-IDS
ids.txt: GPLv2 - CJKVI-IDS
ids-ext-cdef.txt: No GPL restriction - Choose file based on license requirements
⚠️ Data Version Pinning#
- CJKVI-IDS: Use git commit SHA (not release tag)
- Unihan: Track Unicode version number
- makemeahanzi: Use GitHub release tag
- Reproducibility requires version tracking
Maintenance Strategy#
Quarterly Tasks#
- Check CJKVI-IDS for new commits (git log)
- Monitor Unicode announcements (new CJK extensions)
- Verify pypinyin still actively maintained (PyPI releases)
- Test data parsers against latest versions
Annual Tasks#
- Update Unihan database (new Unicode release)
- Re-download CJKVI-IDS (latest commit)
- Review makemeahanzi for updates
- Audit data conflicts (cross-source validation)
As-Needed#
- Contribute fixes to CJKVI-IDS if errors found
- Fork CJKVI-IDS if maintenance lapses
- Publish derived datasets (community contribution)
Further Reading#
Discovery Documents:
- S1: Ecosystem Scan - Overview of standards and tools
- S2: Library Comparison - Technical deep-dive
- S3: Use Case Patterns - 10 generic patterns
- S4: Viability Analysis - Long-term strategy
External Resources:
- Unicode Unihan Database (UAX #38) - Official spec
- CJKVI-IDS GitHub - Comprehensive IDS data
- makemeahanzi - Open-source character data
- Hacking Chinese: Phonetic Components - Learner guide
- Hsiao & Shillcock (2006) - Academic research
Research Completed: 2026-01-28 Status: Discovery phase complete (S1-S4) Next: Implementation phase (02-implementations/) if needed
S1: Rapid Discovery
S1-rapid: Approach#
Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.
S1 Rapid: Radical & Component Analysis Ecosystem Scan#
Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Quick ecosystem scan for Chinese character radical decomposition and component analysis
Overview#
Chinese character decomposition involves breaking down characters (汉字/漢字) into their constituent parts: radicals (部首), components, and stroke sequences. This is essential for:
- Etymology: Understanding character origins and historical development
- Mnemonic generation: Creating memory aids for language learners
- Character learning progression: Building curriculum from simple to complex components
- Computational linguistics: Enabling character-aware NLP applications
Key Standards & Data Sources#
1. Unihan Database (Unicode Consortium)#
Source: Unicode Han Database (UAX #38)
The official Unicode Han Database provides comprehensive data for all unified CJK ideographs:
Radical-Stroke Data: Based on 214 Kangxi radicals system (18th century)
- Format:
kRSUnicodeandkRSAdobe_Japan1_6fields - Structure: radical number + residual stroke count
- Example: Character 地 = Radical 土 (32) + 3 residual strokes
- Format:
Interactive Access: Unihan Database Lookup
Radical Index: Unihan Radical-Stroke Index
Development: GitHub repository for expert review
Key Fields:
kRSUnicode: Radical-stroke countkCangjie: Cangjie input method codeskPhonetic: Phonetic component informationkSemanticVariant: Semantic variant mappings
2. IDS (Ideographic Description Sequences)#
Source: Unicode Standard Chapter 18
IDS provides a standardized way to represent character composition using Unicode operators (U+2FF0..U+2FFB):
- Purpose: Describe characters in featural terms by arrangement of components
- Operators: 12 special characters (⿰, ⿱, ⿲, etc.) for spatial arrangement
- ⿰ (U+2FF0): Left-to-right composition
- ⿱ (U+2FF1): Top-to-bottom composition
- ⿲ (U+2FF2): Left-middle-right composition
- etc.
Examples:
- 地 → ⿰土也 (left-right: 土 + 也)
- 花 → ⿱艹化 (top-bottom: 艹 + 化)
Technical Constraints:
- Maximum sequence length: 16 Unicode code points
- Useful for describing unencoded characters or characters missing from fonts
Main IDS Database: CJKVI-IDS on GitHub - Comprehensive IDS data for CJK Unified Ideographs
3. Phonetic-Semantic Compounds (形聲字)#
Concept: 80% of Chinese characters are phonetic-semantic compounds combining:
- Semantic radical (義符): Indicates meaning category
- Phonetic component (聲符): Provides pronunciation hint
Key Resources:
Research Databases:
- Hsiao & Shillcock (2006): Chinese lexical database with phonetic compound decomposition into semantic/phonetic radicals
- Published in Journal of Psycholinguistic Research
- Includes Cangjie stroke pattern decomposition
- Contains pronunciation and frequency data
Python Libraries & Tools#
1. cjklib (Most Comprehensive)#
PyPI: cjklib Docs: cjklib.characterlookup
Capabilities:
- Character decomposition using IDS (Ideographic Description Sequences)
- Radical lookups (Kangxi radical system)
- Stroke decomposition
- Variant character information
- Pronunciation data (Pinyin, Wade-Giles, Cantonese, etc.)
- Glyph component analysis
Database Storage: IDS stored in SQLite database with Unicode IDS operators
Maintenance Status: Last release 0.3.2 (check PyPI for latest)
2. hanzipy#
GitHub: Synkied/hanzipy
Purpose: Chinese character NLP framework for language learners
Features:
- Character decomposition into radicals/components
- Learning-focused API
- Framework for character exploration
Target Audience: Language learners and educational applications
3. makemeahanzi (Data Only)#
GitHub: skishore/makemeahanzi Web: Make Me a Hanzi
Format: JSON data files, not a Python library
Data Sources: Derived from Unihan and cjklib
Content:
- Character etymology (type: ideographic/pictographic/pictophonetic)
- Phonetic/semantic component breakdowns
- Stroke order data
- Open-source, free to use
4. cjk-decomp (Decomposition Data)#
GitHub: amake/cjk-decomp
Scope: Decomposition data for 75,000+ CJK ideographs
- 36 stroke types
- 115 radicals
- 20,924 unified characters
- Extension sets (Ext-A, Ext-B, etc.)
Format: Data files (not a Python library, but can be parsed)
Web-Based Resources#
Reference & Etymology#
- Zhongwen.com: Character genealogies (zipu) showing interconnections between 4000+ characters based on Shuowen Jiezi
- Arch Chinese Radicals: Interactive radical reference
- Dong Chinese Wiki: Character wiki with etymological information
- Multi-function Chinese Character Database (漢語多功能字庫): University of Hong Kong free online dictionary with character origins
Technical References#
- BabelStone IDS Database: UTF-8 encoded plain text IDS for all CJK Unified Ideographs
- CJKVI GitHub Organization: Multiple databases (IDS, variants, etc.)
Quick Assessment#
Strengths#
✅ Unicode Standard Foundation: Unihan is official, comprehensive, regularly maintained ✅ IDS Standardization: Structural decomposition with standardized operators ✅ Multiple Python Options: From comprehensive (cjklib) to learner-focused (hanzipy) ✅ Open Data Available: makemeahanzi, cjk-decomp, CJKVI-IDS all freely accessible ✅ Rich Academic Research: Hsiao & Shillcock database, phonetic compound research
Gaps & Considerations#
⚠️ Library Maintenance: Need to verify current maintenance status of cjklib, hanzipy ⚠️ Data Quality: Different sources may have conflicting decompositions (historical vs. modern) ⚠️ Traditional vs. Simplified: Need to handle both character sets ⚠️ Variant Forms: Regional variants (China, Taiwan, Hong Kong, Japan, Korea) complicate decomposition
Next Steps for S2-S4#
S2 Comprehensive:
- Deep dive into cjklib API and database schema
- Compare IDS coverage across CJKVI-IDS vs. BabelStone vs. cjklib
- Analyze phonetic-semantic compound databases (Hsiao & Shillcock data)
- Review academic literature on character decomposition approaches
S3 Need-Driven:
- Generic use cases: etymology lookup, mnemonic generation, learning progression
- Pattern: “Given character X, find all components”
- Pattern: “Given radical Y, find all characters containing it”
- Pattern: “Find phonetic series (characters sharing phonetic component)”
S4 Strategic:
- Unicode Consortium maintenance trajectory
- Python library ecosystem health
- Integration with modern NLP pipelines (transformers, embeddings)
- Traditional vs. simplified character handling strategies
Sources:
- UAX #38: Unicode Han Database
- Unihan Database Lookup
- Ideographic Description Characters (Wikipedia)
- CJKVI-IDS GitHub
- cjklib PyPI
- Hacking Chinese: Phonetic Components
- Analysis of a Chinese Phonetic Compound Database
- Make Me a Hanzi
S1-rapid Recommendations#
Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.
S2: Comprehensive
S2-comprehensive: Approach#
Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.
S2 Comprehensive: Library & Data Source Deep-Dive#
Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Deep technical analysis of libraries, data sources, and formats for character decomposition
Python Library Landscape#
1. cjklib (Comprehensive but Inactive)#
Repository: cburgmer/cjklib on GitHub PyPI: cjklib Latest Version: 0.3.2 Maintenance Status: ❌ INACTIVE (Snyk analysis)
Maintenance Indicators:
- No releases to PyPI in past 12+ months
- Past year: 0 issues, 0 pull requests, 0 issue authors, 0 PR authors
- Classified as discontinued or receiving minimal attention
- Weekly downloads: ~70 (limited popularity)
- Python 3 Support: NO - major blocker for modern projects
Technical Capabilities:
IDS (Ideographic Description Sequences):
- Hierarchical character decomposition stored in database
- Binary operators (⿰ for left-right: 好 → ⿰女子)
- Trinary operators (⿲ for three components: 辨 → ⿲⾟刂⾟)
- Handles non-distinct partitioning, unencoded components, multi-level hierarchies
Kangxi Radicals:
- Character-to-radical mapping via Unihan integration
- Radical-to-character lookup (find all chars with specific radical)
- Variant handling: Unicode radical forms (⾔), variants (⻈), equivalent chars (言/讠)
Database Architecture:
- Uses DatabaseConnector supporting multiple DB systems
- Default: SQLite (
cjklib.db) - Configuration via
cjklib.conf - Core data storage: IDS sequences with Unicode operators
API Methods (from documentation):
getReadingForCharacter()- pronunciation lookupsgetCharactersForReading()- reverse lookup by pronunciationgetDefaultGlyph()- locale-specific glyph mappingsgetAvailableCharacterDomains()- supported character sets
Assessment:
- ✅ Most comprehensive feature set
- ❌ Inactive development
- ❌ No Python 3 support
- ⚠️ Risk: Unmaintained dependencies, compatibility issues
Potential: Fork exists as cjklib3 (Python 3 port) - needs verification of maintenance status
2. hanzipy (Learner-Focused)#
Repository: Synkied/hanzipy on GitHub PyPI: hanzipy 1.0.4 Maintenance Status: ⚠️ UNCLEAR (no recent activity visible in 2025 search results)
Purpose: Chinese character NLP framework for language learners
Features:
- Character decomposition into radicals/components
- Learning-focused API design
- Framework for character exploration
Assessment:
- ✅ Learner-friendly design
- ⚠️ Limited documentation on advanced features
- ⚠️ Unclear maintenance trajectory
- ⚠️ Smaller community than cjklib
Use Case Fit: Better for educational tools than computational linguistics research
3. pypinyin / python-pinyin (Active, Production-Ready)#
Repository: mozillazg/python-pinyin PyPI: pypinyin Latest Release: July 20, 2025 Python Support: 2.7, 3.5 through 3.13 Maintenance Status: ✅ ACTIVELY MAINTAINED
Primary Focus: Pinyin/Zhuyin/Cyrillic conversion
- Heteronym support (characters with multiple readings)
- Recommended for PRC-style simplified characters
- Most commonly used Python package for modern Chinese
Assessment for Radical Decomposition:
- ❌ Not focused on structural decomposition
- ✅ Excellent for pronunciation/reading features
- ✅ Strong maintenance and community
Recommendation: Use for pronunciation features, combine with IDS data source for decomposition
4. Other Alternatives#
dragonmapper (PyPI):
- Conversion between characters/Pinyin/Zhuyin/IPA
- Traditional vs. Simplified identification
- Not focused on structural decomposition
hanzitools (jcklie/hanzitools on GitHub):
- Heisig entry lookups
- CEDICT translation integration
- Recommends pypinyin for Pinyin conversion
- Limited decomposition features
Data Source Deep-Dive#
1. CJKVI-IDS (Primary Recommendation)#
Repository: cjkvi/cjkvi-ids on GitHub Last Release: February 20, 2018 (v18.02.20) Commits: 158 (ongoing development) License: GPLv2 (most files), some unrestricted
Data Files (10 total):
| File | Purpose | License |
|---|---|---|
ids.txt | Main IDS data (from CHISE project) | GPLv2 |
ids-cdp.txt | With Academia Sinica CDP PUA chars | GPLv2 |
ids-ext-cdef.txt | Extended IDS (Ext-C/D/E/F) | ❗ No GPL restriction |
ids-analysis.txt | Analysis data | GPLv2 |
hanyo-ids.txt | Hanyo-specific IDS | GPLv2 |
waseikanji-ids.txt | Japanese Waseikanji | GPLv2 |
ws2015-ids.txt | 2015 worksheet | GPLv2 |
ws2015-ids-cdp.txt | 2015 worksheet with CDP | GPLv2 |
ucs-strokes.txt | Stroke count info | GPLv2 |
Format: Plain text, tab-delimited
- Column 1: Unicode codepoint (U+XXXX)
- Column 2: Character
- Column 3: IDS sequence using U+2FF0..U+2FFB operators
Example:
U+5730 地 ⿰土也
U+82B1 花 ⿱艹化Coverage: CJK Unified Ideographs (URO + Extensions)
Programmatic Usage:
- Text-based format → easy parsing with Python
- Companion tool: kawabata/ids for IDS normalization
- Font requirement: HanaMin/HanaMin AFDKO for full character display
Assessment:
- ✅ Most comprehensive open IDS database
- ✅ Easy to parse (plain text)
- ✅ Actively developed (158 commits)
- ✅ Covers extended character sets
- ⚠️ Last release 2018 (but ongoing commits)
- ⚠️ GPLv2 licensing for most files
Comparison to Alternatives:
- BabelStone IDS: UTF-8 plain text, all CJK Unified Ideographs
- cjklib embedded IDS: Database format, requires library installation
- makemeahanzi: JSON format, includes etymology types
2. Unihan Database (Official Unicode Standard)#
Specification: UAX #38 Lookup Interface: Unihan Database Lookup Development: unicode-org/unihan-database on GitHub
Relevant Fields for Radical/Component Analysis:
| Field | Description | Example |
|---|---|---|
kRSUnicode | Radical-stroke count (Unicode) | 32.3 (Radical 土 + 3 strokes) |
kRSAdobe_Japan1_6 | Radical-stroke (Adobe-Japan1-6) | Variant system |
kRSKangXi | Kangxi Dictionary radical-stroke | Traditional reference |
kTotalStrokes | Total stroke count | 6 (for 地) |
kCangjie | Cangjie input code | Stroke-based decomposition |
kPhonetic | Phonetic component index | For phonetic compounds |
kSemanticVariant | Semantic variants | Related characters |
Format: Tab-delimited text files
Unihan_RadicalStrokeCounts.txtUnihan_Readings.txtUnihan_Variants.txt
Assessment:
- ✅ Official Unicode standard (authoritative)
- ✅ Regularly updated by consortium
- ✅ Comprehensive coverage of all encoded CJK chars
- ✅ Well-documented specification
- ⚠️ Radical-based (not full structural decomposition)
- ⚠️ Requires parsing multiple files for full data
Use Case: Combine with CJKVI-IDS for comprehensive solution:
- Unihan → Radical, stroke count, phonetic info
- CJKVI-IDS → Full structural decomposition
3. makemeahanzi (Open-Source Character Data)#
Repository: skishore/makemeahanzi on GitHub Website: Make Me a Hanzi Format: JSON data files License: Open-source, free to use Data Sources: Derived from Unihan and cjklib
JSON Structure (from website):
{
"character": "花",
"etymology": {
"type": "pictophonetic",
"hint": "plants",
"phonetic": "化",
"semantic": "艹"
},
"strokes": [...],
"medians": [...]
}Etymology Types:
ideographic: Meaning-based compositionpictographic: Picture representationpictophonetic: Phonetic-semantic compound (形聲字)
Phonetic-Semantic Fields (when type = “pictophonetic”):
hint: Semantic category hintphonetic: Phonetic component (string, may be null)semantic: Semantic radical (string, may be null)
Assessment:
- ✅ JSON format (easy integration)
- ✅ Includes etymology type classification
- ✅ Phonetic-semantic decomposition
- ✅ Open-source, permissive license
- ⚠️ Derived data (not primary source)
- ⚠️ Limited to common characters (not full Unicode coverage)
Best Use: Educational tools, mnemonic generation, character learning apps
4. Hsiao & Shillcock Phonetic Compound Database#
Publication: Analysis of a Chinese Phonetic Compound Database (2006) Journal: Journal of Psycholinguistic Research
Content:
- Most frequent phonetic compounds
- Decomposition into semantic/phonetic radicals (etymologically accurate)
- Further decomposition into Cangjie stroke patterns
- Pronunciation data included
- Character frequency information
Format: Academic dataset (not open GitHub repo) Access: May require contacting authors or institutional access
Assessment:
- ✅ Academically rigorous
- ✅ Etymologically accurate decompositions
- ✅ Includes frequency data (useful for learning progression)
- ❌ Limited availability (not easily downloadable)
- ❌ Focused on frequent characters only
- ⚠️ Published 2006 (may need updates for newer Unicode)
Use Case: Research reference for validating other data sources
Data Quality Comparison#
IDS Coverage Comparison#
| Source | Coverage | Format | Maintenance | License |
|---|---|---|---|---|
| CJKVI-IDS | Full URO + Ext | Plain text | Active commits | GPLv2 |
| BabelStone IDS | All CJK Unified | Plain text | Unknown | Unknown |
| cjklib embedded | URO + common | SQLite | Inactive | BSD-like |
| makemeahanzi | Common chars | JSON | Active | Open |
Recommendation: CJKVI-IDS for comprehensive coverage, makemeahanzi for educational/learner focus
Radical System Comparison#
| System | Radicals | Source | Use Case |
|---|---|---|---|
| Kangxi (康熙) | 214 | Traditional standard | Unicode Unihan, dictionaries |
| Simplified | 189 | PRC standard | Mainland China education |
| Unicode Radical | 214 + variants | Unicode Standard | Digital text processing |
Key Insight: Multiple valid radical systems exist. Unihan uses Kangxi as baseline with variants.
Phonetic-Semantic Decomposition Sources#
| Source | Phonetic Info | Semantic Info | Coverage | Format |
|---|---|---|---|---|
| makemeahanzi | ✅ Component | ✅ Component | Common | JSON |
| Unihan kPhonetic | ✅ Index only | ❌ Limited | Full | Text |
| Hsiao & Shillcock | ✅ Full | ✅ Full | Frequent | Academic |
| CJKVI-IDS | ❌ No | ❌ No | Full | Text |
Gap Identified: No comprehensive open database combining:
- Full Unicode coverage (CJKVI-IDS level)
- Phonetic-semantic component classification (makemeahanzi level)
- Etymology type annotation (ideographic/pictographic/pictophonetic)
Workaround: Combine multiple sources programmatically
Integration Strategy#
Recommended Data Stack#
Layer 1: Structural Decomposition
- Primary: CJKVI-IDS (
ids.txt) for full structural decomposition - Fallback: Parse Unihan for characters not in CJKVI-IDS
Layer 2: Radical Information
- Primary: Unihan
kRSUnicodefor Kangxi radical + stroke count - Enhancement: Unihan
kRSKangXifor traditional dictionary reference
Layer 3: Phonetic-Semantic Classification
- Primary: makemeahanzi for common characters (JSON)
- Research: Hsiao & Shillcock for validation/academic reference
Layer 4: Pronunciation & Readings
- Primary: pypinyin for modern Mandarin pinyin
- Enhancement: Unihan readings for historical/variant pronunciations
Library Selection by Use Case#
Use Case: Educational Tool / Language Learning App
- Data: makemeahanzi (etymology, mnemonics) + pypinyin (pronunciation)
- Library: Custom parser (JSON + pypinyin library)
- Rationale: JSON format easy, mnemonic focus, active maintenance
Use Case: Computational Linguistics Research
- Data: CJKVI-IDS (full coverage) + Unihan (radicals/metadata)
- Library: Custom parser or cjklib3 fork (if verified maintained)
- Rationale: Comprehensive data, research flexibility
Use Case: Dictionary/Reference Application
- Data: Unihan (comprehensive) + CJKVI-IDS (structure) + pypinyin (pronunciation)
- Library: Custom integration layer
- Rationale: Multiple data sources needed, no single library sufficient
Use Case: Quick Prototype
- Data: makemeahanzi JSON
- Library: Standard Python JSON parsing
- Rationale: Fastest time-to-value, single file
Technical Debt & Risk Assessment#
cjklib (Inactive)#
- Risk: No Python 3 support, abandoned project
- Mitigation: Fork to cjklib3 OR extract database and write custom parser
- Effort: High (forking), Medium (custom parser)
CJKVI-IDS (2018 release)#
- Risk: Stale release despite active commits
- Mitigation: Use latest GitHub main branch, not release tag
- Effort: Low (just change download source)
Multiple Data Source Dependencies#
- Risk: Conflicting decompositions, synchronization issues
- Mitigation: Document source precedence rules, version pinning
- Effort: Medium (governance + tooling)
License Compliance#
- Risk: GPLv2 for CJKVI-IDS main file
- Mitigation: Use
ids-ext-cdef.txt(no GPL) OR comply with GPLv2 - Effort: Low (file selection) or Medium (GPL compliance)
Next Steps#
For S3 (Need-Driven Patterns):
- Map generic use cases to data source combinations
- Identify gaps in current data sources
- Prototype parsing strategies (pseudo-code level, no implementation)
For S4 (Strategic Analysis):
- Unicode Consortium roadmap for Unihan updates
- Python ecosystem trend analysis (active vs. dormant projects)
- Traditional vs. Simplified handling strategies
- CJK variant forms (Japan/Korea/Vietnam) decomposition differences
Sources:
- cburgmer/cjklib GitHub
- cjklib PyPI
- Snyk: cjklib Health Analysis
- cjklib Documentation
- pypinyin PyPI
- python-pinyin README
- CJKVI-IDS GitHub
- Unihan Database (UAX #38)
- makemeahanzi GitHub
- Hsiao & Shillcock (2006) Paper
S2-comprehensive Recommendations#
Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.
S3: Need-Driven
S3-need-driven: Approach#
Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.
S3-need-driven Recommendations#
Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.
S3 Need-Driven: Generic Use Case Patterns#
Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Generic use case patterns for character decomposition (NOT application-specific)
Pattern Philosophy#
Per DISCOVERY_VS_IMPLEMENTATION.md guidance:
- ✅ Generic patterns applicable to any developer
- ✅ Technology-neutral descriptions
- ❌ NO application-specific requirements (“for our app…”)
- ❌ NO implementation plans (“install and test…”)
Pattern 1: Character Decomposition Lookup#
Generic Need: Given a Chinese character, retrieve its structural components
Input: Single character (e.g., 花) Output: Structural decomposition with spatial relationship
Data Requirements:
- IDS sequence (⿱艹化)
- Component list ([艹, 化])
- Spatial operator (top-bottom arrangement)
Example Variations:
- Flat decomposition: 花 → [艹, 化]
- Hierarchical decomposition: 花 → [艹, [亻, 匕]] (recursive)
- IDS string representation: “⿱艹化”
Data Sources:
- Primary: CJKVI-IDS
ids.txt - Fallback: cjklib database, makemeahanzi JSON
Complexity Considerations:
- Single valid decomposition vs. multiple valid partitions
- Unencoded components (exist only within larger characters)
- Variant forms (simplified vs. traditional differences)
Generic Use Cases:
- Dictionary lookup enhancement (show components)
- Educational tools (character structure visualization)
- Font rendering (construct missing glyphs)
- Mnemonic generation (break into memorable parts)
Pattern 2: Radical-Based Character Search#
Generic Need: Find all characters containing a specific radical
Input: Radical (e.g., 土 “earth”) Output: List of characters containing that radical
Data Requirements:
- Radical-to-character index
- Optional: stroke count for sub-filtering
- Optional: character frequency for sorting
Example Queries:
- All characters with radical 土: [地, 坐, 场, 城, …]
- Characters with 土 + 5 total strokes: [地, 圾, …]
- Top 100 most frequent characters with 土: sorted by usage
Data Sources:
- Primary: Unihan
kRSUnicodefield - Enhancement: Character frequency data (Unihan
kFrequencyor external corpus) - Alternative: Reverse index from CJKVI-IDS (parse components)
Radical System Considerations:
- Kangxi radicals (214) vs. Simplified radicals (189)
- Radical variants (讠 vs. 言)
- Unicode radical characters vs. component characters
Generic Use Cases:
- Dictionary browsing (traditional radical index navigation)
- Character learning (study characters by semantic category)
- Input method optimization (radical-based typing)
- Corpus analysis (semantic field distribution)
Pattern 3: Phonetic Series Identification#
Generic Need: Find characters sharing the same phonetic component
Input: Phonetic component (e.g., 青 “qing”) Output: List of characters using that phonetic component with readings
Data Requirements:
- Phonetic-semantic compound classification
- Phonetic component extraction
- Pronunciation data for validation
Example Queries:
- Characters with phonetic 青: [清 qīng, 情 qíng, 请 qǐng, 晴 qíng, …]
- Show semantic radical for each: [清(氵), 情(忄), 请(讠), 晴(日)]
- Pronunciation similarity analysis: all share “qing” pronunciation
Data Sources:
- Primary: makemeahanzi etymology fields (phonetic/semantic)
- Research: Hsiao & Shillcock phonetic compound database
- Validation: Unihan
kPhoneticfield + pronunciation data
Pattern Insights:
- ~80% of Chinese characters are phonetic-semantic compounds
- Phonetic component often indicates pronunciation (not always exact)
- Same phonetic in different semantic contexts = different meanings
Generic Use Cases:
- Mnemonic generation (learn pronunciation from phonetic)
- Character learning (phonetic family grouping)
- Historical linguistics (phonetic evolution study)
- OCR validation (phonetic similarity for error correction)
Pattern 4: Etymology Tree Generation#
Generic Need: Trace a character’s historical development and component relationships
Input: Target character (e.g., 森) Output: Etymology tree showing component hierarchy
Decomposition Levels:
- Level 0: Target character (森)
- Level 1: Direct components (木 + 木 + 木)
- Level 2: Component components (if applicable)
- Level N: Minimal strokes or radicals
Data Requirements:
- Recursive IDS parsing
- Etymology type classification (pictographic/ideographic/pictophonetic)
- Historical form variants (oracle bone → seal script → modern)
Example Tree (simplified):
森 (forest)
├─ 木 (tree) [pictographic]
├─ 木 (tree) [pictographic]
└─ 木 (tree) [pictographic]
└─ Etymology type: ideographic compound (multiple trees = forest)Example Tree (compound):
想 (think)
├─ 相 (mutual/appearance) [phonetic component]
│ ├─ 木 (tree) [radical]
│ └─ 目 (eye) [component]
└─ 心 (heart) [semantic radical]
└─ Etymology type: phonetic-semantic (heart + appearance → thinking)Data Sources:
- Structure: CJKVI-IDS (recursive parsing)
- Etymology type: makemeahanzi
etymology.typefield - Historical forms: Specialized databases (Shuowen Jiezi, oracle bone scripts)
Termination Conditions:
- Reach minimal stroke components (一, 丨, etc.)
- Reach pictographic radicals (no further decomposition)
- Reach unencoded component (no Unicode codepoint)
Generic Use Cases:
- Educational tools (visualize character structure)
- Mnemonic storytelling (understand meaning from components)
- Historical linguistics research (trace character evolution)
- Character learning curriculum (teach simple before complex)
Pattern 5: Mnemonic Component Extraction#
Generic Need: Extract semantically meaningful components for memory aids
Input: Character + desired mnemonic style Output: Components with semantic hints
Mnemonic Types:
- Semantic decomposition: 好 = 女 (woman) + 子 (child) → “woman with child = good”
- Phonetic hint: 清 = 氵(water) + 青 (qing) → “water that sounds like ‘qing’”
- Pictographic story: 森 = 木+木+木 → “many trees = forest”
Component Filtering:
- Prioritize semantically meaningful components (radicals)
- De-emphasize phonetic components for semantic mnemonics
- Identify pictographic components for visual mnemonics
Data Requirements:
- IDS decomposition (structure)
- Etymology type (phonetic vs. semantic vs. pictographic)
- Radical semantic category (Kangxi radical meanings)
- Component meanings (individual character definitions)
Example Processing:
Character: 想 (think)
├─ Decomposition: ⿱相心
├─ Components: [相, 心]
├─ Etymology: phonetic-semantic
├─ Semantic radical: 心 (heart)
├─ Phonetic component: 相 (xiāng → xiǎng)
└─ Mnemonic: "Thinking (xiǎng) comes from the heart (心), using mutual (相) understanding"Data Sources:
- Structure: CJKVI-IDS
- Etymology: makemeahanzi
- Radical meanings: Kangxi radical semantic categories
- Component definitions: Dictionary database (CEDICT, CC-CEDICT)
Generic Use Cases:
- Flashcard apps (auto-generate memory aids)
- Character learning books (etymology explanations)
- Educational games (story-based character teaching)
- Spaced repetition systems (memorable hints)
Pattern 6: Learning Progression Sequencing#
Generic Need: Order characters from simple to complex for curriculum design
Input: Set of characters to learn Output: Ordered sequence from simple components to complex compounds
Complexity Metrics:
- Stroke count: Fewer strokes = simpler
- Component count: Fewer components = simpler
- Decomposition depth: Shallow hierarchy = simpler
- Component frequency: Common components taught first
Sequencing Algorithm (generic pseudo-logic):
1. Identify all unique components across character set
2. Build dependency graph (character depends on its components)
3. Topological sort: teach components before compounds
4. Within same level, sort by stroke count (ascending)
5. Within same stroke count, sort by frequency (descending)Example Sequence:
Level 0 (Radicals/Simple):
一 (1 stroke) → 二 (2) → 三 (3) → 十 (2) → 人 (2) → 木 (4) → 水 (4)
Level 1 (Simple Compounds):
好 (女+子, 6 strokes) → 休 (人+木, 6) → 林 (木+木, 8)
Level 2 (Complex Compounds):
森 (木+木+木, 12) → 想 (相+心, 13)Data Requirements:
- IDS decomposition (dependency graph)
- Stroke count data (Unihan
kTotalStrokes) - Character frequency (Unihan
kFrequencyor corpus data) - Component reusability (how many characters use this component)
Data Sources:
- Structure: CJKVI-IDS
- Stroke count: Unihan
kTotalStrokesorucs-strokes.txt - Frequency: Unihan
kFrequency, SUBTLEX-CH, or web corpus - HSK/TOCFL levels (for educational sequencing)
Generic Use Cases:
- Textbook curriculum design (character introduction order)
- Adaptive learning systems (personalized progression)
- Character workbook generation (practice sheets)
- Font subset optimization (include components first)
Pattern 7: Character Similarity Analysis#
Generic Need: Find characters with similar structure or components
Input: Query character Output: Ranked list of structurally similar characters
Similarity Dimensions:
- Component overlap: Share N components
- Structural similarity: Same IDS operators (both ⿰ left-right)
- Radical similarity: Same Kangxi radical
- Phonetic similarity: Share phonetic component
- Visual similarity: OCR confusion pairs
Example Queries:
- Characters structurally similar to 清: [请, 晴, 情] (all ⿰X青)
- Characters with same radical as 清: [河, 海, 泪] (all 氵)
- Characters visually similar to 大: [太, 犬, 天] (OCR confusion)
Similarity Scoring (generic approach):
Jaccard similarity: intersection(components) / union(components)
Structural similarity: same IDS operator = +1 point
Radical match: same Kangxi radical = +2 points
Phonetic match: same phonetic component = +1 pointData Requirements:
- IDS decomposition (component extraction)
- Radical classification (Kangxi radical)
- Phonetic-semantic classification
- Optional: Stroke-level similarity (for OCR)
Data Sources:
- Structure: CJKVI-IDS
- Radicals: Unihan
kRSUnicode - Phonetics: makemeahanzi etymology
- Visual confusion: OCR error datasets (research papers)
Generic Use Cases:
- OCR post-processing (similar character suggestions)
- Character learning (study confusable pairs)
- Dictionary features (related characters navigation)
- Text correction (typo detection)
Pattern 8: Component Reusability Analysis#
Generic Need: Identify high-value components (used in many characters)
Input: Character database Output: Components ranked by reusability
Reusability Metrics:
- Character count: How many characters contain this component
- Frequency-weighted: Weighted by character usage frequency
- Pedagogical value: Appears in HSK 1-6 vocabulary
Example High-Reusability Components:
口 (mouth): appears in 1000+ characters
- 吃, 喝, 叫, 唱, 吗, 呢, 哪, 呀, ...
- High pedagogical value (early HSK levels)
氵(water radical): appears in 800+ characters
- 水, 河, 海, 洋, 湖, 江, 清, 洗, ...
- Semantic category: water-related meaningsAnalysis Output:
- Top 100 most reusable components (by character count)
- Components by semantic category (214 Kangxi radicals)
- Components by pedagogical level (HSK 1-6 coverage)
Data Requirements:
- Component extraction from all characters
- Character frequency data (usage weighting)
- HSK/TOCFL level data (pedagogical value)
Data Sources:
- Components: CJKVI-IDS (parse all IDS)
- Frequency: Web corpus or Unihan
kFrequency - HSK levels: HSK vocabulary lists (external)
Generic Use Cases:
- Curriculum design (teach high-value components first)
- Font subsetting (prioritize reusable components)
- Input method optimization (frequent component shortcuts)
- Mnemonic generation (focus on common building blocks)
Pattern 9: Traditional ↔ Simplified Mapping#
Generic Need: Map between traditional and simplified character forms
Input: Character (traditional or simplified) Output: Corresponding form(s) in other system
Complexity Cases:
- One-to-one: 花 (same in both) → 花
- One-to-many: 發 (trad) → 发 or 髮 (simp, context-dependent)
- Many-to-one: 發/髮 (trad) → 发 (simp, merged)
- Component differences: 國 → 国 (玉 → 王 inside 口)
Decomposition Impact:
- Traditional: 國 → ⿴囗或 (囗 + 或)
- Simplified: 国 → ⿴囗玉 (囗 + 玉)
- Component change affects mnemonic/etymology
Data Requirements:
- Traditional/simplified variant mappings
- IDS for both forms (structural differences)
- Context rules (for one-to-many mappings)
Data Sources:
- Variants: Unihan
kSimplifiedVariant,kTraditionalVariant - Structure: CJKVI-IDS (includes both forms)
- Context: HanDeDict or CC-CEDICT (word-level distinctions)
Generic Use Cases:
- Text conversion (traditional ↔ simplified)
- Dictionary lookup (show both forms)
- Cross-Strait content (Taiwan vs. Mainland)
- Character learning (understand simplification rules)
Pattern 10: Stroke Order & Decomposition Alignment#
Generic Need: Align component decomposition with stroke writing order
Input: Character Output: Components ordered by stroke sequence
Challenge: IDS decomposition ≠ stroke order
- IDS: 想 = ⿱相心 (top 相, bottom 心)
- Stroke order: Write partial 木, then 目, then 心, then finish 木
Alignment Strategy:
- Parse IDS decomposition (structural)
- Parse stroke order sequence (temporal)
- Map strokes to components (spatial)
- Reorder components by first stroke
Data Requirements:
- IDS decomposition (spatial structure)
- Stroke order sequences (temporal order)
- Stroke-to-component mapping
Data Sources:
- IDS: CJKVI-IDS
- Stroke order: makemeahanzi
strokesarray - Graphics: SVG stroke paths (for visualization)
Generic Use Cases:
- Handwriting teaching apps (component-aware practice)
- Stroke order animations (show components correctly)
- Calligraphy tools (component-based guidance)
- Character tracing worksheets (pedagogical materials)
Cross-Pattern Dependencies#
Foundation Patterns (required by others):
- Character Decomposition Lookup (Pattern 1)
- Used by: Etymology Tree (4), Mnemonic Extraction (5), Learning Progression (6)
Enhancement Patterns (add value to foundation): 2. Radical-Based Search (2) + Phonetic Series (3)
- Enables: Comprehensive similarity analysis (7)
Advanced Patterns (combine multiple patterns): 6. Learning Progression = Decomposition (1) + Frequency + Stroke Count 7. Similarity Analysis = Decomposition (1) + Radical (2) + Phonetic (3)
Data Source Coverage Matrix#
| Pattern | CJKVI-IDS | Unihan | makemeahanzi | pypinyin | External |
|---|---|---|---|---|---|
| 1. Decomposition | ✅ Primary | 🟨 Fallback | 🟨 Fallback | ❌ | ❌ |
| 2. Radical Search | 🟨 Reverse | ✅ Primary | 🟨 Enhancement | ❌ | 🟨 Frequency |
| 3. Phonetic Series | 🟨 Structure | 🟨 kPhonetic | ✅ Primary | ✅ Validation | ❌ |
| 4. Etymology Tree | ✅ Primary | ❌ | 🟨 Type | ❌ | 🟨 Historical |
| 5. Mnemonic Extract | ✅ Structure | 🟨 Radicals | ✅ Etymology | ❌ | 🟨 Definitions |
| 6. Learning Progress | ✅ Structure | ✅ Strokes | 🟨 Data | ❌ | ✅ HSK levels |
| 7. Similarity | ✅ Primary | 🟨 Radicals | 🟨 Phonetic | ❌ | ❌ |
| 8. Reusability | ✅ Primary | 🟨 Frequency | ❌ | ❌ | 🟨 HSK |
| 9. Trad ↔ Simp | ✅ Both forms | ✅ Variants | ❌ | ❌ | 🟨 Context |
| 10. Stroke Order | 🟨 Structure | ❌ | ✅ Primary | ❌ | ❌ |
Legend:
- ✅ Primary data source (best fit)
- 🟨 Enhancement or fallback
- ❌ Not applicable
Identified Data Gaps#
Gap 1: No comprehensive phonetic-semantic database
- CJKVI-IDS: Structure only (no phonetic classification)
- makemeahanzi: Limited coverage (common chars only)
- Hsiao & Shillcock: Limited availability (academic dataset)
Gap 2: Traditional vs. Simplified decomposition differences
- CJKVI-IDS includes both forms BUT no explicit mapping
- Must parse both and compare programmatically
Gap 3: Stroke order integration
- makemeahanzi has stroke order but limited coverage
- No standard format linking stroke order to IDS decomposition
Gap 4: Historical forms
- Oracle bone → seal script → modern evolution
- Requires specialized databases (Shuowen Jiezi, etc.)
- Not covered by Unicode Unihan or CJKVI-IDS
Next Steps for S4 (Strategic)#
Technology Evolution:
- Unicode roadmap for CJK extensions (Ext-G, Ext-H)
- Python ecosystem health (cjklib inactive, alternatives?)
- Modern NLP integration (transformers, embeddings)
Maintenance & Governance:
- CJKVI-IDS: Why no releases since 2018 despite commits?
- pypinyin: Active maintenance trajectory
- Data quality assurance across multiple sources
Variant Handling:
- CJK unification implications (China/Taiwan/Japan/Korea/Vietnam)
- Regional glyph differences in decomposition
- Font rendering vs. semantic decomposition
Sources:
S4: Strategic
S4-strategic: Approach#
Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.
S4-strategic Recommendations#
Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.
S4 Strategic: Long-Term Viability & Technology Evolution#
Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Strategic analysis of maintenance trajectories, ecosystem health, and technology evolution
Executive Summary#
Key Finding: Character decomposition is a mature but fragmented domain with:
- ✅ Stable Standards: Unicode Unihan maintained by consortium (low obsolescence risk)
- ⚠️ Fragmented Tooling: No single actively-maintained comprehensive Python library
- ✅ Open Data: CJKVI-IDS, makemeahanzi provide foundation for custom solutions
- 🔄 Recommended Strategy: Data-first approach (direct parsing) over library dependency
Unicode Consortium Roadmap#
Unihan Database Maintenance#
Governance: Unicode Technical Committee (UTC) Update Cycle: Aligned with Unicode version releases (annual/biannual) Current Version: Unicode 16.0 (September 2024) Next Expected: Unicode 17.0 (September 2026)
Historical Trajectory:
- Regular CJK extension releases (Ext-A through Ext-I planned)
- Ext-G: 4,939 characters (Unicode 13.0, March 2020)
- Ext-H: 4,192 characters (Unicode 15.1, September 2023)
- Ext-I: ~5,000 characters (target Unicode 17.0, 2026)
Maintenance Commitment:
- ✅ Strong institutional backing (Unicode Consortium)
- ✅ International standards body (ISO/IEC 10646 alignment)
- ✅ Multi-stakeholder governance (China, Taiwan, Japan, Korea)
- ✅ Backward compatibility guarantees (no character removal)
Risk Assessment: LOW
- Standard unlikely to be abandoned
- Continuous extension for historical/rare characters
- Well-documented specification (UAX #38)
IDS Standardization Status#
Current State: IDS operators in Unicode (U+2FF0..U+2FFB) since Unicode 3.0 (1999) Stability: ✅ MATURE - No changes to operator set in 25+ years
Open Questions:
- Will Unicode extend IDS operators beyond current 12?
- No active proposals identified in search results
- Existing operators sufficient for known character structures
Risk Assessment: LOW
- Core operators stable since 1999
- Community implementations (CHISE, CJKVI) rely on current standard
- No indication of breaking changes
Python Ecosystem Health#
cjklib: Inactive but Extractable#
Status: Abandoned (no Python 3 support, no updates since 2012+) Last PyPI Release: 0.3.2 (years old) GitHub Activity: 0 PRs, 0 issues in past year
Strategic Options:
Fork: Community fork cjklib3
- Risk: Fork maintenance uncertain, no verified activity
- Effort: Depends on fork quality
Database Extraction: Dump cjklib SQLite database, use directly
- Risk: One-time extraction, no updates
- Effort: Medium (parsing SQL schema)
Abandon: Use alternative data sources (CJKVI-IDS, Unihan)
- Risk: None (open data)
- Effort: Low (simple parsing)
Recommendation: Option 3 (abandon cjklib, use open data sources)
- Rationale: Unmaintained library is technical debt
- Alternative: CJKVI-IDS + Unihan provide equivalent data
- Future-proof: Text files easier to maintain than abandoned library
pypinyin: Active & Recommended#
Status: ✅ ACTIVELY MAINTAINED Latest Release: July 2025 (recent) Python Support: 2.7, 3.5-3.13 (modern versions) Community: Most popular Chinese pronunciation library
Strategic Value:
- ✅ Pronunciation features (Pinyin, Zhuyin, Cyrillic)
- ✅ Heteronym support (multi-reading characters)
- ✅ Regular updates (2025 release confirmed)
- ❌ Not focused on decomposition (different domain)
Recommendation: USE for pronunciation features, combine with decomposition data
Data Source Viability#
CJKVI-IDS: Active Development, Stale Releases#
Last Release: February 2018 (v18.02.20) GitHub Commits: 158 total (ongoing activity) Maintenance Gap: Why no releases since 2018 despite commits?
Possible Explanations:
- Rolling updates: Main branch is canonical (no formal releases)
- Low churn: IDS data stable, incremental updates only
- Volunteer maintenance: No bandwidth for release process
Strategic Implications:
- ✅ Data is actively curated (commits continue)
- ⚠️ No release hygiene (versioning, changelogs)
- ✅ Plain text format (easy to fork if abandoned)
Risk Mitigation:
- Use git commit SHA for version pinning (not release tags)
- Monitor commit activity quarterly (stale =
<2commits/year) - Fallback: Fork and maintain internally if abandoned
Risk Assessment: MEDIUM
- Data quality: HIGH (actively used by community)
- Maintenance: UNCERTAIN (commits but no releases)
- Replaceability: HIGH (plain text, can fork)
makemeahanzi: Active but Limited Coverage#
GitHub Activity: Check repository for recent commits Coverage: Common characters (not full Unicode) License: Open-source (permissive)
Strategic Value:
- ✅ Etymology type classification (pictographic/phonetic/ideographic)
- ✅ Phonetic-semantic decomposition
- ✅ JSON format (easy integration)
- ❌ Limited to common characters (~8,000-10,000)
Use Case Fit:
- ✅ Excellent for educational/learner applications (HSK, common chars)
- ❌ Insufficient for comprehensive dictionary/research (need full Unicode)
Recommendation: USE for educational features, supplement with CJKVI-IDS for full coverage
CJK Variant Handling Strategy#
Regional Glyph Differences#
Challenge: Same Unicode codepoint, different glyphs across regions
- Example: 骨 (bone)
- PRC/Simplified: Specific stroke structure
- Taiwan/Traditional: Slightly different form
- Japan: Kanji variant form
- Korea: Hanja form
Impact on Decomposition:
- IDS may differ for same codepoint (regional variants)
- Radical classification may vary (different radical systems)
- Mnemonic generation affected (visual form matters)
Unicode Approach:
- Han Unification: One codepoint for “same” character across regions
- Variation Selectors: U+E0100..U+E01EF for explicit variants
- Font-level distinction (not data-level)
Data Source Handling:
| Source | Variant Support | Notes |
|---|---|---|
| CJKVI-IDS | Multiple IDS per codepoint | Covers regional variants |
| Unihan | kRSUnicode vs. kRSKangXi | Different radical systems |
| makemeahanzi | Simplified-focused | Limited traditional variant info |
Strategic Recommendation:
- Default to Unicode Standard (Unihan as canonical)
- Store multiple IDS per character (regional variants)
- Locale-aware decomposition (user’s region preference)
- Document variant handling policy (e.g., “prefer simplified for PRC learners”)
Risk: Variant confusion if not explicitly handled Mitigation: Store locale/region metadata with decomposition data
Integration with Modern NLP#
Transformer Models & Character Embeddings#
Current Trend (2025-2026): Character-level tokenization for Chinese
- BERT-wwm-ext (whole word masking)
- MacBERT (improved Chinese BERT)
- ERNIE 3.0 (Baidu’s multilingual model)
Radical/Component Awareness:
- Research question: Do transformer models benefit from explicit radical features?
- Some models incorporate radical embeddings (RoBERTa-wwm-ext-large)
- Debate: Implicit learning vs. explicit radical features
Strategic Opportunities:
- Radical-aware embeddings: Inject decomposition into model training
- Zero-shot learning: Use decomposition for rare character understanding
- Mnemonic generation: LLM + decomposition data = auto-generated stories
- Curriculum generation: Optimize learning order via component dependencies
Data Requirements for NLP Integration:
- IDS decomposition (structured data)
- Radical semantic categories (meaning injection)
- Phonetic component labels (pronunciation features)
- Etymology types (character category classification)
Recommendation: Position decomposition data as enhancement layer for NLP
- Base models learn implicitly from text
- Decomposition adds interpretability + rare character handling
Traditional vs. Simplified Strategic Handling#
Simplification Rules (1950s-1960s PRC reforms)#
Types of Simplification:
- Structural: 國 → 国 (component replacement)
- Radical: 讓 → 让 (讠 for 言)
- Merging: 發/髮 → 发 (homophone merging)
- Phonetic loan: 麵 → 面 (reuse existing character)
Decomposition Impact:
- Traditional: 國 → ⿴囗或 (complex)
- Simplified: 国 → ⿴囗玉 (different components)
- Etymology may be obscured in simplified (e.g., 爱 lost 心 from 愛)
Data Alignment Challenges:
| Challenge | Example | Solution |
|---|---|---|
| One-to-many | 幹 → 干/乾/幹 (context-dependent) | Word-level mapping (not char) |
| Many-to-one | 發/髮/彆 → 发/别 (merged) | Store all traditional sources |
| Component change | 國/国 radicals differ | IDS for both forms |
Strategic Approaches:
Approach 1: Dual Storage (store both forms)
- Pros: Complete data, no information loss
- Cons: 2x storage, complexity in querying
Approach 2: Primary + Variant Links
- Pros: Efficient storage, clear relationships
- Cons: Must choose primary (simplified or traditional)
Approach 3: Etymology-Preserving (prefer traditional for learning)
- Pros: Maintains historical meaning
- Cons: Less useful for simplified-only learners
Recommendation: Approach 1 (dual storage)
- Rationale: Storage is cheap, data completeness valuable
- Implementation: Character ID → [traditional_data, simplified_data]
- Querying: Filter by user’s locale preference
Technology Evolution Vectors#
Vector 1: Unicode Extensions#
Trajectory: Continued CJK extensions (Ext-I, Ext-J, …) Impact: More rare/historical characters require IDS data Preparation: Monitor Unicode roadmap, update data sources post-release
Action Items:
- Subscribe to Unicode announcements (unicode.org mailing lists)
- Budget for data updates quarterly (new extensions)
- Automated scraping of Unihan releases (CI/CD pipeline)
Vector 2: Machine Learning for Decomposition#
Emerging Research: Auto-generate IDS from character images Potential: Fill gaps for unencoded/rare characters Maturity: Research phase (not production-ready)
Strategic Watch:
- Academic papers on character structure recognition
- Open-source models (e.g., OCR → IDS generation)
- Validation against human-curated data (CJKVI-IDS as ground truth)
Action: Monitor but don’t depend on (experimental technology)
Vector 3: Web Standards (CSS, SVG, Fonts)#
Trend: Browser support for IDS rendering (experimental) Use Case: Display unencoded characters via IDS composition Status: Limited browser support (cutting-edge only)
Strategic Opportunity:
- Educational tools could use IDS for dynamic character rendering
- Fallback: SVG generation from component SVGs (makemeahanzi approach)
Action: Prototype but maintain static fallbacks (browser compatibility)
Vector 4: Open Data Movement#
Trend: More linguistic data becoming open (CC-BY, public domain) Recent Examples:
- Wikimedia language data projects
- Universal Dependencies for Chinese parsing
Opportunity: Contribute back to community
- Publish derived datasets (e.g., phonetic series mappings)
- Contribute fixes to CJKVI-IDS (if errors found)
- Open-source tooling for decomposition processing
Strategic Value: Build reputation + community goodwill
Maintenance Burden Analysis#
Option A: Depend on Libraries (cjklib, hanzipy)#
Pros:
- Pre-built functionality
- Community testing (if active)
Cons:
- Maintenance dependency (abandoned libraries = broken systems)
- Python version compatibility issues (cjklib: no Python 3)
- API changes break downstream code
Maintenance Burden: HIGH (when library abandoned) Risk: HIGH (cjklib already abandoned)
Option B: Direct Data Parsing (CJKVI-IDS, Unihan)#
Pros:
- Full control over parsing logic
- Plain text = easy to maintain
- No library dependency rot
Cons:
- Initial implementation effort (write parsers)
- Must handle data format changes (rare for text files)
Maintenance Burden: LOW (data format stable) Risk: LOW (can fork data if source abandoned)
Option C: Hybrid (Libraries for Some, Data for Others)#
Pros:
- Use pypinyin (active) for pronunciation
- Parse CJKVI-IDS directly for decomposition
- Best of both worlds
Cons:
- Multiple dependencies (some active, some not)
Maintenance Burden: MEDIUM Risk: MEDIUM (only for library portions)
Recommendation: Option C (Hybrid)
- pypinyin for pronunciation (active maintenance)
- Direct parsing for decomposition (stable data)
- Avoid abandoned libraries (cjklib)
10-Year Outlook (2026-2036)#
Stable Elements (High Confidence)#
- Unicode Standard: Will continue, backward compatible
- Unihan Database: Regular updates, consortium-backed
- IDS Operators: Stable, no breaking changes expected
- Plain Text Data: CJKVI-IDS, Unihan will remain parseable
Uncertain Elements (Monitor Closely)#
- Python Library Ecosystem: May consolidate or fragment further
- CJK NLP: Transformer models may change decomposition relevance
- Regional Variants: Taiwan/Hong Kong policy changes possible
Emerging Opportunities#
- LLM-Powered Mnemonic Generation: GPT-4+ with decomposition data
- AR/VR Character Learning: Spatial decomposition visualization
- Adaptive Curriculum: ML-optimized learning progressions
Strategic Positioning#
Bet on:
- Open data sources (Unihan, CJKVI-IDS)
- Stable standards (Unicode, IDS)
- Actively maintained tools (pypinyin)
Avoid:
- Unmaintained libraries (cjklib)
- Proprietary datasets (vendor lock-in)
- Cutting-edge unproven tech (ML decomposition, experimental browser APIs)
Hedge with:
- Modular architecture (swap data sources easily)
- Data versioning (track Unihan/CJKVI-IDS versions)
- Automated testing (detect data format changes)
Governance & Data Quality Assurance#
Multi-Source Data Conflicts#
Problem: CJKVI-IDS vs. Unihan vs. makemeahanzi may disagree Example: Character 某 decomposition
- Source A: ⿱艹木
- Source B: ⿱甘木
- Cause: Different interpretations of component
Resolution Strategy:
Precedence Rules: Define authoritative source per data type
- Radicals: Unihan (official Unicode)
- IDS: CJKVI-IDS (most comprehensive)
- Etymology: makemeahanzi (educational focus)
Validation Pipeline: Cross-check sources
- Flag conflicts for manual review
- Document decisions (source A chosen because…)
Version Pinning: Track source data versions
- CJKVI-IDS: git commit SHA
- Unihan: Unicode version number
- makemeahanzi: GitHub release tag
Quality Metrics:
- Coverage: % of Unicode CJK chars with IDS
- Consistency: % agreement across sources
- Completeness: % of chars with phonetic/semantic classification
Risk Register#
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| cjklib abandoned | ✅ HAPPENED | High | Use alternative data sources |
| CJKVI-IDS stale | Medium | Medium | Monitor commits, fallback to Unihan |
| Unicode breaks IDS | Low | High | Versioning + automated tests |
| Phonetic DB unavailable | Medium | Low | makemeahanzi sufficient for most uses |
| Regional variant conflicts | High | Medium | Dual storage + locale preference |
| Python 2→3 migration | ✅ PAST | High | Already avoided (no cjklib) |
| Data license issues | Low | Medium | Use GPL-exempt files (ids-ext-cdef.txt) |
Recommendations#
Short-Term (2026-2027)#
- ✅ Adopt CJKVI-IDS for structural decomposition (direct parsing)
- ✅ Use Unihan for radical/stroke metadata (official source)
- ✅ Integrate pypinyin for pronunciation features (active maintenance)
- ✅ Supplement with makemeahanzi for etymology (educational value)
- ❌ Avoid cjklib (unmaintained, Python 2 only)
Medium-Term (2027-2029)#
- Build validation pipeline (cross-check data sources)
- Contribute fixes to CJKVI-IDS (community engagement)
- Monitor Unicode roadmap (prepare for Ext-I, Ext-J)
- Evaluate cjklib3 fork (if community adopts)
Long-Term (2029-2036)#
- Explore ML-enhanced decomposition (auto-IDS generation research)
- Integrate with NLP pipelines (transformer embeddings)
- Contribute datasets to community (open data movement)
- Plan for next-gen standards (if IDS operators evolve)
Sources: