1.161 Radical & Component Analysis#

Character decomposition and radical analysis for CJK characters. Tools for identifying semantic components, radicals, and character structure - essential for dictionary lookups, etymology, and character learning.


Explainer

Radical & Component Analysis: Research Explainer#

Research ID: 1.161 Category: Chinese Language Processing / Character Decomposition Completed: 2026-01-28 Discovery Phase: S1-S4 Complete


What Is This?#

Character decomposition is the process of breaking down Chinese characters (汉字/漢字) into their constituent parts: radicals (部首), components, and structural relationships. This research surveys data sources, tools, and patterns for implementing character decomposition systems.

Key Use Cases:

  • Etymology: Understand character origins and historical development
  • Mnemonic generation: Create memory aids for language learners
  • Character learning progression: Build curricula from simple to complex
  • Computational linguistics: Enable character-aware NLP applications

What We Researched#

S1: Quick Ecosystem Scan (ecosystem-scan.md)#

Data Standards:

  • Unihan Database: Official Unicode consortium database with radical-stroke data, variants, pronunciations
  • IDS (Ideographic Description Sequences): Standardized structural representation using operators (⿰, ⿱, etc.)
  • Phonetic-Semantic Compounds (形聲字): 80% of characters combine semantic radical + phonetic component

Python Libraries:

  • cjklib: Comprehensive but inactive (no Python 3 support)
  • hanzipy: Learner-focused, unclear maintenance
  • pypinyin: Active, production-ready (pronunciation focus)
  • makemeahanzi: Open-source character data (JSON format)

Open Data Sources:

  • CJKVI-IDS: 158 commits, comprehensive IDS database (plain text)
  • cjk-decomp: 75,000+ character decomposition data
  • Hsiao & Shillcock Database: Academic phonetic compound research

S2: Deep Technical Dive (library-comparison.md)#

Library Status:

  • cjklib: Abandoned (0 PRs/issues in past year, no Python 3)
  • pypinyin: Actively maintained (July 2025 release, Python 3.5-3.13)
  • ⚠️ hanzipy: Maintenance unclear
  • CJKVI-IDS: Active commits (158 total), stale releases (2018)

Data Format Comparison:

CJKVI-IDS:    U+5730  地  ⿰土也   (plain text, tab-delimited)
Unihan:       kRSUnicode: 32.3      (radical 土 + 3 strokes)
makemeahanzi: {"character": "地", "etymology": {...}}  (JSON)

Integration Strategy:

  • Layer 1: CJKVI-IDS for structural decomposition
  • Layer 2: Unihan for radical/stroke metadata
  • Layer 3: makemeahanzi for phonetic-semantic classification
  • Layer 4: pypinyin for pronunciation

Critical Insight: No single comprehensive library exists. Hybrid data-source approach recommended.


S3: Generic Use Cases (use-case-patterns.md)#

10 Key Patterns Identified:

  1. Character Decomposition Lookup: 花 → ⿱艹化 (structure)
  2. Radical-Based Search: Find all chars with radical 土
  3. Phonetic Series Identification: 青 → 清情請晴… (pronunciation families)
  4. Etymology Tree Generation: Recursive component hierarchy
  5. Mnemonic Component Extraction: 好 = 女+子 → “woman + child = good”
  6. Learning Progression Sequencing: Simple components before complex compounds
  7. Character Similarity Analysis: Structural/radical/phonetic overlap
  8. Component Reusability Analysis: Identify high-value components (口, 氵, etc.)
  9. Traditional ↔ Simplified Mapping: 國 ↔ 国 (variant handling)
  10. Stroke Order Alignment: Component order vs. writing sequence

Data Gap Identified: No comprehensive open database combining:

  • Full Unicode coverage (CJKVI-IDS level)
  • Phonetic-semantic classification (makemeahanzi level)
  • Etymology type annotation (ideographic/pictographic/pictophonetic)

Workaround: Combine multiple data sources programmatically


S4: Strategic Viability (viability-analysis.md)#

Unicode Roadmap:

  • ✅ Stable: Unihan database regularly updated (annual Unicode releases)
  • ✅ Mature: IDS operators stable since 1999 (no breaking changes)
  • ✅ Ongoing: CJK extensions (Ext-I in Unicode 17.0, 2026)

Python Ecosystem Health:

  • ❌ cjklib: Abandoned (technical debt if used)
  • ✅ pypinyin: Active (recommended for pronunciation)
  • ⚠️ CJKVI-IDS: Active commits but stale releases (monitor closely)

Recommended Architecture:

  • Bet on: Open data sources (Unihan, CJKVI-IDS), stable standards
  • Avoid: Unmaintained libraries (cjklib), proprietary datasets
  • Hedge with: Modular design, data versioning, automated testing

10-Year Outlook:

  • Stable: Unicode, Unihan, IDS operators
  • Uncertain: Python library consolidation, CJK NLP evolution
  • Emerging: LLM-powered mnemonics, adaptive learning curricula

Key Findings#

1. Data Sources Are More Reliable Than Libraries#

AspectLibrariesOpen Data
MaintenanceFragile (cjklib abandoned)Stable (Unicode consortium)
FormatAPIs can breakPlain text (future-proof)
ControlDependency on maintainersFull parsing control
RiskHigh (abandonment)Low (can fork)

Recommendation: Data-first approach - parse CJKVI-IDS and Unihan directly, avoid unmaintained libraries.


2. Hybrid Data Stack Required#

No single source provides everything:

Use Case: Educational App
├─ Structure: CJKVI-IDS (⿰土也)
├─ Radicals: Unihan (kRSUnicode = 32.3)
├─ Etymology: makemeahanzi (type: pictophonetic)
└─ Pronunciation: pypinyin (dì, de)

Use Case: Comprehensive Dictionary
├─ Structure: CJKVI-IDS (full Unicode)
├─ Metadata: Unihan (variants, readings, strokes)
├─ Phonetic: makemeahanzi (common chars) + custom for rare chars
└─ Pronunciation: pypinyin + Unihan kMandarin

3. Traditional vs. Simplified Requires Dual Storage#

Challenge: Same Unicode codepoint, different glyphs

  • 骨 (bone): PRC vs. Taiwan vs. Japan variants
  • Decomposition differs: 國 (⿴囗或) vs. 国 (⿴囗玉)

Solution:

  • Store IDS for both traditional and simplified forms
  • Use Unihan kSimplifiedVariant / kTraditionalVariant for mapping
  • Locale preference determines which decomposition to show

4. 80% of Characters Are Phonetic-Semantic Compounds#

Structure: Semantic radical + Phonetic component

Example: 清 (clear, qīng)

  • Semantic: 氵 (water) → meaning category
  • Phonetic: 青 (qīng) → pronunciation hint

Implication:

  • Mnemonic generation: “Clear water (氵) sounds like 青”
  • Learning progression: Teach phonetic families together (清情請晴)
  • Etymology tools: Must classify phonetic vs. semantic components

Data Source: makemeahanzi for common chars, Hsiao & Shillcock for research


5. Regional Variants Complicate Decomposition#

Unicode Han Unification: One codepoint for “same” character across regions

  • Benefit: Interoperability
  • Challenge: Visual/structural differences not captured

Handling Strategy:

  1. Default to Unicode standard (Unihan)
  2. Store multiple IDS per character (regional variants)
  3. Locale-aware decomposition (user’s region preference)
  4. Document variant policy (e.g., “prefer simplified for PRC learners”)

Data Source Recommendations#

For Structural Decomposition#

Primary: CJKVI-IDS ids.txt

  • Coverage: Full URO + Extensions
  • Format: Plain text (easy parsing)
  • License: ids-ext-cdef.txt is GPL-exempt

For Radical & Metadata#

Primary: Unihan Database

  • Authority: Official Unicode consortium
  • Fields: kRSUnicode, kTotalStrokes, kSemanticVariant, etc.
  • Maintenance: Regular updates

For Phonetic-Semantic Classification#

Common Chars: makemeahanzi

  • Format: JSON (easy integration)
  • Fields: etymology.type, etymology.phonetic, etymology.semantic
  • License: Open-source

Research Reference: Hsiao & Shillcock (2006) Database

  • Academic rigor
  • Cangjie stroke decomposition
  • Limited availability (contact authors)

For Pronunciation#

Primary: pypinyin

  • Actively maintained (July 2025)
  • Python 3.5-3.13 support
  • Heteronym support

Library Recommendations#

Use CaseRecommended ApproachWhy
Quick PrototypeParse makemeahanzi JSONFastest, single file, common chars
Educational Appmakemeahanzi + pypinyinEtymology + pronunciation, learner-focused
Comprehensive DictionaryCJKVI-IDS + Unihan + pypinyinFull coverage, authoritative
Research ProjectAll sources + validationCross-reference, data quality assurance

Libraries to AVOID:

  • ❌ cjklib: No Python 3 support, abandoned
  • ❌ cjklib3 fork: Maintenance unverified (as of 2026-01-28)

Implementation Guidance#

Parsing CJKVI-IDS (Plain Text)#

Format: Tab-delimited

U+5730	地	⿰土也

Python Pseudo-Code (concept, not implementation):

def parse_ids_line(line):
    parts = line.strip().split('\t')
    return {
        'unicode': parts[0],        # U+5730
        'character': parts[1],      # 地
        'ids': parts[2],            # ⿰土也
        'components': extract_components(parts[2])  # [土, 也]
    }

Parsing Unihan (Tab-Delimited)#

Format: U+XXXX<tab>kField<tab>Value

U+5730	kRSUnicode	32.3
U+5730	kTotalStrokes	6

Python Pseudo-Code:

def parse_unihan_file(file_path):
    char_data = defaultdict(dict)
    for line in file:
        if line.startswith('#'): continue
        codepoint, field, value = line.strip().split('\t')
        char_data[codepoint][field] = value
    return char_data

Parsing makemeahanzi (JSON)#

Format: JSON objects, one per line

{"character": "地", "etymology": {"type": "pictophonetic", "phonetic": "也", "semantic": "土"}}

Python Pseudo-Code:

import json

def parse_makemeahanzi(file_path):
    characters = {}
    with open(file_path) as f:
        for line in f:
            data = json.loads(line)
            characters[data['character']] = data
    return characters

Next Steps#

If Building Educational Tool#

  1. Download makemeahanzi dictionary.txt
  2. Install pypinyin: pip install pypinyin
  3. Parse JSON for etymology, use pypinyin for pronunciation
  4. Implement Pattern 5 (Mnemonic Extraction) from S3

If Building Comprehensive Dictionary#

  1. Clone CJKVI-IDS (use ids.txt)
  2. Download Unihan Database
  3. Write parsers for both formats (see pseudo-code above)
  4. Implement Pattern 1 (Character Decomposition Lookup) from S3
  5. Cross-validate data sources (check for conflicts)

If Doing Research#

  1. Obtain Hsiao & Shillcock (2006) database (academic access)
  2. Combine all sources: CJKVI-IDS + Unihan + makemeahanzi
  3. Build validation pipeline (data quality metrics)
  4. Implement Pattern 7 (Similarity Analysis) and Pattern 8 (Reusability) from S3

Critical Warnings#

❌ Do NOT Use cjklib#

  • No Python 3 support
  • Abandoned (no updates in years)
  • Fork (cjklib3) maintenance unclear
  • Use direct data parsing instead

⚠️ Handle Regional Variants#

  • Same Unicode codepoint ≠ same visual form
  • Store multiple IDS per character
  • Document locale handling policy

⚠️ License Compliance#

  • CJKVI-IDS ids.txt: GPLv2
  • CJKVI-IDS ids-ext-cdef.txt: No GPL restriction
  • Choose file based on license requirements

⚠️ Data Version Pinning#

  • CJKVI-IDS: Use git commit SHA (not release tag)
  • Unihan: Track Unicode version number
  • makemeahanzi: Use GitHub release tag
  • Reproducibility requires version tracking

Maintenance Strategy#

Quarterly Tasks#

  • Check CJKVI-IDS for new commits (git log)
  • Monitor Unicode announcements (new CJK extensions)
  • Verify pypinyin still actively maintained (PyPI releases)
  • Test data parsers against latest versions

Annual Tasks#

  • Update Unihan database (new Unicode release)
  • Re-download CJKVI-IDS (latest commit)
  • Review makemeahanzi for updates
  • Audit data conflicts (cross-source validation)

As-Needed#

  • Contribute fixes to CJKVI-IDS if errors found
  • Fork CJKVI-IDS if maintenance lapses
  • Publish derived datasets (community contribution)

Further Reading#

Discovery Documents:

External Resources:


Research Completed: 2026-01-28 Status: Discovery phase complete (S1-S4) Next: Implementation phase (02-implementations/) if needed

S1: Rapid Discovery

S1-rapid: Approach#

Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.


S1 Rapid: Radical & Component Analysis Ecosystem Scan#

Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Quick ecosystem scan for Chinese character radical decomposition and component analysis


Overview#

Chinese character decomposition involves breaking down characters (汉字/漢字) into their constituent parts: radicals (部首), components, and stroke sequences. This is essential for:

  • Etymology: Understanding character origins and historical development
  • Mnemonic generation: Creating memory aids for language learners
  • Character learning progression: Building curriculum from simple to complex components
  • Computational linguistics: Enabling character-aware NLP applications

Key Standards & Data Sources#

1. Unihan Database (Unicode Consortium)#

Source: Unicode Han Database (UAX #38)

The official Unicode Han Database provides comprehensive data for all unified CJK ideographs:

  • Radical-Stroke Data: Based on 214 Kangxi radicals system (18th century)

    • Format: kRSUnicode and kRSAdobe_Japan1_6 fields
    • Structure: radical number + residual stroke count
    • Example: Character 地 = Radical 土 (32) + 3 residual strokes
  • Interactive Access: Unihan Database Lookup

  • Radical Index: Unihan Radical-Stroke Index

  • Development: GitHub repository for expert review

Key Fields:

  • kRSUnicode: Radical-stroke count
  • kCangjie: Cangjie input method codes
  • kPhonetic: Phonetic component information
  • kSemanticVariant: Semantic variant mappings

2. IDS (Ideographic Description Sequences)#

Source: Unicode Standard Chapter 18

IDS provides a standardized way to represent character composition using Unicode operators (U+2FF0..U+2FFB):

  • Purpose: Describe characters in featural terms by arrangement of components
  • Operators: 12 special characters (⿰, ⿱, ⿲, etc.) for spatial arrangement
    • ⿰ (U+2FF0): Left-to-right composition
    • ⿱ (U+2FF1): Top-to-bottom composition
    • ⿲ (U+2FF2): Left-middle-right composition
    • etc.

Examples:

  • 地 → ⿰土也 (left-right: 土 + 也)
  • 花 → ⿱艹化 (top-bottom: 艹 + 化)

Technical Constraints:

  • Maximum sequence length: 16 Unicode code points
  • Useful for describing unencoded characters or characters missing from fonts

Main IDS Database: CJKVI-IDS on GitHub - Comprehensive IDS data for CJK Unified Ideographs

3. Phonetic-Semantic Compounds (形聲字)#

Concept: 80% of Chinese characters are phonetic-semantic compounds combining:

  • Semantic radical (義符): Indicates meaning category
  • Phonetic component (聲符): Provides pronunciation hint

Key Resources:

Research Databases:

  • Hsiao & Shillcock (2006): Chinese lexical database with phonetic compound decomposition into semantic/phonetic radicals

Python Libraries & Tools#

1. cjklib (Most Comprehensive)#

PyPI: cjklib Docs: cjklib.characterlookup

Capabilities:

  • Character decomposition using IDS (Ideographic Description Sequences)
  • Radical lookups (Kangxi radical system)
  • Stroke decomposition
  • Variant character information
  • Pronunciation data (Pinyin, Wade-Giles, Cantonese, etc.)
  • Glyph component analysis

Database Storage: IDS stored in SQLite database with Unicode IDS operators

Maintenance Status: Last release 0.3.2 (check PyPI for latest)

2. hanzipy#

GitHub: Synkied/hanzipy

Purpose: Chinese character NLP framework for language learners

Features:

  • Character decomposition into radicals/components
  • Learning-focused API
  • Framework for character exploration

Target Audience: Language learners and educational applications

3. makemeahanzi (Data Only)#

GitHub: skishore/makemeahanzi Web: Make Me a Hanzi

Format: JSON data files, not a Python library

Data Sources: Derived from Unihan and cjklib

Content:

  • Character etymology (type: ideographic/pictographic/pictophonetic)
  • Phonetic/semantic component breakdowns
  • Stroke order data
  • Open-source, free to use

4. cjk-decomp (Decomposition Data)#

GitHub: amake/cjk-decomp

Scope: Decomposition data for 75,000+ CJK ideographs

  • 36 stroke types
  • 115 radicals
  • 20,924 unified characters
  • Extension sets (Ext-A, Ext-B, etc.)

Format: Data files (not a Python library, but can be parsed)


Web-Based Resources#

Reference & Etymology#

  1. Zhongwen.com: Character genealogies (zipu) showing interconnections between 4000+ characters based on Shuowen Jiezi
  2. Arch Chinese Radicals: Interactive radical reference
  3. Dong Chinese Wiki: Character wiki with etymological information
  4. Multi-function Chinese Character Database (漢語多功能字庫): University of Hong Kong free online dictionary with character origins

Technical References#

  1. BabelStone IDS Database: UTF-8 encoded plain text IDS for all CJK Unified Ideographs
  2. CJKVI GitHub Organization: Multiple databases (IDS, variants, etc.)

Quick Assessment#

Strengths#

Unicode Standard Foundation: Unihan is official, comprehensive, regularly maintained ✅ IDS Standardization: Structural decomposition with standardized operators ✅ Multiple Python Options: From comprehensive (cjklib) to learner-focused (hanzipy) ✅ Open Data Available: makemeahanzi, cjk-decomp, CJKVI-IDS all freely accessible ✅ Rich Academic Research: Hsiao & Shillcock database, phonetic compound research

Gaps & Considerations#

⚠️ Library Maintenance: Need to verify current maintenance status of cjklib, hanzipy ⚠️ Data Quality: Different sources may have conflicting decompositions (historical vs. modern) ⚠️ Traditional vs. Simplified: Need to handle both character sets ⚠️ Variant Forms: Regional variants (China, Taiwan, Hong Kong, Japan, Korea) complicate decomposition


Next Steps for S2-S4#

S2 Comprehensive:

  • Deep dive into cjklib API and database schema
  • Compare IDS coverage across CJKVI-IDS vs. BabelStone vs. cjklib
  • Analyze phonetic-semantic compound databases (Hsiao & Shillcock data)
  • Review academic literature on character decomposition approaches

S3 Need-Driven:

  • Generic use cases: etymology lookup, mnemonic generation, learning progression
  • Pattern: “Given character X, find all components”
  • Pattern: “Given radical Y, find all characters containing it”
  • Pattern: “Find phonetic series (characters sharing phonetic component)”

S4 Strategic:

  • Unicode Consortium maintenance trajectory
  • Python library ecosystem health
  • Integration with modern NLP pipelines (transformers, embeddings)
  • Traditional vs. simplified character handling strategies

Sources:


S1-rapid Recommendations#

Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.

S2: Comprehensive

S2-comprehensive: Approach#

Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.


S2 Comprehensive: Library & Data Source Deep-Dive#

Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Deep technical analysis of libraries, data sources, and formats for character decomposition


Python Library Landscape#

1. cjklib (Comprehensive but Inactive)#

Repository: cburgmer/cjklib on GitHub PyPI: cjklib Latest Version: 0.3.2 Maintenance Status: ❌ INACTIVE (Snyk analysis)

Maintenance Indicators:

  • No releases to PyPI in past 12+ months
  • Past year: 0 issues, 0 pull requests, 0 issue authors, 0 PR authors
  • Classified as discontinued or receiving minimal attention
  • Weekly downloads: ~70 (limited popularity)
  • Python 3 Support: NO - major blocker for modern projects

Technical Capabilities:

  1. IDS (Ideographic Description Sequences):

    • Hierarchical character decomposition stored in database
    • Binary operators (⿰ for left-right: 好 → ⿰女子)
    • Trinary operators (⿲ for three components: 辨 → ⿲⾟刂⾟)
    • Handles non-distinct partitioning, unencoded components, multi-level hierarchies
  2. Kangxi Radicals:

    • Character-to-radical mapping via Unihan integration
    • Radical-to-character lookup (find all chars with specific radical)
    • Variant handling: Unicode radical forms (⾔), variants (⻈), equivalent chars (言/讠)
  3. Database Architecture:

    • Uses DatabaseConnector supporting multiple DB systems
    • Default: SQLite (cjklib.db)
    • Configuration via cjklib.conf
    • Core data storage: IDS sequences with Unicode operators
  4. API Methods (from documentation):

    • getReadingForCharacter() - pronunciation lookups
    • getCharactersForReading() - reverse lookup by pronunciation
    • getDefaultGlyph() - locale-specific glyph mappings
    • getAvailableCharacterDomains() - supported character sets

Assessment:

  • ✅ Most comprehensive feature set
  • ❌ Inactive development
  • ❌ No Python 3 support
  • ⚠️ Risk: Unmaintained dependencies, compatibility issues

Potential: Fork exists as cjklib3 (Python 3 port) - needs verification of maintenance status


2. hanzipy (Learner-Focused)#

Repository: Synkied/hanzipy on GitHub PyPI: hanzipy 1.0.4 Maintenance Status: ⚠️ UNCLEAR (no recent activity visible in 2025 search results)

Purpose: Chinese character NLP framework for language learners

Features:

  • Character decomposition into radicals/components
  • Learning-focused API design
  • Framework for character exploration

Assessment:

  • ✅ Learner-friendly design
  • ⚠️ Limited documentation on advanced features
  • ⚠️ Unclear maintenance trajectory
  • ⚠️ Smaller community than cjklib

Use Case Fit: Better for educational tools than computational linguistics research


3. pypinyin / python-pinyin (Active, Production-Ready)#

Repository: mozillazg/python-pinyin PyPI: pypinyin Latest Release: July 20, 2025 Python Support: 2.7, 3.5 through 3.13 Maintenance Status: ✅ ACTIVELY MAINTAINED

Primary Focus: Pinyin/Zhuyin/Cyrillic conversion

  • Heteronym support (characters with multiple readings)
  • Recommended for PRC-style simplified characters
  • Most commonly used Python package for modern Chinese

Assessment for Radical Decomposition:

  • ❌ Not focused on structural decomposition
  • ✅ Excellent for pronunciation/reading features
  • ✅ Strong maintenance and community

Recommendation: Use for pronunciation features, combine with IDS data source for decomposition


4. Other Alternatives#

dragonmapper (PyPI):

  • Conversion between characters/Pinyin/Zhuyin/IPA
  • Traditional vs. Simplified identification
  • Not focused on structural decomposition

hanzitools (jcklie/hanzitools on GitHub):

  • Heisig entry lookups
  • CEDICT translation integration
  • Recommends pypinyin for Pinyin conversion
  • Limited decomposition features

Data Source Deep-Dive#

1. CJKVI-IDS (Primary Recommendation)#

Repository: cjkvi/cjkvi-ids on GitHub Last Release: February 20, 2018 (v18.02.20) Commits: 158 (ongoing development) License: GPLv2 (most files), some unrestricted

Data Files (10 total):

FilePurposeLicense
ids.txtMain IDS data (from CHISE project)GPLv2
ids-cdp.txtWith Academia Sinica CDP PUA charsGPLv2
ids-ext-cdef.txtExtended IDS (Ext-C/D/E/F)❗ No GPL restriction
ids-analysis.txtAnalysis dataGPLv2
hanyo-ids.txtHanyo-specific IDSGPLv2
waseikanji-ids.txtJapanese WaseikanjiGPLv2
ws2015-ids.txt2015 worksheetGPLv2
ws2015-ids-cdp.txt2015 worksheet with CDPGPLv2
ucs-strokes.txtStroke count infoGPLv2

Format: Plain text, tab-delimited

  • Column 1: Unicode codepoint (U+XXXX)
  • Column 2: Character
  • Column 3: IDS sequence using U+2FF0..U+2FFB operators

Example:

U+5730	地	⿰土也
U+82B1	花	⿱艹化

Coverage: CJK Unified Ideographs (URO + Extensions)

Programmatic Usage:

  • Text-based format → easy parsing with Python
  • Companion tool: kawabata/ids for IDS normalization
  • Font requirement: HanaMin/HanaMin AFDKO for full character display

Assessment:

  • ✅ Most comprehensive open IDS database
  • ✅ Easy to parse (plain text)
  • ✅ Actively developed (158 commits)
  • ✅ Covers extended character sets
  • ⚠️ Last release 2018 (but ongoing commits)
  • ⚠️ GPLv2 licensing for most files

Comparison to Alternatives:

  • BabelStone IDS: UTF-8 plain text, all CJK Unified Ideographs
  • cjklib embedded IDS: Database format, requires library installation
  • makemeahanzi: JSON format, includes etymology types

2. Unihan Database (Official Unicode Standard)#

Specification: UAX #38 Lookup Interface: Unihan Database Lookup Development: unicode-org/unihan-database on GitHub

Relevant Fields for Radical/Component Analysis:

FieldDescriptionExample
kRSUnicodeRadical-stroke count (Unicode)32.3 (Radical 土 + 3 strokes)
kRSAdobe_Japan1_6Radical-stroke (Adobe-Japan1-6)Variant system
kRSKangXiKangxi Dictionary radical-strokeTraditional reference
kTotalStrokesTotal stroke count6 (for 地)
kCangjieCangjie input codeStroke-based decomposition
kPhoneticPhonetic component indexFor phonetic compounds
kSemanticVariantSemantic variantsRelated characters

Format: Tab-delimited text files

  • Unihan_RadicalStrokeCounts.txt
  • Unihan_Readings.txt
  • Unihan_Variants.txt

Assessment:

  • ✅ Official Unicode standard (authoritative)
  • ✅ Regularly updated by consortium
  • ✅ Comprehensive coverage of all encoded CJK chars
  • ✅ Well-documented specification
  • ⚠️ Radical-based (not full structural decomposition)
  • ⚠️ Requires parsing multiple files for full data

Use Case: Combine with CJKVI-IDS for comprehensive solution:

  • Unihan → Radical, stroke count, phonetic info
  • CJKVI-IDS → Full structural decomposition

3. makemeahanzi (Open-Source Character Data)#

Repository: skishore/makemeahanzi on GitHub Website: Make Me a Hanzi Format: JSON data files License: Open-source, free to use Data Sources: Derived from Unihan and cjklib

JSON Structure (from website):

{
  "character": "花",
  "etymology": {
    "type": "pictophonetic",
    "hint": "plants",
    "phonetic": "化",
    "semantic": "艹"
  },
  "strokes": [...],
  "medians": [...]
}

Etymology Types:

  • ideographic: Meaning-based composition
  • pictographic: Picture representation
  • pictophonetic: Phonetic-semantic compound (形聲字)

Phonetic-Semantic Fields (when type = “pictophonetic”):

  • hint: Semantic category hint
  • phonetic: Phonetic component (string, may be null)
  • semantic: Semantic radical (string, may be null)

Assessment:

  • ✅ JSON format (easy integration)
  • ✅ Includes etymology type classification
  • ✅ Phonetic-semantic decomposition
  • ✅ Open-source, permissive license
  • ⚠️ Derived data (not primary source)
  • ⚠️ Limited to common characters (not full Unicode coverage)

Best Use: Educational tools, mnemonic generation, character learning apps


4. Hsiao & Shillcock Phonetic Compound Database#

Publication: Analysis of a Chinese Phonetic Compound Database (2006) Journal: Journal of Psycholinguistic Research

Content:

  • Most frequent phonetic compounds
  • Decomposition into semantic/phonetic radicals (etymologically accurate)
  • Further decomposition into Cangjie stroke patterns
  • Pronunciation data included
  • Character frequency information

Format: Academic dataset (not open GitHub repo) Access: May require contacting authors or institutional access

Assessment:

  • ✅ Academically rigorous
  • ✅ Etymologically accurate decompositions
  • ✅ Includes frequency data (useful for learning progression)
  • ❌ Limited availability (not easily downloadable)
  • ❌ Focused on frequent characters only
  • ⚠️ Published 2006 (may need updates for newer Unicode)

Use Case: Research reference for validating other data sources


Data Quality Comparison#

IDS Coverage Comparison#

SourceCoverageFormatMaintenanceLicense
CJKVI-IDSFull URO + ExtPlain textActive commitsGPLv2
BabelStone IDSAll CJK UnifiedPlain textUnknownUnknown
cjklib embeddedURO + commonSQLiteInactiveBSD-like
makemeahanziCommon charsJSONActiveOpen

Recommendation: CJKVI-IDS for comprehensive coverage, makemeahanzi for educational/learner focus


Radical System Comparison#

SystemRadicalsSourceUse Case
Kangxi (康熙)214Traditional standardUnicode Unihan, dictionaries
Simplified189PRC standardMainland China education
Unicode Radical214 + variantsUnicode StandardDigital text processing

Key Insight: Multiple valid radical systems exist. Unihan uses Kangxi as baseline with variants.


Phonetic-Semantic Decomposition Sources#

SourcePhonetic InfoSemantic InfoCoverageFormat
makemeahanzi✅ Component✅ ComponentCommonJSON
Unihan kPhonetic✅ Index only❌ LimitedFullText
Hsiao & Shillcock✅ Full✅ FullFrequentAcademic
CJKVI-IDS❌ No❌ NoFullText

Gap Identified: No comprehensive open database combining:

  1. Full Unicode coverage (CJKVI-IDS level)
  2. Phonetic-semantic component classification (makemeahanzi level)
  3. Etymology type annotation (ideographic/pictographic/pictophonetic)

Workaround: Combine multiple sources programmatically


Integration Strategy#

Layer 1: Structural Decomposition

  • Primary: CJKVI-IDS (ids.txt) for full structural decomposition
  • Fallback: Parse Unihan for characters not in CJKVI-IDS

Layer 2: Radical Information

  • Primary: Unihan kRSUnicode for Kangxi radical + stroke count
  • Enhancement: Unihan kRSKangXi for traditional dictionary reference

Layer 3: Phonetic-Semantic Classification

  • Primary: makemeahanzi for common characters (JSON)
  • Research: Hsiao & Shillcock for validation/academic reference

Layer 4: Pronunciation & Readings

  • Primary: pypinyin for modern Mandarin pinyin
  • Enhancement: Unihan readings for historical/variant pronunciations

Library Selection by Use Case#

Use Case: Educational Tool / Language Learning App

  • Data: makemeahanzi (etymology, mnemonics) + pypinyin (pronunciation)
  • Library: Custom parser (JSON + pypinyin library)
  • Rationale: JSON format easy, mnemonic focus, active maintenance

Use Case: Computational Linguistics Research

  • Data: CJKVI-IDS (full coverage) + Unihan (radicals/metadata)
  • Library: Custom parser or cjklib3 fork (if verified maintained)
  • Rationale: Comprehensive data, research flexibility

Use Case: Dictionary/Reference Application

  • Data: Unihan (comprehensive) + CJKVI-IDS (structure) + pypinyin (pronunciation)
  • Library: Custom integration layer
  • Rationale: Multiple data sources needed, no single library sufficient

Use Case: Quick Prototype

  • Data: makemeahanzi JSON
  • Library: Standard Python JSON parsing
  • Rationale: Fastest time-to-value, single file

Technical Debt & Risk Assessment#

cjklib (Inactive)#

  • Risk: No Python 3 support, abandoned project
  • Mitigation: Fork to cjklib3 OR extract database and write custom parser
  • Effort: High (forking), Medium (custom parser)

CJKVI-IDS (2018 release)#

  • Risk: Stale release despite active commits
  • Mitigation: Use latest GitHub main branch, not release tag
  • Effort: Low (just change download source)

Multiple Data Source Dependencies#

  • Risk: Conflicting decompositions, synchronization issues
  • Mitigation: Document source precedence rules, version pinning
  • Effort: Medium (governance + tooling)

License Compliance#

  • Risk: GPLv2 for CJKVI-IDS main file
  • Mitigation: Use ids-ext-cdef.txt (no GPL) OR comply with GPLv2
  • Effort: Low (file selection) or Medium (GPL compliance)

Next Steps#

For S3 (Need-Driven Patterns):

  • Map generic use cases to data source combinations
  • Identify gaps in current data sources
  • Prototype parsing strategies (pseudo-code level, no implementation)

For S4 (Strategic Analysis):

  • Unicode Consortium roadmap for Unihan updates
  • Python ecosystem trend analysis (active vs. dormant projects)
  • Traditional vs. Simplified handling strategies
  • CJK variant forms (Japan/Korea/Vietnam) decomposition differences

Sources:


S2-comprehensive Recommendations#

Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.

S3: Need-Driven

S3-need-driven: Approach#

Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.


S3-need-driven Recommendations#

Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.


S3 Need-Driven: Generic Use Case Patterns#

Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Generic use case patterns for character decomposition (NOT application-specific)


Pattern Philosophy#

Per DISCOVERY_VS_IMPLEMENTATION.md guidance:

  • ✅ Generic patterns applicable to any developer
  • ✅ Technology-neutral descriptions
  • ❌ NO application-specific requirements (“for our app…”)
  • ❌ NO implementation plans (“install and test…”)

Pattern 1: Character Decomposition Lookup#

Generic Need: Given a Chinese character, retrieve its structural components

Input: Single character (e.g., 花) Output: Structural decomposition with spatial relationship

Data Requirements:

  • IDS sequence (⿱艹化)
  • Component list ([艹, 化])
  • Spatial operator (top-bottom arrangement)

Example Variations:

  • Flat decomposition: 花 → [艹, 化]
  • Hierarchical decomposition: 花 → [艹, [亻, 匕]] (recursive)
  • IDS string representation: “⿱艹化”

Data Sources:

  • Primary: CJKVI-IDS ids.txt
  • Fallback: cjklib database, makemeahanzi JSON

Complexity Considerations:

  • Single valid decomposition vs. multiple valid partitions
  • Unencoded components (exist only within larger characters)
  • Variant forms (simplified vs. traditional differences)

Generic Use Cases:

  • Dictionary lookup enhancement (show components)
  • Educational tools (character structure visualization)
  • Font rendering (construct missing glyphs)
  • Mnemonic generation (break into memorable parts)

Generic Need: Find all characters containing a specific radical

Input: Radical (e.g., 土 “earth”) Output: List of characters containing that radical

Data Requirements:

  • Radical-to-character index
  • Optional: stroke count for sub-filtering
  • Optional: character frequency for sorting

Example Queries:

  • All characters with radical 土: [地, 坐, 场, 城, …]
  • Characters with 土 + 5 total strokes: [地, 圾, …]
  • Top 100 most frequent characters with 土: sorted by usage

Data Sources:

  • Primary: Unihan kRSUnicode field
  • Enhancement: Character frequency data (Unihan kFrequency or external corpus)
  • Alternative: Reverse index from CJKVI-IDS (parse components)

Radical System Considerations:

  • Kangxi radicals (214) vs. Simplified radicals (189)
  • Radical variants (讠 vs. 言)
  • Unicode radical characters vs. component characters

Generic Use Cases:

  • Dictionary browsing (traditional radical index navigation)
  • Character learning (study characters by semantic category)
  • Input method optimization (radical-based typing)
  • Corpus analysis (semantic field distribution)

Pattern 3: Phonetic Series Identification#

Generic Need: Find characters sharing the same phonetic component

Input: Phonetic component (e.g., 青 “qing”) Output: List of characters using that phonetic component with readings

Data Requirements:

  • Phonetic-semantic compound classification
  • Phonetic component extraction
  • Pronunciation data for validation

Example Queries:

  • Characters with phonetic 青: [清 qīng, 情 qíng, 请 qǐng, 晴 qíng, …]
  • Show semantic radical for each: [清(氵), 情(忄), 请(讠), 晴(日)]
  • Pronunciation similarity analysis: all share “qing” pronunciation

Data Sources:

  • Primary: makemeahanzi etymology fields (phonetic/semantic)
  • Research: Hsiao & Shillcock phonetic compound database
  • Validation: Unihan kPhonetic field + pronunciation data

Pattern Insights:

  • ~80% of Chinese characters are phonetic-semantic compounds
  • Phonetic component often indicates pronunciation (not always exact)
  • Same phonetic in different semantic contexts = different meanings

Generic Use Cases:

  • Mnemonic generation (learn pronunciation from phonetic)
  • Character learning (phonetic family grouping)
  • Historical linguistics (phonetic evolution study)
  • OCR validation (phonetic similarity for error correction)

Pattern 4: Etymology Tree Generation#

Generic Need: Trace a character’s historical development and component relationships

Input: Target character (e.g., 森) Output: Etymology tree showing component hierarchy

Decomposition Levels:

  1. Level 0: Target character (森)
  2. Level 1: Direct components (木 + 木 + 木)
  3. Level 2: Component components (if applicable)
  4. Level N: Minimal strokes or radicals

Data Requirements:

  • Recursive IDS parsing
  • Etymology type classification (pictographic/ideographic/pictophonetic)
  • Historical form variants (oracle bone → seal script → modern)

Example Tree (simplified):

森 (forest)
├─ 木 (tree) [pictographic]
├─ 木 (tree) [pictographic]
└─ 木 (tree) [pictographic]
└─ Etymology type: ideographic compound (multiple trees = forest)

Example Tree (compound):

想 (think)
├─ 相 (mutual/appearance) [phonetic component]
│   ├─ 木 (tree) [radical]
│   └─ 目 (eye) [component]
└─ 心 (heart) [semantic radical]
└─ Etymology type: phonetic-semantic (heart + appearance → thinking)

Data Sources:

  • Structure: CJKVI-IDS (recursive parsing)
  • Etymology type: makemeahanzi etymology.type field
  • Historical forms: Specialized databases (Shuowen Jiezi, oracle bone scripts)

Termination Conditions:

  • Reach minimal stroke components (一, 丨, etc.)
  • Reach pictographic radicals (no further decomposition)
  • Reach unencoded component (no Unicode codepoint)

Generic Use Cases:

  • Educational tools (visualize character structure)
  • Mnemonic storytelling (understand meaning from components)
  • Historical linguistics research (trace character evolution)
  • Character learning curriculum (teach simple before complex)

Pattern 5: Mnemonic Component Extraction#

Generic Need: Extract semantically meaningful components for memory aids

Input: Character + desired mnemonic style Output: Components with semantic hints

Mnemonic Types:

  1. Semantic decomposition: 好 = 女 (woman) + 子 (child) → “woman with child = good”
  2. Phonetic hint: 清 = 氵(water) + 青 (qing) → “water that sounds like ‘qing’”
  3. Pictographic story: 森 = 木+木+木 → “many trees = forest”

Component Filtering:

  • Prioritize semantically meaningful components (radicals)
  • De-emphasize phonetic components for semantic mnemonics
  • Identify pictographic components for visual mnemonics

Data Requirements:

  • IDS decomposition (structure)
  • Etymology type (phonetic vs. semantic vs. pictographic)
  • Radical semantic category (Kangxi radical meanings)
  • Component meanings (individual character definitions)

Example Processing:

Character: 想 (think)
├─ Decomposition: ⿱相心
├─ Components: [相, 心]
├─ Etymology: phonetic-semantic
├─ Semantic radical: 心 (heart)
├─ Phonetic component: 相 (xiāng → xiǎng)
└─ Mnemonic: "Thinking (xiǎng) comes from the heart (心), using mutual (相) understanding"

Data Sources:

  • Structure: CJKVI-IDS
  • Etymology: makemeahanzi
  • Radical meanings: Kangxi radical semantic categories
  • Component definitions: Dictionary database (CEDICT, CC-CEDICT)

Generic Use Cases:

  • Flashcard apps (auto-generate memory aids)
  • Character learning books (etymology explanations)
  • Educational games (story-based character teaching)
  • Spaced repetition systems (memorable hints)

Pattern 6: Learning Progression Sequencing#

Generic Need: Order characters from simple to complex for curriculum design

Input: Set of characters to learn Output: Ordered sequence from simple components to complex compounds

Complexity Metrics:

  1. Stroke count: Fewer strokes = simpler
  2. Component count: Fewer components = simpler
  3. Decomposition depth: Shallow hierarchy = simpler
  4. Component frequency: Common components taught first

Sequencing Algorithm (generic pseudo-logic):

1. Identify all unique components across character set
2. Build dependency graph (character depends on its components)
3. Topological sort: teach components before compounds
4. Within same level, sort by stroke count (ascending)
5. Within same stroke count, sort by frequency (descending)

Example Sequence:

Level 0 (Radicals/Simple):
  一 (1 stroke) → 二 (2) → 三 (3) → 十 (2) → 人 (2) → 木 (4) → 水 (4)

Level 1 (Simple Compounds):
  好 (女+子, 6 strokes) → 休 (人+木, 6) → 林 (木+木, 8)

Level 2 (Complex Compounds):
  森 (木+木+木, 12) → 想 (相+心, 13)

Data Requirements:

  • IDS decomposition (dependency graph)
  • Stroke count data (Unihan kTotalStrokes)
  • Character frequency (Unihan kFrequency or corpus data)
  • Component reusability (how many characters use this component)

Data Sources:

  • Structure: CJKVI-IDS
  • Stroke count: Unihan kTotalStrokes or ucs-strokes.txt
  • Frequency: Unihan kFrequency, SUBTLEX-CH, or web corpus
  • HSK/TOCFL levels (for educational sequencing)

Generic Use Cases:

  • Textbook curriculum design (character introduction order)
  • Adaptive learning systems (personalized progression)
  • Character workbook generation (practice sheets)
  • Font subset optimization (include components first)

Pattern 7: Character Similarity Analysis#

Generic Need: Find characters with similar structure or components

Input: Query character Output: Ranked list of structurally similar characters

Similarity Dimensions:

  1. Component overlap: Share N components
  2. Structural similarity: Same IDS operators (both ⿰ left-right)
  3. Radical similarity: Same Kangxi radical
  4. Phonetic similarity: Share phonetic component
  5. Visual similarity: OCR confusion pairs

Example Queries:

  • Characters structurally similar to 清: [请, 晴, 情] (all ⿰X青)
  • Characters with same radical as 清: [河, 海, 泪] (all 氵)
  • Characters visually similar to 大: [太, 犬, 天] (OCR confusion)

Similarity Scoring (generic approach):

Jaccard similarity: intersection(components) / union(components)
Structural similarity: same IDS operator = +1 point
Radical match: same Kangxi radical = +2 points
Phonetic match: same phonetic component = +1 point

Data Requirements:

  • IDS decomposition (component extraction)
  • Radical classification (Kangxi radical)
  • Phonetic-semantic classification
  • Optional: Stroke-level similarity (for OCR)

Data Sources:

  • Structure: CJKVI-IDS
  • Radicals: Unihan kRSUnicode
  • Phonetics: makemeahanzi etymology
  • Visual confusion: OCR error datasets (research papers)

Generic Use Cases:

  • OCR post-processing (similar character suggestions)
  • Character learning (study confusable pairs)
  • Dictionary features (related characters navigation)
  • Text correction (typo detection)

Pattern 8: Component Reusability Analysis#

Generic Need: Identify high-value components (used in many characters)

Input: Character database Output: Components ranked by reusability

Reusability Metrics:

  1. Character count: How many characters contain this component
  2. Frequency-weighted: Weighted by character usage frequency
  3. Pedagogical value: Appears in HSK 1-6 vocabulary

Example High-Reusability Components:

口 (mouth): appears in 1000+ characters
  - 吃, 喝, 叫, 唱, 吗, 呢, 哪, 呀, ...
  - High pedagogical value (early HSK levels)

氵(water radical): appears in 800+ characters
  - 水, 河, 海, 洋, 湖, 江, 清, 洗, ...
  - Semantic category: water-related meanings

Analysis Output:

  • Top 100 most reusable components (by character count)
  • Components by semantic category (214 Kangxi radicals)
  • Components by pedagogical level (HSK 1-6 coverage)

Data Requirements:

  • Component extraction from all characters
  • Character frequency data (usage weighting)
  • HSK/TOCFL level data (pedagogical value)

Data Sources:

  • Components: CJKVI-IDS (parse all IDS)
  • Frequency: Web corpus or Unihan kFrequency
  • HSK levels: HSK vocabulary lists (external)

Generic Use Cases:

  • Curriculum design (teach high-value components first)
  • Font subsetting (prioritize reusable components)
  • Input method optimization (frequent component shortcuts)
  • Mnemonic generation (focus on common building blocks)

Pattern 9: Traditional ↔ Simplified Mapping#

Generic Need: Map between traditional and simplified character forms

Input: Character (traditional or simplified) Output: Corresponding form(s) in other system

Complexity Cases:

  1. One-to-one: 花 (same in both) → 花
  2. One-to-many: 發 (trad) → 发 or 髮 (simp, context-dependent)
  3. Many-to-one: 發/髮 (trad) → 发 (simp, merged)
  4. Component differences: 國 → 国 (玉 → 王 inside 口)

Decomposition Impact:

  • Traditional: 國 → ⿴囗或 (囗 + 或)
  • Simplified: 国 → ⿴囗玉 (囗 + 玉)
  • Component change affects mnemonic/etymology

Data Requirements:

  • Traditional/simplified variant mappings
  • IDS for both forms (structural differences)
  • Context rules (for one-to-many mappings)

Data Sources:

  • Variants: Unihan kSimplifiedVariant, kTraditionalVariant
  • Structure: CJKVI-IDS (includes both forms)
  • Context: HanDeDict or CC-CEDICT (word-level distinctions)

Generic Use Cases:

  • Text conversion (traditional ↔ simplified)
  • Dictionary lookup (show both forms)
  • Cross-Strait content (Taiwan vs. Mainland)
  • Character learning (understand simplification rules)

Pattern 10: Stroke Order & Decomposition Alignment#

Generic Need: Align component decomposition with stroke writing order

Input: Character Output: Components ordered by stroke sequence

Challenge: IDS decomposition ≠ stroke order

  • IDS: 想 = ⿱相心 (top 相, bottom 心)
  • Stroke order: Write partial 木, then 目, then 心, then finish 木

Alignment Strategy:

  1. Parse IDS decomposition (structural)
  2. Parse stroke order sequence (temporal)
  3. Map strokes to components (spatial)
  4. Reorder components by first stroke

Data Requirements:

  • IDS decomposition (spatial structure)
  • Stroke order sequences (temporal order)
  • Stroke-to-component mapping

Data Sources:

  • IDS: CJKVI-IDS
  • Stroke order: makemeahanzi strokes array
  • Graphics: SVG stroke paths (for visualization)

Generic Use Cases:

  • Handwriting teaching apps (component-aware practice)
  • Stroke order animations (show components correctly)
  • Calligraphy tools (component-based guidance)
  • Character tracing worksheets (pedagogical materials)

Cross-Pattern Dependencies#

Foundation Patterns (required by others):

  1. Character Decomposition Lookup (Pattern 1)
    • Used by: Etymology Tree (4), Mnemonic Extraction (5), Learning Progression (6)

Enhancement Patterns (add value to foundation): 2. Radical-Based Search (2) + Phonetic Series (3)

  • Enables: Comprehensive similarity analysis (7)

Advanced Patterns (combine multiple patterns): 6. Learning Progression = Decomposition (1) + Frequency + Stroke Count 7. Similarity Analysis = Decomposition (1) + Radical (2) + Phonetic (3)


Data Source Coverage Matrix#

PatternCJKVI-IDSUnihanmakemeahanzipypinyinExternal
1. Decomposition✅ Primary🟨 Fallback🟨 Fallback
2. Radical Search🟨 Reverse✅ Primary🟨 Enhancement🟨 Frequency
3. Phonetic Series🟨 Structure🟨 kPhonetic✅ Primary✅ Validation
4. Etymology Tree✅ Primary🟨 Type🟨 Historical
5. Mnemonic Extract✅ Structure🟨 Radicals✅ Etymology🟨 Definitions
6. Learning Progress✅ Structure✅ Strokes🟨 Data✅ HSK levels
7. Similarity✅ Primary🟨 Radicals🟨 Phonetic
8. Reusability✅ Primary🟨 Frequency🟨 HSK
9. Trad ↔ Simp✅ Both forms✅ Variants🟨 Context
10. Stroke Order🟨 Structure✅ Primary

Legend:

  • ✅ Primary data source (best fit)
  • 🟨 Enhancement or fallback
  • ❌ Not applicable

Identified Data Gaps#

Gap 1: No comprehensive phonetic-semantic database

  • CJKVI-IDS: Structure only (no phonetic classification)
  • makemeahanzi: Limited coverage (common chars only)
  • Hsiao & Shillcock: Limited availability (academic dataset)

Gap 2: Traditional vs. Simplified decomposition differences

  • CJKVI-IDS includes both forms BUT no explicit mapping
  • Must parse both and compare programmatically

Gap 3: Stroke order integration

  • makemeahanzi has stroke order but limited coverage
  • No standard format linking stroke order to IDS decomposition

Gap 4: Historical forms

  • Oracle bone → seal script → modern evolution
  • Requires specialized databases (Shuowen Jiezi, etc.)
  • Not covered by Unicode Unihan or CJKVI-IDS

Next Steps for S4 (Strategic)#

Technology Evolution:

  • Unicode roadmap for CJK extensions (Ext-G, Ext-H)
  • Python ecosystem health (cjklib inactive, alternatives?)
  • Modern NLP integration (transformers, embeddings)

Maintenance & Governance:

  • CJKVI-IDS: Why no releases since 2018 despite commits?
  • pypinyin: Active maintenance trajectory
  • Data quality assurance across multiple sources

Variant Handling:

  • CJK unification implications (China/Taiwan/Japan/Korea/Vietnam)
  • Regional glyph differences in decomposition
  • Font rendering vs. semantic decomposition

Sources:

S4: Strategic

S4-strategic: Approach#

Analysis of Chinese radical/component decomposition libraries and databases for character analysis, learning applications, and etymological research.


S4-strategic Recommendations#

Use cjkradlib or HanziJS for programmatic access. CHISE IDS database for comprehensive data. Unihan for Unicode standard mappings.


S4 Strategic: Long-Term Viability & Technology Evolution#

Status: Discovery research (no code execution) Created: 2026-01-28 Purpose: Strategic analysis of maintenance trajectories, ecosystem health, and technology evolution


Executive Summary#

Key Finding: Character decomposition is a mature but fragmented domain with:

  • Stable Standards: Unicode Unihan maintained by consortium (low obsolescence risk)
  • ⚠️ Fragmented Tooling: No single actively-maintained comprehensive Python library
  • Open Data: CJKVI-IDS, makemeahanzi provide foundation for custom solutions
  • 🔄 Recommended Strategy: Data-first approach (direct parsing) over library dependency

Unicode Consortium Roadmap#

Unihan Database Maintenance#

Governance: Unicode Technical Committee (UTC) Update Cycle: Aligned with Unicode version releases (annual/biannual) Current Version: Unicode 16.0 (September 2024) Next Expected: Unicode 17.0 (September 2026)

Historical Trajectory:

  • Regular CJK extension releases (Ext-A through Ext-I planned)
  • Ext-G: 4,939 characters (Unicode 13.0, March 2020)
  • Ext-H: 4,192 characters (Unicode 15.1, September 2023)
  • Ext-I: ~5,000 characters (target Unicode 17.0, 2026)

Maintenance Commitment:

  • ✅ Strong institutional backing (Unicode Consortium)
  • ✅ International standards body (ISO/IEC 10646 alignment)
  • ✅ Multi-stakeholder governance (China, Taiwan, Japan, Korea)
  • ✅ Backward compatibility guarantees (no character removal)

Risk Assessment: LOW

  • Standard unlikely to be abandoned
  • Continuous extension for historical/rare characters
  • Well-documented specification (UAX #38)

IDS Standardization Status#

Current State: IDS operators in Unicode (U+2FF0..U+2FFB) since Unicode 3.0 (1999) Stability: ✅ MATURE - No changes to operator set in 25+ years

Open Questions:

  • Will Unicode extend IDS operators beyond current 12?
  • No active proposals identified in search results
  • Existing operators sufficient for known character structures

Risk Assessment: LOW

  • Core operators stable since 1999
  • Community implementations (CHISE, CJKVI) rely on current standard
  • No indication of breaking changes

Python Ecosystem Health#

cjklib: Inactive but Extractable#

Status: Abandoned (no Python 3 support, no updates since 2012+) Last PyPI Release: 0.3.2 (years old) GitHub Activity: 0 PRs, 0 issues in past year

Strategic Options:

  1. Fork: Community fork cjklib3

    • Risk: Fork maintenance uncertain, no verified activity
    • Effort: Depends on fork quality
  2. Database Extraction: Dump cjklib SQLite database, use directly

    • Risk: One-time extraction, no updates
    • Effort: Medium (parsing SQL schema)
  3. Abandon: Use alternative data sources (CJKVI-IDS, Unihan)

    • Risk: None (open data)
    • Effort: Low (simple parsing)

Recommendation: Option 3 (abandon cjklib, use open data sources)

  • Rationale: Unmaintained library is technical debt
  • Alternative: CJKVI-IDS + Unihan provide equivalent data
  • Future-proof: Text files easier to maintain than abandoned library

Status: ✅ ACTIVELY MAINTAINED Latest Release: July 2025 (recent) Python Support: 2.7, 3.5-3.13 (modern versions) Community: Most popular Chinese pronunciation library

Strategic Value:

  • ✅ Pronunciation features (Pinyin, Zhuyin, Cyrillic)
  • ✅ Heteronym support (multi-reading characters)
  • ✅ Regular updates (2025 release confirmed)
  • ❌ Not focused on decomposition (different domain)

Recommendation: USE for pronunciation features, combine with decomposition data


Data Source Viability#

CJKVI-IDS: Active Development, Stale Releases#

Last Release: February 2018 (v18.02.20) GitHub Commits: 158 total (ongoing activity) Maintenance Gap: Why no releases since 2018 despite commits?

Possible Explanations:

  1. Rolling updates: Main branch is canonical (no formal releases)
  2. Low churn: IDS data stable, incremental updates only
  3. Volunteer maintenance: No bandwidth for release process

Strategic Implications:

  • ✅ Data is actively curated (commits continue)
  • ⚠️ No release hygiene (versioning, changelogs)
  • ✅ Plain text format (easy to fork if abandoned)

Risk Mitigation:

  • Use git commit SHA for version pinning (not release tags)
  • Monitor commit activity quarterly (stale = <2 commits/year)
  • Fallback: Fork and maintain internally if abandoned

Risk Assessment: MEDIUM

  • Data quality: HIGH (actively used by community)
  • Maintenance: UNCERTAIN (commits but no releases)
  • Replaceability: HIGH (plain text, can fork)

makemeahanzi: Active but Limited Coverage#

GitHub Activity: Check repository for recent commits Coverage: Common characters (not full Unicode) License: Open-source (permissive)

Strategic Value:

  • ✅ Etymology type classification (pictographic/phonetic/ideographic)
  • ✅ Phonetic-semantic decomposition
  • ✅ JSON format (easy integration)
  • ❌ Limited to common characters (~8,000-10,000)

Use Case Fit:

  • ✅ Excellent for educational/learner applications (HSK, common chars)
  • ❌ Insufficient for comprehensive dictionary/research (need full Unicode)

Recommendation: USE for educational features, supplement with CJKVI-IDS for full coverage


CJK Variant Handling Strategy#

Regional Glyph Differences#

Challenge: Same Unicode codepoint, different glyphs across regions

  • Example: 骨 (bone)
    • PRC/Simplified: Specific stroke structure
    • Taiwan/Traditional: Slightly different form
    • Japan: Kanji variant form
    • Korea: Hanja form

Impact on Decomposition:

  • IDS may differ for same codepoint (regional variants)
  • Radical classification may vary (different radical systems)
  • Mnemonic generation affected (visual form matters)

Unicode Approach:

  • Han Unification: One codepoint for “same” character across regions
  • Variation Selectors: U+E0100..U+E01EF for explicit variants
  • Font-level distinction (not data-level)

Data Source Handling:

SourceVariant SupportNotes
CJKVI-IDSMultiple IDS per codepointCovers regional variants
UnihankRSUnicode vs. kRSKangXiDifferent radical systems
makemeahanziSimplified-focusedLimited traditional variant info

Strategic Recommendation:

  1. Default to Unicode Standard (Unihan as canonical)
  2. Store multiple IDS per character (regional variants)
  3. Locale-aware decomposition (user’s region preference)
  4. Document variant handling policy (e.g., “prefer simplified for PRC learners”)

Risk: Variant confusion if not explicitly handled Mitigation: Store locale/region metadata with decomposition data


Integration with Modern NLP#

Transformer Models & Character Embeddings#

Current Trend (2025-2026): Character-level tokenization for Chinese

  • BERT-wwm-ext (whole word masking)
  • MacBERT (improved Chinese BERT)
  • ERNIE 3.0 (Baidu’s multilingual model)

Radical/Component Awareness:

  • Research question: Do transformer models benefit from explicit radical features?
  • Some models incorporate radical embeddings (RoBERTa-wwm-ext-large)
  • Debate: Implicit learning vs. explicit radical features

Strategic Opportunities:

  1. Radical-aware embeddings: Inject decomposition into model training
  2. Zero-shot learning: Use decomposition for rare character understanding
  3. Mnemonic generation: LLM + decomposition data = auto-generated stories
  4. Curriculum generation: Optimize learning order via component dependencies

Data Requirements for NLP Integration:

  • IDS decomposition (structured data)
  • Radical semantic categories (meaning injection)
  • Phonetic component labels (pronunciation features)
  • Etymology types (character category classification)

Recommendation: Position decomposition data as enhancement layer for NLP

  • Base models learn implicitly from text
  • Decomposition adds interpretability + rare character handling

Traditional vs. Simplified Strategic Handling#

Simplification Rules (1950s-1960s PRC reforms)#

Types of Simplification:

  1. Structural: 國 → 国 (component replacement)
  2. Radical: 讓 → 让 (讠 for 言)
  3. Merging: 發/髮 → 发 (homophone merging)
  4. Phonetic loan: 麵 → 面 (reuse existing character)

Decomposition Impact:

  • Traditional: 國 → ⿴囗或 (complex)
  • Simplified: 国 → ⿴囗玉 (different components)
  • Etymology may be obscured in simplified (e.g., 爱 lost 心 from 愛)

Data Alignment Challenges:

ChallengeExampleSolution
One-to-many幹 → 干/乾/幹 (context-dependent)Word-level mapping (not char)
Many-to-one發/髮/彆 → 发/别 (merged)Store all traditional sources
Component change國/国 radicals differIDS for both forms

Strategic Approaches:

Approach 1: Dual Storage (store both forms)

  • Pros: Complete data, no information loss
  • Cons: 2x storage, complexity in querying

Approach 2: Primary + Variant Links

  • Pros: Efficient storage, clear relationships
  • Cons: Must choose primary (simplified or traditional)

Approach 3: Etymology-Preserving (prefer traditional for learning)

  • Pros: Maintains historical meaning
  • Cons: Less useful for simplified-only learners

Recommendation: Approach 1 (dual storage)

  • Rationale: Storage is cheap, data completeness valuable
  • Implementation: Character ID → [traditional_data, simplified_data]
  • Querying: Filter by user’s locale preference

Technology Evolution Vectors#

Vector 1: Unicode Extensions#

Trajectory: Continued CJK extensions (Ext-I, Ext-J, …) Impact: More rare/historical characters require IDS data Preparation: Monitor Unicode roadmap, update data sources post-release

Action Items:

  • Subscribe to Unicode announcements (unicode.org mailing lists)
  • Budget for data updates quarterly (new extensions)
  • Automated scraping of Unihan releases (CI/CD pipeline)

Vector 2: Machine Learning for Decomposition#

Emerging Research: Auto-generate IDS from character images Potential: Fill gaps for unencoded/rare characters Maturity: Research phase (not production-ready)

Strategic Watch:

  • Academic papers on character structure recognition
  • Open-source models (e.g., OCR → IDS generation)
  • Validation against human-curated data (CJKVI-IDS as ground truth)

Action: Monitor but don’t depend on (experimental technology)

Vector 3: Web Standards (CSS, SVG, Fonts)#

Trend: Browser support for IDS rendering (experimental) Use Case: Display unencoded characters via IDS composition Status: Limited browser support (cutting-edge only)

Strategic Opportunity:

  • Educational tools could use IDS for dynamic character rendering
  • Fallback: SVG generation from component SVGs (makemeahanzi approach)

Action: Prototype but maintain static fallbacks (browser compatibility)

Vector 4: Open Data Movement#

Trend: More linguistic data becoming open (CC-BY, public domain) Recent Examples:

  • Wikimedia language data projects
  • Universal Dependencies for Chinese parsing

Opportunity: Contribute back to community

  • Publish derived datasets (e.g., phonetic series mappings)
  • Contribute fixes to CJKVI-IDS (if errors found)
  • Open-source tooling for decomposition processing

Strategic Value: Build reputation + community goodwill


Maintenance Burden Analysis#

Option A: Depend on Libraries (cjklib, hanzipy)#

Pros:

  • Pre-built functionality
  • Community testing (if active)

Cons:

  • Maintenance dependency (abandoned libraries = broken systems)
  • Python version compatibility issues (cjklib: no Python 3)
  • API changes break downstream code

Maintenance Burden: HIGH (when library abandoned) Risk: HIGH (cjklib already abandoned)

Option B: Direct Data Parsing (CJKVI-IDS, Unihan)#

Pros:

  • Full control over parsing logic
  • Plain text = easy to maintain
  • No library dependency rot

Cons:

  • Initial implementation effort (write parsers)
  • Must handle data format changes (rare for text files)

Maintenance Burden: LOW (data format stable) Risk: LOW (can fork data if source abandoned)

Option C: Hybrid (Libraries for Some, Data for Others)#

Pros:

  • Use pypinyin (active) for pronunciation
  • Parse CJKVI-IDS directly for decomposition
  • Best of both worlds

Cons:

  • Multiple dependencies (some active, some not)

Maintenance Burden: MEDIUM Risk: MEDIUM (only for library portions)

Recommendation: Option C (Hybrid)

  • pypinyin for pronunciation (active maintenance)
  • Direct parsing for decomposition (stable data)
  • Avoid abandoned libraries (cjklib)

10-Year Outlook (2026-2036)#

Stable Elements (High Confidence)#

  1. Unicode Standard: Will continue, backward compatible
  2. Unihan Database: Regular updates, consortium-backed
  3. IDS Operators: Stable, no breaking changes expected
  4. Plain Text Data: CJKVI-IDS, Unihan will remain parseable

Uncertain Elements (Monitor Closely)#

  1. Python Library Ecosystem: May consolidate or fragment further
  2. CJK NLP: Transformer models may change decomposition relevance
  3. Regional Variants: Taiwan/Hong Kong policy changes possible

Emerging Opportunities#

  1. LLM-Powered Mnemonic Generation: GPT-4+ with decomposition data
  2. AR/VR Character Learning: Spatial decomposition visualization
  3. Adaptive Curriculum: ML-optimized learning progressions

Strategic Positioning#

Bet on:

  • Open data sources (Unihan, CJKVI-IDS)
  • Stable standards (Unicode, IDS)
  • Actively maintained tools (pypinyin)

Avoid:

  • Unmaintained libraries (cjklib)
  • Proprietary datasets (vendor lock-in)
  • Cutting-edge unproven tech (ML decomposition, experimental browser APIs)

Hedge with:

  • Modular architecture (swap data sources easily)
  • Data versioning (track Unihan/CJKVI-IDS versions)
  • Automated testing (detect data format changes)

Governance & Data Quality Assurance#

Multi-Source Data Conflicts#

Problem: CJKVI-IDS vs. Unihan vs. makemeahanzi may disagree Example: Character 某 decomposition

  • Source A: ⿱艹木
  • Source B: ⿱甘木
  • Cause: Different interpretations of component

Resolution Strategy:

  1. Precedence Rules: Define authoritative source per data type

    • Radicals: Unihan (official Unicode)
    • IDS: CJKVI-IDS (most comprehensive)
    • Etymology: makemeahanzi (educational focus)
  2. Validation Pipeline: Cross-check sources

    • Flag conflicts for manual review
    • Document decisions (source A chosen because…)
  3. Version Pinning: Track source data versions

    • CJKVI-IDS: git commit SHA
    • Unihan: Unicode version number
    • makemeahanzi: GitHub release tag

Quality Metrics:

  • Coverage: % of Unicode CJK chars with IDS
  • Consistency: % agreement across sources
  • Completeness: % of chars with phonetic/semantic classification

Risk Register#

RiskLikelihoodImpactMitigation
cjklib abandonedHAPPENEDHighUse alternative data sources
CJKVI-IDS staleMediumMediumMonitor commits, fallback to Unihan
Unicode breaks IDSLowHighVersioning + automated tests
Phonetic DB unavailableMediumLowmakemeahanzi sufficient for most uses
Regional variant conflictsHighMediumDual storage + locale preference
Python 2→3 migrationPASTHighAlready avoided (no cjklib)
Data license issuesLowMediumUse GPL-exempt files (ids-ext-cdef.txt)

Recommendations#

Short-Term (2026-2027)#

  1. Adopt CJKVI-IDS for structural decomposition (direct parsing)
  2. Use Unihan for radical/stroke metadata (official source)
  3. Integrate pypinyin for pronunciation features (active maintenance)
  4. Supplement with makemeahanzi for etymology (educational value)
  5. Avoid cjklib (unmaintained, Python 2 only)

Medium-Term (2027-2029)#

  1. Build validation pipeline (cross-check data sources)
  2. Contribute fixes to CJKVI-IDS (community engagement)
  3. Monitor Unicode roadmap (prepare for Ext-I, Ext-J)
  4. Evaluate cjklib3 fork (if community adopts)

Long-Term (2029-2036)#

  1. Explore ML-enhanced decomposition (auto-IDS generation research)
  2. Integrate with NLP pipelines (transformer embeddings)
  3. Contribute datasets to community (open data movement)
  4. Plan for next-gen standards (if IDS operators evolve)

Sources:

Published: 2026-03-06 Updated: 2026-03-06