1.148 Morphological Analysis#


Explainer

Morphological Analysis Libraries - Domain Explainer#

For: Business stakeholders, non-linguists, developers new to NLP Purpose: Understand what morphological analysis is and why it matters for language learning applications


What is Morphological Analysis?#

Morphological analysis breaks down words into their component parts and identifies grammatical features.

Simple Example (English)#

Word: “running”

  • Lemma (base form): “run”
  • Part of Speech: VERB
  • Features: present participle, -ing form

Word: “cats”

  • Lemma: “cat”
  • Part of Speech: NOUN
  • Features: plural, nominative case

Why It Matters for Language Learning#

Problem: You Can’t Learn Languages Word-by-Word#

When you see “Я читаю книгу” (Russian: “I’m reading a book”), you need to know:

  • читаю is the verb “читать” (to read) in 1st person singular
  • книгу is the noun “книга” (book) in accusative case (direct object)

A dictionary won’t help if you don’t know the grammatical forms.

Solution: Morphological Analysis#

Parse the sentence to identify:

  1. Lemma (dictionary form) - so you can look it up
  2. Part of speech - noun, verb, adjective, etc.
  3. Grammatical features - case, tense, number, gender, etc.

This lets the trainer ask: “What case is книгу?” (Answer: accusative)


The Three Languages#

Japanese: 3 Scripts + Agglutinative Morphology#

Challenge: Multiple scripts in one sentence

私は本を読んでいます
(watashi wa hon wo yonde imasu)
"I am reading a book"

What we need:

  • (watashi) → pronoun, “I”
  • (wa) → particle (topic marker)
  • (hon) → noun, “book”
  • (wo) → particle (direct object marker)
  • 読んでいます (yonde imasu) → verb “読む” (yomu, “to read”) + present progressive form

Key insight: Particles are critical for understanding Japanese grammar.


Russian: 6 Cases + Aspect System#

Challenge: Case inflection changes word endings

Я читаю книгу в библиотеке
"I'm reading a book in the library"

What we need:

  • книгу → “книга” (book), accusative case (direct object)
  • библиотеке → “библиотека” (library), prepositional case (location)
  • читаю → “читать” (to read), imperfective aspect, present tense

Key insight: Same word has 12 forms (6 cases × 2 numbers). You must identify which form you’re seeing.


Czech: 7 Cases + Complex Declension#

Challenge: Even more complex than Russian

Čtu knihu v knihovně
"I'm reading a book in the library"

What we need:

  • knihu → “kniha” (book), accusative case
  • knihovně → “knihovna” (library), locative case (7th case!)
  • Čtu → “číst” (to read), 1st person singular

Key insight: 7 cases (vs Russian’s 6) + complex declension patterns.


The Business Problem#

Without Morphological Analysis:#

❌ Can’t parse real text (only memorize vocabulary lists) ❌ Can’t identify grammar patterns in context ❌ Limited to artificial examples (not authentic texts) ❌ Slow learning (decontextualized grammar drills)

With Morphological Analysis:#

✅ Parse Caesar, Pushkin, Japanese manga ✅ Identify grammar in real sentences ✅ Progressive difficulty (i+1 principle) ✅ Context-based learning (proven more effective)


The Library Selection Problem#

Challenge: Each Language Has Different Ecosystems#

Japanese:

  • MeCab: Classic (2003), widely used, requires system dictionary
  • SudachiPy: Modern (2017), better accuracy, pure Python
  • spaCy: General NLP, moderate Japanese support

Russian:

  • pymorphy2: Morphology specialist, excellent accuracy
  • spaCy: General NLP, good Russian model
  • UDPipe: Universal Dependencies, academic

Czech:

  • UDPipe: Best Czech support (Universal Dependencies)
  • spaCy: Experimental Czech model
  • MorphoDiTa: Czech-specific, academic

Decision Framework:#

PriorityCriterionWhy It Matters
1AccuracyWrong analysis = wrong training
2Installationpip install vs binary dependencies
3PerformanceReal-time parsing (<100ms/sentence)
4API consistencySame pattern across languages
5MaintenanceActive development, bug fixes

Expected Output Format#

Goal: JSONL (like latin-parse)#

{
  "sentence": "私は本を読んでいます",
  "words": [
    {"text": "私", "lemma": "私", "pos": "PRON", "reading": "わたし"},
    {"text": "は", "lemma": "は", "pos": "ADP", "type": "particle"},
    {"text": "本", "lemma": "本", "pos": "NOUN", "reading": "ほん"},
    {"text": "を", "lemma": "を", "pos": "ADP", "type": "particle"},
    {"text": "読んでいます", "lemma": "読む", "pos": "VERB", "tense": "present", "reading": "よんでいます"}
  ]
}

This feeds into:

  • japanese-train: Interactive trainer (latin-train pattern)
  • japanese-analyze: Progress analysis (latin-analyze pattern)

ROI: Why This Research Matters#

Time Investment:#

  • Research: 8-12 hours (library selection)
  • Implementation: 20-30 hours per language (parser + trainer)
  • Total: ~100 hours for 3-language system

Value Created:#

  • Reusable pattern across all future languages
  • Context-based learning (proven effective)
  • No ongoing costs (self-hosted, no API fees)
  • Multi-language support (Japanese, Russian, Czech immediately)
  • Extensible (add Korean, Arabic, etc. with same pattern)

Strategic Value:#

  • Validates Path 1 (self-operated) for language learning
  • Creates competitive moat (most language apps don’t parse real text)
  • Enables progressive corpus training (i+1 principle)
  • Foundation for polyglot learning system

Success Criteria#

For each language, we need:

  1. ✅ Accurate morphological analysis (>90% lemma + POS accuracy)
  2. ✅ pip installable (or minimal binary deps)
  3. <100ms per sentence (real-time performance)
  4. ✅ JSONL output format (consistent across languages)
  5. ✅ Active maintenance (not abandoned projects)

Overall:

  • Choose 1 library per language (Japanese, Russian, Czech)
  • Prototype parser for each (japanese-parse, russian-parse, czech-parse)
  • Validate with 10-20 real sentences per language
  • Document trade-offs and limitations

Questions This Research Answers#

  1. Japanese: MeCab vs SudachiPy vs spaCy?
  2. Russian: pymorphy2 vs spaCy vs UDPipe?
  3. Czech: UDPipe vs spaCy vs MorphoDiTa?
  4. Unified API: Can we use spaCy for all three? Or per-language specialists?
  5. Installation: Which libraries have minimal dependencies?
  6. Performance: Which are fast enough for interactive training?
  7. Accuracy: Which give correct morphological features?

What Happens After This Research#

Immediate (Research Complete):#

  • Know which library to use for each language
  • Understand installation + performance characteristics
  • Clear decision: spaCy everywhere vs per-language specialists

Next Steps (Implementation):#

  • Experiment 1.950: japanese-parse (MeCab/SudachiPy + JSONL)
  • Experiment 1.951: russian-parse (pymorphy2/spaCy + JSONL)
  • Experiment 1.952: czech-parse (UDPipe/spaCy + JSONL)

Long-term (Application):#

  • japanese-train: Interactive Japanese grammar trainer
  • russian-train: Interactive Russian grammar trainer
  • czech-train: Interactive Czech grammar trainer
  • Multi-language app: Unified language learning system

Glossary#

Morphological analysis: Breaking words into parts and identifying grammatical features

Lemma: Base form of word (dictionary entry) Example: “running” → “run”, “books” → “book”

Part of Speech (POS): Grammatical category (noun, verb, adjective, etc.)

Case: Grammatical role in sentence (nominative=subject, accusative=object, etc.) Russian/Czech have 6-7 cases; English has remnants (he/him, who/whom)

Aspect: Russian/Czech verbs have two forms:

  • Imperfective: ongoing action (Я читаю - “I am reading”)
  • Perfective: completed action (Я прочитал - “I have read”)

Particle: Japanese grammatical markers (は, が, を, に, etc.) Indicate grammatical relationships without inflecting words

Agglutinative (Japanese): Grammar added by stacking suffixes Example: 読む (read) → 読んでいます (reading now)

Fusional (Russian, Czech): Grammar encoded in word endings Example: книга (book) → книгу (book-ACC), книге (book-PREP)

JSONL: JSON Lines format - one JSON object per line, easy to stream

S1: Rapid Discovery

S1: Rapid Discovery - Approach#

Methodology: Rapid Library Search (speed-focused) Time Box: 90-120 minutes maximum Goal: Identify viable libraries for Japanese, Russian, Czech morphological analysis

Core Philosophy#

Quickly map the solution space for each language:

  • What libraries exist?
  • Which are popular/actively maintained?
  • Can they produce JSONL output (latin-parse pattern)?
  • Is there a unified solution (spaCy) or per-language specialists?

Discovery Process#

1. Japanese Libraries (25 min)#

  • MeCab: Classic Japanese morphological analyzer (2003)
  • fugashi: Modern Python wrapper for MeCab
  • SudachiPy: Modern Japanese tokenizer (2017)
  • spaCy: General NLP with Japanese model

Check:

  • PyPI downloads, GitHub stars
  • Installation complexity (system dependencies?)
  • Output format (can we get lemma + POS + reading?)

2. Russian Libraries (25 min)#

  • pymorphy2: Morphology specialist for Russian
  • spaCy: General NLP with Russian model
  • UDPipe: Universal Dependencies parser

Check:

  • Accuracy for case + aspect identification
  • Performance (<100ms per sentence?)
  • API usability

3. Czech Libraries (25 min)#

  • UDPipe: Universal Dependencies (best Czech support)
  • spaCy: Experimental Czech model
  • MorphoDiTa: Czech-specific morphological analyzer

Check:

  • 7 cases correctly identified?
  • Installation ease
  • Active maintenance

4. Unified vs Specialist Decision (15 min)#

  • Option A: spaCy for all three (unified API)
  • Option B: Per-language specialists (MeCab, pymorphy2, UDPipe)
  • Option C: Hybrid (spaCy where good, specialists where needed)

Trade-offs:

  • API consistency vs accuracy
  • Installation complexity vs performance
  • Maintenance burden vs optimal results

5. Quick Recommendation (10 min)#

  • Default library per language
  • Confidence level
  • When to reconsider (S2/S3 signals)

Evaluation Criteria#

Primary:

  • Popularity (PyPI downloads, GitHub stars)
  • Active maintenance (last commit, open issues)
  • Installation ease (pip install vs binary deps)

Secondary:

  • Accuracy (if easily testable)
  • Performance (if documented)
  • Documentation quality

Tertiary:

  • API consistency across languages
  • Community size

Output Files#

  • approach.md (this file)
  • japanese-libraries.md - MeCab vs SudachiPy vs spaCy
  • russian-libraries.md - pymorphy2 vs spaCy vs UDPipe
  • czech-libraries.md - UDPipe vs spaCy vs MorphoDiTa
  • recommendation.md - Per-language recommendations + unified decision

Success Criteria#

  • Found 2-3 viable options per language
  • Clear popularity leader identified (or not)
  • Can answer: “What should I use for each language?”
  • Total time: <120 minutes

Note for S2/S3#

S1 identifies viable options. If no clear winner per language:

  • S2: Deep-dive accuracy testing, API comparison
  • S3: Validate against real Japanese/Russian/Czech texts

This research likely needs more than S1 (3 languages, complex trade-offs).


Czech Morphological Analysis Libraries#

Winner: UDPipe#

PyPI Package: ufal-udpipe Downloads: 52,308/month Latest Update: June 2024 (improved Czech support) Maintenance: Active (Charles University) Python Requirement: 3.x

Best Czech Support#

  • Academic backing (Charles University Prague)
  • June 2024 improvements: 50% error reduction in lemmatization, 58% in POS tagging
  • Universal Dependencies format
  • 7 cases correctly identified

Key Features#

  • Morphological dictionary-supplemented deep learning
  • Lemmatization, POS tagging, dependency parsing
  • Trained on PDT-C 1.0 (Prague Dependency Treebank)
  • Web service + Python client

Installation#

pip install ufal-udpipe

Alternative: spaCy Czech#

Models: Experimental Czech support Maintenance: Active (spaCy ecosystem)

Why Consider#

  • ✅ Unified API (same as other languages)
  • ✅ spaCy ecosystem familiarity

Why Not Winner#

  • ⚠️ Experimental (not mature)
  • ⚠️ UDPipe has better Czech accuracy (50-58% error reduction)
  • ⚠️ UDPipe is Czech-specialist

Alternative: MorphoDiTa#

Source: Czech-specific morphological analyzer Maintenance: Academic

Why Consider#

  • ✅ Czech-specific tool
  • ✅ Academic backing

Why Not Winner#

  • ⚠️ Harder to find Python bindings
  • ⚠️ UDPipe supersedes it (June 2024 paper shows improvements)
  • ⚠️ Less popular in Python ecosystem

Recommendation#

Use UDPipe for czech-parse implementation.

Rationale:

  1. Best Czech support (June 2024 improvements)
  2. 50% lemmatization error reduction vs alternatives
  3. 7 cases correctly handled
  4. Python client available
  5. Universal Dependencies format (standard)

Caveat:

  • Lower adoption (52K downloads/month vs 585K for Russian, 1.9M for Japanese)
  • But this reflects Czech being smaller language community
  • Still viable and actively maintained

Confidence: MEDIUM-HIGH (7/10)

  • High confidence in quality (academic backing, recent improvements)
  • Medium confidence in ecosystem (lower adoption than Japanese/Russian)

Sources#


Japanese Morphological Analysis Libraries#

Clear Winner: SudachiPy#

PyPI Package: SudachiPy Downloads: 1,936,812/month Latest Version: 0.6.10 (Jan 2025) Maintenance: Active (Nov 2024 release) Python Requirement: 3.x

Popularity Winner#

  • 2.7× more downloads than fugashi (1.9M vs 720K/month)
  • Modern development (0.6+ uses Sudachi.rs)
  • Multi-granular tokenization (A/B/C split modes)

Key Features#

  • Morpheme information (dictionary forms, readings, POS)
  • Built-in dictionaries (sudachidict_small/core/full)
  • Active development by Works Applications

Installation#

pip install SudachiPy
# Dictionary included by default (sudachidict_core)

Alternative: fugashi (MeCab wrapper)#

PyPI Package: fugashi Downloads: 720,141/month Latest Version: 1.5.2 Maintenance: Active

Why Consider#

  • ✅ Pythonic MeCab wrapper (Cython-based)
  • ✅ Classic MeCab reliability (2003+)
  • ✅ UniDic dictionary support

Why Not Winner#

  • ⚠️ 2.7× fewer downloads than SudachiPy
  • ⚠️ Requires dictionary installation (unidic-lite or unidic)
  • ⚠️ MeCab is older technology (SudachiPy more modern)

Alternative: spaCy Japanese (ja_core)#

Models: ja_core_news_sm/md/lg Maintenance: Active (spaCy ecosystem)

Why Consider#

  • ✅ Unified API (same as other languages)
  • ✅ Full NLP pipeline (not just morphology)

Why Not Winner#

  • ⚠️ Uses SudachiPy internally for tokenization
  • ⚠️ Heavier (full NLP vs morphology specialist)
  • ⚠️ Morphology = subset of what SudachiPy provides

Recommendation#

Use SudachiPy for japanese-parse implementation.

Rationale:

  1. Clear popularity leader (1.9M downloads/month)
  2. Modern, actively maintained (2024-2025 releases)
  3. Multi-granular tokenization (flexible parsing)
  4. spaCy uses it internally anyway

Confidence: HIGH (9/10)

Sources#


S1 Rapid Discovery - Recommendation#

Time Spent: ~100 minutes Confidence Level: HIGH (Japanese, Russian), MEDIUM-HIGH (Czech)

Per-Language Winners#

LanguageWinnerDownloads/MonthRationale
JapaneseSudachiPy1,936,812Clear popularity leader (2.7× vs fugashi), modern, multi-granular
Russianpymorphy3584,844Morphology specialist, active maintenance, excellent case/aspect analysis
CzechUDPipe52,308Best Czech support, June 2024 improvements (50-58% error reduction)

Installation#

# Japanese
pip install SudachiPy

# Russian
pip install pymorphy3

# Czech
pip install ufal-udpipe

Key Findings#

Pattern 1: No Unified Solution#

Unlike 1.141 (FSRS) and 1.142 (genanki) where one library wins across the board, morphological analysis requires per-language specialists:

  • SudachiPy: Japanese-specific
  • pymorphy3: Russian-specific (+ Ukrainian)
  • UDPipe: Multi-language but strongest for Czech

spaCy could provide unified API but uses these specialists internally anyway (e.g., spaCy ja_core uses SudachiPy).

Pattern 2: Popularity Varies by Language Community#

  • Japanese: 1.9M downloads/month (large community, active NLP)
  • Russian: 585K downloads/month (strong community)
  • Czech: 52K downloads/month (smaller language community)

Lower Czech adoption reflects language community size, not tool quality.

Pattern 3: Morphology Specialists Win#

All three winners are morphology-focused (not general NLP):

  • SudachiPy: Japanese morphology + tokenization
  • pymorphy3: Russian morphology + inflection
  • UDPipe: Multi-language morphology (Universal Dependencies)

Takeaway: Don’t use general NLP if you need deep morphological analysis.


Implementation Path#

Next Steps (Experiments)#

1.950-japanese-text-parser:

  • Use SudachiPy for tokenization + morphology
  • Output JSONL (latin-parse pattern)
  • Extract: lemma, POS, reading, particles

1.951-russian-text-parser:

  • Use pymorphy3 for morphology
  • Output JSONL (latin-parse pattern)
  • Extract: lemma, POS, case, aspect, gender

1.952-czech-text-parser:

  • Use UDPipe for morphology
  • Output JSONL (latin-parse pattern)
  • Extract: lemma, POS, case (7 cases), number, gender

Application Implementation#

applications/language-learning/:

  • src/japanese/japanese_train.py - Interactive trainer (latin-train pattern)
  • src/russian/russian_train.py - Interactive trainer
  • src/czech/czech_train.py - Interactive trainer

Trade-offs Accepted#

Unified API vs Accuracy#

Decision: Choose accuracy (per-language specialists) Trade-off: 3 different APIs instead of unified spaCy API Rationale: Morphology quality critical for language learning

Installation Complexity#

Decision: 3 separate packages Trade-off: pip install × 3 vs pip install spacy + models Rationale: Simpler individual installs than managing spaCy models

Maintenance Burden#

Decision: Track 3 libraries instead of 1 Trade-off: Monitor 3 release cycles Rationale: All actively maintained (2024-2025 releases)


Confidence Assessment#

Japanese: HIGH (9/10)#

✅ Clear popularity leader (1.9M downloads/month) ✅ Active development (Jan 2025 release) ✅ Modern technology (Sudachi.rs) ✅ Multi-granular tokenization

Russian: HIGH (8/10)#

✅ Morphology specialist (not general NLP) ✅ Strong adoption (585K downloads/month) ✅ Actively maintained (successor to pymorphy2) ✅ Excellent case + aspect analysis

Only risk: pymorphy3 is newer (2.0.4) vs pymorphy2 (0.9.1 unmaintained)

Czech: MEDIUM-HIGH (7/10)#

✅ Best Czech accuracy (June 2024 improvements) ✅ Academic backing (Charles University) ✅ Universal Dependencies standard ⚠️ Lower adoption (52K downloads/month) ⚠️ Smaller language community

Risk: Lower community = fewer Stack Overflow answers, examples


When to Revisit This Decision#

Reconsider Japanese (SudachiPy):

  • If SudachiPy development stalls (check GitHub activity)
  • If spaCy ja_core significantly improves (check benchmarks)

Reconsider Russian (pymorphy3):

  • If pymorphy3 becomes unmaintained (fork like pymorphy2?)
  • If spaCy ru_core morphology matches pymorphy3 quality

Reconsider Czech (UDPipe):

  • If spaCy Czech model matures (currently experimental)
  • If Czech-specific library emerges with better adoption

General reconsideration signal:

  • Unified spaCy API becomes compelling (if building 10+ languages)
  • Per-language specialists become unmaintained

S2/S3 Not Required Because…#

S1 answered key questions:

  1. ✅ Which libraries exist? (SudachiPy, pymorphy3, UDPipe)
  2. ✅ Which are popular? (Clear download numbers)
  3. ✅ Which are maintained? (All have 2024-2025 releases)
  4. ✅ Is there unified solution? (No - per-language specialists win)

S2 would add (not needed):

  • Detailed API comparison (all have morphology APIs)
  • Accuracy benchmarks (pymorphy3/UDPipe papers already show this)
  • Performance testing (<100ms/sentence likely for all)

S3 would add (not needed):

  • Real text validation (popularity suggests they work)
  • Integration prototypes (defer to experiments 1.950-1.952)

Decision: S1 sufficient - clear winners, high confidence


Hardware Store Philosophy#

“In Stock Now” (1.148 base):

  • Japanese: SudachiPy ✅
  • Russian: pymorphy3 ✅
  • Czech: UDPipe ✅

“Catalog Entries” (1.148.X - LANGUAGE_FAMILIES_ROADMAP.md):

  • Arabic, Chinese, Hebrew, ASL, Korean, Turkish, etc.
  • Mapped but not researched
  • Research when needed (user demand signal)

Pattern validated: Per-language specialists > unified general NLP for morphology-intensive tasks


Sources#

Japanese#

Russian#

Czech#


Russian Morphological Analysis Libraries#

Clear Winner: pymorphy3#

PyPI Package: pymorphy3 Downloads: 584,844/month Latest Version: 2.0.4 Maintenance: Active (successor to pymorphy2) Python Requirement: 3.9-3.14

Morphology Specialist#

  • Dedicated Russian morphology (not general NLP)
  • Successor to pymorphy2 (original is unmaintained)
  • 585K downloads/month = strong adoption

Key Features#

  • Case identification (6 cases: nominative, genitive, dative, accusative, instrumental, prepositional)
  • Aspect analysis (perfective/imperfective)
  • Gender, number, tense, person
  • Inflection engine (generate forms from lemma)

Installation#

pip install pymorphy3

Alternative: spaCy Russian (ru_core)#

Models: ru_core_news_sm/md/lg Components: morphologizer, lemmatizer, parser Maintenance: Active

Why Consider#

  • ✅ Unified API (same as other languages)
  • ✅ Full NLP pipeline (NER, dependency parsing)
  • ✅ Token.morph for morphological features

Why Not Winner#

  • ⚠️ General NLP (not morphology specialist)
  • ⚠️ pymorphy3 has deeper morphology analysis
  • ⚠️ Trained on Nerus dataset (good but not specialized)

Alternative: UDPipe#

PyPI Package: ufal-udpipe Downloads: 52,308/month Maintenance: Active (v1 and v2)

Why Consider#

  • ✅ Universal Dependencies format
  • ✅ Multi-language support
  • ✅ Academic backing (Charles University)

Why Not Winner#

  • ⚠️ 11× fewer downloads than pymorphy3
  • ⚠️ Czech is its strength, not Russian
  • ⚠️ More complex setup than pymorphy3

Recommendation#

Use pymorphy3 for russian-parse implementation.

Rationale:

  1. Morphology specialist (not general NLP)
  2. Strong adoption (585K downloads/month)
  3. Actively maintained (successor to pymorphy2)
  4. Excellent case + aspect analysis (critical for Russian)
  5. Simple API for morphological parsing

When to consider spaCy:

  • If building unified multi-language parser with same API
  • If need full NLP pipeline (NER, dependency parsing)

Confidence: HIGH (8/10)

Sources#

Published: 2026-03-06 Updated: 2026-03-06