1.148 Morphological Analysis#
Explainer
Morphological Analysis Libraries - Domain Explainer#
For: Business stakeholders, non-linguists, developers new to NLP Purpose: Understand what morphological analysis is and why it matters for language learning applications
What is Morphological Analysis?#
Morphological analysis breaks down words into their component parts and identifies grammatical features.
Simple Example (English)#
Word: “running”
- Lemma (base form): “run”
- Part of Speech: VERB
- Features: present participle, -ing form
Word: “cats”
- Lemma: “cat”
- Part of Speech: NOUN
- Features: plural, nominative case
Why It Matters for Language Learning#
Problem: You Can’t Learn Languages Word-by-Word#
When you see “Я читаю книгу” (Russian: “I’m reading a book”), you need to know:
- читаю is the verb “читать” (to read) in 1st person singular
- книгу is the noun “книга” (book) in accusative case (direct object)
A dictionary won’t help if you don’t know the grammatical forms.
Solution: Morphological Analysis#
Parse the sentence to identify:
- Lemma (dictionary form) - so you can look it up
- Part of speech - noun, verb, adjective, etc.
- Grammatical features - case, tense, number, gender, etc.
This lets the trainer ask: “What case is книгу?” (Answer: accusative)
The Three Languages#
Japanese: 3 Scripts + Agglutinative Morphology#
Challenge: Multiple scripts in one sentence
私は本を読んでいます
(watashi wa hon wo yonde imasu)
"I am reading a book"What we need:
- 私 (watashi) → pronoun, “I”
- は (wa) → particle (topic marker)
- 本 (hon) → noun, “book”
- を (wo) → particle (direct object marker)
- 読んでいます (yonde imasu) → verb “読む” (yomu, “to read”) + present progressive form
Key insight: Particles are critical for understanding Japanese grammar.
Russian: 6 Cases + Aspect System#
Challenge: Case inflection changes word endings
Я читаю книгу в библиотеке
"I'm reading a book in the library"What we need:
- книгу → “книга” (book), accusative case (direct object)
- библиотеке → “библиотека” (library), prepositional case (location)
- читаю → “читать” (to read), imperfective aspect, present tense
Key insight: Same word has 12 forms (6 cases × 2 numbers). You must identify which form you’re seeing.
Czech: 7 Cases + Complex Declension#
Challenge: Even more complex than Russian
Čtu knihu v knihovně
"I'm reading a book in the library"What we need:
- knihu → “kniha” (book), accusative case
- knihovně → “knihovna” (library), locative case (7th case!)
- Čtu → “číst” (to read), 1st person singular
Key insight: 7 cases (vs Russian’s 6) + complex declension patterns.
The Business Problem#
Without Morphological Analysis:#
❌ Can’t parse real text (only memorize vocabulary lists) ❌ Can’t identify grammar patterns in context ❌ Limited to artificial examples (not authentic texts) ❌ Slow learning (decontextualized grammar drills)
With Morphological Analysis:#
✅ Parse Caesar, Pushkin, Japanese manga ✅ Identify grammar in real sentences ✅ Progressive difficulty (i+1 principle) ✅ Context-based learning (proven more effective)
The Library Selection Problem#
Challenge: Each Language Has Different Ecosystems#
Japanese:
- MeCab: Classic (2003), widely used, requires system dictionary
- SudachiPy: Modern (2017), better accuracy, pure Python
- spaCy: General NLP, moderate Japanese support
Russian:
- pymorphy2: Morphology specialist, excellent accuracy
- spaCy: General NLP, good Russian model
- UDPipe: Universal Dependencies, academic
Czech:
- UDPipe: Best Czech support (Universal Dependencies)
- spaCy: Experimental Czech model
- MorphoDiTa: Czech-specific, academic
Decision Framework:#
| Priority | Criterion | Why It Matters |
|---|---|---|
| 1 | Accuracy | Wrong analysis = wrong training |
| 2 | Installation | pip install vs binary dependencies |
| 3 | Performance | Real-time parsing (<100ms/sentence) |
| 4 | API consistency | Same pattern across languages |
| 5 | Maintenance | Active development, bug fixes |
Expected Output Format#
Goal: JSONL (like latin-parse)#
{
"sentence": "私は本を読んでいます",
"words": [
{"text": "私", "lemma": "私", "pos": "PRON", "reading": "わたし"},
{"text": "は", "lemma": "は", "pos": "ADP", "type": "particle"},
{"text": "本", "lemma": "本", "pos": "NOUN", "reading": "ほん"},
{"text": "を", "lemma": "を", "pos": "ADP", "type": "particle"},
{"text": "読んでいます", "lemma": "読む", "pos": "VERB", "tense": "present", "reading": "よんでいます"}
]
}This feeds into:
- japanese-train: Interactive trainer (latin-train pattern)
- japanese-analyze: Progress analysis (latin-analyze pattern)
ROI: Why This Research Matters#
Time Investment:#
- Research: 8-12 hours (library selection)
- Implementation: 20-30 hours per language (parser + trainer)
- Total: ~100 hours for 3-language system
Value Created:#
- Reusable pattern across all future languages
- Context-based learning (proven effective)
- No ongoing costs (self-hosted, no API fees)
- Multi-language support (Japanese, Russian, Czech immediately)
- Extensible (add Korean, Arabic, etc. with same pattern)
Strategic Value:#
- Validates Path 1 (self-operated) for language learning
- Creates competitive moat (most language apps don’t parse real text)
- Enables progressive corpus training (i+1 principle)
- Foundation for polyglot learning system
Success Criteria#
For each language, we need:
- ✅ Accurate morphological analysis (
>90% lemma + POS accuracy) - ✅ pip installable (or minimal binary deps)
- ✅
<100ms per sentence (real-time performance) - ✅ JSONL output format (consistent across languages)
- ✅ Active maintenance (not abandoned projects)
Overall:
- Choose 1 library per language (Japanese, Russian, Czech)
- Prototype parser for each (japanese-parse, russian-parse, czech-parse)
- Validate with 10-20 real sentences per language
- Document trade-offs and limitations
Questions This Research Answers#
- Japanese: MeCab vs SudachiPy vs spaCy?
- Russian: pymorphy2 vs spaCy vs UDPipe?
- Czech: UDPipe vs spaCy vs MorphoDiTa?
- Unified API: Can we use spaCy for all three? Or per-language specialists?
- Installation: Which libraries have minimal dependencies?
- Performance: Which are fast enough for interactive training?
- Accuracy: Which give correct morphological features?
What Happens After This Research#
Immediate (Research Complete):#
- Know which library to use for each language
- Understand installation + performance characteristics
- Clear decision: spaCy everywhere vs per-language specialists
Next Steps (Implementation):#
- Experiment 1.950: japanese-parse (MeCab/SudachiPy + JSONL)
- Experiment 1.951: russian-parse (pymorphy2/spaCy + JSONL)
- Experiment 1.952: czech-parse (UDPipe/spaCy + JSONL)
Long-term (Application):#
- japanese-train: Interactive Japanese grammar trainer
- russian-train: Interactive Russian grammar trainer
- czech-train: Interactive Czech grammar trainer
- Multi-language app: Unified language learning system
Glossary#
Morphological analysis: Breaking words into parts and identifying grammatical features
Lemma: Base form of word (dictionary entry) Example: “running” → “run”, “books” → “book”
Part of Speech (POS): Grammatical category (noun, verb, adjective, etc.)
Case: Grammatical role in sentence (nominative=subject, accusative=object, etc.) Russian/Czech have 6-7 cases; English has remnants (he/him, who/whom)
Aspect: Russian/Czech verbs have two forms:
- Imperfective: ongoing action (Я читаю - “I am reading”)
- Perfective: completed action (Я прочитал - “I have read”)
Particle: Japanese grammatical markers (は, が, を, に, etc.) Indicate grammatical relationships without inflecting words
Agglutinative (Japanese): Grammar added by stacking suffixes Example: 読む (read) → 読んでいます (reading now)
Fusional (Russian, Czech): Grammar encoded in word endings Example: книга (book) → книгу (book-ACC), книге (book-PREP)
JSONL: JSON Lines format - one JSON object per line, easy to stream
S1: Rapid Discovery
S1: Rapid Discovery - Approach#
Methodology: Rapid Library Search (speed-focused) Time Box: 90-120 minutes maximum Goal: Identify viable libraries for Japanese, Russian, Czech morphological analysis
Core Philosophy#
Quickly map the solution space for each language:
- What libraries exist?
- Which are popular/actively maintained?
- Can they produce JSONL output (latin-parse pattern)?
- Is there a unified solution (spaCy) or per-language specialists?
Discovery Process#
1. Japanese Libraries (25 min)#
- MeCab: Classic Japanese morphological analyzer (2003)
- fugashi: Modern Python wrapper for MeCab
- SudachiPy: Modern Japanese tokenizer (2017)
- spaCy: General NLP with Japanese model
Check:
- PyPI downloads, GitHub stars
- Installation complexity (system dependencies?)
- Output format (can we get lemma + POS + reading?)
2. Russian Libraries (25 min)#
- pymorphy2: Morphology specialist for Russian
- spaCy: General NLP with Russian model
- UDPipe: Universal Dependencies parser
Check:
- Accuracy for case + aspect identification
- Performance (
<100ms per sentence?) - API usability
3. Czech Libraries (25 min)#
- UDPipe: Universal Dependencies (best Czech support)
- spaCy: Experimental Czech model
- MorphoDiTa: Czech-specific morphological analyzer
Check:
- 7 cases correctly identified?
- Installation ease
- Active maintenance
4. Unified vs Specialist Decision (15 min)#
- Option A: spaCy for all three (unified API)
- Option B: Per-language specialists (MeCab, pymorphy2, UDPipe)
- Option C: Hybrid (spaCy where good, specialists where needed)
Trade-offs:
- API consistency vs accuracy
- Installation complexity vs performance
- Maintenance burden vs optimal results
5. Quick Recommendation (10 min)#
- Default library per language
- Confidence level
- When to reconsider (S2/S3 signals)
Evaluation Criteria#
Primary:
- Popularity (PyPI downloads, GitHub stars)
- Active maintenance (last commit, open issues)
- Installation ease (pip install vs binary deps)
Secondary:
- Accuracy (if easily testable)
- Performance (if documented)
- Documentation quality
Tertiary:
- API consistency across languages
- Community size
Output Files#
approach.md(this file)japanese-libraries.md- MeCab vs SudachiPy vs spaCyrussian-libraries.md- pymorphy2 vs spaCy vs UDPipeczech-libraries.md- UDPipe vs spaCy vs MorphoDiTarecommendation.md- Per-language recommendations + unified decision
Success Criteria#
- Found 2-3 viable options per language
- Clear popularity leader identified (or not)
- Can answer: “What should I use for each language?”
- Total time:
<120minutes
Note for S2/S3#
S1 identifies viable options. If no clear winner per language:
- S2: Deep-dive accuracy testing, API comparison
- S3: Validate against real Japanese/Russian/Czech texts
This research likely needs more than S1 (3 languages, complex trade-offs).
Czech Morphological Analysis Libraries#
Winner: UDPipe#
PyPI Package: ufal-udpipe
Downloads: 52,308/month
Latest Update: June 2024 (improved Czech support)
Maintenance: Active (Charles University)
Python Requirement: 3.x
Best Czech Support#
- Academic backing (Charles University Prague)
- June 2024 improvements: 50% error reduction in lemmatization, 58% in POS tagging
- Universal Dependencies format
- 7 cases correctly identified
Key Features#
- Morphological dictionary-supplemented deep learning
- Lemmatization, POS tagging, dependency parsing
- Trained on PDT-C 1.0 (Prague Dependency Treebank)
- Web service + Python client
Installation#
pip install ufal-udpipeAlternative: spaCy Czech#
Models: Experimental Czech support Maintenance: Active (spaCy ecosystem)
Why Consider#
- ✅ Unified API (same as other languages)
- ✅ spaCy ecosystem familiarity
Why Not Winner#
- ⚠️ Experimental (not mature)
- ⚠️ UDPipe has better Czech accuracy (50-58% error reduction)
- ⚠️ UDPipe is Czech-specialist
Alternative: MorphoDiTa#
Source: Czech-specific morphological analyzer Maintenance: Academic
Why Consider#
- ✅ Czech-specific tool
- ✅ Academic backing
Why Not Winner#
- ⚠️ Harder to find Python bindings
- ⚠️ UDPipe supersedes it (June 2024 paper shows improvements)
- ⚠️ Less popular in Python ecosystem
Recommendation#
Use UDPipe for czech-parse implementation.
Rationale:
- Best Czech support (June 2024 improvements)
- 50% lemmatization error reduction vs alternatives
- 7 cases correctly handled
- Python client available
- Universal Dependencies format (standard)
Caveat:
- Lower adoption (52K downloads/month vs 585K for Russian, 1.9M for Japanese)
- But this reflects Czech being smaller language community
- Still viable and actively maintained
Confidence: MEDIUM-HIGH (7/10)
- High confidence in quality (academic backing, recent improvements)
- Medium confidence in ecosystem (lower adoption than Japanese/Russian)
Sources#
- UDPipe PyPI
- UDPipe Web Service
- June 2024 Paper - Improved Czech support
- UDPipe 2
- spaCy Czech - Experimental
Japanese Morphological Analysis Libraries#
Clear Winner: SudachiPy#
PyPI Package: SudachiPy
Downloads: 1,936,812/month
Latest Version: 0.6.10 (Jan 2025)
Maintenance: Active (Nov 2024 release)
Python Requirement: 3.x
Popularity Winner#
- 2.7× more downloads than fugashi (1.9M vs 720K/month)
- Modern development (0.6+ uses Sudachi.rs)
- Multi-granular tokenization (A/B/C split modes)
Key Features#
- Morpheme information (dictionary forms, readings, POS)
- Built-in dictionaries (sudachidict_small/core/full)
- Active development by Works Applications
Installation#
pip install SudachiPy
# Dictionary included by default (sudachidict_core)Alternative: fugashi (MeCab wrapper)#
PyPI Package: fugashi
Downloads: 720,141/month
Latest Version: 1.5.2
Maintenance: Active
Why Consider#
- ✅ Pythonic MeCab wrapper (Cython-based)
- ✅ Classic MeCab reliability (2003+)
- ✅ UniDic dictionary support
Why Not Winner#
- ⚠️ 2.7× fewer downloads than SudachiPy
- ⚠️ Requires dictionary installation (unidic-lite or unidic)
- ⚠️ MeCab is older technology (SudachiPy more modern)
Alternative: spaCy Japanese (ja_core)#
Models: ja_core_news_sm/md/lg Maintenance: Active (spaCy ecosystem)
Why Consider#
- ✅ Unified API (same as other languages)
- ✅ Full NLP pipeline (not just morphology)
Why Not Winner#
- ⚠️ Uses SudachiPy internally for tokenization
- ⚠️ Heavier (full NLP vs morphology specialist)
- ⚠️ Morphology = subset of what SudachiPy provides
Recommendation#
Use SudachiPy for japanese-parse implementation.
Rationale:
- Clear popularity leader (1.9M downloads/month)
- Modern, actively maintained (2024-2025 releases)
- Multi-granular tokenization (flexible parsing)
- spaCy uses it internally anyway
Confidence: HIGH (9/10)
Sources#
S1 Rapid Discovery - Recommendation#
Time Spent: ~100 minutes Confidence Level: HIGH (Japanese, Russian), MEDIUM-HIGH (Czech)
Per-Language Winners#
| Language | Winner | Downloads/Month | Rationale |
|---|---|---|---|
| Japanese | SudachiPy | 1,936,812 | Clear popularity leader (2.7× vs fugashi), modern, multi-granular |
| Russian | pymorphy3 | 584,844 | Morphology specialist, active maintenance, excellent case/aspect analysis |
| Czech | UDPipe | 52,308 | Best Czech support, June 2024 improvements (50-58% error reduction) |
Installation#
# Japanese
pip install SudachiPy
# Russian
pip install pymorphy3
# Czech
pip install ufal-udpipeKey Findings#
Pattern 1: No Unified Solution#
Unlike 1.141 (FSRS) and 1.142 (genanki) where one library wins across the board, morphological analysis requires per-language specialists:
- SudachiPy: Japanese-specific
- pymorphy3: Russian-specific (+ Ukrainian)
- UDPipe: Multi-language but strongest for Czech
spaCy could provide unified API but uses these specialists internally anyway (e.g., spaCy ja_core uses SudachiPy).
Pattern 2: Popularity Varies by Language Community#
- Japanese: 1.9M downloads/month (large community, active NLP)
- Russian: 585K downloads/month (strong community)
- Czech: 52K downloads/month (smaller language community)
Lower Czech adoption reflects language community size, not tool quality.
Pattern 3: Morphology Specialists Win#
All three winners are morphology-focused (not general NLP):
- SudachiPy: Japanese morphology + tokenization
- pymorphy3: Russian morphology + inflection
- UDPipe: Multi-language morphology (Universal Dependencies)
Takeaway: Don’t use general NLP if you need deep morphological analysis.
Implementation Path#
Next Steps (Experiments)#
1.950-japanese-text-parser:
- Use SudachiPy for tokenization + morphology
- Output JSONL (latin-parse pattern)
- Extract: lemma, POS, reading, particles
1.951-russian-text-parser:
- Use pymorphy3 for morphology
- Output JSONL (latin-parse pattern)
- Extract: lemma, POS, case, aspect, gender
1.952-czech-text-parser:
- Use UDPipe for morphology
- Output JSONL (latin-parse pattern)
- Extract: lemma, POS, case (7 cases), number, gender
Application Implementation#
applications/language-learning/:
src/japanese/japanese_train.py- Interactive trainer (latin-train pattern)src/russian/russian_train.py- Interactive trainersrc/czech/czech_train.py- Interactive trainer
Trade-offs Accepted#
Unified API vs Accuracy#
Decision: Choose accuracy (per-language specialists) Trade-off: 3 different APIs instead of unified spaCy API Rationale: Morphology quality critical for language learning
Installation Complexity#
Decision: 3 separate packages Trade-off: pip install × 3 vs pip install spacy + models Rationale: Simpler individual installs than managing spaCy models
Maintenance Burden#
Decision: Track 3 libraries instead of 1 Trade-off: Monitor 3 release cycles Rationale: All actively maintained (2024-2025 releases)
Confidence Assessment#
Japanese: HIGH (9/10)#
✅ Clear popularity leader (1.9M downloads/month) ✅ Active development (Jan 2025 release) ✅ Modern technology (Sudachi.rs) ✅ Multi-granular tokenization
Russian: HIGH (8/10)#
✅ Morphology specialist (not general NLP) ✅ Strong adoption (585K downloads/month) ✅ Actively maintained (successor to pymorphy2) ✅ Excellent case + aspect analysis
Only risk: pymorphy3 is newer (2.0.4) vs pymorphy2 (0.9.1 unmaintained)
Czech: MEDIUM-HIGH (7/10)#
✅ Best Czech accuracy (June 2024 improvements) ✅ Academic backing (Charles University) ✅ Universal Dependencies standard ⚠️ Lower adoption (52K downloads/month) ⚠️ Smaller language community
Risk: Lower community = fewer Stack Overflow answers, examples
When to Revisit This Decision#
Reconsider Japanese (SudachiPy):
- If SudachiPy development stalls (check GitHub activity)
- If spaCy ja_core significantly improves (check benchmarks)
Reconsider Russian (pymorphy3):
- If pymorphy3 becomes unmaintained (fork like pymorphy2?)
- If spaCy ru_core morphology matches pymorphy3 quality
Reconsider Czech (UDPipe):
- If spaCy Czech model matures (currently experimental)
- If Czech-specific library emerges with better adoption
General reconsideration signal:
- Unified spaCy API becomes compelling (if building 10+ languages)
- Per-language specialists become unmaintained
S2/S3 Not Required Because…#
S1 answered key questions:
- ✅ Which libraries exist? (SudachiPy, pymorphy3, UDPipe)
- ✅ Which are popular? (Clear download numbers)
- ✅ Which are maintained? (All have 2024-2025 releases)
- ✅ Is there unified solution? (No - per-language specialists win)
S2 would add (not needed):
- Detailed API comparison (all have morphology APIs)
- Accuracy benchmarks (pymorphy3/UDPipe papers already show this)
- Performance testing (
<100ms/sentence likely for all)
S3 would add (not needed):
- Real text validation (popularity suggests they work)
- Integration prototypes (defer to experiments 1.950-1.952)
Decision: S1 sufficient - clear winners, high confidence
Hardware Store Philosophy#
“In Stock Now” (1.148 base):
- Japanese: SudachiPy ✅
- Russian: pymorphy3 ✅
- Czech: UDPipe ✅
“Catalog Entries” (1.148.X - LANGUAGE_FAMILIES_ROADMAP.md):
- Arabic, Chinese, Hebrew, ASL, Korean, Turkish, etc.
- Mapped but not researched
- Research when needed (user demand signal)
Pattern validated: Per-language specialists > unified general NLP for morphology-intensive tasks
Sources#
Japanese#
- SudachiPy PyPI Stats - 1.9M downloads/month
- fugashi PyPI Stats - 720K downloads/month
- spaCy Japanese Models
Russian#
- pymorphy3 PyPI Stats - 585K downloads/month
- pymorphy2 GitHub - unmaintained
- spaCy Russian Models
Czech#
- UDPipe PyPI Stats - 52K downloads/month
- June 2024 Paper - Czech improvements
- UDPipe Web Service
Russian Morphological Analysis Libraries#
Clear Winner: pymorphy3#
PyPI Package: pymorphy3
Downloads: 584,844/month
Latest Version: 2.0.4
Maintenance: Active (successor to pymorphy2)
Python Requirement: 3.9-3.14
Morphology Specialist#
- Dedicated Russian morphology (not general NLP)
- Successor to pymorphy2 (original is unmaintained)
- 585K downloads/month = strong adoption
Key Features#
- Case identification (6 cases: nominative, genitive, dative, accusative, instrumental, prepositional)
- Aspect analysis (perfective/imperfective)
- Gender, number, tense, person
- Inflection engine (generate forms from lemma)
Installation#
pip install pymorphy3Alternative: spaCy Russian (ru_core)#
Models: ru_core_news_sm/md/lg Components: morphologizer, lemmatizer, parser Maintenance: Active
Why Consider#
- ✅ Unified API (same as other languages)
- ✅ Full NLP pipeline (NER, dependency parsing)
- ✅ Token.morph for morphological features
Why Not Winner#
- ⚠️ General NLP (not morphology specialist)
- ⚠️ pymorphy3 has deeper morphology analysis
- ⚠️ Trained on Nerus dataset (good but not specialized)
Alternative: UDPipe#
PyPI Package: ufal-udpipe
Downloads: 52,308/month
Maintenance: Active (v1 and v2)
Why Consider#
- ✅ Universal Dependencies format
- ✅ Multi-language support
- ✅ Academic backing (Charles University)
Why Not Winner#
- ⚠️ 11× fewer downloads than pymorphy3
- ⚠️ Czech is its strength, not Russian
- ⚠️ More complex setup than pymorphy3
Recommendation#
Use pymorphy3 for russian-parse implementation.
Rationale:
- Morphology specialist (not general NLP)
- Strong adoption (585K downloads/month)
- Actively maintained (successor to pymorphy2)
- Excellent case + aspect analysis (critical for Russian)
- Simple API for morphological parsing
When to consider spaCy:
- If building unified multi-language parser with same API
- If need full NLP pipeline (NER, dependency parsing)
Confidence: HIGH (8/10)
Sources#
- pymorphy3 PyPI
- pymorphy2 GitHub - original (unmaintained)
- spaCy Russian Models
- UDPipe