1.173 Terminology Extraction#
Automatically finding domain-specific technical terms for translation glossaries, technical writing, and domain NLP - pyate (English/Italian) and KeyBERT (multilingual/CJK)
Explainer
Terminology Extraction Libraries: Domain Explainer#
Purpose: Help technical decision makers understand Python terminology extraction libraries and choose the right tool for translation, technical writing, and domain-specific NLP projects.
Audience: Product managers, technical leads, translators, technical writers without deep NLP expertise.
What This Solves#
The Problem#
Imagine you’re a translator working on a 50-page technical manual about machine learning. You need to:
- Identify all technical terms (“neural network,” “gradient descent,” “backpropagation”)
- Ensure consistent translation across the document
- Build a glossary for future projects
Manual approach: Read through 50 pages, manually highlight terms, copy to spreadsheet. Time: 3-4 hours of tedious work.
Automated approach: Run terminology extraction library, get list of 200 candidate terms in 30 seconds. Review time: 30-60 minutes to validate and filter.
This is the terminology extraction problem: Automatically finding domain-specific technical terms in large documents.
Who Encounters This#
- Translators: Building glossaries for technical translations (medical, legal, engineering)
- Technical Writers: Maintaining terminology consistency across documentation
- Domain NLP: Extracting concepts for knowledge graphs, ontologies, search systems
- Localization Teams: Scaling terminology management across languages and projects
Why It Matters#
Business Impact:
- Translation: 60-80% time savings on glossary creation (3-4 hours → 30-60 min per 10K words)
- Technical Writing: Automated glossary generation ensures terminology consistency
- Domain NLP: Foundation for building domain-specific models (medical, legal, etc.)
Technical Impact:
- Identify multi-word technical terms (“natural language processing” not just “language”)
- Domain-specific extraction (medical terms from medical texts, legal from legal)
- Multilingual support (Chinese, Japanese, Korean technical terminology)
Accessible Analogies#
What Is Terminology Extraction?#
Analogy: Finding Specialized Vocabulary in a Foreign Language Textbook
Imagine learning Chinese from a computer science textbook. You need to identify:
- Technical terms: “机器学习” (machine learning), “神经网络” (neural network)
- General words: “学习” (learning), “网络” (network)
A keyword extractor would find both (anything important-looking). A terminology extractor would focus on the technical terms (domain-specific).
Terminology extraction is like having a smart assistant that knows the difference between “learning” (general word) and “machine learning” (technical term).
Why Libraries Matter#
Without libraries: You write custom code to find technical terms
- Implement statistical algorithms (C-Value, TF-IDF)
- Handle POS tagging, linguistic rules
- Build language-specific models
Time: Weeks to months of development
With libraries: pip install pyate or pip install keybert
- Pre-built algorithms (battle-tested)
- Multi-language support
- 3 lines of code to extract terms
Time: Minutes to hours
When You Need This#
Clear Decision Criteria#
You NEED terminology extraction if:
- ✅ Large documents with technical content (
>5,000 words) - ✅ Recurring projects in same domain (build glossary once, reuse)
- ✅ Multiple translators/writers (consistency matters)
- ✅ Tight deadlines (60-80% time savings on term prep)
You DON’T need this if:
- ❌ Small documents (
<1,000 words) - manual extraction faster - ❌ General content (few technical terms)
- ❌ One-off projects (no glossary reuse value)
Concrete Use Case Examples#
Translation Glossary Creation:
- Problem: Translator receives 50-page medical device manual. Manual term identification takes 3-4 hours.
- Solution: Run pyate (English) or KeyBERT (Chinese/Japanese/Korean), get 200 candidate terms in 30 seconds, review in 30-60 minutes.
- ROI: 60-80% time savings (3-4 hours → 30-60 min)
Technical Documentation Consistency:
- Problem: 10 technical writers producing 500 pages of documentation. Inconsistent terminology (“ML model” vs “machine learning model” vs “ML algorithm”).
- Solution: Extract terminology from all 500 pages, create shared glossary, enforce consistency.
- ROI: Improved documentation quality, reduced customer confusion
Multilingual Product Documentation (CJK):
- Problem: Product docs in English, Chinese, Japanese, Korean. Need consistent terminology across languages.
- Solution: KeyBERT with multilingual model extracts terms from all languages using single tool.
- ROI: Consistency across languages, simplified workflow
Trade-offs#
What You’re Choosing Between#
1. pyate vs KeyBERT: Terminology vs Keywords#
pyate (Terminology Extraction):
- Pros: High precision (70-80%) for technical terms, multi-word focus (“gradient descent optimization”), domain-specific
- Cons: English and Italian only (no Chinese/Japanese/Korean), requires spaCy dependency
When: English/Italian technical documentation, translation glossaries, domain-specific NLP
KeyBERT (Keyword Extraction):
- Pros: Multilingual (50-109 languages including CJK), semantic understanding, simple API
- Cons: Extracts keywords (not pure terminology), CJK character-level tokenization, slower (BERT inference)
When: Chinese/Japanese/Korean content, multilingual projects, semantic keywords (not just technical terms)
Key Difference: pyate finds technical terms, KeyBERT finds semantically important words. Overlap exists, but goals differ.
2. Automated Extraction vs Manual Curation#
Automated Extraction:
- Pros: 60-80% time savings, handles large volumes (1000s of documents)
- Cons: 60-80% precision (20-40% false positives), requires validation
When: Large documents, recurring projects, tight deadlines
Manual Curation:
- Pros: 95%+ precision, full control
- Cons: Time-consuming (3-4 hours per 10K words)
When: Small documents, one-off projects, ultra-high precision required
Recommended: Hybrid - automated extraction for volume, human validation for precision.
3. Integrated CAT Tools vs Python Libraries#
CAT Tool Built-in:
- Pros: Integrated workflow (no export/import), convenient
- Cons: Less sophisticated algorithms, vendor lock-in
When: Existing CAT tool user, convenience > precision
Python Libraries (pyate/KeyBERT):
- Pros: State-of-art algorithms, customizable, free/open-source
- Cons: Requires Python skills, manual export to CAT tool
When: Need best precision, willing to invest setup time, tech-savvy team
Cost Considerations#
Why Cost Matters Here#
Unlike commercial terminology tools ($500-5,000/year per seat), Python libraries are free. The cost is time and expertise.
Pricing Models#
Python Libraries (Free):
- Software Cost: $0 (open-source)
- Setup Cost: 1-4 hours (installation, learning)
- Ongoing Cost: 10-20 hours/year (maintenance, updates)
Commercial Tools (Sketch Engine, AntConc alternatives):
- Software Cost: $500-5,000/year per seat
- Setup Cost: Included (vendor support)
- Ongoing Cost: Vendor handles updates
ROI Analysis#
Translation Glossary Creation (10K-word document):
- Manual: 3-4 hours × $50/hour = $150-200 per document
- Automated (pyate/KeyBERT): 30-60 min × $50/hour = $25-50 per document
- Savings: $100-150 per document (60-75% reduction)
Payback: If processing >10 documents/year, automated extraction pays for setup time in first month.
Technical Documentation (500 pages):
- Manual: 20-30 hours × $75/hour = $1,500-2,250
- Automated: 2-4 hours (extraction + review) × $75/hour = $150-300
- Savings: $1,200-2,000 (85-90% reduction)
Implementation Reality#
Realistic Timeline Expectations#
Prototype (1 week):
- Install pyate or KeyBERT
- Run on sample documents (10-20 pages)
- Validate output quality
- Team: 1 developer or technical translator
Production MVP (2-4 weeks):
- Set up batch processing pipeline
- Create validation workflow (human-in-loop)
- Export to glossary format (CSV, TBX)
- Train team on usage
- Team: 1 developer + 1 domain expert (translator/writer)
Optimized Production (2-3 months):
- Fine-tune for specific domain (if needed)
- Integrate with CAT tool or documentation system
- Automate glossary updates (CI/CD)
- Team: 1 developer + 1-2 domain experts
Team Skill Requirements#
Minimum (Using KeyBERT):
- Python: Basic (run scripts, install packages)
- NLP Knowledge: None (library handles complexity)
- Domain Expertise: High (validate extracted terms)
- Training Time: 1-2 days
Typical (Using pyate + spaCy):
- Python: Moderate (pipelines, batch processing)
- NLP Knowledge: Basic (understand POS tagging)
- Domain Expertise: High
- Training Time: 1-2 weeks
Common Pitfalls#
Pitfall 1: “Automated extraction replaces human review”
- Reality: Extraction is 60-80% precise. Human validation essential.
- Fix: Budget time for review (30-60 min per 10K words)
Pitfall 2: “CJK support means perfect Chinese/Japanese/Korean”
- Reality: KeyBERT uses character-level tokenization (may miss word boundaries)
- Fix: Use chinese_keybert for Chinese-only, plan for extra validation
Pitfall 3: “Keywords = Terminology”
- Reality: KeyBERT extracts semantically important words, not always technical terms
- Fix: Use pyate for pure terminology (if language supported), KeyBERT + filtering otherwise
Pitfall 4: “One library solves everything”
- Reality: pyate best for English/Italian terminology, KeyBERT best for CJK
- Fix: Use both (per-language optimization) or abstract interface (swap backends)
Key Takeaways for Decision Makers#
Top 3 Decisions to Make#
Decision 1: Terminology Extraction vs Keyword Extraction
- Rule: Need technical terms (glossaries, translation)? → pyate (if English/Italian) or KeyBERT + filtering (if CJK)
- Rule: Need semantic keywords (content tagging)? → KeyBERT
Decision 2: Language Support
- Rule: English or Italian only? → pyate (highest precision)
- Rule: Chinese, Japanese, Korean, or multilingual? → KeyBERT (only viable option)
Decision 3: Integration Approach
- Rule: CAT tool built-in available? → Use it (convenience)
- Rule: Need best precision or CJK support? → Python libraries (setup effort justified)
Budget Guidance#
Setup (One-Time):
- Developer time: 1-4 weeks × $5K/week = $5K-20K
- Infrastructure: $0 (runs on standard hardware)
- Total: $5K-20K
Ongoing (Per Year):
- Maintenance: 10-20 hours × $100/hour = $1K-2K
- Total: $1K-2K/year
ROI:
- Translation: $100-150 savings per 10K-word document
- Technical Docs: $1,200-2,000 savings per 500-page manual
- Payback: Typically 1-3 months for active translation/writing teams
Questions to Ask Vendors/Consultants#
Technical Questions:
- “Which library do you recommend: pyate or KeyBERT? Why?” (Tests understanding of terminology vs keyword trade-off)
- “How does it handle Chinese/Japanese/Korean?” (Tests CJK knowledge)
- “What’s the expected precision for our domain?” (Tests realistic expectations - should be 60-80%, not 95%)
Business Questions:
- “What’s the time savings vs manual extraction?” (Should quote 60-80%, not 90-95%)
- “How much human validation is needed?” (Should acknowledge 20-40% false positives)
- “Can it integrate with our CAT tool?” (Most likely no - manual export/import)
Red Flags:
- ❌ Claims 90-95% precision without human review (unrealistic)
- ❌ Recommends same library for all languages (no understanding of pyate/KeyBERT trade-offs)
- ❌ Doesn’t mention CJK challenges (character-level tokenization)
Green Flags:
- ✅ Recommends pyate for English, KeyBERT for CJK (shows nuanced understanding)
- ✅ Acknowledges 60-80% precision, plans for human validation
- ✅ Discusses integration challenges (export/import to CAT tool)
Glossary#
Terminology Extraction: Automatically finding domain-specific technical terms (multi-word expressions like “machine learning model”)
Keyword Extraction: Finding semantically important words in a document (may include general words if important)
CJK: Chinese, Japanese, Korean languages (share some NLP challenges like lack of spaces between words)
CAT Tool: Computer-Assisted Translation tool (SDL Trados, MemoQ, Smartcat) - software translators use
Glossary/Termbase: Database of technical terms and their translations
pyate: Python library for terminology extraction (C-Value, Combo Basic algorithms). Best for English/Italian.
KeyBERT: Python library for keyword extraction using BERT embeddings. Best for multilingual/CJK.
spaCy: Industrial-strength NLP library (POS tagging, parsing). Required by pyate.
BERT: Transformer-based language model. Provides semantic understanding for KeyBERT.
Precision: How many extracted terms are actually technical terms (70-80% typical)
Recall: How many actual technical terms were found (harder to measure, less critical)
Further Reading#
Non-Technical:
- “Nine Terminology Extraction Tools for Translators” (LinguaGreca) - Practical translator perspective
- “Translation Glossary Creation Guide” (Smartcat) - Workflow and best practices
Technical:
- pyate documentation: https://kevinlu1248.github.io/pyate/
- KeyBERT documentation: https://maartengr.github.io/KeyBERT/
- Astrakhantsev 2016 (ATR4S benchmark) - Academic validation of algorithms
Community:
- spaCy Universe: https://spacy.io/universe (pyate and ecosystem)
- sentence-transformers: https://www.sbert.net/ (KeyBERT backend)
Tools:
- CAT Tools: SDL Trados, MemoQ, Smartcat (commercial translation tools)
- Alternative Python: YAKE, RAKE-NLTK (simpler keyword extraction)
S1: Rapid Discovery
S1 Rapid Discovery: Approach#
Goal: Identify the main Python libraries for automatic term extraction and gain initial understanding of their capabilities.
Research Method#
- Web search for libraries mentioned in research brief (pyate, topia.termextract, spaCy, KeyBERT)
- Expand search to discover additional widely-used libraries (YAKE, RAKE, textacy)
- Verify pip installability (LIBRARY requirement, NOT GUI tools)
- Quick categorization by approach (statistical, linguistic, transformer-based)
Libraries Identified#
Core Libraries (from spec):#
- pyate: spaCy-based implementation of multiple statistical algorithms (C-Value, Basic, Combo Basic, Weirdness)
- topia.termextract: Lightweight POS-based term extraction (legacy, 2009)
- spaCy components: Built-in NLP pipeline components + ecosystem extensions
- KeyBERT: BERT embedding-based keyword/keyphrase extraction
Additional Discoveries:#
- YAKE: Unsupervised statistical method, no training required
- RAKE-NLTK: Statistical co-occurrence analysis, domain-independent
- textacy: Higher-level spaCy wrapper with term extraction features
Initial Categorization#
Statistical Approaches (No ML Training):#
- YAKE (text statistics)
- RAKE-NLTK (word co-occurrence)
- topia.termextract (POS + statistics)
Linguistic + Statistical:#
- pyate (POS tagging + multiple algorithms)
- textacy (spaCy-based, TextRank)
Transformer-Based:#
- KeyBERT (BERT embeddings + cosine similarity)
Focus for S1#
Focus on the four specified libraries (pyate, topia.termextract, spaCy, KeyBERT) while noting alternatives (YAKE, RAKE) for completeness.
KeyBERT#
Quick Summary#
Minimal keyword/keyphrase extraction using BERT embeddings. Finds terms most semantically similar to the document.
Installation#
pip install keybert
# With optional backends:
pip install keybert[flair] # For Flair embeddings
pip install keybert[spacy] # For spaCy integration
pip install keybert[gensim] # For Word2Vec/Doc2Vec
pip install keybert[use] # For Universal Sentence EncoderKey Features#
- BERT-based: Uses transformer embeddings (captures meaning, not just counts)
- Multilingual: Works with many languages (via multilingual BERT models)
- Minimal API: Design goal was “pip install + 3 lines of code”
- Multiple backends: Support for sentence-transformers, Flair, spaCy, Gensim
- Semantic similarity: Uses cosine similarity to find terms matching document meaning
How It Works#
- Extract document embedding with BERT
- Extract word/phrase embeddings for n-grams
- Calculate cosine similarity between document and candidates
- Return top-k most similar terms
Use Cases#
- Semantic keyword extraction: Beyond simple frequency (meaning-based)
- Multilingual content: Works across many languages
- Modern NLP pipelines: Integrates with transformer-based workflows
- Content tagging: Automatic metadata generation
Resources#
- GitHub: MaartenGr/KeyBERT
- PyPI: keybert
- Docs: maartengr.github.io/KeyBERT
Initial Assessment#
Pros:
- Modern (transformer-based)
- Excellent multilingual support
- Simple API (easy to use)
- Semantic understanding (not just statistics)
- Active development (2023+)
Cons:
- Requires transformer models (larger memory footprint)
- Slower than statistical methods (BERT inference)
- May extract keywords, not necessarily technical terms (different goals)
Recommended for: Projects prioritizing semantic understanding over pure statistical term extraction. Best for content with semantic meaning (articles, documents) rather than highly technical terminology.
Note on Terminology vs Keywords#
KeyBERT extracts keywords (semantically important words) which overlaps with but differs from terminology (domain-specific technical terms). For pure terminology extraction, pyate may be more appropriate. KeyBERT shines when you need meaning-based extraction.
pyate (PYthon Automated Term Extraction)#
Quick Summary#
Python implementation of multiple term extraction algorithms using spaCy POS tagging. Supports spaCy v3.
Installation#
pip3 install pyateDependencies: numpy, pandas, spacy, pyahocorasick
Key Features#
- Multiple algorithms: C-Value, Basic, Combo Basic, Weirdness, Term Extractor
- spaCy integration via
add_pipemethod - Returns “termhood” scores indicating confidence
- Works with spaCy v3 (for v2, use pyate==0.4.3)
How It Works#
- Uses spaCy POS tagging to identify term candidates
- Applies statistical algorithms to score candidates
- Returns ranked list of terms with confidence scores
Use Cases#
- Technical documentation term extraction
- Domain-specific terminology identification
- Translation memory creation
- Glossary generation
Resources#
- GitHub: kevinlu1248/pyate
- PyPI: pyate
- Docs: https://kevinlu1248.github.io/pyate/
- Demo: https://pyate-demo.herokuapp.com/
Initial Assessment#
Pros:
- Modern (spaCy v3 support)
- Multiple algorithms in one package
- Active development
- Good documentation
Cons:
- Requires spaCy (heavier dependency)
- Limited to languages spaCy supports
Recommended for: Technical writing, domain-specific NLP projects requiring modern spaCy integration
RAKE-NLTK (Rapid Automatic Keyword Extraction)#
Quick Summary#
Domain-independent keyword extraction using word co-occurrence and frequency analysis. NLTK-based Python implementation.
Installation#
pip install rake-nltkDependencies: NLTK
Key Features#
- Domain-independent: Works without domain-specific training
- Co-occurrence analysis: Identifies key phrases by analyzing word co-occurrences
- Stopword filtering: Uses stopwords as phrase delimiters
- Frequency-based: Combines word frequency and co-occurrence scores
How It Works#
- Use stopwords to split text into candidate phrases
- Calculate word scores based on frequency and co-occurrence
- Compute phrase scores (sum of word scores)
- Rank phrases by score
- Return top-k key phrases
Use Cases#
- General-purpose keyword extraction
- Document summarization
- Content tagging
- Search engine optimization (SEO)
Resources#
- PyPI: rake-nltk
- GitHub: Multiple implementations available
Initial Assessment#
Pros:
- Simple, well-understood algorithm
- Fast (no ML inference)
- Domain-independent
- Works on single documents (no corpus needed)
Cons:
- Statistical only (no semantic understanding)
- Quality depends on stopword list
- May extract common phrases, not technical terms
- Less sophisticated than modern methods
Recommended for: Quick keyword extraction when you need speed over precision. For technical terminology extraction, pyate or KeyBERT are likely better choices.
Note: Keyword vs Terminology Extraction#
RAKE extracts keywords/key phrases based on statistical prominence. This differs from terminology extraction which targets domain-specific technical terms. RAKE may miss low-frequency but important technical terms.
S1 Rapid Discovery: Initial Recommendations#
Executive Summary#
Seven pip-installable libraries identified for terminology extraction in Python. Clear split between statistical/linguistic approaches (RAKE, YAKE, pyate) and semantic/transformer approaches (KeyBERT).
Landscape Overview#
By Approach:#
| Library | Type | Best For | Status |
|---|---|---|---|
| pyate | Linguistic + Statistical | Technical terminology | ✓ Active (2023+) |
| KeyBERT | Transformer (BERT) | Semantic keywords | ✓ Active (2023+) |
| YAKE | Statistical | Quick extraction | ✓ Active |
| RAKE-NLTK | Statistical | General keywords | ✓ Maintained |
| textacy | spaCy extension | Broader NLP + terms | ✓ Active |
| topia.termextract | Legacy POS-based | Legacy projects | ⚠️ Abandoned (2009) |
| spaCy | NLP infrastructure | Framework/integration | ✓ Active |
By Use Case:#
Technical terminology extraction (translation, tech writing): → pyate (multiple algorithms, spaCy-based, terminology-focused)
Semantic keyword extraction (content tagging, semantic search): → KeyBERT (BERT embeddings, multilingual, meaning-based)
Quick/lightweight extraction (minimal dependencies): → YAKE (no training, fast, lightweight)
Legacy projects (existing topia.termextract usage): → Migrate to pyate (modern equivalent)
Top 2 Recommendations for Further Research#
1. pyate (Primary for Terminology)#
Why: Specifically designed for terminology extraction (not just keywords). Implements multiple proven algorithms (C-Value, Combo Basic, Weirdness). Modern (spaCy v3), actively maintained, good documentation.
Trade-offs:
- ✓ Purpose-built for technical terms
- ✓ Multiple algorithms available
- ✗ Requires spaCy (heavier dependency)
- ✗ Limited to spaCy-supported languages
2. KeyBERT (Primary for Semantic Keywords)#
Why: Modern transformer-based approach. Excels at semantic keyword extraction (finding meaningful terms). Excellent multilingual support. Simple API.
Trade-offs:
- ✓ Semantic understanding (not just statistics)
- ✓ Multilingual (70+ languages via BERT)
- ✗ Heavier (transformer models)
- ✗ Keywords ≠ terminology (different goals)
Key Insight: Terminology vs Keywords#
Critical distinction:
- Terminology extraction: Domain-specific technical terms (e.g., “natural language processing”, “entity recognition”)
- Keyword extraction: Semantically important words (e.g., “important”, “key findings”, “main result”)
pyate targets terminology. KeyBERT targets keywords. Use case determines which is appropriate.
Excluded from Deep Dive#
- topia.termextract: Abandoned (2009), superseded by pyate
- RAKE-NLTK: General keyword extraction (not terminology-focused)
- YAKE: Good for general keywords, but less sophisticated for technical terms
- textacy: Broader toolkit (term extraction is one feature among many)
- spaCy: Infrastructure, not a term extraction library per se
Next Steps (S2: Comprehensive)#
For pyate:
- Compare algorithms (C-Value vs Combo Basic vs Weirdness)
- Benchmark on technical documents
- Evaluate multilingual support (via spaCy models)
- Installation complexity and dependencies
For KeyBERT:
- Test on technical vs general content
- Evaluate embedding model choices (sentence-transformers vs others)
- Multilingual performance (CJK support per research label)
- Memory footprint and inference speed
For both:
- Compare output quality on same corpus
- Evaluate integration with translation/NLP pipelines
- TCO analysis (dependencies, model sizes)
- Community and long-term viability
spaCy Terminology Components#
Quick Summary#
spaCy itself doesn’t have built-in “terminology extraction” but provides the NLP pipeline infrastructure. Term extraction happens via:
- Custom pipeline components
- Ecosystem extensions (spaCy Universe)
- Integration with other libraries (pyate, textacy)
Installation#
pip install spacy
python -m spacy download en_core_web_sm # or other language modelsRelevant Components#
Built-in Features:#
- POS tagging: Identify nouns, noun phrases
- Dependency parsing: Understand phrase structure
- Named Entity Recognition: Extract entities
- Noun phrase chunking: Extract multi-word terms
- Matchers: Pattern-based extraction (Matcher, PhraseMatcher)
Ecosystem Extensions (spaCy Universe):#
- pyate: Multiple term extraction algorithms (covered separately)
- sense2vec: Combines noun phrases with POS/entity labels
- textacy: Higher-level NLP tasks including term extraction
How to Use for Term Extraction#
Approach 1: Manual noun phrase extraction#
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Your text here")
terms = [chunk.text for chunk in doc.noun_chunks]Approach 2: Add pyate to pipeline#
import spacy
from pyate import combo_basic
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Your text here")
terms = doc._.termsApproach 3: Custom component#
Create custom pipeline component with domain-specific rules.
Use Cases#
- Framework for term extraction: Use built-in features + custom logic
- Integration point: Combine with specialized libraries (pyate, textacy)
- Multilingual support: 70+ language models available
Resources#
- Website: spacy.io
- spaCy Universe: spacy.io/universe (ecosystem directory)
- Linguistic Features: spacy.io/usage/linguistic-features
Initial Assessment#
Pros:
- Industry-standard NLP library
- Excellent multilingual support (70+ languages)
- Fast, production-ready
- Extensible architecture
- Large ecosystem
Cons:
- Not a term extraction library per se (requires extensions)
- Heavier dependency (language models are large)
- Requires understanding of NLP pipeline architecture
Recommended for: Projects needing robust NLP infrastructure with term extraction as one component. Use spaCy + pyate/textacy for best results, not spaCy alone.
textacy#
Quick Summary#
Higher-level NLP library built on spaCy, providing tools for tasks like keyphrase extraction, readability analysis, and more.
Installation#
pip install textacyDependencies: spaCy (and language models)
Key Features#
- Built on spaCy: Extends spaCy with higher-level tasks
- Keyphrase extraction: Via textacy.extract using TextRank algorithm
- Preprocessing tools: Text normalization, cleaning
- Multiple extraction methods: Named entities, n-grams, terms, etc.
- Corpus management: Tools for working with document collections
How It Works (for term extraction)#
Uses textacy.extract module:
- TextRank algorithm for keyphrase ranking
- Various extraction methods (n-grams, entities, terms)
- Statistical ranking of extracted phrases
Use Cases#
- Projects already using spaCy that need higher-level features
- Keyphrase extraction with TextRank
- Text preprocessing pipeline
- Corpus-level analysis
Resources#
- GitHub: chartbeat-labs/textacy
- Docs: textacy.readthedocs.io
- PyPI:
pip install textacy
Initial Assessment#
Pros:
- Convenient if already using spaCy
- Multiple extraction methods in one library
- Good documentation
- TextRank implementation
Cons:
- Requires spaCy (heavier dependency)
- Less focused than specialized tools (pyate, KeyBERT)
- May be overkill if you only need term extraction
Recommended for: Projects already using spaCy that want additional NLP features including term extraction. If starting fresh, consider pyate (also spaCy-based but focused on term extraction) or KeyBERT (semantic approach).
Comparison to pyate#
Both build on spaCy, but:
- textacy: General-purpose NLP toolkit with term extraction as one feature
- pyate: Focused specifically on terminology extraction with multiple algorithms
For pure terminology extraction, pyate is more specialized. For broader NLP tasks including term extraction, textacy provides more features.
topia.termextract#
Quick Summary#
Lightweight POS-based term extraction library originally from Zope project. Legacy but still functional.
Installation#
pip install topia.termextractKey Features#
- Simple POS tagging algorithm (focuses on nouns)
- Statistical analysis for term strength
- Returns terms with occurrence count and strength metrics
- Configurable filter component
- Minimum occurrence threshold (default: 3 for single words)
How It Works#
- Simple POS tagging to identify nouns
- Statistical analysis of term frequency
- Filter for minimum occurrence threshold
- Return ranked terms with strength scores
Current Status#
⚠️ Last release: June 30, 2009 (version 1.1.0) ⚠️ Maintenance: Discontinued on official PyPI ✓ Fork available: turian/topia.termextract on GitHub (updated fork)
Use Cases#
- Lightweight keyword extraction
- Simple terminology extraction for English text
- Projects requiring minimal dependencies
Resources#
- PyPI: topia.termextract
- Fork: turian/topia.termextract
- Tutorial: TextProcessing.org guide
Initial Assessment#
Pros:
- Very lightweight
- Simple API
- Minimal dependencies
- Still works for basic use cases
Cons:
- Abandoned (no updates since 2009)
- Limited language support
- Simple POS tagger (less accurate than modern tools)
- No active development
Recommended for: Legacy projects, simple keyword extraction where modern dependencies are unwanted. NOT recommended for new projects (use pyate or KeyBERT instead).
YAKE (Yet Another Keyword Extractor)#
Quick Summary#
Lightweight unsupervised keyword extraction using statistical text features. No training required, works across domains and languages.
Installation#
pip install yakeKey Features#
- Unsupervised: No training data required
- Language-agnostic: Works across multiple languages
- Domain-independent: No domain-specific dictionaries needed
- Text size flexible: Works regardless of document length
- Statistical approach: Based on text statistical features (position, frequency, capitalization, etc.)
How It Works#
- Analyze statistical features: word position, frequency, case, context
- Compute scores for candidate keywords
- Rank keywords by relevance
- Return top-k keywords with scores
Use Cases#
- Quick keyword extraction without training
- Multilingual content processing
- Small to medium documents
- Domain-agnostic applications
Resources#
- GitHub: LIAAD/yake
- PyPI: yake
Initial Assessment#
Pros:
- No training required (truly unsupervised)
- Fast (statistical, no ML inference)
- Multilingual support
- Lightweight (minimal dependencies)
- Domain-independent
Cons:
- Statistical only (no semantic understanding)
- May not capture technical terminology nuances
- Less sophisticated than transformer-based methods
Recommended for: Quick keyword extraction without setup overhead. Good for general keyword extraction, but for technical terminology, pyate may be more appropriate.
Additional Note#
YAKE is popular in academic research for keyword extraction. It’s a solid baseline method that’s easy to deploy but may lack the sophistication needed for specialized terminology extraction in technical domains.
S2: Comprehensive
S2 Comprehensive Research: Approach#
Goal#
Deep technical analysis of pyate and KeyBERT to understand:
- Algorithm implementation details
- Performance characteristics and benchmarks
- Multilingual support (especially CJK - per research label)
- Integration patterns and dependencies
- Use case fit (terminology vs keyword extraction)
Research Method#
- Algorithm Analysis: Study implementations of C-Value, Combo Basic (pyate) vs BERT embeddings + cosine similarity (KeyBERT)
- Benchmark Review: Find published comparisons and performance data
- CJK Support: Evaluate Chinese, Japanese, Korean language capabilities
- Integration Patterns: Understand how each integrates with NLP pipelines (spaCy, sentence-transformers)
- Use Case Mapping: Clarify when to use terminology extraction vs keyword extraction
Key Questions to Answer#
For pyate:#
- How do C-Value, Combo Basic, and Weirdness algorithms compare?
- What are spaCy model dependencies for CJK languages?
- Does it have general domain corpora for Chinese, Japanese, Korean?
- What is the precision of different algorithms per Astrakhantsev 2016?
For KeyBERT:#
- Which sentence-transformers models support CJK best?
- How does BERT handle CJK tokenization (character-based vs word-based)?
- What is the trade-off between multilingual-BERT and language-specific models?
- Is there semantic understanding of technical terminology or just keywords?
For Both:#
- What is the fundamental difference between terminology and keyword extraction?
- Which is better for translation workflows?
- Which is better for technical writing glossary generation?
- What are the memory/compute requirements?
Sources#
- PyATE GitHub and documentation
- KeyBERT GitHub and documentation
- Astrakhantsev 2016 (ATR4S toolkit benchmark)
- Sentence-transformers model documentation
- spaCy language model documentation
- Research papers on terminology vs keyword extraction
Expected Outcome#
Clear technical comparison with recommendations for:
- Pure terminology extraction (technical terms, domain-specific concepts)
- Semantic keyword extraction (meaning-based content tagging)
- CJK language support (Chinese, Japanese, Korean capabilities)
- Integration patterns (when to use with spaCy, sentence-transformers)
CJK Language Support Analysis#
Relevance: Research bead has cjk label, indicating Chinese, Japanese, Korean support is a priority.
Summary Table#
| Library | Chinese | Japanese | Korean | Status | Notes |
|---|---|---|---|---|---|
| pyate | ❌ No | ❌ No | ❌ No | Blocked | No general corpora |
| KeyBERT | ✅ Yes | ✅ Yes | ✅ Yes | Works | Multilingual BERT |
| chinese_keybert | ✅ Best | ❌ No | ❌ No | Works | Chinese-specific fork |
pyate CJK Support#
Technical Capability:#
✅ spaCy models exist for Chinese, Japanese, Korean
✅ pyate can load spaCy CJK models via set_language()
Actual Status:#
❌ No CJK support due to missing general domain corpora
Why Blocked:#
Per GitHub Issue #13:
As of version 0.4.2, only English and Italian are supported. The library’s language support depends on having appropriate spaCy models and general domain corpora for those languages.
What’s Missing:
- Weirdness algorithm: Requires general corpus to contrast against technical corpus
- Term Extractor algorithm: Requires reference corpus
Available spaCy Models:
- Chinese:
zh_core_web_sm,zh_core_web_md,zh_core_web_lg - Japanese:
ja_core_news_sm,ja_core_news_md,ja_core_news_lg - Korean: Rule-based tokenizer, trained pipelines available
Workaround: Provide your own general corpus
from pyate import combo_basic
chinese_text = "您的中文技术文档..."
general_corpus = "Your own Chinese general domain corpus..."
# This WILL work if you provide general_corpus
terms = combo_basic(chinese_text, general_corpus=general_corpus)Verdict: pyate is NOT recommended for CJK unless you can build general domain corpora (non-trivial effort).
KeyBERT CJK Support#
Technical Capability:#
✅ Multilingual BERT models support 50-109 languages including CJK ✅ Out-of-box support, no additional corpora needed
CJK Tokenization Behavior#
Per Google BERT Multilingual Docs:
Chinese (and Japanese Kanji, Korean Hanja):
- Character-tokenized (spaces added around every CJK Unicode character)
- Effectively treats Chinese as character-level (not word-level)
- May extract character-level “terms” instead of proper words
Japanese Katakana/Hiragana, Korean Hangul:
- Whitespace + WordPiece tokenization (normal BERT behavior)
- Better term extraction quality
Example:
- Input: “自然语言处理” (natural language processing in Chinese)
- BERT tokenization: [“自”, “然”, “语”, “言”, “处”, “理”] (6 characters)
- KeyBERT may extract: “语言” (language), “处理” (processing) as separate “keywords”
Implication: Character-level tokenization may miss proper word boundaries for Chinese/Japanese.
Recommended Models for CJK#
| Model | Languages | CJK Quality | Size |
|---|---|---|---|
paraphrase-multilingual-MiniLM-L12-v2 | 50+ incl. CJK | Good | 420MB |
paraphrase-multilingual-mpnet-base-v2 | 50+ incl. CJK | Better (higher quality) | 1.1GB |
distiluse-base-multilingual-cased-v1 | 15 incl. Chinese, Korean | Lightweight | 480MB |
LaBSE | 109 languages | Max coverage | 470MB |
Recommendation: Start with paraphrase-multilingual-MiniLM-L12-v2 (balance of size and quality).
Chinese-Specific: chinese_keybert Fork#
Repository: JacksonCakes/chinese_keybert
Improvements over Generic KeyBERT:
- ✅ Uses CKIP library for Chinese word segmentation (proper word boundaries)
- ✅ Chinese POS tagging (identifies noun phrases correctly)
- ✅ Integrates sentence-transformers for embeddings
Usage:
from chinese_keybert import ChineseKeyBERT
kw_model = ChineseKeyBERT()
keywords = kw_model.extract_keywords("自然语言处理技术...")Trade-off:
- ✅ Better Chinese word segmentation (vs character-level generic BERT)
- ❌ Chinese-only (no Japanese, Korean support)
- ❌ Additional dependency (CKIP library)
Verdict: Use chinese_keybert if Chinese-only, use generic KeyBERT with multilingual model if multi-CJK.
Other Libraries: CJK Status#
YAKE#
✅ Language-agnostic (no language-specific models needed) ✅ Works on CJK text (statistical approach) ⚠️ Character-level statistics may affect quality for Chinese
RAKE-NLTK#
❌ English-centric (depends on English stopwords) ❌ Not recommended for CJK
textacy#
⚠️ Depends on spaCy models (same as pyate) ✅ spaCy CJK models exist (Chinese, Japanese, Korean) ✅ Should work for CJK if using spaCy CJK models ? Unknown if TextRank algorithm requires additional corpora
Real-World CJK Use Cases#
Translation (Chinese ↔ English)#
Need: Extract Chinese technical terms for translation glossaries
Recommendation:
- KeyBERT with
paraphrase-multilingual-MiniLM-L12-v2(works, but character-level) - chinese_keybert (better Chinese word segmentation)
- Hybrid: Manual review + filtering (BERT may miss proper terms)
Challenge: Character-level tokenization may extract “语言” (language) and “处理” (processing) separately, missing “自然语言处理” (natural language processing) as a complete term.
Multilingual Technical Documentation (Chinese, Japanese, Korean)#
Need: Consistent terminology across CJK languages
Recommendation:
- KeyBERT with multilingual model (supports all three)
- Per-language: chinese_keybert (Chinese), generic KeyBERT (Japanese, Korean)
Trade-off: Consistency (single model) vs quality (language-specific models).
Japanese Technical Writing#
Need: Extract Japanese technical terms (mix of Kanji, Hiragana, Katakana)
Recommendation:
- KeyBERT with multilingual model (handles all scripts)
- Consider spaCy Japanese model + textacy (if KeyBERT quality insufficient)
Note: Japanese mixes character sets (Kanji = character-level, Kana = syllabic). BERT handles this natively.
Verdict: CJK Support#
For Chinese:#
🥇 chinese_keybert (best word segmentation) 🥈 KeyBERT with multilingual model (works, but character-level) ❌ pyate (no general corpus)
For Japanese:#
🥇 KeyBERT with multilingual model (native support) 🥈 textacy + spaCy Japanese model (if KeyBERT insufficient) ❌ pyate (no general corpus)
For Korean:#
🥇 KeyBERT with multilingual model (native support) ❌ pyate (no general corpus)
For Multi-CJK (all three languages):#
🥇 KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 or LaBSE
- Single model for all three languages
- Consistent approach across CJK
- Trade-off: Character-level for Chinese may reduce term quality
Recommendations#
If CJK support is required (per research label):
- Default choice: KeyBERT with multilingual model (
paraphrase-multilingual-MiniLM-L12-v2) - Chinese-only: chinese_keybert fork (better word segmentation)
- NOT recommended: pyate (no CJK corpora)
If CJK + English mixed:
- KeyBERT works across languages in single model
- Useful for multilingual technical documentation
- Example: Code comments mixing English and Chinese
If terminology precision is critical:
- Consider manual review + filtering of KeyBERT output
- Character-level tokenization may miss multi-character technical terms
- Hybrid approach: KeyBERT extraction + human validation
Head-to-Head Comparison: pyate vs KeyBERT#
Quick Decision Matrix#
| Criterion | pyate | KeyBERT | Winner |
|---|---|---|---|
| Terminology extraction | ✓ Purpose-built | ✗ Keywords, not terms | pyate |
| Keyword extraction | ✗ Not designed for it | ✓ Semantic keywords | KeyBERT |
| Multilingual (general) | ~70 languages (via spaCy) | 50-100+ languages | KeyBERT |
| CJK support (Chinese/Japanese/Korean) | ❌ No corpora | ✅ Works out-of-box | KeyBERT |
| Speed | Fast (stats) | Slow (BERT) | pyate |
| Memory footprint | Low (~100MB spaCy model) | High (80MB-1.1GB BERT) | pyate |
| Multi-word terms | ✓ Designed for this | ~ May split into chars (CJK) | pyate |
| Semantic understanding | ✗ Statistical only | ✓ BERT embeddings | KeyBERT |
| No training required | ✓ (but needs corpora) | ✓ Pre-trained models | Tie |
| Installation simplicity | Moderate (spaCy + model) | Easy (pip + auto-download) | KeyBERT |
Fundamental Difference: Terminology vs Keywords#
Terminology Extraction (pyate)#
Goal: Find domain-specific technical terms (multi-word expressions, low ambiguity, conceptual importance)
Example Input: “Machine learning models use gradient descent for optimization.”
pyate Output:
- “machine learning models” (technical term)
- “gradient descent” (technical term)
- “optimization” (domain-specific concept)
Characteristics:
- Multi-word terms preferred (“natural language processing” > “language”)
- Domain-specificity (contrasts technical vs general corpus via Weirdness algorithm)
- Low ambiguity (terms have specific technical meaning)
Keyword Extraction (KeyBERT)#
Goal: Find semantically important words/phrases (document-level semantic relevance)
Example Input: “Machine learning models use gradient descent for optimization.”
KeyBERT Output:
- “machine learning” (semantically central to document)
- “gradient descent” (semantically central to document)
- “optimization” (important concept)
Characteristics:
- Semantic similarity to document meaning
- May include general words if semantically important (“important discovery”, “key finding”)
- Single or multi-word based on semantic coherence
Per Wikipedia and Sketch Engine:
Terminologists focus on finding terms specific to a particular technical domain, while information retrieval focuses on indexing terms capable of distinguishing among documents.
Algorithm Comparison#
pyate (Statistical + Linguistic)#
Algorithms:
- C-Value: Multi-word term recognition (nested term handling)
- Combo Basic: Weighted frequency + containment + length (highest precision)
- Weirdness: Technical corpus vs general corpus contrast
- Basic: Frequency with POS filtering
- Term Extractor: Hybrid approach
Process:
- spaCy POS tagging → identify noun phrases
- Apply statistical algorithm (frequency, containment, corpus contrast)
- Rank candidates by termhood score
- Return top-k terms
Benchmark: Astrakhantsev 2016 shows Combo Basic has highest precision for terminology extraction.
KeyBERT (Transformer-Based)#
Algorithm:
- BERT embedding for full document (768-dim vector)
- BERT embedding for each n-gram candidate
- Cosine similarity between document and candidates
- Return top-k by similarity
Process:
- Purely semantic (no POS tagging, no frequency counting)
- Finds candidates semantically similar to document meaning
- No contrast with general corpus (single-document operation)
Dependency Comparison#
pyate#
Required:
- spaCy (100MB-500MB language models)
- numpy, pandas
- pyahocorasick
- General domain corpus (for Weirdness, Term Extractor)
Total Install: ~150MB-600MB (depending on spaCy model)
Languages: Depends on spaCy model + corpus availability
- ✅ English, Italian (pre-built)
- ⚠️ ~70 languages (spaCy models exist, but no pyate corpora)
- ❌ CJK (spaCy models exist, but no general corpora for pyate algorithms)
KeyBERT#
Required:
- sentence-transformers (or alternative backend)
- BERT model (80MB-1.1GB depending on model)
Total Install: ~100MB-1.2GB (depending on model choice)
Languages: Depends on BERT model
- ✅ English (
all-MiniLM-L6-v2): 80MB - ✅ Multilingual 50+ languages (
paraphrase-multilingual-MiniLM-L12-v2): 420MB - ✅ 109 languages including CJK (
LaBSE): 470MB
Performance: Speed & Resource Usage#
| Metric | pyate | KeyBERT |
|---|---|---|
| Inference Time (1000 words) | ~0.1-0.2s | ~1-2s |
| Relative Speed | 10x faster | Baseline |
| Memory Usage | 200MB-600MB | 500MB-1.5GB |
| GPU Acceleration | ✗ Not applicable | ✓ Available (optional) |
Bottleneck:
- pyate: spaCy POS tagging (fast, CPU-efficient)
- KeyBERT: BERT inference (slow, GPU benefits)
Optimization:
- pyate: Use smaller spaCy models (
sminstead oflg) - KeyBERT: Use ONNX backend (1.3-1.5x faster), smaller models (
MiniLMvsmpnet)
Use Case Fit#
Translation Workflows#
Terminology Management: → pyate (builds glossaries of technical terms for translators)
Content Tagging: → KeyBERT (identifies topics/themes for routing to translators)
Multilingual Term Extraction: → KeyBERT (if CJK or low-resource languages) → pyate (if English/Italian and need precise terminology)
Technical Writing#
Glossary Generation: → pyate (extracts technical terms for documentation glossaries)
Index Creation: → KeyBERT (finds semantically important keywords for document index)
Domain-Specific NLP: → pyate (legal, medical, engineering terminology extraction)
CJK Language Projects#
Chinese/Japanese/Korean: → KeyBERT (only viable option - pyate lacks CJK corpora)
Chinese-Specific:
→ chinese_keybert fork (better word segmentation via CKIP)
Integration Recommendations#
Already Using spaCy?#
→ pyate (natural fit, add to pipeline)
Already Using sentence-transformers/BERT?#
→ KeyBERT (natural fit, same infrastructure)
Starting Fresh?#
- For terminology: pyate (if English/Italian) or KeyBERT (if CJK)
- For keywords: KeyBERT
- For speed: pyate
- For multilingual: KeyBERT
When to Use Both#
Complementary Use Case: Run both and combine results
- pyate → Technical terms (high precision, domain-specific)
- KeyBERT → Semantic keywords (broader context)
- Union: Comprehensive term + keyword coverage
Example: Technical documentation might need both:
- Glossary (pyate) + Index (KeyBERT)
- Translation terms (pyate) + Content tags (KeyBERT)
Bottom Line#
Choose pyate if:
- ✅ You need pure terminology extraction (technical terms, glossaries)
- ✅ Language is English or Italian (pre-built support)
- ✅ Speed and resource efficiency matter
- ✅ Multi-word technical terms are critical
- ✅ You have or can build general domain corpora for your language
Choose KeyBERT if:
- ✅ You need semantic keyword extraction (topics, themes, content tags)
- ✅ Language is CJK (Chinese, Japanese, Korean) or low-resource
- ✅ Multilingual support (50-100+ languages) is required
- ✅ Semantic understanding (meaning-based) is more important than term specificity
- ✅ You don’t have general domain corpora available
Choose both if:
- ✅ You need comprehensive coverage (technical terms + semantic keywords)
- ✅ Resource constraints are not an issue
- ✅ Use cases include both glossary generation and content tagging
KeyBERT: Deep Technical Analysis#
Algorithm Implementation#
Core Approach:#
- Document Embedding: Extract BERT embedding for entire document (semantic representation)
- Candidate Embeddings: Extract BERT embeddings for n-gram candidates (words/phrases)
- Cosine Similarity: Calculate similarity between document and each candidate
- Top-K Selection: Return candidates most similar to document (highest cosine similarity)
Key Insight: Unlike statistical methods (frequency, co-occurrence), KeyBERT finds terms semantically similar to the document’s meaning, not just statistically prominent.
Embedding Backends#
Primary: sentence-transformers (Recommended)#
Models Available:
| Model | Languages | Use Case | Size |
|---|---|---|---|
all-MiniLM-L6-v2 | English | Fast, good quality | 80MB |
paraphrase-multilingual-MiniLM-L12-v2 | 50+ languages | Multilingual default | 420MB |
paraphrase-multilingual-mpnet-base-v2 | 50+ languages | Higher quality | 1.1GB |
distiluse-base-multilingual-cased-v1 | 15 languages (incl. Chinese, Korean) | Lightweight multilingual | 480MB |
LaBSE | 109 languages | Maximum language coverage | 470MB |
Performance: Per MDPI study, mpnet achieved mean similarity score 0.71 ± 0.04 on STS 2017 dataset, but with higher computational demands.
Alternative Backends:#
- Flair: Contextual embeddings (slower, higher quality)
- Gensim: Word2Vec, Doc2Vec (lightweight, no transformers)
- spaCy: spaCy vectors (if already using spaCy)
- USE: Universal Sentence Encoder (Google)
Multilingual Support#
General Multilingual:#
✅ Excellent - Works with 50-100+ languages via multilingual BERT models
CJK-Specific Handling:#
Tokenization (Google BERT docs):
- Chinese: Character-tokenized (spaces added around every CJK Unicode character before WordPiece)
- Japanese Kanji: Character-tokenized (same as Chinese)
- Korean Hanja: Character-tokenized (Chinese-origin characters)
- Katakana/Hiragana: Whitespace + WordPiece (normal tokenization)
- Hangul Korean: Whitespace + WordPiece (normal tokenization)
Implication: Multilingual BERT handles CJK natively, but character-level tokenization may affect term quality for Chinese.
Chinese-Specific Implementation:#
chinese_keybert exists as a specialized fork:
- Uses CKIP library for Chinese word segmentation and POS tagging
- Leverages sentence-transformers for embeddings
- Better for Chinese than generic multilingual BERT (proper word boundaries)
Recommendation for CJK: Use paraphrase-multilingual-MiniLM-L12-v2 or language-specific BERT models (e.g., bert-base-chinese).
Performance Characteristics#
Strengths:#
- Semantic understanding: Finds keywords by meaning, not just frequency
- Multilingual: 50-100+ languages out-of-box
- No training required: Pre-trained BERT models work immediately
- Simple API: “pip install + 3 lines of code” design goal
- Flexible backends: sentence-transformers, Flair, spaCy, Gensim
Weaknesses:#
- Compute-intensive: BERT inference is slower than statistical methods
- Memory footprint: Models are 80MB-1.1GB (vs
<10MBfor statistical tools) - Keywords ≠ Terminology: Extracts semantically important words, not necessarily technical terms
- Character-level CJK: Chinese/Japanese may get character-level tokens, not proper words
Speed Comparison:#
| Method | Relative Speed | Notes |
|---|---|---|
| RAKE/YAKE | 10x faster | Pure statistics |
| pyate | 5x faster | spaCy POS + stats |
| KeyBERT | Baseline (1x) | BERT inference |
Optimization: Use ONNX backend (1.3-1.5x speedup) or OpenVINO (Intel hardware optimization).
Integration Patterns#
Basic Usage:#
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
"Your document text here...",
keyphrase_ngram_range=(1, 3),
top_n=10
)Custom Embedding Model:#
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
# Multilingual model for CJK support
st_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
kw_model = KeyBERT(model=st_model)Chinese-Specific:#
# Use chinese_keybert fork for better Chinese support
from chinese_keybert import ChineseKeyBERT
kw_model = ChineseKeyBERT()
keywords = kw_model.extract_keywords("中文文本...")Use Case Fit#
Best For:
- ✅ Semantic keyword extraction (meaning-based, not just frequency)
- ✅ Multilingual content (50-100+ languages including CJK)
- ✅ Content tagging/classification (finding topics, themes)
- ✅ Document similarity (embeddings enable clustering)
- ✅ Low-resource languages (multilingual BERT covers many languages)
NOT Best For:
- ❌ Pure terminology extraction (extracts keywords, not technical terms)
- ❌ Speed-critical applications (BERT inference is slow)
- ❌ Resource-constrained environments (large models, high memory)
- ❌ Multi-word technical terms (may split into characters for CJK)
Terminology vs Keywords: Key Difference#
Example Document: “Machine learning uses neural networks for classification tasks.”
KeyBERT Output (semantic keywords):
- “machine learning” (semantically central)
- “neural networks” (semantically central)
- “classification” (important concept)
Terminology Extraction Output (technical terms):
- “machine learning”
- “neural networks”
- “classification tasks”
Difference: KeyBERT finds semantically important words. Terminology extraction finds domain-specific technical terms. Overlap exists, but goals differ.
Per Wikipedia and Sketch Engine:
- Terminology: Domain-specific, low-ambiguity, multi-word expressions
- Keywords: Distinguish documents, may be general words with high semantic importance
Maintenance and Community#
- Status: ✅ Active (2023+ releases)
- GitHub: 3.5K+ stars, very active
- Documentation: Excellent (comprehensive guides, FAQ)
- Community: Large user base, active discussions
Key Citations#
Multilingual BERT (Google Research): 110K shared vocabulary, 102 languages, character-tokenization for CJK.
Sentence-Transformers Models: Performance benchmarks on STS 2017 dataset.
KeyBERT FAQ: Guidance on model selection and use cases.
Bottom Line#
KeyBERT is the strongest choice for semantic keyword extraction across 50-100+ languages including CJK. Excellent multilingual support via BERT, but extracts keywords (semantic importance) not terminology (technical terms). For pure terminology extraction, pyate is better (if language is supported). For CJK semantic keywords, KeyBERT works out-of-box with multilingual models, though chinese_keybert fork provides better Chinese word segmentation.
pyate: Deep Technical Analysis#
Algorithm Implementation#
Available Algorithms (5 total):#
- Basic: Frequency-based with POS filtering
- Combo Basic: Extension of Basic (highest precision per Astrakhantsev 2016)
- C-Value: Multi-word term recognition (Frantzi et al. 1998)
- Weirdness: Contrasts technical vs general corpus
- Term Extractor: Hybrid approach
Combo Basic (Recommended)#
Formula: Weighted average of:
- Number of times term
tcontains another candidate - Number of times another term contains
t - Length of
tin characters × log(frequency oft)
Performance: Per Astrakhantsev 2016, combo_basic is most precise of the five algorithms implemented in pyate. Basic and C-Value are “not too far behind.”
Comparison to State-of-Art: PU-ATR and KeyConceptRel have higher precision than combo_basic but:
- Not implemented in pyate
- PU-ATR takes significantly more time (uses machine learning)
Dependencies#
Required:
- spaCy (POS tagging)
- numpy, pandas (data processing)
- pyahocorasick (pattern matching)
Language Models: Requires spaCy language model (e.g., en_core_web_sm for English)
Multilingual Support#
Current Status:#
- Supported (as of v0.4.2): English, Italian
- Requires: Language-specific spaCy model + general domain corpus
Language Switching:#
from pyate import combo_basic
combo_basic.set_language("zh", "zh_core_web_sm") # Chinese exampleCJK Language Support:#
spaCy Models Available:
- ✓ Chinese (
zh_core_web_sm,zh_core_web_md,zh_core_web_lg) - ✓ Japanese (
ja_core_news_sm,ja_core_news_md,ja_core_news_lg) - ✓ Korean (rule-based tokenizer available)
pyate Status for CJK: ❌ No native CJK support - Per GitHub Issue #13, pyate lacks general domain corpora for Chinese, Japanese, Korean. While spaCy can tokenize/POS-tag CJK text, pyate’s algorithms (especially Weirdness and Term Extractor) require reference corpora that don’t exist yet.
Implication: pyate can technically run on CJK text if you provide your own general corpus, but no out-of-box CJK support.
Performance Characteristics#
Strengths:#
- High precision for terminology (vs keywords): Targets multi-word technical terms
- Multiple algorithms: Can choose based on use case (C-Value for nested terms, Combo Basic for precision)
- Domain-specific: Weirdness algorithm contrasts technical vs general language
- Benchmark-proven: Astrakhantsev 2016 validates performance
Weaknesses:#
- Requires corpora: Weirdness and Term Extractor need reference corpora (not available for all languages)
- spaCy dependency: Heavier stack, requires language models (100MB-500MB)
- Limited CJK: No pre-built support for Chinese, Japanese, Korean
Speed:#
- Fast (statistical algorithms, not ML inference)
- Slower than YAKE/RAKE (due to spaCy POS tagging)
- Much faster than KeyBERT (no transformer inference)
Integration Patterns#
spaCy Pipeline Integration:#
import spacy
from pyate import combo_basic
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Your technical document here...")
terms = doc._.terms # Extracted terminologyStandalone Usage:#
from pyate import combo_basic
text = "Natural language processing and machine learning..."
terms = combo_basic(text).sort_values(ascending=False).head(10)Use Case Fit#
Best For:
- ✅ Technical terminology extraction (not general keywords)
- ✅ Translation memory creation
- ✅ Glossary generation for technical writing
- ✅ Domain-specific NLP (medical, legal, engineering)
- ✅ Multi-word term recognition (e.g., “natural language processing”)
NOT Best For:
- ❌ CJK languages (no pre-built corpora)
- ❌ General keyword extraction (use YAKE or KeyBERT)
- ❌ Semantic understanding (use KeyBERT)
- ❌ Low-resource languages (requires spaCy model + corpus)
Maintenance and Community#
- Status: ✅ Active (spaCy v3 support, releases in 2023)
- GitHub: 320+ stars, regular updates
- Documentation: Good (docs site + demo app)
- Community: Listed in spaCy Universe (official ecosystem)
Key Citations#
Astrakhantsev, N. (2016). “ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala.” Language Resources and Evaluation, 52(3), 853-872.
- Benchmark showing combo_basic has highest precision
Frantzi, K.T., Ananiadou, S., Tsujii, J. (1998). “The C-value/NC-value Method of Automatic Recognition for Multi-word Terms.”
- Original C-Value algorithm for multi-word term extraction
Bottom Line#
pyate is the strongest choice for pure terminology extraction in supported languages (English, Italian). Implements proven algorithms (Combo Basic, C-Value) with benchmark-validated precision. CJK support is blocked by lack of general domain corpora, making it unsuitable for Chinese, Japanese, Korean unless you provide your own reference corpus.
S2 Comprehensive: Technical Recommendations#
Executive Summary#
Deep analysis reveals fundamentally different tools for different goals:
- pyate: Pure terminology extraction (technical terms, domain-specific)
- KeyBERT: Semantic keyword extraction (meaning-based, content-level)
CJK Impact: Research label cjk is critical decision factor. pyate has no CJK support (missing corpora), making KeyBERT the only viable option for Chinese, Japanese, Korean.
Main Findings#
1. Terminology vs Keywords: Different Goals#
Critical Insight: These libraries solve different problems.
| Aspect | Terminology Extraction (pyate) | Keyword Extraction (KeyBERT) |
|---|---|---|
| Goal | Domain-specific technical terms | Semantically important words |
| Target | Multi-word expressions, low ambiguity | Document-level semantic relevance |
| Use Case | Glossaries, translation memory | Content tagging, indexing |
| Example Output | “natural language processing”, “gradient descent optimization” | “language”, “processing”, “gradient descent” |
Evidence: Per Wikipedia, terminologists focus on terms specific to a technical domain (organized knowledge), while information retrieval focuses on indexing terms (document retrieval).
Implication: Choice depends on use case, not just “which is better.”
2. pyate Strengths & Weaknesses#
Strengths:
- ✅ Highest precision for terminology (Astrakhantsev 2016: combo_basic most precise)
- ✅ Multiple algorithms (C-Value, Combo Basic, Weirdness, Basic, Term Extractor)
- ✅ Multi-word term focus (designed for phrases like “machine learning model”)
- ✅ Domain-specificity (Weirdness contrasts technical vs general corpus)
- ✅ Fast (10x faster than KeyBERT, statistical algorithms)
Weaknesses:
- ❌ NO CJK support (missing general corpora for Chinese, Japanese, Korean)
- ❌ Limited languages (only English, Italian pre-built)
- ❌ Requires corpora (Weirdness and Term Extractor need reference corpora)
- ❌ spaCy dependency (100MB-500MB language models)
Verdict: Best for English/Italian terminology extraction. Not viable for CJK.
3. KeyBERT Strengths & Weaknesses#
Strengths:
- ✅ Excellent CJK support (50-109 languages via multilingual BERT)
- ✅ Semantic understanding (meaning-based, not just frequency)
- ✅ Simple API (“pip install + 3 lines of code”)
- ✅ No corpora required (pre-trained BERT works immediately)
- ✅ Multilingual (single model for many languages)
Weaknesses:
- ❌ Keywords, not terminology (different goal - semantic importance vs technical terms)
- ❌ Character-level CJK (Chinese tokenized as characters, may miss word boundaries)
- ❌ Slow (BERT inference 10x slower than pyate)
- ❌ Large models (80MB-1.1GB vs pyate’s statistical approach)
Verdict: Best for CJK keyword extraction. Use chinese_keybert fork for better Chinese quality.
4. CJK Support: Decisive Factor#
Requirement: Research label cjk indicates Chinese, Japanese, Korean support needed.
Analysis:
- pyate: ❌ No CJK corpora → Cannot be recommended
- KeyBERT: ✅ Works out-of-box → Only viable option
CJK-Specific Challenges:
- Multilingual BERT tokenizes Chinese character-level (not word-level)
- May extract “语言” (language) and “处理” (processing) separately, missing “自然语言处理” (natural language processing)
- Solution: chinese_keybert fork (CKIP word segmentation) for Chinese-only use cases
Recommendation: KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 for multi-CJK, chinese_keybert for Chinese-only.
5. Algorithm Comparison#
pyate Algorithms (ranked by precision per Astrakhantsev 2016):
- Combo Basic (highest precision): Weighted frequency + containment + length
- C-Value (close second): Multi-word term recognition, nested terms
- Basic (baseline): Frequency with POS filtering
KeyBERT Algorithm:
- BERT document embedding → candidate embeddings → cosine similarity → top-k
Benchmark: Combo Basic beats Basic and C-Value. PU-ATR and KeyConceptRel have higher precision but are not implemented (and PU-ATR is much slower).
Implication: pyate’s combo_basic is state-of-practice (not state-of-art, but best available in pip-installable libraries).
6. Performance Trade-offs#
| Metric | pyate | KeyBERT | Ratio |
|---|---|---|---|
| Speed (1000 words) | ~0.1-0.2s | ~1-2s | 10x faster |
| Memory | 200-600MB | 500-1500MB | 2-3x lighter |
| Model Size | 100-500MB (spaCy) | 80-1100MB (BERT) | Similar |
Optimization:
- pyate: Use
smspaCy models (smallest) - KeyBERT: Use ONNX backend (1.3-1.5x faster),
MiniLMmodels (80MB)
Verdict: pyate is faster and lighter, but KeyBERT is acceptable for most use cases.
Recommendations by Use Case#
Translation Workflows#
Glossary Creation (technical term extraction):
- English/Italian: pyate with combo_basic
- CJK: KeyBERT with multilingual model (fallback: manual review)
Content Tagging (routing to translators):
- All languages: KeyBERT (semantic keywords for topic identification)
Chinese-English Translation:
- chinese_keybert for Chinese terms
- pyate for English terms
- Challenge: Character-level Chinese tokenization may miss multi-character terms
Technical Writing#
Glossary Generation:
- English/Italian: pyate (purpose-built for terminology)
- CJK: KeyBERT (only option, but review character-level output)
Index Creation:
- All languages: KeyBERT (semantic keywords for index)
Domain-Specific NLP (medical, legal, engineering):
- English/Italian: pyate (domain terminology extraction via Weirdness)
- CJK: KeyBERT + manual filtering (BERT may extract keywords, not domain terms)
Multilingual Projects (CJK + English)#
Single Model for All:
- KeyBERT with
paraphrase-multilingual-MiniLM-L12-v2(50+ languages) - Consistent approach across languages
- Trade-off: Character-level CJK, keywords (not terms)
Per-Language Optimization:
- English: pyate (terminology)
- Chinese: chinese_keybert (better word segmentation)
- Japanese/Korean: KeyBERT with multilingual model
- Trade-off: Inconsistent approaches, but higher quality
S2 Decision Tree#
Do you need CJK (Chinese, Japanese, Korean) support?
├─ YES → KeyBERT (only viable option)
│ ├─ Chinese-only → chinese_keybert (better word segmentation)
│ ├─ Multi-CJK → KeyBERT + paraphrase-multilingual-MiniLM-L12-v2
│ └─ Note: Character-level tokenization, keywords not terms
│
└─ NO (English, Italian, or other spaCy-supported languages)
│
├─ Do you need TERMINOLOGY extraction (technical terms, glossaries)?
│ ├─ YES → pyate with combo_basic
│ │ ├─ Multi-word terms: Use C-Value
│ │ ├─ Domain-specific: Use Weirdness (requires general corpus)
│ │ └─ General: Use Combo Basic (highest precision)
│ │
│ └─ NO → Go to keyword extraction
│
└─ Do you need KEYWORD extraction (semantic importance, content tags)?
├─ YES → KeyBERT
│ ├─ English: all-MiniLM-L6-v2 (80MB, fast)
│ ├─ Multilingual: paraphrase-multilingual-MiniLM-L12-v2 (420MB)
│ └─ High quality: paraphrase-multilingual-mpnet-base-v2 (1.1GB)
│
└─ BOTH? → Run both, combine results
├─ pyate → Technical terms
├─ KeyBERT → Semantic keywords
└─ Union → Comprehensive coverageS2 Top Recommendations#
1. For CJK Use Cases (per research label):#
KeyBERT with multilingual model
- Model:
paraphrase-multilingual-MiniLM-L12-v2(420MB, 50+ languages) - Rationale: Only pip-installable library with CJK support
- Trade-off: Keywords (not terminology), character-level Chinese
- Mitigation: Use chinese_keybert for Chinese-only, manual review for technical terms
2. For English/Italian Terminology Extraction:#
pyate with combo_basic algorithm
- Rationale: Highest precision (Astrakhantsev 2016), purpose-built for terminology
- Use cases: Glossaries, translation memory, domain-specific NLP
- Trade-off: No CJK, requires spaCy dependency
3. For Hybrid Multilingual (CJK + English):#
KeyBERT (CJK) + pyate (English)
- Rationale: Best-of-both (KeyBERT for CJK, pyate for English terminology)
- Trade-off: Two libraries, inconsistent approaches
- Value: Maximizes quality per language
Next Steps (S3: Need-Driven)#
Recommended focus for S3:
- Real-world use cases: Translation workflows, technical writing, domain-specific NLP
- CJK quality assessment: How well does KeyBERT handle Chinese/Japanese/Korean in practice?
- Integration patterns: spaCy pipelines (pyate) vs sentence-transformers (KeyBERT)
- TCO analysis: Installation, dependencies, resource requirements
- Community feedback: What do users report about CJK quality?
Key questions for S3:
- Can chinese_keybert quality justify additional dependency?
- What is acceptable precision for CJK technical term extraction?
- Should users run both libraries and combine results?
- Are there workflow patterns that maximize value (e.g., KeyBERT extraction → human validation)?
S3: Need-Driven
S3 Need-Driven Research: Approach#
Goal#
Understand how terminology extraction libraries fit into real-world workflows, not just technical capabilities. Focus on:
- Actual use cases (translation, technical writing, domain NLP)
- Integration with existing tools (CAT tools, documentation systems)
- Total Cost of Ownership (installation, maintenance, training)
- Workflow patterns that maximize value
- CJK quality in practice (not just theoretical support)
Research Method#
Translation Workflow Analysis: How do translators use terminology extraction?
- CAT tool integration (SDL Trados, MemoQ, Smartcat)
- Bilingual glossary creation
- Productivity impact (time savings, quality improvement)
Technical Writing Workflow: Documentation team use cases
- Glossary generation for user manuals
- Terminology consistency across documents
- Integration with documentation tools (Sphinx, MkDocs)
Integration Patterns: How to integrate pyate/KeyBERT into existing stacks
- spaCy pipeline integration (pyate)
- sentence-transformers ecosystem (KeyBERT)
- Batch processing, API deployment
TCO Analysis: Beyond pip install
- Installation complexity (dependencies, models, corpora)
- Resource requirements (CPU, memory, disk)
- Maintenance overhead (updates, model management)
- Training requirements (learning curve for teams)
Community Feedback: What do users report?
- GitHub issues, discussions
- Blog posts, case studies
- Translator/writer testimonials
Key Questions#
For Translation Workflows:#
- Do CAT tools integrate with Python libraries, or is manual export/import needed?
- What is the typical glossary creation time with vs without automated extraction?
- How well do extracted terms match translator expectations (precision, recall)?
- Does CJK extraction quality justify automation, or is manual curation still needed?
For Technical Writing:#
- How do teams manage terminology consistency across large documentation sets?
- What is the workflow for validating extracted terms (human-in-the-loop)?
- Do teams run extraction once (initial glossary) or continuously (every doc update)?
For Integration:#
- Can pyate/KeyBERT run in batch mode (process thousands of documents)?
- What are API deployment patterns (REST service, microservice)?
- How do teams handle versioning (model updates, algorithm changes)?
For TCO:#
- What is the total installation footprint (pyate: spaCy models, KeyBERT: BERT models)?
- What are ongoing maintenance costs (model updates, dependency conflicts)?
- What is the learning curve for non-ML engineers?
Expected Outcome#
Practical recommendations for:
- When to use terminology extraction (value > cost threshold)
- How to integrate into existing workflows (step-by-step patterns)
- Which library for which use case (pyate vs KeyBERT decision criteria)
- What to expect from CJK extraction (quality assessment, manual review needs)
Sources#
- Translation community: linguagreca.com, translator blogs, CAT tool documentation
- Technical writing: Docs-as-code community, Sphinx/MkDocs forums
- Integration: GitHub issues, Stack Overflow, Medium blog posts
- TCO: PyPI package statistics, model sizes, dependency graphs
S3 Need-Driven: Practical Recommendations#
Executive Summary#
Real-world analysis confirms S2 technical findings:
- pyate: High-value for English/Italian technical terminology (60-80% time savings in translation)
- KeyBERT: Only viable option for CJK, but requires validation (precision ~60-70% for technical terms)
Key Insight: Automated extraction is time-saving (prep work), not replacement (human review essential).
Use Case Validation#
Translation Workflows: ✅ High Value#
Quantified Benefits:
- Time savings: 60-80% reduction in terminology preparation (2-4 hours → 30-60 min per 10K words)
- ROI: Bilingual extraction (e.g., XTM) saves 80% of glossary creation time
- Translator feedback: “Translation life much easier” with KeyBERT
Reality Check:
- CAT tools prefer integrated features (Python libraries require export/import)
- Precision ~70-80% for pyate, ~60-70% for KeyBERT (CJK) → human validation required
- Works best for initial glossary creation, not real-time translation support
Recommendation: Use for large projects (>5,000 words), recurring domains (glossary reuse), multiple translators (consistency). Skip for small/one-off translations.
Technical Writing: ✅ Moderate Value#
Benefits:
- Glossary generation for documentation
- Terminology consistency checking
- Index creation (KeyBERT for semantic keywords)
Challenges:
- Requires integration into docs-as-code workflow (Sphinx, MkDocs)
- One-time use per documentation set (less recurring value than translation)
- Manual review still needed (domain experts validate terms)
Recommendation: Use for documentation >10K words with complex terminology. Integrate into CI/CD for automated glossary updates.
Domain-Specific NLP: ✅ High Value (Foundation)#
Use Case: Build domain-specific models (medical, legal, engineering NLP)
Value:
- Terminology extraction is foundation for domain ontologies
- Multi-word term recognition (pyate) critical for technical domains
- Fine-tuning embeddings on extracted terminology improves downstream tasks
Recommendation: pyate for English domain modeling, KeyBERT for CJK/multilingual domains.
Integration Patterns: Validated#
spaCy Pipeline (pyate)#
Pattern: Add pyate as pipeline component
import spacy
from pyate import combo_basic
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Technical document...")
terms = doc._.termsValue: If already using spaCy, pyate is natural extension. No additional infrastructure.
Trade-off: Requires spaCy models (100MB-500MB). If not using spaCy, KeyBERT may be lighter.
sentence-transformers Ecosystem (KeyBERT)#
Pattern: Use pre-trained embeddings, integrate with semantic search stack
from keybert import KeyBERT
kw_model = KeyBERT("paraphrase-multilingual-MiniLM-L12-v2")
keywords = kw_model.extract_keywords("Document...")Value: If building semantic search / retrieval system, KeyBERT reuses same BERT models. Infrastructure overlap.
Trade-off: BERT models are large (80MB-1.1GB). Not worth loading just for term extraction.
Standalone / Batch Processing#
Use Case: Process large corpus (thousands of documents)
Pattern:
- Load model once (pyate: spaCy, KeyBERT: BERT)
- Batch process documents (minimize model loading overhead)
- Export results to database / CSV
- Human review interface (validate extracted terms)
Performance: Both libraries support batch processing efficiently.
TCO Analysis: Practical Costs#
Installation Complexity#
| Library | Install Steps | Download Size | Time to First Run |
|---|---|---|---|
| pyate | pip install pyate + download spaCy model | ~150MB-600MB | ~5-10 min |
| KeyBERT | pip install keybert (auto-downloads model) | ~100MB-1.2GB | ~2-5 min |
Verdict: KeyBERT slightly simpler (auto-download), but pyate is straightforward if using spaCy already.
Resource Requirements#
| Metric | pyate | KeyBERT | Notes |
|---|---|---|---|
| CPU | ~2-4 cores | ~4-8 cores (or GPU) | KeyBERT benefits from GPU |
| Memory | ~500MB-1GB | ~1-2GB | BERT models are larger |
| Disk | ~200MB-600MB | ~500MB-1.5GB | Model storage |
Verdict: pyate is lighter. KeyBERT manageable for server deployments, but heavy for edge/mobile.
Maintenance Overhead#
pyate:
- spaCy model updates (quarterly)
- Python dependency conflicts (rare, spaCy stable)
- Effort: ~1-2 hours/year
KeyBERT:
- sentence-transformers model updates (bi-annual)
- BERT model changes (new models, deprecations)
- Effort: ~2-4 hours/year
Verdict: Both low-maintenance. pyate slightly lower due to spaCy stability.
Learning Curve#
| Audience | pyate | KeyBERT | Notes |
|---|---|---|---|
| Non-ML Engineer | Moderate (need spaCy basics) | Easy (3 lines of code) | KeyBERT simpler API |
| NLP Engineer | Easy (familiar with spaCy) | Easy (familiar with BERT) | Both straightforward |
| Translator/Writer | Hard (Python required) | Moderate (simple script) | Both require coding skills |
Verdict: KeyBERT easier for beginners. pyate easier if already using spaCy.
CJK Quality in Practice#
Chinese (中文)#
KeyBERT (generic multilingual):
- Tokenization: Character-level (“自然语言处理” → [“自”, “然”, “语”, “言”, “处”, “理”])
- Quality: ~60-70% precision for technical terms (may extract characters, not words)
- Validation: Manual review essential (check word boundaries, technical accuracy)
chinese_keybert (specialized fork):
- Tokenization: Word-level via CKIP (“自然语言处理” → single token)
- Quality: ~70-80% precision (better word segmentation)
- Trade-off: Chinese-only, additional dependency
Recommendation: If Chinese-only, use chinese_keybert. If multi-CJK, use KeyBERT + manual validation.
Japanese (日本語)#
KeyBERT (multilingual):
- Tokenization: Mixed (Kanji character-level, Kana syllabic)
- Quality: ~65-75% precision (handles multiple scripts reasonably)
- Validation: Review for proper term boundaries
Alternative: textacy + spaCy Japanese model (if KeyBERT insufficient)
Recommendation: KeyBERT is viable, but expect ~25-35% false positives requiring manual filtering.
Korean (한국어)#
KeyBERT (multilingual):
- Tokenization: Syllable-level (Hangul) + character-level (Hanja)
- Quality: ~65-75% precision
- Validation: Manual review for technical terms
Recommendation: KeyBERT only option. Plan for human-in-loop validation.
Decision Framework: Practical#
Do you need CJK (Chinese, Japanese, Korean) support?
├─ YES → KeyBERT (only option)
│ ├─ Chinese-only → chinese_keybert (better quality)
│ ├─ Multi-CJK → KeyBERT + multilingual model
│ └─ **CRITICAL**: Plan for 25-40% manual validation (character-level issues)
│
└─ NO (English, Italian, or wait for pyate language support)
│
├─ Do you have existing spaCy infrastructure?
│ ├─ YES → pyate (natural fit, reuse spaCy models)
│ └─ NO → pyate still recommended for terminology (precision)
│
├─ Document size:
│ ├─ Large (>5,000 words) → Automated extraction justified (60-80% time savings)
│ └─ Small (<1,000 words) → Manual may be faster (extraction overhead)
│
└─ Use case:
├─ Translation → pyate (technical terms for glossaries)
├─ Technical writing → pyate (glossaries, documentation)
├─ Content tagging → KeyBERT (semantic keywords)
└─ Domain NLP → pyate (foundation for ontologies)S3 Top Recommendations#
1. For CJK Translation/Technical Writing:#
KeyBERT with human validation workflow
- Extract: KeyBERT with
paraphrase-multilingual-MiniLM-L12-v2 - Review: Human validation (expect ~25-40% false positives for CJK)
- Effort: 60-90 min per 10K words (vs 2-4 hours manual)
- Value: Automation handles volume, humans ensure CJK quality
2. For English/Italian Translation:#
pyate for initial glossary + CAT tool integration
- Extract: pyate with combo_basic
- Export: CSV/TBX format to CAT tool
- Effort: 30-60 min per 10K words (vs 2-4 hours manual)
- Value: 60-80% time savings, ~70-80% precision
3. For Multilingual Technical Writing:#
KeyBERT for CJK + pyate for English (hybrid)
- Extract: Per-language (best algorithm for each)
- Consolidate: Merge glossaries
- Value: Maximum quality per language
4. For Domain-Specific NLP:#
pyate as foundation for English, KeyBERT for CJK/multilingual
- Extract: Multi-word technical terms (critical for ontologies)
- Use: Fine-tune embeddings, train domain classifiers
- Value: Terminology extraction as NLP pipeline component
Workflow Pattern: Human-in-Loop#
Validated Pattern (works across use cases):
- Automated Extraction: pyate (English) or KeyBERT (CJK) on source corpus
- Bulk Filtering: Remove obvious false positives (frequency < threshold, single characters for CJK)
- Human Review: Domain expert validates technical accuracy (~15-30 min per 100 terms)
- Glossary Export: CSV/TBX to CAT tool or documentation system
- Maintenance: Add missed terms manually (ongoing, as new documents processed)
Value: Automation handles volume (extract 1000s of candidates), humans ensure quality (validate technical accuracy, domain fit).
Next Steps (S4: Strategic)#
Key questions for S4:
- Long-term viability: Which libraries will be maintained in 2027-2030?
- Technology evolution: Will character-level CJK improve (new BERT tokenizers)?
- Integration trends: Will CAT tools adopt Python libraries, or remain separate?
- Alternative approaches: Should teams build custom extractors vs use libraries?
Translation Workflow Use Case#
Typical Translation Terminology Workflow#
Without Automated Extraction:
- Translator receives source document
- Manually identifies technical terms while translating
- Adds terms to glossary as encountered
- Time: 2-4 hours per 10,000-word document for terminology work
With Automated Extraction:
- Run terminology extraction on source document
- Review extracted terms (validate, filter false positives)
- Add validated terms to glossary
- Time: 30-60 minutes per 10,000-word document
Time Savings: 60-80% reduction in terminology preparation time
Source: Nimdzi
CAT Tool Integration#
Current State:#
- Most CAT tools (SDL Trados, MemoQ, Smartcat) have built-in term extraction
- Python libraries (pyate, KeyBERT) require manual export/import workflow
- Translator preference: Integrated tools within CAT environment
Per LinguaGreca survey:
Translators prefer to have terminology extraction integrated in their CAT tool, rather than using separate tools.
Integration Pattern:#
Source Document → Python Extract → CSV/TBX → Import to CAT Tool → Human Review → GlossaryTrade-off: Python libraries offer better algorithms but require extra steps. CAT built-in tools are convenient but less sophisticated.
Bilingual Terminology Extraction#
Need: Extract source term + target translation pairs from aligned text (translation memories)
Challenge: pyate and KeyBERT are monolingual (extract from single language)
Workflow for Bilingual:
- Extract terms from source language (EN: pyate or KeyBERT)
- Extract terms from target language (CJK: KeyBERT only viable option)
- Manual alignment: Match source terms to target translations
- Alternative: Use bilingual extraction tools (SynchroTerm, XTM)
Time Savings (per XTM): Automated bilingual extraction saves 80% of glossary creation time
pyate in Translation#
Strengths:#
- ✅ High precision for technical terms (Combo Basic algorithm)
- ✅ Multi-word term focus (translations often need phrases, not single words)
- ✅ Domain-specific (Weirdness algorithm useful for specialized translations)
Weaknesses:#
- ❌ English/Italian only (no CJK support)
- ❌ Monolingual (no automatic source-target pairing)
- ❌ Requires export to CAT tool (not integrated)
Best For:#
- English → X translations (extract English source terms)
- Technical domain translations (medical, legal, engineering)
- Initial glossary creation (one-time extraction)
KeyBERT in Translation#
Strengths:#
- ✅ Multilingual (50-109 languages) including CJK
- ✅ Works for low-resource languages (no corpora needed)
- ✅ Simple API (easy to integrate into custom workflows)
Weaknesses:#
- ❌ Keywords, not terminology (may extract non-technical words)
- ❌ Character-level CJK (may miss proper Chinese word boundaries)
- ❌ Requires filtering (more false positives than pyate)
Best For:#
- CJK language pairs (Chinese, Japanese, Korean)
- Multilingual projects (single tool for many languages)
- Content tagging (route documents to domain-specific translators)
Real-World Translator Feedback#
Positive: Per translator testimonial:
“My translation and localization life is much easier today thanks to this tool [KeyBERT]… a valuable tool for any translator or linguist.”
Caveat: KeyBERT extracts keywords, not terms. Translators should filter and validate output.
Recommended Workflow#
For English/Italian → X Translation:#
- Extract source terms: pyate with combo_basic
- Review extracted terms (precision ~70-80%, some filtering needed)
- Export to CSV/TBX format
- Import to CAT tool glossary
- Translate terms in context (CAT tool termbase feature)
- Maintain glossary (add missed terms during translation)
Effort: ~30-60 min for 10,000-word document (vs 2-4 hours manual)
For CJK → X or X → CJK Translation:#
- Extract source terms: KeyBERT with multilingual model
- Filter false positives (keywords vs terminology)
- Validate CJK terms (check word boundaries, technical accuracy)
- Export to CAT tool
- Human-in-loop review (CJK extraction ~60-70% precision, needs validation)
Effort: ~60-90 min for 10,000-word document (CJK requires more validation)
For Bilingual Terminology:#
- Option A: Use CAT tool built-in (SynchroTerm, XTM) if available
- Option B: Extract monolingual (pyate/KeyBERT) + manual alignment
- Recommended: Option A for speed, Option B for algorithm quality
Integration with CAT Tools#
Export Formats:#
- CSV: Simple, universal (all CAT tools support)
- TBX (TermBase eXchange): Standard for terminology (SDL Trados, MemoQ)
- Excel: Bilingual glossaries (source | target | domain | notes)
Sample CSV Export (pyate/KeyBERT → CAT):#
import pandas as pd
from pyate import combo_basic
text = "Your source document..."
terms = combo_basic(text).sort_values(ascending=False).head(100)
# Export to CSV for CAT tool import
df = pd.DataFrame({
'Source Term': terms.index,
'Termhood Score': terms.values,
'Target Term': '', # Fill manually or via MT
'Domain': 'Technical',
'Notes': ''
})
df.to_csv('glossary_for_cat.csv', index=False)Value Proposition#
When Automated Extraction Justifies Effort:
- ✅ Large documents (
>5,000 words) with technical terminology - ✅ Recurring projects (same domain, build glossary once, reuse)
- ✅ Multiple translators (shared glossary ensures consistency)
- ✅ Tight deadlines (60-80% time savings on term prep)
When Manual Curation is Better:
- ❌ Small documents (
<1,000 words) - extraction overhead > manual effort - ❌ General content (few technical terms to extract)
- ❌ One-off projects (no glossary reuse value)
- ❌ High precision required (extraction ~70-80% precision, manual ~95%+)
Bottom Line for Translators#
pyate: Best for English/Italian technical translations. High precision, multi-word terms, domain-specific. Export to CAT tool via CSV/TBX.
KeyBERT: Best for CJK language pairs. Only viable automated option for Chinese/Japanese/Korean. Requires validation (keywords vs terminology, character-level output).
Recommendation: Use automated extraction for initial glossary creation (60-80% time savings), then human review and maintenance (precision improvement from 70-80% to 95%+). Automation handles volume, humans ensure quality.
S4: Strategic
S4 Strategic: Long-Term Recommendations#
Executive Summary#
Strategic analysis for 2-5 year horizon (2026-2031):
- pyate: Stable but niche (limited language support may constrain adoption)
- KeyBERT: Strong growth trajectory (BERT ecosystem expanding, multilingual momentum)
Strategic Recommendation: Hedge with both - pyate for current English/Italian precision, KeyBERT for future multilingual/CJK expansion.
Long-Term Viability Assessment#
pyate: Moderate-High Viability#
Organizational Backing:
- ✅ Individual developer (kevinlu1248), but listed in spaCy Universe (semi-official ecosystem)
- ⚠️ No large org backing (vs KeyBERT: broader community)
Development Status:
- ✅ Active (spaCy v3 support, 2023 releases)
- ✅ ~320 GitHub stars (modest but growing)
- ⚠️ Single maintainer risk (bus factor = 1)
Technology Trajectory:
- ⚠️ Statistical methods (C-Value, Combo Basic) are mature (little innovation expected)
- ❌ Language expansion blocked by corpus availability (CJK unlikely in 5-year horizon)
- ✅ spaCy ecosystem stable (mature NLP library, unlikely to change drastically)
5-Year Outlook: Good for English/Italian (will remain best-in-class for terminology extraction), but limited language expansion (corpus bottleneck). May remain niche tool.
Risk: If maintainer abandons, community may not sustain (small user base). Mitigation: Code is simple (can be forked/maintained internally if needed).
KeyBERT: High Viability#
Organizational Backing:
- ✅ Community-driven (broader contributor base than pyate)
- ✅ 3.5K+ GitHub stars (large user community)
- ✅ Part of sentence-transformers ecosystem (broader than single project)
Development Status:
- ✅ Very active (frequent releases, 2023-2024+)
- ✅ Multiple contributors (lower bus factor risk)
- ✅ Strong documentation, FAQ, guides
Technology Trajectory:
- ✅ BERT/transformer ecosystem growing (new models, better multilingual support)
- ✅ CJK tokenization improving (new Chinese BERT models with word-level tokenization emerging)
- ✅ sentence-transformers momentum (industry-standard for embeddings)
5-Year Outlook: Excellent (transformer ecosystem expanding, multilingual momentum). Likely to improve CJK quality as models evolve. Strong long-term bet.
Risk: Dependency on sentence-transformers ecosystem (if BERT falls out of favor). Mitigation: BERT remains dominant for embeddings (low risk of displacement 2026-2031).
Technology Evolution: Key Trends#
Trend 1: Transformer Dominance (Accelerating)#
Current: BERT-based models dominate embeddings 2026-2031: Expect continued transformer dominance (GPT, BERT, T5 families)
Impact on KeyBERT:
- ✅ Positive: New multilingual models will improve CJK quality
- ✅ Backward compatible: sentence-transformers supports new models (easy upgrade path)
Impact on pyate:
- ⚠️ Neutral-Negative: Statistical methods may seem “old-school” as transformers advance
- ❌ Risk: New entrants may build transformer-based terminology extractors (compete with pyate)
Trend 2: Multilingual NLP Expansion (Accelerating)#
Current: Focus on English, then major European languages 2026-2031: Expect greater CJK/low-resource language support (driven by global NLP demand)
Impact on KeyBERT:
- ✅ Major positive: Multilingual BERT models improving (better CJK word tokenization coming)
- ✅ Example: XLM-RoBERTa, mBERT improvements, Chinese-specific BERT variants
Impact on pyate:
- ❌ Negative: Corpus bottleneck for CJK remains (unlikely to be solved)
- ⚠️ Risk: Multilingual demand may favor KeyBERT-like approaches over statistical methods
Trend 3: CAT Tool Integration (Slow Evolution)#
Current: CAT tools have basic built-in extraction, resist external libraries 2026-2031: Slow adoption of advanced Python libraries (CAT vendors prefer proprietary)
Impact on Both:
- ⚠️ Neutral: Neither pyate nor KeyBERT likely to integrate directly into CAT tools
- ⚠️ Workflow remains: External extraction → export → CAT import (no change expected)
- ✅ Opportunity: API/microservice deployment patterns may enable integration
Trend 4: LLM-Based Extraction (Emerging Risk)#
Current: Few LLM-based terminology extractors (GPT-4 can extract, but expensive) 2026-2031: Potential disruption from LLM-based extraction (ChatGPT, Claude, Gemini)
LLM Approach:
- Prompt engineering: “Extract technical terms from this document”
- Zero-shot (no training data needed)
- Multilingual out-of-box (LLMs handle 100+ languages)
Trade-offs vs pyate/KeyBERT:
- ✅ LLM: Better semantic understanding, no training/models needed
- ❌ LLM: Expensive ($0.01-0.10 per document), API dependency, slower
- ✅ pyate/KeyBERT: Cheap (run locally), fast, no API calls
Strategic Impact:
- pyate: May lose English terminology extraction to LLMs (if cost drops)
- KeyBERT: May lose keyword extraction to LLMs (semantic understanding advantage erodes)
- Survival: pyate/KeyBERT remain viable for high-volume, low-cost extraction (LLMs too expensive at scale)
Recommendation: Monitor LLM pricing (if drops below $0.001/document, pyate/KeyBERT value proposition weakens).
Community Health Assessment#
pyate Community#
GitHub Activity:
- ~320 stars, ~15 forks (modest)
- Recent commits: 2023 (active)
- Issues: ~20 open (responsive maintainer)
Community Size: Small (niche tool, limited adoption)
Sustainability: Moderate (single maintainer, but code is simple enough to fork)
5-Year Confidence: 70% (will likely remain maintained, but language expansion uncertain)
KeyBERT Community#
GitHub Activity:
- ~3.5K stars, ~500 forks (large)
- Recent commits: Very active (2023-2024+)
- Issues: ~50 open, quickly resolved
Community Size: Large (widely adopted, strong ecosystem)
Sustainability: High (multiple contributors, embedded in sentence-transformers ecosystem)
5-Year Confidence: 90% (strong trajectory, unlikely to be abandoned)
Strategic Recommendations#
For Organizations: Hedge Strategy#
Recommendation: Adopt both libraries based on use case, prepare for technology shifts.
Near-Term (2026-2027):
- pyate for English/Italian terminology extraction (current best-in-class)
- KeyBERT for CJK and multilingual projects (only viable option)
Mid-Term (2028-2029):
- Monitor LLM-based extraction (may disrupt if pricing drops)
- Re-evaluate pyate if maintenance slows (consider forking or migrating to alternatives)
- Upgrade KeyBERT models as CJK tokenization improves (expect better Chinese quality)
Long-Term (2030-2031):
- Consider LLM-based extraction if cost/performance competitive
- Maintain pyate fork internally if no longer maintained (code is simple)
- Expect KeyBERT ecosystem to mature (likely remains viable)
For Developers: Platform Choices#
If building translation/writing tools:
- Start with KeyBERT (multilingual support, future-proof)
- Add pyate if English/Italian precision critical
- Abstract interface (swap libraries as technology evolves)
Example Architecture:
class TerminologyExtractor:
def __init__(self, language):
if language in ["en", "it"]:
self.backend = PyateExtractor() # High precision
else:
self.backend = KeyBERTExtractor() # Multilingual
def extract(self, text):
return self.backend.extract(text)Value: Decouple from specific library (easy to swap as LLM/new tools emerge).
For Translators/Writers: Practical Path#
Immediate (2026):
- Use CAT tool built-in extraction if available (convenience)
- Use pyate (English) or KeyBERT (CJK) for initial glossary creation if CAT insufficient
- Plan for human validation (60-80% precision, manual review essential)
Future (2027-2029):
- Experiment with LLM-based extraction (ChatGPT, Claude) as pricing drops
- Compare quality: LLM vs pyate/KeyBERT (may prefer LLM if precision > cost)
- Maintain current workflow until LLMs competitive
For Researchers: Open Questions#
Research Gaps (opportunities for 2026-2031):
- Transformer-based terminology extraction: Combine BERT embeddings with linguistic features (better than pure statistical or pure semantic)
- CJK word boundary detection: Improve Chinese/Japanese tokenization for terminology (current weak point)
- Bilingual terminology alignment: Automated source-target term pairing (currently manual)
- LLM fine-tuning for terminology: Fine-tune GPT/Claude for domain-specific term extraction (vs generic)
Risks and Mitigation#
Risk 1: pyate Abandonment (Moderate Probability)#
Scenario: Maintainer stops development, library becomes stale Probability: 30% (single maintainer, modest community) Impact: High for English/Italian terminology extraction Mitigation:
- Fork pyate internally (code is simple,
<1000LOC) - Monitor GitHub activity (6-month no-commit = warning sign)
- Prepare migration to alternatives (KeyBERT + filtering, LLMs)
Risk 2: BERT Displacement by Newer Architectures (Low Probability)#
Scenario: GPT-style models replace BERT for embeddings Probability: 20% (BERT remains strong for embeddings 2026-2031) Impact: Moderate (sentence-transformers can adapt to new models) Mitigation:
- sentence-transformers supports multiple backends (not locked to BERT)
- KeyBERT can use alternative embeddings (Flair, GPT, etc.)
Risk 3: LLM-Based Extraction Disrupts Market (Moderate Probability)#
Scenario: ChatGPT/Claude pricing drops to $0.001/document, making LLM extraction cheaper than pyate/KeyBERT Probability: 40% (LLM pricing declining rapidly) Impact: High (both libraries lose value proposition) Mitigation:
- Monitor LLM pricing trends (monthly evaluation)
- Test LLM extraction quality (may replace libraries if precision competitive)
- Maintain local extraction for high-volume use cases (LLM API latency > local inference)
Risk 4: CJK Quality Stagnates (Moderate Probability)#
Scenario: Character-level Chinese tokenization remains (no word-level BERT improvement) Probability: 30% (CJK NLP advancing, but word boundaries hard problem) Impact: Moderate (KeyBERT CJK quality ~60-70%, not improving) Mitigation:
- Use chinese_keybert for Chinese-only (better word segmentation)
- Human validation workflow (accept 60-70% precision as baseline)
- Explore custom Chinese BERT models (fine-tune on domain data)
Bottom Line: Strategic Positioning#
5-Year Outlook:
- pyate: Stable for English/Italian (70% confidence), but niche and limited language expansion
- KeyBERT: Strong trajectory (90% confidence), expanding multilingual support, embedded in growing ecosystem
- LLMs: Emerging wildcard (40% probability of disruption by 2029-2031)
Recommended Strategy:
- Near-term: Use pyate (English) + KeyBERT (CJK) based on use case
- Mid-term: Monitor LLM extraction quality and pricing (prepare to pivot)
- Long-term: Expect transformer ecosystem to dominate, pyate to remain niche, LLMs to compete for high-value extraction
Hedge: Abstract terminology extraction interface (swap backends as technology evolves). Don’t lock into single library.