1.173 Terminology Extraction#

Automatically finding domain-specific technical terms for translation glossaries, technical writing, and domain NLP - pyate (English/Italian) and KeyBERT (multilingual/CJK)

Explainer

Terminology Extraction Libraries: Domain Explainer#

Purpose: Help technical decision makers understand Python terminology extraction libraries and choose the right tool for translation, technical writing, and domain-specific NLP projects.

Audience: Product managers, technical leads, translators, technical writers without deep NLP expertise.

What This Solves#

The Problem#

Imagine you’re a translator working on a 50-page technical manual about machine learning. You need to:

Identify all technical terms (“neural network,” “gradient descent,” “backpropagation”)
Ensure consistent translation across the document
Build a glossary for future projects

Manual approach: Read through 50 pages, manually highlight terms, copy to spreadsheet. Time: 3-4 hours of tedious work.

Automated approach: Run terminology extraction library, get list of 200 candidate terms in 30 seconds. Review time: 30-60 minutes to validate and filter.

This is the terminology extraction problem: Automatically finding domain-specific technical terms in large documents.

Who Encounters This#

Translators: Building glossaries for technical translations (medical, legal, engineering)
Technical Writers: Maintaining terminology consistency across documentation
Domain NLP: Extracting concepts for knowledge graphs, ontologies, search systems
Localization Teams: Scaling terminology management across languages and projects

Why It Matters#

Business Impact:

Translation: 60-80% time savings on glossary creation (3-4 hours → 30-60 min per 10K words)
Technical Writing: Automated glossary generation ensures terminology consistency
Domain NLP: Foundation for building domain-specific models (medical, legal, etc.)

Technical Impact:

Identify multi-word technical terms (“natural language processing” not just “language”)
Domain-specific extraction (medical terms from medical texts, legal from legal)
Multilingual support (Chinese, Japanese, Korean technical terminology)

Accessible Analogies#

What Is Terminology Extraction?#

Analogy: Finding Specialized Vocabulary in a Foreign Language Textbook

Imagine learning Chinese from a computer science textbook. You need to identify:

Technical terms: “机器学习” (machine learning), “神经网络” (neural network)
General words: “学习” (learning), “网络” (network)

A keyword extractor would find both (anything important-looking). A terminology extractor would focus on the technical terms (domain-specific).

Terminology extraction is like having a smart assistant that knows the difference between “learning” (general word) and “machine learning” (technical term).

Why Libraries Matter#

Without libraries: You write custom code to find technical terms

Implement statistical algorithms (C-Value, TF-IDF)
Handle POS tagging, linguistic rules
Build language-specific models

Time: Weeks to months of development

With libraries: pip install pyate or pip install keybert

Pre-built algorithms (battle-tested)
Multi-language support
3 lines of code to extract terms

Time: Minutes to hours

When You Need This#

Clear Decision Criteria#

You NEED terminology extraction if:

✅ Large documents with technical content (>5,000 words)
✅ Recurring projects in same domain (build glossary once, reuse)
✅ Multiple translators/writers (consistency matters)
✅ Tight deadlines (60-80% time savings on term prep)

You DON’T need this if:

❌ Small documents (<1,000 words) - manual extraction faster
❌ General content (few technical terms)
❌ One-off projects (no glossary reuse value)

Concrete Use Case Examples#

Translation Glossary Creation:

Problem: Translator receives 50-page medical device manual. Manual term identification takes 3-4 hours.
Solution: Run pyate (English) or KeyBERT (Chinese/Japanese/Korean), get 200 candidate terms in 30 seconds, review in 30-60 minutes.
ROI: 60-80% time savings (3-4 hours → 30-60 min)

Technical Documentation Consistency:

Problem: 10 technical writers producing 500 pages of documentation. Inconsistent terminology (“ML model” vs “machine learning model” vs “ML algorithm”).
Solution: Extract terminology from all 500 pages, create shared glossary, enforce consistency.
ROI: Improved documentation quality, reduced customer confusion

Multilingual Product Documentation (CJK):

Problem: Product docs in English, Chinese, Japanese, Korean. Need consistent terminology across languages.
Solution: KeyBERT with multilingual model extracts terms from all languages using single tool.
ROI: Consistency across languages, simplified workflow

Trade-offs#

What You’re Choosing Between#

1. pyate vs KeyBERT: Terminology vs Keywords#

pyate (Terminology Extraction):

Pros: High precision (70-80%) for technical terms, multi-word focus (“gradient descent optimization”), domain-specific
Cons: English and Italian only (no Chinese/Japanese/Korean), requires spaCy dependency

When: English/Italian technical documentation, translation glossaries, domain-specific NLP

KeyBERT (Keyword Extraction):

Pros: Multilingual (50-109 languages including CJK), semantic understanding, simple API
Cons: Extracts keywords (not pure terminology), CJK character-level tokenization, slower (BERT inference)

When: Chinese/Japanese/Korean content, multilingual projects, semantic keywords (not just technical terms)

Key Difference: pyate finds technical terms, KeyBERT finds semantically important words. Overlap exists, but goals differ.

2. Automated Extraction vs Manual Curation#

Automated Extraction:

Pros: 60-80% time savings, handles large volumes (1000s of documents)
Cons: 60-80% precision (20-40% false positives), requires validation

When: Large documents, recurring projects, tight deadlines

Manual Curation:

Pros: 95%+ precision, full control
Cons: Time-consuming (3-4 hours per 10K words)

When: Small documents, one-off projects, ultra-high precision required

Recommended: Hybrid - automated extraction for volume, human validation for precision.

3. Integrated CAT Tools vs Python Libraries#

CAT Tool Built-in:

Pros: Integrated workflow (no export/import), convenient
Cons: Less sophisticated algorithms, vendor lock-in

When: Existing CAT tool user, convenience > precision

Python Libraries (pyate/KeyBERT):

Pros: State-of-art algorithms, customizable, free/open-source
Cons: Requires Python skills, manual export to CAT tool

When: Need best precision, willing to invest setup time, tech-savvy team

Cost Considerations#

Why Cost Matters Here#

Unlike commercial terminology tools ($500-5,000/year per seat), Python libraries are free. The cost is time and expertise.

Pricing Models#

Python Libraries (Free):

Software Cost: $0 (open-source)
Setup Cost: 1-4 hours (installation, learning)
Ongoing Cost: 10-20 hours/year (maintenance, updates)

Commercial Tools (Sketch Engine, AntConc alternatives):

Software Cost: $500-5,000/year per seat
Setup Cost: Included (vendor support)
Ongoing Cost: Vendor handles updates

ROI Analysis#

Translation Glossary Creation (10K-word document):

Manual: 3-4 hours × $50/hour = $150-200 per document
Automated (pyate/KeyBERT): 30-60 min × $50/hour = $25-50 per document
Savings: $100-150 per document (60-75% reduction)

Payback: If processing >10 documents/year, automated extraction pays for setup time in first month.

Technical Documentation (500 pages):

Manual: 20-30 hours × $75/hour = $1,500-2,250
Automated: 2-4 hours (extraction + review) × $75/hour = $150-300
Savings: $1,200-2,000 (85-90% reduction)

Implementation Reality#

Realistic Timeline Expectations#

Prototype (1 week):

Install pyate or KeyBERT
Run on sample documents (10-20 pages)
Validate output quality
Team: 1 developer or technical translator

Production MVP (2-4 weeks):

Set up batch processing pipeline
Create validation workflow (human-in-loop)
Export to glossary format (CSV, TBX)
Train team on usage
Team: 1 developer + 1 domain expert (translator/writer)

Optimized Production (2-3 months):

Fine-tune for specific domain (if needed)
Integrate with CAT tool or documentation system
Automate glossary updates (CI/CD)
Team: 1 developer + 1-2 domain experts

Team Skill Requirements#

Minimum (Using KeyBERT):

Python: Basic (run scripts, install packages)
NLP Knowledge: None (library handles complexity)
Domain Expertise: High (validate extracted terms)
Training Time: 1-2 days

Typical (Using pyate + spaCy):

Python: Moderate (pipelines, batch processing)
NLP Knowledge: Basic (understand POS tagging)
Domain Expertise: High
Training Time: 1-2 weeks

Common Pitfalls#

Pitfall 1: “Automated extraction replaces human review”

Reality: Extraction is 60-80% precise. Human validation essential.
Fix: Budget time for review (30-60 min per 10K words)

Pitfall 2: “CJK support means perfect Chinese/Japanese/Korean”

Reality: KeyBERT uses character-level tokenization (may miss word boundaries)
Fix: Use chinese_keybert for Chinese-only, plan for extra validation

Pitfall 3: “Keywords = Terminology”

Reality: KeyBERT extracts semantically important words, not always technical terms
Fix: Use pyate for pure terminology (if language supported), KeyBERT + filtering otherwise

Pitfall 4: “One library solves everything”

Reality: pyate best for English/Italian terminology, KeyBERT best for CJK
Fix: Use both (per-language optimization) or abstract interface (swap backends)

Key Takeaways for Decision Makers#

Top 3 Decisions to Make#

Decision 1: Terminology Extraction vs Keyword Extraction

Rule: Need technical terms (glossaries, translation)? → pyate (if English/Italian) or KeyBERT + filtering (if CJK)
Rule: Need semantic keywords (content tagging)? → KeyBERT

Decision 2: Language Support

Rule: English or Italian only? → pyate (highest precision)
Rule: Chinese, Japanese, Korean, or multilingual? → KeyBERT (only viable option)

Decision 3: Integration Approach

Rule: CAT tool built-in available? → Use it (convenience)
Rule: Need best precision or CJK support? → Python libraries (setup effort justified)

Budget Guidance#

Setup (One-Time):

Developer time: 1-4 weeks × $5K/week = $5K-20K
Infrastructure: $0 (runs on standard hardware)
Total: $5K-20K

Ongoing (Per Year):

Maintenance: 10-20 hours × $100/hour = $1K-2K
Total: $1K-2K/year

ROI:

Translation: $100-150 savings per 10K-word document
Technical Docs: $1,200-2,000 savings per 500-page manual
Payback: Typically 1-3 months for active translation/writing teams

Questions to Ask Vendors/Consultants#

Technical Questions:

“Which library do you recommend: pyate or KeyBERT? Why?” (Tests understanding of terminology vs keyword trade-off)
“How does it handle Chinese/Japanese/Korean?” (Tests CJK knowledge)
“What’s the expected precision for our domain?” (Tests realistic expectations - should be 60-80%, not 95%)

Business Questions:

“What’s the time savings vs manual extraction?” (Should quote 60-80%, not 90-95%)
“How much human validation is needed?” (Should acknowledge 20-40% false positives)
“Can it integrate with our CAT tool?” (Most likely no - manual export/import)

Red Flags:

❌ Claims 90-95% precision without human review (unrealistic)
❌ Recommends same library for all languages (no understanding of pyate/KeyBERT trade-offs)
❌ Doesn’t mention CJK challenges (character-level tokenization)

Green Flags:

✅ Recommends pyate for English, KeyBERT for CJK (shows nuanced understanding)
✅ Acknowledges 60-80% precision, plans for human validation
✅ Discusses integration challenges (export/import to CAT tool)

Glossary#

Terminology Extraction: Automatically finding domain-specific technical terms (multi-word expressions like “machine learning model”)

Keyword Extraction: Finding semantically important words in a document (may include general words if important)

CJK: Chinese, Japanese, Korean languages (share some NLP challenges like lack of spaces between words)

CAT Tool: Computer-Assisted Translation tool (SDL Trados, MemoQ, Smartcat) - software translators use

Glossary/Termbase: Database of technical terms and their translations

pyate: Python library for terminology extraction (C-Value, Combo Basic algorithms). Best for English/Italian.

KeyBERT: Python library for keyword extraction using BERT embeddings. Best for multilingual/CJK.

spaCy: Industrial-strength NLP library (POS tagging, parsing). Required by pyate.

BERT: Transformer-based language model. Provides semantic understanding for KeyBERT.

Precision: How many extracted terms are actually technical terms (70-80% typical)

Recall: How many actual technical terms were found (harder to measure, less critical)

S1 Rapid Discovery: Approach#

Goal: Identify the main Python libraries for automatic term extraction and gain initial understanding of their capabilities.

Research Method#

Web search for libraries mentioned in research brief (pyate, topia.termextract, spaCy, KeyBERT)
Expand search to discover additional widely-used libraries (YAKE, RAKE, textacy)
Verify pip installability (LIBRARY requirement, NOT GUI tools)
Quick categorization by approach (statistical, linguistic, transformer-based)

Libraries Identified#

Core Libraries (from spec):#

pyate: spaCy-based implementation of multiple statistical algorithms (C-Value, Basic, Combo Basic, Weirdness)
topia.termextract: Lightweight POS-based term extraction (legacy, 2009)
spaCy components: Built-in NLP pipeline components + ecosystem extensions
KeyBERT: BERT embedding-based keyword/keyphrase extraction

Additional Discoveries:#

YAKE: Unsupervised statistical method, no training required
RAKE-NLTK: Statistical co-occurrence analysis, domain-independent
textacy: Higher-level spaCy wrapper with term extraction features

Initial Categorization#

Statistical Approaches (No ML Training):#

YAKE (text statistics)
RAKE-NLTK (word co-occurrence)
topia.termextract (POS + statistics)

Linguistic + Statistical:#

pyate (POS tagging + multiple algorithms)
textacy (spaCy-based, TextRank)

Transformer-Based:#

KeyBERT (BERT embeddings + cosine similarity)

Focus for S1#

Focus on the four specified libraries (pyate, topia.termextract, spaCy, KeyBERT) while noting alternatives (YAKE, RAKE) for completeness.

KeyBERT#

Quick Summary#

Minimal keyword/keyphrase extraction using BERT embeddings. Finds terms most semantically similar to the document.

Installation#

pip install keybert

# With optional backends:
pip install keybert[flair]  # For Flair embeddings
pip install keybert[spacy]  # For spaCy integration
pip install keybert[gensim] # For Word2Vec/Doc2Vec
pip install keybert[use]    # For Universal Sentence Encoder

Key Features#

BERT-based: Uses transformer embeddings (captures meaning, not just counts)
Multilingual: Works with many languages (via multilingual BERT models)
Minimal API: Design goal was “pip install + 3 lines of code”
Multiple backends: Support for sentence-transformers, Flair, spaCy, Gensim
Semantic similarity: Uses cosine similarity to find terms matching document meaning

How It Works#

Extract document embedding with BERT
Extract word/phrase embeddings for n-grams
Calculate cosine similarity between document and candidates
Return top-k most similar terms

Use Cases#

Semantic keyword extraction: Beyond simple frequency (meaning-based)
Multilingual content: Works across many languages
Modern NLP pipelines: Integrates with transformer-based workflows
Content tagging: Automatic metadata generation

Resources#

GitHub: MaartenGr/KeyBERT
PyPI: keybert
Docs: maartengr.github.io/KeyBERT

Initial Assessment#

Pros:

Modern (transformer-based)
Excellent multilingual support
Simple API (easy to use)
Semantic understanding (not just statistics)
Active development (2023+)

Cons:

Requires transformer models (larger memory footprint)
Slower than statistical methods (BERT inference)
May extract keywords, not necessarily technical terms (different goals)

Recommended for: Projects prioritizing semantic understanding over pure statistical term extraction. Best for content with semantic meaning (articles, documents) rather than highly technical terminology.

Note on Terminology vs Keywords#

KeyBERT extracts keywords (semantically important words) which overlaps with but differs from terminology (domain-specific technical terms). For pure terminology extraction, pyate may be more appropriate. KeyBERT shines when you need meaning-based extraction.

pyate (PYthon Automated Term Extraction)#

Quick Summary#

Python implementation of multiple term extraction algorithms using spaCy POS tagging. Supports spaCy v3.

Installation#

pip3 install pyate

Dependencies: numpy, pandas, spacy, pyahocorasick

Key Features#

Multiple algorithms: C-Value, Basic, Combo Basic, Weirdness, Term Extractor
spaCy integration via add_pipe method
Returns “termhood” scores indicating confidence
Works with spaCy v3 (for v2, use pyate==0.4.3)

How It Works#

Uses spaCy POS tagging to identify term candidates
Applies statistical algorithms to score candidates
Returns ranked list of terms with confidence scores

Use Cases#

Technical documentation term extraction
Domain-specific terminology identification
Translation memory creation
Glossary generation

Resources#

GitHub: kevinlu1248/pyate
PyPI: pyate
Docs: https://kevinlu1248.github.io/pyate/
Demo: https://pyate-demo.herokuapp.com/

Initial Assessment#

Pros:

Modern (spaCy v3 support)
Multiple algorithms in one package
Active development
Good documentation

Cons:

Requires spaCy (heavier dependency)
Limited to languages spaCy supports

Recommended for: Technical writing, domain-specific NLP projects requiring modern spaCy integration

RAKE-NLTK (Rapid Automatic Keyword Extraction)#

Quick Summary#

Domain-independent keyword extraction using word co-occurrence and frequency analysis. NLTK-based Python implementation.

Installation#

pip install rake-nltk

Dependencies: NLTK

Key Features#

Domain-independent: Works without domain-specific training
Co-occurrence analysis: Identifies key phrases by analyzing word co-occurrences
Stopword filtering: Uses stopwords as phrase delimiters
Frequency-based: Combines word frequency and co-occurrence scores

How It Works#

Use stopwords to split text into candidate phrases
Calculate word scores based on frequency and co-occurrence
Compute phrase scores (sum of word scores)
Rank phrases by score
Return top-k key phrases

Use Cases#

General-purpose keyword extraction
Document summarization
Content tagging
Search engine optimization (SEO)

Resources#

PyPI: rake-nltk
GitHub: Multiple implementations available

Initial Assessment#

Pros:

Simple, well-understood algorithm
Fast (no ML inference)
Domain-independent
Works on single documents (no corpus needed)

Cons:

Statistical only (no semantic understanding)
Quality depends on stopword list
May extract common phrases, not technical terms
Less sophisticated than modern methods

Recommended for: Quick keyword extraction when you need speed over precision. For technical terminology extraction, pyate or KeyBERT are likely better choices.

Note: Keyword vs Terminology Extraction#

RAKE extracts keywords/key phrases based on statistical prominence. This differs from terminology extraction which targets domain-specific technical terms. RAKE may miss low-frequency but important technical terms.

S1 Rapid Discovery: Initial Recommendations#

Executive Summary#

Seven pip-installable libraries identified for terminology extraction in Python. Clear split between statistical/linguistic approaches (RAKE, YAKE, pyate) and semantic/transformer approaches (KeyBERT).

Landscape Overview#

By Approach:#

Library	Type	Best For	Status
pyate	Linguistic + Statistical	Technical terminology	✓ Active (2023+)
KeyBERT	Transformer (BERT)	Semantic keywords	✓ Active (2023+)
YAKE	Statistical	Quick extraction	✓ Active
RAKE-NLTK	Statistical	General keywords	✓ Maintained
textacy	spaCy extension	Broader NLP + terms	✓ Active
topia.termextract	Legacy POS-based	Legacy projects	⚠️ Abandoned (2009)
spaCy	NLP infrastructure	Framework/integration	✓ Active

By Use Case:#

Technical terminology extraction (translation, tech writing): → pyate (multiple algorithms, spaCy-based, terminology-focused)

Semantic keyword extraction (content tagging, semantic search): → KeyBERT (BERT embeddings, multilingual, meaning-based)

Quick/lightweight extraction (minimal dependencies): → YAKE (no training, fast, lightweight)

Legacy projects (existing topia.termextract usage): → Migrate to pyate (modern equivalent)

Top 2 Recommendations for Further Research#

1. pyate (Primary for Terminology)#

Why: Specifically designed for terminology extraction (not just keywords). Implements multiple proven algorithms (C-Value, Combo Basic, Weirdness). Modern (spaCy v3), actively maintained, good documentation.

Trade-offs:

✓ Purpose-built for technical terms
✓ Multiple algorithms available
✗ Requires spaCy (heavier dependency)
✗ Limited to spaCy-supported languages

2. KeyBERT (Primary for Semantic Keywords)#

Why: Modern transformer-based approach. Excels at semantic keyword extraction (finding meaningful terms). Excellent multilingual support. Simple API.

Trade-offs:

✓ Semantic understanding (not just statistics)
✓ Multilingual (70+ languages via BERT)
✗ Heavier (transformer models)
✗ Keywords ≠ terminology (different goals)

Key Insight: Terminology vs Keywords#

Critical distinction:

Terminology extraction: Domain-specific technical terms (e.g., “natural language processing”, “entity recognition”)
Keyword extraction: Semantically important words (e.g., “important”, “key findings”, “main result”)

pyate targets terminology. KeyBERT targets keywords. Use case determines which is appropriate.

Excluded from Deep Dive#

topia.termextract: Abandoned (2009), superseded by pyate
RAKE-NLTK: General keyword extraction (not terminology-focused)
YAKE: Good for general keywords, but less sophisticated for technical terms
textacy: Broader toolkit (term extraction is one feature among many)
spaCy: Infrastructure, not a term extraction library per se

Next Steps (S2: Comprehensive)#

For pyate:

Compare algorithms (C-Value vs Combo Basic vs Weirdness)
Benchmark on technical documents
Evaluate multilingual support (via spaCy models)
Installation complexity and dependencies

For KeyBERT:

Test on technical vs general content
Evaluate embedding model choices (sentence-transformers vs others)
Multilingual performance (CJK support per research label)
Memory footprint and inference speed

For both:

Compare output quality on same corpus
Evaluate integration with translation/NLP pipelines
TCO analysis (dependencies, model sizes)
Community and long-term viability

spaCy Terminology Components#

Quick Summary#

spaCy itself doesn’t have built-in “terminology extraction” but provides the NLP pipeline infrastructure. Term extraction happens via:

Custom pipeline components
Ecosystem extensions (spaCy Universe)
Integration with other libraries (pyate, textacy)

Installation#

pip install spacy
python -m spacy download en_core_web_sm  # or other language models

Relevant Components#

Built-in Features:#

POS tagging: Identify nouns, noun phrases
Dependency parsing: Understand phrase structure
Named Entity Recognition: Extract entities
Noun phrase chunking: Extract multi-word terms
Matchers: Pattern-based extraction (Matcher, PhraseMatcher)

Ecosystem Extensions (spaCy Universe):#

pyate: Multiple term extraction algorithms (covered separately)
sense2vec: Combines noun phrases with POS/entity labels
textacy: Higher-level NLP tasks including term extraction

How to Use for Term Extraction#

Approach 1: Manual noun phrase extraction#

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Your text here")
terms = [chunk.text for chunk in doc.noun_chunks]

Approach 2: Add pyate to pipeline#

import spacy
from pyate import combo_basic
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Your text here")
terms = doc._.terms

Approach 3: Custom component#

Create custom pipeline component with domain-specific rules.

Use Cases#

Framework for term extraction: Use built-in features + custom logic
Integration point: Combine with specialized libraries (pyate, textacy)
Multilingual support: 70+ language models available

Resources#

Website: spacy.io
spaCy Universe: spacy.io/universe (ecosystem directory)
Linguistic Features: spacy.io/usage/linguistic-features

Initial Assessment#

Pros:

Industry-standard NLP library
Excellent multilingual support (70+ languages)
Fast, production-ready
Extensible architecture
Large ecosystem

Cons:

Not a term extraction library per se (requires extensions)
Heavier dependency (language models are large)
Requires understanding of NLP pipeline architecture

Recommended for: Projects needing robust NLP infrastructure with term extraction as one component. Use spaCy + pyate/textacy for best results, not spaCy alone.

textacy#

Quick Summary#

Higher-level NLP library built on spaCy, providing tools for tasks like keyphrase extraction, readability analysis, and more.

Installation#

pip install textacy

Dependencies: spaCy (and language models)

Key Features#

Built on spaCy: Extends spaCy with higher-level tasks
Keyphrase extraction: Via textacy.extract using TextRank algorithm
Preprocessing tools: Text normalization, cleaning
Multiple extraction methods: Named entities, n-grams, terms, etc.
Corpus management: Tools for working with document collections

How It Works (for term extraction)#

Uses textacy.extract module:

TextRank algorithm for keyphrase ranking
Various extraction methods (n-grams, entities, terms)
Statistical ranking of extracted phrases

Use Cases#

Projects already using spaCy that need higher-level features
Keyphrase extraction with TextRank
Text preprocessing pipeline
Corpus-level analysis

Resources#

GitHub: chartbeat-labs/textacy
Docs: textacy.readthedocs.io
PyPI: pip install textacy

Initial Assessment#

Pros:

Convenient if already using spaCy
Multiple extraction methods in one library
Good documentation
TextRank implementation

Cons:

Requires spaCy (heavier dependency)
Less focused than specialized tools (pyate, KeyBERT)
May be overkill if you only need term extraction

Recommended for: Projects already using spaCy that want additional NLP features including term extraction. If starting fresh, consider pyate (also spaCy-based but focused on term extraction) or KeyBERT (semantic approach).

Comparison to pyate#

Both build on spaCy, but:

textacy: General-purpose NLP toolkit with term extraction as one feature
pyate: Focused specifically on terminology extraction with multiple algorithms

For pure terminology extraction, pyate is more specialized. For broader NLP tasks including term extraction, textacy provides more features.

topia.termextract#

Quick Summary#

Lightweight POS-based term extraction library originally from Zope project. Legacy but still functional.

Installation#

pip install topia.termextract

Key Features#

Simple POS tagging algorithm (focuses on nouns)
Statistical analysis for term strength
Returns terms with occurrence count and strength metrics
Configurable filter component
Minimum occurrence threshold (default: 3 for single words)

How It Works#

Simple POS tagging to identify nouns
Statistical analysis of term frequency
Filter for minimum occurrence threshold
Return ranked terms with strength scores

Current Status#

⚠️ Last release: June 30, 2009 (version 1.1.0) ⚠️ Maintenance: Discontinued on official PyPI ✓ Fork available: turian/topia.termextract on GitHub (updated fork)

Use Cases#

Lightweight keyword extraction
Simple terminology extraction for English text
Projects requiring minimal dependencies

Resources#

Initial Assessment#

Pros:

Very lightweight
Simple API
Minimal dependencies
Still works for basic use cases

Cons:

Abandoned (no updates since 2009)
Limited language support
Simple POS tagger (less accurate than modern tools)
No active development

Recommended for: Legacy projects, simple keyword extraction where modern dependencies are unwanted. NOT recommended for new projects (use pyate or KeyBERT instead).

YAKE (Yet Another Keyword Extractor)#

Quick Summary#

Lightweight unsupervised keyword extraction using statistical text features. No training required, works across domains and languages.

Installation#

pip install yake

Key Features#

Unsupervised: No training data required
Language-agnostic: Works across multiple languages
Domain-independent: No domain-specific dictionaries needed
Text size flexible: Works regardless of document length
Statistical approach: Based on text statistical features (position, frequency, capitalization, etc.)

How It Works#

Analyze statistical features: word position, frequency, case, context
Compute scores for candidate keywords
Rank keywords by relevance
Return top-k keywords with scores

Use Cases#

Quick keyword extraction without training
Multilingual content processing
Small to medium documents
Domain-agnostic applications

Resources#

GitHub: LIAAD/yake
PyPI: yake

Initial Assessment#

Pros:

No training required (truly unsupervised)
Fast (statistical, no ML inference)
Multilingual support
Lightweight (minimal dependencies)
Domain-independent

Cons:

Statistical only (no semantic understanding)
May not capture technical terminology nuances
Less sophisticated than transformer-based methods

Recommended for: Quick keyword extraction without setup overhead. Good for general keyword extraction, but for technical terminology, pyate may be more appropriate.

Additional Note#

YAKE is popular in academic research for keyword extraction. It’s a solid baseline method that’s easy to deploy but may lack the sophistication needed for specialized terminology extraction in technical domains.

S2: Comprehensive

S2 Comprehensive Research: Approach#

Goal#

Deep technical analysis of pyate and KeyBERT to understand:

Algorithm implementation details
Performance characteristics and benchmarks
Multilingual support (especially CJK - per research label)
Integration patterns and dependencies
Use case fit (terminology vs keyword extraction)

Research Method#

Algorithm Analysis: Study implementations of C-Value, Combo Basic (pyate) vs BERT embeddings + cosine similarity (KeyBERT)
Benchmark Review: Find published comparisons and performance data
CJK Support: Evaluate Chinese, Japanese, Korean language capabilities
Integration Patterns: Understand how each integrates with NLP pipelines (spaCy, sentence-transformers)
Use Case Mapping: Clarify when to use terminology extraction vs keyword extraction

Key Questions to Answer#

For pyate:#

How do C-Value, Combo Basic, and Weirdness algorithms compare?
What are spaCy model dependencies for CJK languages?
Does it have general domain corpora for Chinese, Japanese, Korean?
What is the precision of different algorithms per Astrakhantsev 2016?

For KeyBERT:#

Which sentence-transformers models support CJK best?
How does BERT handle CJK tokenization (character-based vs word-based)?
What is the trade-off between multilingual-BERT and language-specific models?
Is there semantic understanding of technical terminology or just keywords?

For Both:#

What is the fundamental difference between terminology and keyword extraction?
Which is better for translation workflows?
Which is better for technical writing glossary generation?
What are the memory/compute requirements?

Sources#

PyATE GitHub and documentation
KeyBERT GitHub and documentation
Astrakhantsev 2016 (ATR4S toolkit benchmark)
Sentence-transformers model documentation
spaCy language model documentation
Research papers on terminology vs keyword extraction

Expected Outcome#

Clear technical comparison with recommendations for:

Pure terminology extraction (technical terms, domain-specific concepts)
Semantic keyword extraction (meaning-based content tagging)
CJK language support (Chinese, Japanese, Korean capabilities)
Integration patterns (when to use with spaCy, sentence-transformers)

CJK Language Support Analysis#

Relevance: Research bead has cjk label, indicating Chinese, Japanese, Korean support is a priority.

Summary Table#

Library	Chinese	Japanese	Korean	Status	Notes
pyate	❌ No	❌ No	❌ No	Blocked	No general corpora
KeyBERT	✅ Yes	✅ Yes	✅ Yes	Works	Multilingual BERT
chinese_keybert	✅ Best	❌ No	❌ No	Works	Chinese-specific fork

pyate CJK Support#

Technical Capability:#

✅ spaCy models exist for Chinese, Japanese, Korean ✅ pyate can load spaCy CJK models via set_language()

Actual Status:#

❌ No CJK support due to missing general domain corpora

Why Blocked:#

Per GitHub Issue #13:

As of version 0.4.2, only English and Italian are supported. The library’s language support depends on having appropriate spaCy models and general domain corpora for those languages.

What’s Missing:

Weirdness algorithm: Requires general corpus to contrast against technical corpus
Term Extractor algorithm: Requires reference corpus

Available spaCy Models:

Chinese: zh_core_web_sm, zh_core_web_md, zh_core_web_lg
Japanese: ja_core_news_sm, ja_core_news_md, ja_core_news_lg
Korean: Rule-based tokenizer, trained pipelines available

Workaround: Provide your own general corpus

from pyate import combo_basic

chinese_text = "您的中文技术文档..."
general_corpus = "Your own Chinese general domain corpus..."

# This WILL work if you provide general_corpus
terms = combo_basic(chinese_text, general_corpus=general_corpus)

Verdict: pyate is NOT recommended for CJK unless you can build general domain corpora (non-trivial effort).

KeyBERT CJK Support#

Technical Capability:#

✅ Multilingual BERT models support 50-109 languages including CJK ✅ Out-of-box support, no additional corpora needed

CJK Tokenization Behavior#

Per Google BERT Multilingual Docs:

Chinese (and Japanese Kanji, Korean Hanja):

Character-tokenized (spaces added around every CJK Unicode character)
Effectively treats Chinese as character-level (not word-level)
May extract character-level “terms” instead of proper words

Japanese Katakana/Hiragana, Korean Hangul:

Whitespace + WordPiece tokenization (normal BERT behavior)
Better term extraction quality

Example:

Input: “自然语言处理” (natural language processing in Chinese)
BERT tokenization: [“自”, “然”, “语”, “言”, “处”, “理”] (6 characters)
KeyBERT may extract: “语言” (language), “处理” (processing) as separate “keywords”

Implication: Character-level tokenization may miss proper word boundaries for Chinese/Japanese.

Recommended Models for CJK#

Model	Languages	CJK Quality	Size
`paraphrase-multilingual-MiniLM-L12-v2`	50+ incl. CJK	Good	420MB
`paraphrase-multilingual-mpnet-base-v2`	50+ incl. CJK	Better (higher quality)	1.1GB
`distiluse-base-multilingual-cased-v1`	15 incl. Chinese, Korean	Lightweight	480MB
`LaBSE`	109 languages	Max coverage	470MB

Recommendation: Start with paraphrase-multilingual-MiniLM-L12-v2 (balance of size and quality).

Chinese-Specific: chinese_keybert Fork#

Repository: JacksonCakes/chinese_keybert

Improvements over Generic KeyBERT:

✅ Uses CKIP library for Chinese word segmentation (proper word boundaries)
✅ Chinese POS tagging (identifies noun phrases correctly)
✅ Integrates sentence-transformers for embeddings

Usage:

from chinese_keybert import ChineseKeyBERT

kw_model = ChineseKeyBERT()
keywords = kw_model.extract_keywords("自然语言处理技术...")

Trade-off:

✅ Better Chinese word segmentation (vs character-level generic BERT)
❌ Chinese-only (no Japanese, Korean support)
❌ Additional dependency (CKIP library)

Verdict: Use chinese_keybert if Chinese-only, use generic KeyBERT with multilingual model if multi-CJK.

Other Libraries: CJK Status#

YAKE#

✅ Language-agnostic (no language-specific models needed) ✅ Works on CJK text (statistical approach) ⚠️ Character-level statistics may affect quality for Chinese

RAKE-NLTK#

❌ English-centric (depends on English stopwords) ❌ Not recommended for CJK

textacy#

⚠️ Depends on spaCy models (same as pyate) ✅ spaCy CJK models exist (Chinese, Japanese, Korean) ✅ Should work for CJK if using spaCy CJK models ? Unknown if TextRank algorithm requires additional corpora

Real-World CJK Use Cases#

Translation (Chinese ↔ English)#

Need: Extract Chinese technical terms for translation glossaries

Recommendation:

KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 (works, but character-level)
chinese_keybert (better Chinese word segmentation)
Hybrid: Manual review + filtering (BERT may miss proper terms)

Challenge: Character-level tokenization may extract “语言” (language) and “处理” (processing) separately, missing “自然语言处理” (natural language processing) as a complete term.

Multilingual Technical Documentation (Chinese, Japanese, Korean)#

Need: Consistent terminology across CJK languages

Recommendation:

KeyBERT with multilingual model (supports all three)
Per-language: chinese_keybert (Chinese), generic KeyBERT (Japanese, Korean)

Trade-off: Consistency (single model) vs quality (language-specific models).

Japanese Technical Writing#

Need: Extract Japanese technical terms (mix of Kanji, Hiragana, Katakana)

Recommendation:

KeyBERT with multilingual model (handles all scripts)
Consider spaCy Japanese model + textacy (if KeyBERT quality insufficient)

Note: Japanese mixes character sets (Kanji = character-level, Kana = syllabic). BERT handles this natively.

Verdict: CJK Support#

For Chinese:#

🥇 chinese_keybert (best word segmentation) 🥈 KeyBERT with multilingual model (works, but character-level) ❌ pyate (no general corpus)

For Japanese:#

🥇 KeyBERT with multilingual model (native support) 🥈 textacy + spaCy Japanese model (if KeyBERT insufficient) ❌ pyate (no general corpus)

For Korean:#

🥇 KeyBERT with multilingual model (native support) ❌ pyate (no general corpus)

For Multi-CJK (all three languages):#

🥇 KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 or LaBSE

Single model for all three languages
Consistent approach across CJK
Trade-off: Character-level for Chinese may reduce term quality

Recommendations#

If CJK support is required (per research label):

Default choice: KeyBERT with multilingual model (paraphrase-multilingual-MiniLM-L12-v2)
Chinese-only: chinese_keybert fork (better word segmentation)
NOT recommended: pyate (no CJK corpora)

If CJK + English mixed:

KeyBERT works across languages in single model
Useful for multilingual technical documentation
Example: Code comments mixing English and Chinese

If terminology precision is critical:

Consider manual review + filtering of KeyBERT output
Character-level tokenization may miss multi-character technical terms
Hybrid approach: KeyBERT extraction + human validation

Head-to-Head Comparison: pyate vs KeyBERT#

Quick Decision Matrix#

Criterion	pyate	KeyBERT	Winner
Terminology extraction	✓ Purpose-built	✗ Keywords, not terms	pyate
Keyword extraction	✗ Not designed for it	✓ Semantic keywords	KeyBERT
Multilingual (general)	~70 languages (via spaCy)	50-100+ languages	KeyBERT
CJK support (Chinese/Japanese/Korean)	❌ No corpora	✅ Works out-of-box	KeyBERT
Speed	Fast (stats)	Slow (BERT)	pyate
Memory footprint	Low (~100MB spaCy model)	High (80MB-1.1GB BERT)	pyate
Multi-word terms	✓ Designed for this	~ May split into chars (CJK)	pyate
Semantic understanding	✗ Statistical only	✓ BERT embeddings	KeyBERT
No training required	✓ (but needs corpora)	✓ Pre-trained models	Tie
Installation simplicity	Moderate (spaCy + model)	Easy (pip + auto-download)	KeyBERT

Fundamental Difference: Terminology vs Keywords#

Terminology Extraction (pyate)#

Goal: Find domain-specific technical terms (multi-word expressions, low ambiguity, conceptual importance)

Example Input: “Machine learning models use gradient descent for optimization.”

pyate Output:

“machine learning models” (technical term)
“gradient descent” (technical term)
“optimization” (domain-specific concept)

Characteristics:

Multi-word terms preferred (“natural language processing” > “language”)
Domain-specificity (contrasts technical vs general corpus via Weirdness algorithm)
Low ambiguity (terms have specific technical meaning)

Keyword Extraction (KeyBERT)#

Goal: Find semantically important words/phrases (document-level semantic relevance)

Example Input: “Machine learning models use gradient descent for optimization.”

KeyBERT Output:

“machine learning” (semantically central to document)
“gradient descent” (semantically central to document)
“optimization” (important concept)

Characteristics:

Semantic similarity to document meaning
May include general words if semantically important (“important discovery”, “key finding”)
Single or multi-word based on semantic coherence

Per Wikipedia and Sketch Engine:

Terminologists focus on finding terms specific to a particular technical domain, while information retrieval focuses on indexing terms capable of distinguishing among documents.

Algorithm Comparison#

pyate (Statistical + Linguistic)#

Algorithms:

C-Value: Multi-word term recognition (nested term handling)
Combo Basic: Weighted frequency + containment + length (highest precision)
Weirdness: Technical corpus vs general corpus contrast
Basic: Frequency with POS filtering
Term Extractor: Hybrid approach

Process:

spaCy POS tagging → identify noun phrases
Apply statistical algorithm (frequency, containment, corpus contrast)
Rank candidates by termhood score
Return top-k terms

Benchmark: Astrakhantsev 2016 shows Combo Basic has highest precision for terminology extraction.

KeyBERT (Transformer-Based)#

Algorithm:

BERT embedding for full document (768-dim vector)
BERT embedding for each n-gram candidate
Cosine similarity between document and candidates
Return top-k by similarity

Process:

Purely semantic (no POS tagging, no frequency counting)
Finds candidates semantically similar to document meaning
No contrast with general corpus (single-document operation)

Dependency Comparison#

pyate#

Required:

spaCy (100MB-500MB language models)
numpy, pandas
pyahocorasick
General domain corpus (for Weirdness, Term Extractor)

Total Install: ~150MB-600MB (depending on spaCy model)

Languages: Depends on spaCy model + corpus availability

✅ English, Italian (pre-built)
⚠️ ~70 languages (spaCy models exist, but no pyate corpora)
❌ CJK (spaCy models exist, but no general corpora for pyate algorithms)

KeyBERT#

Required:

sentence-transformers (or alternative backend)
BERT model (80MB-1.1GB depending on model)

Total Install: ~100MB-1.2GB (depending on model choice)

Languages: Depends on BERT model

✅ English (all-MiniLM-L6-v2): 80MB
✅ Multilingual 50+ languages (paraphrase-multilingual-MiniLM-L12-v2): 420MB
✅ 109 languages including CJK (LaBSE): 470MB

Performance: Speed & Resource Usage#

Metric	pyate	KeyBERT
Inference Time (1000 words)	~0.1-0.2s	~1-2s
Relative Speed	10x faster	Baseline
Memory Usage	200MB-600MB	500MB-1.5GB
GPU Acceleration	✗ Not applicable	✓ Available (optional)

Bottleneck:

pyate: spaCy POS tagging (fast, CPU-efficient)
KeyBERT: BERT inference (slow, GPU benefits)

Optimization:

pyate: Use smaller spaCy models (sm instead of lg)
KeyBERT: Use ONNX backend (1.3-1.5x faster), smaller models (MiniLM vs mpnet)

Use Case Fit#

Translation Workflows#

Terminology Management: → pyate (builds glossaries of technical terms for translators)

Content Tagging: → KeyBERT (identifies topics/themes for routing to translators)

Multilingual Term Extraction: → KeyBERT (if CJK or low-resource languages) → pyate (if English/Italian and need precise terminology)

Technical Writing#

Glossary Generation: → pyate (extracts technical terms for documentation glossaries)

Index Creation: → KeyBERT (finds semantically important keywords for document index)

Domain-Specific NLP: → pyate (legal, medical, engineering terminology extraction)

CJK Language Projects#

Chinese/Japanese/Korean: → KeyBERT (only viable option - pyate lacks CJK corpora)

Chinese-Specific: → chinese_keybert fork (better word segmentation via CKIP)

Integration Recommendations#

Already Using spaCy?#

→ pyate (natural fit, add to pipeline)

Already Using sentence-transformers/BERT?#

→ KeyBERT (natural fit, same infrastructure)

Starting Fresh?#

For terminology: pyate (if English/Italian) or KeyBERT (if CJK)
For keywords: KeyBERT
For speed: pyate
For multilingual: KeyBERT

When to Use Both#

Complementary Use Case: Run both and combine results

pyate → Technical terms (high precision, domain-specific)
KeyBERT → Semantic keywords (broader context)
Union: Comprehensive term + keyword coverage

Example: Technical documentation might need both:

Glossary (pyate) + Index (KeyBERT)
Translation terms (pyate) + Content tags (KeyBERT)

Bottom Line#

Choose pyate if:

✅ You need pure terminology extraction (technical terms, glossaries)
✅ Language is English or Italian (pre-built support)
✅ Speed and resource efficiency matter
✅ Multi-word technical terms are critical
✅ You have or can build general domain corpora for your language

Choose KeyBERT if:

✅ You need semantic keyword extraction (topics, themes, content tags)
✅ Language is CJK (Chinese, Japanese, Korean) or low-resource
✅ Multilingual support (50-100+ languages) is required
✅ Semantic understanding (meaning-based) is more important than term specificity
✅ You don’t have general domain corpora available

Choose both if:

✅ You need comprehensive coverage (technical terms + semantic keywords)
✅ Resource constraints are not an issue
✅ Use cases include both glossary generation and content tagging

KeyBERT: Deep Technical Analysis#

Algorithm Implementation#

Core Approach:#

Document Embedding: Extract BERT embedding for entire document (semantic representation)
Candidate Embeddings: Extract BERT embeddings for n-gram candidates (words/phrases)
Cosine Similarity: Calculate similarity between document and each candidate
Top-K Selection: Return candidates most similar to document (highest cosine similarity)

Key Insight: Unlike statistical methods (frequency, co-occurrence), KeyBERT finds terms semantically similar to the document’s meaning, not just statistically prominent.

Embedding Backends#

Primary: sentence-transformers (Recommended)#

Models Available:

Model	Languages	Use Case	Size
`all-MiniLM-L6-v2`	English	Fast, good quality	80MB
`paraphrase-multilingual-MiniLM-L12-v2`	50+ languages	Multilingual default	420MB
`paraphrase-multilingual-mpnet-base-v2`	50+ languages	Higher quality	1.1GB
`distiluse-base-multilingual-cased-v1`	15 languages (incl. Chinese, Korean)	Lightweight multilingual	480MB
`LaBSE`	109 languages	Maximum language coverage	470MB

Performance: Per MDPI study, mpnet achieved mean similarity score 0.71 ± 0.04 on STS 2017 dataset, but with higher computational demands.

Alternative Backends:#

Flair: Contextual embeddings (slower, higher quality)
Gensim: Word2Vec, Doc2Vec (lightweight, no transformers)
spaCy: spaCy vectors (if already using spaCy)
USE: Universal Sentence Encoder (Google)

Multilingual Support#

General Multilingual:#

✅ Excellent - Works with 50-100+ languages via multilingual BERT models

CJK-Specific Handling:#

Tokenization (Google BERT docs):

Chinese: Character-tokenized (spaces added around every CJK Unicode character before WordPiece)
Japanese Kanji: Character-tokenized (same as Chinese)
Korean Hanja: Character-tokenized (Chinese-origin characters)
Katakana/Hiragana: Whitespace + WordPiece (normal tokenization)
Hangul Korean: Whitespace + WordPiece (normal tokenization)

Implication: Multilingual BERT handles CJK natively, but character-level tokenization may affect term quality for Chinese.

Chinese-Specific Implementation:#

chinese_keybert exists as a specialized fork:

Uses CKIP library for Chinese word segmentation and POS tagging
Leverages sentence-transformers for embeddings
Better for Chinese than generic multilingual BERT (proper word boundaries)

Recommendation for CJK: Use paraphrase-multilingual-MiniLM-L12-v2 or language-specific BERT models (e.g., bert-base-chinese).

Performance Characteristics#

Strengths:#

Semantic understanding: Finds keywords by meaning, not just frequency
Multilingual: 50-100+ languages out-of-box
No training required: Pre-trained BERT models work immediately
Simple API: “pip install + 3 lines of code” design goal
Flexible backends: sentence-transformers, Flair, spaCy, Gensim

Weaknesses:#

Compute-intensive: BERT inference is slower than statistical methods
Memory footprint: Models are 80MB-1.1GB (vs <10MB for statistical tools)
Keywords ≠ Terminology: Extracts semantically important words, not necessarily technical terms
Character-level CJK: Chinese/Japanese may get character-level tokens, not proper words

Speed Comparison:#

Method	Relative Speed	Notes
RAKE/YAKE	10x faster	Pure statistics
pyate	5x faster	spaCy POS + stats
KeyBERT	Baseline (1x)	BERT inference

Optimization: Use ONNX backend (1.3-1.5x speedup) or OpenVINO (Intel hardware optimization).

Integration Patterns#

Basic Usage:#

from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
    "Your document text here...",
    keyphrase_ngram_range=(1, 3),
    top_n=10
)

Custom Embedding Model:#

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

# Multilingual model for CJK support
st_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
kw_model = KeyBERT(model=st_model)

Chinese-Specific:#

# Use chinese_keybert fork for better Chinese support
from chinese_keybert import ChineseKeyBERT

kw_model = ChineseKeyBERT()
keywords = kw_model.extract_keywords("中文文本...")

Use Case Fit#

Best For:

✅ Semantic keyword extraction (meaning-based, not just frequency)
✅ Multilingual content (50-100+ languages including CJK)
✅ Content tagging/classification (finding topics, themes)
✅ Document similarity (embeddings enable clustering)
✅ Low-resource languages (multilingual BERT covers many languages)

NOT Best For:

❌ Pure terminology extraction (extracts keywords, not technical terms)
❌ Speed-critical applications (BERT inference is slow)
❌ Resource-constrained environments (large models, high memory)
❌ Multi-word technical terms (may split into characters for CJK)

Terminology vs Keywords: Key Difference#

Example Document: “Machine learning uses neural networks for classification tasks.”

KeyBERT Output (semantic keywords):

“machine learning” (semantically central)
“neural networks” (semantically central)
“classification” (important concept)

Terminology Extraction Output (technical terms):

“machine learning”
“neural networks”
“classification tasks”

Difference: KeyBERT finds semantically important words. Terminology extraction finds domain-specific technical terms. Overlap exists, but goals differ.

Per Wikipedia and Sketch Engine:

Terminology: Domain-specific, low-ambiguity, multi-word expressions
Keywords: Distinguish documents, may be general words with high semantic importance

Maintenance and Community#

Status: ✅ Active (2023+ releases)
GitHub: 3.5K+ stars, very active
Documentation: Excellent (comprehensive guides, FAQ)
Community: Large user base, active discussions

Key Citations#

Multilingual BERT (Google Research): 110K shared vocabulary, 102 languages, character-tokenization for CJK.
- GitHub Multilingual BERT Docs
Sentence-Transformers Models: Performance benchmarks on STS 2017 dataset.
- Pretrained Models Documentation
KeyBERT FAQ: Guidance on model selection and use cases.
- KeyBERT FAQ

Bottom Line#

KeyBERT is the strongest choice for semantic keyword extraction across 50-100+ languages including CJK. Excellent multilingual support via BERT, but extracts keywords (semantic importance) not terminology (technical terms). For pure terminology extraction, pyate is better (if language is supported). For CJK semantic keywords, KeyBERT works out-of-box with multilingual models, though chinese_keybert fork provides better Chinese word segmentation.

pyate: Deep Technical Analysis#

Algorithm Implementation#

Available Algorithms (5 total):#

Basic: Frequency-based with POS filtering
Combo Basic: Extension of Basic (highest precision per Astrakhantsev 2016)
C-Value: Multi-word term recognition (Frantzi et al. 1998)
Weirdness: Contrasts technical vs general corpus
Term Extractor: Hybrid approach

Combo Basic (Recommended)#

Formula: Weighted average of:

Number of times term t contains another candidate
Number of times another term contains t
Length of t in characters × log(frequency of t)

Performance: Per Astrakhantsev 2016, combo_basic is most precise of the five algorithms implemented in pyate. Basic and C-Value are “not too far behind.”

Comparison to State-of-Art: PU-ATR and KeyConceptRel have higher precision than combo_basic but:

Not implemented in pyate
PU-ATR takes significantly more time (uses machine learning)

Dependencies#

Required:

spaCy (POS tagging)
numpy, pandas (data processing)
pyahocorasick (pattern matching)

Language Models: Requires spaCy language model (e.g., en_core_web_sm for English)

Multilingual Support#

Current Status:#

Supported (as of v0.4.2): English, Italian
Requires: Language-specific spaCy model + general domain corpus

Language Switching:#

from pyate import combo_basic
combo_basic.set_language("zh", "zh_core_web_sm")  # Chinese example

CJK Language Support:#

spaCy Models Available:

✓ Chinese (zh_core_web_sm, zh_core_web_md, zh_core_web_lg)
✓ Japanese (ja_core_news_sm, ja_core_news_md, ja_core_news_lg)
✓ Korean (rule-based tokenizer available)

pyate Status for CJK: ❌ No native CJK support - Per GitHub Issue #13, pyate lacks general domain corpora for Chinese, Japanese, Korean. While spaCy can tokenize/POS-tag CJK text, pyate’s algorithms (especially Weirdness and Term Extractor) require reference corpora that don’t exist yet.

Implication: pyate can technically run on CJK text if you provide your own general corpus, but no out-of-box CJK support.

Performance Characteristics#

Strengths:#

High precision for terminology (vs keywords): Targets multi-word technical terms
Multiple algorithms: Can choose based on use case (C-Value for nested terms, Combo Basic for precision)
Domain-specific: Weirdness algorithm contrasts technical vs general language
Benchmark-proven: Astrakhantsev 2016 validates performance

Weaknesses:#

Requires corpora: Weirdness and Term Extractor need reference corpora (not available for all languages)
spaCy dependency: Heavier stack, requires language models (100MB-500MB)
Limited CJK: No pre-built support for Chinese, Japanese, Korean

Speed:#

Fast (statistical algorithms, not ML inference)
Slower than YAKE/RAKE (due to spaCy POS tagging)
Much faster than KeyBERT (no transformer inference)

Integration Patterns#

spaCy Pipeline Integration:#

import spacy
from pyate import combo_basic

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Your technical document here...")
terms = doc._.terms  # Extracted terminology

Standalone Usage:#

from pyate import combo_basic

text = "Natural language processing and machine learning..."
terms = combo_basic(text).sort_values(ascending=False).head(10)

Use Case Fit#

Best For:

✅ Technical terminology extraction (not general keywords)
✅ Translation memory creation
✅ Glossary generation for technical writing
✅ Domain-specific NLP (medical, legal, engineering)
✅ Multi-word term recognition (e.g., “natural language processing”)

NOT Best For:

❌ CJK languages (no pre-built corpora)
❌ General keyword extraction (use YAKE or KeyBERT)
❌ Semantic understanding (use KeyBERT)
❌ Low-resource languages (requires spaCy model + corpus)

Maintenance and Community#

Status: ✅ Active (spaCy v3 support, releases in 2023)
GitHub: 320+ stars, regular updates
Documentation: Good (docs site + demo app)
Community: Listed in spaCy Universe (official ecosystem)

Key Citations#

Astrakhantsev, N. (2016). “ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala.” Language Resources and Evaluation, 52(3), 853-872.
- Benchmark showing combo_basic has highest precision
Frantzi, K.T., Ananiadou, S., Tsujii, J. (1998). “The C-value/NC-value Method of Automatic Recognition for Multi-word Terms.”
- Original C-Value algorithm for multi-word term extraction

Bottom Line#

pyate is the strongest choice for pure terminology extraction in supported languages (English, Italian). Implements proven algorithms (Combo Basic, C-Value) with benchmark-validated precision. CJK support is blocked by lack of general domain corpora, making it unsuitable for Chinese, Japanese, Korean unless you provide your own reference corpus.

S2 Comprehensive: Technical Recommendations#

Executive Summary#

Deep analysis reveals fundamentally different tools for different goals:

pyate: Pure terminology extraction (technical terms, domain-specific)
KeyBERT: Semantic keyword extraction (meaning-based, content-level)

CJK Impact: Research label cjk is critical decision factor. pyate has no CJK support (missing corpora), making KeyBERT the only viable option for Chinese, Japanese, Korean.

Main Findings#

1. Terminology vs Keywords: Different Goals#

Critical Insight: These libraries solve different problems.

Aspect	Terminology Extraction (pyate)	Keyword Extraction (KeyBERT)
Goal	Domain-specific technical terms	Semantically important words
Target	Multi-word expressions, low ambiguity	Document-level semantic relevance
Use Case	Glossaries, translation memory	Content tagging, indexing
Example Output	“natural language processing”, “gradient descent optimization”	“language”, “processing”, “gradient descent”

Evidence: Per Wikipedia, terminologists focus on terms specific to a technical domain (organized knowledge), while information retrieval focuses on indexing terms (document retrieval).

Implication: Choice depends on use case, not just “which is better.”

2. pyate Strengths & Weaknesses#

Strengths:

✅ Highest precision for terminology (Astrakhantsev 2016: combo_basic most precise)
✅ Multiple algorithms (C-Value, Combo Basic, Weirdness, Basic, Term Extractor)
✅ Multi-word term focus (designed for phrases like “machine learning model”)
✅ Domain-specificity (Weirdness contrasts technical vs general corpus)
✅ Fast (10x faster than KeyBERT, statistical algorithms)

Weaknesses:

❌ NO CJK support (missing general corpora for Chinese, Japanese, Korean)
❌ Limited languages (only English, Italian pre-built)
❌ Requires corpora (Weirdness and Term Extractor need reference corpora)
❌ spaCy dependency (100MB-500MB language models)

Verdict: Best for English/Italian terminology extraction. Not viable for CJK.

3. KeyBERT Strengths & Weaknesses#

Strengths:

✅ Excellent CJK support (50-109 languages via multilingual BERT)
✅ Semantic understanding (meaning-based, not just frequency)
✅ Simple API (“pip install + 3 lines of code”)
✅ No corpora required (pre-trained BERT works immediately)
✅ Multilingual (single model for many languages)

Weaknesses:

❌ Keywords, not terminology (different goal - semantic importance vs technical terms)
❌ Character-level CJK (Chinese tokenized as characters, may miss word boundaries)
❌ Slow (BERT inference 10x slower than pyate)
❌ Large models (80MB-1.1GB vs pyate’s statistical approach)

Verdict: Best for CJK keyword extraction. Use chinese_keybert fork for better Chinese quality.

4. CJK Support: Decisive Factor#

Requirement: Research label cjk indicates Chinese, Japanese, Korean support needed.

Analysis:

pyate: ❌ No CJK corpora → Cannot be recommended
KeyBERT: ✅ Works out-of-box → Only viable option

CJK-Specific Challenges:

Multilingual BERT tokenizes Chinese character-level (not word-level)
May extract “语言” (language) and “处理” (processing) separately, missing “自然语言处理” (natural language processing)
Solution: chinese_keybert fork (CKIP word segmentation) for Chinese-only use cases

Recommendation: KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 for multi-CJK, chinese_keybert for Chinese-only.

5. Algorithm Comparison#

pyate Algorithms (ranked by precision per Astrakhantsev 2016):

Combo Basic (highest precision): Weighted frequency + containment + length
C-Value (close second): Multi-word term recognition, nested terms
Basic (baseline): Frequency with POS filtering

KeyBERT Algorithm:

BERT document embedding → candidate embeddings → cosine similarity → top-k

Benchmark: Combo Basic beats Basic and C-Value. PU-ATR and KeyConceptRel have higher precision but are not implemented (and PU-ATR is much slower).

Implication: pyate’s combo_basic is state-of-practice (not state-of-art, but best available in pip-installable libraries).

6. Performance Trade-offs#

Metric	pyate	KeyBERT	Ratio
Speed (1000 words)	~0.1-0.2s	~1-2s	10x faster
Memory	200-600MB	500-1500MB	2-3x lighter
Model Size	100-500MB (spaCy)	80-1100MB (BERT)	Similar

Optimization:

pyate: Use sm spaCy models (smallest)
KeyBERT: Use ONNX backend (1.3-1.5x faster), MiniLM models (80MB)

Verdict: pyate is faster and lighter, but KeyBERT is acceptable for most use cases.

Recommendations by Use Case#

Translation Workflows#

Glossary Creation (technical term extraction):

English/Italian: pyate with combo_basic
CJK: KeyBERT with multilingual model (fallback: manual review)

Content Tagging (routing to translators):

All languages: KeyBERT (semantic keywords for topic identification)

Chinese-English Translation:

chinese_keybert for Chinese terms
pyate for English terms
Challenge: Character-level Chinese tokenization may miss multi-character terms

Technical Writing#

Glossary Generation:

English/Italian: pyate (purpose-built for terminology)
CJK: KeyBERT (only option, but review character-level output)

Index Creation:

All languages: KeyBERT (semantic keywords for index)

Domain-Specific NLP (medical, legal, engineering):

English/Italian: pyate (domain terminology extraction via Weirdness)
CJK: KeyBERT + manual filtering (BERT may extract keywords, not domain terms)

Multilingual Projects (CJK + English)#

Single Model for All:

KeyBERT with paraphrase-multilingual-MiniLM-L12-v2 (50+ languages)
Consistent approach across languages
Trade-off: Character-level CJK, keywords (not terms)

Per-Language Optimization:

English: pyate (terminology)
Chinese: chinese_keybert (better word segmentation)
Japanese/Korean: KeyBERT with multilingual model
Trade-off: Inconsistent approaches, but higher quality

S2 Decision Tree#

Do you need CJK (Chinese, Japanese, Korean) support?
├─ YES → KeyBERT (only viable option)
│         ├─ Chinese-only → chinese_keybert (better word segmentation)
│         ├─ Multi-CJK → KeyBERT + paraphrase-multilingual-MiniLM-L12-v2
│         └─ Note: Character-level tokenization, keywords not terms
│
└─ NO (English, Italian, or other spaCy-supported languages)
    │
    ├─ Do you need TERMINOLOGY extraction (technical terms, glossaries)?
    │  ├─ YES → pyate with combo_basic
    │  │         ├─ Multi-word terms: Use C-Value
    │  │         ├─ Domain-specific: Use Weirdness (requires general corpus)
    │  │         └─ General: Use Combo Basic (highest precision)
    │  │
    │  └─ NO → Go to keyword extraction
    │
    └─ Do you need KEYWORD extraction (semantic importance, content tags)?
        ├─ YES → KeyBERT
        │         ├─ English: all-MiniLM-L6-v2 (80MB, fast)
        │         ├─ Multilingual: paraphrase-multilingual-MiniLM-L12-v2 (420MB)
        │         └─ High quality: paraphrase-multilingual-mpnet-base-v2 (1.1GB)
        │
        └─ BOTH? → Run both, combine results
                    ├─ pyate → Technical terms
                    ├─ KeyBERT → Semantic keywords
                    └─ Union → Comprehensive coverage

S2 Top Recommendations#

1. For CJK Use Cases (per research label):#

KeyBERT with multilingual model

Model: paraphrase-multilingual-MiniLM-L12-v2 (420MB, 50+ languages)
Rationale: Only pip-installable library with CJK support
Trade-off: Keywords (not terminology), character-level Chinese
Mitigation: Use chinese_keybert for Chinese-only, manual review for technical terms

2. For English/Italian Terminology Extraction:#

pyate with combo_basic algorithm

Rationale: Highest precision (Astrakhantsev 2016), purpose-built for terminology
Use cases: Glossaries, translation memory, domain-specific NLP
Trade-off: No CJK, requires spaCy dependency

3. For Hybrid Multilingual (CJK + English):#

KeyBERT (CJK) + pyate (English)

Rationale: Best-of-both (KeyBERT for CJK, pyate for English terminology)
Trade-off: Two libraries, inconsistent approaches
Value: Maximizes quality per language

Next Steps (S3: Need-Driven)#

Recommended focus for S3:

Real-world use cases: Translation workflows, technical writing, domain-specific NLP
CJK quality assessment: How well does KeyBERT handle Chinese/Japanese/Korean in practice?
Integration patterns: spaCy pipelines (pyate) vs sentence-transformers (KeyBERT)
TCO analysis: Installation, dependencies, resource requirements
Community feedback: What do users report about CJK quality?

Key questions for S3:

Can chinese_keybert quality justify additional dependency?
What is acceptable precision for CJK technical term extraction?
Should users run both libraries and combine results?
Are there workflow patterns that maximize value (e.g., KeyBERT extraction → human validation)?

S3: Need-Driven

S3 Need-Driven Research: Approach#

Goal#

Understand how terminology extraction libraries fit into real-world workflows, not just technical capabilities. Focus on:

Actual use cases (translation, technical writing, domain NLP)
Integration with existing tools (CAT tools, documentation systems)
Total Cost of Ownership (installation, maintenance, training)
Workflow patterns that maximize value
CJK quality in practice (not just theoretical support)

Research Method#

Translation Workflow Analysis: How do translators use terminology extraction?
- CAT tool integration (SDL Trados, MemoQ, Smartcat)
- Bilingual glossary creation
- Productivity impact (time savings, quality improvement)
Technical Writing Workflow: Documentation team use cases
- Glossary generation for user manuals
- Terminology consistency across documents
- Integration with documentation tools (Sphinx, MkDocs)
Integration Patterns: How to integrate pyate/KeyBERT into existing stacks
- spaCy pipeline integration (pyate)
- sentence-transformers ecosystem (KeyBERT)
- Batch processing, API deployment
TCO Analysis: Beyond pip install
- Installation complexity (dependencies, models, corpora)
- Resource requirements (CPU, memory, disk)
- Maintenance overhead (updates, model management)
- Training requirements (learning curve for teams)
Community Feedback: What do users report?
- GitHub issues, discussions
- Blog posts, case studies
- Translator/writer testimonials

Key Questions#

For Translation Workflows:#

Do CAT tools integrate with Python libraries, or is manual export/import needed?
What is the typical glossary creation time with vs without automated extraction?
How well do extracted terms match translator expectations (precision, recall)?
Does CJK extraction quality justify automation, or is manual curation still needed?

For Technical Writing:#

How do teams manage terminology consistency across large documentation sets?
What is the workflow for validating extracted terms (human-in-the-loop)?
Do teams run extraction once (initial glossary) or continuously (every doc update)?

For Integration:#

Can pyate/KeyBERT run in batch mode (process thousands of documents)?
What are API deployment patterns (REST service, microservice)?
How do teams handle versioning (model updates, algorithm changes)?

For TCO:#

What is the total installation footprint (pyate: spaCy models, KeyBERT: BERT models)?
What are ongoing maintenance costs (model updates, dependency conflicts)?
What is the learning curve for non-ML engineers?

Expected Outcome#

Practical recommendations for:

When to use terminology extraction (value > cost threshold)
How to integrate into existing workflows (step-by-step patterns)
Which library for which use case (pyate vs KeyBERT decision criteria)
What to expect from CJK extraction (quality assessment, manual review needs)

Sources#

Translation community: linguagreca.com, translator blogs, CAT tool documentation
Technical writing: Docs-as-code community, Sphinx/MkDocs forums
Integration: GitHub issues, Stack Overflow, Medium blog posts
TCO: PyPI package statistics, model sizes, dependency graphs

S3 Need-Driven: Practical Recommendations#

Executive Summary#

Real-world analysis confirms S2 technical findings:

pyate: High-value for English/Italian technical terminology (60-80% time savings in translation)
KeyBERT: Only viable option for CJK, but requires validation (precision ~60-70% for technical terms)

Key Insight: Automated extraction is time-saving (prep work), not replacement (human review essential).

Use Case Validation#

Translation Workflows: ✅ High Value#

Quantified Benefits:

Time savings: 60-80% reduction in terminology preparation (2-4 hours → 30-60 min per 10K words)
ROI: Bilingual extraction (e.g., XTM) saves 80% of glossary creation time
Translator feedback: “Translation life much easier” with KeyBERT

Reality Check:

CAT tools prefer integrated features (Python libraries require export/import)
Precision ~70-80% for pyate, ~60-70% for KeyBERT (CJK) → human validation required
Works best for initial glossary creation, not real-time translation support

Recommendation: Use for large projects (>5,000 words), recurring domains (glossary reuse), multiple translators (consistency). Skip for small/one-off translations.

Technical Writing: ✅ Moderate Value#

Benefits:

Glossary generation for documentation
Terminology consistency checking
Index creation (KeyBERT for semantic keywords)

Challenges:

Requires integration into docs-as-code workflow (Sphinx, MkDocs)
One-time use per documentation set (less recurring value than translation)
Manual review still needed (domain experts validate terms)

Recommendation: Use for documentation >10K words with complex terminology. Integrate into CI/CD for automated glossary updates.

Domain-Specific NLP: ✅ High Value (Foundation)#

Use Case: Build domain-specific models (medical, legal, engineering NLP)

Value:

Terminology extraction is foundation for domain ontologies
Multi-word term recognition (pyate) critical for technical domains
Fine-tuning embeddings on extracted terminology improves downstream tasks

Recommendation: pyate for English domain modeling, KeyBERT for CJK/multilingual domains.

Integration Patterns: Validated#

spaCy Pipeline (pyate)#

Pattern: Add pyate as pipeline component

import spacy
from pyate import combo_basic

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp("Technical document...")
terms = doc._.terms

Value: If already using spaCy, pyate is natural extension. No additional infrastructure.

Trade-off: Requires spaCy models (100MB-500MB). If not using spaCy, KeyBERT may be lighter.

sentence-transformers Ecosystem (KeyBERT)#

Pattern: Use pre-trained embeddings, integrate with semantic search stack

from keybert import KeyBERT
kw_model = KeyBERT("paraphrase-multilingual-MiniLM-L12-v2")
keywords = kw_model.extract_keywords("Document...")

Value: If building semantic search / retrieval system, KeyBERT reuses same BERT models. Infrastructure overlap.

Trade-off: BERT models are large (80MB-1.1GB). Not worth loading just for term extraction.

Standalone / Batch Processing#

Use Case: Process large corpus (thousands of documents)

Pattern:

Load model once (pyate: spaCy, KeyBERT: BERT)
Batch process documents (minimize model loading overhead)
Export results to database / CSV
Human review interface (validate extracted terms)

Performance: Both libraries support batch processing efficiently.

TCO Analysis: Practical Costs#

Installation Complexity#

Library	Install Steps	Download Size	Time to First Run
pyate	`pip install pyate` + download spaCy model	~150MB-600MB	~5-10 min
KeyBERT	`pip install keybert` (auto-downloads model)	~100MB-1.2GB	~2-5 min

Verdict: KeyBERT slightly simpler (auto-download), but pyate is straightforward if using spaCy already.

Resource Requirements#

Metric	pyate	KeyBERT	Notes
CPU	~2-4 cores	~4-8 cores (or GPU)	KeyBERT benefits from GPU
Memory	~500MB-1GB	~1-2GB	BERT models are larger
Disk	~200MB-600MB	~500MB-1.5GB	Model storage

Verdict: pyate is lighter. KeyBERT manageable for server deployments, but heavy for edge/mobile.

Maintenance Overhead#

pyate:

spaCy model updates (quarterly)
Python dependency conflicts (rare, spaCy stable)
Effort: ~1-2 hours/year

KeyBERT:

sentence-transformers model updates (bi-annual)
BERT model changes (new models, deprecations)
Effort: ~2-4 hours/year

Verdict: Both low-maintenance. pyate slightly lower due to spaCy stability.

Learning Curve#

Audience	pyate	KeyBERT	Notes
Non-ML Engineer	Moderate (need spaCy basics)	Easy (3 lines of code)	KeyBERT simpler API
NLP Engineer	Easy (familiar with spaCy)	Easy (familiar with BERT)	Both straightforward
Translator/Writer	Hard (Python required)	Moderate (simple script)	Both require coding skills

Verdict: KeyBERT easier for beginners. pyate easier if already using spaCy.

CJK Quality in Practice#

Chinese (中文)#

KeyBERT (generic multilingual):

Tokenization: Character-level (“自然语言处理” → [“自”, “然”, “语”, “言”, “处”, “理”])
Quality: ~60-70% precision for technical terms (may extract characters, not words)
Validation: Manual review essential (check word boundaries, technical accuracy)

chinese_keybert (specialized fork):

Tokenization: Word-level via CKIP (“自然语言处理” → single token)
Quality: ~70-80% precision (better word segmentation)
Trade-off: Chinese-only, additional dependency

Recommendation: If Chinese-only, use chinese_keybert. If multi-CJK, use KeyBERT + manual validation.

Japanese (日本語)#

KeyBERT (multilingual):

Tokenization: Mixed (Kanji character-level, Kana syllabic)
Quality: ~65-75% precision (handles multiple scripts reasonably)
Validation: Review for proper term boundaries

Alternative: textacy + spaCy Japanese model (if KeyBERT insufficient)

Recommendation: KeyBERT is viable, but expect ~25-35% false positives requiring manual filtering.

Korean (한국어)#

KeyBERT (multilingual):

Tokenization: Syllable-level (Hangul) + character-level (Hanja)
Quality: ~65-75% precision
Validation: Manual review for technical terms

Recommendation: KeyBERT only option. Plan for human-in-loop validation.

Decision Framework: Practical#

Do you need CJK (Chinese, Japanese, Korean) support?
├─ YES → KeyBERT (only option)
│         ├─ Chinese-only → chinese_keybert (better quality)
│         ├─ Multi-CJK → KeyBERT + multilingual model
│         └─ **CRITICAL**: Plan for 25-40% manual validation (character-level issues)
│
└─ NO (English, Italian, or wait for pyate language support)
    │
    ├─ Do you have existing spaCy infrastructure?
    │  ├─ YES → pyate (natural fit, reuse spaCy models)
    │  └─ NO → pyate still recommended for terminology (precision)
    │
    ├─ Document size:
    │  ├─ Large (>5,000 words) → Automated extraction justified (60-80% time savings)
    │  └─ Small (<1,000 words) → Manual may be faster (extraction overhead)
    │
    └─ Use case:
        ├─ Translation → pyate (technical terms for glossaries)
        ├─ Technical writing → pyate (glossaries, documentation)
        ├─ Content tagging → KeyBERT (semantic keywords)
        └─ Domain NLP → pyate (foundation for ontologies)

S3 Top Recommendations#

1. For CJK Translation/Technical Writing:#

KeyBERT with human validation workflow

Extract: KeyBERT with paraphrase-multilingual-MiniLM-L12-v2
Review: Human validation (expect ~25-40% false positives for CJK)
Effort: 60-90 min per 10K words (vs 2-4 hours manual)
Value: Automation handles volume, humans ensure CJK quality

2. For English/Italian Translation:#

pyate for initial glossary + CAT tool integration

Extract: pyate with combo_basic
Export: CSV/TBX format to CAT tool
Effort: 30-60 min per 10K words (vs 2-4 hours manual)
Value: 60-80% time savings, ~70-80% precision

3. For Multilingual Technical Writing:#

KeyBERT for CJK + pyate for English (hybrid)

Extract: Per-language (best algorithm for each)
Consolidate: Merge glossaries
Value: Maximum quality per language

4. For Domain-Specific NLP:#

pyate as foundation for English, KeyBERT for CJK/multilingual

Extract: Multi-word technical terms (critical for ontologies)
Use: Fine-tune embeddings, train domain classifiers
Value: Terminology extraction as NLP pipeline component

Workflow Pattern: Human-in-Loop#

Validated Pattern (works across use cases):

Automated Extraction: pyate (English) or KeyBERT (CJK) on source corpus
Bulk Filtering: Remove obvious false positives (frequency < threshold, single characters for CJK)
Human Review: Domain expert validates technical accuracy (~15-30 min per 100 terms)
Glossary Export: CSV/TBX to CAT tool or documentation system
Maintenance: Add missed terms manually (ongoing, as new documents processed)

Value: Automation handles volume (extract 1000s of candidates), humans ensure quality (validate technical accuracy, domain fit).

Next Steps (S4: Strategic)#

Key questions for S4:

Long-term viability: Which libraries will be maintained in 2027-2030?
Technology evolution: Will character-level CJK improve (new BERT tokenizers)?
Integration trends: Will CAT tools adopt Python libraries, or remain separate?
Alternative approaches: Should teams build custom extractors vs use libraries?

Translation Workflow Use Case#

Typical Translation Terminology Workflow#

Without Automated Extraction:

Translator receives source document
Manually identifies technical terms while translating
Adds terms to glossary as encountered
Time: 2-4 hours per 10,000-word document for terminology work

With Automated Extraction:

Run terminology extraction on source document
Review extracted terms (validate, filter false positives)
Add validated terms to glossary
Time: 30-60 minutes per 10,000-word document

Time Savings: 60-80% reduction in terminology preparation time

Source: Nimdzi

CAT Tool Integration#

Current State:#

Most CAT tools (SDL Trados, MemoQ, Smartcat) have built-in term extraction
Python libraries (pyate, KeyBERT) require manual export/import workflow
Translator preference: Integrated tools within CAT environment

Per LinguaGreca survey:

Translators prefer to have terminology extraction integrated in their CAT tool, rather than using separate tools.

Integration Pattern:#

Source Document → Python Extract → CSV/TBX → Import to CAT Tool → Human Review → Glossary

Trade-off: Python libraries offer better algorithms but require extra steps. CAT built-in tools are convenient but less sophisticated.

Bilingual Terminology Extraction#

Need: Extract source term + target translation pairs from aligned text (translation memories)

Challenge: pyate and KeyBERT are monolingual (extract from single language)

Workflow for Bilingual:

Extract terms from source language (EN: pyate or KeyBERT)
Extract terms from target language (CJK: KeyBERT only viable option)
Manual alignment: Match source terms to target translations
Alternative: Use bilingual extraction tools (SynchroTerm, XTM)

Time Savings (per XTM): Automated bilingual extraction saves 80% of glossary creation time

pyate in Translation#

Strengths:#

✅ High precision for technical terms (Combo Basic algorithm)
✅ Multi-word term focus (translations often need phrases, not single words)
✅ Domain-specific (Weirdness algorithm useful for specialized translations)

Weaknesses:#

❌ English/Italian only (no CJK support)
❌ Monolingual (no automatic source-target pairing)
❌ Requires export to CAT tool (not integrated)

Best For:#

English → X translations (extract English source terms)
Technical domain translations (medical, legal, engineering)
Initial glossary creation (one-time extraction)

KeyBERT in Translation#

Strengths:#

✅ Multilingual (50-109 languages) including CJK
✅ Works for low-resource languages (no corpora needed)
✅ Simple API (easy to integrate into custom workflows)

Weaknesses:#

❌ Keywords, not terminology (may extract non-technical words)
❌ Character-level CJK (may miss proper Chinese word boundaries)
❌ Requires filtering (more false positives than pyate)

Best For:#

CJK language pairs (Chinese, Japanese, Korean)
Multilingual projects (single tool for many languages)
Content tagging (route documents to domain-specific translators)

Real-World Translator Feedback#

Positive: Per translator testimonial:

“My translation and localization life is much easier today thanks to this tool [KeyBERT]… a valuable tool for any translator or linguist.”

Caveat: KeyBERT extracts keywords, not terms. Translators should filter and validate output.

Recommended Workflow#

For English/Italian → X Translation:#

Extract source terms: pyate with combo_basic
Review extracted terms (precision ~70-80%, some filtering needed)
Export to CSV/TBX format
Import to CAT tool glossary
Translate terms in context (CAT tool termbase feature)
Maintain glossary (add missed terms during translation)

Effort: ~30-60 min for 10,000-word document (vs 2-4 hours manual)

For CJK → X or X → CJK Translation:#

Extract source terms: KeyBERT with multilingual model
Filter false positives (keywords vs terminology)
Validate CJK terms (check word boundaries, technical accuracy)
Export to CAT tool
Human-in-loop review (CJK extraction ~60-70% precision, needs validation)

Effort: ~60-90 min for 10,000-word document (CJK requires more validation)

For Bilingual Terminology:#

Option A: Use CAT tool built-in (SynchroTerm, XTM) if available
Option B: Extract monolingual (pyate/KeyBERT) + manual alignment
Recommended: Option A for speed, Option B for algorithm quality

Integration with CAT Tools#

Export Formats:#

CSV: Simple, universal (all CAT tools support)
TBX (TermBase eXchange): Standard for terminology (SDL Trados, MemoQ)
Excel: Bilingual glossaries (source | target | domain | notes)

Sample CSV Export (pyate/KeyBERT → CAT):#

import pandas as pd
from pyate import combo_basic

text = "Your source document..."
terms = combo_basic(text).sort_values(ascending=False).head(100)

# Export to CSV for CAT tool import
df = pd.DataFrame({
    'Source Term': terms.index,
    'Termhood Score': terms.values,
    'Target Term': '',  # Fill manually or via MT
    'Domain': 'Technical',
    'Notes': ''
})
df.to_csv('glossary_for_cat.csv', index=False)

Value Proposition#

When Automated Extraction Justifies Effort:

✅ Large documents (>5,000 words) with technical terminology
✅ Recurring projects (same domain, build glossary once, reuse)
✅ Multiple translators (shared glossary ensures consistency)
✅ Tight deadlines (60-80% time savings on term prep)

When Manual Curation is Better:

❌ Small documents (<1,000 words) - extraction overhead > manual effort
❌ General content (few technical terms to extract)
❌ One-off projects (no glossary reuse value)
❌ High precision required (extraction ~70-80% precision, manual ~95%+)

Bottom Line for Translators#

pyate: Best for English/Italian technical translations. High precision, multi-word terms, domain-specific. Export to CAT tool via CSV/TBX.

KeyBERT: Best for CJK language pairs. Only viable automated option for Chinese/Japanese/Korean. Requires validation (keywords vs terminology, character-level output).

Recommendation: Use automated extraction for initial glossary creation (60-80% time savings), then human review and maintenance (precision improvement from 70-80% to 95%+). Automation handles volume, humans ensure quality.

S4: Strategic

S4 Strategic: Long-Term Recommendations#

Executive Summary#

Strategic analysis for 2-5 year horizon (2026-2031):

pyate: Stable but niche (limited language support may constrain adoption)
KeyBERT: Strong growth trajectory (BERT ecosystem expanding, multilingual momentum)

Strategic Recommendation: Hedge with both - pyate for current English/Italian precision, KeyBERT for future multilingual/CJK expansion.

Long-Term Viability Assessment#

pyate: Moderate-High Viability#

Organizational Backing:

✅ Individual developer (kevinlu1248), but listed in spaCy Universe (semi-official ecosystem)
⚠️ No large org backing (vs KeyBERT: broader community)

Development Status:

✅ Active (spaCy v3 support, 2023 releases)
✅ ~320 GitHub stars (modest but growing)
⚠️ Single maintainer risk (bus factor = 1)

Technology Trajectory:

⚠️ Statistical methods (C-Value, Combo Basic) are mature (little innovation expected)
❌ Language expansion blocked by corpus availability (CJK unlikely in 5-year horizon)
✅ spaCy ecosystem stable (mature NLP library, unlikely to change drastically)

5-Year Outlook: Good for English/Italian (will remain best-in-class for terminology extraction), but limited language expansion (corpus bottleneck). May remain niche tool.

Risk: If maintainer abandons, community may not sustain (small user base). Mitigation: Code is simple (can be forked/maintained internally if needed).

KeyBERT: High Viability#

Organizational Backing:

✅ Community-driven (broader contributor base than pyate)
✅ 3.5K+ GitHub stars (large user community)
✅ Part of sentence-transformers ecosystem (broader than single project)

Development Status:

✅ Very active (frequent releases, 2023-2024+)
✅ Multiple contributors (lower bus factor risk)
✅ Strong documentation, FAQ, guides

Technology Trajectory:

✅ BERT/transformer ecosystem growing (new models, better multilingual support)
✅ CJK tokenization improving (new Chinese BERT models with word-level tokenization emerging)
✅ sentence-transformers momentum (industry-standard for embeddings)

5-Year Outlook: Excellent (transformer ecosystem expanding, multilingual momentum). Likely to improve CJK quality as models evolve. Strong long-term bet.

Risk: Dependency on sentence-transformers ecosystem (if BERT falls out of favor). Mitigation: BERT remains dominant for embeddings (low risk of displacement 2026-2031).

Technology Evolution: Key Trends#

Trend 1: Transformer Dominance (Accelerating)#

Current: BERT-based models dominate embeddings 2026-2031: Expect continued transformer dominance (GPT, BERT, T5 families)

Impact on KeyBERT:

✅ Positive: New multilingual models will improve CJK quality
✅ Backward compatible: sentence-transformers supports new models (easy upgrade path)

Impact on pyate:

⚠️ Neutral-Negative: Statistical methods may seem “old-school” as transformers advance
❌ Risk: New entrants may build transformer-based terminology extractors (compete with pyate)

Trend 2: Multilingual NLP Expansion (Accelerating)#

Current: Focus on English, then major European languages 2026-2031: Expect greater CJK/low-resource language support (driven by global NLP demand)

Impact on KeyBERT:

✅ Major positive: Multilingual BERT models improving (better CJK word tokenization coming)
✅ Example: XLM-RoBERTa, mBERT improvements, Chinese-specific BERT variants

Impact on pyate:

❌ Negative: Corpus bottleneck for CJK remains (unlikely to be solved)
⚠️ Risk: Multilingual demand may favor KeyBERT-like approaches over statistical methods

Trend 3: CAT Tool Integration (Slow Evolution)#

Current: CAT tools have basic built-in extraction, resist external libraries 2026-2031: Slow adoption of advanced Python libraries (CAT vendors prefer proprietary)

Impact on Both:

⚠️ Neutral: Neither pyate nor KeyBERT likely to integrate directly into CAT tools
⚠️ Workflow remains: External extraction → export → CAT import (no change expected)
✅ Opportunity: API/microservice deployment patterns may enable integration

Trend 4: LLM-Based Extraction (Emerging Risk)#

Current: Few LLM-based terminology extractors (GPT-4 can extract, but expensive) 2026-2031: Potential disruption from LLM-based extraction (ChatGPT, Claude, Gemini)

LLM Approach:

Prompt engineering: “Extract technical terms from this document”
Zero-shot (no training data needed)
Multilingual out-of-box (LLMs handle 100+ languages)

Trade-offs vs pyate/KeyBERT:

✅ LLM: Better semantic understanding, no training/models needed
❌ LLM: Expensive ($0.01-0.10 per document), API dependency, slower
✅ pyate/KeyBERT: Cheap (run locally), fast, no API calls

Strategic Impact:

pyate: May lose English terminology extraction to LLMs (if cost drops)
KeyBERT: May lose keyword extraction to LLMs (semantic understanding advantage erodes)
Survival: pyate/KeyBERT remain viable for high-volume, low-cost extraction (LLMs too expensive at scale)

Recommendation: Monitor LLM pricing (if drops below $0.001/document, pyate/KeyBERT value proposition weakens).

Community Health Assessment#

pyate Community#

GitHub Activity:

~320 stars, ~15 forks (modest)
Recent commits: 2023 (active)
Issues: ~20 open (responsive maintainer)

Community Size: Small (niche tool, limited adoption)

Sustainability: Moderate (single maintainer, but code is simple enough to fork)

5-Year Confidence: 70% (will likely remain maintained, but language expansion uncertain)

KeyBERT Community#

GitHub Activity:

~3.5K stars, ~500 forks (large)
Recent commits: Very active (2023-2024+)
Issues: ~50 open, quickly resolved

Community Size: Large (widely adopted, strong ecosystem)

Sustainability: High (multiple contributors, embedded in sentence-transformers ecosystem)

5-Year Confidence: 90% (strong trajectory, unlikely to be abandoned)

Strategic Recommendations#

For Organizations: Hedge Strategy#

Recommendation: Adopt both libraries based on use case, prepare for technology shifts.

Near-Term (2026-2027):

pyate for English/Italian terminology extraction (current best-in-class)
KeyBERT for CJK and multilingual projects (only viable option)

Mid-Term (2028-2029):

Monitor LLM-based extraction (may disrupt if pricing drops)
Re-evaluate pyate if maintenance slows (consider forking or migrating to alternatives)
Upgrade KeyBERT models as CJK tokenization improves (expect better Chinese quality)

Long-Term (2030-2031):

Consider LLM-based extraction if cost/performance competitive
Maintain pyate fork internally if no longer maintained (code is simple)
Expect KeyBERT ecosystem to mature (likely remains viable)

For Developers: Platform Choices#

If building translation/writing tools:

Start with KeyBERT (multilingual support, future-proof)
Add pyate if English/Italian precision critical
Abstract interface (swap libraries as technology evolves)

Example Architecture:

class TerminologyExtractor:
    def __init__(self, language):
        if language in ["en", "it"]:
            self.backend = PyateExtractor()  # High precision
        else:
            self.backend = KeyBERTExtractor()  # Multilingual

    def extract(self, text):
        return self.backend.extract(text)

Value: Decouple from specific library (easy to swap as LLM/new tools emerge).

For Translators/Writers: Practical Path#

Immediate (2026):

Use CAT tool built-in extraction if available (convenience)
Use pyate (English) or KeyBERT (CJK) for initial glossary creation if CAT insufficient
Plan for human validation (60-80% precision, manual review essential)

Future (2027-2029):

Experiment with LLM-based extraction (ChatGPT, Claude) as pricing drops
Compare quality: LLM vs pyate/KeyBERT (may prefer LLM if precision > cost)
Maintain current workflow until LLMs competitive

For Researchers: Open Questions#

Research Gaps (opportunities for 2026-2031):

Transformer-based terminology extraction: Combine BERT embeddings with linguistic features (better than pure statistical or pure semantic)
CJK word boundary detection: Improve Chinese/Japanese tokenization for terminology (current weak point)
Bilingual terminology alignment: Automated source-target term pairing (currently manual)
LLM fine-tuning for terminology: Fine-tune GPT/Claude for domain-specific term extraction (vs generic)

Risks and Mitigation#

Risk 1: pyate Abandonment (Moderate Probability)#

Scenario: Maintainer stops development, library becomes stale Probability: 30% (single maintainer, modest community) Impact: High for English/Italian terminology extraction Mitigation:

Fork pyate internally (code is simple, <1000 LOC)
Monitor GitHub activity (6-month no-commit = warning sign)
Prepare migration to alternatives (KeyBERT + filtering, LLMs)

Risk 2: BERT Displacement by Newer Architectures (Low Probability)#

Scenario: GPT-style models replace BERT for embeddings Probability: 20% (BERT remains strong for embeddings 2026-2031) Impact: Moderate (sentence-transformers can adapt to new models) Mitigation:

sentence-transformers supports multiple backends (not locked to BERT)
KeyBERT can use alternative embeddings (Flair, GPT, etc.)

Risk 3: LLM-Based Extraction Disrupts Market (Moderate Probability)#

Scenario: ChatGPT/Claude pricing drops to $0.001/document, making LLM extraction cheaper than pyate/KeyBERT Probability: 40% (LLM pricing declining rapidly) Impact: High (both libraries lose value proposition) Mitigation:

Monitor LLM pricing trends (monthly evaluation)
Test LLM extraction quality (may replace libraries if precision competitive)
Maintain local extraction for high-volume use cases (LLM API latency > local inference)

Risk 4: CJK Quality Stagnates (Moderate Probability)#

Scenario: Character-level Chinese tokenization remains (no word-level BERT improvement) Probability: 30% (CJK NLP advancing, but word boundaries hard problem) Impact: Moderate (KeyBERT CJK quality ~60-70%, not improving) Mitigation:

Use chinese_keybert for Chinese-only (better word segmentation)
Human validation workflow (accept 60-70% precision as baseline)
Explore custom Chinese BERT models (fine-tune on domain data)

Bottom Line: Strategic Positioning#

5-Year Outlook:

pyate: Stable for English/Italian (70% confidence), but niche and limited language expansion
KeyBERT: Strong trajectory (90% confidence), expanding multilingual support, embedded in growing ecosystem
LLMs: Emerging wildcard (40% probability of disruption by 2029-2031)

Recommended Strategy:

Near-term: Use pyate (English) + KeyBERT (CJK) based on use case
Mid-term: Monitor LLM extraction quality and pricing (prepare to pivot)
Long-term: Expect transformer ecosystem to dominate, pyate to remain niche, LLMs to compete for high-value extraction

Hedge: Abstract terminology extraction interface (swap backends as technology evolves). Don’t lock into single library.

Published: 2026-03-06 Updated: 2026-03-06

1.173 Terminology Extraction#

Terminology Extraction Libraries: Domain Explainer#

What This Solves#

The Problem#

Who Encounters This#

Why It Matters#

Accessible Analogies#

What Is Terminology Extraction?#

Why Libraries Matter#

When You Need This#

Clear Decision Criteria#

Concrete Use Case Examples#

Trade-offs#

What You’re Choosing Between#

1. pyate vs KeyBERT: Terminology vs Keywords#

2. Automated Extraction vs Manual Curation#

3. Integrated CAT Tools vs Python Libraries#

Cost Considerations#

Why Cost Matters Here#

Pricing Models#

ROI Analysis#

Implementation Reality#

Realistic Timeline Expectations#

Team Skill Requirements#

Common Pitfalls#

Key Takeaways for Decision Makers#

Top 3 Decisions to Make#

Budget Guidance#

Questions to Ask Vendors/Consultants#

Glossary#

Further Reading#

S1 Rapid Discovery: Approach#

Research Method#

Libraries Identified#

Core Libraries (from spec):#

Additional Discoveries:#

Initial Categorization#

Statistical Approaches (No ML Training):#

Linguistic + Statistical:#

Transformer-Based:#

Focus for S1#

KeyBERT#

Quick Summary#

Installation#

Key Features#

How It Works#

Use Cases#

Resources#

Initial Assessment#

Note on Terminology vs Keywords#

pyate (PYthon Automated Term Extraction)#

Quick Summary#

Installation#

Key Features#

How It Works#

Use Cases#

Resources#

Initial Assessment#

RAKE-NLTK (Rapid Automatic Keyword Extraction)#

Quick Summary#

Installation#

Key Features#

How It Works#

Use Cases#

Resources#

Initial Assessment#

Note: Keyword vs Terminology Extraction#

S1 Rapid Discovery: Initial Recommendations#

Executive Summary#

Landscape Overview#

By Approach:#

By Use Case:#

Top 2 Recommendations for Further Research#

1. pyate (Primary for Terminology)#

2. KeyBERT (Primary for Semantic Keywords)#

Key Insight: Terminology vs Keywords#

Excluded from Deep Dive#

Next Steps (S2: Comprehensive)#

spaCy Terminology Components#

Quick Summary#