CVE-2021-43854
📋 TL;DR
CVE-2021-43854 is a regular expression denial of service (ReDoS) vulnerability in NLTK's tokenization functions. Attackers can craft malicious input to cause excessive CPU consumption and service degradation. Users of NLTK's PunktSentenceTokenizer, sent_tokenize, or word_tokenize functions with untrusted input are affected.
💻 Affected Systems
- NLTK (Natural Language Toolkit)
📦 What is this software?
Nltk by Nltk
⚠️ Risk & Real-World Impact
Worst Case
Complete service unavailability due to resource exhaustion, potentially affecting downstream applications that depend on NLTK tokenization.
Likely Case
Degraded performance and increased response times for applications processing user-provided text, leading to partial service disruption.
If Mitigated
Minimal impact with proper input validation and version upgrades, maintaining normal service functionality.
🎯 Exploit Status
Exploitation requires crafting specific input patterns but doesn't require authentication or special privileges.
🛠️ Fix & Mitigation
✅ Official Fix
Patch Version: 3.6.5 and later
Vendor Advisory: https://github.com/nltk/nltk/security/advisories/GHSA-f8m6-h2c7-8h9x
Restart Required: No
Instructions:
1. Upgrade NLTK using pip: pip install --upgrade nltk>=3.6.5
2. Verify installation: pip show nltk
3. Test tokenization functions with sample inputs to ensure functionality.
🔧 Temporary Workarounds
Input Length Limitation
allLimit maximum input length to vulnerable tokenization functions to bound execution time.
# Python example
MAX_INPUT_LENGTH = 10000
if len(user_input) > MAX_INPUT_LENGTH:
raise ValueError('Input too long')
# Then call tokenization functions
🧯 If You Can't Patch
- Implement strict input validation and length limits on all user-provided text before tokenization.
- Deploy rate limiting and monitoring to detect abnormal resource consumption patterns.
🔍 How to Verify
Check if Vulnerable:
Check NLTK version: python -c "import nltk; print(nltk.__version__)" - if version < 3.6.5, system is vulnerable.
Check Version:
python -c "import nltk; print('NLTK version:', nltk.__version__)"
Verify Fix Applied:
After upgrade, verify version is >=3.6.5 and test tokenization with various inputs to ensure normal performance.
📡 Detection & Monitoring
Log Indicators:
- Unusually long execution times for tokenization functions
- High CPU usage by Python processes running NLTK
- Application timeouts or degraded performance
Network Indicators:
- Increased response times for text processing endpoints
- Service degradation patterns
SIEM Query:
Processes with high CPU usage AND command line containing 'python' AND (nltk OR tokenize)
🔗 References
- https://github.com/nltk/nltk/commit/1405aad979c6b8080dbbc8e0858f89b2e3690341
- https://github.com/nltk/nltk/issues/2866
- https://github.com/nltk/nltk/pull/2869
- https://github.com/nltk/nltk/security/advisories/GHSA-f8m6-h2c7-8h9x
- https://github.com/nltk/nltk/commit/1405aad979c6b8080dbbc8e0858f89b2e3690341
- https://github.com/nltk/nltk/issues/2866
- https://github.com/nltk/nltk/pull/2869
- https://github.com/nltk/nltk/security/advisories/GHSA-f8m6-h2c7-8h9x