CVE-2021-43854

7.5 HIGH

📋 TL;DR

CVE-2021-43854 is a regular expression denial of service (ReDoS) vulnerability in NLTK's tokenization functions. Attackers can craft malicious input to cause excessive CPU consumption and service degradation. Users of NLTK's PunktSentenceTokenizer, sent_tokenize, or word_tokenize functions with untrusted input are affected.

💻 Affected Systems

Products:
  • NLTK (Natural Language Toolkit)
Versions: Versions prior to 3.6.5
Operating Systems: All platforms running Python
Default Config Vulnerable: ⚠️ Yes
Notes: Only affects systems using PunktSentenceTokenizer, sent_tokenize, or word_tokenize functions with untrusted input.

📦 What is this software?

⚠️ Risk & Real-World Impact

🔴

Worst Case

Complete service unavailability due to resource exhaustion, potentially affecting downstream applications that depend on NLTK tokenization.

🟠

Likely Case

Degraded performance and increased response times for applications processing user-provided text, leading to partial service disruption.

🟢

If Mitigated

Minimal impact with proper input validation and version upgrades, maintaining normal service functionality.

🌐 Internet-Facing: HIGH
🏢 Internal Only: MEDIUM

🎯 Exploit Status

Public PoC: ⚠️ Yes
Weaponized: LIKELY
Unauthenticated Exploit: ⚠️ Yes
Complexity: LOW

Exploitation requires crafting specific input patterns but doesn't require authentication or special privileges.

🛠️ Fix & Mitigation

✅ Official Fix

Patch Version: 3.6.5 and later

Vendor Advisory: https://github.com/nltk/nltk/security/advisories/GHSA-f8m6-h2c7-8h9x

Restart Required: No

Instructions:

1. Upgrade NLTK using pip: pip install --upgrade nltk>=3.6.5
2. Verify installation: pip show nltk
3. Test tokenization functions with sample inputs to ensure functionality.

🔧 Temporary Workarounds

Input Length Limitation

all

Limit maximum input length to vulnerable tokenization functions to bound execution time.

# Python example
MAX_INPUT_LENGTH = 10000
if len(user_input) > MAX_INPUT_LENGTH:
    raise ValueError('Input too long')
# Then call tokenization functions

🧯 If You Can't Patch

  • Implement strict input validation and length limits on all user-provided text before tokenization.
  • Deploy rate limiting and monitoring to detect abnormal resource consumption patterns.

🔍 How to Verify

Check if Vulnerable:

Check NLTK version: python -c "import nltk; print(nltk.__version__)" - if version < 3.6.5, system is vulnerable.

Check Version:

python -c "import nltk; print('NLTK version:', nltk.__version__)"

Verify Fix Applied:

After upgrade, verify version is >=3.6.5 and test tokenization with various inputs to ensure normal performance.

📡 Detection & Monitoring

Log Indicators:

  • Unusually long execution times for tokenization functions
  • High CPU usage by Python processes running NLTK
  • Application timeouts or degraded performance

Network Indicators:

  • Increased response times for text processing endpoints
  • Service degradation patterns

SIEM Query:

Processes with high CPU usage AND command line containing 'python' AND (nltk OR tokenize)

🔗 References

📤 Share & Export