CVE-2024-0243

8.1 HIGH

📋 TL;DR

This CVE describes a Server-Side Request Forgery (SSRF) vulnerability in LangChain's RecursiveUrlLoader where an attacker controlling the initial crawled website can trick the crawler into fetching content from arbitrary external domains despite the prevent_outside=True setting. This affects any application using vulnerable versions of LangChain's recursive URL loader functionality.

💻 Affected Systems

Products:
  • langchain-ai/langchain
Versions: Versions before commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22
Operating Systems: all
Default Config Vulnerable: ⚠️ Yes
Notes: Only affects RecursiveUrlLoader with prevent_outside=True configuration. The vulnerability exists in URL parsing logic.

📦 What is this software?

⚠️ Risk & Real-World Impact

🔴

Worst Case

Attackers could use the crawler as a proxy to access internal network resources, perform port scanning, or retrieve sensitive data from systems that trust the crawler's IP address.

🟠

Likely Case

Data exfiltration from internal services, unauthorized access to cloud metadata endpoints, or fetching malicious content that could lead to further exploitation.

🟢

If Mitigated

Limited to accessing only publicly available external resources, still potentially enabling information gathering or content injection.

🌐 Internet-Facing: HIGH
🏢 Internal Only: MEDIUM

🎯 Exploit Status

Public PoC: ⚠️ Yes
Weaponized: LIKELY
Unauthenticated Exploit: ⚠️ Yes
Complexity: LOW

Exploit requires control over the initial crawled website content. The vulnerability is well-documented with public proof-of-concept in the bug bounty report.

🛠️ Fix & Mitigation

✅ Official Fix

Patch Version: Commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22 or later

Vendor Advisory: https://github.com/langchain-ai/langchain/pull/15559

Restart Required: No

Instructions:

1. Update LangChain to version containing commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22 or later
2. Replace vulnerable file: libs/community/langchain_community/document_loaders/recursive_url_loader.py
3. No restart required for Python applications

🔧 Temporary Workarounds

Implement custom URL validation

all

Add additional URL validation before passing to RecursiveUrlLoader to ensure URLs match expected domains

# Python code to validate URLs before crawling
from urllib.parse import urlparse

def validate_url(url, allowed_domains):
    parsed = urlparse(url)
    return any(parsed.netloc.endswith(domain) for domain in allowed_domains)

Use allowlist approach

all

Maintain explicit allowlist of domains that can be crawled instead of relying on prevent_outside parameter

# Example allowlist implementation
allowed_domains = ['example.com', 'trusted.org']
# Filter URLs before passing to loader

🧯 If You Can't Patch

  • Implement network-level restrictions to limit crawler's outbound connections
  • Monitor crawler activity for unexpected external domain requests

🔍 How to Verify

Check if Vulnerable:

Check if your langchain_community/document_loaders/recursive_url_loader.py file contains the vulnerable URL parsing logic from before commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22

Check Version:

pip show langchain-community | grep Version

Verify Fix Applied:

Verify the recursive_url_loader.py file includes the fix from PR #15559 with proper URL domain validation

📡 Detection & Monitoring

Log Indicators:

  • Crawler accessing unexpected external domains
  • URLs with mismatched domains in crawl logs

Network Indicators:

  • Outbound HTTP requests from crawler to unexpected domains
  • Requests to internal IP ranges from crawler

SIEM Query:

source="crawler_logs" AND (url NOT CONTAINS "expected-domain.com" OR url CONTAINS "internal-ip")

🔗 References

📤 Share & Export