CVE-2024-0243
📋 TL;DR
This CVE describes a Server-Side Request Forgery (SSRF) vulnerability in LangChain's RecursiveUrlLoader where an attacker controlling the initial crawled website can trick the crawler into fetching content from arbitrary external domains despite the prevent_outside=True setting. This affects any application using vulnerable versions of LangChain's recursive URL loader functionality.
💻 Affected Systems
- langchain-ai/langchain
📦 What is this software?
Langchain by Langchain
⚠️ Risk & Real-World Impact
Worst Case
Attackers could use the crawler as a proxy to access internal network resources, perform port scanning, or retrieve sensitive data from systems that trust the crawler's IP address.
Likely Case
Data exfiltration from internal services, unauthorized access to cloud metadata endpoints, or fetching malicious content that could lead to further exploitation.
If Mitigated
Limited to accessing only publicly available external resources, still potentially enabling information gathering or content injection.
🎯 Exploit Status
Exploit requires control over the initial crawled website content. The vulnerability is well-documented with public proof-of-concept in the bug bounty report.
🛠️ Fix & Mitigation
✅ Official Fix
Patch Version: Commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22 or later
Vendor Advisory: https://github.com/langchain-ai/langchain/pull/15559
Restart Required: No
Instructions:
1. Update LangChain to version containing commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22 or later
2. Replace vulnerable file: libs/community/langchain_community/document_loaders/recursive_url_loader.py
3. No restart required for Python applications
🔧 Temporary Workarounds
Implement custom URL validation
allAdd additional URL validation before passing to RecursiveUrlLoader to ensure URLs match expected domains
# Python code to validate URLs before crawling
from urllib.parse import urlparse
def validate_url(url, allowed_domains):
parsed = urlparse(url)
return any(parsed.netloc.endswith(domain) for domain in allowed_domains)
Use allowlist approach
allMaintain explicit allowlist of domains that can be crawled instead of relying on prevent_outside parameter
# Example allowlist implementation
allowed_domains = ['example.com', 'trusted.org']
# Filter URLs before passing to loader
🧯 If You Can't Patch
- Implement network-level restrictions to limit crawler's outbound connections
- Monitor crawler activity for unexpected external domain requests
🔍 How to Verify
Check if Vulnerable:
Check if your langchain_community/document_loaders/recursive_url_loader.py file contains the vulnerable URL parsing logic from before commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22
Check Version:
pip show langchain-community | grep Version
Verify Fix Applied:
Verify the recursive_url_loader.py file includes the fix from PR #15559 with proper URL domain validation
📡 Detection & Monitoring
Log Indicators:
- Crawler accessing unexpected external domains
- URLs with mismatched domains in crawl logs
Network Indicators:
- Outbound HTTP requests from crawler to unexpected domains
- Requests to internal IP ranges from crawler
SIEM Query:
source="crawler_logs" AND (url NOT CONTAINS "expected-domain.com" OR url CONTAINS "internal-ip")
🔗 References
- https://github.com/langchain-ai/langchain/commit/bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22
- https://github.com/langchain-ai/langchain/pull/15559
- https://huntr.com/bounties/370904e7-10ac-40a4-a8d4-e2d16e1ca861
- https://github.com/langchain-ai/langchain/commit/bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22
- https://github.com/langchain-ai/langchain/pull/15559
- https://huntr.com/bounties/370904e7-10ac-40a4-a8d4-e2d16e1ca861