jina-youtube-howto.md
8.9 KB
YouTube Video Content Extractor using Jina.ai
This script demonstrates how to extract YouTube video content using jina.ai summarizer, bypassing YouTube's restrictions on automated access.
How It Works
Jina.ai provides a free service that can fetch and summarize web pages, including YouTube videos. When you append a YouTube URL to https://r.jina.ai/http://, it:
- Fetches the YouTube page content
- Extracts video metadata, description, and available text content
- Returns it in clean markdown format
- Bypasses YouTube's JavaScript requirements and bot detection
Usage Examples
Basic Usage
# Extract video content
curl "https://r.jina.ai/http://www.youtube.com/watch?v=VIDEO_ID"
# Example with your video
curl "https://r.jina.ai/http://www.youtube.com/watch?v=Tauucda-NV4"
Python Script
import requests
import json
import sys
def extract_youtube_content(video_url):
"""Extract YouTube video content using jina.ai"""
# Remove any existing protocol and add jina.ai prefix
clean_url = video_url.replace("https://", "").replace("http://", "")
jina_url = f"https://r.jina.ai/http://{clean_url}"
try:
response = requests.get(jina_url, timeout=30)
response.raise_for_status()
# Parse the markdown content
content = response.text
# Extract key information
title = extract_title(content)
description = extract_description(content)
views = extract_views(content)
return {
'title': title,
'description': description,
'views': views,
'full_content': content
}
except Exception as e:
return {'error': str(e)}
def extract_title(content):
"""Extract video title from content"""
lines = content.split('\n')
for line in lines:
if line.strip().startswith('# ') and 'YouTube' not in line:
return line.strip('# ').strip()
return "Unknown"
def extract_description(content):
"""Extract video description"""
lines = content.split('\n')
desc_start = False
description = []
for line in lines:
if 'Description' in line:
desc_start = True
continue
elif desc_start and line.strip():
if line.startswith('### ') or line.startswith('['):
break
description.append(line.strip())
return '\n'.join(description)
def extract_views(content):
"""Extract view count"""
import re
views_match = re.search(r'(\d+)\s*views', content)
return views_match.group(1) if views_match else "Unknown"
# Usage
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python youtube_extractor.py <youtube_url>")
sys.exit(1)
video_url = sys.argv[1]
result = extract_youtube_content(video_url)
if 'error' in result:
print(f"Error: {result['error']}")
else:
print(f"Title: {result['title']}")
print(f"Views: {result['views']}")
print(f"Description:\n{result['description']}")
Shell Script
#!/bin/bash
# YouTube Content Extractor using Jina.ai
# Usage: ./extract_youtube.sh <youtube_url>
if [ $# -eq 0 ]; then
echo "Usage: $0 <youtube_url>"
exit 1
fi
YOUTUBE_URL="$1"
JINA_URL="https://r.jina.ai/http://${YOUTUBE_URL#https://}"
echo "Extracting content from: $YOUTUBE_URL"
echo "========================================"
curl -s "$JINA_URL" | \
sed -n '/Description:/,$p' | \
head -n -1
echo "========================================"
Advanced Usage for MCP Integration
Integration with AI Analysis
def analyze_video_with_ai(video_url, ai_client):
"""Extract video content and analyze with AI"""
# Extract content
content = extract_youtube_content(video_url)
if 'error' in content:
return content
# Prepare analysis prompt
prompt = f"""
Analyze this YouTube video content:
Title: {content['title']}
Description: {content['description']}
Views: {content['views']}
Provide insights on:
1. Main topic/subject
2. Key actions demonstrated
3. Technical details shown
4. Notable results or outcomes
"""
# Send to AI for analysis
analysis = ai_client.generate(prompt)
return {
'video_data': content,
'analysis': analysis
}
Batch Processing
def process_video_list(video_urls):
"""Process multiple YouTube videos"""
results = []
for url in video_urls:
print(f"Processing: {url}")
result = extract_youtube_content(url)
results.append(result)
# Rate limiting
time.sleep(1)
return results
# Example usage
video_urls = [
"https://www.youtube.com/watch?v=Tauucda-NV4",
"https://www.youtube.com/watch?v=ANOTHER_VIDEO_ID"
]
results = process_video_list(video_urls)
Integration with Moqui MCP
MCP Service for Video Analysis
<service verb="analyze" noun="YouTubeVideo" authenticate="true">
<description>Analyze YouTube video content using jina.ai</description>
<in-parameters>
<parameter name="videoUrl" required="true" type="String"/>
</in-parameters>
<out-parameters>
<parameter name="analysis" type="Map"/>
</out-parameters>
<actions>
<script><![CDATA[
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
// Extract content using jina.ai
def cleanUrl = videoUrl.replace("https://", "").replace("http://", "")
def jinaUrl = "https://r.jina.ai/http://${cleanUrl}"
def connection = new URL(jinaUrl).openConnection()
def response = connection.inputStream.text
// Parse response (simplified)
def lines = response.split('\n')
def title = "Unknown"
def description = ""
for (line in lines) {
if (line.startsWith('# ') && title == "Unknown") {
title = line.replace('# ', '').replace(' - YouTube', '').trim()
}
if (line.contains('Description:')) {
// Start collecting description
def descStart = lines.indexOf(line) + 1
description = lines[descStart..-1].join('\n').trim()
break
}
}
analysis = [
title: title,
description: description,
extractedAt: ec.user.nowTimestamp,
source: 'jina.ai'
]
]]></script>
</actions>
</service>
Limitations and Considerations
What Jina.ai Extracts
-
Video title and metadata -
Video description text -
View count and upload date -
Channel information -
Related videos (titles only)
What It Doesn't Extract
-
Actual video content/transcript -
Audio from video -
Visual frames or screenshots -
Comments (requires login)
Rate Limiting
- Jina.ai is a free service - implement rate limiting
- Add delays between requests
- Cache results when possible
Error Handling
- Check for 403 errors (private/deleted videos)
- Handle network timeouts
- Validate YouTube URL format
Security and Privacy
Data Handling
- Only processes publicly available YouTube metadata
- No authentication required
- Content is extracted via third-party service
Usage Guidelines
- Respect YouTube's Terms of Service
- Don't circumvent paywalls or private content
- Use for legitimate research/analysis purposes
Alternative Services
If jina.ai is unavailable, similar services include:
-
https://r.jina.ai/http://URL(primary) -
https://r.jina.ai/http://URL&format=json(JSON format) - Custom scrapers (more complex)
Troubleshooting
Common Issues
- 403 Errors: Video is private or deleted
- Empty Content: Video has no description
- Rate Limiting: Too many requests too quickly
- Network Issues: Connection timeouts
Debug Mode
def debug_extract(video_url):
"""Debug version with detailed logging"""
print(f"Original URL: {video_url}")
clean_url = video_url.replace("https://", "").replace("http://", "")
jina_url = f"https://r.jina.ai/http://{clean_url}"
print(f"Jina URL: {jina_url}")
try:
response = requests.get(jina_url, timeout=30)
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.text)}")
if response.status_code == 200:
print("✅ Success!")
else:
print("❌ Failed!")
except Exception as e:
print(f"❌ Exception: {e}")
This approach turns the jina.ai trick into a reusable skill for extracting YouTube video metadata and descriptions for analysis, documentation, or integration with other systems.