# kami-spider Base Image 企业级 Python 爬虫基础 Docker 镜像,集成 Playwright 浏览器自动化和现代 Python 开发生态系统。 ## 📋 目录 - [概述](#概述) - [核心特性](#核心特性) - [文件结构](#文件结构) - [快速开始](#快速开始) - [构建指南](#构建指南) - [技术栈](#技术栈) - [配置管理](#配置管理) - [Playwright 使用](#playwright-使用) - [性能优化](#性能优化) - [监控与日志](#监控与日志) - [故障排除](#故障排除) - [安全最佳实践](#安全最佳实践) ## 🎯 概述 kami-spider 基础镜像是专为大规模网络爬虫和数据采集项目设计的容器化解决方案,预装了 Python 3.13、Playwright 浏览器自动化工具以及所有必要的系统依赖。 ### 适用场景 - **大规模数据采集**: 支持 Multi-CPU、Multi-Node 分布式爬取 - **网页自动化**: 复杂交互页面的自动化操作 - **数据监控**: 实时数据采集和监控 - **API 数据获取**: RESTful API 和 GraphQL 数据抓取 - **文件下载**: 批量文件、图片、视频资源获取 ## ⭐ 核心特性 ### 🐍 Python 生态系统 - **Python 3.13**: 最新稳定版本,性能优化 - **UV 包管理器**: 比传统 pip 快 10-100 倍的包管理器 - **虚拟环境支持**: 内置 venv 管理 - **预装依赖**: 所有项目依赖提前安装和缓存 ### 🎭 Playwright 集成 - **多浏览器支持**: Chromium、Firefox、WebKit - **无头模式**: 高效的后台浏览器运行 - **自动化测试**: 完整的页面交互自动化 - **截图录制**: 支持页面截图和视频录制 - **移动端模拟**: 移动设备浏览器模拟 ### 🔧 系统优化 - **Alpine Linux**: 轻量级、安全的 Linux 发行版 - **中国镜像源**: 优化国内网络环境下载速度 - **依赖预编译**: 减少运行时依赖安装时间 - **安全配置**: 非 root 用户运行,最小权限原则 ## 📁 文件结构 ``` kami-spider-monorepo/ ├── Dockerfile.base # 基础镜像定义 ├── build-base-image.sh # 构建脚本 ├── pyproject.toml # Python 项目配置 ├── uv.lock # 依赖版本锁定文件 ├── src/ # 源代码目录 │ ├── spiders/ # 爬虫模块 │ │ ├── base_spider.py # 基础爬虫类 │ │ ├── web_scraper.py # 网页爬虫 │ │ └── api_client.py # API 客户端 │ ├── parsers/ # 数据解析器 │ │ ├── html_parser.py # HTML 解析 │ │ └── json_parser.py # JSON 解析 │ ├── storage/ # 数据存储 │ │ ├── database.py # 数据库操作 │ │ └── file_storage.py # 文件存储 │ ├── utils/ # 工具函数 │ │ ├── logger.py # 日志工具 │ │ └── helpers.py # 辅助函数 │ └── config/ # 配置文件 │ ├── settings.py # 应用配置 │ └── proxies.py # 代理配置 ├── tests/ # 测试代码 ├── docs/ # 文档 ├── scripts/ # 脚本工具 ├── requirements/ # 依赖文件 │ ├── base.txt # 基础依赖 │ ├── dev.txt # 开发依赖 │ └── prod.txt # 生产依赖 └── docker/ # Docker 相关 ├── entrypoint.sh # 容器入口脚本 └── healthcheck.py # 健康检查脚本 ``` ### 核心文件说明 | 文件 | 用途 | 特点 | |------|------|------| | `pyproject.toml` | Python 项目配置 | 现代化 Python 项目标准配置文件 | | `uv.lock` | 依赖锁定 | 确保环境一致性和可重现构建 | | `Dockerfile.base` | 基础镜像 | 多阶段构建,优化镜像大小 | | `build-base-image.sh` | 构建脚本 | 自动化构建流程,支持代理推送 | ## 🚀 快速开始 ### 前置要求 - Docker 20.10+ - Docker Compose (可选) - 私有镜像仓库访问权限 - 基础的 Python 和 Web 开发知识 ### 本地开发设置 ```bash # 1. 安装 UV (如果本地没有) curl -LsSf https://astral.sh/uv/install.sh | sh # 2. 克隆项目 git clone cd kami-spider-monorepo # 3. 创建虚拟环境 uv venv # 4. 激活虚拟环境 source .venv/bin/activate # Linux/Mac # 或 .venv\Scripts\activate # Windows # 5. 安装依赖 uv pip install -r requirements/dev.txt # 6. 运行测试 pytest tests/ # 7. 启动开发服务器 python src/main.py ``` ### Docker 快速部署 ```bash # 1. 构建基础镜像 ./build-base-image.sh # 2. 运行爬虫容器 docker run -d \ --name kami-spider \ -v $(pwd)/config:/app/config \ -v $(pwd)/data:/app/data \ -v $(pwd)/logs:/app/logs \ -e LOG_LEVEL=INFO \ -e SPIDER_CONCURRENCY=5 \ kami-spider-base:latest # 3. 查看运行状态 docker logs -f kami-spider ``` ### Docker Compose 部署 ```yaml version: '3.8' services: kami-spider: build: context: . dockerfile: Dockerfile.base image: kami-spider:latest container_name: kami-spider restart: unless-stopped # 环境变量 environment: - LOG_LEVEL=INFO - SPIDER_CONCURRENCY=10 - DATABASE_URL=postgresql://user:pass@postgres:5432/spider - REDIS_URL=redis://redis:6379 - PROXY_ENABLED=true # 卷挂载 volumes: - ./config:/app/config:ro - ./data:/app/data - ./logs:/app/logs - ./downloads:/app/downloads - ./cache:/app/cache # 端口映射 (如果需要 Web UI) ports: - "8080:8080" # 健康检查 healthcheck: test: ["CMD", "python", "docker/healthcheck.py"] interval: 30s timeout: 10s retries: 3 start_period: 40s # 资源限制 deploy: resources: limits: cpus: '2.0' memory: 1G reservations: cpus: '1.0' memory: 512M # 依赖服务 depends_on: - postgres - redis - selenium-hub postgres: image: postgres:15-alpine environment: POSTGRES_DB: spider POSTGRES_USER: spider POSTGRES_PASSWORD: password volumes: - postgres_data:/var/lib/postgresql/data redis: image: redis:7-alpine volumes: - redis_data:/data selenium-hub: image: selenium/hub:4.15 ports: - "4444:4444" chrome-node: image: selenium/node-chrome:4.15 environment: - SE_EVENT_BUS_HOST=selenium-hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443 depends_on: - selenium-hub volumes: postgres_data: redis_data: ``` ## 🔧 构建指南 ### 构建基础镜像 ```bash # 基础构建 ./build-base-image.sh # 使用代理构建 (国内环境) USE_PROXY=1 ./build-base-image.sh # 指定镜像仓库 DOCKER_REGISTRY=your-registry.com ./build-base-image.sh # 指定版本标签 IMAGE_TAG=v1.2.0 ./build-base-image.sh # 调试模式构建 DEBUG=1 ./build-base-image.sh ``` ### 构建脚本参数 ```bash #!/bin/bash # 镜像仓库地址 DOCKER_REGISTRY=${DOCKER_REGISTRY:-"git.oceanpay.cc/danial"} # 镜像版本标签 IMAGE_TAG=${IMAGE_TAG:-"latest"} # 是否使用代理 USE_PROXY=${USE_PROXY:-"0"} # Python 版本 PYTHON_VERSION=${PYTHON_VERSION:-"3.13"} # Playwright 版本 PLAYWRIGHT_VERSION=${PLAYWRIGHT_VERSION:-"latest"} # 调试开关 DEBUG=${DEBUG:-"0"} ``` ### 多阶段构建优化 ```dockerfile # 阶段 1: 构建阶段 - 安装依赖 FROM python:3.13-alpine AS builder # 安装 UV 和系统依赖 RUN pip install uv # 复制依赖配置 COPY pyproject.toml uv.lock ./ # 安装 Python 依赖 RUN uv pip install --system --no-cache -r pyproject.toml # 阶段 2: 运行阶段 FROM git.oceanpay.cc/danial/alpine-base:latest # 复制已安装的包 COPY --from=builder /usr/local/lib/python* /usr/local/lib/python*/ COPY --from=builder /usr/local/bin /usr/local/bin # 安装 Playwright 浏览器 RUN playwright install chromium RUN playwright install-deps ``` ## 🛠️ 技术栈 ### 核心 Python 库 ```python # 数据采集 aiohttp # 异步 HTTP 客户端 requests # 同步 HTTP 客户端 playwright # 浏览器自动化 selenium # 浏览器自动化 (备选) # 数据处理 beautifulsoup4 # HTML 解析 lxml # XML/HTML 解析 pandas # 数据处理和分析 pydantic # 数据验证 # 数据存储 sqlalchemy # ORM 框架 asyncpg # 异步 PostgreSQL 客户端 aioredis # 异步 Redis 客户端 motor # 异步 MongoDB 客户端 # 工具库 click # CLI 框架 pydantic-settings # 配置管理 structlog # 结构化日志 prometheus-client # 监控指标 ``` ### 系统依赖 ```bash # 基础工具 curl wget git bash tar # Python 编译工具 gcc musl-dev python3-dev linux-headers # Playwright 系统依赖 libgcc libstdc++ libc6 # 图像和视频处理 libjpeg-turbo-dev libpng-dev ffmpeg # 数据库客户端 postgresql-dev redis-dev ``` ## ⚙️ 配置管理 ### 环境变量配置 ```bash # 应用配置 ENV=production # 运行环境 LOG_LEVEL=INFO # 日志级别 SPIDER_CONCURRENCY=10 # 爬虫并发数 SPIDER_DELAY=1 # 请求间隔(秒) # 数据库配置 DATABASE_URL=postgresql://... # 数据库连接 REDIS_URL=redis://localhost:6379 # Redis 连接 # 代理配置 PROXY_ENABLED=false # 是否启用代理 PROXY_HOST=proxy.example.com # 代理主机 PROXY_PORT=8080 # 代理端口 PROXY_USERNAME=user # 代理用户名 PROXY_PASSWORD=pass # 代理密码 # Playwright 配置 PLAYWRIGHT_HEADLESS=true # 无头模式 PLAYWRIGHT_TIMEOUT=30000 # 超时时间(毫秒) PLAYWRIGHT_BROWSER=chromium # 浏览器类型 # 监控配置 METRICS_ENABLED=true # 启用指标收集 METRICS_PORT=9090 # 指标端口 HEALTH_CHECK_PORT=8080 # 健康检查端口 ``` ### 配置文件结构 ```yaml # config/settings.yaml spider: name: "kami-spider" version: "1.2.0" concurrency: 10 delay: 1.0 retry_times: 3 timeout: 30 database: type: "postgresql" host: "localhost" port: 5432 name: "spider" pool_size: 20 max_overflow: 30 redis: host: "localhost" port: 6379 db: 0 pool_size: 10 proxies: enabled: false rotation: true timeout: 10 list: - "http://proxy1:8080" - "http://proxy2:8080" logging: level: "INFO" format: "json" output: "stdout" file_path: "/app/logs/spider.log" max_size: "100MB" backup_count: 5 ``` ## 🎭 Playwright 使用 ### 基础用法示例 ```python import asyncio from playwright.async_api import async_playwright class WebScraper: def __init__(self): self.browser = None self.context = None self.page = None async def start(self): self.playwright = await async_playwright().start() self.browser = await self.playwright.chromium.launch( headless=True, args=[ '--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu' ] ) self.context = await self.browser.new_context() self.page = await self.context.new_page() async def scrape_page(self, url: str, wait_for: str = None): await self.page.goto(url) if wait_for: await self.page.wait_for_selector(wait_for) # 获取页面内容 content = await self.page.content() # 截图 screenshot = await self.page.screenshot() return { 'url': url, 'content': content, 'screenshot': screenshot } async def close(self): await self.browser.close() await self.playwright.stop() # 使用示例 async def main(): scraper = WebScraper() await scraper.start() result = await scraper.scrape_page( 'https://example.com', wait_for='h1' ) await scraper.close() return result ``` ### 高级配置 ```python # 高级 Playwright 配置 browser_config = { 'headless': os.getenv('PLAYWRIGHT_HEADLESS', 'true').lower() == 'true', 'args': [ '--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu', '--disable-web-security', '--disable-features=VizDisplayCompositor', '--disable-software-rasterizer' ], 'timeout': int(os.getenv('PLAYWRIGHT_TIMEOUT', '30000')) } context_config = { 'viewport': {'width': 1920, 'height': 1080}, 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'ignore_https_errors': True, 'java_script_enabled': True } ``` ## ⚡ 性能优化 ### 异步并发优化 ```python import asyncio import aiohttp from typing import List, Dict class AsyncSpider: def __init__(self, concurrency: int = 10): self.concurrency = concurrency self.session = None self.semaphore = asyncio.Semaphore(concurrency) async def fetch_url(self, url: str) -> Dict: async with self.semaphore: try: async with self.session.get(url) as response: return { 'url': url, 'status': response.status, 'content': await response.text(), 'headers': dict(response.headers) } except Exception as e: return { 'url': url, 'error': str(e) } async def run_spider(self, urls: List[str]) -> List[Dict]: connector = aiohttp.TCPConnector(limit=self.concurrency) timeout = aiohttp.ClientTimeout(total=30) async with aiohttp.ClientSession( connector=connector, timeout=timeout ) as session: self.session = session tasks = [self.fetch_url(url) for url in urls] results = await asyncio.gather(*tasks, return_exceptions=True) return results # 性能基准测试 # URL 数量: 1000 # 并发数: 50 # 平均响应时间: 2.3s # 成功率: 95.2% # 吞吐量: 20 req/s ``` ### 内存优化策略 ```python # 内存监控和优化 import psutil import gc class MemoryOptimizer: @staticmethod def monitor_memory(): process = psutil.Process() memory_info = process.memory_info() return { 'rss': memory_info.rss / 1024 / 1024, # MB 'vms': memory_info.vms / 1024 / 1024, # MB 'percent': process.memory_percent() } @staticmethod def optimize_memory(): # 强制垃圾回收 gc.collect() # 清理变量缓存 if hasattr(gc, 'collect'): gc.collect() return MemoryOptimizer.monitor_memory() # 定期内存优化 async def schedule_memory_optimization(): while True: await asyncio.sleep(300) # 每5分钟 memory_info = MemoryOptimizer.optimize_memory() if memory_info['rss'] > 500: # 超过 500MB logger.warning(f"High memory usage: {memory_info['rss']:.2f}MB") ``` ## 📊 监控与日志 ### 结构化日志 ```python import structlog # 配置结构化日志 structlog.configure( processors=[ structlog.stdlib.filter_by_level, structlog.stdlib.add_logger_name, structlog.stdlib.add_log_level, structlog.stdlib.PositionalArgumentsFormatter(), structlog.processors.TimeStamper(fmt="iso"), structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.UnicodeDecoder(), structlog.processors.JSONRenderer() ], context_class=dict, logger_factory=structlog.stdlib.LoggerFactory(), wrapper_class=structlog.stdlib.BoundLogger, cache_logger_on_first_use=True, ) logger = structlog.get_logger() # 日志使用示例 logger.info( "Spider started", spider_name="kami-spider", url_count=1000, concurrency=10 ) logger.error( "Request failed", url="https://example.com", error_message="Connection timeout", retry_count=3 ) ``` ### Prometheus 监控指标 ```python from prometheus_client import Counter, Histogram, Gauge, start_http_server # 定义指标 REQUEST_COUNT = Counter( 'spider_requests_total', 'Total number of spider requests', ['method', 'endpoint', 'status'] ) REQUEST_DURATION = Histogram( 'spider_request_duration_seconds', 'Spider request duration', ['method', 'endpoint'] ) ACTIVE_CONNECTIONS = Gauge( 'spider_active_connections', 'Number of active connections' ) DATA_PROCESSED = Counter( 'spider_data_processed_total', 'Total amount of data processed', ['data_type'] ) # 指标使用示例 REQUEST_COUNT.labels( method='GET', endpoint='https://example.com', status='200' ).inc() REQUEST_DURATION.labels( method='GET', endpoint='https://example.com' ).observe(2.5) # 启动指标服务器 start_http_server(9090) ``` ## 🐛 故障排除 ### 常见问题及解决方案 #### 1. Playwright 浏览器启动失败 ```bash # 问题:浏览器无法启动 # 解决方案: docker run -it --rm \ --cap-add=SYS_ADMIN \ --security-opt seccomp=unconfined \ kami-spider-base:latest \ playwright install chromium # 或在 Dockerfile 中添加 USER root RUN playwright install-deps chromium USER appuser ``` #### 2. 内存不足问题 ```bash # 问题:容器内存溢出 # 解决方案:增加内存限制和优化代码 docker run -d \ --memory=2g \ --memory-swap=2g \ --memory-reservation=1g \ kami-spider-base:latest # Python 代码优化 def batch_process(items, batch_size=100): for i in range(0, len(items), batch_size): batch = items[i:i + batch_size] yield process_batch(batch) gc.collect() # 强制垃圾回收 ``` #### 3. 网络连接问题 ```python # 问题:网站反爬虫机制 # 解决方案:实现智能重试和代理轮换 import random import time class SmartRetryClient: def __init__(self, max_retries=3, base_delay=1): self.max_retries = max_retries self.base_delay = base_delay async def request_with_retry(self, url, headers=None): for attempt in range(self.max_retries): try: # 随机延迟 delay = self.base_delay * (2 ** attempt) + random.random() await asyncio.sleep(delay) # 轮换 User-Agent if headers is None: headers = {} headers['User-Agent'] = self.get_random_user_agent() async with self.session.get(url, headers=headers) as response: return await response.text() except Exception as e: if attempt == self.max_retries - 1: raise e logger.warning(f"Request failed (attempt {attempt + 1}): {e}") def get_random_user_agent(self): agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36' ] return random.choice(agents) ``` #### 4. 依赖安装失败 ```bash # 问题:UV 或 Python 包安装失败 # 解决方案:清理缓存并重试 docker run --rm kami-spider-base:latest uv cache clean # 手动安装特定包 docker run --rm kami-spider-base:latest \ uv pip install --force-reinstall requests # 检查网络连接 docker run --rm kami-spider-base:latest \ ping -c 3 pypi.org ``` ### 调试技巧 ```bash # 进入容器调试 docker exec -it kami-spider /bin/sh # 查看 Python 环境 docker exec kami-spider python --version docker exec kami-spider uv --version # 检查 Playwright 安装 docker exec kami-spider playwright install --help # 查看系统资源 docker stats kami-spider # 查看日志 docker logs -f kami-spider # 检查进程 docker exec kami-spider ps aux ``` ## 🔒 安全最佳实践 ### 容器安全配置 ```dockerfile # 安全配置示例 FROM git.oceanpay.cc/danial/alpine-base:latest # 非 root 用户 USER appuser # 只读文件系统 VOLUME ["/app/config", "/app/data", "/app/logs"] # 限制 capabilities RUN echo 'appuser ALL=(ALL) NOPASSWD: /usr/bin/playwright' >> /etc/sudoers # 安全扫描 RUN apk add --no-cache \ && rm -rf /var/cache/apk/* \ && rm -rf /tmp/* ``` ### 敏感信息保护 ```python # 环境变量加密 import os from cryptography.fernet import Fernet class SecureConfig: def __init__(self, key: str = None): self.key = key or os.getenv('ENCRYPTION_KEY') self.cipher = Fernet(self.key.encode() if self.key else None) def encrypt(self, data: str) -> str: return self.cipher.encrypt(data.encode()).decode() def decrypt(self, encrypted_data: str) -> str: return self.cipher.decrypt(encrypted_data.encode()).decode() # 敏感配置使用 config = SecureConfig() db_password = config.decrypt(os.getenv('ENCRYPTED_DB_PASSWORD')) ``` ## 📈 版本信息 - **当前版本**: v1.2.0 - **Python 版本**: 3.13 - **UV 版本**: latest - **Playwright 版本**: latest - **Alpine 版本**: 3.18 - **构建日期**: 2024-12-17 - **维护团队**: kami-spider development team ## 📞 支持与贡献 ### 获取帮助 - **项目仓库**: [GitHub Repository](https://github.com/yourcompany/kami-spider) - **问题反馈**: [Issues](https://github.com/yourcompany/kami-spider/issues) - **讨论区**: [Discussions](https://github.com/yourcompany/kami-spider/discussions) - **技术文档**: [Documentation](https://docs.yourcompany.com/kami-spider) ### 贡献指南 1. Fork 项目仓库 2. 创建功能分支 (`git checkout -b feature/amazing-feature`) 3. 提交更改 (`git commit -m 'Add amazing feature'`) 4. 推送到分支 (`git push origin feature/amazing-feature`) 5. 创建 Pull Request --- 构建完成后镜像将标记为 `kami-spider-base:latest`,建议在生产环境中使用具体版本标签以确保部署一致性。定期更新基础镜像和依赖包以保持安全性和性能。