什么是 OpenClaw?
OpenClaw 是一个开源的网络爬虫框架,专为数据采集和分析设计,它提供了简单易用的API,支持分布式爬取、反爬虫绕过、数据清洗等功能。

系统要求
- Python 3.7+
- 内存:至少 2GB RAM
- 网络连接
安装步骤
pip安装
pip install openclaw
源码安装
git clone https://github.com/openclaw/openclaw.git cd openclaw pip install -r requirements.txt python setup.py install
基础使用
1 创建第一个爬虫
from openclaw import Spider, Request
class MySpider(Spider):
name = "my_first_spider"
def start_requests(self):
# 起始URL
yield Request("https://example.com")
def parse(self, response):
# 提取数据
title = response.css('h1::text').get()
yield {
'title': title,
'url': response.url
}
# 运行爬虫
if __name__ == "__main__":
spider = MySpider()
spider.run()
2 配置爬虫
class ConfigSpider(Spider):
name = "config_demo"
# 基础配置
allowed_domains = ["example.com"]
start_urls = ["https://example.com/page1", "https://example.com/page2"]
# 爬取间隔(秒)
download_delay = 2
# 并发数
concurrent_requests = 5
def parse(self, response):
# 提取链接并跟进
for link in response.css('a::attr(href)').getall():
yield Request(response.urljoin(link), callback=self.parse_page)
def parse_page(self, response):
# 页面解析逻辑
pass
进阶功能
1 数据处理管道
from openclaw import Pipeline
class CleanDataPipeline(Pipeline):
def process_item(self, item, spider):
# 清洗数据
if 'title' in item:
item['title'] = item['title'].strip()
return item
class SaveToJSONPipeline(Pipeline):
def __init__(self):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
import json
with open('output.json', 'w') as f:
json.dump(self.items, f, indent=2)
2 中间件使用
from openclaw import Middleware
class CustomMiddleware(Middleware):
def process_request(self, request, spider):
# 添加自定义请求头
request.headers['User-Agent'] = 'MyCustomAgent/1.0'
return request
def process_response(self, response, spider):
# 处理响应
if response.status == 403:
spider.logger.warning(f"被封禁: {response.url}")
return response
实战示例:爬取新闻网站
from openclaw import Spider, Request
import re
class NewsSpider(Spider):
name = "news_crawler"
def start_requests(self):
urls = [
'https://news.example.com/tech',
'https://news.example.com/business'
]
for url in urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
# 提取文章链接
article_links = response.css('.article-list a::attr(href)').getall()
for link in article_links:
yield Request(
response.urljoin(link),
callback=self.parse_article,
meta={'category': response.url}
)
def parse_article(self, response):
yield {
'title': response.css('h1.article-title::text').get(),
'content': ' '.join(response.css('.article-content p::text').getall()),
'author': response.css('.author-name::text').get(),
'publish_date': response.css('.publish-date::text').get(),
'category': response.meta['category'],
'url': response.url
}
def process_exception(self, request, exception, spider):
# 异常处理
self.logger.error(f"请求失败: {request.url}, 错误: {exception}")
配置文件
创建 config.yaml:
spider: name: "my_spider" download_delay: 1 concurrent_requests: 3 user_agent: "OpenClaw/1.0" database: enabled: true type: "sqlite" path: "./data.db" middlewares: - "openclaw.middlewares.RetryMiddleware" - "myproject.middlewares.CustomMiddleware" pipelines: - "openclaw.pipelines.ValidationPipeline" - "myproject.pipelines.CleanDataPipeline"
常用命令
# 运行爬虫 openclaw run myspider.py # 带配置运行 openclaw run myspider.py -c config.yaml # 查看爬虫状态 openclaw status # 导出数据 openclaw export -f json -o output.json # 分布式运行 openclaw run --distributed --workers 4
调试技巧
1 使用Shell调试
from openclaw.shell import inspect_response
# 在parse方法中调试
def parse(self, response):
inspect_response(response) # 进入交互式shell
# 继续编写解析代码
2 日志配置
import logging
# 设置日志级别
logging.getLogger('openclaw').setLevel(logging.DEBUG)
# 自定义日志格式
logging.basicConfig(
format='%(asctime)s [%(name)s] %(levelname)s: %(message)s',
level=logging.INFO
)
最佳实践
- 遵守robots.txt:配置
ROBOTSTXT_OBEY = True - 设置合理延迟:避免给目标网站造成压力
- 错误处理:实现完整的异常处理机制
- 数据去重:使用内置的
DupFilter中间件 - 资源管理:及时关闭数据库连接和文件句柄
常见问题解决
Q: 爬取速度太慢?
A: 调整 concurrent_requests 和 download_delay 参数
Q: 遇到反爬虫? A: 使用代理中间件或更换 User-Agent
Q: 内存占用过高? A: 启用数据处理管道,边爬取边保存数据
Q: 如何续爬?
A: 启用 JOBDIR 配置保存爬取状态
学习资源
这个教程涵盖了OpenClaw的基础到进阶使用,建议从简单的爬虫开始,逐步学习更复杂的功能,记得始终遵守目标网站的爬取政策和相关法律法规。
版权声明:除非特别标注,否则均为本站原创文章,转载时请以链接形式注明文章出处。