存储格式-酷谷的谷子

代理池架构图

为什么要搭建代理池？

在爬虫开发中，IP 被封是家常便饭。单个免费代理又短命、又慢、还不稳定。解决方案就是搭建一个代理池——自动采集、自动验证、自动淘汰，让你永远有可用的代理。

一、架构设计

采集器 → 代理池 ← 验证器
           ↓
         API 服务 → 你的爬虫

核心组件： - 采集器：定时从免费代理网站抓取代理 - 调度器：管理采集和验证的定时任务 - 存储层：用 Redis 存储代理数据（Hash 结构） - API 服务：提供 HTTP 接口供爬虫获取代理 - 验证器：定期验证代理是否存活，死亡的自动剔除

二、数据存储设计

使用 Redis 的 Hash 结构存储代理：

# 存储格式
{
    "use_proxy": {
        "127.0.0.1:8080": '{"proxy": "127.0.0.1:8080", "https": true, "source": "sslproxies", "check_count": 10, "last_status": true, "fail_count": 0}',
        ...
    }
}

关键字段： - `proxy`：代理地址 - `https`：是否支持 HTTPS - `source`：来源标记（方便排查问题源） - `check_count`：已验证次数 - `last_status`：上次验证状态 - `fail_count`：连续失败次数（超过阈值自动剔除）

三、采集器实现

每个采集器是一个函数，返回代理列表。支持多种来源同时采集：

import requests
from lxml import etreedef collect_sslproxies():
    """从 sslproxies.org 采集"""
    urls = [
        "https://www.sslproxies.org/",
        "https://www.us-proxy.org/",
        "https://free-proxy-list.net/"
    ]
    proxies = []
    for url in urls:
        try:
            resp = requests.get(url, timeout=10)
            html = etree.HTML(resp.content.decode('utf-8', errors='ignore'))
            rows = html.xpath('//table/tbody/tr')
            for row in rows:
                cols = row.xpath('./td/text()')
                if len(cols) >= 2:
                    proxy = f"{cols[0]}:{cols[1]}"
                    proxies.append(proxy)
        except:
            continue
    return proxies

代理源列表（实测可用）

| 代理源 | URL | 类型 | 日活量 | |--------|-----|------|--------| | sslproxies.org | https://sslproxies.org | HTTP/HTTPS | 100+ | | us-proxy.org | https://us-proxy.org | HTTP/HTTPS | 80+ | | free-proxy-list.net | https://free-proxy-list.net | HTTP/HTTPS | 150+ | | proxy-list.download | https://proxy-list.download | HTTP/HTTPS/SOCKS | 200+ | | proxyscrape.com | https://proxyscrape.com | HTTP/SOCKS4/SOCKS5 | 500+ |

四、验证器实现

def verify_proxy(proxy):
    """验证代理可用性（HTTP + HTTPS 双检）"""
    test_urls = [
        ("http://httpbin.org/ip", "http"),
        ("https://qq.com", "https"),
    ]
    for url, scheme in test_urls:
        try:
            resp = requests.get(
                url, 
                proxies={scheme: f"{scheme}://{proxy}"},
                timeout=5
            )
            if resp.status_code != 200:
                return False
        except:
            return False
    return True

双检原则：HTTP 和 HTTPS 都通过才算验证成功。很多免费代理只支持 HTTP 转发，不支持 HTTPS 隧道。

五、API 服务

from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/get/')
def get_proxy():
    """返回一个随机可用代理"""
    proxy = redis.hgetall('use_proxy')
    # 筛选有效的，随机返回一个
    ...@app.route('/count/')
def count():
    """返回代理池统计"""
    total = redis.hlen('use_proxy')
    return jsonify({"total": total, "status": "running"})

六、代理池集成到爬虫

class ProxyPool:
    def get_proxy(self):
        try:
            resp = requests.get("http://localhost:5010/get/", timeout=3)
            data = resp.json()
            return data.get("proxy")
        except:
            return self._get_fallback()    def get_count(self):
        try:
            resp = requests.get("http://localhost:5010/count/", timeout=3)
            return resp.json().get("total", 0)
        except:
            return 0

七、运维注意事项

1. 采集频率：不要超过 2 秒/请求，避免被源站封禁 2. 验证线程数：20 个线程并发验证，性能与稳定性平衡 3. 死亡阈值：连续失败 3 次自动移除 4. Redis 持久化：开启 RDB 快照，防止数据丢失 5. 公网可达性：API 服务绑 0.0.0.0，但注意入站防火墙

总结

一个生产可用的代理池并不复杂——核心就是三部分：采集、存储、验证。用免费的代理网站做采集源，Redis 做存储，定时任务做验证，Flask 提供 API。这套方案我已经跑通了上千条代理的池子，如果你也在做爬虫，可以试试同样思路。

目录CONTENT

存储格式