网站部署上线后,为了防止爬虫的侵扰,一般会进行反爬策略限制。在Web开发中Nginx一般作为Gateway和Router,本文主要介绍如何使用Nginx进行User-Agent和RateLimit两个方面来进行反爬
User-Agent
cat /etc/nginx/agent_deny
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { return 403; }
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) { return 403; }
#禁止非GET|POST方式的抓取 if ($request_method !~ ^(GET|POST)$) { return 403; }
|
RateLimit
RateLimit传统的算法主要分为token bucket和leaky bucket,nginx和传统的leaky bucket算法略微有些区别,leaky bucket算法主要处理方式Traffic Shaping和Traffic Policing,Traffic Shaping的核心理念是”等待”,Traffic Policing的核心理念是”丢弃”,在bucket满后,常见的处理方式为:
- 暂时拦截住上方水的向下流动,等待桶中的一部分水漏走后,再放行上方水
- 溢出的上方水直接抛弃
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
limit_req_zone $binary_remote_addr zone=ratelimit:10m rate=2r/s;
limit_req_status 429;
server { listen 80; server_name localhost; charset utf-8; location / { limit_req zone=ratelimit burst=20 nodelay; include agent_deny; proxy_pass http://127.0.0.1:8080/java-web; } }
|
参考链接