禁止爬虫

发布日期: 2024-08-23

文章字数: 361

阅读时长: 1 分

阅读次数:

某日忽然发现cdn命中率降低了一些，仔细查看日志后发现404请求很多，经过排查日志发现User-agent请求很多似乎是百度蜘蛛

compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html
类似这样的请求不存在的文件而导致404；

先禁止：

网站根目录可以做robots.txt, robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的。为了省事可以在
http://tool.chinaz.com/robots/
站点可以针对现在的搜索引擎按照想要的规则生成robots.txt文件。

User-agent: *

 Disallow: /

nginx增加一段配置

 

location =/robots.txt {

  default_type text/html;

  add_header Content-Type "text/plain; charset=UTF-8";

  return 200 "User-Agent: *\nDisallow: /";

}

是不是百度蜘蛛需要甄别：百度在常见问题解答中给出了甄别的方法：当有 User-Agent 是 Baiduspider 的请求时，我们可以通过 host 命令反解 ip 来判断，Baiduspider 的 hostname 以 *.baidu.com 或 *.baidu.jp 的格式命名，其它的则可以视为非法的蜘蛛。
比如：
host 123.125.71.60

60.71.125.123.in-addr.arpa domain name pointer baiduspider-123-125-71-60.crawl.baidu.com.

Windows–nslookup

nslookup

默认服务器: public1.alidns.com

Address: 223.5.5.5

> 123.125.71.60

服务器: public1.alidns.com

Address: 223.5.5.5

名称: baiduspider-123-125-71-60.crawl.baidu.com

Address: 123.125.71.60

云上的小雨滴

https://blog.icmpy.net/2024/082340682.html