易车新闻页客户广告,从国双跳转至秒针,之后302重定向至易车专题,我们发现国双的收数远大于秒针,且在产生国双点击的时段,数据量明显不符合正常的流量节奏,怀疑是爬虫,抓取了国双链接,但在执行命令时,curl 未设置CURLOPT_FOLLOWLOCATION,或者未添加 -L 参数。

我们用curl 模拟结果如下:

curl https://c.gridsumdissector.com/r/?gid=gad_44x_ethrvgop&ck=733x&adk=11464x&autorefresh=__AUTOREFRESH__&yiche_did=__YCDID__
[1] 20140
[2] 20141
[3] 20142
[4] 20143
[2]   Done                    ck=7335
christen@iZ28x0bwff6Z:~/app$ Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.

提示有二进制内容输出,Java 调用 crul,输出如下:

line=GIF89a����,D;
GIF89a����,D;

添加 -L 参数之后,能正常获取HTML 结构:

$ curl -L https://c.gridsumdissector.com/r/?gid=gad_445_ethrvgop&ck=7335&adk=114646&autorefresh=__AUTOREFRESH__&yiche_did=__YCDID__
[1] 20270
[2] 20271
[3] 20272
[4] 20273
[2]   Done                    ck=7335

易车专题部分HTML 片段:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" class="npage">

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1,user-scalable=no, minimal-ui" />
    <meta name="format-detection" content="telephone=no" />
    <meta name="apple-mobile-web-app-capable" content="yes" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <title>AITO汽车</title>
    <meta name="description" itemprop="description"
        content="易车为您提供最新赛力斯专题报道,了解赛力斯汽车的动态,是您浏览汽车信息和选爱车的第一网络媒体平台,更多精彩尽在易车。">
    <meta name="keywords" content="赛力斯专题,赛力斯,新车专题,易车专题,易车" />

从国双的日志看,与秒针有差异的请求,都是同一个浏览器,Chrome 93.0,爬虫的嫌疑更大了。

Java 调用curl Demo:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

public class CurlTest {
    static String[] cmdParts1 = {"curl", "-H", "Cache-Control: max-age=0", "--compressed", "https://c.gridsumdissector.com/r/?gid=gad_44x_ethrvgop&ck=733x&adk=11464x&autorefresh=__AUTOREFRESH__&yiche_did=__YCDID__"};
    static String[] cmdParts2 = {"curl", "-H", "Cache-Control: max-age=0", "--compressed", "https://adtopic.yiche.com/zhuanti/seres_221220/"};

    public static void main(String[] args) {
        // 监测
        System.out.println(execCmd(String.join(" ", cmdParts1)));
        // 专题
        System.out.println(execCmd(String.join(" ", cmdParts2)));
    }

    public static String execCmdParts(String[] cmdParts) {
        ProcessBuilder process = new ProcessBuilder(cmdParts);
        Process p;
        try {
            p = process.start();
            BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
            StringBuilder builder = new StringBuilder();
            String line;
            while ((line = reader.readLine()) != null) {
                builder.append(line);
                builder.append(System.getProperty("line.separator"));
            }

            return builder.toString();
        } catch (IOException e) {
            System.out.print("error");
            e.printStackTrace();
        }

        return null;
    }

    private static String execCmd(String command) {
        StringBuilder output = new StringBuilder();
        Process p;
        try {
            p = Runtime.getRuntime().exec(command);
            BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
            String line;
            while ((line = reader.readLine()) != null) {
                System.out.println("line=" + line);
                output.append(line).append("\n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return output.toString();
    }

}

结论:爬虫可以重现这种数据状况,但如果要认定为爬虫流量,还需要更多的细节佐证,而国双无法识别爬虫,未显示异常,说明还存在漏洞。

最后修改:2023 年 01 月 05 日
如果觉得我的文章对你有用,请随意赞赏