最近突然想要爬取pixiv的一些图片。于是就有了这篇文章

问题

最开始我使用了 https://github.com/upbit/pixivpy 项目来尝试访问API。我的服务器在国内,因此需要绕过sni检测才能访问pixiv的服务器。

当我正准备开始写代码的时候,我发现该库的绕过sni功能已经失效了。无奈就只好自己写一个。于是就有了这篇文章。

既然是自己写的话,看隔壁的 https://github.com/Mikubill/pixivpy-async 迟迟没有更新关于SNI的东西,那就基于aiohttp做一个吧

调查问题

查找资料可以得知Pixiv的屏蔽方式是DNSSNI。此外找到了SNI的解决方式是只需要不发送SNI就行

实现一个安全的DNS解析器

翻阅资料,aiohttp可以自定义域名解析器。参考了aiohttp内置的代码和pixivpy之后写了个使用cloudflaredoh域名解析器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
class ByPassResolver(AbstractResolver):

async def resolve(self, host: str, port: int, family: int) -> List[Dict[str, Any]]:
new_host = host
ips = await self.require_appapi_hosts(new_host)
result = []

for i in ips:
result.append({
"hostname": host,
"host": i,
"port": port,
"family": family,
"proto": 0,
"flags": socket.AI_NUMERICHOST | socket.AI_NUMERICSERV,
})
return result

async def close(self) -> None:
pass

async def require_appapi_hosts(self, hostname, timeout=3) -> List[str]:
"""
通过 Cloudflare 的 DNS over HTTPS 请求真实的 IP 地址。
"""
URLS = (
"https://cloudflare-dns.com/dns-query",
"https://1.0.0.1/dns-query",
"https://1.1.1.1/dns-query",
"https://[2606:4700:4700::1001]/dns-query",
"https://[2606:4700:4700::1111]/dns-query",
)
params = {
"ct": "application/dns-json",
"name": hostname,
"type": "A",
"do": "false",
"cd": "false",
}

for url in URLS:
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params, timeout=timeout) as rsp:
response = await rsp.text()
obj = json.loads(response)
pattern = re.compile(
"((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.){3}(1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)")
result = []
for i in obj["Answer"]:
ip = i["data"]

if pattern.match(ip) is not None:
result.append(ip)
print(result)
return result

except Exception as e:
logging.exception(e)
pass

就这么简单,但是这还没完。事实上在后面这段代码是无法工作的。。。

实现不发送SNI

sni的相关代码是由ssl包的相关内容实现的。如果要在aiohttp中使用自己的ssl设置可以像下面这样写

1
2
3
4
5
6
7
8
ssl_ctx = ssl.SSLContext()

ssl_ctx.check_hostname = False
ssl_ctx.verify_mode = ssl.CERT_NONE
# 关闭验证服务器证书

connector = aiohttp.TCPConnector(ssl=ssl_ctx)
client = aiohttp.ClientSession(connector=connector)

但是在SSLContext中并没有找到关于sni的任何设置,只好继续寻找,发现了下面的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
asyncio.base_events.BaseEventLoop.create_connection

if server_hostname is None and ssl:
# Use host as default for server_hostname. It is an error
# if host is empty or not set, e.g. when an
# already-connected socket was passed or when only a port
# is given. To avoid this error, you can pass
# server_hostname='' -- this will bypass the hostname
# check. (This also means that if host is a numeric
# IP/IPv6 address, we will attempt to verify that exact
# address; this will probably fail, but it is possible to
# create a certificate for a specific IP address, so we
# don't judge it here.)
if not host:
raise ValueError('You must set server_hostname '
'when using ssl without a host')
server_hostname = host

根据这段说明,只要server_hostname=''即可不发送sni直接通过ip链接

如何让server_host=''呢,将目光回到上面的dns代码里,只需要让返回的字典里的 "hostname": ""即可。

到此解决了sni的问题

连接Pixiv

看起来一切正常,事实上是这样吗?测试的时候一个错误无情的丢在了我的脸上

1
aiohttp.client_exceptions.ClientConnectorSSLError: Cannot connect to host app-api.pixiv.net:443 ssl:default [[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1123)]

是ssl错误。检查了解析出的ip之后发现ip是一个cloudflare的cdn节点,这类节点可能不允许sni为空。于是只好去找别的节点使用。

感谢 https://github.com/Notsfsssf/pixez-flutter 项目,让我真的找到了一个可以不发送sni连接的ip

https://github.com/Notsfsssf/pixez-flutter/blob/master/lib/er/hoster.dart

1
2
3
4
5
6
7
8
static Map<String, dynamic> _constMap = {
"app-api.pixiv.net": "210.140.131.199",
"oauth.secure.pixiv.net": "210.140.131.199",
"i.pximg.net": "210.140.92.144",
"s.pximg.net": "210.140.92.143",
"doh": "1.0.0.1",
};

210.140.131.199?按照上次的经验来说这样的节点肯定不止一个。拿出shodan

在443端口的证书信息里找到了以下内容:

1
DNS:*.pixiv.net, DNS:pixiv.me, DNS:public-api.secure.pixiv.net, DNS:oauth.secure.pixiv.net, DNS:www.pixivision.net, DNS:fanbox.cc, DNS:*.fanbox.cc, DNS:pixiv.net

看起来这台服务器承载了不少服务。经常上pixiv的同学可能知道www.pixivision.net是可以直连的,并且能够正常解析。如果我们用www.pixivision.net的ip地址替换掉app-api.pixiv.net的地址是否就能绕过sni直连呢?

修改上面的代码如下

1
2
3
4
5

new_host = host
if host == "app-api.pixiv.net":
new_host = "www.pixivision.net"

大功告成!成功直接连接到了服务器,同时获取了一些服务器地址

1
['210.140.131.226', '210.140.131.223', '210.140.131.218', '210.140.131.199', '210.140.131.201']

最终代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import asyncio
import logging
import json
import socket
from typing import List, Dict, Any
import ssl
import aiohttp
import re

from aiohttp.abc import AbstractResolver


class ByPassResolver(AbstractResolver):

async def resolve(self, host: str, port: int, family: int) -> List[Dict[str, Any]]:
new_host = host
if host == "app-api.pixiv.net":
new_host = "www.pixivision.net"
if host == "www.pixiv.net":
new_host = "www.pixivision.net"

ips = await self.require_appapi_hosts(new_host)
result = []

for i in ips:
result.append({
"hostname": "",
"host": i,
"port": port,
"family": family,
"proto": 0,
"flags": socket.AI_NUMERICHOST | socket.AI_NUMERICSERV,
})
return result

async def close(self) -> None:
pass

async def require_appapi_hosts(self, hostname, timeout=3) -> List[str]:
"""
通过 Cloudflare 的 DNS over HTTPS 请求真实的 IP 地址。
"""
URLS = (
"https://cloudflare-dns.com/dns-query",
"https://1.0.0.1/dns-query",
"https://1.1.1.1/dns-query",
"https://[2606:4700:4700::1001]/dns-query",
"https://[2606:4700:4700::1111]/dns-query",
)
params = {
"ct": "application/dns-json",
"name": hostname,
"type": "A",
"do": "false",
"cd": "false",
}

for url in URLS:
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params, timeout=timeout) as rsp:
response = await rsp.text()
obj = json.loads(response)
pattern = re.compile(
"((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.){3}(1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)")
result = []
for i in obj["Answer"]:
ip = i["data"]

if pattern.match(ip) is not None:
result.append(ip)
print(result)
return result

except Exception as e:
logging.exception(e)
pass


def get_bypass_client() -> aiohttp.ClientSession:
ssl_ctx = ssl.SSLContext()
ssl_ctx.check_hostname = False
ssl_ctx.verify_mode = ssl.CERT_NONE
connector = aiohttp.TCPConnector(ssl=ssl_ctx, resolver=ByPassResolver())
client = aiohttp.ClientSession(connector=connector)
return client


async def test():
client = get_bypass_client()
async with client.get(
"https://www.pixiv.net/ajax/search/artworks/%E7%99%BE%E5%90%88?word=%E7%99%BE%E5%90%88&order=date_d&mode=all&p=99999990&s_mode=s_tag&type=all&lang=zh") as rsp:
print(await rsp.json())
await client.close()


if __name__ == '__main__':
asyncio.run(test())