爬虫入门级操作

一、从指定网页爬取指定内容

被爬取网页

https://ask.39.net/news/2685-1.html

image-20250913095304442

明确目的

我想爬取这个网页中的问题及其回答。

组织爬取代码

image-20250913105018475
image-20250913105026789
image-20250913105429779
image-20250913105708533

将复制内容转换成python代码

打开网站页面:https://spidertools.cn/#/curl2Request ,将上述复制得到的内容直接粘贴到下图左侧输入框内,将能将curl命令自动生成右侧python代码,复制出来

image-20250913105922955

运行python脚本

将复制的python脚本保存为test01.py,运行

image-20250913190642777

发现得到的内容是乱码的,因为编码设置不正确导致的,返回的文本内容具体编码是什么不清楚,但肯定跟服务器指定的此网页的编码不一样。解决方法如下,通过如下方法找到服务器此网页的编码为utf-8,或直接查找上图也可以看到编码格式。

image-20250913190912471

然后对test01.py稍作修改,添加如下倒数第3行内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import requests


headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "no-cache",
"pragma": "no-cache",
"priority": "u=0, i",
"referer": "https://ask.39.net/index.htm",
"sec-ch-ua": "\"Not;A=Brand\";v=\"99\", \"Google Chrome\";v=\"139\", \"Chromium\";v=\"139\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36"
}
cookies = {
"asp_furl": "https%3A%2F%2Fwww.google.com.hk%2F",
"_ga": "GA1.2.1531224323.1757682130",
"Hm_lvt_6e8573fc07ff21285f41cba3fb1618af": "1757682131",
"HMACCOUNT": "61545DE2F4546981",
"Hm_lvt_9840601cb51320c55bca4fa0f4949efe": "1757682131",
"track39id": "f1e216cf465a1effd01fe0d4d910a097",
"JSESSIONID": "BACB897BB0974672B36BD0E03882948B",
"Hm_lpvt_9840601cb51320c55bca4fa0f4949efe": "1757728380",
"Hm_lpvt_6e8573fc07ff21285f41cba3fb1618af": "1757728380"
}
url = "https://ask.39.net/news/2685-1.html"
response = requests.get(url, headers=headers, cookies=cookies)

response.encoding = 'utf-8' #添加此行内容
print(response.text)
print(response)

#以下可选。作用是将返回的回复直接写入test.html文件中保存起来
with open("test.html", "w", encoding="utf-8") as f:
f.write(response.text)

爬虫入门级操作
https://jiangsanyin.github.io/2025/09/13/爬虫入门级操作/
作者
sanyinjiang
发布于
2025年9月13日
许可协议