网站首页:https://gptstore.ai/gpts/categories/finance
翻页规律如下:
https://gptstore.ai/_next/data/S9vKNrHo4K82xWjuXpw-O/en/gpts/categories/finance.json?slug=finance&page=2
https://gptstore.ai/_next/data/S9vKNrHo4K82xWjuXpw-O/en/gpts/categories/finance.json?slug=finance&page=3
动态网站,返回json数据:
{
"pageProps": {
"gpts": [
{
"name": "Finance Consultant",
"description": "An expert consultant with realtime stock and crypto information",
"logo": "https://files.oaiusercontent.com/file-dBLUY66YVfjBxi9EgTkau08C?se=2123-10-26T23%3A11%3A45Z&sp=r&sv=2021-08-06&sr=b&rscc=max-age%3D31536000%2C%20immutable&rscd=attachment%3B%20filename%3D152e9d55-44cf-440b-aa86-db587d948007.png&sig=dq8VXuDXcDz%2Bc3IzyzbQzGTQb3OldexX9hO5PX4Hq8A%3D",
"gptId": "uj0goHTqVH-finance-consultant",
"gptUrl": "https://chatgpt.com/g/g-0XpYXF4Kg-finance-consultant",
"conversionCount": 1000,
"authorName": "http://gptpersonalize.com",
"pScore": 0,
"star": 3.75
},
在deepseek中输入提示词:
你是一个Python编程专家,完成一个Python脚本编写的任务,具体步骤如下:
在F盘新建一个Excel文件:gptstoreaifinancegpts20240619.xlsx
请求网址:
https://gptstore.ai/_next/data/S9vKNrHo4K82xWjuXpw-O/en/gpts/categories/finance.json?slug=finance&page={pagenumber}
请求方法:
GET
状态代码:
200 OK
{pagenumber}的值从1开始,以1递增,到10结束;
获取网页的响应,这是一个嵌套的json数据;
获取json数据中"gpts"键的值,这是一个json数据;
提取每个json数据中所有键的名称,写入Excel文件的表头,所有键对应的值,写入Excel文件的数据列;
保存Excel文件;
注意:每一步都输出信息到屏幕;
每爬取1页数据后暂停5-9秒;
需要对 JSON 数据进行预处理,将嵌套的字典和列表转换成适合写入 Excel 的格式,比如将嵌套的字典转换为字符串;
在较新的Pandas版本中,append方法已被弃用。我们应该使用pd.concat来代替。
要设置请求标头:
Accept:
*/*
Accept-Encoding:
gzip, deflate, br, zstd
Accept-Language:
zh-CN,zh;q=0.9,en;q=0.8
Priority:
u=1, i
Referer:
https://gptstore.ai/gpts/categories/finance
Sec-Ch-Ua:
"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
empty
Sec-Fetch-Mode:
cors
Sec-Fetch-Site:
same-origin
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
X-Nextjs-Data:
1
源代码:
import requests
import json
import pandas as pd
import time
import random
# 设置请求头
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Priority": "u=1, i",
"Referer": "https://gptstore.ai/gpts/categories/finance",
"Sec-Ch-Ua": '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"X-Nextjs-Data": "1"
}
# 初始化DataFrame
df = pd.DataFrame()
# 遍历页码
for page_number in range(1, 11):
print(f"正在爬取第 {page_number} 页数据...")
url = f"https://gptstore.ai/_next/data/S9vKNrHo4K82xWjuXpw-O/en/gpts/categories/finance.json?slug=finance&page={page_number}"
response = requests.get(url, headers=headers)
if response.status_code == 200:
data = response.json()
# 提取数据
items = data['pageProps']['gpts']
for item in items:
flat_item = {}
for key, value in item.items():
if isinstance(value, (dict, list)):
flat_item[key] = json.dumps(value)
else:
flat_item[key] = value
df = pd.concat([df, pd.DataFrame([flat_item])], ignore_index=True)
else:
print(f"请求失败,状态码: {response.status_code}")
# 随机暂停5-9秒
time.sleep(random.uniform(5, 9))
# 保存到Excel文件
excel_file = "F:/gptstoreaifinancegpts20240619.xlsx"
df.to_excel(excel_file, index=False)
print(f"数据已保存到 {excel_file}")