得到的url:
https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid=384801460&date=2021-08-08
url中的oid为视频弹幕链接的id值;data参数为刚才的的日期,而获得该视频全部弹幕内容,只需要更改data参数即可。而data参数可以从上面的弹幕日期url获得,也可以自行构造;网页数据格式为json格式
实战代码import requests
import pandas as pd
import re
def data_resposen(url):
headers = {
"cookie": "你的cookie",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
}
resposen = requests.get(url, headers=headers)
return resposen
def main(oid, month):
df = pd.DataFrame()
url = f'https://api.bilibili.com/x/v2/dm/history/index?type=1&oid={oid}&month={month}'
list_data = data_resposen(url).json()['data'] # 拿到所有日期
print(list_data)
for data in list_data:
urls = f'https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid={oid}&date={data}'
text = re.findall(".*?([\u4E00-\u9FA5] ).*?", data_resposen(urls).text)
for e in text:
print(e)
data = pd.DataFrame({'弹幕': [e]})
df = pd.concat([df, data])
df.to_csv('弹幕.csv', encoding='utf-8', index=False, mode='a ')
if __name__ == '__main__':
oid = '384801460' # 视频弹幕链接的id值
month = '2021-08' # 开始日期
main(oid, month)
结果展示:

B站视频的评论内容在网页下方,进入浏览器的开发者工具后,只需要向下拉取即可加载出数据包:

得到真实url:
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550479&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1&_=1629012090500
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550483&jsonp=jsonp&next=2&type=1&oid=589656273&mode=3&plat=1&_=1629012513080
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550484&jsonp=jsonp&next=3&type=1&oid=589656273&mode=3&plat=1&_=1629012803039
两条urlnext参数,以及_和callback参数。_和callback一个是时间戳,一个是干扰参数,删除即可。next参数第一条为0,第二条为2,第三条为3,所以第一条next参数固定为0,第二条开始递增;网页数据格式为json格式。
实战代码import requests
import pandas as pd
df = pd.DataFrame()
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
try:
a = 1
while True:
if a == 1:
# 删除不必要参数得到的第一条url
url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1'
else:
url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={a}&type=1&oid=589656273&mode=3&plat=1'
print(url)
html = requests.get(url, headers=headers).json()
for i in html['data']['replies']:
uname = i['member']['uname'] # 用户名称
sex = i['member']['sex'] # 用户性别
mid = i['mid'] # 用户id
current_level = i['member']['level_info']['current_level'] # vip等级
message = i['content']['message'].replace('\n', '') # 用户评论
like = i['like'] # 评论点赞次数
ctime = i['ctime'] # 评论时间
data = pd.DataFrame({'用户名称': [uname], '用户性别': [sex], '用户id': [mid],
'vip等级': [current_level], '用户评论': [message], '评论点赞次数': [like],
'评论时间': [ctime]})
df = pd.concat([df, data])
a = 1
except Exception as e:
print(e)
df.to_csv('奥运会.csv', encoding='utf-8')
print(df.shape)
结果展示,获取的内容不包括二级评论,如果需要,可自行爬取,操作步骤差不多:

本文以爬取电影《哥斯拉大战金刚》为例,讲解如何爬爱奇艺视频的弹幕和评论!
网页地址:
https://www.iqiyi.com/v_19rr0m845o.html
弹幕分析网页
爱奇艺视频的弹幕依然是要进入开发者工具进行抓包,得到一个br压缩文件,点击可以直接下载,里面的内容是二进制数据,视频每播放一分钟,就加载一条数据包:
