关与今日头条app的爬虫介绍
这段时间忙于工作,主要针对新闻资讯内容的爬取
主要爬的有今日头条,凤凰,网易,腾讯,大型网站的爬取,的总结,
1,必须熟悉手机抓包软件的配置,才可以有效的抓取到接口
2,从接口处寻找规律,
3,明确自己需要哪些内容,
4.写爬虫
我通过接口找到了所有的类目:
classify_url = 'https://is.snssdk.com/article/category/get_subscribed/v4/?iid=45032656046&device_id=43306941482&ac=wifi&channel=update&aid=13&app_name=news_article&version_code=693&version_name=6.9.3&device_platform=android&ab_version=425531%2C511489%2C512527%2C421244%2C486953%2C494121%2C513028%2C519225%2C239095%2C500091%2C467914%2C170988%2C493249%2C398175%2C519895%2C442127%2C374116%2C437000%2C478532%2C517767%2C489317%2C501961%2C519804%2C276206%2C519509%2C459645%2C500387%2C416055%2C510641%2C392461%2C470730%2C495896%2C378451%2C471406%2C510754%2C519795%2C516760%2C509305%2C512393%2C512914%2C468954%2C271178%2C424178%2C326524%2C326532%2C496389%2C508197%2C345191%2C519949%2C516309%2C518639%2C515800%2C489801%2C510935%2C455646%2C424176%2C214069%2C497615%2C507003%2C482355%2C510710%2C519295%2C442255%2C519259%2C519017%2C520601%2C512958%2C489514%2C280447%2C520688%2C281294%2C513401%2C325616%2C515839%2C498551%2C520553%2C386888%2C520089%2C498375%2C516137%2C513578%2C467513%2C515673%2C513283%2C444465%2C304488%2C261581%2C403270%2C484178%2C457480%2C502680%2C512027%2C510536&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_group=94570%2C102754%2C181429&ab_feature=94570%2C102754&abflag=3&ssmix=a&device_type=NX563J&device_brand=nubia&language=zh&os_api=25&os_version=7.1.1&uuid=864460031530349&openudid=f1082e56b1908c9c&manifest_version_code=692&resolution=1080*1920&dpi=480&update_version_code=69305&_rticket=1538042842567&fp=GSTqFS4MLrx7FlPZc2U1Flx7P24M&tma_jssdk_version=1.3.0.1&pos=5r_-9Onkv6e_eBEKeScxeCUfv7G_8fLz-vTp6Pn4v6esrKuzr6WpqKSxv_H86fTp6Pn4v6eupLOlrqmtqqSxv_zw_O3e9Onkv6e_eBEKeScxeCUfv7G__PD87dHy8_r06ej5-L-nrKyrs6mkrKWoqrG__PD87dH86fTp6Pn4v6eupLOkrKmqpKTg&rom_version=25&plugin=26894&ts=1538042842&as=a2d5ea8a7aed3bfbec7259&mas=00f531ef9a8037a65e770c80d5e613fbf128caa4888a605ed5'
然后找到列表页的接口
base_url = 'https://is.snssdk.com/api/news/feed/v88/?list_count=17&category={}&refer=1&refresh_reason=5&session_refresh_idx=1&count=20&min_behot_time=1537635643&last_refresh_sub_entrance_interval=1538041336&loc_mode=0&loc_time=1537701890&latitude=39.834079&longitude=116.28459&city=%E5%8C%97%E4%BA%AC%E5%B8%82&tt_from=enter_auto&lac=4282&cid=7752303&plugin_enable=3&iid=45032656046&device_id=43306941482&ac=wifi&channel=update&aid=13&app_name=news_article&version_code=693&version_name=6.9.3&device_platform=android&ab_version=425531%2C511489%2C512527%2C421244%2C486953%2C494121%2C513028%2C519225%2C239095%2C500091%2C467914%2C170988%2C493249%2C398175%2C519895%2C442127%2C374116%2C437000%2C478532%2C517767%2C489317%2C501961%2C519804%2C276206%2C519509%2C459645%2C500387%2C416055%2C510641%2C392461%2C470730%2C495896%2C378451%2C471406%2C510754%2C519795%2C516760%2C509305%2C512393%2C512914%2C468954%2C271178%2C424178%2C326524%2C326532%2C496389%2C508197%2C345191%2C519949%2C516309%2C518639%2C515800%2C489801%2C510935%2C455646%2C424176%2C214069%2C497615%2C507003%2C482355%2C510710%2C519295%2C442255%2C519259%2C519017%2C520601%2C512958%2C489514%2C280447%2C520688%2C281294%2C513401%2C325616%2C515839%2C498551%2C520553%2C386888%2C520089%2C498375%2C516137%2C513578%2C467513%2C515673%2C513283%2C444465%2C510536%2C304488%2C261581%2C403270%2C484178%2C457480%2C502680%2C512027&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_group=94570%2C102754%2C181429&ab_feature=94570%2C102754&abflag=3&ssmix=a&device_type=NX563J&device_brand=nubia&language=zh&os_api=25&os_version=7.1.1&uuid=864460031530349&openudid=f1082e56b1908c9c&manifest_version_code=692&resolution=1080*1920&dpi=480&update_version_code=69305&_rticket=1538041336618&fp=GSTqFS4MLrx7FlPZc2U1Flx7P24M&tma_jssdk_version=1.3.0.1&pos=5r_-9Onkv6e_eBEKeScxeCUfv7G_8fLz-vTp6Pn4v6esrKuzr6WpqKSxv_H86fTp6Pn4v6eupLOlrqmtqqSxv_zw_O3e9Onkv6e_eBEKeScxeCUfv7G__PD87dHy8_r06ej5-L-nrKyrs6mkrKWoqrG__PD87dH86fTp6Pn4v6eupLOkrKmqpKTg&rom_version=25&plugin=26894&ts=1538041336&as=a2d56aba88bfab35ec7222&mas=00b339523bce59cab47cb99ee6d66e76d36864a4888a8080da&cp=58b0a9cfaa5f8q1'
注意:category ={} 为所对应的类目
category 所对应的字段可以从类目的接口获取
字段匹配的代码如下:
res = requests.get(classify_url)
html = json.loads(res.text)
datas = html['data']['data']
print(len(datas))
for data in datas:
# 栏目
column = data['name']
print(column)
#类目
category = data['category']
然后进行字段拼接就可以找到所对应的列表页,得到列表页然后就要获取到详情页的地址
详情页的地址也只找的接口
这就简单多了,有好几种可行方案,我就在这里说一种
我通过抓包软件找到接口
text_url = "http://a3.pstatp.com/article/content/21/1/{}/{}/1/0/?iid=37457543399&device_id=55215909025&ac=wifi&channel=tengxun2&aid=13&app_name=news_article&version_code=682&version_name=6.8.2&device_platform=android&ab_version=261581%2C403271%2C197606%2C293032%2C405731%2C418881%2C413287%2C271178%2C357705%2C377637%2C326524%2C326532%2C405403%2C415915%2C409847%2C416819%2C402597%2C369470%2C239096%2C170988%2C416198%2C390549%2C404717%2C374117%2C416708%2C416648%2C265169%2C415090%2C330633%2C297058%2C410260%2C276203%2C413705%2C320832%2C397738%2C381405%2C416055%2C416153%2C401106%2C392484%2C385726%2C376443%2C378451%2C401138%2C392717%2C323233%2C401589%2C391817%2C346557%2C415482%2C414664%2C406427%2C411774%2C345191%2C417119%2C377633%2C413565%2C414156%2C214069%2C31211%2C414225%2C411334%2C415564%2C388526%2C280449%2C281297%2C325614%2C324092%2C357402%2C414393%2C386890%2C411663%2C361348%2C406418%2C252782%2C376993%2C418024&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_feature=102749%2C94563&abflag=3&ssmix=a&device_type=MI+3C&device_brand=Xiaomi&language=zh&os_api=19&os_version=4.4.4&uuid=99000549116036&openudid=efcc6d4284c6c458&manifest_version_code=682&resolution=1080*1920&dpi=480&update_version_code=68210&_rticket=1532142082952&rom_version=miui_v7_5.12.4&plugin=32&pos=5r_88Pzt0fzp9Ono-fi_p66ps6-oraylqrG__PD87d706eS_p794Iw14KgN4JR-_sb_88Pzt0fLz-vTp6Pn4v6esrKqzrKSvqq6k4A%3D%3D&fp=z2T_L2mOLSxbFlHIPlU1FYweFzKe&ts=1532142082&as=a255cac5b2208bd2a23862&mas=00e35bc961329fe4e2da0242394f32b692264a2c00d8a582a8"
注意:{}{}这个也是所需要匹配的可以从列表页获取,列表页获取的这个字段有的时候有有的时候没有,所以我用的异常处理
#获取这个字段的代码如下:
res = requests.get(base_url, headers=self.headers)
html = json.loads(res.text)
print(res.status_code, '-------')
datas = html['data']
for data in datas:
try:
# 详情页的id
group_id = (json.loads(data["content"]))["group_id"]
except:
group_id = 0
if group_id != 0:
print(group_id)
#接下来就是拼接详情页的地址了
在然后就是匹配获取标题还有内容了在这里就不多说了,没有什么技术含量:
想要源码>>>>>>>>>>>>>>>>>>>>>>>可以联系本主。。。希望你们自己通过抓包软件,找到接口,然后按照我的思路去完成??他的反爬主要是接口的访问量,还有要换ua,还有ip。。后续会有其他新闻类的介绍,谢谢关注!!!!