900字范文,内容丰富有趣,生活中的好帮手!
900字范文 > Python爬虫网络请求 requests(get post)

Python爬虫网络请求 requests(get post)

时间:2022-12-07 20:10:05

相关推荐

Python爬虫网络请求 requests(get post)

Python网络请求模块

urllib

Urllib 库,它是 Python 内置的 HTTP 请求库,也就是说我们不需要额外安装即可使用,它包含四个模块

第一个模块 request,它是最基本的 HTTP 请求模块,我们可以用它来模拟发送一请求,就像在浏览器里输入网址然后敲击回车一样,只需要给库方法传入 URL 还有额外的参数,就可以模拟实现这个过程了。

第二个 error 模块即异常处理模块,如果出现请求错误,我们可以捕获这些异常,然后进行重试或其他操作保证程序不会意外终止。

第三个 parse 模块是一个工具模块,提供了许多 URL 处理方法,比如拆分、解析、合并等等的方法。

第四个模块是 robotparser,主要是用来识别网站的 robots.txt 文件,然后判断哪些网站可以爬,哪些网站不可以爬的,其实用的比较少。

比较古老,封装的相关基于爬虫的操作相对比较麻烦、繁琐、复杂的

requests出现逐渐代替了urllib

requests

定义

requests模块:Python中原生的一款基于网络请求的模块

特点

功能强大简单便捷效率高

作用

模拟浏览器发送请求

浏览器发送请求流程

指定URL发起HTTP/HTTPS请求(get)获取响应数据

requests发送请求流程

指定URL发起HTTP/HTTPS请求(get/post)获取响应数据持久化存储

requests编码流程

安装环境

安装requests模块 pip install requests

pip install requests

实战编码

爬取搜狗首页的页面数据

请求搜狗首页,并且把html的源码写入sogo.html文件中

import requestsif __name__ == "__main__":url = '/'response = requests.get(url=url)page_text = response.textwith open('./requests/sogo.html', 'w', encoding='utf-8') as fp:fp.write(page_text)print('over')

打开文件,没有相应的样式,我们只需要数据。

UA伪装

UA:user-agent(请求载体的身份标识)

UA检测/反爬机制:门户网站的浏览器会检测对应请求的载体身份标识,如果检测到载体的身份标识为某一款浏览器,则说明此次请求为一次正常的请求(服务器端不会拒绝正常请求。反之,则认为该请求为不正常的请求(爬虫),服务器很有可能拒绝请求。(反爬机制)

正常请求的user-agent:

UA伪装/反反爬机制:将爬虫对应请求的载体身份标识为某一款浏览器(不进行伪装可能得不到数据或乱码数据)

UA伪装获取搜狗搜索数据

import requestsif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}url = '/web'kw = input('enter a keyword:')param = {'query': kw}response = requests.get(url=url, params=param, headers=headers)page_text = response.textwith open('./requests/sogo.html', 'w', encoding='utf-8') as fp:fp.write(page_text)print('over')

百度翻译

UI输入关键词后,进行ajax请求,响应json数据

json数据

{"errno":0,"data":[{"k":"dog","v":"n. \u72d7; \u8e69\u811a\u8d27; \u4e11\u5973\u4eba; \u5351\u9119\u5c0f\u4eba v. \u56f0\u6270; \u8ddf\u8e2a"},{"k":"DOG","v":"abbr. Data Output Gate \u6570\u636e\u8f93\u51fa\u95e8"},{"k":"doge","v":"n. \u5171\u548c\u56fd\u603b\u7763"},{"k":"dogm","v":"abbr. dogmatic \u6559\u6761\u7684; \u72ec\u65ad\u7684; dogmatism \u6559\u6761\u4e3b\u4e49; dogmatist"},{"k":"Dogo","v":"[\u5730\u540d] [\u9a6c\u91cc\u3001\u5c3c\u65e5\u5c14\u3001\u4e4d\u5f97] \u591a\u6208; [\u5730\u540d] [\u97e9\u56fd] \u9053\u9ad8"}]}

Python获取结果百度翻译结果,使用json() 获取的是对象,text() 获取的是字符串

import requestsimport jsonif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}kw = input('enter a kewword:')data = {'kw': kw}url = '/sug'response = requests.post(url=url, data=data, headers=headers)json_data = response.json()print(json_data)fp = open('../' + kw + '.json', 'w', encoding='utf-8')json.dump(json_data, fp=fp, ensure_ascii=False)print('over')

持久化存储值json文件:

豆瓣电影

import requestsimport jsonif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}params = {'type': '24', # 电影类型 24--喜剧'interval_id': '100: 90','action': '','start': '0', # 从第几部开始取'limit': '20'} # 每次多少部url = '/j/chart/top_list'response = requests.get(url=url, params=params, headers=headers)json_data = response.json()print(json_data)fp = open('./requests/douban.json', 'w', encoding='utf-8')json.dump(json_data, fp=fp, ensure_ascii=False)print('over')

爬取肯德基餐厅信息

import requestsimport jsonif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}data = {'cname': '','pid': '','keyword': '厦门','pageIndex': '1','pageSize': '10'}url = '/kfccda/ashx/GetStoreList.ashx?op=keyword'response = requests.post(url=url, data=data, headers=headers)text_data = response.textprint(text_data)fp = open('./requests/kendeji.json', 'w', encoding='utf-8')json.dump(text_data, fp=fp, ensure_ascii=False)print('over')

综合案例—药监总局化妆品
需求

获取每一家公司的详细信息

代码实现

直接请求药监总局化妆品页面:http://scxk.:81/xk/

import requestsif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}url = 'http://scxk.:81/xk/'response = requests.get(url=url, headers=headers)print(response.text)

并没有预期像得到的数据列表

我们需要爬取的数据是由ajax请求动态获取

使用requests模块发送post请求,响应头数据类型为json数据

import requestsimport jsonif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}url = 'http://scxk.:81/xk/itownet/portalAction.do?method=getXkzsList'data = {'on': 'true','page': 1,'pageSize': 15,'productName': '','conditionType': 1,'applyname': ''}response = requests.post(url=url, data=data, headers=headers)json_data = response.json()print(json_data)fp = open('./requests/huazhuanqin.json', 'w', encoding='utf-8')json.dump(json_data, fp=fp, ensure_ascii=False)print('over')

json 数据

{"filesize": "", "keyword": "", "list": [{"ID": "a86b953e03fa4cbcb18354588067f68f", "EPS_NAME": "南京天其美生物技术有限公司", "PRODUCT_SN": "苏妆0026", "CITY_CODE": "81", "XK_COMPLETE_DATE": {"date": 7, "day": 1, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1607270400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-12-06", "QF_MANAGER_NAME": "江苏省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "9135MA1R9MDG61", "XC_DATE": "-12-07", "NUM_": 1}, {"ID": "b2afd7622312450db3d7fcc84fd37195", "EPS_NAME": "江苏海智生物医药有限公司", "PRODUCT_SN": "苏妆0025", "CITY_CODE": "81", "XK_COMPLETE_DATE": {"date": 7, "day": 1, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1607270400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-12-06", "QF_MANAGER_NAME": "江苏省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91313025480985", "XC_DATE": "-12-07", "NUM_": 2}, {"ID": "5eb10afc74a2462c8e86652ec8d90a48", "EPS_NAME": "无锡邦士立生物科技有限公司", "PRODUCT_SN": "苏妆0013", "CITY_CODE": "82", "XK_COMPLETE_DATE": {"date": 4, "day": 5, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1607011200000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-04-20", "QF_MANAGER_NAME": "江苏省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "9133355032183D", "XC_DATE": "-12-04", "NUM_": 3}, {"ID": "11e6d8199f7a4496a3caacf7a545de74", "EPS_NAME": "广东贝多宝新生物科技实业有限公司", "PRODUCT_SN": "粤妆0118", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 2, "day": 3, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606838400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-05-13", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440514MA5105RK6Y", "XC_DATE": "-12-02", "NUM_": 4}, {"ID": "39213f625bf8425a8e871fb5b15e1dfa", "EPS_NAME": "吉林一正科技发展有限公司", "PRODUCT_SN": "吉妆0015", "CITY_CODE": "278", "XK_COMPLETE_DATE": {"date": 2, "day": 3, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606838400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-09-11", "QF_MANAGER_NAME": "吉林省食品药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91220394559795377J", "XC_DATE": "-12-02", "NUM_": 5}, {"ID": "86f41470eadd42048ba120b6a0b5ce69", "EPS_NAME": "广东璞言生物科技有限公司", "PRODUCT_SN": "粤妆0238", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 1, "day": 2, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606752000000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-30", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440513MA54JAN31N", "XC_DATE": "-12-01", "NUM_": 6}, {"ID": "1bde7d3498c14463b53d708ea474e843", "EPS_NAME": "汕头市澄海区金泳乐化妆品有限公司", "PRODUCT_SN": "粤妆0582", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 1, "day": 2, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606752000000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-08-14", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440515730485070R", "XC_DATE": "-12-01", "NUM_": 7}, {"ID": "b47a8488b1a640f0b46cdc73d71fa244", "EPS_NAME": "常州伟博海泰生物科技有限公司", "PRODUCT_SN": "苏妆0086", "CITY_CODE": "91", "XK_COMPLETE_DATE": {"date": 1, "day": 2, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606752000000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-12-24", "QF_MANAGER_NAME": "江苏省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91320411076300122U", "XC_DATE": "-12-01", "NUM_": 8}, {"ID": "ae8324769a434638acc24bbb29753cde", "EPS_NAME": "众唯汇诚(辽宁)生物科技有限公司", "PRODUCT_SN": "辽妆0012", "CITY_CODE": "290", "XK_COMPLETE_DATE": {"date": 1, "day": 2, "hours": 0, "minutes": 0, "month": 11, "nanos": 0, "seconds": 0, "time": 1606752000000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2024-12-20", "QF_MANAGER_NAME": "辽宁省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91210500MA1004W54M", "XC_DATE": "-12-01", "NUM_": 9}, {"ID": "a5ff467ab1b648d590b30aa3cc8a2d56", "EPS_NAME": "广州天然国度生物科技有限公司", "PRODUCT_SN": "粤妆0236", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 30, "day": 1, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606665600000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-29", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440101MA9ULP4U43", "XC_DATE": "-11-30", "NUM_": 10}, {"ID": "6005ea53473e4e9ea491bf776bd60e2e", "EPS_NAME": "名宇(广东)化妆品科技有限公司", "PRODUCT_SN": "粤妆0235", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 30, "day": 1, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606665600000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-29", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440101MA9UKENC2T", "XC_DATE": "-11-30", "NUM_": 11}, {"ID": "b4437b636b5944eb9eadc1418f312b19", "EPS_NAME": "四川隆力奇实业有限公司", "PRODUCT_SN": "川妆0003", "CITY_CODE": "165", "XK_COMPLETE_DATE": {"date": 30, "day": 1, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606665600000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "-08-18", "QF_MANAGER_NAME": "四川省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91510112794939158Q", "XC_DATE": "-11-30", "NUM_": 12}, {"ID": "f015717cb3364e47891c6f6140d9c7e4", "EPS_NAME": "丽雅国际生物科技(广州)有限公司", "PRODUCT_SN": "粤妆0237", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 27, "day": 5, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606406400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-26", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440101MA9UW9T12P", "XC_DATE": "-11-27", "NUM_": 13}, {"ID": "ccb44aae22554b8e9e60aad4fb6be2aa", "EPS_NAME": "广州大唐化妆品有限公司", "PRODUCT_SN": "粤妆0233", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 27, "day": 5, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606406400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-26", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440101MA9UM5B397", "XC_DATE": "-11-27", "NUM_": 14}, {"ID": "4cf125b03b9947a58600302c69ca4402", "EPS_NAME": "广州市倩雅丝精细化工有限公司", "PRODUCT_SN": "粤妆0693", "CITY_CODE": null, "XK_COMPLETE_DATE": {"date": 27, "day": 5, "hours": 0, "minutes": 0, "month": 10, "nanos": 0, "seconds": 0, "time": 1606406400000, "timezoneOffset": -480, "year": 120}, "XK_DATE": "2025-11-26", "QF_MANAGER_NAME": "广东省药品监督管理局", "BUSINESS_LICENSE_NUMBER": "91440111716373443G", "XC_DATE": "-11-27", "NUM_": 15}], "orderBy": "createDate", "orderType": "desc", "pageCount": 360, "pageNumber": 1, "pageSize": 15, "property": "", "totalCount": 5400}

进入详情页面,获取的数据也是ajax动态加载的

URL:http://scxk.:81/xk/itownet/portal/dzpz.jsp?id=a86b953e03fa4cbcb18354588067f68f

爬取某一化妆品企业的详细信息

import requestsimport jsonif __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}url = 'http://scxk.:81/xk/itownet/portalAction.do?method=getXkzsById'data = {'id': 'a86b953e03fa4cbcb18354588067f68f'}response = requests.post(url=url, data=data, headers=headers)json_data = response.json()print(json_data)fp = open('./requests/huazhuanqindetail.json', 'w', encoding='utf-8')json.dump(json_data, fp=fp, ensure_ascii=False)print('over')

json 数据

{"businessLicenseNumber": "9135MA1R9MDG61", "businessPerson": "仇德发", "certStr": "一般液态单元(护肤水类)", "cityCode": "", "countyCode": "", "creatUser": "", "createTime": "", "endTime": "", "epsAddress": "南京市江宁区乾德路2号4栋1楼(江宁高新园)", "epsName": "南京天其美生物技术有限公司", "epsProductAddress": "南京市江宁区乾德路2号4栋1楼(江宁高新园)", "id": "", "isimport": "N", "legalPerson": "仇德发", "offDate": "", "offReason": "", "parentid": "", "preid": "", "processid": "2028150712201bnhb9", "productSn": "苏妆0026", "provinceCode": "", "qfDate": "", "qfManagerName": "江苏省药品监督管理局", "qualityPerson": "何敏", "rcManagerDepartName": "江苏省药品监督管理局(南京检查分局)", "rcManagerUser": "江苏省药品监督管理局(南京检查分局)负责日常监管的人员", "startTime": "", "xkCompleteDate": null, "xkDate": "2025-12-06", "xkDateStr": "-12-07", "xkName": "张春平", "xkProject": "", "xkRemark": "", "xkType": "201"}

整合处理,获取1-50页的企业详细信息:

import requestsimport jsonif __name__ == "__main__":# 批量获取IDheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}url = 'http://scxk.:81/xk/itownet/portalAction.do?method=getXkzsList'id_lists = []for pageid in range(1, 51):print(pageid)data = {'on': 'true','page': pageid,'pageSize': 15,'productName': '','conditionType': 1,'applyname': ''}response = requests.post(url=url, data=data, headers=headers)json_data = response.json()for dic in json_data['list']:id_lists.append(dic['ID'])# 根据ID获取详细数据data_lists = []id_url = 'http://scxk.:81/xk/itownet/portalAction.do?method=getXkzsById'for id in id_lists:print(id)id_data = {'id': id}list_data = requests.post(url=id_url, data=id_data, headers=headers).json()data_lists.append(list_data)# 持久化存储fp = open('./requests/huazhuanqindetail.json', 'w', encoding='utf-8')json.dump(data_lists, fp=fp, ensure_ascii=False)print('over')

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。