Spider
[^]: Python3 实现
urllib.request模块
用于HTTP、HTTPS、FTP协议的URL,主要用于HTTP
1 urllib.request.urlopen(url, data = None , [ timeout,], cafile = None , capath = None , cadefault = False , context = None )
data
是以post方式提交URL的参数,timeout
是超时时间设置参数(仅适用于HTTP,HTTPS和FTP连接),ca-XXX
是有关身份验证的参数(cafile
和capath
参数为HTTPS请求指定一组可信CA证书。)
P.S. python3.6后不推荐使用cafile,capath和cadefault来支持context。请使用ssl.SSLContext.load_cert_chain()
或者 ssl.create_default_context()
选择系统的可信CA证书。
该函数返回的对象有下列方法:
geturl()
返回检索到资源的URL,通常用于确定是否遵循重定向 。
getcode
返回响应的HTTP状态代码 (200:成功、404:不存在、503:暂时不可用)。
info
以email.message_from_string()
实例的形式返回页面的元信息 。
返回百度首页 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 __author__ = 'QCF' import urllib.requestdef clear (): '''清屏''' print('内容较多,3s后翻页' ) time.sleep(3 ) OS = platform.system() if (OS == 'Windows' ): os.system('cls' ) else : os.system('clear' ) def linkBaidu (): url = 'http://www.baidu.com' try : response = urllib.request.urlopen(url,timeout=3 ) result = response.read().decode('utf-8' ) except Exception as e: print('网络地址错误' ) exit() with open('baidu.txt' , 'w' , encoding = 'utf-8' ) as fp: fp.write(result) print("获取url信息:response.geturl():%s" %response.geturl()) print("获取返回代码:response.getcode():%s" %response.getcode()) print("获取返回信息:response.info():%s" %response.info()) print("网页内容已存入当前目录的baidu,txt中" ) if __name__ == '__main__' : linkBaidu()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ================= RESTART: F:/Python/Spyder/ConnectBaidu.py ================= 获取url信息:response.geturl():http: 获取返回代码:response.getcode():200 获取返回信息:response.info():Bdpagetype: 1 Bdqid: 0xef5df86600013d61 Cache-Control: private Content-Type: text/html Cxy_all: baidu+1 c8349b37b441e6932e8b8b6e4747690 Date: Fri, 25 Jan 2019 15 :03 :18 GMT Expires: Fri, 25 Jan 2019 15 :03 :00 GMT P3p: CP=" OTI DSP COR IVA OUR IND COM " Server: BWS/1.1 Set-Cookie: BAIDUID=28 A5143FAE268F8DB5005D86DECF2D35:FG=1 ; expires=Thu, 31 -Dec-37 23 :55 :55 GMT; max-age=2147483647 ; path=/; domain=.baidu.com Set-Cookie: BIDUPSID=28 A5143FAE268F8DB5005D86DECF2D35; expires=Thu, 31 -Dec-37 23 :55 :55 GMT; max-age=2147483647 ; path=/; domain=.baidu.com Set-Cookie: PSTM=1548428598 ; expires=Thu, 31 -Dec-37 23 :55 :55 GMT; max-age=2147483647 ; path=/; domain=.baidu.com Set-Cookie: delPer=0 ; path=/; domain=.baidu.com Set-Cookie: BDSVRTM=0 ; path=/ Set-Cookie: BD_HOME=0 ; path=/ Set-Cookie: H_PS_PSSID=26524 _1439_21110_28329_28414_20718; path=/; domain=.baidu.com Vary: Accept-Encoding X-Ua-Compatible: IE=Edge,chrome=1 Connection: close Transfer-Encoding: chunked 网页内容已存入当前目录的baidu,txt中
若是将url = 'http://www.baidu.com'
改为url = 'https://www.baidu.com'
返回内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ================= RESTART: F:/Python/Spyder/ConnectBaidu.py ================= 获取url信息:response.geturl():https: 获取返回代码:response.getcode():200 获取返回信息:response.info():Accept-Ranges: bytes Cache-Control: no-cache Content-Length: 227 Content-Type: text/html Date: Fri, 25 Jan 2019 15 :03 :32 GMT Etag: "5c36c624-e3" Last-Modified: Thu, 10 Jan 2019 04 :12 :20 GMT P3p: CP=" OTI DSP COR IVA OUR IND COM " Pragma: no-cache Server: BWS/1.1 Set-Cookie: BD_NOT_HTTPS=1 ; path=/; Max-Age=300 Set-Cookie: BIDUPSID=8 CE73187BDBE7A99BC73BBDDA28A698C; expires=Thu, 31 -Dec-37 23 :55 :55 GMT; max-age=2147483647 ; path=/; domain=.baidu.com Set-Cookie: PSTM=1548428612 ; expires=Thu, 31 -Dec-37 23 :55 :55 GMT; max-age=2147483647 ; path=/; domain=.baidu.com Strict-Transport-Security: max-age=0 X-Ua-Compatible: IE=Edge,chrome=1 Connection: close 网页内容已存入当前目录的baidu,txt中
1 2 3 4 5 6 7 8 9 10 <html > <head > <script > location.replace(location.href.replace("https://" ,"http://" )); </script > </head > <body > <noscript > <meta http-equiv ="refresh" content ="0;url=http://www.baidu.com/" > </noscript > </body > </html >
特别说明:with open('baidu.txt', 'w', encoding = 'utf-8') as fp:
为Windows下写法,Linux下为with open('baidu.txt', 'w') as fp:
为什么呢?
Windows新建文件默认使用gbk
编码,如果用gbk
编码解析程序中 result = response.read().decode('utf-8')
的utf-8
网络数据流,就会导致UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 29527: illegal multibyte sequence
错误。
但是如果url是HTTPS,则无此报错。(?)
urllib package 包含四个模块:
urllib.request 用于打开和获取URL
urllib.error 包含urllib.request
产生的异常
urllib.parse 用于解析URL
urllib.robotparser 用于解析robots.txt
文件
(参照:https://docs.python.org/3/library/urllib.html)