Python urlparse模塊

時間 2019-11-22

原文原文鏈接

Python urlparse模塊

urlparse 模塊簡介

urlparse模塊用於把url解析爲各個組件，支持file，ftp，http，https，imap，mailto，mms，news，nntp，prospero，rtsp，sftp，shttp，sip，svn+ssh，telnet等幾乎全部的形式，在Python3中，該模塊放置在urllib.parse中了。html

函數說明

1.urlparse()函數python

>>> from urllib.parse import urlparse
>>> urls = urlparse('https://www.cnblogs.com/fuhj02/archive/2010/12/07/1898557.html')
>>> urls
ParseResult(scheme='https', netloc='www.cnblogs.com', path='/fuhj02/archive/2010/12/07/1898557.html', params='', query='', fragment='')
>>> dir(urls)
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_fields', '_hostinfo', '_make', '_replace', '_source', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'params', 'password', 'path', 'port', 'query', 'scheme', 'username']
>>> urls.hostname
'www.cnblogs.com'

該函數將一個url字符串分解爲6個元素，以元組的形式返回。這與URL的通常結構相關：scheme://netloc//path;parameters?query#fragment解析獲得的每一個元素都是一個字符串，有的元素可能爲空，除了返回這6個元素外，返回的對象還包含了一些屬性：username、password、hostname、port等，咱們能夠經過Python的內置函數dir()來查看其具備的屬性和方法。less

注意：若要獲得正確的nerloc值，url必須以//開頭，不然會被歸到path值裏去。例如：ssh

>>> another = urlparse('www.cnblogs.com/fuhj02/archive/2010/12/07/1898557.html')
>>> another
ParseResult(scheme='', netloc='', path='www.cnblogs.com/fuhj02/archive/2010/12/07/1898557.html', params='', query='', fragment='')

其實，返回的結果是tuple子類的一個實例，該類具備以下的只讀屬性：
svn

2.urlunparse()函數函數

>>> from urllib.parse import urlunparse
>>> urlunparse(urls)
'https://www.cnblogs.com/fuhj02/archive/2010/12/07/1898557.html'

該函數做用是把urlparse()分解的元素再拼合還原爲一個url，該函數的參數能夠是任意的六元組。ui

3.urlsplit()函數url

>>> from urllib.parse import urlsplit
>>> urlsplit('https://www.cnblogs.com/fuhj02/archive/2010/12/07/1898557.html')
SplitResult(scheme='https', netloc='www.cnblogs.com', path='/fuhj02/archive/2010/12/07/1898557.html', query='', fragment='')

該函數與urlparse()相似，不過返回的是一個5元素的元組，不包括params。.net

4.urlunsplit()函數，此函數是將urlsplit函數分解的元素再組合起來。code

5.urljoin()函數

>>> from urllib.parse import urljoin
>>> urljoin('http://www.baidu.com', 'wenku.faq.html')
'http://www.baidu.com/wenku.faq.html'

該函數基於一個base url和另一個url構造一個絕對url，如上所示。注意：若是參數中的url爲絕對路徑的URL(即以//或scheme://開始)，那麼url的hostname和scheme將會出如今結果中，以下所示：

>>> urljoin('https://www.baidu.com/', 'https://blog.csdn.net/timeless_go/article/details/78489716')
'https://blog.csdn.net/timeless_go/article/details/78489716'
>>> urljoin('http://wiki.huihoo.com/wiki/', 'OpenERP#.E5.AE.89.E8.A3.85')
'http://wiki.huihoo.com/wiki/OpenERP#.E5.AE.89.E8.A3.85'

其他方法再也不挨着介紹，直接查看源代碼便可。