使用htmldom分析HTML代碼

時間 2019-12-11

標籤使用 htmldom 分析 html 代碼欄目 HTML 简体版

原文原文鏈接

使用語言是Python 3.5。開發環境是Windows。javascript

在使用HTMLParser庫的時候，發現它不能正確的解析多重div元素嵌套的狀況，由於這些div元素中又包含了a元素等其它元素。html

這彷佛是一個長期以來都沒解決的BUG：java

https://sourceforge.net/p/nekohtml/bugs/98/
http://jericho.htmlparser.net//docs/javadoc/net/htmlparser/jericho/StartTag.htmlpython

因而我尋找一個新庫，但願它能像javascript那樣，能從html代碼構建一個dom對象模型。但沒找到完美的，只找到這個庫：app

http://thehtmldom.sourceforge.net/dom

個人源代碼下載的方法是這樣的：編輯器

1. 首先在火狐中打開URL；函數

2. Ctrl+Shift+C打開「DOM和樣式查看器」；編碼

3. 選中頂部html元素，右鍵選擇複製outerHTML（HTML外面O）；url

4. 找個文本編輯器粘貼，保存爲utf-8編碼。

而後直接上個人代碼，記憶力很差，之後便於我直接複製：

#!/usr/bin/python  
# -*- coding: <encoding name> -*- 


from htmldom import htmldom


url = 'file:///C:/Users/Microsoft/Desktop/analyse/ict_in_companies.html'
dom = htmldom.HtmlDom(url).createDom()


def returndict(estartup):
    if not isinstance(estartup, htmldom.HtmlNodeList):
        return None
    _returndict = {'market': '', 'name': '', 'link': '', 'pitch': '', 'raised': '', 'signal': '', 'joined': '', 'employee': '', 'stage': '', 'location': ''}
    try: 
        etext = estartup.children('div[class~=company]').first().children().first().children('div[class=text]').first()
        ename = etext.children('div[class=name]').first()
        elink = ename.children('a[class=startup-link]').first()
        epitch = etext.children('div[class=pitch]').first()
    except BaseException as err:
        print('**** ERROR There is something wrong! ****')
        print(err)
    else:
        _returndict['name'] = ename.text().rstrip()
        _returndict['pitch'] = epitch.text().rstrip()
        _returndict['link'] = elink.attr('href').rstrip()
    try:
        ejoined = estartup.children('div[class~=joined]').first().children('div[class=value]').first()
        elocation = estartup.children('div[class~=location]').first().children('div[class=value]').first()
        emarket = estartup.children('div[class~=market]').first().children('div[class=value]').first()
        eemployee = estartup.children('div[class~=company_size]').first().children('div[class=value]').first()
        estage = estartup.children('div[class~=stage]').first().children('div[class=value]').first()
        eraised = estartup.children('div[class~=raised]').first().children('div[class=value]').first()
        esignal = estartup.children('div[class~=signal]').first().children('div[class=value]').first().children('img').first()
    except BaseException as err:
        print('**** ERROR There is something wrong! ****')
        print(err)
    else:
        _returndict['joined'] = ejoined.text().rstrip()
        _returndict['location'] = elocation.text().rstrip()
        _returndict['market'] = emarket.text().rstrip()
        _returndict['employee'] = "'%s" % eemployee.text().rstrip()
        _returndict['stage'] = estage.text().rstrip()
        _returndict['raised'] = eraised.text().rstrip()
        _returndict['signal'] = esignal.attr('src')[37:38].rstrip()
    return _returndict


def returngbk(original):
    return original.encode('gbk', 'ignore')
    #return original.encode('utf-8')


ecompanies = dom.find('div[class~=frw44]')
lcompanies = []
for ecompany in ecompanies:
    estartup = ecompany.children(selector='div[class~=startup]', all_children=False).first()
    dcompany = returndict(estartup)
    if not dcompany: continue
    print('index: %d' % len(lcompanies))
    lcompanies.append(dcompany)


output = open('C:/Users/Microsoft/Desktop/a.csv', 'wb')
output.write(b'market\tname\tlink\tpitch\traised\tsignal\toined\temployee\tstage\tlocation\n')
for i in range(len(lcompanies)):
    print('Index: %d' % i)
    tcompany = (returngbk(lcompanies[i]['market']),
                returngbk(lcompanies[i]['name']),
                returngbk(lcompanies[i]['link']),
                returngbk(lcompanies[i]['pitch']),
                returngbk(lcompanies[i]['raised']),
                returngbk(lcompanies[i]['signal']),
                returngbk(lcompanies[i]['joined']),
                returngbk(lcompanies[i]['employee']),
                returngbk(lcompanies[i]['stage']),
                returngbk(lcompanies[i]['location']))
    data = b'%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % tcompany
    output.write(data)
output.close

htmldom 2.0這個庫有個特色，每次經過必定方式（如find方法、children方法、for...in語句等）返回的對象，始終都是HtmlNodeList類型的，這給使用這個庫形成了一些歧義。

例如，children方法應該返回的是當前元素p的集合cs，用for...in語句遍歷獲得的c纔是子元素自己，但p（單個元素）、cs（集合）、c（單個元素）都是HtmlNodeList類型的對象，意味着它們的方法屬性都是相同的。

若是cs中只有一個c，c.attr(...)能夠成功，而cs.attr(...)是不能成功的。若是想引用c，可使用cs.first()，即c is cs.first()。不要把cs和c搞混了。

還有，雖然手冊中children方法可以經過傳參指定是否返回遞歸的子元素，但實際上傳參與否，都只能返回下一級的子元素。因此就看成沒有這個參數吧。

最後，相似於dom.find('div[class=classname]')這樣的用法，classname是不能有空格或者其它特殊字符的，若是有隻能使用dom.find('div[class~=assna]')這樣的用法。

這次在Windows平臺下使用open函數寫入文件發現一些問題：網頁中的某些字符沒法轉換成GBK編碼字節並寫入文件，代碼運行後報錯終止。

那麼，爲何要轉換成GBK編碼寫入文件，是個人代碼中這麼作了嗎？答案是個人代碼沒這麼作，但操做系統默認使用GBK編碼保存文本文件，因此有這一轉換。

因爲html源代碼是utf-8格式，當中有一部分unicode字符集的內容沒法映射至gbk字符集，所以報錯。解決方法是代碼裏主動轉換，使用s.encode('gbk', 'ignore')，s是原字符串對象，ignore是必加參數，表示不能轉換的則忽略。type(s.encode('gbk', 'ignore'))可知其類型是bytes類型，即__repr__()是b'...'。既然已是bytes類型，直接以二進制的方式寫入文本，因此open函數使用wb參數。

固然，直接s.encode('utf-8')結合open(path, 'wb')，也是不會報錯的，可是在Windows平臺下，通常軟件顯示的字符仍是GBK編碼，致使UTF-8編碼的字符在顯示時會有問題，須要手動設置爲UTF-8解決這個問題。

附上excel中打開utf-8編碼的.csv文件不亂碼的方法：

https://jingyan.baidu.com/article/48a4205705c098a925250455.html

發現這個方法更好用，尤爲是一些特殊字符，如歐元符號等能夠保留，如果轉換成GBK則會丟失。