python BeautifulSoup4 獲取 script 節點問題

在爬取12306站點名時發現,BeautifulSoup檢索不到station_version的節點javascript

由於script標籤在</html>以外,若是用‘lxml’解析器會忽略這一部分,而使用html5lib則不會。html

  ...
1
<!-- 購物車 --> 2 <div style="display: none;" class="buy-cart"><div class="cart-hd"><span class="num">0</span> 3 </div> 4 <div class="cart-bd" style="display: none;"><div class="cart-bd-top"><h3><span id="hbTrainDate">候補購票需求列表</span> 5 <a id="hbClear" href="javascript:void(0)" shape="rect">[清空]</a> 6 </h3> 7 <a href="javascript:void(0)" class="close" shape="rect">×</a> 8 </div> 9 <div class="cart-bd-con"><ul class="cart-tlist"></ul> 10 </div> 11 <div class="cart-bd-ft"><p class="cart-ft-tips">一、候補訂單需求中可包含2個相鄰乘車日期,每一個乘車日期可包含2個不一樣「車次+席別」的組合需求。</p> 12 <p class="cart-ft-tips">二、排位是指您的訂單在待兌現訂單中的位置。當前排位僅供參考,實際排位以支付成功後爲準。</p> 13 <a id="hbSubmit" href="javascript:void(0)" class="btn72 fr" shape="rect">添加乘客</a> 14 </div> 15 </div> 16 </div> 17 </body> 18 </html>  # 用‘lxml’獲得的湯到此爲止 19 <script type="text/javascript" src="/otn/resources/js/framework/station_name.js?station_version=1.9115" xml:space="preserve"></script> 20 <script type="text/javascript" src="/otn/resources/js/framework/favorite_name.js" xml:space="preserve"></script> 21 <script type="text/javascript" src="/otn/resources/merged/queryLeftTicket_end_js.js?scriptVersion=1.9158" xml:space="preserve"></script>
  ...

 

1 >>> url = "https://kyfw.12306.cn/otn/leftTicket/init?linktypeid=dc&fs=%E4%B8%87%E5%B7%9E,WYW&ts=%E8%A5%BF%E5%AE%89,XAY&date=2019-11-05&flag=N,N,Y"
 2 ... response = requests.get(url, timeout=10) 3 ... response.encoding = 'utf-8' 4 ... lxml = bs(response.text, 'lxml') 5 ... html5lib = bs(response.text, 'html5lib') 6 ... response.close() 7 >>> lxml.find_all(src=re.compile(".*station_version.*")) 8 [] 9 >>> html5lib.find_all(src=re.compile(".*station_version.*")) 10 [<script src="/otn/resources/js/framework/station_name.js?station_version=1.9115" type="text/javascript" xml:space="preserve"></script>]

 

原文出處:https://www.cnblogs.com/wawawawa-briefnote/p/11801636.htmlhtml5

相關文章
相關標籤/搜索