[python] 解析xml文件

時間 2019-12-13

標籤 python 解析 xml 文件欄目 Python 简体版

原文原文鏈接

0. XML 基礎

參考文檔：http://www.runoob.com/xml/xml-tutorial.htmlhtml

　　0A. XML 特性

XML 用於傳輸數據，而不是顯示數據 <--> HTML
具備自我描述性
沒有預約義的標籤

　　0B. XML 語法

XML申明（可選）：<?xml version="1.0" encoding="UTF-8"?>
XML 文檔造成一種樹結構：element; attribute; element content
根元素（必有）
全部XML 元素都必須有一個關閉標籤（與HTML不一樣）
對大小寫敏感
元素必須正確嵌套
屬性值必須加引號
實體引用「<」<-->「<」，「>」<-->「>」，「&」<-->「&」，「'」<-->「'」，「"」<-->「"」
註釋：
空格會被保留
以 LF 存儲換行

　　0C. XML 與 JSON 主要組成成分區別

XML是element、attribute和element content
JSON是object、 array、string、number、boolean(true/false)和null

1. DOM 解析XML

　　 Python主要有三種方式解析XML：node

SAX (simple API for XML)
DOM(Document Object Model)
ElementTree(元素樹)

參考文檔：python

https://docs.python.org/2/library/xml.dom.minidom.html；dom

https://docs.python.org/2/library/xml.dom.html ui

DOM中全部對象類型：

Interface	Section	Purpose
`DOMImplementation`	DOMImplementation Objects	Interface to the underlying implementation.
`Node`	Node Objects	Base interface for most objects in a document.
`NodeList`	NodeList Objects	Interface for a sequence of nodes.
`DocumentType`	DocumentType Objects	Information about the declarations needed to process a document.
`Document`	Document Objects	Object which represents an entire document.
`Element`	Element Objects	Element nodes in the document hierarchy.
`Attr`	Attr Objects	Attribute value nodes on element nodes.
`Comment`	Comment Objects	Representation of comments in the source document.
`Text`	Text and CDATASection Objects	Nodes containing textual content from the document.
`ProcessingInstruction`	ProcessingInstruction Objects	Processing instruction representation.

其中最重要是Node類，他是實現XML全部組成部分的父類，他的子類包括：Document；Element；Attr；Comment；Text；ProcessingInstruction
經過getElementsByTagName()和getElementsByTagNameNS()方法會獲得Node類的集合，即NodeList
DocumentType爲定義文檔、標籤含義用法等，譬如testng中的xml文件頭部：<!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd">

DOM Node類型

參考文檔：http://www.w3school.com.cn/xmldom/dom_nodetype.asp編碼

節點類型	Named Constant	描述	nodeName 的返回值	nodeValue 的返回值	nodeType 的返回值
Element	ELEMENT_NODE	element（元素）元素	element name	null	1
Attr	ATTRIBUTE_NODE	屬性。	屬性名稱	屬性值	2
Text	TEXT_NODE	元素或屬性中的文本內容。	#text	節點內容	3
CDATASection	CDATA_SECTION_NODE	表示文檔中的 CDATA 區段（文本不會被解析器解析）	#cdata-section	節點內容	4
EntityReference	ENTITY_REFERENCE_NODE	實體引用元素。	實體引用名稱	null	5
Entity	ENTITY_NODE	實體。	實體名稱	null	6
ProcessingInstruction	PROCESSING_INSTRUCTION_NODE	表示處理指令。	target	節點的內容	7
Comment	COMMENT_NODE	註釋。	#comment	註釋文本	8
Document	DOCUMENT_NODE	表示整個文檔（DOM 樹的根節點）	#document	null	9
DocumentType	DOCUMENT_TYPE_NODE	向爲文檔定義的實體提供接口。	doctype 名稱	null	10
DocumentFragment	DOCUMENT_FRAGMENT_NODE	表示輕量級的 Document 對象，其中容納了一部分文檔。	#document fragment	null	11
Notation	NOTATION_NODE	表示在 DTD 中聲明的符號。	符號名稱	null	12

注意：spa

XML中全部的數據，包括屬性attr，文本內容test，都是node，因此不要直接對標籤對nodeValue，而是要獲取到childNode[0]後，childNode[0].nodeValuecode

經常使用方法

更多方法：https://docs.python.org/2/library/xml.dom.html orm

#1A. import minidom
from xml.dom.minidom import parse, parseString


#1B. 解析XML文件成document對象
# parse()能夠傳入文件名或文件對象
dom1 = parse('c:\\temp\\mydata.xml')  # parse an XML file by name

datasource = open('c:\\temp\\mydata.xml')
dom2 = parse(datasource)  # parse an open file

#1c. 解析XML字符串
dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')

#1d. 獲取根節點
root = dom1.documentElement

#1e.獲取指定節點下,指定標籤名的節點s,返回list
elements = root.getElementsByTagName("XXX")

#1f.獲取子節點s,返回list

elements = root.childNodes

#1g.將節點及其子節點轉換爲xml(即str),可指定編碼：toxml("utf-8")
root.toxml()

# 還有一系列對node增刪改的方法，能夠參看官方手冊文檔

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。