源碼:https://github.com/martinblech/xmltodict
html
在開發中常常遇到string、xml、json、dict對象的相互轉換,這個工具和這裏的方法所有都可以搞定。
node
XML文件轉換流程
注意:如下代碼只是示範邏輯,不能直接運行。
python
import os import time import lxml from lxml import etree import xmltodict, sys, gc # 遞歸解析xml文件 context = etree.iterparse(osmfile,tag=["node","way","relation"]) fast_iter(context, process_element, maxline) ... # xml對象轉爲字符串 elem_data = etree.tostring(elem) # 生成dict對象 elem_dict = xmltodict.parse(elem_data) # 從dict產生json字符串 elem_jsonStr = json.dumps(elem_dict) # 從json字符串產生json對象 json_obj = json.dumps(elem_jsonStr)
etree遞歸讀取xml結構數據(佔用資源少): http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
XML字符串轉爲json對象支持庫 : https://github.com/martinblech/xmltodict linux
xmltodict.parse()會將字段名輸出添加@和#,在Spark查詢中會引發問題,須要去掉。以下設置便可:git
xmltodict.parse(elem_data,attr_prefix="",cdata_key="")
以下:github
magical_parser = lxml.etree.XMLParser(encoding='utf-8', recover=True) tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object
先將element轉爲string,而後生成dict,再用json.dump()產生json字符串。編程
elem_data = etree.tostring(elem) elem_dict = xmltodict.parse(elem_data) elem_jsonStr = json.dumps(elem_dict)
能夠使用json.loads(elem_jsonStr)建立出可編程的json對象。json
xmltodict的用法
xmltodict
is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":ubuntu
>>> print(json.dumps(xmltodict.parse(""" ... <mydocument has="an attribute"> ... <and> ... <many>elements</many> ... <many>more elements</many> ... </and>... <plus a="complex"> ... element as well ... </plus> ... </mydocument>... """), indent=4)) { "mydocument": { "@has": "an attribute", "and": { "many": ["elements", "more elements"] }, "plus": {"@a": "complex", "#text": "element as well" } } }
By default, xmltodict
does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True
will make it expand namespaces for you:工具
>>> xml = """ ... <root xmlns=" ... xmlns:a=" ... xmlns:b=" ... <x>1</x>... <a:y>2</a:y> ... <b:z>3</b:z>... </root> ... """ >>> xmltodict.parse(xml, process_namespaces=True) == { ... 'http://defaultns.com/:root': { ... 'http://defaultns.com/:x': '1', ... 'http://a.com/:y': '2', ... 'http://b.com/:z': '3', ... } ... } True
It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:
>>> namespaces = { ... 'http://defaultns.com/': None, # skip this namespace ... 'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a" ... } >>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == { ... 'root': { ... 'x': '1', ... 'ns_a:y': '2', ... 'http://b.com/:z': '3', ... }, ... } True
xmltodict
is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:
>>> def handle_artist(_, artist): ... print artist['name'] ... return True >>> >>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'), ... item_depth=2, item_callback=handle_artist) A Perfect Circle Fantômas King Crimson Chris Potter ...
It can also be used from the command line to pipe objects to a script like this:
import sys, marshal while True: _, article = marshal.load(sys.stdin) print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py AccessibleComputing Anarchism AfghanistanHistory AfghanistanGeography AfghanistanPeople AfghanistanCommunications Autism ...
Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz
And you reuse the dicts with every script that needs them:
$ cat enwiki.dicts.gz | gunzip | script1.py $ cat enwiki.dicts.gz | gunzip | script2.py ...
You can also convert in the other direction, using the unparse()
method:
>>> mydict = { ... 'response': { ... 'status': 'good', ... 'last_updated': '2014-02-16T23:10:12Z', ... } ... } >>> print unparse(mydict, pretty=True) <?xml version="1.0" encoding="utf-8"?> <response> <status>good</status> <last_updated>2014-02-16T23:10:12Z</last_updated> </response>
Text values for nodes can be specified with the cdata_key
key in the python dict, while node properties can be specified with the attr_prefix
prefixed to the key name in the python dict. The default value for attr_prefix
is @
and the default value for cdata_key
is #text
.
>>> import xmltodict >>> >>> mydict = { ... 'text': { ... '@color':'red', ... '@stroke':'2', ... '#text':'This is a test' ... } ... } >>> print xmltodict.unparse(mydict, pretty=True) <?xml version="1.0" encoding="utf-8"?> <text stroke="2" color="red">This is a test</text>
You just need to:
$ pip install xmltodict
There is an official Fedora package for xmltodict.
$ sudo yum install python-xmltodict
There is an official Arch Linux package for xmltodict.
$ sudo pacman -S python-xmltodict
There is an official Debian package for xmltodict.
$ sudo apt install python-xmltodict