python中xml與json、dict、string的相互轉換-xmltodict

xmltodict,xml與json的相互轉換

源碼:https://github.com/martinblech/xmltodict
html

在開發中常常遇到string、xml、json、dict對象的相互轉換,這個工具和這裏的方法所有都可以搞定。node

XML文件轉換流程

注意:如下代碼只是示範邏輯,不能直接運行。
python

import os
import time
import lxml
from lxml import etree
import xmltodict, sys, gc

# 遞歸解析xml文件
context = etree.iterparse(osmfile,tag=["node","way","relation"])
fast_iter(context, process_element, maxline)
...

# xml對象轉爲字符串
elem_data = etree.tostring(elem)

# 生成dict對象
elem_dict = xmltodict.parse(elem_data)

# 從dict產生json字符串
elem_jsonStr = json.dumps(elem_dict)

# 從json字符串產生json對象
json_obj = json.dumps(elem_jsonStr)

遞歸解析XML

etree遞歸讀取xml結構數據(佔用資源少): http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
XML字符串轉爲json對象支持庫 : https://github.com/martinblech/xmltodict  linux

xmltodict.parse()會將字段名輸出添加@和#,在Spark查詢中會引發問題,須要去掉。以下設置便可:git

xmltodict.parse(elem_data,attr_prefix="",cdata_key="")

編碼和錯誤xml文件恢復

以下:github

magical_parser = lxml.etree.XMLParser(encoding='utf-8', recover=True)  
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

先將element轉爲string,而後生成dict,再用json.dump()產生json字符串。編程

elem_data = etree.tostring(elem)
elem_dict = xmltodict.parse(elem_data)
elem_jsonStr = json.dumps(elem_dict)

能夠使用json.loads(elem_jsonStr)建立出可編程的json對象。json

xmltodict的用法

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":ubuntu

Build Status

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>...  """), indent=4))
{ "mydocument": 
{ "@has": "an attribute", 
       "and": 
        {
          "many": ["elements",  "more elements"]
        }, 
        "plus": {"@a": "complex", "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:工具

>>> xml = """
... <root xmlns=" 
...       xmlns:a=" 
...       xmlns:b=" 
...   <x>1</x>...   <a:y>2</a:y>
...   <b:z>3</b:z>... </root>
... """

>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }

>>> print unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<response>
    <status>good</status>
    <last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>> 
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print xmltodict.unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>

Ok, how do I get it?

Using pypi

You just need to:

$ pip install xmltodict

RPM-based distro (Fedora, RHEL, …)

There is an official Fedora package for xmltodict.

$ sudo yum install python-xmltodict

Arch Linux

There is an official Arch Linux package for xmltodict.

$ sudo pacman -S python-xmltodict

Debian-based distro (Debian, Ubuntu, …)

There is an official Debian package for xmltodict.

$ sudo apt install python-xmltodict
相關文章
相關標籤/搜索