python中xml與json、dict、string的相互轉換-xmltodict

時間 2019-11-12

標籤 python xml json dict string 相互轉換 xmltodict 欄目 Python 简体版

原文原文鏈接

xmltodict，xml與json的相互轉換

源碼：https://github.com/martinblech/xmltodict
html

在開發中常常遇到string、xml、json、dict對象的相互轉換，這個工具和這裏的方法所有都可以搞定。node

`XML文件轉換流程`

注意：如下代碼只是示範邏輯，不能直接運行。
python

import os
import time
import lxml
from lxml import etree
import xmltodict, sys, gc

# 遞歸解析xml文件
context = etree.iterparse(osmfile,tag=["node","way","relation"])
fast_iter(context, process_element, maxline)
...

# xml對象轉爲字符串
elem_data = etree.tostring(elem)

# 生成dict對象
elem_dict = xmltodict.parse(elem_data)

# 從dict產生json字符串
elem_jsonStr = json.dumps(elem_dict)

# 從json字符串產生json對象
json_obj = json.dumps(elem_jsonStr)

遞歸解析XML

etree遞歸讀取xml結構數據（佔用資源少）: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
XML字符串轉爲json對象支持庫 : https://github.com/martinblech/xmltodict linux

xmltodict.parse()會將字段名輸出添加@和#，在Spark查詢中會引發問題，須要去掉。以下設置便可：git

xmltodict.parse(elem_data,attr_prefix="",cdata_key="")

編碼和錯誤xml文件恢復

以下：github

magical_parser = lxml.etree.XMLParser(encoding='utf-8', recover=True)  
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

先將element轉爲string，而後生成dict，再用json.dump()產生json字符串。編程

elem_data = etree.tostring(elem)
elem_dict = xmltodict.parse(elem_data)
elem_jsonStr = json.dumps(elem_dict)

能夠使用json.loads(elem_jsonStr)建立出可編程的json對象。json

`xmltodict的用法`

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":ubuntu

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>...  """), indent=4))
{ "mydocument": 
{ "@has": "an attribute", 
       "and": 
        {
          "many": ["elements",  "more elements"]
        }, 
        "plus": {"@a": "complex", "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:工具

>>> xml = """
... <root xmlns=" 
...       xmlns:a=" 
...       xmlns:b=" 
...   <x>1</x>...   <a:y>2</a:y>
...   <b:z>3</b:z>... </root>
... """

>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }

>>> print unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<response>
    <status>good</status>
    <last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>> 
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print xmltodict.unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>