背景:
基於博客《Python 解析樹狀結構文件》的算法優化 算法
核心思想:app
創建一個List用來存儲父節點信息,每當讀到以Tab+name 開頭的行時,將這行父節點信息存儲在prefixList[tab 的個數] 中,即prefixList[i] 存儲 Tab 個數爲 i 的父節點信息。優化
當讀到以Tab+ptr 開頭的行的時候,代表到達了子節點,那麼它的父節點(前綴)一定爲:preList[0] + ...+ preList[tab 的個數],因此最終結果爲: 前綴 + 當前子節點信息。debug
當再次讀到以Tab+name 開頭的行時,代表對於接下來的子節點而言,其父節點中某個節點變化了,咱們只要覆蓋對應的prefixList[tab 的個數] 的值,由於不會有節點須要原來prefixList[tab 的個數] 的值。code
實現:字符串
現模擬debug trace 建一個文本文件1.txt,內容以下:get
01
service[hi]
博客
02
name: [
1
]
string
03
{
it
04
name:[
11
]
05
{
06
name: [
111
]
07
{
08
ptr
-
>
1111
-
-
>[value0]
09
ptr
-
>
1112
-
-
>[value1]
10
}
11
name: [
112
]
12
{
13
name: [
1121
]
14
{
15
ptr
-
>
111211
-
-
>[value2]
16
}
17
18
}
19
}
20
name:[
12
]
21
{
22
ptr
-
>
121
-
-
>[value3]
23
}
24
name:[
13
]
25
{
26
ptr
-
>
131
-
-
>[value4]
27
}
28
}
29
service[Jeff]
30
name: [
1
]
31
{
32
name:[
11
]
33
{
34
name: [
111
]
35
{
36
ptr
-
>
1111
-
-
>[value0]
37
ptr
-
>
1112
-
-
>[value1]
38
}
39
name: [
112
]
40
{
41
name: [
1121
]
42
{
43
ptr
-
>
111211
-
-
>[value2]
44
}
45
46
}
47
}
48
name:[
12
]
49
{
50
ptr
-
>
121
-
-
>[value3]
51
}
52
name:[
13
]
53
{
54
ptr
-
>
131
-
-
>[value4]
55
}
56
}
解析程序以下:
1.common.py
01
'''
02
Created on 2012-5-28
03
04
@author: Jeff_Yu
05
'''
06
07
def
getValue(string,key1,key2):
08
"""
09
get the value between key1 and key2 in string
10
"""
11
index1
=
string.find(key1)
12
index2
=
string.find(key2)
13
14
value
=
string[index1
+
1
:index2]
15
return
value
16
17
def
getFiledNum(string,key,begin):
18
"""
19
get the number of key in string from begin position
20
"""
21
keyNum
=
0
22
start
=
begin
23
24
while
True
:
25
index
=
string.find(key, start)
26
if
index
=
=
-
1
:
27
break
28
29
keyNum
=
keyNum
+
1
30
start
=
index
+
1
31
32
return
keyNum
2. main.py
01
'''
02
Created on 2012-6-1
03
04
@author: Jeff_Yu
05
'''
06
07
import
common
08
09
fileNameRead
=
"1.txt"
10
fileNameWrite
=
'%s%s'
%
(
"Result_"
,fileNameRead)
11
writeList
=
[]
12
# the first name always start with 0 Tab
13
i
=
0
14
15
fr
=
open
(fileNameRead,
'r'
)
16
fw
=
open
(fileNameWrite,
'w'
)
17
18
for
data
in
fr:
19
if
not
data:
20
break
21
22
# find the Service Name
23
if
data.startswith(
"service"
):
24
#for each service
25
prefixList
=
list
(
"0"
*
30
)
26
prefixString
=
""
27
recordNum
=
""
28
29
index
=
data.find(
'\n'
)
30
writeList.append(
'%s\n'
%
data[
0
:index])
31
continue
32
33
34
# find name
35
if
data.find(
"name"
) !
=
-
1
:
36
tabNumOfData
=
common.getFiledNum(data,
'\t'
,
0
)
37
38
value
=
common.getValue(data,
'['
,
']'
)
39
40
prefixList[tabNumOfData]
=
value
+
"."
41
42
if
data.find(
"ptr"
) !
=
-
1
:
43
tabNumOfLeaf
=
common.getFiledNum(data,
'\t'
,
0
)
44
45
valueOfLeaf
=
common.getValue(data,
'['
,
']'
)
46
nameOfLeaf
=
common.getValue(data,
'>'
,
'-->'
)
47
LeafPartstring
=
nameOfLeaf
+
"["
+
valueOfLeaf
+
"]"
48
49
finalString
=
""
50
while
i < tabNumOfLeaf:
51
finalString
=
finalString
+
prefixList[i]
52
i
=
i
+
1
53
54
i
=
0
55
56
finalString
=
finalString
+
LeafPartstring
57
58
#append line to writeList
59
writeList.append(finalString)
60
writeList.append(
"\n"
)
61
62
63
64
# write writeList to result file
65
fw.writelines(writeList)
66
67
68
del
prefixList
69
del
writeList
70
71
fw.close()
72
fr.close()
解析結果Result_1.txt:
01
service[hi]
02
1.11
.
111.1111
[value0]
03
1.11
.
111.1112
[value1]
04
1.11
.
112.1121
.
111211
[value2]
05
1.12
.
121
[value3]
06
1.13
.
131
[value4]
07
service[Jeff]
08
1.11
.
111.1111
[value0]
09
1.11
.
111.1112
[value1]
10
1.11
.
112.1121
.
111211
[value2]
11
1.12
.
121
[value3]
12
1.13
.
131
[value4]
實際的trace文件比這個複雜,由於涉及公司信息,實現代碼就不貼出來,可是核心思想和上面是同樣的
這個版本效率大大提升,原來解析5M的文件要2分多鐘,如今只要1秒鐘
這個版本優化了:
1.字符串相加的部分改爲 all = ‘%s%s%s%s’ % (str0, str1, str2, str3) 的形式。
2.要寫入得內容保存在List中,最後用f.writelines(list)一塊兒寫入。
3. 這個算法減小了讀文件的次數,及時保存讀過的有用信息,避免往回讀文件。