再用python寫一個文本處理的東東

時間 2019-11-16

標籤再用 python 一個文本處理東東欄目 Python 简体版

原文原文鏈接

朋友遇到一點麻煩，我挺身而出幫忙。事情是這樣的：python

- 他們的業務系統中，數據來自一個郵箱；正則表達式

- 每個郵件包含一條記錄；數組

- 這些記錄是純文本的，字段之間由一些特殊字符分隔；服務器

- 他們須要從郵箱中批量取出每一封郵件，放到一個excel文件中。ssh

這些對python來講，真是小菜一碟。（過後證實，仍是有些小坑，讓我頭疼了好一下子。）函數

由於是初學者，沒有必要從python2起步，我直接用了python3。fetch

首先是收信。郵箱不支持pop3取信，好在支持IMAP。查了一下，python3有專門的庫能夠作到。網站

而後是要用正則表達式處理文本。ui

生成excel須要用到什麼什麼第三方庫，找了一下，沒下下來。乾脆就簡單點，生成csv文件吧。spa

==============

 1 def main():
 2     M = imaplib.IMAP4_SSL("my-host.com","993")
 3     t=0
 4     try:
 5         try:
 6             M.login('my-username','my-password')
 7         except Exception as e:
 8             print('login error: %s' % e)
 9             M.close()
10         
11         M.select('INBOX',False)
12         
13         # result, message = M.select()
14         # tips: 若是想找Essh郵件的話,使用
15         # type, data = M.search(None, '(SUBJECT "Essh")')
16         # 裏面要用一個括號,表明是一個查詢條件,能夠同時指定多個查詢條件,例如FROM xxxx SUBJECT "aaa",
17         # 注意,命令要用括號罩住(痛苦的嘗試)
18         typ, data = M.search(None, 'ALL')
19 
20         msgList = data[0].split()
21         print("total mails:" + str(len(msgList)))
22         last = msgList[len(msgList) - 1]
23         # first = msgList[0]
24         # M.store(first, '-FLAGS', '(\Seen)')
25         # M.store("1:*", '+FLAGS', '\\Deleted') #Flag all Trash as Deleted
26         output=PATH+'\output.csv'
27         fp2=open(output, 'w')
28 
29         last_id=read_config()
30         count=0
31         for idx in range(int(last_id), len(msgList)):
32             print("curr id: "+str(idx)+'\n')
33             type,data=M.fetch(msgList[idx],'(RFC822)')
34             deal_mail(data, fp2)
35             count=count+1
36             if count>500:
37                 break
38 
39         write_config(idx)
40         # print(str(idx))
41         print("OK!")
42         M.logout()
43     except Exception as e:
44         print('imap error: %s' % e)
45         M.close()

這是main()部分。主要是鏈接IMAP服務器、取信、調用處理函數。

我發現，IMAP提供的接口比較怪異。無論怎麼說，沒怎麼掉坑，網上的資料都很齊全。關於搜索的語法以及刪除和置爲已讀/未讀的命令都放在註釋裏。

這裏面的邏輯是：首先獲取上次處理的序號last_id，今後處開始，處理500條信件。而後將新的last_id寫入配置文件中，下次讀取。

在寫這個程序的時候，遇到的最多的麻煩是關於str和bytes類型的。由於網上許多代碼都是來自python2，因此在python3中就遇到屢次提示：

cannot use a string pattern on a bytes-like object

write() argument must be str, not bytes

error: a bytes-like object is required, not 'str'

error: string argument without an encoding

error: cannot use a string pattern on a bytes-like object

等等。。。一個頭兩個大。好比，我要把半角逗號替換成全角逗號，這個最簡單不過的功能，就試了半天：

這個錯：

content=content.replace(',', '，')
error: a bytes-like object is required, not 'str'

這個也錯：

content=content.replace(',', bytes('，'))
error: string argument without an encoding

最終這個纔對了：

content=content.replace(bytearray(',','GBK'), '，'.encode('GBK'))

但是當我繼續要把半角雙引號變成全角雙引號時，狀況又不同：

matchObj = re.match( r'.*<body>(.*)</body>.*', content.decode('GBK'), re.M|re.I|re.S)
if matchObj:
    found=matchObj.group(1) #郵件正文
    aa=found.split('#$') #分解爲一個個field
    aa[9]=aa[9].replace('"', '「') #我靠前面的寫法用不着了！ content=content.replace(bytearray(',','GBK'), '，'.encode('GBK'))

我汗。。。總之確定裏面有什麼東西還不大明白，致使走了許多彎路。記錄下來，利已助人吧。

如下是所有源碼

#-*- coding:UTF-8 -*-
import imaplib, string, email
import os
import re

CONFIG_FILE='last_id.txt'
PATH=os.path.split(os.path.realpath(__file__))[0]

def main():
    M = imaplib.IMAP4_SSL("my-host.com","993")
    t=0
    try:
        try:
            M.login('my-username','my-password')
        except Exception as e:
            print('login error: %s' % e)
            M.close()
        
        M.select('INBOX',False)
        
        # result, message = M.select()
        # tips: 若是想找Essh郵件的話,使用
        # type, data = M.search(None, '(SUBJECT "Essh")')
        # 裏面要用一個括號,表明是一個查詢條件,能夠同時指定多個查詢條件,例如FROM xxxx SUBJECT "aaa",
        # 注意,命令要用括號罩住(痛苦的嘗試)
        typ, data = M.search(None, 'ALL')

        msgList = data[0].split()
        print("total mails:" + str(len(msgList)))
        last = msgList[len(msgList) - 1]
        # first = msgList[0]
        # M.store(first, '-FLAGS', '(\Seen)')
        # M.store("1:*", '+FLAGS', '\\Deleted') #Flag all Trash as Deleted
        output=PATH+'\output.csv'
        fp2=open(output, 'w')

        last_id=read_config()
        count=0
        for idx in range(int(last_id), len(msgList)):
            print("curr id: "+str(idx)+'\n')
            type,data=M.fetch(msgList[idx],'(RFC822)')
            deal_mail(data, fp2)
            count=count+1
            if count>500:
                break

        write_config(idx)
        # print(str(idx))
        print("OK!")
        M.logout()
    except Exception as e:
        print('imap error: %s' % e)
        M.close()

def main2():
    path=os.path.split(os.path.realpath(__file__))[0]
    input=path+'\input2.txt'
    output=path+'\output.csv'

    fp=open(input, 'rb')
    fp2=open(output, 'w')
    if True:
        line=fp.read()
        pharse_content(fp2, line)

def get_mime_version(msg):
    if msg != None:
        return email.utils.parseaddr(msg.get('mime-version'))[1]
    else:
        empty_obj()
def get_message_id(msg):
    if msg != None:
        return email.utils.parseaddr(msg.get('Message-ID'))[1]
    else:
        empty_obj()

# 讀config文件，獲取上次最大id，從這個id開始讀郵件
def read_config():
    if os.path.isfile(PATH+"\\"+CONFIG_FILE):
        _fp=open(PATH+"\\"+CONFIG_FILE)
        id=_fp.read()
        _fp.close()
    else:
        id=0
    return id

# 將本次處理的郵件的最大id寫入config，以便下次讀取
def write_config(id):
    _fp=open(PATH+"\\"+CONFIG_FILE, 'w')
    _fp.write(str(id))
    _fp.close()

def deal_mail(data, fp2):
    msg=email.message_from_string(data[0][1].decode('GBK'))
    messageid = get_message_id(msg)
    print(messageid)
    content=msg.get_payload(decode=True)
    #print(content)
    pharse_content(fp2, content, messageid)

def pharse_content(fp2, content, messageid):
    #將半角的 , 換成全角的 ，
    # content=content.replace(',', '，')   # error: a bytes-like object is required, not 'str'
    # content=content.replace(',', bytes('，')) # error: string argument without an encoding
    content=content.replace(bytearray(',','GBK'), '，'.encode('GBK'))
    # print(content.decode('GBK'))

    # strinfo=re.compile(',')
    # content=strinfo.sub('，', content)  # error: cannot use a string pattern on a bytes-like object

    matchObj = re.match( r'.*<body>(.*)</body>.*', content.decode('GBK'), re.M|re.I|re.S)
    if matchObj:
        found=matchObj.group(1) #郵件正文
        aa=found.split('#$')    #分解爲一個個field

        # 獲取申訴涉及號碼。匹配模式：申訴問題涉及號碼：18790912404;
        mobileObj=re.match(r'.*申訴問題涉及號碼：(.*);', aa[9], re.M|re.I|re.S)
        if mobileObj:
            mobile=mobileObj.group(1)
        else:
            mobile=''

        # bb 是結果數組，對應生成的csv文件的列
        aa[9]=aa[9].replace('"', '「') #我靠前面的寫法用不着了！ content=content.replace(bytearray(',','GBK'), '，'.encode('GBK'))

        bb=['']*40    #40個元素的數組，對應40個列
        bb[3]=aa[0]   #D列
        bb[4]=aa[4]   #E
        bb[5]=mobile  #F
        bb[6]=aa[5]   #G
        bb[7]=aa[2]   #H
        bb[8]=aa[1]   #I
        bb[9]=aa[3]   #J
        bb[11]=aa[6]  #L
        bb[12]=aa[6]  #M
        bb[22]='網站' #W 申訴來源。此處可自行修改成指定類型
        bb[36]='"'+aa[9]+'"'  #AK，兩側加 "" 是爲了保證多行文字都放進一個單元格中

        DELI=','
        # fp2.write("AAAAA,"+DELI.join(bb)+"\\n")
        fp2.write(DELI.join(bb)+"\n")
    else:
        print("No match!!")

main()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。