第4章：文本處理

時間 2019-12-20

標籤文本處理简体版

原文原文鏈接

1.字符串常量css

1).字符串是不可變的有序集合html

Python不區分字符和字符串正則表達式

全部的字符都是字符串app

字符串是不可變的函數

字符串是字符的有序集合編碼

2).字符串函數spa

通用操做日誌

與大小寫相關的方法code

判斷類方法orm

字符串方法startswith和endswith

查找類函數

字符串操做方法

3).使用Python分析Apache的訪問日誌

from __future__ import print_function

ips = []
with open('access.log') as f:
    for line in f:
        ips.append(line.split()[0])

print(ips)
print("PV is {0}".format(len(ips)))
print("UV is {0}".format(len(set(ips))))

from __future__ import print_function
from collections import Counter

c = Counter()
with open('access.log') as f:
    for line in f:
        c[line.split()[6]] += 1

print(c)
print("Popular resources : {0}".format(c.most_common(10)))

from __future__ import print_function
from collections import Counter

d = {}
with open('access.log') as f:
    for line in f:
        key = line.split()[8]
        d.setdefault(key, 0)
        d[key] += 1

sum_requests = 0
error_requests = 0

for key, val in d.items():
    if int(key) >= 400:
        error_requests += val
    sum_requests += val

print("error rate: {0:.2f}%".format(error_requests * 100.0 / sum_requests))

4).字符串格式化

在Python中，存在兩種格式化字符串的方法，即%表達式和format函數

format函數纔是字符串格式化的將來

"{} is better than {}.".format('Beautiful', 'ugly')

2.正則表達式

1).利用re庫處理正則表達式

非編譯的正則表達式版本：
import re

def main():
    pattern = "[0-9]+"
    with open('data.txt') as f:
        for line in f:
            re.findall(pattern, line)

if __name__ == '__main__':
    main()

編譯的正則表達式版本：
import re

def main():
    pattern = "[0-9]+"
    re_obj = re.complile(pattern)
    with open('data.txt') as f:
        for line in f:
            re_obj.findall(line)

if __name__ == '__main__':
    main()

search函數與match函數用法幾乎同樣，區別在於前者在字符串的任意位置進行匹配，後者僅對字符串的開始部分進行匹配

2).經常使用的re方法

匹配類函數：re模塊中最簡單的即是findall函數

修改類函數：re模塊的sub函數相似於字符串的replace函數，只是sub函數支持正則表達式

3).案例：獲取HTML頁面中的全部超連接

import re
import requests

r = requests.get('https://news.ycombinator.com/')
mydata = re.findall('"(https?://.*?)"', r.content.decode('utf-8'))
print(mydata)

3.字符集編碼

1).UTF8編碼

2).Python2和Python3中的Unicode

把Unicode字符表示爲二進制數據有許多中辦法，最多見的編碼方式就是UTF-8

Unicode是表現形式，UTF-8是存儲形式

UTT-8是使用最普遍的編碼，但僅僅是Unicode的一種存儲形式

from __future__ import unicode_literals

4.Jiaja2模板

1).語法塊

在Jinja2中，存在三種語法：
控制結構{%%}
變量取值{{}}
註釋{##}
{% if users %}
    <ul>
    {% for user in users %}
        <li>{{ user.username }}</li>
    {% endfor %}
    </ul>
{% endif %}

2).Jinja2的繼承和Super函數

base.html頁面
<html lang= "en" >
<head>
    {% block head %｝
    <link rel ="stylesheet" href="style.css" />
    <title> {% block title %}｛% endblock %｝－ My Webpage</title>
    {% endblock %｝
</ head>
<body>
<div id="content">
  ｛% block content %}｛% endblock %}
</div>
</body >

index.html頁面
{% extends "base .html" %｝
{% block title %｝index｛% endblock %｝
{% block head %｝
{ { super () } }
<style type= 」 text/css 」 >
.important { color: #336699 ; }
</style>
{% endblock %｝
{% block content %｝
<hl ＞index</hl>
<p class ＝ "important"> Welcome on my awesome homepage. </p>
{% endblock %｝

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。