[Python] 02 - String

時間 2019-11-18

標籤 python string 欄目 Python 简体版

原文原文鏈接

字符串 string

考點

Bytes類型

In Python 3, bytes contains sequences of 8-bit values, str contains sequences of
Unicode characters. bytes and str instances can’t be used together with operators
(like > or +).html

在Python3之後，字符串和bytes類型完全分開了。字符串是以字符爲單位進行處理的，bytes類型是以字節爲單位處理的。python

建立、與字符串的相互轉化以下：git

# (1)
b = b''         # 建立一個空的bytes
b = byte()      # 建立一個空的bytes

# (2)
b = b'hello'    #  直接指定這個hello是bytes類型

# (3)
b = bytes('string',encoding='編碼類型')  #利用內置bytes方法，將字符串轉換爲指定編碼的bytes
b = str.encode('編碼類型')   # 利用字符串的encode方法編碼成bytes，默認爲utf-8類型

bytes.decode('編碼類型')：將bytes對象解碼成字符串，默認使用utf-8進行解碼。

基本性質和功能

不變性 Immutability

若是相變的話：string --> list --> stringweb

基礎功能函數

基礎功能

S = 'Spam"

S.find('pa')

S.replace('pa', 'XYZ')

S.isalpha(),

S.isdigit()

In [5]: dir(S)
Out[5]: 

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

dir(S)

查看說明：正則表達式

help(S.replace)

split 分割的應用

去掉先後空格

先去掉先後空格，再分割的過程。api

>>> s.strip().split(',')
['hello', ' world', ' hao', '', '123']

string自帶的分割

提取括號中的內容，以下。app

str="hello boy<[www.baidu.com]>byebye"

print(str.split("[")[1].split("]")[0]) www.baidu.com

sys自帶的分割

os.path.split() 函數框架

import os print(os.path.split('/dodo/soft/python/'))　　# path + filename ('/dodo/soft/python', '') print(os.path.split('/dodo/soft/python')) ('/dodo/soft', 'python')

文件後綴分割

filepath, tmpfilename = os.path.split(fileUrl)

shotname, extension = os.path.splitext(tmpfilename)

The os module contains two sub-modules os.sys (same as sys) and os.path that are dedicated to the system and directories; respectively.ssh

import oside

import os.sys

import os.path

讀取輸入

按行讀取

逐行讀取一行字符串

with open('somefile', 'r') as f:
    for line in f:
        print(line, end='')
        
"""
Hello
World
Python
"""

一次性所有讀取到列表

with open('somefile','r') as f:
    content = list(f)
    print(content)

"""
['Hello\n', 'World\n', 'Python']
"""

以上的 list(f) 即是默認的readlines()；

with open('somefile','r') as f:
    content = f.readlines()
    print(content)

"""
['Hello\n', 'World\n', 'Python']
"""

自動去掉」換行符「

with open('somefile','r') as f:
    content = f.read().splitlines()
    print(content)

"""
['Hello', 'World', 'Python']
"""

或者，本身手動使用 rstrip() 去掉結尾的「換行符號」；去掉行首就換爲 strip()；

with open('somefile','r') as f:
    content = [line.rstrip('\n') for line in f]
    print(content)

"""
['Hello', 'World', 'Python']
"""

enumerate 遍歷

列表的遍歷方法

>>>seq = ['one', 'two', 'three']
>>> for i, element in enumerate(seq):
...     print i, element


0 one
1 two
2 three

遍歷 sys.stdout

with open('somefile', 'r') as f:
    for number, line in enumerate(f,start=1):
        print(number, line, end='')

"""
1 Hello
2 World
3 Python
"""

打印輸出

外部設置：sys.stdout 方法

(1) 定好方向 --> (2) 而後輸出

將「輸出口」打印

>>> import sys # Printing the hard way
>>> sys.stdout.write('hello world\n')　　// 默認打印到屏幕 hello world

指定「輸出口」的字符串來源

C:\code> c:\python33\python
 >>> import sys >>> temp = sys.stdout # Save for restoring later

>>> sys.stdout = open('log.txt', 'a') # Redirect prints to a file
>>> print('spam')                     # Prints go to file, not here
>>> print(1, 2, 3) >>> sys.stdout.close()                # Flush output to disk


>>> sys.stdout = temp                 # Restore original stream
>>> print('back here')                # Prints show up here again
back here >>> print(open('log.txt').read())     # Result of earlier prints
spam 1 2 3

內部設置：print(file=log) 方法【推薦】

log = open('log.txt', 'a')  # 3.X
print(x, y, z, file=log)    # Print to a file-like object
print(a, b, c)              # Print to original stdout

# 老版本
log = open('log.txt', 'a') # 2.X
print >> log, x, y, z      # Print to a file-like object
print a, b, c              # Print to original stdout

日誌顯示和保存都兼顧，怎麼辦？

暫時寫個函數，包含兩種打印好了。

from __future__ import print_function

打印函數

若干種打印格式

(1) C語言格式；(2) index方式；(3) auto index方式；(4) dict方式；

第1~3種方式

第4種方式

＃ Dictionary-Based Formatting Expressions

>>> '%(qty)d more %(food)s' % {'qty': 1, 'food': 'spam'}
'1 more spam'

String Formatting Expressions --> 具體參見：268/1594

‘數字’ 打印美觀化

(a) 小數保留幾位

(b) 數字佔用寬度

print('%2d-%02d' % (3, 1))

 3-01

其餘技巧

- ASCII查看

len(S)

ord('\n')  # 查看 ASCII

chr(115)   # 查看 對應的char

- \0: a binary zero byte

- 多行打印

>>> msg = """
aaaaaaaaaaaaa
bbb'''bbbbbbbbbb""bbbbbbb'bbbb
cccccccccccccc
"""
>>> msg
'\naaaaaaaaaaaaa\nbbb\'\'\'bbbbbbbbbb""bbbbbbb\'bbbb\ncccccccccccccc\n'

- Raw print

In [40]: r"C:\new\test.spm"
Out[40]: 'C:\\new\\test.spm'

- str vs repr

From: http://blog.csdn.net/u013961718/article/details/51100464

str出來的值是給人看的字符串，
repr出來的值是給機器看的，括號中的任何內容出來後都是在它之上再加上一層引號。

日誌函數

能夠理解爲更高級的打印方式，畢竟應用於項目中。

日誌級別

五種日誌類型

Ref: python logging 替代print 輸出內容到控制檯和重定向到文件

logging.DEBUG
logging.INFO
logging.WARNING
logging.ERROR
logging.CRITICAL

設置日誌輸出配置

Ref: python 的日誌logging模塊學習

import logging logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s', datefmt = '%a, %d %b %Y %H:%M:%S', filename = 'myapp.log', filemode = 'w') 
#logging.config模塊能夠經過加載配置文件，歷來配置日誌屬性

logging.debug('This is debug message') logging.info('This is info message') logging.warning('This is warning message')

日誌打印到：./myapp.log 文件

./myapp.log文件中內容爲:
Sun, 24 May 2009 21:48:54 demo2.py[line:11] DEBUG This is debug message
Sun, 24 May 2009 21:48:54 demo2.py[line:12] INFO This is info message
Sun, 24 May 2009 21:48:54 demo2.py[line:13] WARNING This is warning

將日誌同時輸出到文件和屏幕

import logging

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a, %d %b %Y %H:%M:%S',
                    filename='myapp.log',
                    filemode='w')

#################################################################################################
#定義一個StreamHandler，將INFO級別或更高的日誌信息打印到標準錯誤，並將其添加到當前的日誌處理對象#
console = logging.StreamHandler()
console.setLevel(logging.INFO)

formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)
#################################################################################################

logging.debug('This is debug message')
logging.info('This is info message')
logging.warning('This is warning message')

結果：

屏幕上打印:
root        : INFO     This is info message
root        : WARNING  This is warning message


./myapp.log文件中內容爲:
Sun, 24 May 2009 21:48:54 demo2.py[line:11] DEBUG This is debug message
Sun, 24 May 2009 21:48:54 demo2.py[line:12] INFO This is info message
Sun, 24 May 2009 21:48:54 demo2.py[line:13] WARNING This is warning message

其餘詳見：6、Unicode Strings 160/1594，內容略

正則表達式 - Regex

正則引擎原理：[IR] XPath for Search Query

使用教程: 正則表達式30分鐘入門教程

基礎用法

re.match 法

典型應用：字符串信息提取，路徑的提取；能夠替代 split()。

In [8]: >>> import re ...: ...: >>> match = re.match('Hello[ \t]*(.*)world', 'Hello Python world') ...: ...: >>> match.group(1) ...: Out[8]: 'Python '
 -------------------------------------------------------------------------------------- In [9]: >>> match = re.match('[/:](.*)[/:](.*)[/:](.*)', '/usr/home:lumberjack') ...: ...: >>> match.groups() ...: Out[9]: ('usr', 'home', 'lumberjack')
 --------------------------------------------------------------------------------------- In [10]: >>> re.split('[/:]', '/usr/home/lumberjack') Out[10]: ['', 'usr', 'home', 'lumberjack']

filter 篩選框架

一個簡單的框架代碼：

def filter_mail(emails): return list(filter(fun, emails))　　# 2.fun 是個自定義的函數，返回：True/False，也是個re. 
 if __name__ == '__main__': n = int(input()) emails = [] for _ in range(n): emails.append(input())　　 # 1.獲取mail list 
filtered_emails = filter_mail(emails) filtered_emails.sort() # 3.排序 print(filtered_emails)

郵件格式匹配

Valid email addresses must follow these rules:

* It must have the username@websitename.extension format type.

* The username can only contain letters, digits, dashes and underscores.

* The website name can only have letters and digits.

* The maximum length of the extension is $.$

import re
re.search(r'^[A-Za-z0-9-_]+@[A-Za-z0-9]+\.\w?\w?\w$',s)

正則表達式

限定符與元字符

限定符

元字符

經常使用例子

常見字符串匹配

# 先是一個單詞hi，而後是任意個任意字符(但不能是換行)，最後是Lucy這個單詞

\bhi\b.*\bLucy\b

# 匹配以字母a開頭的單詞——先是某個單詞開始處(\b)，而後是字母a，而後是任意數量的字母或數字(\w*)，最後是單詞結束處(\b)。

\ba\w*\b

# 匹配以.tif結尾的單詞

re.search( ".*\\.tif",f)]

# 匹配1個或更多連續的數字。這裏的+是和*相似的元字符，不一樣的是*匹配重複任意次(多是0次)，而+則匹配重複1次或更屢次。

\d+

# 匹配恰好6個字符的單詞。

\b\w{6}\b

# 填寫的QQ號必須爲5位到12位數字：開始--> ^ ... $ <--結束

^\d{5,12}$

電話號碼

# 中國的電話號碼 - 簡單版本

0\d\d-\d\d\d\d\d\d\d\d  　　以下改進版
0\d{2}-\d{8}

# 匹配幾種格式的電話號碼，像(010)88886666，或022-22334455，或02912345678等。

- - 首先是一個轉義字符\(,它能出現0次或1次(?),
  - 而後是一個0，後面跟着2個數字(\d{2})，
  - 而後是)或-或空格中的一個，它出現1次或不出現(?)，
  - 最後是8個數字(\d{8})

\(?0\d{2}[) -]?\d{8}

However，也能匹配010)12345678或(022-87654321這樣的「不正確」的格式。

那，怎麼辦？-- 分枝條件

# 匹配兩種以連字號分隔的電話號碼：一種是三位區號，8位本地號(如010-12345678)，一種是4位區號，7位本地號(0376-2233445)。

0\d{2}-\d{8}|0\d{3}-\d{7}

繼續補充。。。用到再說。

1. python string
2. Python基礎-02
3. LeetCode：97. Interleaving String - Python
4. 【Python】[02]初識Python
5. Strategy for Python Challenge(02)
6. day02-python 基礎02
7. Python 02-模塊
8. 02--python要點
9. 02.Python基礎
10. python practice 02
更多相關文章...
• SQLite - Python - SQLite教程
• Docker 安裝 Python - Docker教程
• YAML 入門教程
• 爲了進字節跳動，我精選了29道Java經典算法題，帶詳細講解

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。