隨着Python在機器學習和數據科學領域的應用愈來愈普遍,相關的Python庫也增加的很是快。可是Python自己存在一個很是要命的問題,就是Python2和Python3,兩個版本互不兼容,並且Github上Python2的開源庫有不少不兼容Python3,致使大量的Python2的用戶不肯意遷移到Python3。
Python3在不少方面都作出了改變,優化了Python2的不少不足,標準庫也擴充了不少內容,例如協程相關的庫。如今列舉一些Python3裏提供的功能,跟你更好的從Python2遷移到Python3的理由。html
使用Python2的同窗,應該都用過os.path這個庫,來處理各類各樣的路徑問題,好比拼接文件路徑的函數:os.path.join()
,用Python3,你可使用pathlib很方便的完成這個功能:python
from pathlib import Path dataset = 'wiki_images' datasets_root = Path('/path/to/datasets/') train_path = datasets_root / dataset / 'train' test_path = datasets_root / dataset / 'test' for image_path in train_path.iterdir(): with image_path.open() as f: # note, open is a method of Path object # do something with an image
相比與os.path.join()
函數,pathlib更加安全、方便、可讀。pathlib還有不少其餘的功能。git
p.exists() p.is_dir() p.parts() p.with_name('sibling.png') # only change the name, but keep the folder p.with_suffix('.jpg') # only change the extension, but keep the folder and the name p.chmod(mode) p.rmdir()
在Pycharm中,類型提醒是這個樣子的:
github
類型提醒在複雜的項目中能夠很好的幫助咱們規避一些手誤或者類型錯誤,Python2的時候是靠IDE來識別,格式IDE識別方法不一致,而且只是識別,並不具有嚴格限定。例若有下面的代碼,參數能夠是numpy.array , astropy.Table and astropy.Column, bcolz, cupy, mxnet.ndarray等等。json
def repeat_each_entry(data): """ Each entry in the data is doubled <blah blah nobody reads the documentation till the end> """ index = numpy.repeat(numpy.arange(len(data)), 2) return data[index]
一樣上面的代碼,傳入pandas.Series類型的參數也是能夠,可是運行時會出錯。安全
repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside
這還只是一個函數,對於大型的項目,會有好多這樣的函數,代碼很容易就跑飛了。因此肯定的參數類型對於大型項目來講很是重要,而Python2沒有這樣的能力,Python3能夠。dom
def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):
目前,好比JetBrains家的PyCharm已經支持Type Hint語法檢查功能,若是你使用了這個IDE,能夠經過IDE功能進行實現。若是你像我同樣,使用了SublimeText編輯器,那麼第三方工具mypy能夠幫助到你。
PS:目前類型提醒對ndarrays/tensors支持不是很好。機器學習
正常狀況下,函數的註釋處理理解代碼用,其餘沒什麼用。你能夠是用enforce
來強制運行時檢查類型。編輯器
@enforce.runtime_validation def foo(text: str) -> None: print(text) foo('Hi') # ok foo(5) # fails @enforce.runtime_validation def any2(x: List[bool]) -> bool: return any(x) any ([False, False, True, False]) # True any2([False, False, True, False]) # True any (['False']) # True any2(['False']) # fails any ([False, None, "", 0]) # False any2([False, None, "", 0]) # fails
以下代碼:ide
# l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min # Python 2 X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b)) # Python 3 X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b)
使用@符號,整個代碼變得更可讀和方便移植到其餘如numpy、tensorflow
等庫。
在Python2中,遞歸查找文件不是件容易的事情,即便使用glob庫,可是python3中,能夠經過通配符簡單的實現。
import glob # Python 2 found_images = \ glob.glob('/path/*.jpg') \ + glob.glob('/path/*/*.jpg') \ + glob.glob('/path/*/*/*.jpg') \ + glob.glob('/path/*/*/*/*.jpg') \ + glob.glob('/path/*/*/*/*/*.jpg') # Python 3 found_images = glob.glob('/path/**/*.jpg', recursive=True)
和以前提到的pathlib
一塊兒使用,效果更好:
# Python 3 found_images = pathlib.Path('/path/').glob('**/*.jpg')
打印到指定文件
print >>sys.stderr, "critical error" # Python 2 print("critical error", file=sys.stderr) # Python 3
不使用join函數拼接字符串
# Python 3 print(*array, sep='\t') print(batch, epoch, loss, accuracy, time, sep='\t')
重寫print函數
# Python 3 _print = print # store the original print function def print(*args, **kargs): pass # do something useful, e.g. store output to some file
再好比下面的代碼
@contextlib.contextmanager def replace_print(): import builtins _print = print # saving old print function # or use some other function here builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs) yield builtins.print = _print with replace_print(): <code here will invoke other print function>
雖然上面這段代碼也能達到重寫print函數的目的,可是不推薦使用。
python2提供的字符串格式化系統仍是不夠好,太冗長麻煩,一般咱們會寫這樣一段代碼來輸出日誌信息:
# Python 2 print('{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format( batch=batch, epoch=epoch, total_epochs=total_epochs, acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies), avg_time=time / len(data_batch) )) # Python 2 (too error-prone during fast modifications, please avoid): print('{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format( batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies), time / len(data_batch) ))
輸出的結果是:
120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60
python3.6的f-strings功能實現起來就簡單多了。
# Python 3.6+ print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}')
並且,在編寫查詢或生成代碼片斷時很是方便:
query = f"INSERT INTO STATION VALUES (13, '{city}', '{state}', {latitude}, {longitude})"
下面這些比較操做在python3裏是非法的
# All these comparisons are illegal in Python 3 3 < '3' 2 < None (3, 4) < (3, None) (4, 5) < [4, 5] # False in both Python 2 and Python 3 (4, 5) == [4, 5]
不一樣類型的數據沒法排序
sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1']
s = '您好' print(len(s)) print(s[:2]) Output: Python 2: 6\n�� Python 3: 2\n您好. x = u'со' x += 'co' # ok x += 'со' # fail
下面這段代碼在Python2裏運行失敗可是Python3會成功運行,Python3的字符串都是Unicode編碼,因此這樣對NLP來講很方便,再好比:
'a' < type < u'a' # Python 2: True 'a' < u'a' # Python 2: False
from collections import Counter Counter('Möbelstück') Python 2: Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1}) Python 3: Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})
CPython3.6+裏的dict默認的行爲和orderdict很相似
import json x = {str(i):i for i in range(5)} json.loads(json.dumps(x)) # Python 2 {u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4} # Python 3 {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4}
一樣的,**kwargs字典內容的數據和傳入參數的順序是一致的。
from torch import nn # Python 2 model = nn.Sequential(OrderedDict([ ('conv1', nn.Conv2d(1,20,5)), ('relu1', nn.ReLU()), ('conv2', nn.Conv2d(20,64,5)), ('relu2', nn.ReLU()) ])) # Python 3.6+, how it *can* be done, not supported right now in pytorch model = nn.Sequential( conv1=nn.Conv2d(1,20,5), relu1=nn.ReLU(), conv2=nn.Conv2d(20,64,5), relu2=nn.ReLU()) )
# handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name) # picking two last values from a sequence *prev, next_to_last, last = values_history # This also works with any iterables, so if you have a function that yields e.g. qualities, # below is a simple way to take only last two values from a list *prev, next_to_last, last = iter_train(args)
# Python 2 import cPickle as pickle import numpy print len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) # result: 23691675 # Python 3 import pickle import numpy len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) # result: 8000162
縮短到Python2時間的1/3
labels = <initial_value> predictions = [model.predict(data) for data, labels in dataset] # labels are overwritten in Python 2 # labels are not affected by comprehension in Python 3
# Python 2 class MySubClass(MySuperClass): def __init__(self, name, **options): super(MySubClass, self).__init__(name='subclass', **options) # Python 3 class MySubClass(MySuperClass): def __init__(self, name, **options): super().__init__(name='subclass', **options)
合併兩個Dict
x = dict(a=1, b=2) y = dict(b=3, d=4) # Python 3.5+ z = {**x, **y} # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.
Python3.5+不只僅合併dict很方便,合併list等也很方便
[*a, *b, *c] # list, concatenating (*a, *b, *c) # tuple, concatenating {*a, *b, *c} # set, union
Python 3.5+ do_something(**{**default_settings, **custom_settings}) # Also possible, this code also checks there is no intersection between keys of dictionaries do_something(**first_args, **second_args)
python2提供了兩個整數類型:int和long,python3只提供有個整數類型:int,以下的代碼:
isinstance(x, numbers.Integral) # Python 2, the canonical way isinstance(x, (long, int)) # Python 2 isinstance(x, int) # Python 3, easier to remember
python3提供了不少新的特性,方便咱們編碼的同時,也帶來了更好的安全性和較高的性能。並且官方也一直推薦儘快遷移到python3。固然,遷移的代價因系統而異,但願這篇文章能對你遷移python2到python3有些幫助。