Notes on Generator 1

既然英文才是程序員的母語,就嘗試着用英文寫博文吧。。python


Iterators

Iteration is actually the process of iterating over an iterable object, common iterable objects are Dict, String, File, etc.nginx

The iteration consumes the contents in its targeted iterable object.程序員

Functions like sum(), min(), list(), tuple() and in operator makes an iterable object not iterable.ajax

To make a list iterable, we can simply call iter(item_list), and then call next() on it, all elements will be returned.express

Any object has iter() and next() is considered as Iterable.ide

# in Operator

for x in obj:
    # statements

# What's inside

_iter = iter(obj)
while 1:
    try:
        x = _iter.next()
    except StopIteration:
        break
    # statements

Generator

Generator might be a easier-used Iterator。oop

def countdown(n):
    print "Counting down from", n
    while n > 0:
        yield n
        n -= 1
# Note that two lines below didn't start calling countdown until the next() was called.
# yield produced the n, but suspend the whole function until next time next() was called.
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10
10
>>> x.next()
9
...
>>> x.next()
1
# When x returns, a next() will raise exception.
>>> x.next()
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
StopIteration
>>>

Python 3.4 version belowthis

def countdown(n):
    print("Counting down from", n)
    while n>0:
        yield n
        n -= 1
    return 'exits'
>>> x= countdown(3)
>>> x
<generator object countdown at 0x101bd7288>
>>> next(x)
counting down 3
3
>>> next(x)
2
>>> next(x)
1
>>> next(x)
# In Python 3.4, Generator Function can also return some value, and the value will be something like error message in the raised exception later.
# This feature is considered as Syntax Error in Python 2.7.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration: exits

Generators vs. Iterators

  • Generator Function isn't just an iterable object.
  • Operations on generators are always one-time, once a whole iteration was done, you have to call the generator function again.
  • Unlike generators, Iterators like list and dict can be iterated unlimited times.

Generator Expressions

Variable b is an Generator below.code

>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8

When list a is super large, the use of generator can save a lot memory actually, simply because it doesn't store another big list in memory.orm

>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]

A generator example

We now have a 1Gb access.log from nginx, the problem here is to sum up sizes of all the packets.

Every line of access.log looks like this below:

xx.xx.xx.xx - - [01/Jul/2014:10:06:06 +0800] "GET /share/ajax/?image_id=xxx&user_id=xxx HTTP/1.1" 200 72 "http://www.baidu.com/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"

We have two solutions, one was implemented by Generator, and the other simply use for-loop.

import cProfile, pstats, StringIO

def gene():
	with open('access.log', 'r') as f:
		lines = (line.split(' ', 11)[9] for line in f)
		sizes = (int(size) for size in lines if not size == '-')
		print "Generators Result: ", sum(sizes)

pr = cProfile.Profile()
pr.enable()
gene()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


def loop():
	size_sum = 0
	with open('access.log', 'r') as f:
		for line in f.readlines():
			size = line.split(' ', 11)[9]
			if not size == '-':
				size_sum += int(size)
		print "Forloop Result: ", size_sum

pr = cProfile.Profile()
pr.enable()
loop()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


Sh4n3@Macintosh:~% python ger.py
Generators Result: 13678125506
         12481726 function calls in 41.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   41.487   41.487 ger.py:3(gene)
        1    1.864    1.864   41.487   41.487 {sum}
  4160297   17.209    0.000   39.623    0.000 ger.py:6(<genexpr>)
  4160713   11.972    0.000   22.414    0.000 ger.py:5(<genexpr>)
  4160712   10.442    0.000   10.442    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Forloop Result: 13678125506
         4160716 function calls in 142.672 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   84.979   84.979  142.672  142.672 ger.py:9(loop)
        1   47.609   47.609   47.609   47.609 {method 'readlines' of 'file' objects}
  4160712   10.084    0.000   10.084    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

So the result here shows the generator version is 3x faster than the for-loop version.

Reference

相關文章
相關標籤/搜索