emacs org mode
+ emacs magit
+ bitbucket
+ python
. There must be some room for improvement.How
html
課程用的是R
. 我不想再學一門相似的語言了, 我會找出相對應的numpy
和 scipy
solution.node
Getting and Cleaning Datapython
Raw data --> Processing scripts --> tidy data (often ignored in the classes but really important)git
--> data analysis (covered in machine learning classes)web
--> data communication算法
dataframe.merge
dataframe.join
in pandas
代碼簿? (⊙o⊙)…express
python
和 R
對於有效位數handle地很好? 不須要像C
裏邊同樣考慮 float
或者 double
? 某些極端狀況下也會須要像sympy
這樣的library吧.代碼簿的做用相似於wet lab中的實驗記錄本. 很慶幸很早就知道了emacs
的 org mode
, 用在這裏很適合. 可是 Info about the variables 的重要性被我忽略了.json
若是feature的數量不少, 並且feature自己意義深入, 就須要仔細挑選. 記得一次聽報告, 有家金融公司用decision tree 作portfolio, 算法自己稀鬆日常, 可是對於具體用了哪些feature, lecturer守口如瓶.app
"There are many stages to the design and analysis of a successful study. The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a 'decision rule' to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled."less
Leek, Jeffrey T., and Roger D. Peng. "Statistics: P values are just the tip of the iceberg." Nature 520.7549 (2015): 612-612.
我一般都是直接用wget
, 可是那樣就不容易整合到腳本中. 幾個極可能會在download時候用到的python
function:
# set up the env os.path.dirname(os.path.realpath(__file__)) os.getcwd() os.path.join() os.chdir() os.path.exists() os.makedirs() # dowload urllib.request.urlretrieve() urllib.request.urlopen() # to tag your downloaded files datetime.timezone() datetime.datetime.now() # an example import shutil import ssl import urllib.request as ur def download(myurl): """ download to the current directory """ fn = myurl.split('/')[-1] context = ssl._create_unverified_context() with ur.urlopen(myurl, context=context) as response, open(fn, 'wb') as out_file: shutil.copyfileobj(response, out_file) return fn
pandas.read_csv()
Here is a very good introduction
Below are my summaries:
python
標準庫中自帶了xml.etree.ElementTree
用來解析xml
. 其中, ElementTree
表示整個XML文件, Element
表示一個node.
The first element in every XML document is called the root element. 一個XML文件只能又一個root, 所以如下的不符合xml規範:
<foo></foo> <bar></bar>
recursively 遍歷
# an excersice # find all elements with zipcode equals 21231 xml_fn = download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml") tree = ET.parse(xml_fn) for child in tree.iter(): if child.tag == 'zipcode' and child.text == '21231': print(child)
JSON 的格式肉眼看起來就像是nested python dict. python 自帶的json的用法相似pickle.
Python makes a distinction between matching and searching. Matching looks only at the start of the target string, whereas searching looks for the pattern anywhere in the target.
Always use raw strings for regx.
Character sets
sth like r'[A-Za-z_]'
would match an underscore or any uppercase or lowercase ASCII letter.
Characters that have special meanings in other regular expression contexts do not have special meanings within square brackets. The only character with a special meaning inside square brackets is a ^, and then only if it is the first character after the left (open- ing) bracket.
import pandas as pd df = pd.DataFrame # Look at a bit of the data df.head() df.tail() # summary df.describe() df.quantile() # cov and corr # DataFrame’s corr and cov methods return a full correlation or covariance matrix as a DataFrame, respectively # to calcuate pairwise correlation between a DataFrame's columns or rows dset.corrwith(dset['<one col name>']) # you can write your own analsis function and apply it to the dataframe, for example: f = lambda x: x.max() - x.min() df.apply(f, axis=1)
df.dropna() df.fillna(0) # to modify inplace _ = df.fillna(0, inplace=True) # fill the nan with the mean # 或者用naive bayesian的prediction data.fillna(data.mean())
Principles of Analytic Graphics
Show comparisons
If you build a model that can do some predictions, please come along with the performance of random guess.
Show causality, mechanism, explanation, systematic structure
Show multivariate data
The world is inherently multivariate
Integration of evidence
Simple Summaries of Data
Two dimensions
> 2 dimensions
Graphics File Devices
rnorm
:generate random Normal variates with a given mean and standard deviationdnorm
: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)pnorm
: evaluate the cumulative distribution function for a Normal distribution
d
for density
r
for random number generationp
for cumulative distributionq
for quantile functionSetting the random number seed with set.seed
ensures reproducibility
> set.seed(1) > rnorm(5)