問題:
I have tried to puzzle out an answer to this question for many months while learning pandas. 在學習熊貓的過程當中,我試圖解決這個問題的答案已經有不少月了。 I use SAS for my day-to-day work and it is great for it's out-of-core support. 我在平常工做中使用SAS,這很是有用,由於它提供了核心支持。 However, SAS is horrible as a piece of software for numerous other reasons. 可是,因爲許多其餘緣由,SAS做爲一個軟件也是很糟糕的。 python
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. 有一天,我但願用python和pandas取代我對SAS的使用,可是我目前缺乏大型數據集的核心工做流程。 I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive. 我並非在說須要分佈式網絡的「大數據」,而是文件太大而沒法容納在內存中,但文件又足夠小而沒法容納在硬盤上。 數據庫
My first thought is to use HDFStore
to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. 個人第一個想法是使用HDFStore
將大型數據集保存在磁盤上,而後僅將我須要的部分拉入數據幀中進行分析。 Others have mentioned MongoDB as an easier to use alternative. 其餘人則提到MongoDB是一種更易於使用的替代方案。 My question is this: 個人問題是這樣的: 網絡
What are some best-practice workflows for accomplishing the following: 什麼是實現如下目標的最佳實踐工做流: 數據結構
- Loading flat files into a permanent, on-disk database structure 將平面文件加載到永久的磁盤數據庫結構中
- Querying that database to retrieve data to feed into a pandas data structure 查詢該數據庫以檢索要輸入到熊貓數據結構中的數據
- Updating the database after manipulating pieces in pandas 處理熊貓中的片斷後更新數據庫
Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data". 現實世界中的示例將不勝感激,尤爲是那些使用「大數據」中的熊貓的人。 app
Edit -- an example of how I would like this to work: 編輯-我但願如何工做的示例: 機器學習
- Iteratively import a large flat-file and store it in a permanent, on-disk database structure. 迭代導入一個大型平面文件,並將其存儲在永久的磁盤數據庫結構中。 These files are typically too large to fit in memory. 這些文件一般太大而沒法容納在內存中。
- In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory. 爲了使用Pandas,我想讀取這些數據的子集(一般一次只讀取幾列),使其適合內存。
- I would create new columns by performing various operations on the selected columns. 我將經過對所選列執行各類操做來建立新列。
- I would then have to append these new columns into the database structure. 而後,我將不得不將這些新列添加到數據庫結構中。
I am trying to find a best-practice way of performing these steps. 我正在嘗試找到執行這些步驟的最佳實踐方法。 Reading links about pandas and pytables it seems that appending a new column could be a problem. 閱讀有關熊貓和pytables的連接,彷佛添加一個新列多是個問題。 分佈式
Edit -- Responding to Jeff's questions specifically: 編輯-專門回答傑夫的問題: 學習
- I am building consumer credit risk models. 我正在創建消費者信用風險模型。 The kinds of data include phone, SSN and address characteristics; 數據類型包括電話,SSN和地址特徵; property values; 財產價值; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. 諸如犯罪記錄,破產等之類的貶義信息。我天天使用的數據集平均有近1,000到2,000個混合數據類型的字段:數字和字符數據的連續,名義和有序變量。 I rarely append rows, but I do perform many operations that create new columns. 我不多追加行,可是我確實執行了許多建立新列的操做。
- Typical operations involve combining several columns using conditional logic into a new, compound column. 典型的操做涉及使用條件邏輯將幾個列合併到一個新的複合列中。 For example,
if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
. 例如, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
; if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
。 The result of these operations is a new column for every record in my dataset. 這些操做的結果是數據集中每一個記錄的新列。
- Finally, I would like to append these new columns into the on-disk data structure. 最後,我想將這些新列添加到磁盤數據結構中。 I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. 我將重複步驟2,使用交叉表和描述性統計數據探索數據,以尋找有趣的直觀關係進行建模。
- A typical project file is usually about 1GB. 一個典型的項目文件一般約爲1GB。 Files are organized into such a manner where a row consists of a record of consumer data. 文件組織成這樣的方式,其中一行包含消費者數據記錄。 Each row has the same number of columns for every record. 每條記錄的每一行都有相同的列數。 This will always be the case. 狀況老是如此。
- It's pretty rare that I would subset by rows when creating a new column. 建立新列時,我會按行進行子集化是很是罕見的。 However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. 可是,在建立報告或生成描述性統計信息時,對行進行子集化是很常見的。 For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. 例如,我可能想爲特定業務建立一個簡單的頻率,例如零售信用卡。 To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. 爲此,除了我要報告的任何列以外,我將只選擇那些業務線=零售的記錄。 When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. 可是,在建立新列時,我將拉出全部數據行,而僅提取操做所需的列。
- The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. 建模過程要求我分析每一列,尋找與某些結果變量有關的有趣關係,並建立描述這些關係的新複合列。 The columns that I explore are usually done in small sets. 我探索的列一般以小集合形式完成。 For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. 例如,我將重點介紹一組僅涉及屬性值的20個列,並觀察它們與貸款違約的關係。 Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. 在探索了這些列並建立了新的列以後,我接着轉到另外一組列,例如大學學歷,而後重複該過程。 What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. 我正在作的是建立候選變量,這些變量解釋個人數據和某些結果之間的關係。 At the very end of this process, I apply some learning techniques that create an equation out of those compound columns. 在此過程的最後,我應用了一些學習技術,這些技術能夠根據這些複合列建立方程。
It is rare that I would ever add rows to the dataset. 我不多向數據集添加行。 I will nearly always be creating new columns (variables or features in statistics/machine learning parlance). 我幾乎老是會建立新列(統計/機器學習術語中的變量或功能)。 大數據
解決方案:
參考一:
https://stackoom.com/question/xqJF/使用熊貓的-大數據-工做流程
參考二:
https://oldbug.net/q/xqJF/Large-data-work-flows-using-pandas