python使用pymongo訪問MongoDB的基本操做，以及CSV文件導出

時間 2019-11-13

標籤 python 使用 pymongo 訪問 mongodb 基本以及 csv 文件導出欄目 Python 简体版

原文原文鏈接

1. 環境。html

Python：3.6.1
Python IDE：pycharm
系統：win7python

2. 簡單示例ios

import pymongo正則表達式

# mongodb服務的地址和端口號
mongo_url = "127.0.0.1:27017"mongodb

# 鏈接到mongodb，若是參數不填，默認爲「localhost:27017」
client = pymongo.MongoClient(mongo_url)數據庫

#鏈接到數據庫myDatabase
DATABASE = "myDatabase"
db = client[DATABASE]segmentfault

#鏈接到集合(表):myDatabase.myCollection
COLLECTION = "myCollection"
db_coll = db[COLLECTION ]windows

# 在表myCollection中尋找date字段等於2017-08-29的記錄，並將結果按照age從大到小排序
queryArgs = {'date':'2017-08-29'}
search_res = db_coll.find(queryArgs).sort('age',-1)
for record in search_res:
print(f"_id = {record['_id']}, name = {record['name']}, age = {record['age']}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
3. 要點python3.x

針對讀操做，進行數據統計，儘可能使用多線程，節省時間，只是要注意線程數量，會大量吃內存。
4. mongoDB的數據類型數組

MongoDB支持許多數據類型，以下：

字符串 - 用於存儲數據的最經常使用的數據類型。MongoDB中的字符串必須爲UTF-8。
整型 - 用於存儲數值。整數能夠是32位或64位，具體取決於服務器。
布爾類型 - 用於存儲布爾值(true / false)值。
雙精度浮點數 - 用於存儲浮點值。
最小/最大鍵 - 用於將值與最小和最大BSON元素進行比較。
數組 - 用於將數組或列表或多個值存儲到一個鍵中。
時間戳 - ctimestamp，當文檔被修改或添加時，能夠方便地進行錄製。
對象 - 用於嵌入式文檔。
對象 - 用於嵌入式文檔。
Null - 用於存儲Null值。
符號 - 該數據類型與字符串相同; 可是，一般保留用於使用特定符號類型的語言。
日期 - 用於以UNIX時間格式存儲當前日期或時間。您能夠經過建立日期對象並將日，月，年的 - 日期進行指定本身須要的日期時間。
對象ID - 用於存儲文檔的ID。
二進制數據 - 用於存儲二進制數據。
代碼 - 用於將JavaScript代碼存儲到文檔中。
正則表達式 - 用於存儲正則表達式。
不支持的數據類型：

python中的集合（set）
5. 對錶（集合collection）的操做

import pymongo

# mongodb服務的地址和端口號
mongo_url = "127.0.0.1:27017"

# 鏈接到mongodb，若是參數不填，默認爲「localhost:27017」
client = pymongo.MongoClient(mongo_url)
#鏈接到數據庫myDatabase
DATABASE = "amazon"
db = client[DATABASE]

#鏈接到集合(表):myDatabase.myCollection
COLLECTION = "galance20170801"
db_coll = db[COLLECTION]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5.1. 查找記錄：find

（5.1.1）指定返回哪些字段
# 示例一：全部字段
# select * from galance20170801
searchRes = db_coll.find()
# 或者searchRes = db_coll.find({})
1
2
3
4
# 示例二：用字典指定要顯示的哪幾個字段
# select _id,key from galance20170801
queryArgs = {}
projectionFields = {'_id':True, 'key':True} # 用字典指定
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果{'_id': 'B01EYCLJ04', 'key': 'pro audio'}
1
2
3
4
5
6
# 示例三：用字典指定去掉哪些字段
queryArgs = {}
projectionFields = {'_id':False, 'key':False} # 用字典指定
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果{'activity': False, 'avgStar': 4.3, 'color': 'Yellow & Black', 'date': '2017-08-01'}
1
2
3
4
5
# 示例四：用列表指定要顯示哪幾個字段
# select _id,key,date from galance20170801
queryArgs = {}
projectionFields = ['key','date'] # 用列表指定，結果中必定會返回_id這個字段
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio'}
1
2
3
4
5
6
（5.1.2）指定查詢條件
（5.1.2.1）. 比較：=，！=，>, <, >=, <=
$ne：不等於(not equal)
$gt：大於(greater than)
$lt：小於(less than)
$lte：小於等於(less than equal)
$gte：大於等於(greater than equal)
1
2
3
4
5
# 示例一：相等
# select _id,key,sales,date from galance20170801 where key = 'TV & Video'
queryArgs = {'key':'TV & Video'}
projectionFields = ['key','sales','date']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果：{'_id': '0750699973', 'date': '2017-08-01', 'key': 'TV & Video', 'sales': 0}
1
2
3
4
5
6
# 示例二：不相等
# select _id,key,sales,date from galance20170801 where sales != 0
queryArgs = {'sales':{'$ne':0}}
projectionFields = ['key','sales','date']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果：{'_id': 'B01M996469', 'date': '2017-08-01', 'key': 'stereos', 'sales': 2}
1
2
3
4
5
6
# 示例三：大於
# where sales > 100
queryArgs = {'sales':{'$gt':100}}
# 結果：{'_id': 'B010OYASRG', 'date': '2017-08-01', 'key': 'Sound Bar', 'sales': 124}
1
2
3
4
# 示例四：小於
# where sales < 100
queryArgs = {'sales':{'$lt':100}}
# 結果：{'_id': 'B011798DKQ', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例五：指定範圍
# where sales > 50 and sales < 100
queryArgs = {'sales':{'$gt':50, '$lt':100}}
# 結果：{'_id': 'B008D2IHES', 'date': '2017-08-01', 'key': 'Sound Bar', 'sales': 66}
1
2
3
4
# 示例六：指定範圍，大於等於，小於等於
# where sales >= 50 and sales <= 100
queryArgs = {'sales':{'$gte':50, '$lte':100}}
# 結果：{'_id': 'B01M6DHW26', 'date': '2017-08-01', 'key': 'radios', 'sales': 50}
1
2
3
4
（5.1.2.2）. and
# 示例一：不一樣字段，並列條件
# where date = '2017-08-01' and sales = 100
queryArgs = {'date':'2017-08-01', 'sales':100}
# 結果：{'_id': 'B01BW2YYYC', 'date': '2017-08-01', 'key': 'Video', 'sales': 100}
1
2
3
4
# 示例二：相同字段，並列條件
# where sales >= 50 and sales <= 100
# 正確：queryArgs = {'sales':{'$gte':50, '$lte':100}}
# 錯誤：queryArgs = {'sales':{'$gt':50}, 'sales':{'$lt':100}}
# 結果：{'_id': 'B01M6DHW26', 'date': '2017-08-01', 'key': 'radios', 'sales': 50}
1
2
3
4
5
（5.1.2.3）. or
# 示例一：不一樣字段，或條件
# where date = '2017-08-01' or sales = 100
queryArgs = {'$or':[{'date':'2017-08-01'}, {'sales':100}]}
# 結果：{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例二：相同字段，或條件
# where sales = 100 or sales = 120
queryArgs = {'$or':[{'sales':100}, {'sales':120}]}
# 結果：
# {'_id': 'B00X5RV14Y', 'date': '2017-08-01', 'key': 'Chargers', 'sales': 120}
# {'_id': 'B0728GGX6Y', 'date': '2017-08-01', 'key': 'Glasses', 'sales': 100}
1
2
3
4
5
6
（5.1.2.4）. in，not in，all
# 示例一：in
# where sales in (100,120)
# 這個地方必定要注意，不能用List，只能用元組，由於是不可變的
# 若是用了 {'$in': [100,120]}，就會出現異常：TypeError: unhashable type: 'list'
queryArgs = {'sales':{'$in': (100,120)}}
# 結果：
# {'_id': 'B00X5RV14Y', 'date': '2017-08-01', 'key': 'Chargers', 'sales': 120}
# {'_id': 'B0728GGX6Y', 'date': '2017-08-01', 'key': 'Glasses', 'sales': 100}
1
2
3
4
5
6
7
8
# 示例二：not in
# where sales not in (100,120)
queryArgs = {'sales':{'$nin':(100,120)}}
# 結果：{'_id': 'B01EYCLJ04', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 0}
1
2
3
4
# 示例三：匹配條件內全部值 all
# where sales = 100 and sales = 120
queryArgs = {'sales':{'$all':[100,120]}} # 必須同時知足
# 結果：無結果
1
2
3
4
# 示例四：匹配條件內全部值 all
# where sales = 100 and sales = 100
queryArgs = {'sales':{'$all':[100,100]}} # 必須同時知足
# 結果：{'_id': 'B01BW2YYYC', 'date': '2017-08-01', 'key': 'Video', 'sales': 100}
1
2
3
4
（5.1.2.5）. 字段是否存在
# 示例一：字段不存在
# where rank2 is null
queryArgs = {'rank2':None}
projectionFields = ['key','sales','date', 'rank2']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# 結果：{'_id': 'B00ACOKQTY', 'date': '2017-08-01', 'key': '3D TVs', 'sales': 0}

# mongodb中的命令
db.categoryAsinSrc.find({'isClawered': true, 'avgCost': {$exists: false}})
1
2
3
4
5
6
7
8
9
# 示例二：字段存在
# where rank2 is not null
queryArgs = {'rank2':{'$ne':None}}
projectionFields = ['key','sales','date','rank2']
searchRes = db_coll.find(queryArgs, projection = projectionFields).limit(100)
# 結果：{'_id': 'B014I8SX4Y', 'date': '2017-08-01', 'key': '3D TVs', 'rank2': 4.0, 'sales': 0}
1
2
3
4
5
6
（5.1.2.6）. 正則表達式匹配：$regex（SQL：like）
# 示例一：關鍵字key包含audio子串
# where key like "%audio%"
queryArgs = {'key':{'$regex':'.*audio.*'}}
# 結果：{'_id': 'B01M19FGTZ', 'date': '2017-08-01', 'key': 'pro audio', 'sales': 1}
1
2
3
4
（5.1.2.7）. 數組中必須包含元素：$all
# 查詢記錄，linkNameLst是一個數組，指定linkNameLst字段必須包含 'Electronics, Computers & Office' 這個元素。
db.getCollection("2018-01-24").find({'linkNameLst': {'$all': ['Electronics, Computers & Office']}})

# 查詢記錄，linkNameLst是一個數組，指定linkNameLst字段必須同時包含 'Wearable Technology' 和 'Electronics, Computers & Office' 這兩個元素。
db.getCollection("2018-01-24").find({'linkNameLst': {'$all': ['Wearable Technology', 'Electronics, Computers & Office']}})
1
2
3
4
5
（5.1.2.8）. 按數組大小查詢
兩個思路：
第一個思路：使用$where（具備很大的靈活性，可是速度會慢一些）
# priceLst是一個數組，目標是查詢 len(priceLst) < 3
db.getCollection("20180306").find({$where: "this.priceLst.length < 3"})
1
2
關於$where，請參考官方文檔：http://docs.mongodb.org/manual/reference/operator/query/where/。
第二個思路：判斷數組中的某個指定索引的元素是否存在（會比較高效）
例如：若是要求 len(priceLst) < 3：就意味着 num[ 2 ]不存在
# priceLst是一個數組，目標是查詢 len(priceLst) < 3
db.getCollection("20180306").find({'priceLst.2': {$exists: 0}})
1
2
例如：若是要求 len(priceLst) > 3：就意味着 num[ 3 ]存在
# priceLst是一個數組，目標是查詢 len(priceLst) > 3
db.getCollection("20180306").find({'priceLst.3': {$exists: 1}})
1
2
（5.1.3）指定查詢條件
（5.1.3.1）. 限定數量：limit
# 示例一：按sales降序排列，取前100
# select top 100 _id,key,sales form galance20170801 where key = 'speakers' order by sales desc
queryArgs = {'key':'speakers'}
projectionFields = ['key','sales']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
topSearchRes = searchRes.sort('sales',pymongo.DESCENDING).limit(100)
1
2
3
4
5
6
（5.1.3.2）. 排序：sort
# 示例二：按sales降序，rank升序
# select _id,key,date,rank from galance20170801 where key = 'speakers' order by sales desc,rank
queryArgs = {'key':'speakers'}
projectionFields = ['key','sales','rank']
searchRes = db_coll.find(queryArgs, projection = projectionFields)
# sortedSearchRes = searchRes.sort('sales',pymongo.DESCENDING) # 單個字段
sortedSearchRes = searchRes.sort([('sales', pymongo.DESCENDING),('rank', pymongo.ASCENDING)]) # 多個字段
# 結果：
# {'_id': 'B000289DC6', 'key': 'speakers', 'rank': 3.0, 'sales': 120}
# {'_id': 'B001VRJ5D4', 'key': 'speakers', 'rank': 5.0, 'sales': 120}
1
2
3
4
5
6
7
8
9
10
（5.1.3.3）. 統計：count
# 示例三：統計匹配記錄總數
# select count(*) from galance20170801 where key = 'speakers'
queryArgs = {'key':'speakers'}
searchResNum = db_coll.find(queryArgs).count()
# 結果：
# 106
1
2
3
4
5
6
5.2. 添加記錄

5.2.1. 單條插入

# 示例一：指定 _id，若是重複，會產生異常
ID = 'firstRecord'
insertDate = '2017-08-28'
count = 10
insert_record = {'_id':ID, 'endDate': insertDate, 'count': count}
insert_res = db_coll.insert_one(insert_record)
print(f"insert_id={insert_res.inserted_id}: {insert_record}")
# 結果：insert_id=firstRecord: {'_id': 'firstRecord', 'endDate': '2017-08-28', 'count': 10}
1
2
3
4
5
6
7
8
# 示例二：不指定 _id，自動生成
insertDate = '2017-10-10'
count = 20
insert_record = {'endDate': insertDate, 'count': count}
insert_res = db_coll.insert_one(insert_record)
print(f"insert_id={insert_res.inserted_id}: {insert_record}")
# 結果：insert_id=59ad356d51ad3e2314c0d3b2: {'endDate': '2017-10-10', 'count': 20, '_id': ObjectId('59ad356d51ad3e2314c0d3b2')}
1
2
3
4
5
6
7
5.2.2. 批量插入

# 更高效，但要注意若是指定_id，必定不能重複
# ordered = True，遇到錯誤 break, 而且拋出異常
# ordered = False，遇到錯誤 continue, 循環結束後拋出異常
insertRecords = [{'i':i, 'date':'2017-10-10'} for i in range(10)]
insertBulk = db_coll.insert_many(insertRecords, ordered = True)
print(f"insert_ids={insertBulk.inserted_ids}")
# 結果：insert_ids=[ObjectId('59ad3ba851ad3e1104a4de6d'), ObjectId('59ad3ba851ad3e1104a4de6e'), ObjectId('59ad3ba851ad3e1104a4de6f'), ObjectId('59ad3ba851ad3e1104a4de70'), ObjectId('59ad3ba851ad3e1104a4de71'), ObjectId('59ad3ba851ad3e1104a4de72'), ObjectId('59ad3ba851ad3e1104a4de73'), ObjectId('59ad3ba851ad3e1104a4de74'), ObjectId('59ad3ba851ad3e1104a4de75'), ObjectId('59ad3ba851ad3e1104a4de76')]
1
2
3
4
5
6
7
5.3. 修改記錄

# 根據篩選條件_id，更新這條記錄。若是找不到符合條件的記錄，就插入這條記錄（upsert = True）
updateFilter = {'_id': item['_id']}
updateRes = db_coll.update_one(filter = updateFilter,
update = {'$set': dict(item)},
upsert = True)
print(f"updateRes = matched:{updateRes.matched_count}, modified = {updateRes.modified_count}")
1
2
3
4
5
6
7
# 根據篩選條件，更新部分字段：i是原有字段，isUpdated是新增字段
filterArgs = {'date':'2017-10-10'}
updateArgs = {'$set':{'isUpdated':True, 'i':100}}
updateRes = db_coll.update_many(filter = filterArgs, update = updateArgs)
print(f"updateRes: matched_count={updateRes.matched_count}, "
f"modified_count={updateRes.modified_count} modified_ids={updateRes.upserted_id}")
# 結果：updateRes: matched_count=8, modified_count=8 modified_ids=None
1
2
3
4
5
6
7

5.4. 刪除記錄

5.4.1. 刪除一條記錄

# 示例一：和查詢使用的條件同樣
queryArgs = {'endDate':'2017-08-28'}
delRecord = db_coll.delete_one(queryArgs)
print(f"delRecord={delRecord.deleted_count}")
# 結果：delRecord=1
1
2
3
4
5
5.4.2. 批量刪除

# 示例二：和查詢使用的條件同樣
queryArgs = {'i':{'$gt':5, '$lt':8}}
# db_coll.delete_many({}) # 清空數據庫
delRecord = db_coll.delete_many(queryArgs)
print(f"delRecord={delRecord.deleted_count}")
# 結果：delRecord=2
1
2
3
4
5
6
6. 將數據庫文檔寫入csv文件。

6.1. 標準代碼

讀csv文件
import csv

with open("phoneCount.csv", "r") as csvfile:
reader = csv.reader(csvfile)
# 這裏不須要readlines
for line in reader:
print(f"# line = {line}, typeOfLine = {type(line)}, lenOfLine = {len(line)}")
# 輸出結果以下：
line = ['850', 'rest', '43', 'NN'], typeOfLine = <class 'list'>, lenOfLine = 4
line = ['9865', 'min', '1', 'CD'], typeOfLine = <class 'list'>, lenOfLine = 4
1
2
3
4
5
6
7
8
9
10
寫csv文件
# 導出數據庫全部記錄的標準模版
import pymongo
import csv

# 初始化數據庫
mongo_url = "127.0.0.1:27017"
DATABASE = "databaseName"
TABLE = "tableName"

client = pymongo.MongoClient(mongo_url)
db_des = client[DATABASE]
db_des_table = db_des[TABLE]

# 將數據寫入到CSV文件中
# 若是直接從mongod booster導出, 一旦有部分出現字段缺失，那麼會出現結果錯位的問題

# newline='' 的做用是防止結果數據中出現空行，專屬於python3
with open(f"{DATABASE}_{TABLE}.csv", "w", newline='') as csvfileWriter:
writer = csv.writer(csvfileWriter)
# 先寫列名
# 寫第一行，字段名
fieldList = [
"_id",
"itemType",
"field_1",
"field_2",
"field_3",
]
writer.writerow(fieldList)

allRecordRes = db_des_table.find()
# 寫入多行數據
for record in allRecordRes:
print(f"record = {record}")
recordValueLst = []
for field in fieldList:
if field not in record:
recordValueLst.append("None")
else:
recordValueLst.append(record[field])
try:
writer.writerow(recordValueLst)
except Exception as e:
print(f"write csv exception. e = {e}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
6.2. 可能出現的問題以及解決方案

6.2.1. 寫csv文件編碼問題

參考文章：Python UnicodeEncodeError: ‘gbk’ codec can’t encode character 解決方法 :
http://www.jb51.net/article/64816.htm

重要點：目標文件的編碼是致使標題所指問題的罪魁禍首。若是咱們打開一個文件，在windows下面，新文件的默認編碼是gbk，這樣的話，python解釋器會用gbk編碼去解析咱們的網絡數據流txt，然而txt此時已是decode過的unicode編碼，這樣的話就會致使解析不了，出現上述問題。解決的辦法就是，改變目標文件的編碼。
解決方案：
###### 確實最推薦的作法是在open文件時，指定編碼格式：
with open(f"{DATABASE}_{TABLE}.csv", "w", newline='', encoding='utf-8') as csvfileWriter:
# 就像咱們在windows環境下，寫csv文件時，默認編碼是'gbk'，而從網上獲取的數據大部分是'utf-8'，這就可能出現某些編碼不兼容的問題。好比：write csv exception. e = 'gbk' codec can't encode character '\xae' in position 80: illegal multibyte sequence
1
2
3
6.2.2. 寫csv文件出現空白行（存在一行間一行）

python2.x 版本
描述及解決方案，請參考：https://www.cnblogs.com/China-YangGISboy/p/7339118.html
# 爲了解決這個問題，查了下資料，發現這是和打開方式有關，將打開的方法改成wb，就不存在這個問題了，也就是
在read/write csv 文件是要以binary的方式進行。
with open('result.csv','wb') as cf:
writer = csv.writer(cf)
writer.writerow(['shader','file'])
for key , value in result.items():
writer.writerow([key,value])
1
2
3
4
5
6
7
python2.x要用‘wb’模式寫入的真正緣由：
python2.x中寫入CSV時，CSV文件的建立必須加上‘b’參數，即open('result.csv','wb')，否則會出現隔行的現象。緣由是：python正常寫入文件的時候，每行的結束默認添加'n’，即0x0D，而 writerow 命令的結束會再增長一個0x0D0A，所以對於windows系統來講，就是兩行，而採用’ b'參數，用二進制進行文件寫入，系統默認是不添加0x0D的

並且在python2.x中，str和bytes是存在不少隱性轉換的，因此雖然CSV是文本文件，也是能夠正常寫入。1234python3 版本在python3中，str和bytes有了清晰的劃分，也沒有任何隱性的轉換，csv 是文本格式的文件，不支持二進制的寫入，因此不要用二進制模式打開文件，數據也沒必要轉成bytes。描述及解決方案，請參考：https://segmentfault.com/q/1010000006841656?_ea=1148776# 解決方案就是 newline 配置成空便可with open('result.csv', 'w', newline='') as csvfile:12總結一下：出現空白行的根本緣由是Python版本問題，解決方案上python2.x中要求用‘wb’，python3.x中要求用 ‘w’ 和newline參數。拓展：關於python3中bytes和string之間的互相轉換：http://www.jb51.net/article/105064.htm--------------------- 做者：Kosmoo 來源：CSDN 原文：https://blog.csdn.net/zwq912318834/article/details/77689568 版權聲明：本文爲博主原創文章，轉載請附上博文連接！

相關標籤/搜索