DC學院--數據庫

時間 2019-11-13

標籤學院數據庫欄目 SQL 简体版

原文原文鏈接

1.文本文件與數據庫的比較html

文本文件的好處：簡單，直接閱讀處理。處理時須要把文件存入內存
數據庫：結構化的數據存儲，，索引簡單。
常見數據庫：（1）SQL數據庫：Oracle，MS SQL Server，MySQL，SQLite（2）NoSQL數據庫，分佈式中常見：MongoDB，Cassandra

數據庫模式：服務器客戶端（MySQL），文件型數據（SQLite），好比firefox瀏覽器使用文件型數據庫。

2. 基於HeidiSQL的數據庫操做python

操做包括：建立數據庫，導入數據，查詢數據，新增數據，修改數據，刪除數據mysql

實例操做數據集：Iris（鳶尾花），從UCI Machine Learning下載。sql

增刪查改語句：select column_x from table　where condition order by column_i [desc,asc], column數據庫

insert into table_name (cname1, cname2) values (v1, v2)
update table_name set colum1=value1, colum2=value2, ... where condition
delete from table where condition瀏覽器

進階操做：服務器

(1) distinct，（查找的列組合起來在表裏是惟一的）
select distinct column1, column2 from table;
select distinct sepal_length, species from iris;
(2) 比較操做：like，in，between
用在where內指定字段或列特色
where columnN like pattern --pattern包含通配符下劃線單純的一個字符—，一個或多個字符%
select * from iris where species like '%';
where column_name in (value1, value2, value3...) 等同於多個or
where column_name between value1 and value2(包含其實及結束值)
select * from iris where sepal_length between 5 and 6 order by sepal_length;
select * from iris where sepal_length in (2, 3, 6);

(3) 聚合操做：max，min，count，avg，sum
select MIN(column_name)
select min(sepal_length), max(sepal_length), count(sepal_length), avg(sepal_length), sum(sepal_length) from iris;

(4) group by
把列根據屬性分紅幾個類，與聚合操做一塊兒，按照不一樣類別統計。
select
min(sepal_length), max(sepal_length), count(sepal_length), avg(sepal_length),
sum(sepal_length),count(species)
from iris
group by(species);

(5) 主鍵、索引
主鍵（primary key）：能夠用來惟一肯定表中的一條記錄，如學號、身份證號，也可單獨額外生成
索引（index）：對數據庫中的某一（幾）個字段進行索引，就是加速了若是在查詢操做中where有針對該字段的條件。
默認主鍵會被索引。排序後二分查找
(6) 表的鏈接 join
select column_name(s)
from table1
inner join table2 on table1.column_name=table2.column_name;
inner join：取兩個表的交集
left join：取左邊的表，和與右邊表的交集
right_join：取右邊的表，和與鎖邊的表的交集數據結構

3. 利用python鏈接數據庫分佈式

安裝數據包：pip install pymysql，官方文檔：https://pymysql.readthedocs.io/en/latest/user/examples.htmlide

步驟： 1.與數據庫創建鏈接

　　　　2.進行sql的增刪改查，使用反引號將數據庫字段括起來。對數據庫有添加、刪除等改寫操做時，該寫完後統一使用commit，實際改動數據庫。

　　　　3.關閉數據庫鏈接

 1 import pymysql.cursors
 2 
 3 #connect to the database,以dict類型存儲結果
 4 connection = pymysql.connect(host='localhost',
 5                              user='root',
 6                              password='123456root',
 7                              db='iris',
 8                              charset='utf8mb4',
 9                              cursorclass=pymysql.cursors.DictCursor)
10 
11 try:
12     with connection.cursor() as cursor:
13         #read a single record
14         sql = "SELECT * from `iris_with_id` where `id` =%s"
15         cursor.execute(sql, ('3',))
16         result=cursor.fetchone()
17         print result
18         print result['id']
19 finally:
20     connection.close()

View Code

結果：

{u'Petal_width': 0.2, u'sepal_length': 4.7, u'id': 3, u'sepal_width': 3.2, u'petal_length': 1.3, u'species': u'Iris-setosa\r'}
3

cursor.fetchone()只取查詢結果中的第一個，cursor.fetchall()取全部的查詢結果。cursorclass=pymysql.cursors.DictCursor指定以字典的鍵值對的形式返回結果。

4. 利用pandas進行數據清理，seaborn數據可視化

數據清理包含四部分：

格式轉換
- 數據的原始存儲形式未必適合python數據處理
- 例如原始數據存儲時間字符串，轉化爲python表示的數據結構
缺失數據
- 每條記錄均可能在某些屬性值上缺失
- 應對策略
  - 忽略有缺失數據的記錄
  - 直接把這個值標記爲「未知」
  - 利用平均值、最常出現值等去填充
異常數據
- 出現不符合常識的數值
- 性別-出現數字，年齡大於100
標準化
- 　　用戶可自主輸入的一些屬性上，可能出現實際是相同值，可是輸入不一樣

數據清理實踐：共享住宿Airbnb

安裝 Python工具包

Pandas：核心數據結構：DataFrame

Seaborn：提供可視化功能。

示例：數據源是https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

#讀取數據
import pandas
users = pandas.read_csv("train_users_2.csv")
#首先須要的是對數據的基本查看
users.describe()
#第一行是屬性名稱，index從0開始，能夠指定顯示前3行，也能夠不給參數，默認顯示前5行
users.head(3)
#與head相反，給出數據集末尾的幾行
users.tail()
users.shape  #返回整個數據的樣子
users.loc[1:5, "age"] #返回第1到5行age字段的值
#格式轉換，能夠用format指定格式

users['date_account_created'] = pandas.to_datetime(users['date_account_created']) #統一時間格式
users["timestamp_first_active"] = pandas.to_datetime(users["timestamp_first_active"], format = "%Y%m%d%H%M%S")


import seaborn
%matplotlib inline
seaborn.distplot(users['age'].dropna())
users_with_true_age = users[users["age"] < 90]
users_with_true_age = users_with_true_age[users_with_true_age["age"]>10]
seaborn.distplot(users_with_true_age["age"])

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。