最近工做碰到一個問題,如何將大量數據(100MB+)導入到遠程的mysql server上。html
嘗試1: java
Statement執行executeBatch的方法。每次導入1000條記錄。時間爲12s/1000條。比較慢。mysql
對於1M次的插入這意味着須要4個多小時,期間還會由於網絡情況,數據庫負載等因素而把載入延遲提高到85s/1000條甚至更高。 sql
效果較差。數據庫
嘗試2:緩存
使用PreparedStatement,該方法須要預先給定insert操做的「格式」。服務器
實測用這種方式插入的效率爲每秒鐘數十行。網絡
注意,將rewriteBatchedStatements設爲true以後,在不到一分鐘時間裏面就將78萬條數據所有導入數據庫了。這是一個行之有效的方法。數據結構
代碼:多線程
1 import java.io.BufferedReader; 2 import java.io.FileReader; 3 import java.sql.Connection; 4 import java.sql.DriverManager; 5 import java.sql.PreparedStatement; 6 7 /** 8 * 9 */ 10 public class PreparedStatementTestMain { 11 private static PreparedStatement ps; 12 public static void main(String[] args) { 13 try{ 14 Class.forName("com.mysql.jdbc.Driver"); 15 Connection conn = DriverManager.getConnection("jdbc:mysql://remote-host/test?user=xxx&password=xxx"); 16 String sql = "insert into test values(?,?,?,?,?,?,?,?,?,?,?)"; 17 ps = conn.prepareStatement(sql); 18 19 BufferedReader in = new BufferedReader(new FileReader("xxxx")); 20 String line; 21 int count =0; 22 while((line = in.readLine())!=null){ 23 count+=1; 24 String[] values = line.split("\t",-1); 25 //ps.setInt(1,count); 26 for(int i =1;i<values.length;i++) { 27 // if(i==6){ 28 // ps.setInt(i+1,Integer.parseInt(values[i])); 29 // }else{ 30 // if(values[i]==null){ 31 // ps.setString(i," "); 32 // }else { 33 ps.setString(i, values[i]); 34 // } 35 // } 36 } 37 ps.addBatch(); 38 System.out.println("Line "+count); 39 } 40 ps.executeBatch(); 41 ps.close(); 42 }catch(Exception e){ 43 e.printStackTrace(); 44 } 45 } 46 }
嘗試3:
使用mysqlimport工具。通過實測,速度接近於嘗試2中加上rewriteBatchedStatements以後的速度。不過前提是數據必需要保存爲文件。
另一個思路:
多線程插入。
測試:
import java.sql.*; import java.util.Properties; import java.util.Random; import java.util.concurrent.*; public class TestMultiThreadInsert { private static final String dbClassName = "com.mysql.jdbc.Driver"; private static final String CONNECTION = "jdbc:mysql://host/"; private static final String USER = "x"; private static final String PASSWORD = "xxx"; private static final int THREAD_NUM=10; private static void executeSQL(Connection conn, String sql) throws SQLException { Statement stmt = conn.createStatement(); stmt.execute(sql); } private static void ResetEnvironment() throws SQLException { Properties p = new Properties(); p.put("user", USER); p.put("password", PASSWORD); Connection conn = DriverManager.getConnection(CONNECTION, p); { for (String query: new String[] { "USE test", "CREATE TABLE IF NOT EXISTS MTI (ID INT AUTO_INCREMENT PRIMARY KEY,MASSAGE VARCHAR(9) NOT NULL)", "TRUNCATE TABLE MTI" }) { executeSQL(conn, query); } } } private static void worker() { Properties properties = new Properties(); properties.put("user", USER); properties.put("password", PASSWORD); try{ Connection conn = DriverManager.getConnection(CONNECTION, properties); executeSQL(conn, "USE test"); while (!Thread.interrupted()) { executeSQL(conn, String.format("INSERT INTO MTI VALUES (NULL,'hello')")); System.out.println("Inserting "+value+" finished."); } } catch (SQLException e) { e.printStackTrace(); } } public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException { Class.forName(dbClassName); ResetEnvironment(); ExecutorService executor = Executors.newFixedThreadPool(THREAD_NUM); for (int i = 0; i < THREAD_NUM; i++) { executor.submit(new Runnable() { public void run() { worker(); } }); } Thread.sleep(20000); executor.shutdownNow(); if (!executor.awaitTermination(5, TimeUnit.SECONDS)) { System.err.println("Pool did not terminate"); } } }
20個線程分別單條單條地插入,20秒鐘插入2923條;
10個線程分別單條單條地插入,20秒鐘插入1699條;
1個線程單條單條地插入,20秒鐘330條。
預測:將多線程與PreparedStatement結合預計能夠提升插入速度。
可是使用多線程插入會不可避免的要考慮到一個問題:寫鎖。
雖然上面的程序確實證實了多線程插入的可行性,可是背後的邏輯是什麼樣的呢?有必要進行一下解讀。
上面的代碼中的多線程對應的是多個鏈接(可參考:https://dev.mysql.com/doc/refman/5.5/en/connection-threads.html),經過多線程主要是提升了命令提交速度,而不是多個執行線程。至於如何執行,還須要考察InnoDB(目前所用的數據庫引擎)對數據插入的處理機制。
爲了解決這個問題,經過搜索,查到了這些可能存在聯繫的關鍵詞:
1.io_threads(https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-multiple_io_threads.html),
2.鎖,
3.insert buffer(https://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-performance-change_buffering.html)。
關於insert buffer,理解下面這句話是關鍵:
innodb使用insert buffer"欺騙"數據庫:對於爲非惟一索引,輔助索引的修改操做並不是實時更新索引的葉子頁,而是把若干對同一頁面的更新緩存起來作合併爲一次性更新操做,轉化隨機IO 爲順序IO,這樣能夠避免隨機IO帶來性能損耗,提升數據庫的寫性能。
要理解上面那句話,要先知道innoDB使用了什麼樣的數據結構來存儲數據。
B+Tree!
關於B+Tree 網上一堆說明,這裏不做贅述。
關於io_threads,是利用多個線程來處理對數據頁的讀寫。
有一個問題依然沒有說明白:鎖!
(官方)locking
The system of protecting a transaction from seeing or changing data that is being queried or changed by other transactions. The locking strategy must balance reliability and consistency of database operations (the principles of the ACID philosophy) against the performance needed for goodconcurrency. Fine-tuning the locking strategy often involves choosing an isolation level and ensuring all your database operations are safe and reliable for that isolation level.
innodb爲了提升讀的性能,自定義了read write lock,也就是讀寫鎖。其設計原則是:
一、同一時刻容許多個線程同時讀取內存中的變量
二、同一時刻只容許一個線程更改內存中的變量
三、同一時刻當有線程在讀取變量時不容許任何線程寫存在
四、同一時刻當有線程在更改變量時不容許任何線程讀,也不容許出本身之外的線程寫(線程內能夠遞歸佔有鎖)。
五、當有rw_lock處於線程讀模式下是有線程寫等待,這時候若是再有其餘線程讀請求鎖的時,這個讀請求將處於等待前面寫完成。
既然有了鎖,那麼如何利用多線程的寫操做來提升效率呢?
思考角度:提升互斥鎖的切換效率!
怎麼作到?
參考http://www.2cto.com/database/201411/352586.html
https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-latching.html
On many platforms, Atomic operations can often be used to synchronize the actions of multiple threads more efficiently than Pthreads. Each operation to acquire or release a lock can be done in fewer CPU instructions, wasting less time when threads contend for access to shared data structures. This in turn means greater scalability on multi-core platforms.
On platforms where the GCC, Windows, or Solaris functions for atomic memory access are not available, InnoDB
uses the traditional Pthreads method of implementing mutexes and read/write locks.
mutex
Informal abbreviation for "mutex variable". (Mutex itself is short for "mutual exclusion".) The low-level object that InnoDB uses to represent and enforce exclusive-access locks to internal in-memory data structures. Once the lock is acquired, any other process, thread, and so on is prevented from acquiring the same lock. Contrast with rw-locks, which InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
The low-level object that InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures following certain rules. Contrast with mutexes, which InnoDB uses to represent and enforce exclusive access to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.
rw-lock
types include s-locks
(shared locks), x-locks
(exclusive locks), and sx-locks
(shared-exclusive locks).
An s-lock
provides read access to a common resource.
An x-lock
provides write access to a common resource while not permitting inconsistent reads by other threads.
An sx-lock
provides write access to a common resource while permitting inconsistent reads by other threads. sx-locks
were introduced in MySQL 5.7 to optimize concurrency and improve scalability for read-write workloads.
The following matrix summarizes rw-lock type compatibility.
S |
SX |
X |
|
---|---|---|---|
S |
Compatible | Compatible | Conflict |
SX |
Compatible | Conflict | Conflict |
X |
Conflict | Conflict | Conflict |
補充:
rewriteBatchedStatements到底爲何對速度優化這個多?
一種說法:這樣作的目的是爲了讓mysql可以將多個mysql insert語句打包成一個packet和mysql服務器通訊。這樣能夠極大下降網絡開銷。
另外一種說法:
Rewriting Batches
「rewriteBatchedStatements=true」
Affects (Prepared)Statement.add/executeBatch()
Core concept - remove latency
Special treatment for prepared INSERT statements
——Mark Matthews - Sun Microsystems
PreparedStatement VS Statement
數據庫系統會對sql語句進行預編譯處理(若是JDBC驅動支持的話),預處理語句將被預先編譯好,這條預編譯的sql查詢語句能在未來的使用中重用。