【面經】Epic: 數據庫去重

題目是:有2個10G的數據庫,存儲了一些string. 2者之間有一些重複的數據。請把它們合併爲一個數據庫,而且去除重複。html

限制:內存是4Gmysql

例如: DB1: cmu, ucb, stanford, nyusql

        DB2: ucsb, ucb, ucsd, cmu.數據庫

二者合併後,應該是: DB: cmu, ucb, stanford, nyu, ucsb, ucsd.ide

做法:把DB1分爲5個小的數據庫,分別是DB11, DB12, DB13, DB14, DB15this

        把DB2分爲5個小的數據庫,分別是DB22, DB22, DB23, DB24, DB25spa

把DB11 與 DB22, DB22, DB23, DB24, DB25 分別進行Union操做,生成DB11Merge.code

把DB12 與 DB22, DB22, DB23, DB24, DB25 分別進行Union操做,生成DB12Merge.htm

......blog

最後再把DB11Merge, DB12Merge, DB13Merge, DB14Merge, DB15Merge 合併在一塊兒便可

用如下語句便可:

mysql> insert into merge select * from persons2;

1. How do I merge two tables in Access while removing duplicates?

ref: http://stackoverflow.com/questions/7615587/how-do-i-merge-two-tables-in-access-while-removing-duplicates

如下是實驗結果:

A UNION query returns only distinct rows. (There is also UNION ALL, but that would include duplicate rows, so you don't want it here.)

 1 mysql> select * from persons2;                                                  +-----------+
 2 
 3 | FirstName |
 4 
 5 +-----------+
 6 
 7 | zelin     |
 8 
 9 | qihao     |
10 
11 +-----------+
12 
13 2 rows in set (0.00 sec)
14 
15  
16 
17 mysql> select * from persons;
18 
19 +-----------+
20 
21 | FirstName |
22 
23 +-----------+
24 
25 | yu        |
26 
27 | zhixu     |
28 
29 | zelin     |
30 
31 +-----------+
32 
33 3 rows in set (0.00 sec)
34 
35  
36 
37 mysql> 
38 
39 mysql> select * from persons union select * from persons2;
40 
41 +-----------+
42 
43 | FirstName |
44 
45 +-----------+
46 
47 | yu        |
48 
49 | zhixu     |
50 
51 | zelin     |
52 
53 | qihao     |
54 
55 +-----------+
56 
57 4 rows in set (0.00 sec)
View Code

 

2. Join

順便介紹幾個DB經常使用的merge用的語句:

http://www.w3schools.com/sql/sql_join.asp

An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them.

The most common type of join is: SQL INNER JOIN (simple join). An SQL INNER JOIN return all rows from multiple tables where the join condition is met.

Let's look at a selection from the "Orders" table:

OrderID CustomerID OrderDate
10308 2 1996-09-18
10309 37 1996-09-19
10310 77 1996-09-20

Then, have a look at a selection from the "Customers" table:

CustomerID CustomerName ContactName Country
1 Alfreds Futterkiste Maria Anders Germany
2 Ana Trujillo Emparedados y helados Ana Trujillo Mexico
3 Antonio Moreno Taquería Antonio Moreno Mexico

Notice that the "CustomerID" column in the "Orders" table refers to the "CustomerID" in the "Customers" table. The relationship between the two tables above is the "CustomerID" column.

Then, if we run the following SQL statement (that contains an INNER JOIN):

Example

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;

Try it yourself »

it will produce something like this:

OrderID CustomerName OrderDate
10308 Ana Trujillo Emparedados y helados 9/18/1996
10365 Antonio Moreno Taquería 11/27/1996
10383 Around the Horn 12/16/1996
10355 Around the Horn 11/15/1996
10278 Berglunds snabbköp 8/12/1996

 


Different SQL JOINs

Before we continue with examples, we will list the types the different SQL JOINs you can use:

    • INNER JOIN: Returns all rows when there is at least one match in BOTH tables
    • LEFT JOIN: Return all rows from the left table, and the matched rows from the right table
    • RIGHT JOIN: Return all rows from the right table, and the matched rows from the left table
    • FULL JOIN: Return all rows when there is a match in ONE of the tables

3. Full Join

 在mysql中沒有full join語句,咱們須要用union:

mysql> SELECT * FROM persons LEFT JOIN persons2 ON persons.firstName=persons2.firstName UNION SELECT * FROM persons RIGHT JOIN persons2 ON persons.firstName=persons2.firstName;

+-----------+-----------+

| FirstName | FirstName |

+-----------+-----------+

| zelin     | zelin     |

| yu        | NULL      |

| zhixu     | NULL      |

| NULL      | qihao     |

+-----------+-----------+

4 rows in set (0.00 sec)

4.  REPLACE Syntax

使用replace語句也能夠達到去重的效果。前提是,咱們把想要去重的項目設置爲primary key便可。

REPLACE [LOW_PRIORITY | DELAYED] [INTO] tbl_name [(col_name,...)] {VALUES | VALUE} ({expr | DEFAULT},...),(...),...

Or:

REPLACE [LOW_PRIORITY | DELAYED]
    [INTO] 
    SET ={ | DEFAULT}, ...
tbl_namecol_nameexpr

Or:

REPLACE [LOW_PRIORITY | DELAYED]
    [INTO]  [(,...)]
    SELECT ...
tbl_namecol_name

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for aPRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 13.2.5, 「INSERT Syntax」.

相關文章
相關標籤/搜索