neo4j批量導入neo4j-import


neo4j數據批量導入

目前主要有如下幾種數據插入方式:(轉自:如何將大規模數據導入Neo4j
Cypher CREATE 語句,爲每一條數據寫一個CREATE
Cypher LOAD CSV 語句,將數據轉成CSV格式,經過LOAD CSV讀取數據。
官方提供的Java API —— Batch Inserter
大牛編寫的 Batch Import 工具
官方提供的 neo4j-import 工具javascript

這裏寫圖片描述

這邊重點來講一下官方最快的neo4j-import,使用的前提條件:css

  • graph.db須要清空;
  • neo4j須要停掉;
  • 接受CSV導入,並且格式較爲固定;
  • 試用場景:首次導入
  • 節點名字須要惟一

比較適用:html

首次導入,沒法迭代更新
   
   
   

  

來看一下官方案例:Use the Import tool java


1 neo4j基本參數

1.1 啓動與關閉:

bin\neo4j start
bin\neo4j stop
bin\neo4j restart
bin\neo4j status
   
   
   

  

1.2 neo4j-admin的參數:控制內存

來源:10.5. Memory recommendations
這裏寫圖片描述node

1.2.1 memrec 是查看參考內存設置

neo4j-admin memrec [--memory=<memory dedicated to Neo4j>] [--database=<name>]
   
   
   

  
Option Default Description
–memory The memory capacity of the machine The amount of memory to allocate to Neo4j. Valid units are: k, K, m, M, g, G.
–database graph.db The name of the database. This option will generate numbers for Lucene indexes, and for data volume and native indexes in the database. These can be used as an input into more detailed memory analysis.

參考:linux

  
  
  
  
  • 1
$neo4j-home> bin/neo4j-admin memrec --memory=16g

1.2.2 指定緩存–pagecache

還有--pagecache單條命令指定緩存:git

  
  
  
  
  • 1
bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --pagecache=4G

指的是,再該條導入數據的指令下,緩存設置。github

1.3 neo4j-admin的參數:Dump and load databases - 線下備份

執行該兩步操做,須要關閉數據庫。參考:10.7. Dump and load databasesweb

dump過程:把graph.db轉存到.dump

須要關閉數據庫sql

$neo4j-home> bin/neo4j-admin dump --database=graph.db --to=/backups/graph.db/2016-10-02.dump
$neo4j-home> ls /backups/graph.db
$neo4j-home> 2016-10-02.dump
   
   
   

  

load過程:把.dumpload進來

好像能夠不用關閉

$neo4j-home> bin/neo4j stop
Stopping Neo4j.. stopped
$neo4j-home> bin/neo4j-admin load --from=/backups/graph.db/2016-10-02.dump --database=graph.db --force
   
   
   

  

若是帶--force,那麼load以後,會更新全部的存在着的.db(any existing database gets overwritten.

1.4 neo4j-admin的參數:backup and restore - 在線備份

參考:6.2. Perform a backup

在線備份backup :

$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> mkdir /mnt/backup
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --pagecache=4G
   
   
   

  

backup 進臨時文件夾之中。

追加備份:

$neo4j-home> export HEAP_SIZE=2G
$neo4j-home> bin/neo4j-admin backup --from=192.168.1.34 --backup-dir=/mnt/backup --name=graph.db-backup --fallback-to-full=true --check-consistency=true --pagecache=4G
   
   
   

  

.


2 簡單demo

movies.csv.

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
   
   
   

  

其中,title是屬性,注意此時須要有雙引號;year:int也是屬性,只不過該屬性是數值型的;
:LABEL:ID同樣生成了一個新節點,也就是一套數據能夠經過:生成雙節點
actors.csv.

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
   
   
   

  

roles.csv.
其中,:LABEL很是有意思,是節點的附屬屬性,其中personId:ID必定是惟一的:LABEL能夠不惟一。
並且,載入以後,:LABEL單獨會成爲新的節點,並且是去重的。

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
   
   
   

  

其中,這個節點的屬性,role沒有標註:,role是屬性,能夠加雙引號,也能夠不加。最好是指定一下格式,譬如:int爲數值型,還有字符型roles:string[]

linux執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies.csv --nodes import/actors.csv --relationships import/roles.csv
   
   
   

  

其中,以前老版本批量導入是:neo4j-import,如今批量導入是:neo4j-admin

window執行:

neo4j-import.bat --into ../data/databases/graph.db --id-type string --nodes:attribute ../import/node_attribute.csv --relationships ../import/product_SecondLeaf.csv --relationships ../import/scene_isDemond.csv
   
   
   

  
  • --into,是指定存入名字,在不一樣的嘗試,能夠修更名字。
  • --nodes:attribute,其中,nodes:後面是用來指定節點大類的名稱的
  • --id-type string,,The –id-type string is indicating that all :ID columns contain alphanumeric values (there is an optimization for numeric-only id’s).以前節點ID只能由數字組成,如今容許字符+數字共同定義。

linux最後啓動:

./bin/neo4j start
   
   
   

  

window 最後啓動:

neo4j.bat console
   
   
   

  

執行時候錯誤信息解析:

1 報錯信息留存在bad.log

\data\databases\graph.db\bad.log
   
   
   

  

global id space的報錯爲節點未定義,或者節點重複

2 若是節點不惟一,直接報錯:
global id space,同時後續的內容中端上傳,須要刪除data/database /graph.db,從新操做一遍


3 其餘導入狀況列舉

主要來源於:B.2. Use the Import tool

3.1 不一樣分隔符導入

若是導入的節點信息爲:

:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN keanu;'Neo';tt0234215;ACTED_IN
   
   
   

  

那麼能夠經過--delimiter來進行指定。

neo4j_home$ bin/neo4j-admin import --nodes import/movies2.csv --nodes import/actors2.csv --relationships import/roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"
   
   
   

  

3.2 不一樣數據集定義相同節點

movies5a.csv.

movieId:ID,title,year:int
tt0133093,"The Matrix",1999
   
   
   

  

sequels5a.csv.

movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003
   
   
   

  

actors5a.csv.

personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"
   
   
   

  

執行語句:

neo4j_home$ bin/neo4j-admin import --nodes:Movie import/movies5a.csv --nodes:Movie:Sequel import/sequels5a.csv --nodes:Actor import/actors5a.csv
   
   
   

  

執行的時候,把movies5a.csv定義一個節點名字nodes:Movie
sequels5a.csv定義節點名字有兩個::Movie:Sequel

3.3 定義關係名稱以及關係屬性

roles5b.csv.

:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
   
   
   

  

執行內容:

neo4j_home$ bin/neo4j-admin import --relationships:ACTED_IN import/roles5b.csv
   
   
   

  

其中,:ACTED_IN將關係名稱定義爲ACTED_IN;同時定義關係的屬性也有role

3.4 拆分數據集上傳提升效率

節點數據集,標題:movies4-header.csv.

movieId:ID,title,year:int,:LABEL
   
   
   

  

節點數據集,內容模塊1:movies4-part1.csv.

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
   
   
   

  

節點數據集,內容模塊2:movies4-part2.csv.

tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
   
   
   

  

關係數據集,標題:roles4-header.csv.

:START_ID,role,:END_ID,:TYPE
   
   
   

  

關係數據集,內容1:roles4-part1.csv.

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
   
   
   

  

關係數據集,內容2:roles4-part2.csv.

laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
   
   
   

  

執行:

neo4j_home$ bin/neo4j-admin import --nodes "import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv" --relationships "import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv"
   
   
   

  

標題與內容單獨分開,而後由:標題,內容模塊1,內容模塊2,分塊導入。

3.5 兩個節點集擁有相同的字段

這個會比較常常出現,兩個節點集合中,擁有相同字段,若是不設置,就會出現報錯。
movies7.csv.

movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
   
   
   

  

其中,(Movie-ID),是將ID進行標記
actors7.csv.

personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor
   
   
   

  

roles7.csv.

:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3
   
   
   

  

執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies7.csv --nodes import/actors7.csv --relationships:ACTED_IN import/roles7.csv
   
   
   

  

在關聯表中定義::START_ID(Actor-ID):END_ID(Movie-ID),來指定相應的ID。

3.6 錯誤信息跳過:錯誤的節點

錯誤的關係出現:
roles8a.csv.

:START_ID,role,:END_ID,:TYPE
carrieanne,"Trinity",tt0242653,ACTED_IN emil,"Emil",tt0133093,ACTED_IN
   
   
   

  

譬如多出了節點,emil
此時執行:

neo4j_home$ bin/neo4j-admin import --nodes import/movies8a.csv --nodes import/actors8a.csv --relationships import/roles8a.csv --ignore-missing-nodes
   
   
   

  

其中的--ignore-missing-nodes就是跳過報錯的節點,其中,錯誤信息會記錄在bad.log之中:

InputRelationship:
   source: roles8a.csv:11
   properties: [role, Emil]
   startNode: emil (global id space)
   endNode: tt0133093 (global id space)
   type: ACTED_IN
 referring to missing node emil
   
   
   

  

3.7 錯誤信息跳過:重複節點

actors8b.csv.

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor
   
   
   

  

在節點數據集actors8b.csv. 中,由重複的節點:laurence
須要執行:

neo4j_home$ bin/neo4j-admin import --nodes import/actors8b.csv --ignore-duplicate-nodes
   
   
   

  

其中,–ignore-duplicate-nodes就是重複節點忽略
會在bad.log之中顯示報錯:

Id 'laurence' is defined more than once in global id space, at least at actors8b.csv:3 and actors8b.csv:5
   
   
   

  

vv

相關文章
相關標籤/搜索