用圖機器學習探索 A 股個股相關性變化

時間 2020-09-26

標籤用圖機器學習探索個股相關性變化简体版

原文原文鏈接

在本系列的前文 [1,2]中，咱們介紹瞭如何使用 Python 語言圖分析庫 NetworkX [3] + Nebula Graph [4] 來進行<權力的遊戲>中人物關係圖譜分析。java

在本文中咱們將介紹如何使用 Java 語言的圖分析庫 JGraphT [5] 並藉助繪圖庫 mxgraph [6] ，可視化探索 A 股的行業個股的相關性隨時間的變化狀況。node

數據集的處理

本文主要分析方法參考了[7,8]，有兩種數據集：git

股票數據（點集）

從 A 股中按股票代碼順序選取了 160 只股票（排除摘牌或者 ST 的）。每一支股票都被建模成一個點，每一個點的屬性有股票代碼，股票名稱，以及證監會對該股票對應上市公司所屬板塊分類等三種屬性；github

表1：點集示例算法

頂點id	股票代碼	股票名稱	所屬板塊
1	SZ0001	平安銀行	金融行業
2	600000	浦發銀行	金融行業
3	600004	白雲機場	交通運輸
4	600006	東風汽車	汽車製造
5	600007	中國國貿	開發區
6	600008	獨創股份	環保行業
7	600009	上海機場	交通運輸
8	600010	包鋼股份	鋼鐵行業

股票關係（邊集）

邊只有一個屬性，即權重。邊的權重表明邊的源點和目標點所表明的兩支股票所屬上市公司業務上的的類似度——類似度的具體計算方法參考 [7,8]：取一段時間（2014 年 1 月 1 日 - 2020 年 1 月 1 日）內，個股的日收益率的時間序列相關性 $P_{ij}$ 再定義個股之間的距離爲 (也即兩點之間的邊權重）：數據庫

$$l_{ij} = sqrt{2（1-P_{ij}）}$$apache

經過這樣的處理，距離取值範圍爲 [0,2]。這意味着距離越遠的個股，兩個之間的收益率相關性越低。網絡

表2：邊集示例數據結構

邊的源點 ID	邊的目標點 ID	邊的權重
11	12	0.493257968
22	83	0.517027513
23	78	0.606206233
2	12	0.653692415
1	11	0.677631482
1	27	0.695705171
1	12	0.71124344
2	11	0.73581915
8	18	0.771556458
12	27	0.785046446
9	20	0.789606527
11	27	0.796009627
25	63	0.797218349
25	72	0.799230001
63	115	0.803534952

這樣的點集和邊集構成一個圖網絡，能夠將這個網絡存儲在圖數據庫 Nebula Graph 中。app

JGraphT

JGraphT 是一個開放源代碼的 Java 類庫，它不只爲咱們提供了各類高效且通用的圖數據結構，還爲解決最多見的圖問題提供了許多有用的算法：

支持有向邊、無向邊、權重邊、非權重邊等；
支持簡單圖、多重圖、僞圖；
提供了用於圖遍歷的專用迭代器（DFS，BFS）等；
提供了大量經常使用的的圖算法，如路徑查找、同構檢測、着色、公共祖先、遊走、連通性、匹配、循環檢測、分區、切割、流、中心性等算法；
能夠方便地導入 / 導出 GraphViz [9]。導出的 GraphViz 可被導入可視化工具 Gephi[10] 進行分析與展現；
能夠方便地使用其餘繪圖組件，如：JGraphX，mxGraph，Guava Graphs Generators 等工具繪製出圖網絡。

下面，咱們來實踐一把，先在 JGraphT 中建立一個有向圖：

import org.jgrapht.*;

import org.jgrapht.graph.*;

import org.jgrapht.nio.*;

import org.jgrapht.nio.dot.*;

import org.jgrapht.traverse.*;

import java.io.*;

import java.net.*;

import java.util.*;

Graph<URI, DefaultEdge> g = new DefaultDirectedGraph<>(DefaultEdge.class);

添加頂點：

URI google = new URI("http://www.google.com");

URI wikipedia = new URI("http://www.wikipedia.org");

URI jgrapht = new URI("http://www.jgrapht.org");

// add the vertices

g.addVertex(google);

g.addVertex(wikipedia);

g.addVertex(jgrapht);

添加邊：

// add edges to create linking structure

g.addEdge(jgrapht, wikipedia);

g.addEdge(google, jgrapht);

g.addEdge(google, wikipedia);

g.addEdge(wikipedia, google);

圖數據庫 Nebula Graph Database

JGraphT 一般使用本地文件做爲數據源，這在靜態網絡研究的時候沒什麼問題，但若是圖網絡常常會發生變化——例如，股票數據每日都在變化——每次生成全新的靜態文件再加載分析就有些麻煩，最好整個變化過程能夠持久化地寫入一個數據庫中，而且能夠實時地直接從數據庫中加載子圖或者全圖作分析。本文選用 Nebula Graph 做爲存儲圖數據的圖數據庫。

Nebula Graph 的 Java 客戶端 Nebula-Java [11] 提供了兩種訪問 Nebula Graph 方式：一種是經過圖查詢語言 nGQL [12] 與查詢引擎層 [13] 交互，這一般適用於有複雜語義的子圖訪問類型; 另外一種是經過 API 與底層的存儲層（storaged）[14] 直接交互，用於獲取全量的點和邊。除了能夠訪問 Nebula Graph 自己外，Nebula-Java 還提供了與 Neo4j [15]、JanusGraph [16]、Spark [17] 等交互的示例。

在本文中，咱們選擇直接訪問存儲層（storaged）來獲取所有的點和邊。下面兩個接口能夠用來讀取全部的點、邊數據：

// space 爲待掃描的圖空間名稱，returnCols 爲須要讀取的點/邊及其屬性列，

// returnCols 參數格式：{tag1Name: prop1, prop2, tag2Name: prop3, prop4, prop5}

Iterator<ScanVertexResponse> scanVertex(

String space, Map<String, List<String>> returnCols);

Iterator<ScanEdgeResponse> scanEdge(

String space, Map<String, List<String>> returnCols);

第一步：初始化一個客戶端，和一個 ScanVertexProcessor。ScanVertexProcessor 用來對讀出來的頂點數據進行解碼：

MetaClientImpl metaClientImpl = new MetaClientImpl(metaHost, metaPort);

metaClientImpl.connect();

StorageClient storageClient = new StorageClientImpl(metaClientImpl);

Processor processor = new ScanVertexProcessor(metaClientImpl);

第二步：調用 scanVertex 接口，該接口會返回一個 scanVertexResponse 對象的迭代器：

Iterator<ScanVertexResponse> iterator =

storageClient.scanVertex(spaceName, returnCols);

第三步：不斷讀取該迭代器所指向的 scanVertexResponse 對象中的數據，直到讀取完全部數據。讀取出來的頂點數據先保存起來，後面會將其添加到到 JGraphT 的圖結構中：

while (iterator.hasNext()) {

ScanVertexResponse response = iterator.next();

if (response == null) {

log.error("Error occurs while scan vertex");

break;

}

Result result = processor.process(spaceName, response);

results.addAll(result.getRows(TAGNAME));

}

讀取邊數據的方法和上面的流程相似。

在 JGraphT 中進行圖分析

第一步：在 JGraphT 中建立一個無向加權圖 graph：

Graph<String, MyEdge> graph = GraphTypeBuilder

.undirected()

.weighted(true)

.allowingMultipleEdges(true)

.allowingSelfLoops(false)

.vertexSupplier(SupplierUtil.createStringSupplier())

.edgeSupplier(SupplierUtil.createSupplier(MyEdge.class))

.buildGraph();

第二步：將上一步從 Nebula Graph 圖空間中讀出來的點、邊數據添加到 graph 中：

for (VertexDomain vertex : vertexDomainList){

graph.addVertex(vertex.getVid().toString());

stockIdToName.put(vertex.getVid().toString(), vertex);

}

for (EdgeDomain edgeDomain : edgeDomainList){

graph.addEdge(edgeDomain.getSrcid().toString(), edgeDomain.getDstid().toString());

MyEdge newEdge = graph.getEdge(edgeDomain.getSrcid().toString(), edgeDomain.getDstid().toString());

graph.setEdgeWeight(newEdge, edgeDomain.getWeight());

}

第三步：參考 [7,8] 中的分析法，對剛纔的圖 graph 使用 Prim 最小生成樹算法（minimun-spanning-tree），並調用封裝好的 drawGraph 接口畫圖：

普里姆算法（Prim's algorithm），圖論中的一種算法，可在加權連通圖裏搜索最小生成樹。即，由此算法搜索到的邊子集所構成的樹中，不但包括了連通圖裏的全部頂點，且其全部邊的權值之和亦爲最小。

SpanningTreeAlgorithm.SpanningTree pMST = new PrimMinimumSpanningTree(graph).getSpanningTree();

Legend.drawGraph(pMST.getEdges(), filename, stockIdToName);

第四步：drawGraph 方法封裝了畫圖的佈局等各項參數設置。這個方法將同一板塊的股票渲染爲同一顏色，將距離接近的股票排列彙集在一塊兒。

public class Legend {

...

public static void drawGraph(Set<MyEdge> edges, String filename, Map<String, VertexDomain> idVertexMap) throws IOException {

// Creates graph with model

mxGraph graph = new mxGraph();

Object parent = graph.getDefaultParent();

// set style

graph.getModel().beginUpdate();

mxStylesheet myStylesheet = graph.getStylesheet();

graph.setStylesheet(setMsStylesheet(myStylesheet));

Map<String, Object> idMap = new HashMap<>();

Map<String, String> industryColor = new HashMap<>();

int colorIndex = 0;

for (MyEdge edge : edges) {

Object src, dst;

if (!idMap.containsKey(edge.getSrc())) {

VertexDomain srcNode = idVertexMap.get(edge.getSrc());

String nodeColor;

if (industryColor.containsKey(srcNode.getIndustry())){

nodeColor = industryColor.get(srcNode.getIndustry());

}else {

nodeColor = COLOR_LIST[colorIndex++];

industryColor.put(srcNode.getIndustry(), nodeColor);

}

src = graph.insertVertex(parent, null, srcNode.getName(), 0, 0, 105, 50, "fillColor=" + nodeColor);

idMap.put(edge.getSrc(), src);

} else {

src = idMap.get(edge.getSrc());

}

if (!idMap.containsKey(edge.getDst())) {

VertexDomain dstNode = idVertexMap.get(edge.getDst());

String nodeColor;

if (industryColor.containsKey(dstNode.getIndustry())){

nodeColor = industryColor.get(dstNode.getIndustry());

}else {

nodeColor = COLOR_LIST[colorIndex++];

industryColor.put(dstNode.getIndustry(), nodeColor);

}

dst = graph.insertVertex(parent, null, dstNode.getName(), 0, 0, 105, 50, "fillColor=" + nodeColor);

idMap.put(edge.getDst(), dst);

} else {

dst = idMap.get(edge.getDst());

}

graph.insertEdge(parent, null, "", src, dst);

}

log.info("vertice " + idMap.size());

log.info("colorsize " + industryColor.size());

mxFastOrganicLayout layout = new mxFastOrganicLayout(graph);

layout.setMaxIterations(2000);

//layout.setMinDistanceLimit(10D);

layout.execute(parent);

graph.getModel().endUpdate();

// Creates an image than can be saved using ImageIO

BufferedImage image = createBufferedImage(graph, null, 1, Color.WHITE,

true, null);

// For the sake of this example we display the image in a window

// Save as JPEG

File file = new File(filename);

ImageIO.write(image, "JPEG", file);

}

...

}

第五步：生成可視化：

圖1中每一個頂點的顏色表明證監會對該股票所屬上市公司歸類的板塊。

能夠看到，實際業務近似度較高的股票已經聚攏成簇狀（例如：高速板塊、銀行版本、機場航空板塊），但也會有部分關聯性不明顯的個股被聚類在一塊兒，具體緣由須要單獨進行個股研究。

圖1：基於 2015-01-01 至 2020-01-01 的股票數據計算出的彙集性

第六步：基於不一樣時間窗口的一些其餘動態探索

上節中，結論主要基於 2015-01-01 到 2020-01-01 的個股彙集性。這一節咱們還作了一些其餘的嘗試：以 2 年爲一個時間滑動窗口，分析方法不變，定性探索彙集羣是否隨着時間變化會發生改變。

圖2：基於 2014-01-01 至 2016-01-01 的股票數據計算出的彙集性

圖3：基於 2015-01-01 至 2017-01-01 的股票數據計算出的彙集性

圖4：基於 2016-01-01 至 2018-01-01 的股票數據計算出的彙集性

圖5：基於 2017-01-01 至 2019-01-01 的股票數據計算出的彙集性

圖6：基於 2018-01-01 至 2020-01-01 的股票數據計算出的彙集性

粗略分析看，隨着時間窗口變化，有些板塊（高速、銀行、機場航空、房產、能源）的板塊內部個股彙集性一直保持比較好——這意味着隨着時間變化，這個版塊內各類一直保持比較高的相關性；但有些板塊（製造）的彙集性會持續變化——意味着相關性一直在發生變化。

Disclaim

本文不構成任何投資建議，且做者不持有本文中任一股票。

受限於停牌、熔斷、漲跌停、送轉、併購、主營業務變動等狀況，數據處理可能有錯誤，未作一一檢查。

受時間所限，本文只選用了 160 個個股樣本過去 6 年的數據，只採用了最小擴張樹一種辦法來作聚類分類。將來可使用更大的數據集（例如美股、衍生品、數字貨幣），嘗試更多種圖機器學習的辦法。

本文代碼可見[18]

Reference

[1] 用 NetworkX + Gephi + Nebula Graph 分析<權力的遊戲>人物關係（上篇）https://nebula-graph.com.cn/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph/

[2] 用 NetworkX + Gephi + Nebula Graph 分析<權力的遊戲>人物關係（下篇） https://nebula-graph.com.cn/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph-part-two/

[3] NetworkX: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. https://networkx.github.io/

[4] Nebula Graph: A powerfully distributed, scalable, lightning-fast graph database written in C++. https://nebula-graph.io/

[5] JGraphT: a Java library of graph theory data structures and algorithms. https://jgrapht.org/

[6] mxGraph: JavaScript diagramming library that enables interactive graph and charting applications. https://jgraph.github.io/mxgraph/

[7] Bonanno, Giovanni & Lillo, Fabrizio & Mantegna, Rosario. (2000). High-frequency Cross-correlation in a Set of Stocks. arXiv.org, Quantitative Finance Papers. 1. 10.1080/713665554.

[8] Mantegna, R.N. Hierarchical structure in financial markets. Eur. Phys. J. B 11, 193–197 (1999).

[9] https://graphviz.org/

[10] https://gephi.org/

[11] https://github.com/vesoft-inc/nebula-java

[12] Nebula Graph Query Language (nGQL). https://docs.nebula-graph.io/manual-EN/1.overview/1.concepts/2.nGQL-overview/

[13] Nebula Graph Query Engine. https://github.com/vesoft-inc/nebula-graph

[14] Nebula-storage: A distributed consistent graph storage. https://github.com/vesoft-inc/nebula-storage

[15] Neo4j. www.neo4j.com

[16] JanusGraph. janusgraph.org

[17] Apache Spark. spark.apache.org.

[18] https://github.com/Judy1992/nebula_scan