TRINI: an adaptive load balancing strategy

TRINI: an adaptive load balancing strategy based on garbage collection for clustered Java system

1. Introduction

  • GC comes with a cost : Whenever it is triggered, GC has an impact on the system performance by pausing the involved programs.node

  • major GC : usually causes the longest type of GC pauses算法

  • research shows that it is not possible to have a single "best-fit-for-all" GC strategy because the GC behavior is dependent on the application inputs and system configuration數據庫

  • GC is particular sensitive to the heap size and even small changes多線程

  • it is commonly agreed that the GC plays an important role in the performance of Java systemapp

  • core line of thinking負載均衡

    • question : what techniques can be deployed so that the occurruence of MaGC events in the application nodes does not affect the performance of the cluster ?less

    • solution : enhance a load balancer so that it selects the nodes that are not expected to have a MaGC event in the immediate futuresdom

  • the behavior of load balancing strategies heavily influenced by the accuracy of its balancing decisions and the amount of resources it uses.ide

    • a deep understanding of these factors is key to comprehend the practicability of any load balancing strategy.性能

2. Background

  • Generational heap

    • 對象按時間不一樣被分配到不一樣的叫做 generation 的內存區中。新的對象建立在 youngest generation 中,由於 younger generation的存活率一般比 older generation 的低。也就是說,younger generation 更有可能包含垃圾,也更頻繁地被回收

    • younger generation 中的 GC 叫做 minor GC (MiGC),一般是廉價的,也不多形成性能問題。 MiGC 也負責將足夠老的活着的對象移動到 older generation。這意味着 MiGC 在 older generation 的內存分配方面起到重要做用

    • older generation 中的 GC 叫做 MaGC ,它一般被認爲是最昂貴的 GC 類型,由於它對性能影響很大

  • Garbage Collection Strategies

  • 3種GC策略

種類 serial GC parallel GC concurrent GC
線程 單線程 多線程
適用 client JVM server JVM server JVM
throughput response time
  • Load balancing

    • 4種負載均衡策略

      • round robin

      • random

      • weighted round robin

      • weighted random

3. Related Work

3.1 Garbage collection optimisation

  • propose new concurrent and parallel algorithm that impact performance less

3.2 Memory forecasting

  • 本文目標

    • forecast the MaGC events and make the information available outside the JVM

  • 其餘人提出的

    • look for way to invoke a GC

    • present an approach to estimate the number of dead object at any time, information that a JVM could to dicide when to trigger a MaGC.

3.3 Distributed system optimisation

  • our research work has enhanced a load balancer by considering the MaGC forecast in its decision layer. In such a case , the load balancer can obteain additional knowledge about the JVM in order to control the workload of the system.

4 Garbage colletion-aware load balancing strategy

4.1 Overview

  • objective —— define a GC-aware load balancing strategy ( TRINI ) which is able to dynamically adjust to the specific GC characteristics of the underlying application

  • 這個策略能讓負載均衡器足夠準確地預測 MaGC 事件的發生

  • TRINI 週期性地從應用節點中檢索信息

  • 根據應用的 GC 特色找到最適合的 policy

  • 使用被選出的 policy 進行預測 MaGC 事件和均衡即將到來的負載

  • 爲了實現自適應,使用了MAPE-K模型

      1. Monitoring element

      • obtain information

      1. Analysis element

      • evaluate if any adaptation is required

      1. Plan element

      1. Execute element

      1. Knowledge element

      • support other elements

      • is fulfilled by the set of program family

  • program family

  • 包含一系列相似的程序。這些程序有共同的GC特色

    • 例如按照 MaGC 時間長短劃分的 program family

  • 每一個 program family 有2個屬性

      1. an evaluation criteria

      • 判斷應用的 GC 行爲是否有資格成爲那個family

      1. a policy

      • 指定 GC 預測和負載均衡的規則

4.2 TRINI core process

  • core process that coordinate its MAPE-K elements

  • load balancer 一旦開始,便觸發 core process

  • 初始時,core process 使用一個默認 policy —— 所有的可用 MiGC 歷史被用來預測 MaGC。初始 policy 考慮全部的在啓動時的額外配置信息,例如負載均衡算法。初始 policy 被用於全部的 node

  • 接下來,monitor 中指定的循環和分析在全部節點中並行地開始,直到完成負載均衡

    • 根據程序的 GC 特色(這些特色被用來定義一系列可用的 program family),收集數據樣本

    • 收集完成後,分析進程檢查當前的 program family 是否適合底層的 GC 特色。若是不適合,則其餘 program family 的評價標準被評價去發現新的 program family

    • 這些新的 program family 一直被使用,直到下一次評價階段發生。這些過程從 program family 的數據庫中檢索他們的配置信息。

4.3 MaGA : a major garbage collection forecast algorithm

  • TRINI最重要的能力 —— 準確預測 MaGC 的發生

    • 經過 MaGA 算法 —— 做者的另外一篇論文

  • MaGA 內容:

    • 週期性地從 JVM 中檢索 GC 和內存樣本,以記錄發生在 Young 和 Old generation 中的內存分配活動

    • 利用最近的歷史數據(由可配置的 FWS(預測窗口大小)限制)來預測下一個 MaGC 事件

    • 預測出在 Old Generation 用完以前,要在 Young Generation 中開闢出多少內存(當 Old Generation 用完便會觸發 MaGC)

      • 算法使用 FWS 中的 old generation 歷史數據獲得一個線性迴歸模型。這樣作是爲了預測 YoungGen 中的增加率,並由此推斷 OldGen 將超過其最大閾值的點在哪裏,而且觸發 GC 。這將預測出當 YoungGen 達到多少時,將發生下一次 MaGC

      • 算法把這個 YoungGen 閾值傳給另外一個線性迴歸模型,這個模型將推測 YoungGen 內存的時間序列並預測新的一個 MaGC 發生的時間

4.4 Garbage collection-aware load balancing algorithms

  • 將四種基本算法改成對應的 GC 感知性算法

    • round robin

    • random

    • weighted round robin

    • weighted random

  • the main difference of new algorithms (compared against their original counterparts)

    • perform an additional check in the selection of the next node

      • 若是預選取的節點在很短的時間內要發生 MaGC ,則跳過該節點,評估其餘節點。

      • 當全部節點都要在接下來的很短的時間內要發生 MaGC ,則算法會按照其原始版本的算法進行,即按照沒有 GC 感知的版本進行

4.5 MiGC-CV program families

  • 自動選擇 FWS

    • 做者以前的一篇論文顯示, MaGA 算法的準確性對 FWS 極其敏感。

    • FWS 限制了用來預測 MaGC 的知識水平(即:內存分配的歷史信息的大小)

    • 實驗發現, 沒有一個適合全部狀況的最優 FWS 值

    • 做者的另外一篇論文代表,可用的歷史數據越多, MaGA 算法預測得越準。但這不具備單調性。相反,最優 FWS 也會經歷低谷

    • 這種行爲能夠被 MiGCCV捕獲

    • MiGCCV 是用來衡量在 MaGC 之間發生的 MiGC 數量變化的係數

    • 這種方法使得 MiGCCV 成爲一種恰當的分類標準,這種標準能夠把不一樣的 program 行爲分到不一樣的 family 中

      • 例如:當 MaGC 之間的 MiGC 的數量變化很大時(即 MiGCCV很大),使用歷史數據就很吃力,由於歷史數據沒法捕獲內存行爲的巨大變化(幾個數量級)。相反,若是隻是用最近的歷史數據(意味着使用一個更小的 FWS ),則預測的準確率會顯著提升。

5. Experimental evaluation

5.1 Experiment #1 Generality assessment

  • TRINI was applied to four load balancing algorithms to assess its generality

load balancing algorithms

original developed
round robin GC - round robin
weighted round robin GC - weighted round robin
random GC - random
weighted random GC - weighted random

test environment

  • 52 virtual machines

    • 50 applicationi nodes

    • 1 load balancer

    • 1 load tester node (performance test -- Apache JMeter)

garbage collection strategies

GC 策略是影響 GC 行爲的一個主要因素

  • serial GC

  • parallel GC

  • concurrent GC

evaluation criteria

  • performance

    • throughput

    • response time

  • overhead

    • CPU (%)

    • memory (MB)

  • FA (forecast accuracy) ----- 3 metrics were calculated

    • FE ( forecast error)

    • MiGCAVG (the average number of MiGCs that occured between two MaGC events)

    • capture the relationship between the heap size and the memory allocaion required by an application (major factors influencing the GC)

    • MiGCAVG 越小, MaGC 發生的次數越多。此時程序的 old generation 老是很是頻繁的被耗盡

    • 若是 MiGCAVG接近0,則會產生內粗不足異常

    • MiGCCV (the coefficient of variation)

      • 是 MiGCAVG 的標準差

      • 用來比較不一樣程序在內存使用方面的變化

performance improvement

  • TRINI worked well irrespective of GC stategies and load balancing algorithms

  • difference in memory behaviors across the tested application

  • analyse MiGCCV behaviors

    • MiGCCV越小,預測的準確率越高

overhead

  • overhead

    • in the application nodes

      • TRINI proved to be lightweight in terms of CPU and memory ———— 增幅很小

      • by the data gathering process

    • in the load balancer node

      • 相對較高 (compared to application nodes)

      • overhead is independent of load balancing algorithms

5.2 Experiment #2 Scalability assessment

test environment

  • the cluster size is varizble

    • covering the range of 5~50 application nodes in increments of 5

    • the number of concurrent users was increaseed proportionally to the cluster size

    • 5-node --> 50 users

    • 10-node --> 100 users

    • and so on

performance

  • hypothesis (comfirmed)

    • performance improvements should not degrade when the cluster size increases (not strict)

    • the difference in improvements among the tested programs were due to their diversities in memory/GC behavior

overhead

  • cost in the application nodes

    • was minimal and relatively constant and independent of the cluster size

  • cost in the load balancer node

    • was dependent of the cluster size

5.3 Experiment #3 Reliability assessment

test environment

  • 50 nodes

  • duration of the test runs was increased from 1 to 24 hours

performance improvements

  • carry out a breakdown of the behavior of each experimental configuration on an hourly basis

    • remains stable through time

overhead

  • application node

    • minimal overhead ( relatively constant)

  • load balancer node

    • higher (but quite steady)

      • main contribution to this increase is the number of forecast processes, which is not influenced by time by the size of the cluster

      • stability in the memory footprint

      • data older than required FWS, the data is automatically purged

5.4 Final discussion for practitioners

  1. to estimate the FA, the MiGCCV has proven to be a useful metric

  2. more GC intensive applications can benefit most from TRINI

  3. in terms of the overhead introduced to the load node, results have shown that the overhead usually follows a relatively linear growth with respect to the cluster size

相關文章
相關標籤/搜索