該文章主要包括如下內容:java
1.skywalking的簡介:mysql
SkyWalking: an open source observability platform to collect, analyze, aggregate and visualize data from services and cloud native infrastructures.
SkyWalking provides an easy way to keep you have a clear view of your distributed system, even across Cloud.
It is more like a modern APM, specially designed for cloud native, container based and distributed system.
-------
skywalking是一個開放源碼的,用於收集、分析,聚合,可視化來自於不一樣服務和本地基礎服務的數據的可觀察的平臺,
skywalking提供了一個簡單的方法來讓你對你的分佈式系統甚至是跨雲的服務有清晰的瞭解。
它更像是一個現代的系統性能管理,特別爲分佈式系統而設計。
Why use SkyWalking?ios
SkyWalking provides solutions for observing and monitoring distributed system, in many different scenarios.
First of all, like traditional ways, SkyWalking provides auto instrument agents for service, such as Java, C# and Node.js.
At the same time, it provides manual instrument SDKs for Go(Not yet), C++(Not yet).
Also with more languages required, risks in manipulating codes at runtime, cloud native infrastructures grow more powerful,
SkyWalking could use Service Mesher infra probes to collect data for understanding the whole distributed system.
In general, it provides observability capabilities for service(s), service instance(s), endpoint(s).
----------
skywalking提供了在不少不一樣的場景下用於觀察和監控分佈式系統的方式。
首先,像傳統的方法,skywalking爲java,c#,Node.js等提供了自動探針代理.
同時,它爲Go,C++提供了手工探針。
隨着本地服務愈來愈多,須要愈來愈多的語言,掌控代碼的風險也在增長,
Skywalking可使用網狀服務探針收集數據,以瞭解整個分佈式系統。
一般,skywalking提供了觀察service,service instance,endpoint的能力。
service: 一個服務
Service Instance: 服務的實例(1個服務會啓動多個節點)
Endpoint: 一個服務中的其中一個接口
2.skywalking的使用:git
第一步:從skywalking的官網http://skywalking.apache.org/downloads/下載包,包的結構如圖。
github
第二步:啓動skywalking收集器服務,啓動腳本是E:\apache-skywalking-apm-bin\bin\startup.sh,啓動以後咱們就能夠訪問http://localhost:8080/就能夠看到skywalking的ui界面了。web
第三步:啓動項目: 拷貝skywalking-agent目錄到所需位置,探針包含整個目錄,請不要改變目錄結構,可修改agent.config配置agent.application_code=xxl-job爲本身的應用名spring
增長JVM啓動參數,-javaagent:/path/to/skywalking-agent/skywalking-agent.jar。參數值爲skywalking-agent.jar的絕對路徑。sql
經過以上幾步以後,咱們就能夠直接訪問咱們的項目的接口,看skywalking界面上可否收集到咱們的調用信息了。express
下圖爲skywalking的首頁,主要展現全局的性能信息。apache
爲了驗證skywalking具備發現系統拓撲(系統依賴)的功能,啓動4個服務,4個服務的接口路徑分別爲hello/start1,hello/start2,hello/start3,hello/start4,
在服務的依賴關係爲: start1依賴start2,start2依賴start3和start4。
訪問start1接口,skywalking展現的項目拓撲圖以下:
全鏈路性能跟蹤展現頁面:
skywalking默認支持調用性能監控的類型有DB(1),RPC_FRAMEWORK(2),HTTP(3),MQ(4),CACHE(5),此外還支持自定義插件來監控未支持的組件。
下面來看下調用dubbo和db的效果:(服務start2中調用db和項目4的dubbo服務)
3.skywalking的traceId與日誌組件(log4j,logback,elk等)的集成:
以logback爲例,只要在日誌配置xml中增長如下配置,則在打印日誌的時候,自動把當前上下文中的traceId加入到日誌中去。
<appender name="console" class="ch.qos.logback.core.ConsoleAppender"> <layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.TraceIdPatternLogbackLayout"> <pattern> %d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %tid - %msg%n </pattern> </layout> </appender>
效果以下圖所示,鏈路中的全部節點的traceId是同樣的,這樣就能夠在skywalking上面發現性能差的traceId後,再去日誌組件中查看日誌是否有異常日誌。
服務1中打印的日誌:
2019-08-14 16:46:22 [http-nio-9091-exec-1] INFO c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service1 logger with traceId
服務2中打印的日誌:
2019-08-14 16:46:24 [http-nio-9092-exec-9] INFO c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service2 logger with traceId
服務3中打印的日誌:
2019-08-14 16:46:24 [http-nio-9093-exec-1] INFO c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service3 logger with traceId
服務4中打印的日誌:
2019-08-14 16:46:24 [http-nio-9094-exec-1] INFO c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service4 logger with traceId
4.skywalking告警模塊的使用:
下圖爲告警頁面的ui界面,能夠看到能夠從三個維度來監控,分別爲服務(service)、服務實例(service instance),端點(endpoint/接口)。
告警規則能夠在安裝包下的配置文件-(apache-skywalking-apm-bin/config/alarm-settings.yml)中,自由定義。
默認配置監控服務和服務實例,不監控端點,由於 # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.# Because the number of endpoint is much more than service and instance.
下面代碼爲配置告警規則的代碼,skywalking還支持使用者配置告警接口,來及時發送通知,如發送短信/郵件等。如配置文件中的webhooks中。
# Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Sample alarm rules. rules: # Rule unique name, must be ended with `_rule`. service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes. service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 3 message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes service_p90_sla_rule: # Metrics value need to be long, double or int metrics-name: service_p90 op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm. # Because the number of endpoint is much more than service and instance. # endpoint_avg_rule: metrics-name: endpoint_avg op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes #webhooks: # - http://127.0.0.1/notify/ # - http://127.0.0.1/go-wechat/
5.skywalking的原理:
skywalaking整體架構分爲三部分:
skywalking的核心在於agent部分,下圖展現了一次調用跨多個進程裏agent的詳細的運行過程:
agent支持多種客戶端和服務端,支持的插件明細:--->https://github.com/apache/skywalking/blob/master/docs/en/setup/service-agent/java-agent/Supported-list.md
以攔截dubbo請求爲例,skywalking的dubbo攔截插件實現的代碼實現:
源碼使用的是攔截dubbo中的MonitorFilter
這個類中的invoke
方法。具體如DubboInterceptor所示,經過獲取dubbo的上下文RpcContext
先對消費者調用以前加入sky walking的跨進程協議header信息sw:traceId
,而後到生產者取出。
package org.apache.skywalking.apm.plugin.dubbo; public class DubboInstrumentation extends ClassInstanceMethodsEnhancePluginDefine { private static final String ENHANCE_CLASS = "com.alibaba.dubbo.monitor.support.MonitorFilter"; private static final String INTERCEPT_CLASS = "org.apache.skywalking.apm.plugin.dubbo.DubboInterceptor"; @Override protected ClassMatch enhanceClass() { return NameMatch.byName(ENHANCE_CLASS); } @Override public ConstructorInterceptPoint[] getConstructorsInterceptPoints() { return null; } @Override public InstanceMethodsInterceptPoint[] getInstanceMethodsInterceptPoints() { return new InstanceMethodsInterceptPoint[] { new InstanceMethodsInterceptPoint() { @Override public ElementMatcher<MethodDescription> getMethodsMatcher() { return named("invoke"); } @Override public String getMethodsInterceptor() { return INTERCEPT_CLASS; } @Override public boolean isOverrideArgs() { return false; } } }; } }
如下代碼爲Dubbo攔截器的實現:
package org.apache.skywalking.apm.plugin.dubbo; import com.alibaba.dubbo.common.URL; import com.alibaba.dubbo.rpc.Invocation; import com.alibaba.dubbo.rpc.Invoker; import com.alibaba.dubbo.rpc.Result; import com.alibaba.dubbo.rpc.RpcContext; import java.lang.reflect.Method; import org.apache.skywalking.apm.agent.core.context.ContextCarrier; import org.apache.skywalking.apm.agent.core.context.tag.Tags; import org.apache.skywalking.apm.agent.core.context.CarrierItem; import org.apache.skywalking.apm.agent.core.context.ContextManager; import org.apache.skywalking.apm.agent.core.context.trace.AbstractSpan; import org.apache.skywalking.apm.agent.core.context.trace.SpanLayer; import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.EnhancedInstance; import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstanceMethodsAroundInterceptor; import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.MethodInterceptResult; import org.apache.skywalking.apm.network.trace.component.ComponentsDefine; /** * {@link DubboInterceptor} define how to enhance class {@link com.alibaba.dubbo.monitor.support.MonitorFilter#invoke(Invoker, * Invocation)}. the trace context transport to the provider side by {@link RpcContext#attachments}.but all the version * of dubbo framework below 2.8.3 don't support {@link RpcContext#attachments}, we support another way to support it. * * @author zhangxin */ public class DubboInterceptor implements InstanceMethodsAroundInterceptor { /** * <h2>Consumer:</h2> The serialized trace context data will * inject to the {@link RpcContext#attachments} for transport to provider side. * <p> * <h2>Provider:</h2> The serialized trace context data will extract from * {@link RpcContext#attachments}. current trace segment will ref if the serialize context data is not null. */ @Override public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, MethodInterceptResult result) throws Throwable { Invoker invoker = (Invoker)allArguments[0]; Invocation invocation = (Invocation)allArguments[1]; RpcContext rpcContext = RpcContext.getContext(); boolean isConsumer = rpcContext.isConsumerSide(); URL requestURL = invoker.getUrl(); AbstractSpan span; final String host = requestURL.getHost(); final int port = requestURL.getPort(); if (isConsumer) { final ContextCarrier contextCarrier = new ContextCarrier(); span = ContextManager.createExitSpan(generateOperationName(requestURL, invocation), contextCarrier, host + ":" + port); //invocation.getAttachments().put("contextData", contextDataStr); //@see https://github.com/alibaba/dubbo/blob/dubbo-2.5.3/dubbo-rpc/dubbo-rpc-api/src/main/java/com/alibaba/dubbo/rpc/RpcInvocation.java#L154-L161 CarrierItem next = contextCarrier.items(); while (next.hasNext()) { next = next.next(); rpcContext.getAttachments().put(next.getHeadKey(), next.getHeadValue()); } } else { ContextCarrier contextCarrier = new ContextCarrier(); CarrierItem next = contextCarrier.items(); while (next.hasNext()) { next = next.next(); next.setHeadValue(rpcContext.getAttachment(next.getHeadKey())); } span = ContextManager.createEntrySpan(generateOperationName(requestURL, invocation), contextCarrier); } Tags.URL.set(span, generateRequestURL(requestURL, invocation)); span.setComponent(ComponentsDefine.DUBBO); SpanLayer.asRPCFramework(span); } @Override public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Object ret) throws Throwable { Result result = (Result)ret; if (result != null && result.getException() != null) { dealException(result.getException()); } ContextManager.stopSpan(); return ret; } @Override public void handleMethodException(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Throwable t) { dealException(t); } /** * Log the throwable, which occurs in Dubbo RPC service. */ private void dealException(Throwable throwable) { AbstractSpan span = ContextManager.activeSpan(); span.errorOccurred(); span.log(throwable); } /** * Format operation name. e.g. org.apache.skywalking.apm.plugin.test.Test.test(String) * * @return operation name. */ private String generateOperationName(URL requestURL, Invocation invocation) { StringBuilder operationName = new StringBuilder(); operationName.append(requestURL.getPath()); operationName.append("." + invocation.getMethodName() + "("); for (Class<?> classes : invocation.getParameterTypes()) { operationName.append(classes.getSimpleName() + ","); } if (invocation.getParameterTypes().length > 0) { operationName.delete(operationName.length() - 1, operationName.length()); } operationName.append(")"); return operationName.toString(); } /** * Format request url. * e.g. dubbo://127.0.0.1:20880/org.apache.skywalking.apm.plugin.test.Test.test(String). * * @return request url. */ private String generateRequestURL(URL url, Invocation invocation) { StringBuilder requestURL = new StringBuilder(); requestURL.append(url.getProtocol() + "://"); requestURL.append(url.getHost()); requestURL.append(":" + url.getPort() + "/"); requestURL.append(generateOperationName(url, invocation)); return requestURL.toString(); } }
在調用結束後結束,把span的詳情信息發送給collector(數據收集器).具體實如今類org.apache.skywalking.apm.agent.core.context.TracingContext的stopSpan(AbstractSpan span)方法,
下面是stopSpan的具體實現方法:
@Override public boolean stopSpan(AbstractSpan span) { AbstractSpan lastSpan = peek(); if (lastSpan == span) { if (lastSpan instanceof AbstractTracingSpan) { AbstractTracingSpan toFinishSpan = (AbstractTracingSpan)lastSpan; if (toFinishSpan.finish(segment)) { pop(); } } else { pop(); } } else { throw new IllegalStateException("Stopping the unexpected span = " + span); } finish(); return activeSpanStack.isEmpty(); }
具體發送數據的邏輯在finish方法中
/** * Finish this context, and notify all {@link TracingContextListener}s, managed by {@link * TracingContext.ListenerManager} */ private void finish() { if (isRunningInAsyncMode) { asyncFinishLock.lock(); } try { if (activeSpanStack.isEmpty() && running && (!isRunningInAsyncMode || asyncSpanCounter.get() == 0)) { TraceSegment finishedSegment = segment.finish(isLimitMechanismWorking()); /* * Recheck the segment if the segment contains only one span. * Because in the runtime, can't sure this segment is part of distributed trace. * * @see {@link #createSpan(String, long, boolean)} */ if (!segment.hasRef() && segment.isSingleSpanSegment()) { if (!samplingService.trySampling()) { finishedSegment.setIgnore(true); } } /* * Check that the segment is created after the agent (re-)registered to backend, * otherwise the segment may be created when the agent is still rebooting and should * be ignored */ if (segment.createTime() < RemoteDownstreamConfig.Agent.INSTANCE_REGISTERED_TIME) { finishedSegment.setIgnore(true); } TracingContext.ListenerManager.notifyFinish(finishedSegment); //通知監控追蹤容器的監聽者,監聽者會把數據發送給collector. running = false; } } finally { if (isRunningInAsyncMode) { asyncFinishLock.unlock(); } } }
5.skywalking的限制
Just effect frameworks or libraries.
Because of the changing codes by agents, it also means the codes are already known by agent plugin developers.
So, there is always a supported list in this kind of probes. Like SkyWalking Java agent supported list. Across thread can't be supported all the time.
Like we said about in process propagation, most codes run in a single thread per request, especially business codes.
But in some other scenarios, they do things in different threads, such as job assignment, task pool or batch process.
Or some languages provide coroutine or similar thing like Goroutine, then developer could run async process with low payload, even been encouraged. In those cases, auto instrument will face problems.
1.只支持已知的代理,若是使用的中間件還未被支持,須要本身寫插件。
2.跨線程的場景不支持自動代理,好比任務分配,任務池,批處理的場景。