進階功能_Logstash_數據採集_用戶指南_日誌服務-阿里雲 https://help.aliyun.com/document_detail/49025.htmlhtml
Logstash Reference [6.4] | Elastic https://www.elastic.co/guide/en/logstash/current/index.htmlios
https://opensource.com/article/17/10/logstash-fundamentalsgit
Logstash, an open source tool released by Elastic, is designed to ingest and transform data. It was originally built to be a log-processing pipeline to ingest logging data into ElasticSearch. Several versions later, it can do much more.github
At its core, Logstash is a form of Extract-Transform-Load (ETL) pipeline. Unstructured log data is extracted, filters transform it, and the results are loaded into some form of data store.json
Logstash can take a line of text like this syslog example:app
and transform it into a much richer datastructure:less
Logstash has a three-stage pipeline implemented in JRuby:ssh
The input stage plugins extract data. This can be from logfiles, a TCP or UDP listener, one of several protocol-specific plugins such as syslog or IRC, or even queuing systems such as Redis, AQMP, or Kafka. This stage tags incoming events with metadata surrounding where the events came from.elasticsearch
The filter stage plugins transform and enrich the data. This is the stage that produces the sshd_action
and sshd_tuple
fields in the example above. This is where you'll find most of Logstash's value.tcp
The output stage plugins load the processed events into something else, such as ElasticSearch or another document-database, or a queuing system such as Redis, AQMP, or Kafka. It can also be configured to communicate with an API. It is also possible to hook up something like PagerDuty to your Logstash outputs.
Have a cron job that checks if your backups completed successfully? It can issue an alarm in the logging stream. This is picked up by an input, and a filter config set up to catch those events marks it up, allowing a conditional output to know this event is for it. This is how you can add alarms to scripts that would otherwise need to create their own notification layers, or that operate on systems that aren't allowed to communicate with the outside world.
In general, each input runs in its own thread. The filter and output stages are more complicated. In Logstash 1.5 through 2.1, the filter stage had a configurable number of threads, with the output stage occupying a single thread. That changed in Logstash 2.2, when the filter-stage threads were built to handle the output stage. With one fewer internal queue to keep track of, throughput improved with Logstash 2.2.
If you're running an older version, it's worth upgrading to at least 2.2. When we moved from 1.5 to 2.2, we saw a 20-25% increase in overall throughput. Logstash also spent less time in wait states, so we used more of the CPU (47% vs 75%).
Logstash can take a single file or a directory for its configuration. If a directory is given, it reads the files in lexical order. This is important, as ordering is significant for filter plugins (we'll discuss that in more detail later).
Here is a bare Logstash config file:
input { }
filter { }
output { }
Each of these will contain zero or more plugin configurations, and there can be multiple blocks.
An input section can look like this:
This tells Logstash to open the syslog { }
plugin on port 514 and will set the document type for each event coming in through that plugin to be syslog_server
. This plugin follows RFC 3164 only, not the newer RFC 5424.
Here is a slightly more complex input block:
# Pull in syslog data
input {
file {
path => [
"/var/log/syslog",
"/var/log/auth.log"
]
type => "syslog"
}
}
# Pull in application-log data. They emit data in JSON form.
input {
file {
path => [
"/var/log/app/worker_info.log",
"/var/log/app/broker_info.log",
"/var/log/app/supervisor.log"
]
exclude => "*.gz"
type => "applog"
codec => "json"
}
}
This one uses two different input { }
blocks to call different invocations of the file { }
plugin: One tracks system-level logs, the other tracks application-level logs. By using two different input { }
blocks, a Java thread is spawned for each one. For a multi-core system, different cores keep track of the configured files; if one thread blocks, the other will continue to function.
Both of these file { }
blocks could be put into the same input { }
block; they would simply run in the same thread—Logstash doesn't really care.
The filter section is where you transform your data into something that's newer and easier to work with. Filters can get quite complex. Here are a few examples of filters that accomplish different goals:
filter {
if [program] == "metrics_fetcher" {
mutate {
add_tag => [ 'metrics' ]
}
}
}
In this example, if the program
field, populated by the syslog
plugin in the example input at the top, reads metrics_fetcher
, then it tags the event metrics
. This tag could be used in a later filter plugin to further enrich the data.
filter {
if "metrics" in [tags] {
kv {
source => "message"
target => "metrics"
}
}
}
This one runs only if metrics
is in the list of tags. It then uses the kv { }
plugin to populate a new set of fields based on the key=value
pairs in the message
field. These new keys are placed as sub-fields of the metrics
field, allowing the text pages_per_second=42 faults=0
to become metrics.pages_per_second = 42
and metrics.faults = 0
on the event.
Why wouldn't you just put this in the same conditional that set the tag
value? Because there are multiple ways an event could get the metrics
tag—this way, the kv
filter will handle them all.
metrics
tag is run before the conditional that checks for it is important. Here are guidelines to ensure your filter sections are optimally ordered:
priority
could be boolean, integer, or string.
mutate { }
plugin is helpful here, as it has methods to coerce fields into specific data types.Here are useful plugins to extract fields from long strings:
backup_state=failed progress=0.24
into fields you can perform operations on.The accounting backup failed
into something that will pass if [backup_status] == 'failed'
, this will do it.
Elastic would like you to send it all into ElasticSearch, but anything that can accept a JSON document, or the datastructure it represents, can be an output. Keep in mind that events can be sent to multiple outputs. Consider this example of metrics:
output {
# Send to the local ElasticSearch port, and rotate the index daily.
elasticsearch {
hosts => [
"localhost",
"logelastic.prod.internal"
]
template_name => "logstash"
index => "logstash-{+YYYY.MM.dd}"
}
if "metrics" in [tags] {
influxdb {
host => "influx.prod.internal"
db => "logstash"
measurement => "appstats"
# This next bit only works because it is already a hash.
data_points => "%{metrics}"
send_as_tags => [ 'environment', 'application' ]
}
}
}
Remember the metrics
example above? This is how we can output it. The events tagged metrics
will get sent to ElasticSearch in their full event form. In addition, the subfields under the metrics
field on that event will be sent to influxdb, in the logstash
database, under the appstats
measurement. Along with the measurements, the values of the environment
and application
fields will be submitted as indexed tags.
There are a great many outputs. Here are some grouped by type:
There are many more output plugins.
A codec is a special piece of the Logstash configuration. We saw one used on the file {}
example above.
# Pull in application-log data. They emit data in JSON form.
input {
file {
path => [
"/var/log/app/worker_info.log",
"/var/log/app/broker_info.log",
"/var/log/app/supervisor.log"
]
exclude => "*.gz"
type => "applog"
codec => "json"
}
}
In this case, the file plugin was configured to use the json
codec. This tells the file plugin to expect a complete JSON data structure on every line in the file. If your logs can be emitted in a structure like this, your filter stage will be much shorter than it would if you had to grok, kv, and csv your way into enrichment.
The json_lines
codec is different in that it will separate events based on newlines in the feed. This is most useful when using something like the tcp { }
input, when the connecting program streams JSON documents without re-establishing the connection each time.
The multiline
codec gets a special mention. As the name suggests, this is a codec you can put on an input to reassemble a multi-line event, such as a Java stack dump, into a single event.
input {
file {
path => '/var/log/stackdumps.log'
type => 'stackdumps'
codec => multiline {
pattern => "^\s"
what => previous
}
}
}
This codec tells the file plugin to treat any log line that starts with white space as belonging to the previous line. It will be appended to the message
field with a new line and the contents of the log line. Once it hits a log line that doesn't start with white space, it will close the event and submit it to the filter stage.
Warning: Due to the highly distributed nature of Logstash, the multiline codec needs to be run as close to the log source as possible. If it reads the file directly, that's perfect. If the events are coming through another system, such as a centralized syslog system, reassembly into a single event will be more challenging.
Logstash can scale from all-in-one boxes up to gigantic infrastructures that require complex event routing before events are processed to satisfy different business owners.
In this example, Logstash is running on each of the four application boxes. Each independent config sends processed events to a centralized ElasticSearch cluster. This can scale quite far, but it means your log-processing resources are competing with your application resources.
This example shows an existing centralized logging infrastructure based on Syslog that we are adding onto. Here, Logstash is installed on the centralized logging box and configured to consume the file output of rsyslog. The processed results are then sent into ElasticSearch.
Learn more in Jamie Riedesel's talk, S, M, and L Logstash Architectures: The Foundations, at LISA17, which will be held October 29-November 3 in San Francisco, California.