Files
ipng.ch/content/articles/2022-03-03-syslog-telegram.md

284 lines
15 KiB
Markdown

---
date: "2022-03-03T19:05:14Z"
title: Syslog to Telegram
aliases:
- /s/articles/2022/03/03/syslog-telegram.html
---
## Introduction
From time to time, I wish I could be made aware of failures earlier. There are two events, in particular,
that I am interested to know about very quickly, as they may impact service at AS8298:
1. _Open Shortest Path First_ (OSPF) adjacency removals. OSPF is a link-state protocol and it knows when
a physical link goes down, that the peer (neighbor) is no longer reachable. It can then recompute
paths to other routers fairly quickly. But if the link stays up but connectivity is interrupted,
for example because there is a switch in the path, it can take a relatively long time to detect.
1. _Bidirectional Forwarding Detection_ (BFD) session timeouts. BFD sets up a rapid (for example every
50ms or 20Hz) of a unidirectional UDP stream between two hosts. If a number of packets (for example
40 packets or 2 seconds) are not received, a link can be assumed to be dead.
Notably, [BIRD](https://bird.network.cz/), as many other vendors do, can combine the two. At IPng, each
OSPF adjacency is protected by BFD. What happens is that once an OSPF enabled link comes up, OSPF _Hello_
packets will be periodically transmitted (with a period called called a _Hello Timer_, typically once every
10 seconds). When a number of these are missed (called a _Dead Timer_, typically 40 seconds), the neighbor
is considered missing in action and the session cleaned up.
To help recover from link failure faster than 40 seconds, a new BFD session can be set up from any
neighbor that sends a _Hello_ packet. From then on, BFD will send a steady stream of UDP packets, and expect
as well the neighbor to send them. If BFD detects a timeout, it can inform BIRD to take action well
before the OSPF _Dead Timer_.
Very strict timers are known to be used, for example 10ms and 5 missed packets, or 50ms (!!) of timeout.
But at IPng, in the typical example above, I instruct BFD to send packets every 50ms, and time out
after 40 missed packets, or two (2) seconds of link downtime. Considering BIRD+VPP converge a full
routing table in about 7 seconds, that gives me an end-to-end recovery time of under 10 seconds, which
is respectable, all the while avoiding triggering on false positives.
I'd like to be made aware of these events, which could signal a darkfiber cut or WDM optic failure, or
an EoMPLS (ie _Virtual Leased Line_ or VLL failure), or a non-recoverable VPP dataplane crash. To a lesser
extent, being made explicitly aware of BGP adjacencies to downstream (IP Transit customers) or upstream
(IP Transit providers) can be useful.
### Syslog NG
There are two parts to this. First I want to have a (set of) central receiver servers, that will each
receive messages from the routers in the field. I decide to take three servers: the main one being
`nms.ipng.nl`, which runs LibreNMS, and further two read-only route collectors `rr0.ddln0.ipng.ch` at
our own DDLN [colocation]({{< ref "2022-02-24-colo" >}}) in Zurich, and `rr0.nlams0.ipng.ch` running
at Coloclue in DCG, Amsterdam.
Of course, it would be a mistake to use UDP as a transport for messages that discuss potential network
outages. Having receivers in multiple places in the network does help a little bit. But I decide to
configure the server (and the clients) later to use TCP. This way, messages are queued to be sent,
and if the TCP connection has to be rerouted when the underlying network converges, I am pretty certain
that the messages will arrive at the central logserver _eventually_.
#### Syslog Server
The configuration for each of the receiving servers is the same, very straight forward:
```
$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/listen.conf
template t_remote {
template("\$ISODATE \$FULLHOST_FROM [\$LEVEL] \${PROGRAM}: \${MESSAGE}\n");
template_escape(no);
};
source s_network_tcp {
network( transport("tcp") ip("::") ip-protocol(6) port(601) max-connections(300) );
};
destination d_ipng { file("/var/log/ipng.log" template(t_remote) template-escape(no)); };
log { source(s_network_tcp); destination(d_ipng); };
EOF
$ sudo systemctl restart syslog-ng
```
First, I define a _template_ which logs in a consistent and predictable manner. Then, I configure a
_source_ which listens on IPv4 and IPv6 on TCP port 601, which allows for more than the default 10
connections. I configure a _destination_ into a file, using the template. Then I tie the log source
into the destination, and restart `syslog-ng`.
One thing that took me a while to realize is that for `syslog-ng`, the parser applied to incoming
messages is different depending on the port used ([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/17)):
* 514, both TCP and UDP, for [RFC3164](https://datatracker.ietf.org/doc/html/rfc3164) (BSD-syslog) formatted traffic
* 601 TCP, for [RFC5424](https://datatracker.ietf.org/doc/html/rfc5424) (IETF-syslog) formatted traffic
* 6514 TCP, for TLS-encrypted traffic (of IETF-syslog messages)
After seeing malformed messages in the syslog, notably with duplicate host/program/timestamp, I ultimately
understood that this was because I was sending RFC5424 style messages to an RFC3164 enabled port (514).
Once I moved the transport to be port 601, the parser matched and loglines were correct.
And another detail -- I feel a little bit proud for not forgetting to add a logrotate entry for
this new log file, keeping 10 days worth of compressed logs:
```
$ cat << EOF | sudo tee /etc/logrotate.d/syslog-ng-ipng
/var/log/ipng.log
{
rotate 10
daily
missingok
notifempty
delaycompress
compress
postrotate
invoke-rc.d syslog-ng reload > /dev/null
endscript
}
EOF
```
I open up the firewall in these new syslog servers for TCP port 601, from any loopback addresses on
AS8298's network.
#### Syslog Clients
The clients install `syslog-ng-core` (which avoids all of the extra packages). On the routers, I
have to make sure that the syslog server runs in the `dataplane` namespace, otherwise it will not
have connectivity to send its messages. And, quite importantly, I should make sure that the
TCP connections are bound to the loopback address of the router, not any arbitrary interface,
as those could go down, rendering the TCP connection useless. So taking `nlams0.ipng.ch` as
an example, here's a configuration snippet:
```
$ sudo apt install syslog-ng-core
$ sudo sed -i -e 's,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,' \
/lib/systemd/system/syslog-ng.service
$ LO4=194.1.163.32
$ LO6=2001:678:d78::8
$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/remote.conf
destination d_nms_tcp { tcp("194.1.163.89" localip("$LO4") port(601)); };
destination d_rr0_nlams0_tcp { tcp("2a02:898:146::4" localip("$LO6") port(601)); };
destination d_rr0_ddln0_tcp { tcp("2001:678:d78:4::1:4" localip("$LO6") port(601)); };
filter f_bird { program(bird); };
log { source(s_src); filter(f_bird); destination(d_nms_tcp); };
log { source(s_src); filter(f_bird); destination(d_rr0_nlams0_tcp); };
log { source(s_src); filter(f_bird); destination(d_rr0_ddln0_tcp); };
EOF
$ sudo systemctl restart syslog-ng
```
Here, I create simply three _destination_ entries, one for each log-sink. Then I create a _filter_
that grabs logs sent, but only for the BIRD server. You can imagine that later, I can add other things
to this -- for example `keepalived` for VRRP failovers. Finally, I tie these together by applying
the filter to the source and sending the result to each syslog server.
So far, so good.
### Bird
For consistency, (although not strictly necessary for the logging and further handling), I add
ISO data timestamping and enable syslogging in `/etc/bird/bird.conf`:
```
timeformat base iso long;
timeformat log iso long;
timeformat protocol iso long;
timeformat route iso long;
log syslog all;
```
And for the two protocols of interest, I add `debug { events };` to the BFD and OSPF protocols. Note
that `bfd on` stanza in the OSPF interfaces -- this instructs BIRD to create BFD session for each of
the neighbors that are found on such an interface, and if BFD were to fail, tear down the adjacency
faster than the regular _Dead Timer_ timeouts.
```
protocol bfd bfd1 {
debug { events };
interface "*" { interval 50 ms; multiplier 40; };
}
protocol ospf v2 ospf4 {
debug { events };
ipv4 { export filter ospf_export; import all; };
area 0 {
interface "loop0" { stub yes; };
interface "xe1-3.100" { type pointopoint; cost 61; bfd on; };
interface "xe1-3.200" { type pointopoint; cost 75; bfd on; };
};
}
```
This will emit loglines for (amongst others), state changes on BFD neighbors and OSPF adjacencies.
There are a lot of messages to choose from, but I found that the following messages contain the minimally
needed information to convey links going down or up (both from BFD's point of view as well as from OSPF
and OSPFv3's point of view). I can demonstrate that by making the link between Hippo and Rhino go down
(ie. by shutting the switchport, or unplugging the cable).
And after this, I can see on `nms.ipng.nl` that the logs start streaming in:
```
pim@nms:~$ tail -f /var/log/ipng.log | egrep '(ospf[46]|bfd1):.*changed state.*to (Down|Up|Full)'
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Up to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Full to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Up to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Full to Down
2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Loading to Full
2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Init to Up
2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Loading to Full
2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Down to Up
```
And now I can see that important events are detected and sent , using reliable TCP transport, to multiple
logging machines, these messages about BFD and OSPF adjacency changes now make it to a central machine.
### Telegram Bot
{{< image width="150px" float="left" src="/assets/syslog-telegram/ptb-logo.png" alt="PTB" >}}
Of course I can go tail the logfile on one of the servers, but I think it'd be a bit more elegant to have
a computer do the pattern matching for me. One way might be to use the `syslog-ng` destination feature _program()_
([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.22/administration-guide/43)),
which pipes these logs through a userspace process, receiving them on stdin and doing interesting things with
them, such as interacting with Telegram, the delivery mechanism of choice for IPng's monitoring systems.
Building such a Telegram enabled bot is very straight forward, thanks to the excellent documentation of the
Telegram API, and the existence of `python-telegram-bot` ([ref](https://github.com/python-telegram-bot/python-telegram-bot)).
However, to avoid my bot from being tied at the hip to `syslog-ng`, I decide to simply tail a number of
logfiles from the commandline (ie `pttb /var/log/*.log`) - and here emerges the name of my little bot:
Python Telegram Tail Bot, or _pttb_ for short, that:
* Tails the syslog logstream from one or more files, ie `/var/log/ipng.log`
* Pattern matches on loglines, after which an `incident` is created
* Waits for a predefined number of seconds (which may be zero) to see if more loglines match, adding them to
the `incident`
* Holds the `incident` against a list of known regular expression `silences`, throwing away those which
aren't meant to be distributed
* Sending to a predefined group chat, those incidents which aren't silenced
The bot should allow for the following features, based on a YAML configuration file, which will allow it to be
restarted and upgraded:
* A (mandatory) `TOKEN` to interact with Telegram API
* A (mandatory) single `chat-id` - messages will be sent to this Telegram group chat
* An (optional) list of logline triggers, consisting of:
* a regular expression to match in the logstream
* a grace period to coalesce additional loglines of the same trigger into the incident
* a description to send once the incident is sent to Telegram
* An (optional) list of silences, consisting of:
* a regular expression to match any incident message data in
* an expiry timestamp
* a description carrying the reason for the silence
The bot will start up, announce itself on the `chat-id` group, and then listen on Telegram for the following
commands:
* **/help** - a list of available commands
* **/trigger** - without parameters, list the current triggers
* **/trigger add &lt;regexp&gt; [duration] [&lt;message&gt;]** - with one parameter set a trigger on a regular expression. Optionally,
add a duration in seconds between [0..3600>, within which additional matched loglines will be added to the
same incident, and an optional message to include in the Telegram alert.
* **/trigger del &lt;idx&gt;** - with one parameter, remove the trigger with that index (use /trigger to see the list).
* **/silence** - without parameters, list the current silences.
* **/silence add &lt;regexp&gt; [duration] [&lt;reason&gt;]** - with one parameter, set a default silence for 1d; optionally
add a duration in the form of `[1-9][0-9]*([hdm])` which defaults to hours (and can be days or minutes), and an optional
reason for the silence.
* **/silence del &lt;idx&gt;** - with one parameter, remove the silence with that index (use /silence to see the list).
* **/stfu [duration]** - a shorthand for a silence with regular expression `.*`, will suppress all notifications, with a
duration similar to the **/silence add** subcommand.
* **/stats** - shows some runtime statistics, notably how many loglines were processed, how many incidents created,
and how many were sent or suppressed due to a silence.
It will save its configuration file any time a silence or trigger is added or deleted. It will (obviously) then
start sending incidents to the `chat-id` group-chat when they occur.
## Results
And a few fun hours of hacking later, I submitted a first rough approxmiation of a useful syslog scanner telegram bot
on [Github](https://github.com/pimvanpelt/python-telegram-tail-bot). It does seem to work, although not all functions
are implemented yet (I'll get them done in the month of March, probably):
{{< image src="/assets/syslog-telegram/demo-telegram.png" alt="PTTB" >}}
So now I'll be pretty quickly and elegantly kept up to date by this logscanner, in addition to my already existing
LibreNMS logging, monitoring and alerting. If you find this stuff useful, feel free to grab a copy from
[Github](https://github.com/pimvanpelt/python-telegram-tail-bot), the code is open source and licensed with a liberal
APACHE 2.0 license, and is based on excellent work of [Python Telegram Bot](https://github.com/python-telegram-bot/python-telegram-bot).