329 lines
26 KiB
Markdown
329 lines
26 KiB
Markdown
---
|
|
date: "2022-11-27T00:01:14Z"
|
|
title: Mastodon - Part 3 - statsd and Prometheus
|
|
aliases:
|
|
- /s/articles/2022/11/27/mastodon-3.html
|
|
---
|
|
|
|
# About this series
|
|
|
|
{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}}
|
|
|
|
I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less
|
|
enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services
|
|
is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but
|
|
for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to
|
|
privately operated ones.
|
|
|
|
In my [[first post]({{< ref "2022-11-20-mastodon-1" >}})], I shared some thoughts on how I installed a Mastodon instance for myself. In a
|
|
[[followup post]({{< ref "2022-11-24-mastodon-2" >}})] I talked about its overall architecture and how one might use Prometheus to monitor
|
|
vital backends like Redis, Postgres and Elastic. But Mastodon _itself_ is also an application which can provide a wealth of telemetry using
|
|
a protocol called [[StatsD](https://github.com/statsd/statsd)].
|
|
|
|
In this post, I'll show how I tie these all together in a custom **Grafana Mastodon dashboard**!
|
|
|
|
## Mastodon Statistics
|
|
|
|
I noticed in the [[Mastodon docs](https://docs.joinmastodon.org/admin/config/#statsd)], that there's a one-liner breadcrumb that might be
|
|
easy to overlook, as it doesn't give many details:
|
|
|
|
> `STATSD_ADDR`: If set, Mastodon will log some events and metrics into a StatsD instance identified by its hostname and port.
|
|
|
|
Interesting, but what is this **statsd**, precisely? It's a simple text-only protocol that allows applications to send key-value pairs in
|
|
the form of `<metricname>:<value>|<type>` strings, that carry statistics of certain _type_ across the network, using either TCP or UDP.
|
|
Cool! To make use of these stats, I first add this `STATSD_ADDR` environment variable from the docs to my `.env.production` file, pointing
|
|
it at `localhost:9125`. This should make Mastodon apps emit some statistics of sorts.
|
|
|
|
I decide to a look at those packets, instructing tcpdump to show the contents of the packets (using the `-A` flag). Considering my
|
|
destination is `localhost`, I also know which interface to tcpdump on (using the `-i lo` flag). My first attempt is a little bit noisy,
|
|
because the packet dump contains the [[IPv4 header](https://en.wikipedia.org/wiki/IPv4)] (20 bytes) and [[UDP
|
|
header](https://en.wikipedia.org/wiki/User_Datagram_Protocol)] (8 bytes) as well, but sure enough, if I start reading from the 28th byte
|
|
onwards, I get human-readable data, in a bunch of strings that start with `Mastodon`:
|
|
|
|
```
|
|
pim@ublog:~$ sudo tcpdump -Ani lo port 9125 | grep 'Mastodon.production.' | sed -e 's,.*Mas,Mas,'
|
|
Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.processing_time:16272|ms
|
|
Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.success:1|c
|
|
Mastodon.production.sidekiq.scheduled_size:25|g
|
|
Mastodon.production.db.tables.accounts.queries.select.duration:1.8323479999999999|ms
|
|
Mastodon.production.web.ActivityPub.InboxesController.create.json.total_duration:33.856679|ms
|
|
Mastodon.production.web.ActivityPub.InboxesController.create.json.db_time:2.3943890118971467|ms
|
|
Mastodon.production.web.ActivityPub.InboxesController.create.json.view_time:1|ms
|
|
Mastodon.production.web.ActivityPub.InboxesController.create.json.status.202:1|c
|
|
...
|
|
```
|
|
|
|
**statsd** organizes its variable names in a dot-delimited tree hierarchy. I can clearly see some patterns in here, but why guess
|
|
when you're working with Open Source? Mastodon turns out to be using a popular Ruby library called the [[National Statsd
|
|
Agency](https://github.com/localshred/nsa)], a wordplay that I don't necessarily find all that funny. Naming aside though, this library
|
|
collects application level statistics in four main categories:
|
|
|
|
1. **:action_controller**: listens to the ActionController class that is extended into ApplicationControllers in Mastodon
|
|
1. **:active_record**: listens to any database (SQL) queries and emits timing information for them
|
|
1. **:active_support_cache**: records information regarding caching (Redis) queries, and emits timing information for them
|
|
1. **:sidekiq**: listens to Sidekiq middleware and emits information about queues, workers and their jobs
|
|
|
|
Using the library's [[docs](https://github.com/localshred/nsa)], I can clearly see the patterns described, for example in the SQL
|
|
recorder, the format will be `{ns}.{prefix}.tables.{table_name}.queries.{operation}.duration` where operation here means one of the
|
|
classic SQL query types, **SELECT**, **INSERT**, **UPDATE**, and **DELETE**. Similarly, in the cache recorder, the format will be
|
|
`{ns}.{prefix}.{operation}.duration` where operation denotes one of **read_hit**, **read_miss**, **generate**, **delete**, and so on.
|
|
|
|
Reading a bit more of the Mastodon and **statsd** library code, I learn that all variables emitted, the namespace `{ns}` is always a
|
|
combination of the application name and Ruby Rails environment, ie. **Mastodon.production**, and the `{prefix}` is the collector name,
|
|
one-of **web**, **db**, **cache** or **sidekiq**. If you're curious, the Mastodon code that initializes the **statsd** collectors lives
|
|
in `config/initializers/statsd.rb`. Alright, I conclude that this is all I need to know about the naming schema.
|
|
|
|
Moving along, **statsd** gives each variable name a [[metric type](https://github.com/statsd/statsd/blob/master/docs/metric_types.md)], which
|
|
can be counters **c**, timers **ms** and gauges **g**. In the packet dump above you can see examples of each of these. The counter type in
|
|
particular is a little bit different -- applications emit increments here - in the case of the ActivityPub.InboxesController, it
|
|
merely signaled to increment the counter by 1, not the absolute value of the counter. This is actually pretty smart, because now any number
|
|
of workers/servers can all contribute to a global counter, by each just sending incrementals which are aggregated by the receiver.
|
|
|
|
As a small critique, I happened to notice that in the sidekiq datastream, some of what I think are _counters_ are actually modeled as
|
|
_gauges_ (notably the **processed** and **failed** jobs from the workers). I will have to remember that, but after observing for a few
|
|
minutes, I think I can see lots of nifty data in here.
|
|
|
|
## Prometheus
|
|
|
|
At IPng Networks, we use Prometheus as a monitoring observability tool. It's worth pointing out that **statsd** has a few options itself to
|
|
visualise data, but considering I already have lots of telemetry in Prometheus and Grafana (see my [[previous post]({{< ref "2022-11-24-mastodon-2" >}})]), I'm going to take a bit of a detour, and convert these metrics into the Prometheus _exposition format_, so that
|
|
they can be scraped on a `/metrics` endpoint just like the others. This way, I have all monitoring in one place and using one tool.
|
|
Monitoring is hard enough as it is, and having to learn multiple tools is _no bueno_ :)
|
|
|
|
### Statsd Exporter: overview
|
|
|
|
The community maintains a Prometheus [[Statsd Exporter](https://github.com/prometheus/statsd_exporter)] on GitHub. This tool, like many
|
|
others in the exporter family, will connect to a local source of telemetry, and convert these into the required format for consumption by
|
|
Prometheus. If left completely unconfigured, it will simply receive the **statsd** UDP packets on the Mastodon side, and export them
|
|
verbatim on the Prometheus side. This will have a few downsides, notably when new operations or controllers come into existence, I would
|
|
have to explicitly make Prometheus aware of them.
|
|
|
|
I think we can do better, specifically because of the patterns noted above, I can condense the many metricnames from **statsd** into a few
|
|
carefully chosen Prometheus metrics, and add their variability into _labels_ in those time series. Taking SQL queries as an example, I see
|
|
that there's a metricname for each known SQL table in Mastodon (and there are many) and then for each table, a unique metric is created for
|
|
each of the four operations:
|
|
|
|
```
|
|
Mastodon.production.tables.{table_name}.queries.select.duration
|
|
Mastodon.production.tables.{table_name}.queries.insert.duration
|
|
Mastodon.production.tables.{table_name}.queries.update.duration
|
|
Mastodon.production.tables.{table_name}.queries.delete.duration
|
|
```
|
|
|
|
What if I could rewrite these by capturing the `{table_name}` label, and further observing that there are four query types (SELECT, INSERT,
|
|
UPDATE, DELETE), so possibly capturing those into a `{operation}` label, like so:
|
|
|
|
```
|
|
mastodon_db_operation_sum{operation="select",table="users"} 85.910
|
|
mastodon_db_operation_sum{operation="insert",table="accounts"} 112.70
|
|
mastodon_db_operation_sum{operation="update",table="web_push_subscriptions"} 6.55
|
|
mastodon_db_operation_sum{operation="delete",table="web_settings"} 9.668
|
|
mastodon_db_operation_count{operation="select",table="users"} 28790
|
|
mastodon_db_operation_count{operation="insert",table="accounts"} 610
|
|
mastodon_db_operation_count{operation="update",table="web_push_subscriptions"} 380
|
|
mastodon_db_operation_count{operation="delete",table="web_settings"} 4)
|
|
```
|
|
|
|
This way, there are only two Prometheus metric names **mastodon_db_operation_sum** and **mastodon_db_operation_count**. The first one counts
|
|
the cumulative time spent performing operations of that type on the table, and the second one counts the total amount of queries of that
|
|
type on the table. If I take the **rate()** of the count variable, I will have queries-per-second, and if I divide the **rate()** of the
|
|
time spent by the **rate()** of the count, I will have a running average time spent per query over that time interval.
|
|
|
|
### Statsd Exporter: configuration
|
|
|
|
The Prometheus folks also thought of this, _quelle surprise_, and the exporter provides incredibly powerful transformation functionality
|
|
between the hierarchical tree-form of **statsd** and the multi-dimensional labeling format of Prometheus. This is called the [[Mapping
|
|
Configuration](https://github.com/prometheus/statsd_exporter#metric-mapping-and-configuration)], and it allows either globbing or regular
|
|
expression matching of the input metricnames, turning them into labeled output metrics. Building further on our example for SQL queries, I
|
|
can create a mapping like so:
|
|
|
|
```
|
|
pim@ublog:~$ cat << EOF | sudo tee /etc/prometheus/statsd-mapping.yaml
|
|
mappings:
|
|
- match: Mastodon\.production\.db\.tables\.(.+)\.queries\.(.+)\.duration
|
|
match_type: regex
|
|
name: "mastodon_db_operation"
|
|
labels:
|
|
table: "$1"
|
|
operation: "$2"
|
|
```
|
|
|
|
This snippet will use a regular expression to match input metricnames, carefully escaping the dot-delimiters. Within the input, I will match
|
|
two groups, the segment following `tables.` holds the variable SQL Table name and the segment following `queries.` captures the SQL Operation.
|
|
Once this matches, the exporter will give the resulting variable in Prometheus simply the name `mastodon_db_operation` and add two labels
|
|
with the results of the regexp capture groups.
|
|
|
|
This one mapping I showed above will take care of _all of the metrics_ from the **database** collector, but are three other collectors in
|
|
Mastodon's Ruby world. In the interest of brevity, I'll not bore you with them in this article, as this is mostly a rinse-and-repeat jobbie.
|
|
But I have attached a copy of the complete mapping configuration at the end of this article. With all of that hard work on mapping
|
|
completed, I can now start the **statsd** exporter and see its beautifully formed and labeled timeseries show up on port 9102, the default
|
|
[[assigned port](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exporters.md)] for this exporter type.
|
|
|
|
## Grafana
|
|
|
|
First let me start by saying I'm incredibly grateful to all the folks who have contributed to existing exporters and Grafana dashboards,
|
|
notably for [[Node Exporter](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)],
|
|
[[Postgres Exporter](https://grafana.com/grafana/dashboards/9628-postgresql-database/)],
|
|
[[Redis Exporter](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)],
|
|
[[NGINX Exporter](https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana/README.md)], and [[ElasticSearch
|
|
Exporter](https://grafana.com/grafana/dashboards/14191-elasticsearch-overview/)]. I'm ready to make a modest contribution back to this
|
|
wonderful community of monitoring dashboards, in the form of a Grafana dashboard for Mastodon!
|
|
|
|
Writing these is pretty rewarding. I'll take some time to explain a few Grafana concepts, although this is not meant to be a tutorial at
|
|
all and honestly, I'm not that good at this anyway. A good dashboard design goes from a 30'000ft overview of the most vital stats (not
|
|
necessarily graphs, but using visual clues like colors), and gives more information in so-called _drill-down_ dashboards that allow a much
|
|
finer grained / higher resolution picture of a specific part of the monitored application.
|
|
|
|
Seeing as the collectors are emitting four main parts of the application (remember, the `{prefix}` is one of **web**, **db**, **cache**, or
|
|
**sidekiq**), so I will give the dashboard the same structure. Also, I will try my best not to invent new terminology, as the application
|
|
developers have given their telemetry certain names, I will stick to these as well. Building a dashboard this way, application developers as
|
|
well as application operators will more likely be talking about the same things.
|
|
|
|
#### Mastodon Overview
|
|
|
|
{{< image src="/assets/mastodon/mastodon-stats-overview.png" alt="Mastodon Stats Overview" >}}
|
|
|
|
In the **Mastodon Overview**, each of the four collectors gets one or two stats-chips to present their highest level vitalsigns on. For a
|
|
web application, this will largely be requests per second, latency and possibly errors served. For a SQL database, this is typically the
|
|
issued queries and their latency. For a cache, the types of operation and again the latency observed in those operations. For Sidekiq (the
|
|
background worker pool that performs certain tasks in a queue on behalf of the system or user), I decide to focus on units of work, latency
|
|
and queue sizes.
|
|
|
|
Setting up the Prometheus queries in Grafana that fetch the data I need for these is typically going to be one of two things:
|
|
|
|
1. **QPS**: This is a rate of the monotonic increasing *_count*, over say one minute, I will see the average queries-per-second. Considering
|
|
the counters I created have labels that tell me what they are counting (for example in Puma, which API endpoint is being queried, and
|
|
what format that request is using), I can now elegantly aggregate those application-wide, like so:
|
|
> sum by (mastodon)(rate(mastodon_controller_duration_count[1m]))
|
|
|
|
2. **Latency**: The metrics in Prometheus aggregate a runtime monotonic increasing *_sum* which tells me about the total time spent doing
|
|
those things. It's pretty easy to calculate the running average latency over the last minute, by simply dividing the rate of time spent
|
|
by the rate of requests served, like so:
|
|
> sum by (mastodon)(rate(mastodon_controller_duration_sum[1m])) <br /> / <br />
|
|
> sum by (mastodon)(rate(mastodon_controller_duration_count[1m]))
|
|
|
|
To avoid clutter, I will leave the detailed full resolution view (like which _controller_ exactly, and what _format_ was queried, and which
|
|
_action_ was taken in the API) to a drilldown below. These two patterns are continud throughout the overview panel. Each QPS value is
|
|
rendered in dark blue, while the latency gets a special treatment on colors: I define a threshold which I consider "unacceptable", and then
|
|
create a few thresholds in Grafana to change the color as I approach that unacceptable max limit. By means of example, the Puma Latency
|
|
element I described above, will have a maximum acceptable latency of 250ms. If the latency is above 40% of that, the color will turn yellow;
|
|
above 60% it'll turn orange and above 80% it'll turn red. This provides a visual queue that something may be wrong.
|
|
|
|
#### Puma Controllers
|
|
|
|
The APIs that Mastodon offers are served by a component called [[Puma](https://github.com/puma/puma)], a simple, fast, multi-threaded, and
|
|
highly parallel HTTP 1.1 server for Ruby/Rack applications. The application running in Puma typically defines endpoints as so-called
|
|
ActionControllers, which Mastodon expands on in a derived concept called ApplicationControllers which each have a unique _controller_ name
|
|
(for example **ActivityPub.InboxesController**), an _action_ performed on them (for example **create**, **show** or **destroy**), and a
|
|
_format_ in which the data is handled (for example **html** or **json**). For each cardinal combination, a set of timeseries (counter, time
|
|
spent and latency quantiles) will exist. At the moment, there are about 53 API controllers, 8 actions, and 4 formats, which means there are
|
|
1'696 interesting metrics to inspect.
|
|
|
|
Drawing all of these in one graph quickly turns into an unmanageable mess, but there's a neat trick in Grafana: what if I could make these
|
|
variables selectable, and maybe pin them to exactly one value (for example, all information with a specific _controller_), that would
|
|
greatly reduce the amount of data we have to show. To implement this, the dashboard can pre-populate a variable based on a Prometheus query.
|
|
By means of example, to find the possible values of _controller_, I might take a look at all Prometheus metrics with name
|
|
**mastodon_controller_duration_count** and search for labels within them with a regular expression, for example **/controller="([^"]+)"/**.
|
|
|
|
What this will do is select all values in the group `"([^"]+)"` which may seem a little bit cryptic at first. The logic behind it is first
|
|
create a group between parenthesis `(...)` and then within that group match a set of characters `[...]` where the set is all characters
|
|
except the double-quote `[^"]` and that is repeated one-or-more times with the `+` suffix. So this will precisely select the string between
|
|
the double-quotes in the label: `controller="whatever"` will return `whatever` with this expression.
|
|
|
|
{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-details.png" alt="Mastodon Puma Controller Details" >}}
|
|
|
|
After creating three of these, one for _controller_, _action_ and _format_, three new dropdown selectors appear at the top of my dashboard.
|
|
I will allow any combination of selections, including "All" of them (the default). Then, if I wish to drill down, I can pin one or more of
|
|
these variables to narrow down the total amount of timeseries to draw.
|
|
|
|
Shown to the right are two examples, one with "All" timeseries in the graph, which shows at least which one(s) are outliers. In this case,
|
|
the orange trace in the top graph showing more than average operations is the so-called **ActivityPub.InboxesController**.
|
|
|
|
I can find this out by hovering over the orange trace, the tooltip will show me the current name and value. Then, selecting this in the top
|
|
navigation dropdown **Puma Controller**, Grafana will narrow down the data for me to only those relevant to this controller, which is super
|
|
cool.
|
|
|
|
#### Drilldown in Grafana
|
|
|
|
{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-controller.png" alt="Mastodon Puma Controller Details" >}}
|
|
|
|
Where the graph down below (called _Action Format Controller Operations_) showed all 1'600 or so timeseries, selecting the one controller I'm
|
|
interested in, shows me a much cleaner graph with only three timeseries, take a look to the right. Just by playing around with this data, I'm
|
|
learning a lot about the architecture of this application!
|
|
|
|
For example, I know that the only _action_ on this particular controller seems to be **create**, and there are three available _formats_ in which
|
|
this create action can be performed: **all**, **html** and **json**. And using the graph above that got me started on this little journey, I now
|
|
know that the traffic spike was for `controller=ActivityPub.InboxesController, action=create, format=all`. Dope!
|
|
|
|
#### SQL Details
|
|
|
|
{{< image src="/assets/mastodon/mastodon-sql-details.png" alt="Mastodon SQL Details" >}}
|
|
|
|
While I already have a really great [[Postgres Dashboard](https://grafana.com/grafana/dashboards/9628-postgresql-database/)] (the one that
|
|
came with Postgres _server_), it is also good to be able to see what the _client_ is experiencing. Here, we can drill down on two variables,
|
|
called `$sql_table` and `$sql_operation`. For each {table,operation}-tuple, the average, median and 90th/99th percentile latency are
|
|
available. So I end up with the following graphs and dials for tail latency: the top left graph shows me something interesting -- most
|
|
queries are SELECT, but the bottom graph shows me lots of tables (at the time of this article, Mastodon has 73 unique SQL tables). If I
|
|
wanted to answer the question "which table gets most SELECTs", I can drill down first by selecting the **SQL Operation** to be **select**,
|
|
after which I see decidedly less traces in the _SQL Table Operations_ graph. Further analysis shows that the two places that are mostly read
|
|
from are the tables called **statuses** and **accounts**. When I drill down using the selectors at the top of Grafana's dashboard UI, the tail
|
|
latency is automatically filtered to only that which is selected. If I were to see very slow queries at some point in the future, it'll be
|
|
very easy to narrow down exactly which table and which operation is the culprit.
|
|
|
|
#### Cache Details
|
|
|
|
{{< image src="/assets/mastodon/mastodon-cache-details.png" alt="Mastodon Cache Details" >}}
|
|
|
|
For the cache statistics collector, I learn there are a few different operators. Similar to Postgres, I already have a really cool
|
|
[[Redis Dashboard](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], for which I can see
|
|
the Redis _server_ view. But in Mastodon, I can now also see the _client_ view, and see when any of these operations spike in either
|
|
queries/sec (left graph), latency (middle graph), or tail latency for common operations (the dials on the right). This is bound to come in
|
|
handy at some point -- I already saw one or two spikes in the **generate** operation (see the blue spike in the screenshot above), which is
|
|
something to keep an eye on.
|
|
|
|
#### Sidekiq Details
|
|
|
|
{{< image src="/assets/mastodon/mastodon-sidekiq-details.png" alt="Mastodon Sidekiq Details" >}}
|
|
|
|
|
|
The single most
|
|
interesting thing in the Mastodon application is undoubtedly its _Sidekiq_ workers, the ones that do all sorts of system- and user-triggered
|
|
work such as distributing the posts to federated servers, prefetch links and media, and calculate trending tags, posts and links.
|
|
Sidekiq is a [[producer-consumer](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem)] system where new units of work (called
|
|
jobs) are written to a _queue_ in Redis by a producer (typically Mastodon's webserver Puma, or another Sidekiq task that needs something to
|
|
happen at some point in the future), and then consumed by one or more pools which execute the _worker_ jobs.
|
|
|
|
There are several queues defined in Mastodon, and each _worker_ has a name, a _failure_ and _success_ rate, and a running tally of how much
|
|
_processing_time_ they've spent executing this type of work. Sidekiq workers will consume jobs in
|
|
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] order, and it has a finite amount of workers (by default on a small
|
|
instance it runs one worker with 25 threads). If you're interested in this type of provisioning, [[Nora
|
|
Tindall](https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/)] wrote a great article about it.
|
|
|
|
This drill-down dashboard shows all of the Sidekiq _worker_ types known to Prometheus, and can be selected at the top of the dashboard in
|
|
the dropdown called **Sidekiq Worker**. A total amount of worker jobs/second, as well as the running average time spent performing those jobs
|
|
is shown in the first two graphs. The three dials show the median, 90th percentile and 99th percentile latency of the work being performed.
|
|
|
|
If all threads are busy, new work is left in the queue, until a worker thread is available to execute the job. This will lead to a queue
|
|
delay on a busy server that is underprovisioned. For jobs that had to wait for an available thread to pick them up, the number of jobs per
|
|
queue, and the time in seconds that the jobs were waiting to be picked up by a worker, are shown in the two lists at bottom right.
|
|
|
|
And with that, as [[North of the Border](https://www.youtube.com/@NorthoftheBorder)] would say: _"We're on to the glamour shots!"_.
|
|
|
|
## What's next
|
|
|
|
I made a promise on two references that will be needed to successfully hook up Prometheus and Grafana to the `STATSD_ADDR` configuration for
|
|
Mastodon's Rails environment, and here they are:
|
|
|
|
* The Statsd Exporter mapping configuration file: [[/etc/prometheus/statsd-mapping.yaml](/assets/mastodon/statsd-mapping.yaml)]
|
|
* The Grafana Dashboard: [[grafana.com/dashboards/](https://grafana.com/grafana/dashboards/17492-mastodon-stats/)]
|
|
|
|
**As a call to action:** if you are running a larger instance and would allow me to take a look and learn from you, I'd be very grateful.
|
|
|
|
I'm going to monitor my own instance for a little while, so that I can start to get a feeling for where the edges of performance cliffs are,
|
|
in other words: How slow is _too slow_? How much load is _too much_? In an upcoming post, I will take a closer look at alerting in Prometheus,
|
|
so that I can catch these performance cliffs and make human operators aware of them by means of alerts, delivered via Telegram or Slack.
|
|
|
|
By the way: If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit
|
|
more load / traffic on this instance will allow me to learn (and in turn, to share with others)!
|
|
|