add detailed example

This commit is contained in:
MulliganSecurity 2025-06-06 11:00:54 +02:00
parent 5a8251fa28
commit 2f765448c3

View file

@ -15,28 +15,124 @@ for clandestine ones.
## Alert Types ## Alert Types
There are basically two types of alerting mode: Automated alerts have many advantages over open simple monitoring: if you can define what nominal looks like (can be done with simple statistical process control measures such as the [Nelson rules](https://en.wikipedia.org/wiki/Nelson_rules)) then you can set up a system that will be:
- Organic mode: you keep a running screen with visualization for your important metrics within your field of view all day, learn by osmosis what "normal" looks like so you can react when something abnormal appears
- pros
- easy to set up
- great for catching complex conditions, you subconscious does all the work
- cons
- messy and unreliable
- only works if you are in front of the screen
- automated mode: you monitoring alerts you when some conditions are met
- pros
- reliable - reliable
- customizable: you define the exact context you want, how you want to receive alerts - customizable: you define the exact context you want, how you want to receive alerts
- runs 24/7 - runs 24/7
- cons
- will only catch issues for conditions you explicitely defined Simple threshold-based alert are reactive by nature, but their automated monitoring cuts the response time down and enable operational agility when responding to threats. Statistical-based alerts allow you to be proactive and notify you when something is not a problem yet but might become one in the future.
- can't come up with alert amelioration ideas
### Examples from the MulliganSecurity Infrastructure monitoring standard playbook
- Threshold-based: a [SMARTCTL](https://en.wikipedia.org/wiki/Smartctl) alert creating a notification when any hard drive within your infrastructure crosses a pre-failure threshold
smartctl_device_attribute{attribute_flags_long=~".*prefailure.*", attribute_value_type="value"}
<=
on (device, attribute_id, instance, attribute_name)
smartctl_device_attribute{attribute_flags_long=~".*prefailure.*", attribute_value_type="thresh"}
- Statistical (anomaly detection): CPU spike or under-use
cpu_percentage_use > (avg_over_time(cpu_percentage_use[5m]) + (3* stddev_over_time(cpu_percentage_use[5m])))
OR
cpu_percentage_use < (avg_over_time(cpu_percentage_use[5m]) - (3* stddev_over_time(cpu_percentage_use[5m])))
## Associated Risks ## Associated Risks
As your perimeter and infrastructure grows, as you add more servers your system complexity will shoot up exponentially. Simple organic alerting shows its limit when you have to correlate logs and behaviors across multiple systems. As your perimeter and infrastructure grows, as you add more servers your system complexity will shoot up exponentially. Simple organic alerting shows its limit when you have to correlate logs and behaviors across multiple systems.
That's why you need alerting, if an adversary decides to stealthily probe at your infrastructure and you know what to look for you will see their attempt for what it is. Choosing to remain in the dark about it is foolish at best and irresponsible if you are part of an outfit as your laziness will put others in harm's way. That's why you need alerting, if an adversary decides to stealthily probe at your infrastructure and you know what to look for you will see their attempt for what it is. Choosing to remain in the dark about it is foolish at best and irresponsible if you are part of an outfit as your laziness will put others in harm's way.
### Real world attack scenario
#### Situation
You run a clandestine operation that requires the ability to serve a website over tor in a [highly available](../high_availability/index.md) manner.
#### Assets
To keep this example simple we will focus only on the website content as the asset to be protected
![](example_infra.png)
#### Threat model
You have a highly technical, state-backed adversary.
- Adversary objective:
- Either make your content unavailable or untrusted
- Adversary methods
- Availability-based attacks: take the site down
- tor service deanonymization techniques (from the [high availability](../high_availability/index.md) attacker playbook)
- Integrity-based attacks: deface or introduce mistakes to break public trust in its content
- AppSec-based attack on the website itself, probing for vulnerabilities such as XSS to identify readers or SQLi to change site content or gain access
#### Threat-model based Alerting
When devising a monitoring plan you must take the following into account:
- What application are you running?
- we are running a website that interacts with user over HTTP through a tor onion service
- How do you know it is running correctly?
- Systemd unit must be running
- associated systemd-socket must be running
- correct queries should receive answers with the following characteristics
- 200 status code
- 95th percentile response time of at most Xms
- How do you monitor the application substrate (VPses, Networks)
- Onionbalance node must be up and available
- Onionbalance node should receive and reply to queries for the highly available server descriptors
- Onionbalance node should have minimal network trafic besides that
- Onion balance node and VPS server should have minimal SSH trafic and only through a tor onion sevice
- Onion balance 95th percentime response time of at most Yms
- Vanguards warning must remain off on this infrastructure in normal operating conditions
##### Availability-based attacks
- To discover coordinated attemps at availability-based deanonymization against your infrastructure you should monitor your server's uptime (Prometheus data source)
absent(up{application="node",instance="myserver5496497897891561asdf.onion"})
- Percentile based detection of performance-degradation attacks (Prometheus data source)
histogram_quantile(0.95, sum(rate(http_server_duration_milliseconds_bucket{http_method="GET",http_host="mycoolwebsite.onion"}[5m])) by (le, http_method,http_host))
##### Intrusion detection
- insider threat: track successful logins and session durations (Loki)
{unit="systemd-logind.service", instance="$hostname"} |= `session` | regexp `.* session (?P<session>[0-9]+).*user (?P<user>[^\.]+)` | label_format session="{{.session}}", user="{{.user}}" | session != ""
- If the endpoint used to connect remotely over ssh gets discovered by the attacker and becomes the target of a bruteforce attacks (Loki data source):
count_over_time({unit="sshd.service", instance="myserver"} |~ `.*invalid (user|password)`[24h]) > 0
- If you have deployed fail2ban and an appropriate telemetry exporter to monitor it (prometheus data source) this query can give you a heads up when you are under attack
sum by (instance) (rate(f2b_jail_banned_current[5m]))
Season with statistical threshold detection depending on how likely your administrators are to fat-finger their username
- Appsec Monitoring (Tempo datasource for traces): if your service collects distributed tracing data you can create alerts based on specific function durations to discover if an attacker has, for example, dropped a webshell in a traced function
{duration>=10s && .service.name="my-interactive-website"}
Do note that Pyroscope for continuous profiling should also be used, but this is highly application specific (eg: monitor critical functions for duration variation). You will want to create recording rules that will build prometheus metrics from your continuous profiling infrastructure so you can alert against those. Creation of recording rules is out of scope for this tutorial but they use the same language and tooling as alerting rules.
![](pyroscope_tracing.png)
### But alerting carries risk too! ### But alerting carries risk too!
Indeed. Today we will keep building on the [monitoring](../anonymous_server_monitoring/index.md) tutorial. Indeed. Today we will keep building on the [monitoring](../anonymous_server_monitoring/index.md) tutorial.