Falco Performance Testing
Special Thanks to Leonardo Grasso for assisting me
Agenda
The agenda of this document is to share the experience and explain the steps followed for the performance testing of Falco application deployed using helm chart on a Kubernetes cluster and establish a relation between the resources (CPU and Memory) required by Falco and the number of syscall per second it can handle.
Assumptions
The assumptions for the performance testing are as below:
Known events are the syscalls that will match the Falco rule and trigger an alert
Unknown events are the syscalls that will be discarded by Falco as they will not meet any condition in Falco rules
The load of unknown events should be minimum during the Falco performance testing i.e. other activities on the host should be restricted to minimum
The performance testing is done by Falco event-generator tool benchmark feature
The Falco event-generator benchmark generates syscall events only, so there are no events of type k8s-audit during this benchmarking exercise
Please, keep in mind that not all actions can be used for benchmarking since some of them take too long to generate a high number of EPS. For example,
k8saudit
actions are not supposed to work, since those actions need some time to create Kubernetes resources. Also, somesyscall
actions sleep for a while (like the syscall.ReadSensitiveFileUntrusted) thus cannot be used.EPS (Event Per Second) number of events generated by event-generator in a single round. It doesn't corresponds to a syscall but the event here is a combination of multiple syscalls e.g. if the action targets a rule that's triggered when a file is created under /bin , the action will probably use 3 syscall:
One to check the directory
One to create the file under /bin
Finally the last one to delete the file
However the event-generator counts only 1 event per action
INFO statistics cpu="38.1%" lost="2%" res_mem="38 MB" throughput="4371.6 EPS" virt_mem="1.1 GB"
Falco receive a lot of events but not all of those trigger a rule so the drop stats produced by the Falco process refers to the raw numbers of syscall received (Falco does not know if those syscall could match a rule, since it lost them). On the other hand, the event-generator knows exactly which events it has generated, so can say for sure: I had sent 100 and I received 50 back, thus 50% are lost
No Custom rules considered during performance testing, only default Falco rules are triggered
Only tested the benchmarking feature with three types of actions
ChangeThreadNamespace|ReadSensitiveFileUntrusted|WriteBelowBinaryDir
The event-generator round prior to the round where drops were seen in the Falco logs was considered to calculate the number of syscall supported at a particular resource setup
Falco Fun Facts:
Here are some facts about Falco that might come to your mind during performance testing.
- Falco just read one event a time from the buffer, process it and the discard it
- Falco receive a lot of events but not all of those trigger a rule
- Falco it's just designed to have a low memory consumption and the memory is not strictly related to the EPS, since each events are processed one-by-one (so once an event is processed, it's then discarded and the memory is freed)
Steps
Setup
Falco was deployed with
-s
option and--stats-interval
set to 1 sec in order to capture the total syscalls by Falco per second-s <stats_file> If specified, append statistics related to Falco's reading/processing of events to this file (only useful in live mode). --stats-interval <msec> When using -s <stats_file>, write statistics every <msec> ms. This uses signals, so don't recommend intervals below 200 ms. Defaults to 5000 (5 seconds).
Sample Configuration:
# Changes in Falco daemonset spec: containers: - args: - /usr/bin/falco - -s - /var/log/falco.txt - --stats-interval - "1000" - --cri - /run/containerd/containerd.sock - -K - /var/run/secrets/kubernetes.io/serviceaccount/token - -k - https://$(KUBERNETES_SERVICE_HOST) - -pk
Enable the gRPC and relaxed the rate-limiter in the Falco configuration by making the below changes in the values.yaml
grpc: enabled: true threadiness: 0 # gRPC unix socket with no authentication unixSocketPath: "unix:///var/run/falco/falco.sock" # gRPC over the network (mTLS) / required when unixSocketPath is empty listenPort: 5060 privateKey: "/etc/falco/certs/server.key" certChain: "/etc/falco/certs/server.crt" rootCerts: "/etc/falco/certs/ca.crt" # gRPC output service. # By default it is off. # By enabling this all the output events will be kept in memory until you read them with a gRPC client. # Make sure to have a consumer for them or leave this disabled. grpcOutput: enabled: true
# A throttling mechanism implemented as a token bucket limits the # rate of Falco notifications. This throttling is controlled by the following configuration # options: # - rate: the number of tokens (i.e. right to send a notification) # gained per second. Defaults to 1. # - max_burst: the maximum number of tokens outstanding. Defaults to 1000. # # With these defaults, Falco could send up to 1000 notifications after # an initial quiet period, and then up to 1 notification per second # afterward. It would gain the full burst back after 1000 seconds of # no activity. outputs: rate: "1000000000" maxBurst: "1000000000"
Download the event-generator binary from here. Extract the binary and use command
event-generator list
to list the sample events this tool generateRun event-generator using the bench option with below command
event-generator bench "ChangeThreadNamespace|ReadSensitiveFileUntrusted|WriteBelowBinaryDir" --loop --pid $(ps -ef | awk '$8=="falco" {print $2}')
event-generator bench
Benchmark for Falco
Synopsis
Benchmark a running Falco instance.
This command generates a high number of Event Per Second (EPS), to test the events throughput allowed by Falco. The number of EPS is controlled by the "--sleep" option: reduce the sleeping duration to increase the EPS. If the "--loop" option is set, the sleeping duration is halved on each round. The "--pid" option can be used to monitor the Falco process.
N.B.: - the Falco gRPC Output must be enabled to use this command - "outputs.rate" and "outputs.max_burst" values within the Falco configuration must be increased, otherwise EPS will be rate-limited by the throttling mechanism - since not all actions can be used for benchmarking, only those actions matching the given regular expression are used
One common way to use this command is as following:
event-generator bench "ChangeThreadNamespace|ReadSensitiveFileUntrusted" --loop --sleep 10ms --pid $(pidof -s falco)
Warning: This command might alter your system. For example, some actions modify files and directories below /bin, /etc, /dev, etc. Make sure you fully understand what is the purpose of this tool before running any action.
event-generator bench [regexp] [flags]
Options
--all Run all actions, including those disabled by default --as string Username to impersonate for the operation --as-group stringArray Group to impersonate for the operation, this flag can be repeated to specify multiple groups. --cache-dir string Default HTTP cache directory (default "$HOME/.kube/http-cache") --certificate-authority string Path to a cert file for the certificate authority --client-certificate string Path to a client certificate file for TLS --client-key string Path to a client key file for TLS --cluster string The name of the kubeconfig cluster to use --context string The name of the kubeconfig context to use --grpc-ca string CA root file path for connecting to a Falco gRPC server (default "/etc/falco/certs/ca.crt") --grpc-cert string Cert file path for connecting to a Falco gRPC server (default "/etc/falco/certs/client.crt") --grpc-hostname string Hostname for connecting to a Falco gRPC server (default "localhost") --grpc-key string Key file path for connecting to a Falco gRPC server (default "/etc/falco/certs/client.key") --grpc-port uint16 Port for connecting to a Falco gRPC server (default 5060) --grpc-unix-socket string Unix socket path for connecting to a Falco gRPC server (default "unix:///var/run/falco.sock") -h, --help help for bench --humanize Humanize values when printing statistics (default true) --insecure-skip-tls-verify If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure --kubeconfig string Path to the kubeconfig file to use for CLI requests. --loop Run in a loop --match-server-version Require server version to match client version -n, --namespace string If present, the namespace scope for this CLI request (default "default") --pid int A process PID to monitor while benchmarking (e.g. the falco process) --polling-interval duration Duration of gRPC APIs polling timeout (default 100ms) --request-timeout string The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. (default "0") --round-duration duration Duration of a benchmark round (default 5s) -s, --server string The address and port of the Kubernetes API server --sleep duration The length of time to wait before running an action. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means no sleep. (default 100ms) --token string Bearer token for authentication to the API server --user string The name of the kubeconfig user to use
Options inherited from parent commands
-c, --config string Config file path (default $HOME/.falco-event-generator.yaml if exists) --logformat string available formats: "text" or "json" (default "text") -l, --loglevel string Log level (default "info")
Execution
Monitor the number of syscall per second on the servers where Falco is deployed. In the below sample output the values in the "cur" sections shows the cumulative total values whereas the values in the "delta" sections represents the values for that particular instance
Falco syscall logs:
{"sample": 2961, "cur": {"events": 46226792, "drops": 86992, "preemptions": 0}, "delta": {"events": 10700, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2962, "cur": {"events": 46323843, "drops": 86992, "preemptions": 0}, "delta": {"events": 97051, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2963, "cur": {"events": 46696561, "drops": 86992, "preemptions": 0}, "delta": {"events": 372718, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2964, "cur": {"events": 47069599, "drops": 86992, "preemptions": 0}, "delta": {"events": 373038, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2965, "cur": {"events": 47419658, "drops": 86992, "preemptions": 0}, "delta": {"events": 350059, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2966, "cur": {"events": 47784238, "drops": 86992, "preemptions": 0}, "delta": {"events": 364580, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2967, "cur": {"events": 48134675, "drops": 102975, "preemptions": 0}, "delta": {"events": 350437, "drops": 15983, "preemptions": 0}, "drop_pct": 4.56088}, {"sample": 2968, "cur": {"events": 48311955, "drops": 131484, "preemptions": 0}, "delta": {"events": 177280, "drops": 28509, "preemptions": 0}, "drop_pct": 16.0813}, {"sample": 2969, "cur": {"events": 48323039, "drops": 131484, "preemptions": 0}, "delta": {"events": 11084, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2970, "cur": {"events": 48333847, "drops": 131484, "preemptions": 0}, "delta": {"events": 10808, "drops": 0, "preemptions": 0}, "drop_pct": 0}, {"sample": 2971, "cur": {"events": 48342737, "drops": 131484, "preemptions": 0}, "delta": {"events": 8890, "drops": 0, "preemptions": 0}, "drop_pct": 0}
This way you can get the syscall activity on you servers in idle mode and can use it as a reference. Also you can check if there are any drops happening prior to starting the event-generator
Start the event-generator tool and observer the statistics printed by it for every round and parallelly monitory the number of syscalls per second in the Falco syscall output.
The event-generator cycle: In every round it generates a load at certain rate (EPS) then rests, during this time it calculates the statistics for that previous round and in the resting time the sycalls also drop. In the next round it doubles the rate and same cycle is repeated till it is stopped.
event-generator logs:
INFO round #14 sleep="12.207µs" INFO resting... INFO syscall.ReadSensitiveFileUntrusted actual=14224 expected=14458 ratio=0.9838151888227971 INFO syscall.WriteBelowBinaryDir actual=14219 expected=14456 ratio=0.9836054233536248 INFO syscall.ChangeThreadNamespace actual=14208 expected=14457 ratio=0.9827765096493049 INFO statistics cpu="68.0%" lost="1%" res_mem="36 MB" throughput="8674.2 EPS" virt_mem="1.1 GB" INFO INFO round #15 sleep="6.103µs" INFO resting... INFO syscall.ReadSensitiveFileUntrusted actual=15093 expected=16962 ratio=0.8898125221082419 INFO syscall.WriteBelowBinaryDir actual=15080 expected=16963 ratio=0.8889936921535105 INFO syscall.ChangeThreadNamespace actual=15058 expected=16961 ratio=0.8878014268026649 INFO statistics cpu="72.6%" lost="11%" res_mem="36 MB" throughput="10177.2 EPS" virt_mem="1.1 GB"
The instance you see values in drops in the "delta" section (sample 2967 in the Falco syscall logs above) stop the event-generator tool and capture the values of the event-generator round prior to it (The drops occured in round #15 as seen in the event generator statistics above, so we will consider the values for round #14)
The total syscall supported by Falco at the given resource setting can then be calculated by taking the average of the syscall values in instances (in sample 2963, 2964, 2965 and 2966) prior to the instance where drop occurred. The instances (sample 2961 and 2962 ) with low syscall are for the resting period in the event-generator cycle
Observations
The observation here are just for the reference purpose only and the intention of this document is to illustrate a process for Falco performance testing, it is highly recommended to carry out the performance testing on your environment and use this data for reference purpose only. The environment under test is a kubernetes cluster deployed over VM hosted on openstack
CPU | Memory | Number of syscalls per second | EPS (rounded to lower limit) |
---|---|---|---|
500m(0.5) | 512 Mi | Upto 150K | 2800 |
1 | 512 Mi | Upto 250K | 4200 |
2 | 512 Mi | Upto 320K | 7000 |
Important points:
Adding/removing the type of event in event generator will have impact on the EPS but the total syscall value should still be around the same range
If the EPS gets stuck at certain range then review the type of events used in the event generator to generate the syscall load as the event generator runs action sequentially (single-threaded), so the total time it takes in a loop is the sum of the time needed to execute all the three events so if one of those is slow, the whole loop will be slow. And when the sleeping time reaches 0, the event-generator EPS cannot grow more since all the time in the loop is just the actions' execution, Thus resulting in EPS toggling in certain range and not increasing with every round
Example:
In the below example the EPS was not increasing at the rate at which it was supposed to but rather went down in one round. This is an indication that one of the event
ChangeThreadNamespace|ReadSensitiveFileUntrusted|WriteBelowBinaryDir
was causing the event generator to slow down. In this case it wasWriteBelowBinaryDir
as the server had I/O issue on the root disk. After the problematic eventWriteBelowBinaryDir
was removed from the list it worked fine#event-generator bench "ChangeThreadNamespace|ReadSensitiveFileUntrusted|WriteBelowBinaryDir" --loop --grpc-unix-socket=unix:///var/run/falco/falco.sock --pid <Falco PID> INFO round #12 sleep="97.656µs" INFO resting... INFO syscall.WriteBelowBinaryDir actual=1696 expected=1696 ratio=1 INFO syscall.ChangeThreadNamespace actual=1695 expected=1695 ratio=1 INFO syscall.ReadSensitiveFileUntrusted actual=1696 expected=1696 ratio=1 INFO statistics cpu="21.9%" lost="0%" res_mem="121 MB" throughput="1017.4 EPS" virt_mem="1.5 GB" INFO INFO round #13 sleep="48.828µs" INFO resting... INFO syscall.WriteBelowBinaryDir actual=1787 expected=1787 ratio=1 INFO syscall.ChangeThreadNamespace actual=1788 expected=1788 ratio=1 INFO syscall.ReadSensitiveFileUntrusted actual=1786 expected=1786 ratio=1 INFO statistics cpu="23.6%" lost="0%" res_mem="121 MB" throughput="1072.2 EPS" virt_mem="1.5 GB" INFO INFO round #14 sleep="24.414µs" INFO resting... INFO syscall.WriteBelowBinaryDir actual=1614 expected=1614 ratio=1 INFO syscall.ChangeThreadNamespace actual=1613 expected=1613 ratio=1 INFO syscall.ReadSensitiveFileUntrusted actual=1615 expected=1615 ratio=1 INFO statistics cpu="22.4%" lost="0%" res_mem="121 MB" throughput="968.4 EPS" virt_mem="1.5 GB"