~/e4b4.eu

Instrumentalizing Kong

It did not start with a whimper, but a bang... or something.

At my job, we use a Kong plugin. Specifically, kong-plugin-jwt-keycloak.
And a few weeks ago this plugin crashed Kong rather spectacularly (a synchronous HTTP client would cause kong to query itself and deadlock). Anyway, just another day at the office; I switched out the synchronous HTTP client for the default asynchronous OpenResty one, verified that the fix was working, and pushed that down the promotion path.

Side-note: Kong lua plugins are basically just an include from the *_by_lua blocks. That's just a standard OpenResty plugin with a couple extra steps, with all the associated pitfalls. This makes any Lua Kong plugin extremely sensitive as they run synchronously in the worker thread. If your plugin computes the first 10,000 prime numbers, then that worker is not processing anyone's request until it's done computing the first 10,000 prime numbers.
You should therefore strive to make your plugins return in a matter microseconds if that worker is to handle any signifcant load.
Lua plugins written in other languages do not suffer from this issue, as Kong uses an IPC layer which means that if your plugin crashes, deadlocks, or otherwise has performance issues, the routes it serves will not work but the plugin won't take the entire worker with it.
These are a very important caveats that at the time of writing are not at all made clear on the Kong PDK documentation page.

Anyway, this all pointed to the fact that maybe we should have more visibility into what this plugin is doing, and the performance impact these synchronous (!) lua methods have on the overall performance of the ingress.

Surely adding a few lines doing something along the lines of metric_summary.observe() would be trivial, right? Right...

Please use docker-compose

This is a tangent, but we forked this plugin which would orchestrate the tests using make and docker run.

I like make(it underpins this entire website, to start with), but orchestrating docker containers is not what it's built for. And there were 6 containers to keep track of to run the test suite!

kong-db-create-postgres:
    @echo "Creating Kong DB"
    - @docker run --rm -d \
        --name ${KONG_DB_CONTAINER_NAME} \
        --net=host \
         -e POSTGRES_USER=${KONG_DB_USER} \
         -e POSTGRES_DB=${KONG_DB_PASS} \
         -e POSTGRES_PASSWORD=${KONG_DB_NAME} \
         ${POSTGRES_IMAGE}
    @$(MAKE) --no-print-directory CONTAINER=${KONG_DB_CONTAINER_NAME} PATTERN="database system is ready to accept connections" wait-for-log

And of course there were race conditions on startup, and the whole thing wouldn't work correctly if the wrong component was restarted.

If I was to rework this plugin beyond a couple patches, I was going to need the ability to test this more efficiently. So I rebuilt the whole local deployment to use docker-compose. Much better!

  db:
    image: *postgres_image
    restart: unless-stopped
    environment:
      POSTGRES_USER: ${KONG_DB_USER:-kong}
      POSTGRES_DB: ${KONG_DB_NAME:-kong}
      POSTGRES_PASSWORD: ${KONG_DB_PASSWORD:-kong}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-d", "db_prod"]
      interval: 5s
      timeout: 20s
      retries: 20
      start_period: 10s

Let's finally add some metrics!

Right, so Kong already has a plugin which exposes Prometheus metrics.

Is there any more documentation than this page? Some documentation on how another plugin might hook into this subsystem to expose their own metrics maybe?
Nope! None whatsoever! Hence this blog post.

However I knew it was possible because people had already done exactly that.
Except the example in that "resolved" comment is woefully insufficient and does not even run. Like, it's missing the relevant implementation in a referenced-but-missing metrics.lua file.
Someone else in that thread posted something much more promising (here's the short version):

local exporter = require "kong.plugins.prometheus.exporter"
local prometheus

--- ...

function Handler:init_worker()
  -- Register your custom metrics with the Prometheus plugin
  prometheus = exporter.get_prometheus()
  if prometheus then
    prometheus:counter("example", "Example", 0)
  else
    kong.log.err("Prometheus library is not available")
  end
end

Promising! The code runs without error, although we're not quite sure what we did, and we still don't have any metrics. Wait, where am I even supposed to get the metrics?

Geting the metrics

The prometheus plugin documentation tersely says:

curl -i http://localhost:8001/metrics

Alright, let's try that:

$ docker compose exec kong curl localhost:8001/metrics | grep -e kong_node_info -e kong_example
# HELP kong_node_info Kong Node metadata information
# TYPE kong_node_info gauge
kong_node_info{node_id="d77932cf-0f05-4ffa-be27-f1c80ed679f7",version="3.6.1"} 1
$ 

Alright, we've got something. No example metric anywhere to be found though. Mhmm. My code looks (spoiler alert: and is) correct and runs fine. I am hitting the metrics endpoint. Yet my metric isn't there.

this is fine dot webp

I scoured the internet. And I scoured, and I scoured. At this point, better men than me would have given up.

Getting the metrics to work

Now, as an interlude, I found (I'm pretty sure) this post which tipped me off to a partial solution:

server {
  server_name kong_prometheus_exporter;
  listen 0.0.0.0:9542;
  access_log off;
  location / {
    default_type text/plain;
      content_by_lua_block {
        local prometheus = require "kong.plugins.my-prometheus.exporter"
        prometheus:collect()
      }
  }

  location /nginx_status {
    internal;
    access_log off;
    stub_status;
  }
}

I added that in a snippet on the container, added nginx_http_include = /nginx-metrics.conf to my kong.conf, and... nope. I get metrics on localhost:9542/metrics, but still not my metrics. But hey, at least I kind-of understand how the metrics get to /metrics, and that's an improvement.

At this point, no choice, we've got to take the plunge into the upstream code to try to make some sense of things.

prometheus/exporter.lua's collect() function looks like this:

local function collect()
  ngx.header["Content-Type"] = "text/plain; charset=UTF-8"

  metric_data()
  -- ...
end

This is all well and good. It gets the metrics. It prints the metrics. Seems perfect, aside from the fact that my metrics are missing.

Getting my metrics to work

Still frame of Dumbledore ominously saying HOWEVER dot webp

prometheus.lua also has a collect() function??

-- Present all metrics in a text format compatible with Prometheus.
--
-- This function should be used to expose the metrics on a separate HTTP page.
-- It will get the metrics from the dictionary, sort them, and expose them
-- aling with TYPE and HELP comments.
function Prometheus:collect()
  ngx.header["Content-Type"] = "text/plain"
  self:metric_data()
end

So the "documentation" (if we can call it that) on the method says this is what should be used to expose the metrics (but the admin API uses the other one, which doesn't even have "documentation"). Looking at the implementation, it's basically the same function as the other one except it's a class method instead of a static function.

Still, we've come this far, and honestly we're quite tired at this point. Let's just use the bloody thing and see where it gets us.

Three guys trudging through a cave made out of books, from Franquin's comic Gaston Lagaffe

@@ -5,7 +5,8 @@
   location / {
     default_type text/plain;
       content_by_lua_block {
-        local prometheus = require "kong.plugins.my-prometheus.exporter"
+        local exporter = require "kong.plugins.prometheus.exporter"
+        local prometheus = exporter.get_prometheus()
         prometheus:collect()
       }
   }
$ docker compose exec kong curl localhost:9542/metrics | grep -e kong_node_info -e kong_example
# HELP kong_node_info Kong Node metadata information
# TYPE kong_node_info gauge
kong_node_info{node_id="d77932cf-0f05-4ffa-be27-f1c80ed679f7",version="3.6.1"} 1
# HELP example Example
# TYPE example counter
kong_example 1
$ 

The fact that it now works is almost insulting. Why? How? I go trudging through a mountain of undocumented code, after several hours of head-scratching I change something randomly, and now it works? As if we needed more proof that CS is not an exact science.

Why are there two competing implementation? And what's the difference? Well, if ya really want to dig through the mess, here goes:
The exporter.lua call sets local_only to not IS_PROMETHEUS_ENABLED which, if true (which seems to be the case for the Admin API for whatever reason), disables the use of the shared dict. And the shared dict is where our metrics are set (trust me on that one, it takes a while to unravel that particular code path!).

Anyways, I added a bunch of metrics, some tests, and here-a-we-go.

Conclusion?

We meant to use Kong as an API gateway (at least for some use-cases, and validating tokens at the ingress level is a great win for security). However, between the naive implementation of LUA plugins making it prone to deadlocks and performance problems, and the dreadful (!) lack of documentation, this definitely has me soured on the whole thing.

At the very least, next time I write a Kong plugin it will be a Python one. They're less likely to cause big problems since the processing happens off-worker, and while the PDK is not well documented at least the python libraries tend to be! To expose metrics on another of our plugins, I just had to import prometheus_client and add a ServiceMonitor. Less fancy but very straightforward and not worthy of a blogpost.