blog/content/sundries/hitman/index.md

200 lines
7.7 KiB
Markdown
Raw Normal View History

+++
title = "Hitman: another fine essential sundry service from Nebcorp Heavy Industries and Sundries"
slug = "hitman"
date = "2024-03-31"
2024-04-01 00:36:08 +00:00
updated = "2024-03-31"
[taxonomies]
tags = ["software", "sundry", "proclamation", "90s", "hitman", "web"]
+++
# Hitman counts your hits, man.
2024-03-31 23:28:20 +00:00
Recently, someone in a community I'm part of asked the following:
2024-03-31 23:28:20 +00:00
> I was thinking about how we used to have website hit counters in the 2000s and I was wondering --
> has anyone put a hit counter on your personal website?
2024-03-31 23:28:20 +00:00
Some people had, it turns out, but many had not. Among the had-nots was me, and I decided to do
something about it. The bottom line up front is that you can see it in action right now at the
bottom of this very page, and if you want, check out the code
[here](https://git.kittencollective.com/nebkor/hitman); it's called Hitman!
2024-03-31 23:28:20 +00:00
## What's the problem?
Back in the day[^web1.0], there was basically only one way to have a website: you have a Linux box,
running the Apache webserver, with PHP enabled, and a MySQL database to hold state; this is your
classic LAPM stack, obviously. If this is your website, adding a visible counter is trivial: you
just use PHP to do server side rendering of the count after a quick SQL query. And because this was
basically the only way to have a website, lots of "website operators" put hitcounters on their site,
because why not?
But this is the year 2024, and we do things differently these days. This blog, for example, is built
with a "static site generator" called [Zola](https://www.getzola.org/), which means that there's no
server side rendering, or any other kind of dynamic behavior from the backend. It's served by a
small Linux VPS that's running the [Caddy](https://caddyserver.com/) webserver, and costs about five
bucks a month to run. If I wanted to have a hitcounter, I'd have to do something non-traditional.
## What's the solution?
For me, it turned out to be a sidecar microservice for counting and reporting the hits. As usual
these days, my first instinct is to reach for [Axum](https://docs.rs/axum/latest/axum/), a framework
for building servers in Rust, and to use [SQLite](https://sqlite.org/) for a database. Caddy proxies
all requests to the hit-counting URL to Hitman, which is listening only on localhost.
### That sounds simple
Ha ha, it does, doesn't it? And in the end, it actually kinda is. But there are a few nuances to
consider.
### Privacy
The less I know the better, as far as I'm concerned, and I didn't see any reason to know more than I
already did with this, but I'd need to track the IP of the client that was doing the request in order to
de-duplicate views. Someone linked to [this
post](https://herman.bearblog.dev/how-bear-does-analytics-with-css/) about how the author uses a
notional CSS load to register a hit, and also how they hash the IP with the date to keep the counts
down to one per IP per day. They're doing quite a bit more actual "analytics" than I'm interested
in, but I liked the other idea. They mention scrubbing the hashes from their DB every night in order to
pre-emptively satisfy an overzealous GDPR regulator[^logs], but I had a better idea, which was to hash the
IP+date with a random number that is not disclosed, and is regenerated every time the server
restarts.
I wound up [hashing with the date +
hour](https://git.kittencollective.com/nebkor/hitman/src/commit/1617eae17448273114ca1b1d9277b3465986e9f1/src/main.rs#L79-L94),
along with the page, IP, and the secret. This buckets views to one per IP per page per hour, vs the
once per day from the bearblog.
### Security?
I spent some time on this, but ultimately realized that there's
- not much I can do, but
- not much they can do, either.
The server [rejects remote
origins](https://git.kittencollective.com/nebkor/hitman/src/commit/1617eae17448273114ca1b1d9277b3465986e9f1/src/main.rs#L45-L48),
but the `Origin` headers can be trivially forged. On the other hand, the worst someone could do is
add a bunch of junk to my DB, and I don't care about the data that much; this is all just for
funsies, anyway!
2024-04-01 00:36:08 +00:00
Still, after writing this out, I realized that someone could send a bunch of junk slugs and hence
fill my disk from a single IP, so I [added a check against a set of allowed
slugs](https://git.kittencollective.com/nebkor/hitman/commit/89a985e96098731e5e8691fd84776c1592b6184b)
to guard against that. Beyond that, I'd need to start thinking about being robust against a targeted
and relatively sophisticated distributed attack, and it's definitely not worth it.
2024-03-31 23:28:20 +00:00
## The front end
I mentioned that this blog is made using Zola, a static site generator. Zola has a built-in
templating system, so the [following
bit](https://git.kittencollective.com/nebkor/blog/commit/87afa418b239419f551459e9cc5e838f9fac7ed6)
of HTML with inlined JavaScript is enough to register a hit and return the latest count:
``` html
<div class=hias-footer>
<p>There have been <span id="hitman-count">no</span> views of this page.</p>
</div>
<script defer>
const hits = document.getElementById('hitman-count');
fetch("/hit/{{ page.slug }}").then((resp) => {
if (resp.ok) {
return resp.text();
} else {
return "I don't even know how many";
}
}).then((data) => {
hits.innerHTML = data;
});
</script>
```
## Putting it all together
OK, all the pieces are laid out, but here's the actual setup on the backend:
### Caddy
The Caddy configuration has the following:
```
proclamations.nebcorp-hias.com {
handle /hit/* {
reverse_proxy localhost:5000
}
handle {
<all the other routes on the site>
}
}
```
This means that requests to, eg, `https://proclamations.nebcorp-hias.com/hit/hitman` will register a
hit for this post, and return the number of views so far.
### systemd
I created a system user for the service, `hitman`, with a homedir in `/var/lib/hitman`, and added
the following systemd unit file into `/etc/systemd/system/hitman.service`:
```
Description=Hitman
After=network.target network-online.target
Requires=network-online.target
[Service]
Type=exec
User=hitman
Group=hitman
ExecStart=/var/lib/hitman/hitman -e /var/lib/hitman/.env
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
[Install]
WantedBy=multi-user.target
```
This will ensure the hitman service is running after boot, and will be restarted if it crashes:
```
$ systemctl status hitman.service
● hitman.service - Hitman
Loaded: loaded (/etc/systemd/system/hitman.service; enabled; preset: enabled)
Active: active (running) since Sun 2024-03-31 12:12:14 PDT; 4h 0min ago
Main PID: 46338 (hitman)
Tasks: 2 (limit: 1018)
Memory: 948.0K
CPU: 53ms
CGroup: /system.slice/hitman.service
└─46338 /var/lib/hitman/hitman -e /var/lib/hitman/.env
```
### Hitman
Inside the `/var/lib/hitman` directory there's a `.env` file with the following content:
```
DATABASE_URL=sqlite:///${HOME}/.hitman.db
DATABASE_FILE=${HOME}/.hitman.db
LISTENING_ADDR=127.0.0.1
LISTENING_PORT=5000
HITMAN_ORIGIN=https://proclamations.nebcorp-hias.com
```
# Coda
When I got this working, a friend said, "Drat, that means I need to follow through on my goal to
write a little web-ring server." Something like two hours later, she had [a working
webring](https://erikarow.land/notes/gleam-webring), and indeed, if you look at the bottom of this
very page, you'll see the webring links; as she says, this Web 1.0 stuff is fun!
---
[^web1.0]: I think of the hitcounter era as the 90s, but that's because I'm older than the person
who asked the question.
[^logs]: They don't mention scrubbing IPs from their logs, but they do mention having logs, so clearly
the job to scrub the hit DB of hashes is just privacy kabuki.