analytics: filter known-bot User-Agents at /api/events (honest visitor counts)

Many modern crawlers (AI scrapers, headless Chrome, link-preview fetchers) run JS and
fire the visit/summary_viewed beacon, inflating "visitors" even though there's no
human discovery channel. Apply queries.is_bot_ua() at /api/events — the same filter
the load-error beacon uses — so honest bot UAs (GPTBot, AhrefsBot, headless Chrome,
python/curl, …) are dropped before recording. Response is identical so a bot can't
detect it. Counts read lower but truer going forward (past rows unchanged). Won't catch
UA-spoofing bots; that needs a heavier heuristic. Tests: bot UAs dropped, real browser
counted; existing event tests send a real UA (default client UA contains "python").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
jay
2026-06-30 11:19:51 -04:00
parent 27022108b4
commit ee43bb0df6
14 changed files with 33 additions and 7 deletions
+7 -2
View File
@@ -1160,8 +1160,13 @@ def create_app() -> FastAPI:
# --- Privacy-respecting first-party analytics -------------------------
@app.post("/api/events")
def record_event(body: EventBody) -> dict:
if body.kind in _EVENT_KINDS:
def record_event(body: EventBody, request: Request) -> dict:
# Don't let crawlers inflate visitor/funnel counts. Many modern bots run JS and
# DO fire this beacon, so filter by User-Agent (same check the load-error beacon
# uses) — catches honest bot UAs (GPTBot, AhrefsBot, headless Chrome, …). The
# response is identical either way, so a bot can't tell it was dropped.
ua = request.headers.get("user-agent", "")
if body.kind in _EVENT_KINDS and not queries.is_bot_ua(ua):
with get_conn() as conn:
conn.execute(
"INSERT OR IGNORE INTO events (kind, article_id, visitor_hash, day) "