analytics: filter known-bot User-Agents at /api/events (honest visitor counts)
Many modern crawlers (AI scrapers, headless Chrome, link-preview fetchers) run JS and fire the visit/summary_viewed beacon, inflating "visitors" even though there's no human discovery channel. Apply queries.is_bot_ua() at /api/events — the same filter the load-error beacon uses — so honest bot UAs (GPTBot, AhrefsBot, headless Chrome, python/curl, …) are dropped before recording. Response is identical so a bot can't detect it. Counts read lower but truer going forward (past rows unchanged). Won't catch UA-spoofing bots; that needs a heavier heuristic. Tests: bot UAs dropped, real browser counted; existing event tests send a real UA (default client UA contains "python"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+7
-2
@@ -1160,8 +1160,13 @@ def create_app() -> FastAPI:
|
||||
# --- Privacy-respecting first-party analytics -------------------------
|
||||
|
||||
@app.post("/api/events")
|
||||
def record_event(body: EventBody) -> dict:
|
||||
if body.kind in _EVENT_KINDS:
|
||||
def record_event(body: EventBody, request: Request) -> dict:
|
||||
# Don't let crawlers inflate visitor/funnel counts. Many modern bots run JS and
|
||||
# DO fire this beacon, so filter by User-Agent (same check the load-error beacon
|
||||
# uses) — catches honest bot UAs (GPTBot, AhrefsBot, headless Chrome, …). The
|
||||
# response is identical either way, so a bot can't tell it was dropped.
|
||||
ua = request.headers.get("user-agent", "")
|
||||
if body.kind in _EVENT_KINDS and not queries.is_bot_ua(ua):
|
||||
with get_conn() as conn:
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO events (kind, article_id, visitor_hash, day) "
|
||||
|
||||
Reference in New Issue
Block a user