Files
upbeatBytes/docs/images-and-visitor-metrics.md
T

5.3 KiB

Images & visitor metrics — policy of record

Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes.

Article images — source-level rights policy

We do not blanket-cache publisher images. Caching (re-hosting a copy on our origin) is opt-in per source; the safe default is to display the publisher's own image without copying it.

sources.image_policy (one of):

policy meaning when to use
cache Re-host a downscaled WebP (≤800px) at /api/img/<id> ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material
remote (default) Hotlink the publisher's image URL (with the frontend's graceful retry) Anything not explicitly cleared — display, never copy
none No image (typographic topic cover) Sources whose terms don't support display, or known-bad imagery

Default for new/unknown sources is remote. Nothing is re-hosted until an admin sets a source to cache (admin → Sources → expand a source → Images selector, or POST /api/admin/sources/{id}/image-policy {"policy": "cache"}).

Why conservative: fair use is case-specific and balances four factors; "reduced + attributed + linked" is good practice but not permission, and the search-engine-thumbnail precedent (Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a universal thumbnail exemption. So re-hosting waits on a per-source rights basis.

How resolution works

  • newsimg.display_url(article_id, image_policy, raw_url) returns the display URL: /api/img/<id> for cache, the publisher URL for remote, None for none/no-image.
  • Applied server-side in Article.from_row (feed/brief/history) and share.py (the /a/<id> page). og:image/twitter:image always reference the publisher's own image URL (a link, not a copy) — never our cached path — so social crawlers don't hit our endpoint.
  • The frontend uses the resolved image_url as-is; the hub probe-retry + ArticleCard onerror cover slow/failed loads for both cached and hotlinked URLs.

Cache mechanics (goodnews/newsimg.py)

  • The cycle owns all fetching (warm()), gated on image_policy='cache', under the cycle lock. GET /api/img/<id> serves cache hits only (never fetches), restricted to accepted + canonical articles whose source is cache. No SSRF/worker-exhaustion surface on the public endpoint.
  • _safe_fetch: http(s) only, _host_is_public on every redirect hop (HTTPError-based redirects followed), body capped. 4xx≠429 = permanent (negative-cached via <sha1>.fail); 429/5xx/network = transient.
  • _encode: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced (w*h > _MAX_PIXELS) before decode. Originals are never retained.
  • Bounded: hard size cap (default 1 GB, GOODNEWS_IMG_CACHE_CAP) with LRU eviction; .fail markers swept after _FAIL_TTL_S. data/img_cache/ is gitignored (runtime data).
  • Revocation: when a source leaves cache (set to remote/none in admin), the endpoint calls newsimg.purge_source() to delete that source's re-hosted copies immediately — they don't linger on disk. (Setting to cache just flips the flag; the cycle warms it.)

Visitor metrics — Recorded visits vs Engaged readers

A JS-capable bot can trip the visit beacon, so the admin shows two numbers:

  • Recorded visits — raw count: one daily visit beacon per device. Known-bot User-Agents are filtered at /api/events (queries.is_bot_ua), but UA-spoofing bots still land here. Noisy.
  • Engaged readers — distinct visitor-day with deliberate activity (the honest number):
    • the gesture-gated engaged beacon (analytics.armEngaged, mirrored on the share page) — fires once/day only after ~8s visible and a real scroll/pointer/key/touch; or
    • a deliberate action: source_click, full_story, share_ub/copy_source/native_share, replace_used, paywall_replace, paywalled_source_open, not_today/less_like_this/hide_topic, or a game started/completed/shared.
    • Never counts auto-fired visit/summary_viewed/open, replace_none, or game *_arrival.
    • Defined by queries.ENGAGED_EVENT_KINDS; surfaced as visitors.engaged_today/d7/d30.

Warm-up caveat: the engaged beacon began 2026-06-30, so rolling windows fill over time — a low engaged_d7/engaged_d30 is partly warm-up, NOT proof the gap to recorded visits was all bots. Compare d7 after a full week, d30 after thirty days. (Admin shows this note inline.)

Privacy unchanged: only a salted visitor_hash is stored (no IP, no raw token, no fingerprint).

Referrer suppression on remote images

Every on-site image request for a remote source sets referrerpolicy="no-referrer" so the publisher CDN doesn't get the referring URL: article cards, the share page, AND the homepage hero (converted from a CSS background-image to a real <img> — the retry probe sets probe.referrerPolicy='no-referrer' too). This hides the referrer, not the visitor's IP — any remote image necessarily exposes the IP to the CDN. For zero third-party image requests, the source must be none or explicitly cleared for local caching (cache).