Files
upbeatBytes/docs/images-and-visitor-metrics.md
T
thejayman77 9d46e03ab8 docs: durable policy of record for images + visitor metrics (Codex close-out)
Encodes the source-level image-rights policy (cache/remote/none; default remote,
opt-in cache only for cleared sources) and the Recorded-visits vs Engaged-readers
metric, so the decisions live in the repo for future audits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 14:09:24 -04:00

4.1 KiB

Images & visitor metrics — policy of record

Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes.

Article images — source-level rights policy

We do not blanket-cache publisher images. Caching (re-hosting a copy on our origin) is opt-in per source; the safe default is to display the publisher's own image without copying it.

sources.image_policy (one of):

policy meaning when to use
cache Re-host a downscaled WebP (≤800px) at /api/img/<id> ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material
remote (default) Hotlink the publisher's image URL (with the frontend's graceful retry) Anything not explicitly cleared — display, never copy
none No image (typographic topic cover) Sources whose terms don't support display, or known-bad imagery

Default for new/unknown sources is remote. Nothing is re-hosted until an admin sets a source to cache (admin → Sources → expand a source → Images selector, or POST /api/admin/sources/{id}/image-policy {"policy": "cache"}).

Why conservative: fair use is case-specific and balances four factors; "reduced + attributed + linked" is good practice but not permission, and the search-engine-thumbnail precedent (Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a universal thumbnail exemption. So re-hosting waits on a per-source rights basis.

How resolution works

  • newsimg.display_url(article_id, image_policy, raw_url) returns the display URL: /api/img/<id> for cache, the publisher URL for remote, None for none/no-image.
  • Applied server-side in Article.from_row (feed/brief/history) and share.py (the /a/<id> page). og:image/twitter:image always reference the publisher's own image URL (a link, not a copy) — never our cached path — so social crawlers don't hit our endpoint.
  • The frontend uses the resolved image_url as-is; the hub probe-retry + ArticleCard onerror cover slow/failed loads for both cached and hotlinked URLs.

Cache mechanics (goodnews/newsimg.py)

  • The cycle owns all fetching (warm()), gated on image_policy='cache', under the cycle lock. GET /api/img/<id> serves cache hits only (never fetches), restricted to accepted + canonical articles whose source is cache. No SSRF/worker-exhaustion surface on the public endpoint.
  • _safe_fetch: http(s) only, _host_is_public on every redirect hop (HTTPError-based redirects followed), body capped. 4xx≠429 = permanent (negative-cached via <sha1>.fail); 429/5xx/network = transient.
  • _encode: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced (w*h > _MAX_PIXELS) before decode. Originals are never retained.
  • Bounded: hard size cap (default 1 GB, GOODNEWS_IMG_CACHE_CAP) with LRU eviction; .fail markers swept after _FAIL_TTL_S. data/img_cache/ is gitignored (runtime data).

Visitor metrics — Recorded visits vs Engaged readers

A JS-capable bot can trip the visit beacon, so the admin shows two numbers:

  • Recorded visits — raw count: one daily visit beacon per device. Known-bot User-Agents are filtered at /api/events (queries.is_bot_ua), but UA-spoofing bots still land here. Noisy.
  • Engaged readers — distinct visitor-day with deliberate activity (the honest number):
    • the gesture-gated engaged beacon (analytics.armEngaged, mirrored on the share page) — fires once/day only after ~8s visible and a real scroll/pointer/key/touch; or
    • a deliberate action: source_click, full_story, share_ub/copy_source/native_share, replace_used, paywall_replace, paywalled_source_open, not_today/less_like_this/hide_topic, or a game started/completed/shared.
    • Never counts auto-fired visit/summary_viewed/open, replace_none, or game *_arrival.
    • Defined by queries.ENGAGED_EVENT_KINDS; surfaced as visitors.engaged_today/d7/d30.

Privacy unchanged: only a salted visitor_hash is stored (no IP, no raw token, no fingerprint).