diff --git a/docs/images-and-visitor-metrics.md b/docs/images-and-visitor-metrics.md new file mode 100644 index 0000000..b2803c3 --- /dev/null +++ b/docs/images-and-visitor-metrics.md @@ -0,0 +1,62 @@ +# Images & visitor metrics — policy of record + +Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes. + +## Article images — source-level rights policy + +We do **not** blanket-cache publisher images. Caching (re-hosting a copy on our origin) is +**opt-in per source**; the safe default is to display the publisher's own image without copying it. + +`sources.image_policy` (one of): + +| policy | meaning | when to use | +|---|---|---| +| `cache` | Re-host a downscaled WebP (≤800px) at `/api/img/` | ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material | +| `remote` (**default**) | Hotlink the publisher's image URL (with the frontend's graceful retry) | Anything not explicitly cleared — display, never copy | +| `none` | No image (typographic topic cover) | Sources whose terms don't support display, or known-bad imagery | + +**Default for new/unknown sources is `remote`.** Nothing is re-hosted until an admin sets a +source to `cache` (admin → Sources → expand a source → Images selector, or +`POST /api/admin/sources/{id}/image-policy {"policy": "cache"}`). + +Why conservative: fair use is case-specific and balances four factors; "reduced + attributed + +linked" is good practice but **not permission**, and the search-engine-thumbnail precedent +(Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a +universal thumbnail exemption. So re-hosting waits on a per-source rights basis. + +### How resolution works +- `newsimg.display_url(article_id, image_policy, raw_url)` returns the display URL: `/api/img/` + for `cache`, the publisher URL for `remote`, `None` for `none`/no-image. +- Applied server-side in `Article.from_row` (feed/brief/history) and `share.py` (the `/a/` + page). `og:image`/`twitter:image` always reference the **publisher's own** image URL (a link, + not a copy) — never our cached path — so social crawlers don't hit our endpoint. +- The frontend uses the resolved `image_url` as-is; the hub probe-retry + ArticleCard `onerror` + cover slow/failed loads for both cached and hotlinked URLs. + +### Cache mechanics (`goodnews/newsimg.py`) +- **The cycle owns all fetching** (`warm()`), gated on `image_policy='cache'`, under the cycle lock. + `GET /api/img/` serves **cache hits only** (never fetches), restricted to accepted + canonical + articles whose source is `cache`. No SSRF/worker-exhaustion surface on the public endpoint. +- `_safe_fetch`: http(s) only, `_host_is_public` on every redirect hop (HTTPError-based redirects + followed), body capped. 4xx≠429 = permanent (negative-cached via `.fail`); 429/5xx/network = transient. +- `_encode`: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced + (`w*h > _MAX_PIXELS`) **before** decode. Originals are never retained. +- Bounded: hard size cap (default 1 GB, `GOODNEWS_IMG_CACHE_CAP`) with LRU eviction; `.fail` + markers swept after `_FAIL_TTL_S`. `data/img_cache/` is gitignored (runtime data). + +## Visitor metrics — Recorded visits vs Engaged readers + +A JS-capable bot can trip the visit beacon, so the admin shows two numbers: + +- **Recorded visits** — raw count: one daily `visit` beacon per device. Known-bot User-Agents are + filtered at `/api/events` (`queries.is_bot_ua`), but UA-spoofing bots still land here. Noisy. +- **Engaged readers** — distinct visitor-day with **deliberate** activity (the honest number): + - the gesture-gated `engaged` beacon (`analytics.armEngaged`, mirrored on the share page) — fires + once/day only after ~8s visible **and** a real scroll/pointer/key/touch; or + - a deliberate action: `source_click`, `full_story`, `share_ub`/`copy_source`/`native_share`, + `replace_used`, `paywall_replace`, `paywalled_source_open`, `not_today`/`less_like_this`/`hide_topic`, + or a game `started`/`completed`/`shared`. + - **Never** counts auto-fired `visit`/`summary_viewed`/`open`, `replace_none`, or game `*_arrival`. + - Defined by `queries.ENGAGED_EVENT_KINDS`; surfaced as `visitors.engaged_today/d7/d30`. + +Privacy unchanged: only a salted `visitor_hash` is stored (no IP, no raw token, no fingerprint).