# Images & visitor metrics — policy of record Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes. ## Article images — source-level rights policy We do **not** blanket-cache publisher images. Caching (re-hosting a copy on our origin) is **opt-in per source**; the safe default is to display the publisher's own image without copying it. `sources.image_policy` (one of): | policy | meaning | when to use | |---|---|---| | `cache` | Re-host a downscaled WebP (≤800px) at `/api/img/` | ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material | | `remote` (**default**) | Hotlink the publisher's image URL (with the frontend's graceful retry) | Anything not explicitly cleared — display, never copy | | `none` | No image (typographic topic cover) | Sources whose terms don't support display, or known-bad imagery | **Default for new/unknown sources is `remote`.** Nothing is re-hosted until an admin sets a source to `cache` (admin → Sources → expand a source → Images selector, or `POST /api/admin/sources/{id}/image-policy {"policy": "cache"}`). Why conservative: fair use is case-specific and balances four factors; "reduced + attributed + linked" is good practice but **not permission**, and the search-engine-thumbnail precedent (Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a universal thumbnail exemption. So re-hosting waits on a per-source rights basis. ### How resolution works - `newsimg.display_url(article_id, image_policy, raw_url)` returns the display URL: `/api/img/` for `cache`, the publisher URL for `remote`, `None` for `none`/no-image. - Applied server-side in `Article.from_row` (feed/brief/history) and `share.py` (the `/a/` page). `og:image`/`twitter:image` always reference the **publisher's own** image URL (a link, not a copy) — never our cached path — so social crawlers don't hit our endpoint. - The frontend uses the resolved `image_url` as-is; the hub probe-retry + ArticleCard `onerror` cover slow/failed loads for both cached and hotlinked URLs. ### Cache mechanics (`goodnews/newsimg.py`) - **The cycle owns all fetching** (`warm()`), gated on `image_policy='cache'`, under the cycle lock. `GET /api/img/` serves **cache hits only** (never fetches), restricted to accepted + canonical articles whose source is `cache`. No SSRF/worker-exhaustion surface on the public endpoint. - `_safe_fetch`: http(s) only, `_host_is_public` on every redirect hop (HTTPError-based redirects followed), body capped. 4xx≠429 = permanent (negative-cached via `.fail`); 429/5xx/network = transient. - `_encode`: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced (`w*h > _MAX_PIXELS`) **before** decode. Originals are never retained. - Bounded: hard size cap (default 1 GB, `GOODNEWS_IMG_CACHE_CAP`) with LRU eviction; `.fail` markers swept after `_FAIL_TTL_S`. `data/img_cache/` is gitignored (runtime data). - **Revocation:** when a source leaves `cache` (set to `remote`/`none` in admin), the endpoint calls `newsimg.purge_source()` to delete that source's re-hosted copies **immediately** — they don't linger on disk. (Setting *to* `cache` just flips the flag; the cycle warms it.) ## Visitor metrics — Recorded visits vs Engaged readers A JS-capable bot can trip the visit beacon, so the admin shows two numbers: - **Recorded visits** — raw count: one daily `visit` beacon per device. Known-bot User-Agents are filtered at `/api/events` (`queries.is_bot_ua`), but UA-spoofing bots still land here. Noisy. - **Engaged readers** — distinct visitor-day with **deliberate** activity (the honest number): - the gesture-gated `engaged` beacon (`analytics.armEngaged`, mirrored on the share page) — fires once/day only after ~8s visible **and** a real scroll/pointer/key/touch; or - a deliberate action: `source_click`, `full_story`, `share_ub`/`copy_source`/`native_share`, `replace_used`, `paywall_replace`, `paywalled_source_open`, `not_today`/`less_like_this`/`hide_topic`, or a game `started`/`completed`/`shared`. - **Never** counts auto-fired `visit`/`summary_viewed`/`open`, `replace_none`, or game `*_arrival`. - Defined by `queries.ENGAGED_EVENT_KINDS`; surfaced as `visitors.engaged_today/d7/d30`. **Warm-up caveat:** the `engaged` beacon began **2026-06-30**, so rolling windows fill over time — a low `engaged_d7`/`engaged_d30` is partly warm-up, NOT proof the gap to recorded visits was all bots. Compare `d7` after a full week, `d30` after thirty days. (Admin shows this note inline.) Privacy unchanged: only a salted `visitor_hash` is stored (no IP, no raw token, no fingerprint). ### Referrer suppression on remote images Every on-site image request for a `remote` source sets `referrerpolicy="no-referrer"` so the publisher CDN doesn't get the referring URL: article cards, the share page, AND the homepage hero (converted from a CSS `background-image` to a real `` — the retry probe sets `probe.referrerPolicy='no-referrer'` too). This hides the *referrer*, **not** the visitor's IP — any remote image necessarily exposes the IP to the CDN. For zero third-party image requests, the source must be `none` or explicitly cleared for local caching (`cache`).