Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.2 KiB
Images & visitor metrics — policy of record
Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes.
Article images — source-level rights policy
We do not blanket-cache publisher images. Caching (re-hosting a copy on our origin) is opt-in per source; the safe default is to display the publisher's own image without copying it.
sources.image_policy (one of):
| policy | meaning | when to use |
|---|---|---|
cache |
Re-host a downscaled WebP (≤800px) at /api/img/<id> |
ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material |
remote (default) |
Hotlink the publisher's image URL (with the frontend's graceful retry) | Anything not explicitly cleared — display, never copy |
none |
No image (typographic topic cover) | Sources whose terms don't support display, or known-bad imagery |
Default for new/unknown sources is remote. Nothing is re-hosted until an admin sets a
source to cache (admin → Sources → expand a source → Images selector, or
POST /api/admin/sources/{id}/image-policy {"policy": "cache"}).
Why conservative: fair use is case-specific and balances four factors; "reduced + attributed + linked" is good practice but not permission, and the search-engine-thumbnail precedent (Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a universal thumbnail exemption. So re-hosting waits on a per-source rights basis.
How resolution works
newsimg.display_url(article_id, image_policy, raw_url)returns the display URL:/api/img/<id>forcache, the publisher URL forremote,Nonefornone/no-image.- Applied server-side in
Article.from_row(feed/brief/history) andshare.py(the/a/<id>page).og:image/twitter:imagealways reference the publisher's own image URL (a link, not a copy) — never our cached path — so social crawlers don't hit our endpoint. - The frontend uses the resolved
image_urlas-is; the hub probe-retry + ArticleCardonerrorcover slow/failed loads for both cached and hotlinked URLs.
Cache mechanics (goodnews/newsimg.py)
- The cycle owns all fetching (
warm()), gated onimage_policy='cache', under the cycle lock.GET /api/img/<id>serves cache hits only (never fetches), restricted to accepted + canonical articles whose source iscache. No SSRF/worker-exhaustion surface on the public endpoint. _safe_fetch: http(s) only,_host_is_publicon every redirect hop (HTTPError-based redirects followed), body capped. 4xx≠429 = permanent (negative-cached via<sha1>.fail); 429/5xx/network = transient._encode: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced (w*h > _MAX_PIXELS) before decode. Originals are never retained.- Bounded: hard size cap (default 1 GB,
GOODNEWS_IMG_CACHE_CAP) with LRU eviction;.failmarkers swept after_FAIL_TTL_S.data/img_cache/is gitignored (runtime data). - Revocation: when a source leaves
cache(set toremote/nonein admin), the endpoint callsnewsimg.purge_source()to delete that source's re-hosted copies immediately — they don't linger on disk. (Setting tocachejust flips the flag; the cycle warms it.)
Visitor metrics — Recorded visits vs Engaged readers
A JS-capable bot can trip the visit beacon, so the admin shows two numbers:
- Recorded visits — raw count: one daily
visitbeacon per device. Known-bot User-Agents are filtered at/api/events(queries.is_bot_ua), but UA-spoofing bots still land here. Noisy. - Engaged readers — distinct visitor-day with deliberate activity (the honest number):
- the gesture-gated
engagedbeacon (analytics.armEngaged, mirrored on the share page) — fires once/day only after ~8s visible and a real scroll/pointer/key/touch; or - a deliberate action:
source_click,full_story,share_ub/copy_source/native_share,replace_used,paywall_replace,paywalled_source_open,not_today/less_like_this/hide_topic, or a gamestarted/completed/shared. - Never counts auto-fired
visit/summary_viewed/open,replace_none, or game*_arrival. - Defined by
queries.ENGAGED_EVENT_KINDS; surfaced asvisitors.engaged_today/d7/d30.
- the gesture-gated
Warm-up caveat: the engaged beacon began 2026-06-30, so rolling windows fill over time —
a low engaged_d7/engaged_d30 is partly warm-up, NOT proof the gap to recorded visits was all
bots. Compare d7 after a full week, d30 after thirty days. (Admin shows this note inline.)
Privacy unchanged: only a salted visitor_hash is stored (no IP, no raw token, no fingerprint).
Optional (not done) — homepage hero referrer
For remote images, article cards and the share page use <img referrerpolicy="no-referrer">, so
the publisher CDN doesn't get the referring URL. The homepage hero (.news-plate) is a CSS
background-image, which can't carry that policy, so it leaks the referrer (not the IP — that's
unavoidable for any remote image). Converting the hero to a real <img referrerpolicy="no-referrer">
would make it consistent. Deferred pending an owner decision (touches the cover/contain hero rendering).