docs: durable policy of record for images + visitor metrics (Codex close-out)
Encodes the source-level image-rights policy (cache/remote/none; default remote, opt-in cache only for cleared sources) and the Recorded-visits vs Engaged-readers metric, so the decisions live in the repo for future audits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,62 @@
|
||||
# Images & visitor metrics — policy of record
|
||||
|
||||
Encodes the decisions from the 2026-06-30 Codex audit. Update this doc when the policy changes.
|
||||
|
||||
## Article images — source-level rights policy
|
||||
|
||||
We do **not** blanket-cache publisher images. Caching (re-hosting a copy on our origin) is
|
||||
**opt-in per source**; the safe default is to display the publisher's own image without copying it.
|
||||
|
||||
`sources.image_policy` (one of):
|
||||
|
||||
| policy | meaning | when to use |
|
||||
|---|---|---|
|
||||
| `cache` | Re-host a downscaled WebP (≤800px) at `/api/img/<id>` | ONLY sources we've cleared: open license (CC etc.), explicit permission, public domain, or our own/gov public-domain material |
|
||||
| `remote` (**default**) | Hotlink the publisher's image URL (with the frontend's graceful retry) | Anything not explicitly cleared — display, never copy |
|
||||
| `none` | No image (typographic topic cover) | Sources whose terms don't support display, or known-bad imagery |
|
||||
|
||||
**Default for new/unknown sources is `remote`.** Nothing is re-hosted until an admin sets a
|
||||
source to `cache` (admin → Sources → expand a source → Images selector, or
|
||||
`POST /api/admin/sources/{id}/image-policy {"policy": "cache"}`).
|
||||
|
||||
Why conservative: fair use is case-specific and balances four factors; "reduced + attributed +
|
||||
linked" is good practice but **not permission**, and the search-engine-thumbnail precedent
|
||||
(Perfect 10 v. Amazon) turned on a specifically transformative image-search function, not a
|
||||
universal thumbnail exemption. So re-hosting waits on a per-source rights basis.
|
||||
|
||||
### How resolution works
|
||||
- `newsimg.display_url(article_id, image_policy, raw_url)` returns the display URL: `/api/img/<id>`
|
||||
for `cache`, the publisher URL for `remote`, `None` for `none`/no-image.
|
||||
- Applied server-side in `Article.from_row` (feed/brief/history) and `share.py` (the `/a/<id>`
|
||||
page). `og:image`/`twitter:image` always reference the **publisher's own** image URL (a link,
|
||||
not a copy) — never our cached path — so social crawlers don't hit our endpoint.
|
||||
- The frontend uses the resolved `image_url` as-is; the hub probe-retry + ArticleCard `onerror`
|
||||
cover slow/failed loads for both cached and hotlinked URLs.
|
||||
|
||||
### Cache mechanics (`goodnews/newsimg.py`)
|
||||
- **The cycle owns all fetching** (`warm()`), gated on `image_policy='cache'`, under the cycle lock.
|
||||
`GET /api/img/<id>` serves **cache hits only** (never fetches), restricted to accepted + canonical
|
||||
articles whose source is `cache`. No SSRF/worker-exhaustion surface on the public endpoint.
|
||||
- `_safe_fetch`: http(s) only, `_host_is_public` on every redirect hop (HTTPError-based redirects
|
||||
followed), body capped. 4xx≠429 = permanent (negative-cached via `<sha1>.fail`); 429/5xx/network = transient.
|
||||
- `_encode`: decoded raster → WebP only; rejects SVG/undecodable; pixel ceiling enforced
|
||||
(`w*h > _MAX_PIXELS`) **before** decode. Originals are never retained.
|
||||
- Bounded: hard size cap (default 1 GB, `GOODNEWS_IMG_CACHE_CAP`) with LRU eviction; `.fail`
|
||||
markers swept after `_FAIL_TTL_S`. `data/img_cache/` is gitignored (runtime data).
|
||||
|
||||
## Visitor metrics — Recorded visits vs Engaged readers
|
||||
|
||||
A JS-capable bot can trip the visit beacon, so the admin shows two numbers:
|
||||
|
||||
- **Recorded visits** — raw count: one daily `visit` beacon per device. Known-bot User-Agents are
|
||||
filtered at `/api/events` (`queries.is_bot_ua`), but UA-spoofing bots still land here. Noisy.
|
||||
- **Engaged readers** — distinct visitor-day with **deliberate** activity (the honest number):
|
||||
- the gesture-gated `engaged` beacon (`analytics.armEngaged`, mirrored on the share page) — fires
|
||||
once/day only after ~8s visible **and** a real scroll/pointer/key/touch; or
|
||||
- a deliberate action: `source_click`, `full_story`, `share_ub`/`copy_source`/`native_share`,
|
||||
`replace_used`, `paywall_replace`, `paywalled_source_open`, `not_today`/`less_like_this`/`hide_topic`,
|
||||
or a game `started`/`completed`/`shared`.
|
||||
- **Never** counts auto-fired `visit`/`summary_viewed`/`open`, `replace_none`, or game `*_arrival`.
|
||||
- Defined by `queries.ENGAGED_EVENT_KINDS`; surfaced as `visitors.engaged_today/d7/d30`.
|
||||
|
||||
Privacy unchanged: only a salted `visitor_hash` is stored (no IP, no raw token, no fingerprint).
|
||||
Reference in New Issue
Block a user