11 Commits

Author SHA1 Message Date
thejayman77 dc23277b38 Read-time: full-article "Full story · ~N min" badge (Option B)
Replaces the gist-based read-time with the SOURCE article's full read time — the
contrast that sells the gist ("calm 1-min version here; ~10 min for the deep dive").

- goodnews/readtime.py: word_count_from_html (strips script/style/nav/header/
  footer/form/button/aside furniture before counting) + source_read_minutes
  (~225 wpm, 200-word floor, None when extraction looks failed/too thin).
- articles.source_words + read_checked_at columns (count only, never the body;
  fits the privacy posture). Idempotent migration.
- enrich.fetch_source_words + enrich_read_times: a bounded, retry-guarded cycle
  step (mirrors the image enrichers) that counts words for recent accepted
  articles. Only ever writes a real count; never overwrites good with zero. Wired
  into the cycle after recent-image enrichment.
- queries: source_words flows through _ARTICLE_COLUMNS; api exposes
  source_read_minutes on Article (null when unknown).
- home3: News card shows "Full story · ~N min", hidden entirely when null (no
  misleading "1 min").
- Tests: furniture stripping, threshold/rounding, enrich idempotency + no
  zero-overwrite, API null handling. 412 backend.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 08:09:00 -04:00
thejayman77 50dc2167cd Durable image quality: stop trusting feed thumbnails; cycle enriches Latest
Make "no blurry images" sustainable, not a one-off cleanup. RSS feed thumbnails
(~44% were ~90px) were stored at ingest and upscaled to mush, so new articles
would reintroduce them. Now image_url is filled ONLY by the quality-gated
og:image enrichment:

* insert_article no longer stores the feed image (was canonicalize_url(item...)).
* enrich_recent_images(): the cycle fetches a quality og:image for the newest
  accepted, imageless articles each run (bounded), keeping Latest photo-rich.
* Brief + on-open enrichment unchanged.

Net: every stored image is a validated, ≥450px og:image; the rest are clean
placeholders.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:55:57 -04:00
thejayman77 b134c2dab6 Image quality gate: reject too-small images (no more blurry thumbnails)
Some cards showed blurry photos — feed RSS thumbnails (~90×90, e.g. Phys.org's
/tmb/ path) that load fine but upscale to mush in the banner. Add a header-based
dimension parser (PNG/GIF/JPEG/WebP, stdlib only) and fold a minimum-size gate
(450×250) into the image validation, alongside the existing load check. Images
we can't measure (SVG/AVIF) still pass on content-type. A re-prune clears the
small ones already stored so those cards fall back to the clean placeholder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:40:04 -04:00
thejayman77 d472b63cbf Validate images actually load (fix overcounted coverage)
A stored og:image isn't proof it renders: signed/hotlink-protected URLs (e.g.
the Guardian's i.guim.co.uk) 401 on a direct browser load, so they counted
toward coverage yet always fell back. Now fetch_og_image confirms the image
truly returns 200 + image/* (requested no-referrer, same SSRF-safe redirect
handling) before storing it. Add prune_broken_images() to clear already-stored
URLs that no longer load, so coverage is honest and those cards show the
placeholder cleanly. The browser onerror→placeholder remains the final safety net.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 13:20:12 -04:00
thejayman77 403749e26f Images phase 1: attention-triggered og:image coverage
Tie image enrichment to attention (per review): when an article earns a summary
(i.e. a reader reached it), best-effort fetch a real og:image if it lacks one —
never blanket-fetch every ingested article. Adds:

* enrich_article_image() — single-article fetch, leaves existing images alone,
  retries an imageless article only after 7 days, stamps image_checked_at.
* generate_summary() calls it after caching (wrapped; never breaks summaries).
* enrich_summarized_images() + `goodnews enrich-images` CLI — slow background
  backfill of already-summarized, accepted, imageless articles.
* Quality gate: extend the generic-image skip list with data:/tracking-pixel/
  spacer markers (on top of the existing logo/placeholder + unbranded-BBC logic).

This is coverage only; display (editorial rhythm, tile treatment) comes next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:30:11 -04:00
thejayman77 acbc06a9e5 Use BBC's clean image variant (cpsprodpb) instead of the branded one
BBC's og:image comes from the "branded_news" CDN path with a "BBC NEWS" logo
baked into the picture (shows as "…EWS" once the hero crops it). The identical
photo is served under "cpsprodpb" with no logo, so rewrite branded_news →
cpsprodpb. Best of both: full-resolution hero, no burned-in branding. Re-enriched
recent briefs so live images swap over. 99 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:51:51 +00:00
thejayman77 2145622b59 Stop rejecting BBC's branded_news images (the blurry-hero bug)
og:image extraction rejected any URL containing "branded_news" as a generic share
image, but that's BBC's normal CDN path for real article photos. So every BBC hero
fell back to the 240px RSS thumbnail (blurry when shown large). Drop that marker;
keep the genuine placeholder markers (facebook-default, og-default, etc.). Updated
the test to assert BBC branded_news paths pass through. 99 tests pass.

(One-time: cleared image_checked_at on the 57 previously-checked articles and
re-enriched recent briefs so existing thumbnails upgrade to og:images.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:47:08 +00:00
thejayman77 6d5bcb13e5 Fix stale pinned-brief images; enrich all 7 + retry failures
Root cause (Codex audit): the client pins the brief by generated_at, but image
enrichment populates image_url AFTER the brief is built without bumping
generated_at — so a verbatim pinned copy stays imageless even once the server
has the image. The reclassify rebuilt the brief and the early pin stuck.

- Frontend: when reusing a pinned brief (same generated_at), refresh server-owned
  metadata by article id (esp. image_url) while preserving the user's order and
  replacements. Re-saves the merged view so it stays current.
- enrich_brief_images: default limit 5 -> 7 (any brief item can become the hero
  via the client fallback or a replace, so cover the whole brief).
- Don't cache image failures forever: retry brief items still missing an image
  after a TTL (retry_days=2) instead of stamping them imageless permanently.

Pairs with the hero image fallback (dd0087b). 99 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:38:54 +00:00
thejayman77 7e1dfd5b3c Reject branded/generic share images; hero prefers a clean illustrated story
- og:image enrichment now skips branded/generic share images (BBC 'branded_news'
  with its burned-in logo, NPR 'facebook-default', etc.) and keeps the first
  real article image — so no competitor logo lands on our hero. Cleared the few
  already-stored branded URLs so they re-enrich.
- Hero selection now prefers a gentle + readable story that also HAS a (clean)
  image, falling back to gentle-readable, then gentle. The lead is visual when
  possible, typographic otherwise — never branded.

(The '7 cards' report was a stale browser cache: the brief stores 7 and the
built JS requests 7; a hard refresh shows all seven.)

Tests: branded/generic image rejection (87 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 13:03:20 +00:00
thejayman77 d8d665ee35 Crisp hero (prefer og:image), 7-card Highlights, no-recycle Replace + session History
- Hero blur fix: brief enrichment now prefers a page's og:image even when a
  feed thumbnail exists (feed thumbs are often tiny; the hero is shown large).
  Verified: BBC hero upgrades to the 1024px share image, ScienceDaily to 1920px.
- Today is now 'Highlights from Today' — hero + 6 (brief size 7), which also
  makes the secondary grid a balanced 3+3 instead of an orphaned 3+1.
- Replace now excludes every article seen this session (a client-side seen-set),
  so it never cycles back to something already shown.
- New session History panel (this tab only, no account): lists everything seen,
  including swapped-away stories, so they stay recoverable. Persistent
  history/favorites are tabled for sign-in later.

Tests: og:image upgrade of an existing feed image (86 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:56:57 +00:00
thejayman77 9e8eddf46d Bounded hero-image enrichment (og:image for brief items only)
The grid stays typographic; the hero is the one intentional visual slot. At
brief-build time we fetch a hero-quality image for the daily five that lack one:
- enrich.py reads ONLY a page's <head> og:image/twitter:image and stores just
  the URL (never the body).
- SSRF-guarded: http(s) only, 6s timeout, 300KB cap, <=3 manual redirects each
  re-validated, and hosts rejected if any resolved address is private, loopback,
  link-local, multicast, reserved, or unspecified.
- image_checked_at column caches success AND failure, so an article is never
  retried forever.
- Wired into build-brief and cycle (brief items only, only if image missing and
  unchecked). Everything else stays metadata-only.
- Verified live: today's five all carry images (feed + enriched).

Tests: og:image parser, head-only scope, IP guard across internal ranges, and
enrich success + failure-caching (85 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 12:37:41 +00:00