dc23277b38
Replaces the gist-based read-time with the SOURCE article's full read time — the
contrast that sells the gist ("calm 1-min version here; ~10 min for the deep dive").
- goodnews/readtime.py: word_count_from_html (strips script/style/nav/header/
footer/form/button/aside furniture before counting) + source_read_minutes
(~225 wpm, 200-word floor, None when extraction looks failed/too thin).
- articles.source_words + read_checked_at columns (count only, never the body;
fits the privacy posture). Idempotent migration.
- enrich.fetch_source_words + enrich_read_times: a bounded, retry-guarded cycle
step (mirrors the image enrichers) that counts words for recent accepted
articles. Only ever writes a real count; never overwrites good with zero. Wired
into the cycle after recent-image enrichment.
- queries: source_words flows through _ARTICLE_COLUMNS; api exposes
source_read_minutes on Article (null when unknown).
- home3: News card shows "Full story · ~N min", hidden entirely when null (no
misleading "1 min").
- Tests: furniture stripping, threshold/rounding, enrich idempotency + no
zero-overwrite, API null handling. 412 backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
44 lines
1.7 KiB
Python
44 lines
1.7 KiB
Python
"""Estimate a SOURCE article's full read time from its fetched HTML.
|
|
|
|
We never store the publisher's body — only a word COUNT (metadata) — to derive a
|
|
"Full story · ~N min" hint that contrasts with our one-minute gist. That tiny
|
|
detail sells the value: the calm summary now, the deep dive only if you want it.
|
|
|
|
Extraction is deliberately light (no readability parser yet): drop the obvious
|
|
non-article furniture (scripts, styles, nav, header, footer, forms, buttons,
|
|
asides), strip tags, count words. ~225 wpm offsets the boilerplate that still
|
|
slips through. Below a floor we assume failed/blocked extraction and return None
|
|
so the UI shows NO badge rather than a misleading "1 min".
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
|
|
_WPM = 225
|
|
_MIN_WORDS = 200 # below this → assume failed/too-thin extraction → no badge
|
|
|
|
# Blocks whose CONTENT is furniture, removed wholesale before counting.
|
|
_FURNITURE = re.compile(
|
|
rb"<(script|style|noscript|template|svg|nav|header|footer|form|button|aside|select|option)\b[^>]*>.*?</\1>",
|
|
re.IGNORECASE | re.DOTALL,
|
|
)
|
|
_TAGS = re.compile(rb"<[^>]+>")
|
|
_WS = re.compile(r"\s+")
|
|
|
|
|
|
def word_count_from_html(raw: bytes | None) -> int:
|
|
"""Rough article word count from raw HTML bytes, furniture stripped."""
|
|
if not raw:
|
|
return 0
|
|
cleaned = _FURNITURE.sub(b" ", raw)
|
|
text = _TAGS.sub(b" ", cleaned).decode("utf-8", "replace")
|
|
return len(_WS.sub(" ", text).split())
|
|
|
|
|
|
def source_read_minutes(words: int | None) -> int | None:
|
|
"""Whole-minute estimate for the FULL article, or None when the count looks
|
|
failed/too thin (so callers omit the badge instead of showing a wrong number)."""
|
|
if not words or words < _MIN_WORDS:
|
|
return None
|
|
return max(2, round(words / _WPM))
|