v0.7.0: kiskegyed.hu parser, dual measurements, site links as URLs

- New parser for kiskegyed.hu: ingredients (with groups, dual measurements), instructions (ol > li > div), tags (section.tags) - Dual measurement handling: "3 ek (70 g)" extracts alternate measurement to comment field - Cross-site linking: kiskegyed→sobors links are followed to get full recipe (mirrors existing sobors→kiskegyed support) - Supported sites now shown as clickable URLs in the import page - supported_sites() returns dicts with name and url Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 18:45:25 +01:00
parent 0912311357
commit 20fabb84bf
4 changed files with 193 additions and 5 deletions
@@ -1,5 +1,15 @@
 # Changelog
 ## v0.7.0 (2026-02-24)
 ### Added
 - Kiskegyed.hu parser: ingredients (with groups, dual measurements), instructions, tags
 - Cross-site recipe linking: kiskegyed→sobors and sobors→kiskegyed links are followed automatically
 - Dual measurement support: parenthesized alternate measurements (e.g. "3 ek (70 g)") extracted to comment field
 ### Changed
 - Supported sites list now shows clickable URLs instead of plain text
 ## v0.6.1 (2026-02-24)
 ### Added
@@ -44,6 +44,7 @@ Docker container for importing recipes from Hungarian websites into [Mealie](htt
 | streetkitchen.hu | Yes (with groups) | Yes (ol/ul/paragraph) | Yes | Yes (from JSON-LD categories) |
 | nosalty.hu | Yes (with groups) | Yes (with section headers) | Yes | Yes |
 | sobors.hu | Yes (with groups) | Yes (with section headers, follows linked recipes) | Yes | Yes |
 | kiskegyed.hu | Yes (with groups, dual measurements) | Yes (follows sobors.hu links) | Yes | Yes |
 | *Other sites* | Fallback (schema.org JSON-LD) | Fallback (schema.org JSON-LD) | Yes (og:image) | Fallback (schema.org keywords) |
 ### Mindmegette.hu Parser
@@ -96,6 +97,20 @@ Extracts data from the sobors.hu recipe pages:
 - **Article-style ingredient fallback**: Pages without the structured `div.hozzavalok-container` are parsed from article-body `h4` + `ul > li` plain text
 - **Tags**: `div.cikk-cimkek > ul.cikk-cimkek-list > li > a` (skips generic "Receptek" category)
 ### Kiskegyed.hu Parser
 Extracts data from kiskegyed.hu recipe pages:
 - **Title**: `h2` element (with ` - Kiskegyed` suffix stripped)
 - **Description**: `section#leadText > p`
 - **Image**: `og:image` meta tag
 - **Ingredients**: `div.recipe_ingredients` → `ul.list > li` items; group headers from `<p>` or `<p><em>` elements
 - **Ingredient groups**: `<p>Name:</p>` or `<p><em>A ...hez</em></p>` format
 - **Dual measurements**: "3 ek (70 g) búzafinomliszt" → qty: 3, unit: ek, food: búzafinomliszt, extra: 70 g
 - **Instructions**: `div.recipe_preparation > ol > li > div`
 - **Cross-site links**: Pages linking to sobors.hu are followed to get the full recipe
 - **Tags**: `section.tags > a > span` (# prefix stripped, "recept" filtered)
 ### Generic Fallback Parser
 For unsupported sites, attempts extraction via:
@@ -70,9 +70,16 @@ def scrape(url: str) -> dict:
    return result
-def supported_sites() -> list[str]:
+def supported_sites() -> list[dict]:
-    """Return list of supported site hostname substrings."""
+    """Return list of supported sites with name and URL."""
-    return [s for s, _ in _PARSERS]
+    _SITE_URLS = {
        "mindmegette": "https://www.mindmegette.hu",
        "streetkitchen": "https://streetkitchen.hu",
        "nosalty": "https://www.nosalty.hu",
        "sobors": "https://sobors.hu",
        "kiskegyed": "https://www.kiskegyed.hu",
    }
    return [{"name": s + ".hu", "url": _SITE_URLS.get(s, "#")} for s, _ in _PARSERS]
 # ---------------------------------------------------------------------------
@@ -545,6 +552,162 @@ def _parse_sobors(soup: BeautifulSoup, url: str) -> dict:
    }
 # ---------------------------------------------------------------------------
 # kiskegyed.hu
 # ---------------------------------------------------------------------------
@_register("kiskegyed")
 def _parse_kiskegyed(soup: BeautifulSoup, url: str) -> dict:
    # Title: h2 inside the detail section
    title = ""
    h2 = soup.find("h2")
    if h2:
        title = h2.get_text(strip=True)
    if not title:
        title = _og(soup, "og:title") or _text(soup.find("title"))
    if title:
        title = re.sub(r"\s*[-–|]\s*Kiskegyed.*$", "", title, flags=re.IGNORECASE).strip()
    # Description: section#leadText > p
    description = ""
    lead = soup.find("section", id="leadText")
    if lead:
        p = lead.find("p")
        if p:
            description = p.get_text(strip=True)
    if not description:
        description = _og(soup, "og:description") or ""
    image_url = _og(soup, "og:image")
    # --- Ingredients ---
    # Container: div.recipe_ingredients
    # Groups: <p>Name:</p> or <p><em>A ...hez</em></p>
    # Items: ul.list > li (plain text with optional <a> links)
    ingredients = []
    ing_container = soup.find("div", class_="recipe_ingredients")
    if ing_container:
        for el in ing_container.find_all(["p", "ul"]):
            if el.name == "p":
                group_text = el.get_text(strip=True).rstrip(":")
                # Skip the "Hozzávalók" header and serving info
                if not group_text or group_text.lower().startswith("hozzávalók"):
                    continue
                # Skip serving info like "4 személyre"
                if re.match(r"^\d+\s+személyre$", group_text):
                    continue
                ingredients.append({"group": group_text})
            elif el.name == "ul" and "list" in (el.get("class") or []):
                for li in el.find_all("li"):
                    line = li.get_text(strip=True)
                    if not line:
                        continue
                    qty, unit, food, extra = _parse_kiskegyed_ingredient(line)
                    ingredients.append({
                        "quantity": qty,
                        "unit": unit,
                        "food": food,
                        "extra": extra,
                    })
    # --- Instructions ---
    # Container: div.recipe_preparation > ol > li > div
    instructions = []
    linked_url = None
    prep_container = soup.find("div", class_="recipe_preparation")
    if prep_container:
        # Check for cross-link to another recipe site (e.g. sobors.hu)
        for a in prep_container.find_all("a", href=True):
            href = a["href"]
            if href.startswith("http") and "kiskegyed.hu" not in href:
                # Check if it points to a supported recipe site
                linked_host = _host(href)
                if any(s in linked_host for s, _ in _PARSERS if s != "kiskegyed"):
                    linked_url = href
                    break
        ol = prep_container.find("ol")
        if ol:
            for li in ol.find_all("li", recursive=False):
                div = li.find("div")
                txt = div.get_text(strip=True) if div else li.get_text(strip=True)
                if txt:
                    instructions.append(txt)
    # If instructions are empty or just a redirect, follow the linked recipe
    if linked_url and len(instructions) <= 2:
        try:
            linked_data = scrape(linked_url)
            if linked_data.get("instructions"):
                instructions = linked_data["instructions"]
            if not ingredients and linked_data.get("ingredients"):
                ingredients = linked_data["ingredients"]
        except Exception:
            pass
    # --- Tags ---
    # Container: section.tags > a > span (text starts with #)
    tags = []
    tag_section = soup.find("section", class_="tags")
    if tag_section:
        skip = {"recept", "receptek"}
        for a in tag_section.find_all("a"):
            span = a.find("span")
            tag_text = span.get_text(strip=True) if span else a.get_text(strip=True)
            tag_text = tag_text.lstrip("#").strip()
            if tag_text and tag_text.lower() not in skip:
                tags.append(tag_text)
    return {
        "title": title or "Ismeretlen recept",
        "description": description,
        "image_url": image_url,
        "ingredients": ingredients,
        "instructions": instructions,
        "tags": tags,
        "original_url": url,
    }
 def _parse_kiskegyed_ingredient(line: str) -> tuple[str, str, str, str]:
    """Parse a kiskegyed.hu ingredient line.
    Handles dual measurements like '3 ek (70 g) búzafinomliszt (BL 55)'
    → qty='3', unit='ek', food='búzafinomliszt', extra='70 g; BL 55'
    """
    extras = []
    # Try: qty unit (alt_measurement) food...
    m = re.match(
        r"^([0-9][0-9.,/½¼¾-]*)\s+(\S+)\s+\(([^)]+)\)\s+(.+)$", line
    )
    if m:
        qty = m.group(1).strip()
        unit = m.group(2).strip()
        extras.append(m.group(3).strip())
        food_raw = m.group(4).strip()
        # Extract trailing parenthesized note from food
        fm = re.match(r"^(.+?)\s*\(([^)]+)\)\s*$", food_raw)
        if fm:
            food_raw = fm.group(1).strip()
            extras.append(fm.group(2).strip())
        return (qty, unit, food_raw, "; ".join(extras))
    # Try: qty unit food...
    m2 = re.match(r"^([0-9][0-9.,/½¼¾-]*)\s+(\S+)\s+(.+)$", line)
    if m2:
        return (m2.group(1).strip(), m2.group(2).strip(), m2.group(3).strip(), "")
    # Try: qty food (e.g. "2 tojás")
    m3 = re.match(r"^([0-9][0-9.,/½¼¾-]*)\s+(.+)$", line)
    if m3:
        return (m3.group(1).strip(), "", m3.group(2).strip(), "")
    # No quantity (e.g. "ízlés szerint só")
    return ("", "", line, "")
 def _parse_sobors_article_ingredients(container, ingredients: list):
    """Parse article-style ingredients from sobors.hu (h4 headers + ul > li plain text)."""
    for el in container.find_all(["h4", "ul"]):
@@ -326,7 +326,7 @@
    <!-- Single import tab -->
    <div id="tabSingle">
        <p style="font-size:0.85rem;color:var(--text-dim);margin-bottom:0.8rem;">
-            Támogatott oldalak: <span class="supported-sites">{{ supported_sites | join(', ') }}</span> + egyéb (schema.org)
+            Támogatott oldalak: <span class="supported-sites">{% for s in supported_sites %}<a href="{{ s.url }}" target="_blank" style="color:var(--accent-light);text-decoration:none;">{{ s.name }}</a>{% if not loop.last %}, {% endif %}{% endfor %}</span> + egyéb (schema.org)
        </p>
        <div class="flex">
            <input type="url" id="recipeUrl" class="grow" style="margin-bottom:0"
@@ -341,7 +341,7 @@
    <!-- Bulk import tab -->
    <div id="tabBulk" style="display:none">
        <p style="font-size:0.85rem;color:var(--text-dim);margin-bottom:0.8rem;">
-            Támogatott oldalak: <span class="supported-sites">{{ supported_sites | join(', ') }}</span> + egyéb (schema.org)
+            Támogatott oldalak: <span class="supported-sites">{% for s in supported_sites %}<a href="{{ s.url }}" target="_blank" style="color:var(--accent-light);text-decoration:none;">{{ s.name }}</a>{% if not loop.last %}, {% endif %}{% endfor %}</span> + egyéb (schema.org)
        </p>
        <label for="bulkUrls">URL-ek (soronként egy)</label>