fix: group title on first ingredient + multi-site parser registry
- Fix ingredient groups creating empty entries in Mealie: set title field on the first ingredient after the group marker instead - Refactor scraper with @_register decorator for URL-based site dispatch - Update README with structured ingredients, groups, MEALIE_INTERNAL_URL Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -48,7 +48,8 @@ Extracts data from the Angular-rendered HTML:
|
|||||||
- **Title**: `og:title` meta tag, with ` | Mindmegette.hu` suffix stripped
|
- **Title**: `og:title` meta tag, with ` | Mindmegette.hu` suffix stripped
|
||||||
- **Description**: `og:description` meta tag
|
- **Description**: `og:description` meta tag
|
||||||
- **Image**: `og:image` meta tag
|
- **Image**: `og:image` meta tag
|
||||||
- **Ingredients**: `div.ingredients` → `div.ingredients-meta` rows, each containing `span.quantity`, `span.unit`, `span.name`, `span.extra`
|
- **Ingredients**: `div.ingredients` → `div.ingredients-meta` rows, each containing `<strong>` (qty), `<span>` (unit), `<a class="ingredients-link">` (food), `<small>` (extra)
|
||||||
|
- **Ingredient groups**: Multiple `div.ingredients` containers; group title via `<strong class="ingredients-group">`
|
||||||
- **Instructions**: `mindmegette-wysiwyg-box` → `ol > li` elements
|
- **Instructions**: `mindmegette-wysiwyg-box` → `ol > li` elements
|
||||||
|
|
||||||
### Generic Fallback Parser
|
### Generic Fallback Parser
|
||||||
@@ -57,14 +58,22 @@ For unsupported sites, attempts extraction via:
|
|||||||
1. Schema.org JSON-LD `@type: Recipe` blocks (`recipeIngredient`, `recipeInstructions`)
|
1. Schema.org JSON-LD `@type: Recipe` blocks (`recipeIngredient`, `recipeInstructions`)
|
||||||
2. OpenGraph meta tags for title, description, image
|
2. OpenGraph meta tags for title, description, image
|
||||||
|
|
||||||
|
### Adding a New Site Parser
|
||||||
|
|
||||||
|
1. Create a parser function in `app/scraper.py` with the `@_register("hostname")` decorator
|
||||||
|
2. The function receives `(soup: BeautifulSoup, url: str)` and returns the standard recipe dict
|
||||||
|
3. The hostname substring is matched against the URL — first match wins, unmatched URLs use the generic fallback
|
||||||
|
|
||||||
## Mealie API Integration
|
## Mealie API Integration
|
||||||
|
|
||||||
The importer uses the Mealie REST API:
|
The importer uses the Mealie REST API:
|
||||||
|
|
||||||
1. **POST** `/api/recipes` — create a stub recipe (returns slug)
|
1. **POST** `/api/recipes` — create a stub recipe (returns slug)
|
||||||
2. **PATCH** `/api/recipes/{slug}` — populate ingredients, instructions, description, orgURL
|
2. **PATCH** `/api/recipes/{slug}` — populate structured ingredients (with unit/food IDs), instructions, description, orgURL
|
||||||
3. **PUT** `/api/recipes/{slug}/image` — upload the recipe image
|
3. **PUT** `/api/recipes/{slug}/image` — upload the recipe image
|
||||||
|
|
||||||
|
**Structured ingredients**: The client resolves unit and food names to Mealie database IDs. Missing units/foods are created automatically via the API. Ingredient groups are supported via the `title` field on the first ingredient of each group.
|
||||||
|
|
||||||
Authentication uses a long-lived API token (Bearer header), created in Mealie at *Profile → API Tokens*.
|
Authentication uses a long-lived API token (Bearer header), created in Mealie at *Profile → API Tokens*.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
@@ -83,7 +92,7 @@ All settings are persisted to `/data/config.json` (mounted as a Docker volume).
|
|||||||
```yaml
|
```yaml
|
||||||
services:
|
services:
|
||||||
recipe-importer:
|
recipe-importer:
|
||||||
image: gitea.dooplex.hu/admin/recipe-importer:0.1.0
|
image: gitea.dooplex.hu/admin/recipe-importer:0.1.7
|
||||||
container_name: recipe-importer
|
container_name: recipe-importer
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
ports:
|
ports:
|
||||||
@@ -104,6 +113,7 @@ volumes:
|
|||||||
| `SECRET_KEY` | `recipe-importer-dev-key` | Flask session secret |
|
| `SECRET_KEY` | `recipe-importer-dev-key` | Flask session secret |
|
||||||
| `DATA_DIR` | `/data` | Persistent storage path |
|
| `DATA_DIR` | `/data` | Persistent storage path |
|
||||||
| `VERSION` | `dev` | Shown in the UI navbar |
|
| `VERSION` | `dev` | Shown in the UI navbar |
|
||||||
|
| `MEALIE_INTERNAL_URL` | *(empty)* | Docker-internal Mealie URL (e.g. `http://mealie:9000`) to avoid Cloudflare hairpin |
|
||||||
|
|
||||||
## Building
|
## Building
|
||||||
|
|
||||||
@@ -120,7 +130,7 @@ The UI is in Hungarian and uses a dark theme. The workflow is:
|
|||||||
|
|
||||||
1. **Settings** (`/settings`) — Enter Mealie URL and API key, test connection
|
1. **Settings** (`/settings`) — Enter Mealie URL and API key, test connection
|
||||||
2. **Import** (`/import`) — Paste a recipe URL, click "Beolvasás" (Scrape)
|
2. **Import** (`/import`) — Paste a recipe URL, click "Beolvasás" (Scrape)
|
||||||
3. **Review** — Edit the title, description, ingredients, instructions in the preview
|
3. **Review** — Edit structured ingredients (4-column: quantity, unit, food, note), add/remove ingredient groups, edit instructions
|
||||||
4. **Send** — Click "Importálás Mealie-be" to push to Mealie
|
4. **Send** — Click "Importálás Mealie-be" to push to Mealie
|
||||||
|
|
||||||
## Tech Stack
|
## Tech Stack
|
||||||
|
|||||||
+11
-12
@@ -145,27 +145,26 @@ class MealieClient:
|
|||||||
|
|
||||||
def _build_payload(self, recipe: dict) -> dict:
|
def _build_payload(self, recipe: dict) -> dict:
|
||||||
ingredients = []
|
ingredients = []
|
||||||
|
pending_group = ""
|
||||||
for item in recipe.get("ingredients", []):
|
for item in recipe.get("ingredients", []):
|
||||||
if isinstance(item, dict):
|
if isinstance(item, dict):
|
||||||
# Group header marker
|
# Group header marker — apply title to the next real ingredient
|
||||||
if "group" in item and "food" not in item:
|
if "group" in item and "food" not in item:
|
||||||
ingredients.append({
|
pending_group = item["group"]
|
||||||
"referenceId": str(uuid.uuid4()),
|
continue
|
||||||
"title": item["group"],
|
ing = self._build_ingredient(item)
|
||||||
"note": "",
|
|
||||||
"isFood": False,
|
|
||||||
"disableAmount": True,
|
|
||||||
})
|
|
||||||
else:
|
|
||||||
ingredients.append(self._build_ingredient(item))
|
|
||||||
else:
|
else:
|
||||||
# Legacy: plain string
|
# Legacy: plain string
|
||||||
ingredients.append({
|
ing = {
|
||||||
"referenceId": str(uuid.uuid4()),
|
"referenceId": str(uuid.uuid4()),
|
||||||
"note": str(item),
|
"note": str(item),
|
||||||
"isFood": False,
|
"isFood": False,
|
||||||
"disableAmount": True,
|
"disableAmount": True,
|
||||||
})
|
}
|
||||||
|
if pending_group:
|
||||||
|
ing["title"] = pending_group
|
||||||
|
pending_group = ""
|
||||||
|
ingredients.append(ing)
|
||||||
|
|
||||||
instructions = []
|
instructions = []
|
||||||
for text in recipe.get("instructions", []):
|
for text in recipe.get("instructions", []):
|
||||||
|
|||||||
+27
-6
@@ -1,6 +1,7 @@
|
|||||||
"""Recipe scraper — parses Hungarian recipe sites into a structured dict.
|
"""Recipe scraper — parses Hungarian recipe sites into a structured dict.
|
||||||
|
|
||||||
Currently supported: mindmegette.hu
|
Each supported site has a parser registered via _PARSERS.
|
||||||
|
Unsupported sites fall back to generic schema.org / og-tag extraction.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import re
|
import re
|
||||||
@@ -12,6 +13,19 @@ _HEADERS = {
|
|||||||
"Accept-Language": "hu-HU,hu;q=0.9,en;q=0.5",
|
"Accept-Language": "hu-HU,hu;q=0.9,en;q=0.5",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Maps a substring of the hostname to a parser function.
|
||||||
|
# Order matters: first match wins.
|
||||||
|
_PARSERS: list[tuple[str, "callable"]] = []
|
||||||
|
|
||||||
|
|
||||||
|
def _register(host_substring: str):
|
||||||
|
"""Decorator: register a parser for URLs whose hostname contains *host_substring*."""
|
||||||
|
def decorator(fn):
|
||||||
|
_PARSERS.append((host_substring, fn))
|
||||||
|
return fn
|
||||||
|
return decorator
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Public API
|
# Public API
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -39,11 +53,17 @@ def scrape(url: str) -> dict:
|
|||||||
soup = BeautifulSoup(resp.text, "lxml")
|
soup = BeautifulSoup(resp.text, "lxml")
|
||||||
|
|
||||||
host = _host(url)
|
host = _host(url)
|
||||||
if "mindmegette" in host:
|
for substring, parser in _PARSERS:
|
||||||
return _parse_mindmegette(soup, url)
|
if substring in host:
|
||||||
else:
|
return parser(soup, url)
|
||||||
# Fallback: try generic schema.org / og-tag extraction
|
|
||||||
return _parse_generic(soup, url)
|
# Fallback: try generic schema.org / og-tag extraction
|
||||||
|
return _parse_generic(soup, url)
|
||||||
|
|
||||||
|
|
||||||
|
def supported_sites() -> list[str]:
|
||||||
|
"""Return list of supported site hostname substrings."""
|
||||||
|
return [s for s, _ in _PARSERS]
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -51,6 +71,7 @@ def scrape(url: str) -> dict:
|
|||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@_register("mindmegette")
|
||||||
def _parse_mindmegette(soup: BeautifulSoup, url: str) -> dict:
|
def _parse_mindmegette(soup: BeautifulSoup, url: str) -> dict:
|
||||||
title = _og(soup, "og:title") or _text(soup.find("title"))
|
title = _og(soup, "og:title") or _text(soup.find("title"))
|
||||||
# Strip " | Mindmegette.hu" suffix
|
# Strip " | Mindmegette.hu" suffix
|
||||||
|
|||||||
Reference in New Issue
Block a user