0ec9ce0c6d
Add gastrohobbi.hu parser (WPBakery page builder layout): ingredients with groups, instructions with embedded lists, tags from JSON-LD articleSection, prep time extraction. Fix ingredient line parser: fractions like "1/2" no longer split due to regex backtracking, en-dash ranges normalized, unicode fractions (½¼¾) recognized as quantity start across all parsers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
258 lines
13 KiB
Markdown
258 lines
13 KiB
Markdown
# Recipe Importer
|
|
|
|
Docker container for importing recipes from Hungarian websites into [Mealie](https://mealie.io/) and [Tandoor Recipes](https://tandoor.dev/).
|
|
|
|
**Problem**: Mealie's and Tandoor's built-in URL import cannot parse ingredients and instructions from Hungarian recipe sites like mindmegette.hu.
|
|
|
|
**Solution**: This container provides a web UI that scrapes Hungarian recipe pages with site-specific parsers, lets you review and edit the extracted data, then pushes it to Mealie and/or Tandoor via their REST APIs. Supports both single recipe import and bulk import of multiple URLs.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ recipe-importer container (:8000) │
|
|
│ │
|
|
│ Flask + Gunicorn │
|
|
│ ├── /settings → Configure Mealie & Tandoor │
|
|
│ ├── /import → Single or bulk import │
|
|
│ ├── /scrape → AJAX: parse recipe HTML │
|
|
│ ├── /send → AJAX: push to Mealie API │
|
|
│ ├── /send-tandoor → AJAX: push to Tandoor API │
|
|
│ ├── /tags → AJAX: list tags from both │
|
|
│ └── /health → Health check │
|
|
│ │
|
|
│ Modules: │
|
|
│ ├── app/config.py → JSON config persistence │
|
|
│ ├── app/scraper.py → Site-specific parsers │
|
|
│ ├── app/mealie.py → Mealie REST API client │
|
|
│ └── app/tandoor.py → Tandoor REST API client │
|
|
└───────────────────┬──────────────┬───────────────────┘
|
|
│ HTTP │ HTTP
|
|
▼ ▼
|
|
┌──────────────┐ ┌───────────────┐
|
|
│ Mealie │ │ Tandoor │
|
|
│ POST /api/.. │ │ POST /api/.. │
|
|
│ PUT /api/.. │ │ PUT /api/.. │
|
|
└──────────────┘ └───────────────┘
|
|
```
|
|
|
|
## Supported Sites
|
|
|
|
| Site | Ingredients | Instructions | Image | Tags |
|
|
|------|:-----------:|:------------:|:-----:|:----:|
|
|
| mindmegette.hu | Yes | Yes | Yes | Yes |
|
|
| streetkitchen.hu | Yes (with groups) | Yes (ol/ul/paragraph) | Yes | Yes (from JSON-LD categories) |
|
|
| nosalty.hu | Yes (with groups) | Yes (with section headers) | Yes | Yes |
|
|
| sobors.hu | Yes (with groups) | Yes (with section headers, follows linked recipes) | Yes | Yes |
|
|
| kiskegyed.hu | Yes (with groups, dual measurements) | Yes (follows sobors.hu links) | Yes | Yes |
|
|
| gastrohobbi.hu | Yes (with groups) | Yes (with embedded lists) | Yes | Yes (from JSON-LD categories) |
|
|
| *Other sites* | Fallback (schema.org JSON-LD) | Fallback (schema.org JSON-LD) | Yes (og:image) | Fallback (schema.org keywords) |
|
|
|
|
### Mindmegette.hu Parser
|
|
|
|
Extracts data from the Angular-rendered HTML:
|
|
|
|
- **Title**: `og:title` meta tag, with ` | Mindmegette.hu` suffix stripped
|
|
- **Description**: `og:description` meta tag
|
|
- **Image**: `og:image` meta tag
|
|
- **Ingredients**: `div.ingredients` → `div.ingredients-meta` rows, each containing `<strong>` (qty), `<span>` (unit), `<a class="ingredients-link">` (food), `<small>` (extra)
|
|
- **Ingredient groups**: Multiple `div.ingredients` containers; group title via `<strong class="ingredients-group">`
|
|
- **Instructions**: `mindmegette-wysiwyg-box` → `ol > li` elements
|
|
- **Tags**: `<a class="tag">` elements inside `div.desktop-wrapper`
|
|
|
|
### Streetkitchen.hu Parser
|
|
|
|
Extracts data from the Next.js-rendered HTML:
|
|
|
|
- **Title**: `og:title` meta tag, with ` | Street Kitchen` suffix stripped
|
|
- **Description**: `og:description` meta tag
|
|
- **Image**: `og:image` meta tag (CDN URL)
|
|
- **Ingredients**: `div.grid.grid-cols-1` container → `div.my-2.flex` rows; quantity+unit merged in first `<div>` (split via regex), food in `<div class="font-bold">`, optional extra in parenthesised `<div>`
|
|
- **Ingredient groups**: `<h5>` headers inside section divs (e.g. "Az előfőzéshez", "A sütéshez")
|
|
- **Instructions**: Three formats handled — `<ol>` ordered list, `<ul>` unordered list, or plain `<p>` paragraphs (with optional `<strong>` section headers)
|
|
- **Tags**: `recipeCategory` field from JSON-LD `@graph` → `Recipe` object (comma-separated)
|
|
|
|
### Nosalty.hu Parser
|
|
|
|
Extracts data from the nosalty.hu recipe pages:
|
|
|
|
- **Title**: `og:title` meta tag
|
|
- **Description**: Story text from `div#recipe-story > p` (nosalty has no dedicated description field)
|
|
- **Image**: `og:image` meta tag
|
|
- **Ingredients**: Scoped to `div#ingredients` to avoid per-serving/nutrition duplicates; `ul.m-list__list > li.m-list__item` rows with `<span>` (qty+unit), `<a class="a-link">` (food), optional trailing `<span>` (extra notes in parentheses)
|
|
- **Ingredient groups**: `<h3 class="m-list__title">` headers between `<ul>` lists
|
|
- **Instructions**: `div#select` → `ol.m-list__list > li.m-list__item` steps; optional `<h4 class="m-list__title">` section headers
|
|
- **Tags**: `<a class="m-tags__tagItem">` inside `div.p-recipe__attributeList`
|
|
|
|
### Sobors.hu Parser
|
|
|
|
Extracts data from the sobors.hu recipe pages:
|
|
|
|
- **Title**: `h3.recept_nev`
|
|
- **Description**: `og:description` meta tag
|
|
- **Image**: `og:image` meta tag
|
|
- **Ingredients**: `div.hozzavalok-container` → `section` elements with `ul > li`, each containing `span.mennyiseg` (qty), `span.mertekegyseg` (unit), `span.hozzavalo` (food)
|
|
- **Ingredient groups**: `section > h4` headers (e.g., "A szószhoz:", "A húsgolyókhoz:")
|
|
- **Instructions**: `div.recept_leiras` → `<p>` tags, with `<h3><strong>` section headers
|
|
- **Linked recipes**: Some pages link to another site (e.g. kiskegyed.hu) instead of showing full instructions. The parser detects external links in the instruction area and follows them to scrape the real recipe content.
|
|
- **Article-style ingredient fallback**: Pages without the structured `div.hozzavalok-container` are parsed from article-body `h4` + `ul > li` plain text
|
|
- **Tags**: `div.cikk-cimkek > ul.cikk-cimkek-list > li > a` (skips generic "Receptek" category)
|
|
|
|
### Kiskegyed.hu Parser
|
|
|
|
Extracts data from kiskegyed.hu recipe pages:
|
|
|
|
- **Title**: `h2` element (with ` - Kiskegyed` suffix stripped)
|
|
- **Description**: `section#leadText > p`
|
|
- **Image**: `og:image` meta tag
|
|
- **Ingredients**: `div.recipe_ingredients` → `ul.list > li` items; group headers from `<p>` or `<p><em>` elements
|
|
- **Ingredient groups**: `<p>Name:</p>` or `<p><em>A ...hez</em></p>` format
|
|
- **Dual measurements**: "3 ek (70 g) búzafinomliszt" → qty: 3, unit: ek, food: búzafinomliszt, extra: 70 g
|
|
- **Instructions**: `div.recipe_preparation > ol > li > div`
|
|
- **Cross-site links**: Pages linking to sobors.hu are followed to get the full recipe
|
|
- **Tags**: `section.tags > a > span` (# prefix stripped, "recept" filtered)
|
|
|
|
### GastroHobbi.hu Parser
|
|
|
|
Extracts data from gastrohobbi.hu recipe pages (WPBakery page builder layout):
|
|
|
|
- **Title**: `h1.mpcth-post-title > span.mpcth-color-main-border`
|
|
- **Description**: First `<p>` in the first `wpb_text_column` before the recipe columns; falls back to `og:description`
|
|
- **Image**: `og:image` meta tag
|
|
- **Ingredients**: Finds `h3` containing "Hozzávalók:", then walks sibling `<ul>` elements; items from `li > p` or `li` directly
|
|
- **Ingredient groups**: Plain `<h3>` elements between ingredient lists (e.g. "A csipetkéhez:")
|
|
- **Instructions**: `<p>` elements following the "Elkészítés:" `h3`; embedded `<ul>` items rendered as bullet points
|
|
- **Prep time**: Extracted from "Elkészítési idő:" `h3`, appended to description
|
|
- **Tags**: JSON-LD `Article.articleSection` array (site uses Article schema, not Recipe)
|
|
|
|
### Generic Fallback Parser
|
|
|
|
For unsupported sites, attempts extraction via:
|
|
1. Schema.org JSON-LD `@type: Recipe` blocks (`recipeIngredient`, `recipeInstructions`, `keywords`)
|
|
2. OpenGraph meta tags for title, description, image
|
|
|
|
### Adding a New Site Parser
|
|
|
|
1. Create a parser function in `app/scraper.py` with the `@_register("hostname")` decorator
|
|
2. The function receives `(soup: BeautifulSoup, url: str)` and returns the standard recipe dict
|
|
3. The hostname substring is matched against the URL — first match wins, unmatched URLs use the generic fallback
|
|
|
|
## Bulk Import
|
|
|
|
The "Tömeges importálás" (Bulk Import) tab allows importing multiple recipes at once:
|
|
|
|
1. Paste one URL per line in the textarea
|
|
2. Choose a mode:
|
|
- **Review mode** — edit each recipe before importing, with option to switch to auto mid-way
|
|
- **Auto mode** — scrape and import all recipes without manual review (with tag option: import all tags or none)
|
|
3. Select target: Mealie, Tandoor, or both
|
|
4. Progress table tracks per-recipe status (pending, scraping, importing, done, error, skipped, duplicate)
|
|
|
|
All processing is done client-side, calling the existing `/scrape` and `/send` / `/send-tandoor` endpoints sequentially.
|
|
|
|
## Mealie API Integration
|
|
|
|
The importer uses the Mealie REST API:
|
|
|
|
1. **POST** `/api/recipes` — create a stub recipe (returns slug)
|
|
2. **PATCH** `/api/recipes/{slug}` — populate structured ingredients (with unit/food IDs), instructions, description, orgURL
|
|
3. **PUT** `/api/recipes/{slug}/image` — upload the recipe image
|
|
|
|
**Structured ingredients**: The client resolves unit and food names to Mealie database IDs. Missing units/foods are created automatically via the API. Ingredient groups are supported via the `title` field on the first ingredient of each group.
|
|
|
|
Authentication uses a long-lived API token (Bearer header), created in Mealie at *Profile → API Tokens*.
|
|
|
|
## Tandoor API Integration
|
|
|
|
The importer uses the Tandoor REST API:
|
|
|
|
1. **POST** `/api/recipe/` — create the full recipe in one call (name, description, source_url, steps with nested ingredients)
|
|
2. **PUT** `/api/recipe/{id}/image/` — upload the recipe image
|
|
|
|
**Step-based ingredients**: Tandoor nests ingredients inside steps. All ingredients are attached to the first step. Units and foods are auto-created by name (no separate resolution needed). Ingredient groups use `is_header: true` on a header entry.
|
|
|
|
**Duplicate detection**: Before import, searches Tandoor by title and checks the `source_url` field to detect already-imported recipes.
|
|
|
|
Authentication uses an API token (Bearer header), created in Tandoor at *Settings → API Browser → Auth Token*.
|
|
|
|
## Tag Management
|
|
|
|
Tags are scraped from recipe pages and shown as editable chips in the UI. Users can:
|
|
- **Remove** scraped tags that are irrelevant
|
|
- **Search** existing tags from Mealie and Tandoor (fetched via `GET /tags` endpoint)
|
|
- **Add** custom tags by typing and pressing Enter
|
|
|
|
Tags are sent to both services on import:
|
|
- **Mealie**: Tags are created via `POST /api/organizers/tags` if they don't exist, then attached to the recipe in the PATCH payload
|
|
- **Tandoor**: Keywords are auto-created by including `keywords: [{"name": "..."}]` in the recipe POST
|
|
|
|
## Configuration
|
|
|
|
All settings are persisted to `/data/config.json` (mounted as a Docker volume).
|
|
|
|
| Setting | Description |
|
|
|---------|-------------|
|
|
| `mealie_url` | Full URL to Mealie instance (e.g. `https://mealie.example.com`) |
|
|
| `mealie_api_key` | Mealie API token |
|
|
| `tandoor_url` | Full URL to Tandoor instance (e.g. `https://recipes.example.com`) |
|
|
| `tandoor_api_key` | Tandoor API token |
|
|
|
|
## Deployment
|
|
|
|
### Docker Compose
|
|
|
|
```yaml
|
|
services:
|
|
recipe-importer:
|
|
image: gitea.dooplex.hu/admin/recipe-importer:0.2.0
|
|
container_name: recipe-importer
|
|
restart: unless-stopped
|
|
ports:
|
|
- "8011:8000"
|
|
volumes:
|
|
- recipe-data:/data
|
|
environment:
|
|
- SECRET_KEY=change-me-in-production
|
|
- MEALIE_INTERNAL_URL=http://mealie:9000
|
|
- TANDOOR_INTERNAL_URL=http://tandoor:8080
|
|
|
|
volumes:
|
|
recipe-data:
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `SECRET_KEY` | `recipe-importer-dev-key` | Flask session secret |
|
|
| `DATA_DIR` | `/data` | Persistent storage path |
|
|
| `VERSION` | `dev` | Shown in the UI navbar |
|
|
| `MEALIE_INTERNAL_URL` | *(empty)* | Docker-internal Mealie URL (e.g. `http://mealie:9000`) to avoid Cloudflare hairpin |
|
|
| `TANDOOR_INTERNAL_URL` | *(empty)* | Docker-internal Tandoor URL (e.g. `http://tandoor:8080`) to avoid Cloudflare hairpin |
|
|
|
|
## Building
|
|
|
|
On the build server (kisfenyo@192.168.0.180):
|
|
|
|
```bash
|
|
cd ~/build/recipe-importer
|
|
./build.sh X.X.X --push
|
|
```
|
|
|
|
## Web UI
|
|
|
|
The UI is in Hungarian and uses a dark theme. The workflow is:
|
|
|
|
1. **Settings** (`/settings`) — Configure Mealie and/or Tandoor connection (URL + API key), test each connection
|
|
2. **Import** (`/import`) — Paste a recipe URL, click "Beolvasás" (Scrape)
|
|
3. **Review** — Edit structured ingredients (4-column: quantity, unit, food, note), add/remove ingredient groups, edit instructions, manage tags (add/remove/search existing)
|
|
4. **Send** — Click "Importálás Mealie-be" and/or "Importálás Tandoor-ba" to push to your configured services
|
|
|
|
## Tech Stack
|
|
|
|
- **Runtime**: Python 3.12 (slim)
|
|
- **Web framework**: Flask 3.1 + Gunicorn
|
|
- **HTML parsing**: BeautifulSoup 4 + lxml
|
|
- **HTTP client**: requests
|
|
- **Container**: ~60 MB image
|