Files
recipe-importer/README.md
T
2026-02-24 16:43:24 +01:00

201 lines
9.7 KiB
Markdown

# Recipe Importer
Docker container for importing recipes from Hungarian websites into [Mealie](https://mealie.io/) and [Tandoor Recipes](https://tandoor.dev/).
**Problem**: Mealie's and Tandoor's built-in URL import cannot parse ingredients and instructions from Hungarian recipe sites like mindmegette.hu.
**Solution**: This container provides a web UI that scrapes Hungarian recipe pages with site-specific parsers, lets you review and edit the extracted data, then pushes it to Mealie and/or Tandoor via their REST APIs.
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ recipe-importer container (:8000) │
│ │
│ Flask + Gunicorn │
│ ├── /settings → Configure Mealie & Tandoor │
│ ├── /import → Paste URL, scrape, review │
│ ├── /scrape → AJAX: parse recipe HTML │
│ ├── /send → AJAX: push to Mealie API │
│ ├── /send-tandoor → AJAX: push to Tandoor API │
│ ├── /tags → AJAX: list tags from both │
│ └── /health → Health check │
│ │
│ Modules: │
│ ├── app/config.py → JSON config persistence │
│ ├── app/scraper.py → Site-specific parsers │
│ ├── app/mealie.py → Mealie REST API client │
│ └── app/tandoor.py → Tandoor REST API client │
└───────────────────┬──────────────┬───────────────────┘
│ HTTP │ HTTP
▼ ▼
┌──────────────┐ ┌───────────────┐
│ Mealie │ │ Tandoor │
│ POST /api/.. │ │ POST /api/.. │
│ PUT /api/.. │ │ PUT /api/.. │
└──────────────┘ └───────────────┘
```
## Supported Sites
| Site | Ingredients | Instructions | Image | Tags |
|------|:-----------:|:------------:|:-----:|:----:|
| mindmegette.hu | Yes | Yes | Yes | Yes |
| streetkitchen.hu | Yes (with groups) | Yes (ol/ul/paragraph) | Yes | Yes (from JSON-LD categories) |
| nosalty.hu | Yes (with groups) | Yes (with section headers) | Yes | Yes |
| *Other sites* | Fallback (schema.org JSON-LD) | Fallback (schema.org JSON-LD) | Yes (og:image) | Fallback (schema.org keywords) |
### Mindmegette.hu Parser
Extracts data from the Angular-rendered HTML:
- **Title**: `og:title` meta tag, with ` | Mindmegette.hu` suffix stripped
- **Description**: `og:description` meta tag
- **Image**: `og:image` meta tag
- **Ingredients**: `div.ingredients``div.ingredients-meta` rows, each containing `<strong>` (qty), `<span>` (unit), `<a class="ingredients-link">` (food), `<small>` (extra)
- **Ingredient groups**: Multiple `div.ingredients` containers; group title via `<strong class="ingredients-group">`
- **Instructions**: `mindmegette-wysiwyg-box``ol > li` elements
- **Tags**: `<a class="tag">` elements inside `div.desktop-wrapper`
### Streetkitchen.hu Parser
Extracts data from the Next.js-rendered HTML:
- **Title**: `og:title` meta tag, with ` | Street Kitchen` suffix stripped
- **Description**: `og:description` meta tag
- **Image**: `og:image` meta tag (CDN URL)
- **Ingredients**: `div.grid.grid-cols-1` container → `div.my-2.flex` rows; quantity+unit merged in first `<div>` (split via regex), food in `<div class="font-bold">`, optional extra in parenthesised `<div>`
- **Ingredient groups**: `<h5>` headers inside section divs (e.g. "Az előfőzéshez", "A sütéshez")
- **Instructions**: Three formats handled — `<ol>` ordered list, `<ul>` unordered list, or plain `<p>` paragraphs (with optional `<strong>` section headers)
- **Tags**: `recipeCategory` field from JSON-LD `@graph``Recipe` object (comma-separated)
### Nosalty.hu Parser
Extracts data from the nosalty.hu recipe pages:
- **Title**: `og:title` meta tag
- **Description**: Story text from `div#recipe-story > p` (nosalty has no dedicated description field)
- **Image**: `og:image` meta tag
- **Ingredients**: Scoped to `div#ingredients` to avoid per-serving/nutrition duplicates; `ul.m-list__list > li.m-list__item` rows with `<span>` (qty+unit), `<a class="a-link">` (food), optional trailing `<span>` (extra notes in parentheses)
- **Ingredient groups**: `<h3 class="m-list__title">` headers between `<ul>` lists
- **Instructions**: `div#select``ol.m-list__list > li.m-list__item` steps; optional `<h4 class="m-list__title">` section headers
- **Tags**: `<a class="m-tags__tagItem">` inside `div.p-recipe__attributeList`
### Generic Fallback Parser
For unsupported sites, attempts extraction via:
1. Schema.org JSON-LD `@type: Recipe` blocks (`recipeIngredient`, `recipeInstructions`, `keywords`)
2. OpenGraph meta tags for title, description, image
### Adding a New Site Parser
1. Create a parser function in `app/scraper.py` with the `@_register("hostname")` decorator
2. The function receives `(soup: BeautifulSoup, url: str)` and returns the standard recipe dict
3. The hostname substring is matched against the URL — first match wins, unmatched URLs use the generic fallback
## Mealie API Integration
The importer uses the Mealie REST API:
1. **POST** `/api/recipes` — create a stub recipe (returns slug)
2. **PATCH** `/api/recipes/{slug}` — populate structured ingredients (with unit/food IDs), instructions, description, orgURL
3. **PUT** `/api/recipes/{slug}/image` — upload the recipe image
**Structured ingredients**: The client resolves unit and food names to Mealie database IDs. Missing units/foods are created automatically via the API. Ingredient groups are supported via the `title` field on the first ingredient of each group.
Authentication uses a long-lived API token (Bearer header), created in Mealie at *Profile → API Tokens*.
## Tandoor API Integration
The importer uses the Tandoor REST API:
1. **POST** `/api/recipe/` — create the full recipe in one call (name, description, source_url, steps with nested ingredients)
2. **PUT** `/api/recipe/{id}/image/` — upload the recipe image
**Step-based ingredients**: Tandoor nests ingredients inside steps. All ingredients are attached to the first step. Units and foods are auto-created by name (no separate resolution needed). Ingredient groups use `is_header: true` on a header entry.
**Duplicate detection**: Before import, searches Tandoor by title and checks the `source_url` field to detect already-imported recipes.
Authentication uses an API token (Bearer header), created in Tandoor at *Settings → API Browser → Auth Token*.
## Tag Management
Tags are scraped from recipe pages and shown as editable chips in the UI. Users can:
- **Remove** scraped tags that are irrelevant
- **Search** existing tags from Mealie and Tandoor (fetched via `GET /tags` endpoint)
- **Add** custom tags by typing and pressing Enter
Tags are sent to both services on import:
- **Mealie**: Tags are created via `POST /api/organizers/tags` if they don't exist, then attached to the recipe in the PATCH payload
- **Tandoor**: Keywords are auto-created by including `keywords: [{"name": "..."}]` in the recipe POST
## Configuration
All settings are persisted to `/data/config.json` (mounted as a Docker volume).
| Setting | Description |
|---------|-------------|
| `mealie_url` | Full URL to Mealie instance (e.g. `https://mealie.example.com`) |
| `mealie_api_key` | Mealie API token |
| `tandoor_url` | Full URL to Tandoor instance (e.g. `https://recipes.example.com`) |
| `tandoor_api_key` | Tandoor API token |
## Deployment
### Docker Compose
```yaml
services:
recipe-importer:
image: gitea.dooplex.hu/admin/recipe-importer:0.2.0
container_name: recipe-importer
restart: unless-stopped
ports:
- "8011:8000"
volumes:
- recipe-data:/data
environment:
- SECRET_KEY=change-me-in-production
- MEALIE_INTERNAL_URL=http://mealie:9000
- TANDOOR_INTERNAL_URL=http://tandoor:8080
volumes:
recipe-data:
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `SECRET_KEY` | `recipe-importer-dev-key` | Flask session secret |
| `DATA_DIR` | `/data` | Persistent storage path |
| `VERSION` | `dev` | Shown in the UI navbar |
| `MEALIE_INTERNAL_URL` | *(empty)* | Docker-internal Mealie URL (e.g. `http://mealie:9000`) to avoid Cloudflare hairpin |
| `TANDOOR_INTERNAL_URL` | *(empty)* | Docker-internal Tandoor URL (e.g. `http://tandoor:8080`) to avoid Cloudflare hairpin |
## Building
On the build server (kisfenyo@192.168.0.180):
```bash
cd ~/build/recipe-importer
./build.sh X.X.X --push
```
## Web UI
The UI is in Hungarian and uses a dark theme. The workflow is:
1. **Settings** (`/settings`) — Configure Mealie and/or Tandoor connection (URL + API key), test each connection
2. **Import** (`/import`) — Paste a recipe URL, click "Beolvasás" (Scrape)
3. **Review** — Edit structured ingredients (4-column: quantity, unit, food, note), add/remove ingredient groups, edit instructions, manage tags (add/remove/search existing)
4. **Send** — Click "Importálás Mealie-be" and/or "Importálás Tandoor-ba" to push to your configured services
## Tech Stack
- **Runtime**: Python 3.12 (slim)
- **Web framework**: Flask 3.1 + Gunicorn
- **HTML parsing**: BeautifulSoup 4 + lxml
- **HTTP client**: requests
- **Container**: ~60 MB image