Initial wiki: sync, lists, consumers, workflow
+238
@@ -0,0 +1,238 @@
|
|||||||
|
# CI and Workflow
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Synchronisation is driven by a single Gitea Actions workflow,
|
||||||
|
`.gitea/workflows/sync.yml`. It runs on a schedule and on manual dispatch.
|
||||||
|
Each run executes the sync script, commits any changes to `blacklist` or
|
||||||
|
`blacklist.prev`, and pushes the commit to `main`.
|
||||||
|
|
||||||
|
There is no CI in the traditional sense for this repository -- no tests,
|
||||||
|
no build, no lint. The workflow's only job is to keep the blacklist in
|
||||||
|
sync with upstream.
|
||||||
|
|
||||||
|
## Workflow file
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: Sync blocklists from upstream
|
||||||
|
|
||||||
|
on:
|
||||||
|
schedule:
|
||||||
|
- cron: '0 4 */7 * *'
|
||||||
|
workflow_dispatch:
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
sync:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
|
||||||
|
- name: Fetch and merge upstream files
|
||||||
|
run: python3 scripts/merge_blocklists.py
|
||||||
|
|
||||||
|
- name: Commit and push if changed
|
||||||
|
run: |
|
||||||
|
git config user.name "gitea-actions"
|
||||||
|
git config user.email "actions@gitea"
|
||||||
|
git add .
|
||||||
|
git diff --staged --quiet || git commit -m "Sync blocklists from upstream"
|
||||||
|
git push
|
||||||
|
```
|
||||||
|
|
||||||
|
## Schedule
|
||||||
|
|
||||||
|
The cron expression `0 4 */7 * *` runs at 04:00 UTC on days 1, 8, 15, 22,
|
||||||
|
and 29 of each month -- effectively every 7 days, with a small skip at
|
||||||
|
the end of each month because day 29 and day 1 of the next month are only
|
||||||
|
1-3 days apart.
|
||||||
|
|
||||||
|
This cadence is deliberate: upstream Cleanuparr rarely changes the
|
||||||
|
blacklist, and running less frequently reduces noise in the commit
|
||||||
|
history. If upstream is updated and you want the change immediately,
|
||||||
|
use manual dispatch (see below) instead of waiting for the next scheduled
|
||||||
|
run.
|
||||||
|
|
||||||
|
### Changing the schedule
|
||||||
|
|
||||||
|
Edit the `cron` line in `.gitea/workflows/sync.yml`. Common alternatives:
|
||||||
|
|
||||||
|
| Cron expression | Meaning |
|
||||||
|
|---|---|
|
||||||
|
| `0 4 */7 * *` | Every 7 days at 04:00 UTC (current) |
|
||||||
|
| `0 4 * * 1` | Every Monday at 04:00 UTC |
|
||||||
|
| `0 4 1 * *` | First day of every month at 04:00 UTC |
|
||||||
|
| `0 */6 * * *` | Every 6 hours |
|
||||||
|
|
||||||
|
All times are UTC. Gitea Actions does not support timezones in cron
|
||||||
|
expressions.
|
||||||
|
|
||||||
|
## Manual dispatch
|
||||||
|
|
||||||
|
The `workflow_dispatch` trigger lets you run the sync on demand from
|
||||||
|
the Gitea UI or via the API. Use this after editing `whitelist` if you
|
||||||
|
want the change to take effect immediately instead of waiting for the
|
||||||
|
next scheduled run.
|
||||||
|
|
||||||
|
### From the Gitea UI
|
||||||
|
|
||||||
|
1. Open the repository on Gitea.
|
||||||
|
2. Go to **Actions** -> **Sync blocklists from upstream**.
|
||||||
|
3. Click **Run workflow**.
|
||||||
|
4. Select branch `main`.
|
||||||
|
5. Click the confirm button.
|
||||||
|
|
||||||
|
The run appears in the Actions list within a few seconds and typically
|
||||||
|
completes in under a minute.
|
||||||
|
|
||||||
|
### From the API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST \
|
||||||
|
-H "Authorization: token YOUR_TOKEN" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"ref": "main"}' \
|
||||||
|
https://git.hisp.no/api/v1/repos/arr/blocklists/actions/workflows/sync.yml/dispatches
|
||||||
|
```
|
||||||
|
|
||||||
|
The token needs `write:repository` scope for the `arr/blocklists` repo.
|
||||||
|
|
||||||
|
## What the workflow does
|
||||||
|
|
||||||
|
### Step 1: checkout
|
||||||
|
|
||||||
|
Standard `actions/checkout@v3`. Checks out the repository at the current
|
||||||
|
HEAD of `main`. No submodules, no LFS, no special configuration.
|
||||||
|
|
||||||
|
### Step 2: fetch and merge
|
||||||
|
|
||||||
|
Runs `python3 scripts/merge_blocklists.py`. The script:
|
||||||
|
|
||||||
|
1. Fetches the upstream blacklist from
|
||||||
|
`https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist`.
|
||||||
|
2. Reads `blacklist.prev`, `blacklist`, and `whitelist` from the checked-out
|
||||||
|
repository.
|
||||||
|
3. Performs the three-way merge and whitelist subtraction.
|
||||||
|
4. Writes `blacklist` and `blacklist.prev` back to disk.
|
||||||
|
|
||||||
|
The script is idempotent: running it twice in a row with no upstream or
|
||||||
|
whitelist changes produces no diff on the second run.
|
||||||
|
|
||||||
|
See [Sync](Sync) for the full algorithm.
|
||||||
|
|
||||||
|
### Step 3: commit and push if changed
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git config user.name "gitea-actions"
|
||||||
|
git config user.email "actions@gitea"
|
||||||
|
git add .
|
||||||
|
git diff --staged --quiet || git commit -m "Sync blocklists from upstream"
|
||||||
|
git push
|
||||||
|
```
|
||||||
|
|
||||||
|
This commits and pushes only if the script actually changed something.
|
||||||
|
The `git diff --staged --quiet` check returns non-zero when there are
|
||||||
|
staged changes, which triggers the commit via `||`. If nothing changed,
|
||||||
|
`git commit` is skipped and the final `git push` is a no-op (push with
|
||||||
|
no local commits ahead of the remote).
|
||||||
|
|
||||||
|
The commit author is always `gitea-actions <actions@gitea>`, regardless
|
||||||
|
of who triggered the run. This makes automated syncs easy to distinguish
|
||||||
|
from human commits in the history.
|
||||||
|
|
||||||
|
## Permissions
|
||||||
|
|
||||||
|
The workflow runs with the default `GITHUB_TOKEN` (Gitea equivalent) that
|
||||||
|
Gitea Actions provides automatically. This token has write access to the
|
||||||
|
repository, which is necessary for the commit-and-push step. No additional
|
||||||
|
secrets are required.
|
||||||
|
|
||||||
|
No external API tokens are needed -- the upstream blacklist is fetched
|
||||||
|
from a public raw URL on `raw.githubusercontent.com` without
|
||||||
|
authentication.
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Checking recent runs
|
||||||
|
|
||||||
|
Go to **Actions** -> **Sync blocklists from upstream** in the Gitea UI.
|
||||||
|
Each run shows:
|
||||||
|
|
||||||
|
- Status (success / failure)
|
||||||
|
- Trigger (schedule / manual dispatch)
|
||||||
|
- Commit created (if any)
|
||||||
|
- Full log output
|
||||||
|
|
||||||
|
### Reading the log
|
||||||
|
|
||||||
|
The Python script prints four summary lines per run. These appear in
|
||||||
|
the "Fetch and merge upstream files" step log:
|
||||||
|
|
||||||
|
```
|
||||||
|
[blacklist] Upstream added: [...]
|
||||||
|
[blacklist] Upstream removed: [...]
|
||||||
|
[blacklist] Custom preserved: [...]
|
||||||
|
[blacklist] Whitelist stripped: [...]
|
||||||
|
```
|
||||||
|
|
||||||
|
Use these to verify the sync behaved as expected. "Whitelist stripped"
|
||||||
|
should list every entry in your whitelist that was present in the upstream
|
||||||
|
blacklist at fetch time.
|
||||||
|
|
||||||
|
### Run history in git log
|
||||||
|
|
||||||
|
Every automated commit uses the same message, so filtering the history
|
||||||
|
is easy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git log --author="gitea-actions" --oneline
|
||||||
|
```
|
||||||
|
|
||||||
|
Or to see commits that actually touched the blacklist:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git log --oneline -- blacklist
|
||||||
|
```
|
||||||
|
|
||||||
|
## Failure modes
|
||||||
|
|
||||||
|
### Upstream unreachable
|
||||||
|
|
||||||
|
If `raw.githubusercontent.com` is unreachable or returns a non-200
|
||||||
|
response, `urllib.request.urlopen` raises an exception and the script
|
||||||
|
exits non-zero. The workflow fails at the "Fetch and merge upstream
|
||||||
|
files" step. No commit is made, no push happens. The repository state
|
||||||
|
is unchanged.
|
||||||
|
|
||||||
|
Retry the workflow manually once upstream is available again.
|
||||||
|
|
||||||
|
### Script error
|
||||||
|
|
||||||
|
If the sync script crashes (malformed upstream, disk full, etc.), the
|
||||||
|
step fails and no commit is made. Read the full step log to diagnose.
|
||||||
|
|
||||||
|
### Push rejected
|
||||||
|
|
||||||
|
If someone pushes to `main` between the checkout and the push, the push
|
||||||
|
is rejected (non-fast-forward). The workflow fails at the push step.
|
||||||
|
No data is lost -- the next scheduled run will fetch the latest state
|
||||||
|
and re-apply the sync.
|
||||||
|
|
||||||
|
### Commit is empty
|
||||||
|
|
||||||
|
This is not a failure. The `git diff --staged --quiet || git commit`
|
||||||
|
pattern explicitly skips the commit when nothing changed, and the
|
||||||
|
subsequent `git push` is a no-op. The workflow reports success.
|
||||||
|
|
||||||
|
## Disabling the scheduled run
|
||||||
|
|
||||||
|
To pause automatic syncing without removing the workflow entirely,
|
||||||
|
comment out the `schedule` section in `.gitea/workflows/sync.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
on:
|
||||||
|
# schedule:
|
||||||
|
# - cron: '0 4 */7 * *'
|
||||||
|
workflow_dispatch:
|
||||||
|
```
|
||||||
|
|
||||||
|
Manual dispatch still works. Uncomment to re-enable scheduling.
|
||||||
+181
@@ -0,0 +1,181 @@
|
|||||||
|
# Consumers
|
||||||
|
|
||||||
|
The blocklists are consumed by two tools in the ARR stack:
|
||||||
|
|
||||||
|
| Tool | Role | File consumed | Mode |
|
||||||
|
|---|---|---|---|
|
||||||
|
| qBittorrent | Download client | `blacklist` | Excluded file names |
|
||||||
|
| Cleanuparr | Media cleanup / malware blocker | `blacklist` or `whitelist` | Blacklist or whitelist mode |
|
||||||
|
|
||||||
|
Both tools read a remote text file over HTTPS, one glob pattern per line.
|
||||||
|
They refresh on their own schedule (qBittorrent on restart or manual
|
||||||
|
refresh; Cleanuparr on its configured interval).
|
||||||
|
|
||||||
|
## Raw URLs
|
||||||
|
|
||||||
|
Point consumers at the raw file URLs, not the Gitea blob viewer URLs:
|
||||||
|
|
||||||
|
```
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/whitelist
|
||||||
|
```
|
||||||
|
|
||||||
|
The `raw/branch/main/` path serves the file contents directly with the
|
||||||
|
correct `text/plain` content type. Using `src/branch/main/` instead serves
|
||||||
|
the HTML viewer page and will break the consumer.
|
||||||
|
|
||||||
|
## qBittorrent
|
||||||
|
|
||||||
|
qBittorrent has an **excluded file names** feature that skips files
|
||||||
|
matching any of the configured glob patterns when downloading a torrent.
|
||||||
|
There is no "included file names" or whitelist mode -- qBittorrent only
|
||||||
|
supports exclusion. This is why it consumes the merged `blacklist` and not
|
||||||
|
the `whitelist`.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
1. Open **Options** (Tools -> Options, or Ctrl+,).
|
||||||
|
2. Go to **Downloads**.
|
||||||
|
3. Scroll to **Excluded file names**.
|
||||||
|
4. Enable the checkbox.
|
||||||
|
5. Set the URL to:
|
||||||
|
|
||||||
|
```
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||||
|
```
|
||||||
|
|
||||||
|
qBittorrent fetches the list on startup and whenever you click **Reload**
|
||||||
|
next to the field. There is no automatic refresh interval -- a restart or
|
||||||
|
manual reload is required to pick up changes.
|
||||||
|
|
||||||
|
### What qBittorrent does with the list
|
||||||
|
|
||||||
|
When a torrent is added, qBittorrent iterates the files inside it and
|
||||||
|
checks each filename against the excluded patterns. Matching files are
|
||||||
|
marked as "do not download" and will not be written to disk. The rest of
|
||||||
|
the torrent downloads normally.
|
||||||
|
|
||||||
|
This means the list operates at the **file level within a torrent**, not
|
||||||
|
the torrent level. A torrent containing `movie.mkv` and `movie.nor.srt`
|
||||||
|
would download both files if `*.srt` is in the whitelist (and thus not in
|
||||||
|
the blacklist), or just `movie.mkv` if `*.srt` were in the blacklist.
|
||||||
|
|
||||||
|
### Refreshing after a whitelist change
|
||||||
|
|
||||||
|
qBittorrent does not auto-refresh the list. After updating `whitelist`:
|
||||||
|
|
||||||
|
1. Wait for the next sync run (or dispatch the workflow manually).
|
||||||
|
2. In qBittorrent, open the excluded file names setting and click
|
||||||
|
**Reload**, or restart qBittorrent.
|
||||||
|
3. New torrents added from this point on will use the updated list.
|
||||||
|
Torrents already in the client are not retroactively changed.
|
||||||
|
|
||||||
|
## Cleanuparr
|
||||||
|
|
||||||
|
Cleanuparr supports two modes for its Malware Blocker and Blacklist Sync
|
||||||
|
features. The repository provides files suitable for both.
|
||||||
|
|
||||||
|
### Blacklist mode
|
||||||
|
|
||||||
|
In blacklist mode, Cleanuparr deletes any file matching a pattern in the
|
||||||
|
configured list.
|
||||||
|
|
||||||
|
Point it at the same URL as qBittorrent:
|
||||||
|
|
||||||
|
```
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||||
|
```
|
||||||
|
|
||||||
|
Because the whitelist has already been subtracted, this file will not
|
||||||
|
cause Cleanuparr to delete anything you have marked as "keep" in the
|
||||||
|
whitelist. Consistent behaviour between the two tools without any
|
||||||
|
per-tool customisation.
|
||||||
|
|
||||||
|
### Whitelist mode
|
||||||
|
|
||||||
|
In whitelist mode, Cleanuparr keeps only files matching a pattern in the
|
||||||
|
configured list and deletes everything else.
|
||||||
|
|
||||||
|
Point it at:
|
||||||
|
|
||||||
|
```
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/whitelist
|
||||||
|
```
|
||||||
|
|
||||||
|
This is the more conservative choice: only the extensions explicitly
|
||||||
|
listed (video containers and subtitles) are allowed. Anything else --
|
||||||
|
including extensions that upstream has not yet flagged as malicious --
|
||||||
|
is deleted.
|
||||||
|
|
||||||
|
### Which mode to use
|
||||||
|
|
||||||
|
| Use case | Mode | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| You trust upstream Cleanuparr's coverage and want to keep everything except known-bad | Blacklist | Lets through unusual-but-legitimate file types (e.g. exotic subtitle formats) |
|
||||||
|
| You only want a strict set of video + subtitle files on disk | Whitelist | Much stricter; deletes anything not explicitly listed |
|
||||||
|
| You want behaviour consistent with qBittorrent | Blacklist | Same source file, same semantics |
|
||||||
|
|
||||||
|
Blacklist mode is the recommended default because it matches the
|
||||||
|
qBittorrent side and avoids unexpected deletions of legitimate but
|
||||||
|
non-listed files.
|
||||||
|
|
||||||
|
## Keeping both consumers in sync
|
||||||
|
|
||||||
|
Both consumers ultimately read the whitelist (directly in Cleanuparr
|
||||||
|
whitelist mode, indirectly via subtraction in blacklist mode and in
|
||||||
|
qBittorrent). This means maintenance is centralised:
|
||||||
|
|
||||||
|
1. Add a line to `whitelist`.
|
||||||
|
2. Wait for the next sync run (or dispatch manually).
|
||||||
|
3. Both consumers honour the change after their next refresh.
|
||||||
|
|
||||||
|
There is no per-tool configuration drift because there is no per-tool
|
||||||
|
configuration to drift.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### A file I whitelisted is still being blocked / deleted
|
||||||
|
|
||||||
|
Check each layer in order:
|
||||||
|
|
||||||
|
1. **Sync ran successfully?** Open the Gitea Actions page for the
|
||||||
|
repository and verify the most recent run is green and newer than
|
||||||
|
your whitelist commit.
|
||||||
|
2. **Blacklist was updated?** Read `blacklist` in Gitea and confirm your
|
||||||
|
whitelisted entry is not present.
|
||||||
|
3. **Consumer refreshed?** qBittorrent requires a manual reload or
|
||||||
|
restart. Cleanuparr refreshes on its own interval -- check its logs
|
||||||
|
to confirm it picked up the new file.
|
||||||
|
4. **Exact string match?** Whitelist entries must match blacklist entries
|
||||||
|
exactly. `*.srt` in whitelist does not strip `*sample.srt` from
|
||||||
|
blacklist. See [Lists](Lists) for pattern semantics.
|
||||||
|
|
||||||
|
### A file I did not whitelist is passing through
|
||||||
|
|
||||||
|
Check whether the pattern is in the blacklist at all:
|
||||||
|
|
||||||
|
1. Open `blacklist` in Gitea and search for the extension.
|
||||||
|
2. If it is not there, upstream does not block it either. You can add
|
||||||
|
it to `blacklist` directly (manual local addition, preserved by the
|
||||||
|
three-way merge) or file an upstream issue.
|
||||||
|
|
||||||
|
### Consumer returns 404
|
||||||
|
|
||||||
|
Verify the URL uses `raw/branch/main/`, not `src/branch/main/`:
|
||||||
|
|
||||||
|
```
|
||||||
|
# Correct
|
||||||
|
https://git.hisp.no/arr/blocklists/raw/branch/main/blacklist
|
||||||
|
|
||||||
|
# Wrong (serves HTML, not the file)
|
||||||
|
https://git.hisp.no/arr/blocklists/src/branch/main/blacklist
|
||||||
|
```
|
||||||
|
|
||||||
|
Also check the repository name and branch are correct
|
||||||
|
(`arr/blocklists`, `main`).
|
||||||
|
|
||||||
|
### Cleanuparr deletes subtitle files
|
||||||
|
|
||||||
|
Cleanuparr is running in whitelist mode against `blacklist`, which is
|
||||||
|
the wrong combination. Either switch it to blacklist mode (keep the URL),
|
||||||
|
or keep whitelist mode and point it at `whitelist` instead.
|
||||||
-1
@@ -1 +0,0 @@
|
|||||||
Welcome to the Wiki.
|
|
||||||
+206
@@ -0,0 +1,206 @@
|
|||||||
|
# Lists
|
||||||
|
|
||||||
|
## The two-file model
|
||||||
|
|
||||||
|
The repository contains exactly two data files. Each has a single, clear
|
||||||
|
role:
|
||||||
|
|
||||||
|
| File | Role | Source of truth | Edit it? |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `blacklist` | Extensions blocked by downloaders and file cleaners | Upstream Cleanuparr, minus `whitelist` | Only for manual additions that upstream missed. Removals do not stick -- use `whitelist` instead |
|
||||||
|
| `whitelist` | Extensions that must never be blocked or deleted | Locally maintained, not synced from upstream | Yes. This is the main file you interact with |
|
||||||
|
|
||||||
|
`blacklist.prev` also exists in the repo but is not a data file -- it is
|
||||||
|
the three-way merge baseline used by the sync script. Never edit it.
|
||||||
|
|
||||||
|
## `blacklist`
|
||||||
|
|
||||||
|
The blacklist is the output file consumed by qBittorrent and (optionally)
|
||||||
|
Cleanuparr. It is regenerated on every sync as:
|
||||||
|
|
||||||
|
```
|
||||||
|
upstream_new | custom_local_additions - whitelist
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `custom_local_additions` is detected by comparing the committed
|
||||||
|
`blacklist` against the previous upstream snapshot. See
|
||||||
|
[Sync](Sync) for the full algorithm.
|
||||||
|
|
||||||
|
### When to edit `blacklist` directly
|
||||||
|
|
||||||
|
In almost every case, you do not. The intended workflow is:
|
||||||
|
|
||||||
|
- To **remove** an entry (stop blocking it): add it to `whitelist`.
|
||||||
|
- To **add** an entry that upstream should also have: file an upstream
|
||||||
|
issue with Cleanuparr.
|
||||||
|
- To **add** an entry that is specific to your setup and not worth
|
||||||
|
upstreaming: edit `blacklist` directly. The three-way merge preserves
|
||||||
|
manual additions across syncs.
|
||||||
|
|
||||||
|
### When editing `blacklist` directly does not work
|
||||||
|
|
||||||
|
Removing a line from `blacklist` does not work as a removal mechanism.
|
||||||
|
The sync will re-add anything upstream has on the next run. If you want
|
||||||
|
something gone, put it in `whitelist`.
|
||||||
|
|
||||||
|
## `whitelist`
|
||||||
|
|
||||||
|
The whitelist is the locally-maintained allow list. It is the single source
|
||||||
|
of truth for "what must be kept." It is not synced from upstream -- any
|
||||||
|
changes you make are permanent until you change them again.
|
||||||
|
|
||||||
|
### Format
|
||||||
|
|
||||||
|
One glob pattern per line, sorted, no blank lines, no comments:
|
||||||
|
|
||||||
|
```
|
||||||
|
*.ass
|
||||||
|
*.avi
|
||||||
|
*.mkv
|
||||||
|
*.mp4
|
||||||
|
*.srt
|
||||||
|
*.ssa
|
||||||
|
*.sub
|
||||||
|
*.webm
|
||||||
|
```
|
||||||
|
|
||||||
|
The sort order is not enforced by the script but is the convention and
|
||||||
|
makes diffs easier to read.
|
||||||
|
|
||||||
|
### Semantics
|
||||||
|
|
||||||
|
Each line is treated as an exact string and subtracted from the blacklist.
|
||||||
|
See [Pattern matching](#pattern-matching) below for the details.
|
||||||
|
|
||||||
|
### Adding an entry
|
||||||
|
|
||||||
|
Edit `whitelist` in Gitea (or via a local clone and push), add the new
|
||||||
|
line, commit. The next sync run (or manual dispatch) will strip it from
|
||||||
|
the blacklist automatically.
|
||||||
|
|
||||||
|
You do not also need to remove it from the blacklist by hand -- the sync
|
||||||
|
does that.
|
||||||
|
|
||||||
|
### Removing an entry
|
||||||
|
|
||||||
|
Delete the line from `whitelist` and commit. The next sync will re-add
|
||||||
|
the entry to the blacklist if upstream still has it. If upstream no longer
|
||||||
|
has the entry, the entry stays gone (which is probably what you want).
|
||||||
|
|
||||||
|
## Pattern matching
|
||||||
|
|
||||||
|
The whitelist-to-blacklist exclusion uses **exact-string set subtraction**,
|
||||||
|
not glob matching. This is an intentional design choice that has two
|
||||||
|
important consequences.
|
||||||
|
|
||||||
|
### Exact entries are stripped
|
||||||
|
|
||||||
|
`*.srt` in the whitelist removes exactly the string `*.srt` from the
|
||||||
|
blacklist. If upstream has `*.srt` as a line, it gets removed. If upstream
|
||||||
|
does not have `*.srt`, nothing happens.
|
||||||
|
|
||||||
|
### Partial matches are not affected
|
||||||
|
|
||||||
|
`*.srt` in the whitelist does **not** strip:
|
||||||
|
|
||||||
|
| Blacklist entry | Stripped? | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| `*.srt` | yes | Identical string |
|
||||||
|
| `*sample.srt` | no | Different string |
|
||||||
|
| `*.srt.bak` | no | Different string |
|
||||||
|
| `file.srt` | no | Different string |
|
||||||
|
|
||||||
|
This is what makes the whitelist safe to maintain. You can whitelist
|
||||||
|
`*.srt` to keep bundled subtitle files without accidentally unblocking
|
||||||
|
sample files or junk variants that happen to end in `.srt`.
|
||||||
|
|
||||||
|
### Why not glob matching
|
||||||
|
|
||||||
|
A glob-based exclusion would strip anything matching `*.srt` as a pattern,
|
||||||
|
which would also strip `*sample.srt` and `*.srt.bak`. That is usually not
|
||||||
|
what you want -- sample files are legitimate junk that the blacklist
|
||||||
|
should still remove.
|
||||||
|
|
||||||
|
Exact-string subtraction is also trivially simple to reason about: if the
|
||||||
|
line you want stripped is in the blacklist as the exact same string, put
|
||||||
|
that same string in the whitelist. Done.
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
### Keeping Norwegian subtitle files
|
||||||
|
|
||||||
|
Scenario: torrents include `.srt` files as bundled Norwegian subtitles.
|
||||||
|
You want qBittorrent to download them, not strip them.
|
||||||
|
|
||||||
|
```
|
||||||
|
# whitelist entry
|
||||||
|
*.srt
|
||||||
|
```
|
||||||
|
|
||||||
|
After the next sync, `*.srt` is gone from `blacklist`. qBittorrent now
|
||||||
|
accepts `.srt` files from within torrents. `*sample.srt` remains blocked.
|
||||||
|
|
||||||
|
### Supporting AV1 in `.webm` containers
|
||||||
|
|
||||||
|
Scenario: you want qBittorrent to accept `.webm` AV1 releases, which are
|
||||||
|
currently blocked because the upstream blacklist treats `*.webm` as junk.
|
||||||
|
|
||||||
|
```
|
||||||
|
# whitelist entry
|
||||||
|
*.webm
|
||||||
|
```
|
||||||
|
|
||||||
|
After the next sync, `*.webm` is gone from `blacklist`. `.webm` torrents
|
||||||
|
download normally. `*sample.webm` remains blocked.
|
||||||
|
|
||||||
|
### Adding a site-specific junk extension
|
||||||
|
|
||||||
|
Scenario: a private tracker keeps injecting `*.nfo.gz` spam files that
|
||||||
|
upstream does not block.
|
||||||
|
|
||||||
|
```
|
||||||
|
# Edit blacklist directly, add the line:
|
||||||
|
*.nfo.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
Commit and push. The next sync runs, the three-way merge sees
|
||||||
|
`*.nfo.gz` in `local - upstream_prev`, classifies it as a manual addition,
|
||||||
|
and preserves it through the merge. Subsequent syncs continue to preserve
|
||||||
|
it even as upstream evolves.
|
||||||
|
|
||||||
|
If upstream ever adds `*.nfo.gz` itself, the entry moves from "custom"
|
||||||
|
to "upstream" on the next sync -- still present, still blocked, just
|
||||||
|
sourced differently.
|
||||||
|
|
||||||
|
## What lives in each file right now
|
||||||
|
|
||||||
|
The whitelist ships with the extensions required for a normal media
|
||||||
|
stack with subtitles and AV1/webm releases:
|
||||||
|
|
||||||
|
```
|
||||||
|
*.ass - SubStation Alpha subtitles
|
||||||
|
*.avi - Audio Video Interleave container
|
||||||
|
*.mkv - Matroska container
|
||||||
|
*.mp4 - MPEG-4 container
|
||||||
|
*.srt - SubRip subtitles
|
||||||
|
*.ssa - SubStation Alpha subtitles
|
||||||
|
*.sub - MicroDVD / VobSub subtitles
|
||||||
|
*.webm - WebM container (AV1, VP9)
|
||||||
|
```
|
||||||
|
|
||||||
|
The blacklist contains whatever upstream Cleanuparr ships, minus everything
|
||||||
|
in the whitelist above. The actual contents change as upstream evolves --
|
||||||
|
check the file in Gitea for the current state.
|
||||||
|
|
||||||
|
## Consumer consequences
|
||||||
|
|
||||||
|
Changes to either file affect what consumers see:
|
||||||
|
|
||||||
|
| Change | Effect on qBittorrent | Effect on Cleanuparr (blacklist mode) | Effect on Cleanuparr (whitelist mode) |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Add to `whitelist` | Stops blocking this extension | Stops deleting this extension | Starts allowing this extension |
|
||||||
|
| Remove from `whitelist` | Resumes blocking (if upstream has it) | Resumes deleting (if upstream has it) | Stops allowing this extension |
|
||||||
|
| Add to `blacklist` directly | Starts blocking this extension | Starts deleting this extension | No effect |
|
||||||
|
| Remove from `blacklist` directly | No effect (sync re-adds) | No effect (sync re-adds) | No effect |
|
||||||
|
|
||||||
|
See [Consumers](Consumers) for configuration details.
|
||||||
+208
@@ -0,0 +1,208 @@
|
|||||||
|
# Sync
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The sync process fetches the upstream Cleanuparr blacklist, preserves any
|
||||||
|
manual local additions, subtracts the locally-maintained whitelist, and
|
||||||
|
writes the result back to `blacklist`. It runs on a schedule (every 7 days)
|
||||||
|
and on manual dispatch. All logic lives in `scripts/merge_blocklists.py`
|
||||||
|
(about 45 lines of pure Python standard library, no third-party deps).
|
||||||
|
|
||||||
|
## Inputs
|
||||||
|
|
||||||
|
The script reads three sources on every run:
|
||||||
|
|
||||||
|
| Source | Path | Role |
|
||||||
|
|---|---|---|
|
||||||
|
| Upstream | `https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist` | Current upstream state, fetched over HTTPS |
|
||||||
|
| Upstream snapshot | `blacklist.prev` | What upstream looked like on the previous sync (baseline) |
|
||||||
|
| Committed blacklist | `blacklist` | Current committed state, may contain manual local additions |
|
||||||
|
| Whitelist | `whitelist` | Locally-maintained entries to strip from the merged result |
|
||||||
|
|
||||||
|
All four are parsed the same way: one entry per non-empty line, stripped of
|
||||||
|
leading/trailing whitespace, loaded into a Python `set`.
|
||||||
|
|
||||||
|
## Three-way merge
|
||||||
|
|
||||||
|
The script performs a classic three-way merge, git-style, using set
|
||||||
|
operations:
|
||||||
|
|
||||||
|
```
|
||||||
|
custom = local - upstream_prev
|
||||||
|
merged = upstream_new | custom
|
||||||
|
result = merged - whitelist
|
||||||
|
```
|
||||||
|
|
||||||
|
Each line does one specific job:
|
||||||
|
|
||||||
|
### `custom = local - upstream_prev`
|
||||||
|
|
||||||
|
Compute what was added locally. Anything in the committed `blacklist` that
|
||||||
|
was not in the previous upstream snapshot must be a manual local addition,
|
||||||
|
because the sync script is the only other thing that writes to `blacklist`
|
||||||
|
and it always produces a subset of `upstream_new | custom`. Tracking this
|
||||||
|
set lets the next sync re-apply those additions on top of the new upstream.
|
||||||
|
|
||||||
|
### `merged = upstream_new | custom`
|
||||||
|
|
||||||
|
Union the fresh upstream with the preserved local additions. Upstream
|
||||||
|
additions flow in (they appear in `upstream_new`), upstream removals flow
|
||||||
|
out (they were in `upstream_prev` but are not in `upstream_new`, and are
|
||||||
|
also not in `custom`), and manual local additions survive.
|
||||||
|
|
||||||
|
### `result = merged - whitelist`
|
||||||
|
|
||||||
|
Strip every entry that appears in the locally-maintained whitelist. This
|
||||||
|
is the step that enables local removals: an extension placed in `whitelist`
|
||||||
|
is always removed from the final `blacklist`, no matter how many times
|
||||||
|
upstream re-adds it.
|
||||||
|
|
||||||
|
After the merge the script writes `result` to `blacklist` and overwrites
|
||||||
|
`blacklist.prev` with `upstream_new` so the next run has a fresh baseline.
|
||||||
|
|
||||||
|
## Why a three-way merge
|
||||||
|
|
||||||
|
A simpler design would be `result = upstream_new - whitelist`, with no
|
||||||
|
`.prev` file and no custom tracking. That works for the common case but
|
||||||
|
drops an escape hatch: if you spot something upstream missed (a new
|
||||||
|
malware extension, a tracker-specific junk file) and add it directly to
|
||||||
|
`blacklist`, the next sync would silently drop it.
|
||||||
|
|
||||||
|
The three-way merge preserves those manual additions without requiring
|
||||||
|
them to live in a separate "additions" file. If you never add anything
|
||||||
|
directly, the `custom` set is empty on every run and the merge reduces to
|
||||||
|
`upstream_new - whitelist`. The overhead is one extra file (`blacklist.prev`)
|
||||||
|
and two set operations.
|
||||||
|
|
||||||
|
## Whitelist exclusion
|
||||||
|
|
||||||
|
The whitelist is subtracted with exact-string set subtraction, not pattern
|
||||||
|
matching. This has two important consequences:
|
||||||
|
|
||||||
|
### Exact entries are stripped
|
||||||
|
|
||||||
|
`*.srt` in `whitelist` strips exactly `*.srt` from the blacklist. Same for
|
||||||
|
`*.webm`, `*.mkv`, etc.
|
||||||
|
|
||||||
|
### Sample patterns are preserved
|
||||||
|
|
||||||
|
The upstream blacklist contains entries like `*sample.srt`, `*sample.webm`,
|
||||||
|
and `*sample.mkv` that block files with "sample" in the name regardless of
|
||||||
|
extension. These are separate string entries from `*.srt` or `*.webm`, so
|
||||||
|
whitelisting the plain extension does not remove the sample-file variant.
|
||||||
|
Sample files continue to be blocked.
|
||||||
|
|
||||||
|
This is almost always the behaviour you want: subtitle files shipped inside
|
||||||
|
a release are kept, but standalone "sample.srt" clutter is still filtered.
|
||||||
|
|
||||||
|
## The `.prev` file
|
||||||
|
|
||||||
|
`blacklist.prev` is a plain text snapshot of whatever `upstream_new` was on
|
||||||
|
the previous successful run. It has no special format, no metadata, and is
|
||||||
|
never edited manually. The sync script rewrites it at the end of every run.
|
||||||
|
|
||||||
|
It exists purely as the baseline for the `local - upstream_prev` step in
|
||||||
|
the three-way merge. Without it, the script could not distinguish "this
|
||||||
|
entry was in local because upstream had it" from "this entry was in local
|
||||||
|
because someone added it manually."
|
||||||
|
|
||||||
|
If `blacklist.prev` is missing (first run, or manually deleted), the script
|
||||||
|
treats the current `upstream_new` as the baseline. This means manual
|
||||||
|
additions made before the first sync are lost -- on the first run with a
|
||||||
|
fresh `.prev`, add them to `whitelist` instead (where they will survive)
|
||||||
|
or add them after the first sync completes.
|
||||||
|
|
||||||
|
## Edge cases
|
||||||
|
|
||||||
|
### First run
|
||||||
|
|
||||||
|
`blacklist.prev` does not exist, `blacklist` may or may not exist.
|
||||||
|
`upstream_prev = upstream_new`, so `custom = local - upstream_new` (anything
|
||||||
|
in `local` that is not upstream). After the run, `.prev` exists and
|
||||||
|
subsequent runs use the normal path.
|
||||||
|
|
||||||
|
### Empty or missing whitelist
|
||||||
|
|
||||||
|
If `whitelist` is missing or empty, `whitelist = set()` and the subtraction
|
||||||
|
is a no-op. The merge degenerates to a plain upstream sync with local
|
||||||
|
additions preserved.
|
||||||
|
|
||||||
|
### Empty or missing blacklist
|
||||||
|
|
||||||
|
If `blacklist` is missing, `local = set()`, `custom = set()`, and
|
||||||
|
`result = upstream_new - whitelist`. Equivalent to a fresh install.
|
||||||
|
|
||||||
|
### Upstream removes an entry that is also in the whitelist
|
||||||
|
|
||||||
|
Harmless. `upstream_new` does not contain it, so `merged` does not contain
|
||||||
|
it, and the whitelist subtraction removes nothing (the entry was already
|
||||||
|
absent). The whitelist entry stays as a harmless no-op for future syncs.
|
||||||
|
|
||||||
|
### An entry appears in both whitelist and blacklist custom additions
|
||||||
|
|
||||||
|
You manually added `*.foo` to `blacklist` and also added `*.foo` to
|
||||||
|
`whitelist`. The whitelist wins: `*.foo` is in `custom`, survives the
|
||||||
|
union, then gets stripped by the final subtraction. The committed
|
||||||
|
`blacklist` will not contain `*.foo`. The custom entry is effectively
|
||||||
|
invisible until you remove `*.foo` from `whitelist`.
|
||||||
|
|
||||||
|
## Reporting
|
||||||
|
|
||||||
|
Each sync run logs four lines to the workflow output:
|
||||||
|
|
||||||
|
```
|
||||||
|
[blacklist] Upstream added: [...]
|
||||||
|
[blacklist] Upstream removed: [...]
|
||||||
|
[blacklist] Custom preserved: [...]
|
||||||
|
[blacklist] Whitelist stripped: [...]
|
||||||
|
```
|
||||||
|
|
||||||
|
These are sorted lists showing exactly what changed. Check the Actions run
|
||||||
|
log after any sync to see what happened, especially if a consumer reports
|
||||||
|
unexpected behaviour.
|
||||||
|
|
||||||
|
## Full script
|
||||||
|
|
||||||
|
```python
|
||||||
|
import urllib.request
|
||||||
|
|
||||||
|
UPSTREAM_URL = "https://raw.githubusercontent.com/Cleanuparr/Cleanuparr/main/blacklist"
|
||||||
|
BLACKLIST = "blacklist"
|
||||||
|
BLACKLIST_PREV = "blacklist.prev"
|
||||||
|
WHITELIST = "whitelist"
|
||||||
|
|
||||||
|
|
||||||
|
def read_lines(path):
|
||||||
|
try:
|
||||||
|
with open(path) as f:
|
||||||
|
return set(line.strip() for line in f if line.strip())
|
||||||
|
except FileNotFoundError:
|
||||||
|
return set()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with urllib.request.urlopen(UPSTREAM_URL) as r:
|
||||||
|
upstream_new = set(
|
||||||
|
line.strip() for line in r.read().decode().splitlines() if line.strip()
|
||||||
|
)
|
||||||
|
|
||||||
|
upstream_prev = read_lines(BLACKLIST_PREV)
|
||||||
|
if not upstream_prev:
|
||||||
|
upstream_prev = upstream_new.copy()
|
||||||
|
|
||||||
|
local = read_lines(BLACKLIST)
|
||||||
|
whitelist = read_lines(WHITELIST)
|
||||||
|
|
||||||
|
custom = local - upstream_prev
|
||||||
|
merged = upstream_new | custom
|
||||||
|
result = merged - whitelist
|
||||||
|
|
||||||
|
with open(BLACKLIST, "w") as f:
|
||||||
|
f.write("\n".join(sorted(result)) + "\n")
|
||||||
|
|
||||||
|
with open(BLACKLIST_PREV, "w") as f:
|
||||||
|
f.write("\n".join(sorted(upstream_new)) + "\n")
|
||||||
|
```
|
||||||
|
|
||||||
|
Logging and the `__main__` guard are omitted above for clarity. See
|
||||||
|
`scripts/merge_blocklists.py` in the repository for the full source.
|
||||||
Reference in New Issue
Block a user