# 筆記自動化系統 — Phase 1 規格書

**版本：** 0.2  
**日期：** 2026-03-17  
**用途：** 供 Claude Code 實作使用  

---

## 一、系統背景與全局脈絡

### 1.1 這個系統解決的問題

使用者的筆記散落在多個工具（Memos、Joplin、Discord、本地資料夾等），過去寫過的想法不會累積、不會被重新利用。這套系統的目標是：

1. **無摩擦收集**：無論在哪個工具寫，都自動進入系統
2. **有意義的組織**：透過標籤體系讓筆記可以跨維度查詢
3. **被動 review**：不建立新介面，review 發生在 AI 對話當下

### 1.2 CODE 系統全局架構

系統以 Obsidian Vault 作為知識作業系統，CODE 為行為準則：

```
C — Capture    → 0_Capture/     筆記本體永久存放
O — Organize   → 1_Organize/    標籤索引（只有連結，無內容）
D — Distill    → 2_Distill/     LYT / 卡片盒（Phase 2）
E — Express    → 3_Express/     產出物（Phase 3）
```

**Phase 1 只實作 C → O 的自動化流程。**  
D 和 E 不在本規格範圍內。

### 1.3 設計原則

- **單向資料流**：C → O → D → E，資料只往下游流動
- **冪等性**：任何收集器重複執行，結果相同，不產生重複筆記
- **可溯源性**：O 裡的每一個連結，都能追溯回 C 的來源與原始 ID
- **詞彙表是活的**：AI 可以從現有詞彙表選標籤，或提議新標籤（狀態 `proposed`），邊用邊打磨
- **Review 不在 Phase 1**：SQLite 的 `review_status` 預設 `pending`，橋接層留給 Phase 1.5

---

## 二、Vault 資料夾結構

**平台：** Windows 本機  
**Vault 根目錄：** 由 `config.yaml` 的 `vault_root` 指定（絕對路徑）

```
{vault_root}/
│
├── 0_Capture/                        ← 筆記本體，永久存放
│   ├── memos-20260317-abc123.md
│   ├── folder-20260315-xyz789.md
│   └── ...
│
├── 1_Organize/                       ← 標籤索引，只有 [[連結]]，無筆記內容
│   ├── para/
│   │   ├── project/
│   │   │   ├── note-system.md
│   │   │   └── aia-website.md
│   │   ├── area/
│   │   │   └── iso27001.md
│   │   ├── resource/
│   │   │   └── laravel.md
│   │   └── archive.md
│   ├── lyt/
│   │   ├── moc/
│   │   │   ├── automation.md
│   │   │   └── knowledge-management.md
│   │   ├── concept.md
│   │   └── source.md
│   ├── status/
│   │   ├── fleeting.md
│   │   ├── developing.md
│   │   └── evergreen.md
│   └── date/
│       └── 2026/
│           └── 03/
│               └── 2026-03-17.md
│
├── 2_Distill/                        ← Phase 2，目前為空
├── 3_Express/                        ← Phase 3，目前為空
│
└── _data/                            ← 系統資料，Obsidian 不索引此資料夾
    ├── config.yaml                   ← 主設定檔
    ├── vocabulary.yaml               ← 受控詞彙表
    └── db/
        └── notes.db                  ← SQLite，TaggingEvent 完整歷程
```

> **注意：** `_data/` 資料夾需在 Obsidian 設定中加入「排除資料夾」，避免 `.yaml` 和 `.db` 被索引。

---

## 三、專案程式結構

```
{project_root}/           ← 可與 Vault 分開，或放在 Vault 外
│
├── src/
│   ├── capturers/
│   │   ├── __init__.py
│   │   ├── base.py       ← BaseCapturer 抽象基類
│   │   ├── memos.py      ← Memos Capturer
│   │   └── folder.py     ← Folder Capturer
│   │
│   ├── organizer/
│   │   ├── __init__.py
│   │   ├── tagger.py     ← AI 打標邏輯
│   │   └── indexer.py    ← 維護 1_Organize/ 的連結列表
│   │
│   ├── db/
│   │   ├── __init__.py
│   │   ├── schema.py     ← SQLite schema 建立
│   │   └── repository.py ← TaggingEvent CRUD
│   │
│   └── utils/
│       ├── __init__.py
│       ├── config.py     ← 載入 config.yaml / vocabulary.yaml
│       └── paths.py      ← 路徑處理（Windows 相容）
│
├── run.py                ← CLI 入口
├── requirements.txt
└── README.md
```

---

## 四、設定檔規格

### 4.1 `_data/config.yaml`

```yaml
# Vault 根目錄（Windows 絕對路徑，使用正斜線或跳脫反斜線）
vault_root: "C:/Users/YY/Documents/MyVault"

# AI 打標設定
tagging:
  enabled: true
  provider: "anthropic"        # anthropic | gemini | ollama | cli
  model: "claude-3-5-haiku-20241022"  # 依 provider 填入對應模型名稱
  api_key_env: "ANTHROPIC_API_KEY"    # 從環境變數讀取 API key
  max_tags_per_note: 3
  # provider = cli 時使用以下設定
  cli_command: "claude"        # 或 "codex"

# 啟用的標籤分類系統（可關閉不需要的）
taxonomies:
  para: true
  lyt: true
  status: true
  date: true                   # 日期永遠啟用

# 各 Capturer 設定
capturers:
  memos:
    enabled: true
    site_url: "https://your-memos.example.com"
    api_token_env: "MEMOS_API_TOKEN"
    sync_strategy: "since_last"   # since_last | full
    content_filter:
      min_length: 10
      exclude_tags: []

  folder:
    enabled: true
    watch_paths:
      - "C:/Users/YY/Documents/joplin-export"
      - "C:/Users/YY/Desktop/notes-inbox"
    recursive: true
    marker_strategy: "frontmatter"  # 在原檔 frontmatter 加 captured: true
    move_after_capture: false        # true = 移走原檔
    extensions: [".md", ".txt"]

# 排程設定（供 APScheduler 或外部 cron 參考）
schedule:
  interval_minutes: 15
```

### 4.2 `_data/vocabulary.yaml`

```yaml
# 受控詞彙表
# AI 只能從 taxonomies 中選擇標籤，或提議新標籤到 proposed 區塊

taxonomies:
  para:
    - "para/project/note-system"
    - "para/project/aia-website"
    - "para/project/wedding"
    - "para/project/hr-system"
    - "para/area/iso27001"
    - "para/area/media-literacy"
    - "para/resource/laravel"
    - "para/resource/python"
    - "para/resource/n8n"
    - "para/resource/obsidian"
    - "para/archive"

  lyt:
    - "lyt/moc/automation"
    - "lyt/moc/knowledge-management"
    - "lyt/moc/ai-education"
    - "lyt/concept"
    - "lyt/source"

  status:
    - "status/fleeting"
    - "status/developing"
    - "status/evergreen"

# AI 提議的新標籤，待人工審核後移入 taxonomies
proposed: []
# 範例格式：
# proposed:
#   - tag: "para/project/new-thing"
#     proposed_by: "ai"
#     proposed_at: "2026-03-17"
#     rationale: "多則筆記提及此主題"
#     review_status: "pending"   # pending | approved | rejected
```

---

## 五、資料結構規格

### 5.1 筆記本體（`0_Capture/{note_id}.md`）

每則筆記以 `note_id` 為檔名，存放在 `0_Capture/`，永久保留。

**Frontmatter 規格：**

```yaml
---
note_id: "memos-20260317-abc123"           # 全域唯一 ID，格式：{source}-{YYYYMMDD}-{原始ID}
source: "memos"                             # 來源識別碼：memos | folder | joplin | discord
source_id: "abc123"                         # 在來源系統的原始 ID
source_url: "https://memos.example.com/m/abc123"  # 可選，原始連結
created_at: "2026-03-17T09:30:00+08:00"    # 原始建立時間（ISO 8601，含時區）
captured_at: "2026-03-17T10:00:00+08:00"   # 本系統抓取時間
tags_raw: ["#work", "#idea"]               # 來源系統原始標籤，保留不修改

# O 階段寫入以下欄位（C 階段不填）
tags:
  - tag: "para/project/note-system"
    agent: "ai"
    applied_at: "2026-03-17T10:05:00+08:00"
    ai_rationale: "內容提及自動化筆記流程建構"
    review_status: "pending"
  - tag: "lyt/moc/automation"
    agent: "ai"
    applied_at: "2026-03-17T10:05:00+08:00"
    ai_rationale: "提及 n8n 和自動化工作流程"
    review_status: "pending"
---

{筆記原始內容，Markdown 格式，完整保留}
```

**note_id 命名規則：**

- 格式：`{source}-{YYYYMMDD}-{原始ID前8碼}`
- 範例：`memos-20260317-abc12345`、`folder-20260315-readme`
- 原始 ID 若為檔名，取去除副檔名的部分，空白換底線，僅保留英數字與底線

### 5.2 標籤索引檔（`1_Organize/{taxonomy}/{tag}.md`）

每個標籤一個 Markdown 檔，只存放 Obsidian 內部連結列表。

**格式：**

```markdown
---
tag: "para/project/note-system"
taxonomy: "para"
created_at: "2026-03-17"
note_count: 3
---

[[memos-20260317-abc12345]]
[[folder-20260315-readme]]
[[memos-20260310-xyz98765]]
```

**日期標籤格式（`1_Organize/date/YYYY/MM/YYYY-MM-DD.md`）：**

```markdown
---
tag: "date/2026-03-17"
taxonomy: "date"
created_at: "2026-03-17"
note_count: 5
---

[[memos-20260317-abc12345]]
[[memos-20260317-def67890]]
[[folder-20260317-mynote]]
```

**索引更新規則：**
- 新增筆記時，在對應標籤 md 檔末尾 append `[[note_id]]`
- 若標籤 md 檔不存在，自動建立（含 frontmatter）
- `note_count` 每次更新後重新計算

### 5.3 Source Registry（`_data/db/` 或 `_data/sources/`）

每個 Capturer 各一個 JSON 檔，記錄同步狀態。

**`_data/sources/memos.json`：**

```json
{
  "source": "memos",
  "last_synced_at": "2026-03-17T10:00:00+08:00",
  "last_cursor": "2026-03-17T09:30:00+08:00",
  "captured_ids": ["abc123", "def456"],
  "total_captured": 42
}
```

**`_data/sources/folder.json`：**

```json
{
  "source": "folder",
  "last_synced_at": "2026-03-17T10:00:00+08:00",
  "captured_files": [
    {
      "original_path": "C:/Users/YY/Documents/joplin-export/my-note.md",
      "note_id": "folder-20260317-my_note",
      "captured_at": "2026-03-17T10:00:00+08:00"
    }
  ],
  "total_captured": 15
}
```

### 5.4 SQLite Schema（`_data/db/notes.db`）

```sql
-- 筆記基本資料
CREATE TABLE notes (
    note_id       TEXT PRIMARY KEY,
    source        TEXT NOT NULL,
    source_id     TEXT NOT NULL,
    source_url    TEXT,
    created_at    TEXT NOT NULL,
    captured_at   TEXT NOT NULL,
    file_path     TEXT NOT NULL,       -- 相對於 vault_root 的路徑
    content_hash  TEXT,                -- MD5，用於偵測內容變更
    UNIQUE(source, source_id)
);

-- TaggingEvent：每次打標的完整歷程
CREATE TABLE tagging_events (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    note_id       TEXT NOT NULL REFERENCES notes(note_id),
    tag           TEXT NOT NULL,
    agent         TEXT NOT NULL CHECK(agent IN ('ai', 'human')),
    applied_at    TEXT NOT NULL,
    ai_model      TEXT,                -- 若 agent='ai'，記錄模型識別碼
    ai_rationale  TEXT,                -- AI 標記理由（中文）
    review_status TEXT NOT NULL DEFAULT 'pending'
                  CHECK(review_status IN ('pending', 'approved', 'rejected', 'discarded')),
    reviewed_at   TEXT,
    UNIQUE(note_id, tag)               -- 同一筆記同一標籤只有一筆有效紀錄
);

-- 詞彙表變更歷程
CREATE TABLE vocabulary_events (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    tag           TEXT NOT NULL,
    action        TEXT NOT NULL CHECK(action IN ('proposed', 'approved', 'rejected', 'added_manually')),
    actor         TEXT NOT NULL CHECK(actor IN ('ai', 'human')),
    actioned_at   TEXT NOT NULL,
    rationale     TEXT
);

-- 建立索引
CREATE INDEX idx_tagging_note_id ON tagging_events(note_id);
CREATE INDEX idx_tagging_status ON tagging_events(review_status);
CREATE INDEX idx_tagging_tag ON tagging_events(tag);
CREATE INDEX idx_notes_source ON notes(source);
CREATE INDEX idx_notes_created ON notes(created_at);
```

---

## 六、各模組實作規格

### 6.1 `BaseCapturer`（`src/capturers/base.py`）

```python
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Optional

@dataclass
class RawNote:
    source: str
    source_id: str
    content: str                    # Markdown 格式的原始內容
    created_at: str                 # ISO 8601
    source_url: Optional[str] = None
    tags_raw: list[str] = None      # 來源系統的原始標籤

class BaseCapturer(ABC):

    @abstractmethod
    def fetch(self) -> list[RawNote]:
        """從來源拉取新內容，回傳 RawNote 列表。只回傳尚未收集的內容。"""
        pass

    @abstractmethod
    def get_last_synced(self) -> Optional[str]:
        """讀取 source registry，回傳上次同步的游標（時間或 ID）。"""
        pass

    @abstractmethod
    def update_registry(self, notes: list[RawNote]) -> None:
        """更新 source registry，記錄已抓取的 ID 或時間游標。"""
        pass

    def to_note_id(self, raw: RawNote) -> str:
        """產生全域唯一 note_id。格式：{source}-{YYYYMMDD}-{source_id前8碼}"""
        from datetime import datetime
        date_part = datetime.fromisoformat(raw.created_at).strftime("%Y%m%d")
        id_part = raw.source_id[:8].replace(" ", "_")
        # 只保留英數字與底線
        import re
        id_part = re.sub(r'[^\w]', '_', id_part)
        return f"{raw.source}-{date_part}-{id_part}"
```

### 6.2 `MemosCapturer`（`src/capturers/memos.py`）

**行為：**
- 呼叫 Memos API（`GET /api/v1/memos`），使用 `created_time` 過濾
- `sync_strategy: since_last`：只抓 `last_synced_at` 之後建立的筆記
- `sync_strategy: full`：抓全部，依 source_id 去重
- 過濾 `content_filter.min_length` 以下的筆記
- 過濾含有 `exclude_tags` 的筆記
- 回傳 `RawNote`，`content` 為 Memos 的 `content` 欄位（已是 Markdown）

**API 參考：**
- Endpoint：`{site_url}/api/v1/memos`
- Auth：`Authorization: Bearer {api_token}`
- 回應欄位：`name`（ID）、`content`、`createTime`、`tags`

**Registry 格式：** `_data/sources/memos.json`（見 5.3）

### 6.3 `FolderCapturer`（`src/capturers/folder.py`）

**行為：**
- 掃描 `watch_paths` 中的資料夾，找出所有符合 `extensions` 的檔案
- 依 `captured_files` 列表判斷是否已收集（比對原始路徑）
- `marker_strategy: frontmatter`：在原始檔案的 frontmatter 加入 `captured: true` 和 `captured_at`
- `move_after_capture: true`：收集後將原始檔案移至 `{watch_path}/_captured/`
- 以**檔案的實際修改時間**（`mtime`）作為 `created_at`
- `source_id` 為原始檔案的相對路徑（相對於 `watch_path`），空白換底線

**注意：**
- Windows 路徑需使用 `pathlib.Path` 處理，確保跨平台相容
- 編碼一律 UTF-8 讀取，若失敗嘗試 `cp950`

### 6.4 `Tagger`（`src/organizer/tagger.py`）

**行為：**

1. 讀取 `vocabulary.yaml` 的 `taxonomies`，依 `config.yaml` 的 `taxonomies` 設定過濾啟用的分類
2. 構建 AI prompt（見下方），呼叫對應 provider
3. 解析 AI 回傳的 JSON
4. 合法標籤（存在於詞彙表）→ 建立 `tagging_events` 紀錄，`review_status: pending`
5. 非法標籤（不存在於詞彙表）→ 加入 `vocabulary.yaml` 的 `proposed` 區塊，同時建立 `tagging_events` 紀錄，`tag` 欄位標注為 `proposed:{tag_name}`

**AI Prompt 範本：**

```
你是一個筆記標籤分類助手。請根據以下筆記內容，從提供的受控詞彙表中選擇最多 {max_tags} 個最相關的標籤。

## 受控詞彙表
{vocabulary_list}

## 筆記內容
{note_content}

## 回傳格式
只回傳 JSON，不要有其他文字：
{
  "selected_tags": ["tag1", "tag2"],
  "proposed_tags": [
    {"tag": "新標籤名稱", "rationale": "提議原因（繁體中文，30字內）"}
  ],
  "rationale": {
    "tag1": "選擇理由（繁體中文，30字內）",
    "tag2": "選擇理由（繁體中文，30字內）"
  }
}

規則：
- selected_tags 只能從受控詞彙表中選擇，不可自行發明
- 若詞彙表中沒有合適標籤，才使用 proposed_tags 提議
- 日期標籤由系統自動處理，不需選擇
- 若筆記過短或內容不明確，selected_tags 可以為空陣列
```

**Provider 實作：**

- `anthropic`：使用 `anthropic` Python SDK
- `gemini`：使用 `google-generativeai` SDK
- `ollama`：使用 `ollama` Python SDK，呼叫本地端點
- `cli`：subprocess 呼叫 `claude` 或 `codex` CLI，stdin 傳入 prompt，stdout 解析 JSON

### 6.5 `Indexer`（`src/organizer/indexer.py`）

**行為：**

1. 接收一個 `note_id` 和其確認的標籤列表
2. 對每個標籤，確認 `1_Organize/{tag_path}.md` 是否存在
   - 不存在：建立檔案（含 frontmatter）
   - 存在：在檔案末尾 append `[[{note_id}]]`（若已存在則跳過）
3. 更新標籤 md 檔的 `note_count`（重新計算連結數量）
4. 日期標籤：自動從 `created_at` 取得日期，寫入 `1_Organize/date/YYYY/MM/YYYY-MM-DD.md`

**路徑轉換規則：**

```
tag: "para/project/note-system"
→ 檔案路徑: {vault_root}/1_Organize/para/project/note-system.md

tag: "lyt/moc/automation"
→ 檔案路徑: {vault_root}/1_Organize/lyt/moc/automation.md

date: "2026-03-17"
→ 檔案路徑: {vault_root}/1_Organize/date/2026/03/2026-03-17.md
```

---

## 七、主流程（`run.py`）

### 7.1 Pipeline 執行順序

```
run.py --stage all
  ↓
1. 載入 config.yaml、vocabulary.yaml
2. 初始化 SQLite（若不存在則建立 schema）
3. 執行所有啟用的 Capturer
   ├── MemosCapturer.fetch()
   │   ├── 轉換為 note schema（frontmatter + 內容）
   │   ├── 寫入 0_Capture/{note_id}.md
   │   ├── 寫入 SQLite notes 表
   │   └── update_registry()
   └── FolderCapturer.fetch()
       └── （同上）
4. 執行 Organizer
   └── 對每一筆新增的 note：
       ├── Tagger.tag(note)
       │   ├── 呼叫 AI 取得標籤
       │   ├── 寫入 tagging_events（SQLite）
       │   └── 更新 0_Capture/{note_id}.md 的 frontmatter tags 欄位
       └── Indexer.index(note_id, tags)
           ├── 更新 1_Organize/{tag}.md
           └── 更新 1_Organize/date/...md
5. 輸出執行摘要
```

### 7.2 CLI 介面

```bash
# 執行完整 pipeline
python run.py

# 只執行 C 階段（不打標）
python run.py --stage capture

# 只執行 O 階段（對已在 0_Capture 但尚未打標的筆記打標）
python run.py --stage organize

# 指定 config 路徑
python run.py --config "C:/path/to/_data/config.yaml"

# 強制重新收集（忽略 registry）
python run.py --force

# Dry run（只顯示會做什麼，不實際寫入）
python run.py --dry-run
```

### 7.3 執行記錄

每次執行後，在 `_data/run_log.jsonl` append 一行 JSON：

```json
{
  "run_at": "2026-03-17T10:00:00+08:00",
  "stage": "all",
  "captured": {"memos": 3, "folder": 1},
  "tagged": 4,
  "errors": [],
  "duration_seconds": 12.3
}
```

---

## 八、依賴套件（`requirements.txt`）

```
# 核心
pydantic>=2.0
pydantic-settings>=2.0
python-frontmatter>=1.1
pyyaml>=6.0
pathlib2>=2.3       # Windows 路徑相容（Python 3.4+ 內建，但保留）

# AI provider（依需求安裝）
anthropic>=0.25     # provider: anthropic
google-generativeai>=0.5  # provider: gemini
ollama>=0.2         # provider: ollama

# 排程（可選）
apscheduler>=3.10

# 工具
loguru>=0.7
click>=8.1          # CLI 介面
```

---

## 九、Phase 1 明確邊界

### 包含

- `MemosCapturer`：完整實作，含 since_last 同步策略
- `FolderCapturer`：完整實作，含 frontmatter marker 策略
- `BaseCapturer` 抽象基類
- `Tagger`：AI 打標，支援 anthropic / gemini / ollama / cli 四種 provider
- `Indexer`：維護 `1_Organize/` 的標籤索引 md 檔
- SQLite 雙層儲存（TaggingEvent 歷程）
- 受控詞彙表機制（選擇 + 提議）
- CLI 介面（`run.py`）
- Source Registry 去重機制

### 不包含

- Review 互動（所有 `review_status` 預設 `pending`，Phase 1.5 處理）
- D 階段（Distill）
- E 階段（Express）
- Discord / Mattermost Capturer（架構預留，Phase 1 不實作）
- 向量語意搜尋
- 衝突解決（同一概念在多來源出現）
- Obsidian Plugin 整合
- Web UI

---

## 十、擴充 Capturer 的方法（給未來參考）

新增一個 Capturer 只需要：

1. 在 `src/capturers/` 新增 `{name}.py`，繼承 `BaseCapturer`
2. 實作 `fetch()`、`get_last_synced()`、`update_registry()`
3. 在 `_data/config.yaml` 的 `capturers` 區塊新增對應設定
4. 在 `run.py` 的 Capturer 初始化列表中加入

不需要修改 `Tagger`、`Indexer`、SQLite schema 或任何其他模組。