feat(memory): 添加知识图谱标签复用查询与容量约束

2026-06-07 13:35:32 +08:00
parent f818bd59f5
commit fb1c530358
1 changed files with 190 additions and 7 deletions
@@ -46,6 +46,7 @@ AG Core 已完成 Phase 0（LLM 调用周期）、Phase 1（提示词工程）
 | F10 | 消息条目级淘汰 | P1 | ConversationMemory 达到上限后删除最旧消息，而非仅压缩内容 |
 | F11 | 基于召回价值的淘汰 | P2 | 根据召回频率、匹配评分、时效性综合计算"记忆价值"，淘汰价值最低的记忆 |
 | F12 | 召回统计记录 | P2 | MemoryStore 记录每次召回事件（recall_count + score），供淘汰策略使用 |
+| F13 | 标签复用查询 | P1 | KnowledgeGraph 提供 find_tags() 接口，供调用方在创建实体时复用已有标签，避免同义标签膨胀 |

 ### 2.2 非功能需求

@@ -173,21 +174,193 @@ pub struct InMemoryKnowledgeStore {
 ```rust
 #[async_trait]
 pub trait KnowledgeGraph: Send + Sync {
+    // 实体管理
    async fn add_entity(&self, entity: GraphEntity) -> Result<(), MemoryError>;
    async fn get_entity(&self, id: &str) -> Result<Option<GraphEntity>, MemoryError>;
    async fn remove_entity(&self, id: &str) -> Result<(), MemoryError>;
+
+    // 关系管理
    async fn add_relation(&self, relation: GraphRelation) -> Result<(), MemoryError>;
    async fn remove_relation(&self, source_id: &str, target_id: &str, relation_type: &str) -> Result<(), MemoryError>;
    async fn get_related(&self, entity_id: &str, depth: usize) -> Result<Vec<ScoredEntity>, MemoryError>;
+
+    // 检索
    async fn find_by_keywords(&self, keywords: &[String]) -> Result<Vec<GraphEntity>, MemoryError>;
+
+    // 标签管理
+    async fn find_tags(&self, prefix: &str) -> Result<Vec<String>, MemoryError>;
+    async fn entity_count_by_tag(&self, tag: &str) -> Result<usize, MemoryError>;
+    async fn set_entity_tags(&self, entity_id: &str, tags: Vec<String>) -> Result<(), MemoryError>;
+    fn tag_constraints(&self) -> TagConstraints;
 }
 ```

- `find_by_keywords()` — 用关键词匹配实体的 `name`（精准/前缀匹配）和 `tags`（完全匹配），不模糊搜索全字段
+- `find_by_keywords()` — 用关键词匹配实体的 `name`（前缀匹配）和 `tags`（完全匹配），不模糊搜索全字段
 - `get_related()` — 找到匹配实体后，按 `depth` 遍历关联实体，同时返回关联关系
+- `find_tags()` — 前缀匹配已有标签，供 Phase 4 Agent 做标签复用决策
+- `entity_count_by_tag()` — 查看某标签的实体关联数，判断标签的通用程度
+- `set_entity_tags()` — 替换实体的全部标签，调用方按关联度降序传入，内部自动截断到 `max_tags_per_entity`

- `get_related()` 按指定深度遍历邻居节点
- `search_entities()` 按实体名称/类型搜索
+#### 标签规范与生成策略
+
+标签是连接 `KnowledgeStore`（内容）和 `KnowledgeGraph`（实体）的检索桥梁。两条检索通道共享同一组关键词，通过标签匹配实现交叉发现。
+
+##### 标签复用原则
+
+标签不应随意增长。**优先复用已有标签，确无合适标签时才创建新标签**。这保证：
+
+- 同一概念永远使用同一标签（`"state-graph"` 不会出现 `"state-graph"` + `"state-graph-framework"` 并存）
+- 共享标签的实体自然形成概念关联，增强图检索的连通性
+- 标签空间收敛，检索效率不随实体数量增长而退化
+
+##### 标签注册与复用流程
+
+KnowledgeGraph 提供标签查询接口供调用方（Phase 4 Agent）做复用决策：
+
+```rust
+#[async_trait]
+pub trait KnowledgeGraph: Send + Sync {
+    // ... 实体/关系管理方法 ...
+
+    /// 查询已有标签（前缀匹配），用于标签复用决策
+    async fn find_tags(&self, prefix: &str) -> Result<Vec<String>, MemoryError>;
+
+    /// 查询某标签关联的实体数量
+    async fn entity_count_by_tag(&self, tag: &str) -> Result<usize, MemoryError>;
+}
+```
+
+Phase 4 Agent 在 Ingest 时遵从这个流程创建实体标签：
+
+```
+LLM 提取候选标签 → ["LangGraph", "StateGraph", "state-machine", "graph-framework"]
+                          │
+    ┌─────────────────────┼──────────────────────┐
+    │ for each candidate: │                       │
+    │                     │                       │
+    │ graph.find_tags(candidate.lowercase())      │
+    │   │                                         │
+    │   ├─ 命中已有标签 → 复用（用已有标签字符串）  │
+    │   │   "state-machine" → 已有 "state-machine" │
+    │   │   用 "state-machine" 而非创建新标签       │
+    │   │                                         │
+    │   └─ 无匹配 → 注册新标签                     │
+    │       "graph-framework" → 全新标签，注册      │
+    │                                             │
+    └─────────────────────────────────────────────┘
+
+最终 entity.tags = ["langgraph", "state-graph", "state-machine", "graph-framework"]
+                                          ↑ 复用了已有标签
+```
+
+##### 标签容量与精炼
+
+每个实体的标签数量有上限，不能无限增长。超出上限时，保留关联度最高的标签。
+
+```rust
+pub struct GraphEntity {
+    pub id: String,
+    pub name: String,
+    pub entity_type: String,
+    pub description: String,
+    pub tags: Vec<String>,          // 检索标签，按关联度降序排列（索引越小越关键）
+}
+
+pub struct TagConstraints {
+    pub max_tags_per_entity: usize,  // 每个实体最多标签数，默认 8
+}
+```
+
+KnowledgeGraph 提供标签集替换接口，由调用方（Phase 4 Agent）按关联度排序后写入，内部自动截断：
+
+```rust
+#[async_trait]
+pub trait KnowledgeGraph: Send + Sync {
+    // ... 其他方法 ...
+
+    /// 替换实体的全部标签
+    /// 调用方确保 tags 已按关联度降序排列
+    /// 内部自动截断到 max_tags_per_entity
+    async fn set_entity_tags(&self, entity_id: &str, tags: Vec<String>) -> Result<(), MemoryError>;
+
+    /// 获取标签容量约束
+    fn tag_constraints(&self) -> TagConstraints;
+}
+```
+
+标签精炼流程（Phase 4 Agent 负责编排）：
+
+```
+LLM 从内容中提取候选标签（可能有 15-20 个）
+  │
+  ├─ 1. 评估每个标签与实体的关联度
+  │     → "langgraph": 0.95 | "graph-framework": 0.3 | "computer": 0.1
+  │
+  ├─ 2. 按关联度降序排列
+  │     → ["langgraph", "state-graph", "state-machine", "graph", ...]
+  │
+  ├─ 3. 复用已有标签（find_tags）
+  │     → 将同义候选替换为已有标签
+  │
+  ├─ 4. 保留 top-8（受 max_tags_per_entity 约束）
+  │     → ["langgraph", "state-graph", "state-machine", "graph",
+  │        "agent", "workflow", "llm", "orchestration"]
+  │
+  └─ 5. set_entity_tags(entity_id, top_8)
+        → KnowledgeGraph 内部截断到 max_tags_per_entity
+```
+
+##### 标签规范
+
+| 规则 | 说明 | 示例 |
+|------|------|------|
+| 原子性 | 每个标签是一个独立的关键词，不包含空格或标点 | `"state-graph"` ✅ / `"State Graph framework"` ❌ |
+| 小写化 | 统一小写存储，匹配时大小写不敏感 | `"langgraph"` |
+| 单数优先 | 优先使用单数形式 | `"agent"` 而非 `"agents"` |
+| 领域受限 | 3-8 个标签/实体，聚焦可检索性而非描述性 | `["langgraph", "state-graph", "graph", "state-machine"]` |
+
+**匹配行为（find_by_keywords）：**
+
+| 字段 | 匹配方式 | 说明 |
+|------|---------|------|
+| `name` | **前缀匹配** | `"lang"` 匹配 `"LangGraph"`、`"LangChain"`，支持大小写不敏感 |
+| `tags` | **完全匹配** | `"state-graph"` 只匹配 `"state-graph"`，不匹配 `"state"` |
+| `entity_type` | 完全匹配（精确值） | `"framework"` 匹配 `"framework"` |
+
+**标签生成方式（Phase 4 Agent 层面）：**
+
+```
+Ingest 新内容
+  │
+  ├─ LLM 提取标签（主要方式）
+  │   → 分析内容 → 提取 3-8 个关键概念作为标签
+  │   → 同时写入 KnowledgePage.tags 和 GraphEntity.tags
+  │
+  ├─ 规则提取（辅助方式）
+  │   → 从 entity.name 中按 CamelCase / kebab-case 拆分
+  │   → "StateGraph" → ["state", "graph", "state-graph"]
+  │
+  └─ 标签继承（关联同步）
+      → KnowledgePage 的标签自动同步到关联的 GraphEntity
+      → 页面标签变更时，同步更新关联实体的标签
+```
+
+**标签在检索中的作用：**
+
+```
+检索关键词: ["langgraph", "state-graph"]
+  │
+  ├─ KnowledgeStore 匹配: 页面 title(前缀) / tags(完全) / summary(包含)
+  │   → 命中 KnowledgePage { tags: ["langgraph", "graph", "state-machine"] } ← tags 完全匹配
+  │   → 命中 KnowledgePage { title: "LangGraph StateGraph Guide" } ← title 前缀匹配
+  │
+  └─ KnowledgeGraph 匹配: 实体 name(前缀) / tags(完全)
+      → 命中 GraphEntity { name: "LangGraph", tags: ["langgraph", "agent-framework"] } ← 都匹配
+      → 命中 GraphEntity { name: "StateGraph", tags: ["langgraph", "state-graph"] } ← tags 完全匹配
+      → get_related("StateGraph") → 遍历关联实体 "StateMachine", "Graph" 等
+```
+
+`tags` 定位为**精准的检索锚点**，`name` 作为**松散的检索入口**，两者配合确保召回既有广度又有精度。

 #### InMemoryGraph — 知识图谱默认实现

@@ -383,6 +556,7 @@ pub struct GraphEntity {
    pub name: String,
    pub entity_type: String,        // "person" | "concept" | "project" | ...
    pub description: String,
+    pub tags: Vec<String>,          // 检索标签（全小写，原子词，按关联度降序排列）
 }

 pub struct GraphRelation {
@@ -841,16 +1015,21 @@ async fn maybe_evict(&self) {
 - 单元测试：页面 CRUD、index 一致性、搜索
 - 验收：`cargo build` + `cargo test` 通过

-### Step 4：KnowledgeGraph
+### Step 4：KnowledgeGraph（含标签匹配）

 **文件**：`src/memory/graph.rs`

- 定义 `GraphEntity`、`GraphRelation` 类型
+- 定义 `GraphEntity`（含 `tags: Vec<String>`）、`GraphRelation` 类型
 - 定义 `KnowledgeGraph` trait
 - 实现 `InMemoryGraph`
  - BFS/DFS 图遍历
-  - search_entities 按名称/类型匹配
- 单元测试：实体 CRUD、关系添加/删除、图遍历、搜索
+  - `find_by_keywords()`：关键词前缀匹配 `name`，完全匹配 `tags`，大小写不敏感
+  - `find_tags(prefix)`：标签复用查询，前缀匹配已有标签
+  - `set_entity_tags()`：替换实体的全部标签，按传入顺序保留前 `max_tags_per_entity` 个
+  - `tag_constraints()`：返回 `TagConstraints { max_tags_per_entity: 8 }`
+  - 内部维护 `tag_index: HashMap<String, HashSet<String>>` 实现标签到实体的快速检索
+- 标签规范在源码中以常量或文档注释形式明确（小写化、原子性、优先复用已有、每实体 ≤8 个）
+- 单元测试：实体 CRUD、关系添加/删除、标签匹配（大小写不敏感、前缀/完全区分）、标签复用查询、标签容量截断、图遍历
 - 验收：`cargo build` + `cargo test` 通过

 ### Step 5：MemoryRetriever（关键词提取 + 评分机制）+ 模块整合
@@ -900,6 +1079,10 @@ async fn maybe_evict(&self) {
 - [ ] `ConversationMemory` 正确复用 `llm::compact` 的压缩逻辑
 - [ ] `KnowledgeStore` trait + `InMemoryKnowledgeStore` 支持页面 CRUD 和 index 维护
 - [ ] `KnowledgeGraph` trait + `InMemoryGraph` 支持实体 CRUD、关系管理和图遍历
+- [ ] `KnowledgeGraph::find_tags(prefix)` 支持前缀匹配查询已有标签
+- [ ] `KnowledgeGraph::set_entity_tags()` 替换全部标签，截断到 `max_tags_per_entity`
+- [ ] 每个实体的标签数不超过 `TagConstraints::max_tags_per_entity`（默认 8）
+- [ ] 标签复用原则在方案文档和源码规范中明确，下游实现可据此遵守
 - [ ] `MemoryRetriever` 组合 KnowledgeStore + KnowledgeGraph 返回统一检索结果
 - [ ] 无 embedding 相关依赖（不引入 fastembed、pgvector、qdrant 等 crate）
 - [ ] 模块结构符合项目惯例：`memory.rs` + `memory/` 目录 + `pub use` 重导出