pokemon qwen 微调系列（一）：SFT 数据工程实战：从爬取到可训练 JSONL

这篇只讲数据，不讲训练参数。核心问题是：怎么把公开网页/API 数据，稳定地变成可用于 SFT 的高质量样本。

系列导航：下一篇见 pokemon qwen 微调系列（二）：SFT 训练实战。

TL;DR

阶段	输入	输出	关键实现
抓取	`configs/sources.yaml`	`data/raw/<source>/*.{json,html}`	`crawl` + `BaseCrawler.run()`
解析	原始 HTML/JSON	`data/processed/articles/<source>/*.json`	`parse_*` -> `Article`
清洗	原始文本字段	干净的 `summary/main_content`	`clean_paragraph/clean_flavor`
SFT 转换	`Article`	Alpaca/Chat JSONL	`datasets/sft.py`
质检去重	原始样本集	过滤后样本集	`passes_quality` + `exact_dedup` + `near_dedup`
切分与统计	最终样本集	train/val/test + `DATASET_CARD.md`	`split.py`

1) 数据爬取：来源配置与合规抓取

抓取入口非常直接：

pokemon-data crawl -c configs/sources.yaml

configs/sources.yaml 里定义了三类来源：

source	类型	当前用途	备注
`pokeapi`	结构化 JSON API	主数据源	覆盖 pokemon/species/type/ability/move
`wikipedia`	HTML 页面	补充自然语言描述	规模小但文本更自然
`fandom`	HTML 页面	当前不产出	被 `robots.txt` 禁止，按策略跳过

关键不是“能抓多少”，而是“抓取是否可重跑、可追溯、合规”。

1.1 抓取层怎么保证可重跑

BaseCrawler.run(resume=True) 会先查缓存文件是否存在（_cached_path），存在则跳过网络请求。
因此你扩 ID 范围重跑时，不会重复轰炸上游站点。

1.2 抓取层怎么保证合规

utils/http.py + utils/robots.py 做了三层约束：

抓取前先查 robots.txt，不允许就抛 RobotsDisallowed。
按 host 做最小请求间隔（MIN_REQUEST_INTERVAL，默认 1 秒）。
请求失败只对网络抖动和 5xx 做重试，不盲目重放 4xx。

2) 解析与清洗：把 raw 变成 `Article`

解析入口：

pokemon-data parse

这个阶段会把 data/raw/<source>/ 的原始文件，统一成 Article：

class Article(BaseModel):
    source: str
    identifier: str
    url: str
    title: str
    summary: str
    main_content: str
    infobox: Infobox | None = None
    categories: list[str]
    extras: dict[str, Any]

2.1 为什么 `Article` 是关键中间层

Article 把来源差异（JSON API vs HTML 页面）统一成一个稳定接口。
后面的 SFT 模板只面向 Article，不用关心数据最初是从哪里来的。

2.2 清洗做了什么

cleaning/text.py 的核心链路：

函数	作用
`clean_paragraph`	NFKC 归一化、去引用标记、去 `[edit]`、压空白
`strip_control`	移除控制字符
`dehyphenate`	修复断词（`inter-\nnational -> international`）
`reflow_flavor_text`	修复 flavor text 里的软换行
`clean_flavor`	`reflow_flavor_text` + `clean_paragraph`

3) SFT 转换：`Article` 到 Alpaca / Chat

转换入口：

pokemon-data build-sft \
  --out data/datasets/sft/pokemon_sft.jsonl \
  --format both \
  --split --dedup

datasets/sft.py 会把每篇 Article 展开成多种训练样本：

kind	目标
`factoid`	单字段问答，强化精确召回
`summary`	概括能力
`long_form`	长回答组织能力
`list`	列表式输出
`structured`	结构化（JSON 风格）输出
`reasoning`	轻量推导（如属性克制）
`multi_turn`	多轮会话风格

3.1 两种输出格式

格式	结构	典型用途
Alpaca	`{instruction, input, output, source, identifier, kind}`	通用 SFT trainer
Chat	`{messages:[{role,content}...], source, identifier, kind}`	ChatML/ShareGPT 流水线

3.1.1 字段释义表

Alpaca 字段

字段	含义	示例
`instruction`	用户任务指令，定义模型“要做什么”	`Give a brief Pokedex-style description of Flare-Boost.`
`input`	可选补充上下文；没有就留空字符串	`""`
`output`	期望答案（监督信号）	`Increases Special Attack to 1.5× when burned.`
`source`	样本来源标记，便于按来源统计/过滤	`pokeapi`
`identifier`	源数据唯一标识，便于追溯与去重	`ability_138`
`kind`	样本类型标签，便于配比与切片训练	`summary`

Chat 字段

字段	含义	示例
`messages`	多轮消息数组，按顺序组成上下文	`[{"role":"user","content":"Tell me about Bulbasaur."}, ...]`
`role`	消息角色（通常为 `system/user/assistant`）	`assistant`
`content`	当前消息文本内容	`grass/poison`
`source`	样本来源标记	`pokeapi`
`identifier`	源数据唯一标识	`pokemon_1`
`kind`	样本类型标签	`multi_turn`

3.2 为什么模板是“确定性改写”

模板不是随机采样，而是 _pick(options, identifier, intent) 做确定性选择。
同一个 identifier 在重跑时保持稳定，既保留表达多样性，又保证复现性。

4) 质量过滤与去重：避免“脏监督”

datasets/quality.py 里有三道闸门。

4.1 质量过滤（`passes_quality`）

规则	默认阈值
指令长度	`10 <= len(instruction) <= 500`
输出长度	短答案类最小 2，其它最小 20；最大 4000
重复率	`max_ngram_repeat_ratio <= 0.25`
空值检查	instruction/output 不能为空

4.2 精确去重（`exact_dedup`）

对 instruction + input + output 做标准化后拼接，取 sha256 作为唯一键。
完全相同样本只保留一条。

4.3 近重复去重（`near_dedup`）

项目	参数
shingles	字符 5-gram
MinHash 数	64
LSH band 大小	8
近重复阈值	Jaccard `>= 0.85`
生效下限	输出长度 `>= 40`

5) 确定性切分与数据卡

datasets/split.py 的切分不是随机 shuffle，而是 hash 路由。
同一条样本跨多次构建会落在同一 split，利于复现对比。

r = \frac{\left(\text{sha256(key)}[:8] \bmod 10{,}000{,}000\right)}{10{,}000{,}000}

路由规则：

条件	split
`r < 0.95`	train
`0.95 <= r < 0.975`	val
`r >= 0.975`	test

构建结束会输出：

文件	作用
`pokemon_sft.train/val/test.jsonl`	Alpaca 格式
`pokemon_sft.chat.train/val/test.jsonl`	Chat 格式
`DATASET_CARD.md`	来源分布、kind 分布、token 估算统计

6) 一条完整可复现命令链

# 1) 抓取
pokemon-data crawl -c configs/sources.yaml
 
# 2) 解析
pokemon-data parse
 
# 3) 构建 SFT（Alpaca + Chat，含去重和切分）
pokemon-data build-sft \
  --out data/datasets/sft/pokemon_sft.jsonl \
  --format both \
  --split \
  --dedup
 
# 4) 构建高质量 chat 子集（可选）
pokemon-data build-hq-chat \
  --out data/datasets/sft/pokemon_sft_hq.chat.jsonl \
  --target-size 10000 \
  --split

7) 数据侧常见坑

症状	根因	修复
样本量很大但回答质量一般	`factoid` 占比过高，表达单一	增加 `long_form/summary/multi_turn` 配比
重跑后评测波动大	切分不稳定或数据被覆盖	用 hash 切分 + 固定输入目录
模型输出模板味过重	同类模板近重复过多	放宽模板池 + 强化 near-dedup
文本中出现断词/奇怪换行	flavor text 软换行未修复	保证 `clean_flavor` 链路生效

8) 数据集里的典型样本展示

下面摘几条真实样本（已与项目中的 JSONL 对齐），覆盖不同 kind 和格式。

8.1 Alpaca：`factoid`

这类样本训练“精确召回”能力：用户问一个很窄的问题，模型要直接给出短而准的答案。

{"instruction":"Name a Pokemon that can have the ability 'Technician'.","input":"","output":"Meowth","source":"pokeapi","identifier":"ability_101","kind":"factoid"}

8.2 Alpaca：`summary`

这类样本训练“压缩表达”能力：把一个概念用一句话讲清楚，减少冗长输出。

{"instruction":"Give a brief Pokedex-style description of Flare-Boost.","input":"","output":"Increases Special Attack to 1.5× when burned.","source":"pokeapi","identifier":"ability_138","kind":"summary"}

8.3 Alpaca：`reasoning`

这类样本训练“规则推导”能力：不是只背答案，而是基于类型克制规则做一步逻辑判断。

{"instruction":"If I use a Fire-type move against a bug-type Pokemon, is it effective?","input":"","output":"Yes. Fire-type moves are super-effective against bug types, so damage is doubled.","source":"pokeapi","identifier":"type_10","kind":"reasoning"}

8.4 Alpaca：`structured`

这类样本训练“结构化输出”能力：让模型按指定字段返回稳定 JSON，方便下游程序消费。

{"instruction":"Return a JSON object describing the move Scratch with fields: name, type, damage_class, power, accuracy, pp.","input":"","output":"{\"name\": \"Scratch\", \"type\": \"normal\", \"damage_class\": \"physical\", \"power\": 40, \"accuracy\": 100, \"pp\": 35}","source":"pokeapi","identifier":"move_10","kind":"structured"}

8.5 Chat：`multi_turn`

这类样本训练“多轮上下文保持”能力：后续追问要继承前文语境，而不是每轮都当作新问题。

{"messages":[{"role":"system","content":"You are a knowledgeable assistant specialised in the Pokemon franchise. Answer based on canonical Pokedex and game data. Be concise and factual."},{"role":"user","content":"Tell me about Bulbasaur."},{"role":"assistant","content":"Bulbasaur is a grass/poison-type Pokemon."},{"role":"user","content":"What type is it?"},{"role":"assistant","content":"grass/poison"},{"role":"user","content":"What's its strongest base stat?"},{"role":"assistant","content":"Special Attack at 65."}],"source":"pokeapi","identifier":"pokemon_1","kind":"multi_turn"}

总结

SFT 数据工程的关键不是“抓更多”，而是把抓取、清洗、转换、质检、切分做成可重跑、可解释、可复现的流水线。
只要这条链路稳定，后面的模型训练迭代才有可靠地基。