Leveraging LLMs for Cloud Incident Extraction

Research Motivation

问题不是“LLM 会不会抽取”，而是“怎样可靠、便宜、可复现地抽取”。

云厂商会公开 incident reports，里面包含服务名、区域、时间、用户症状、根因等信息。但这些报告是自然语言文本，跨厂商格式差异很大，导致研究者和 SRE 很难做 longitudinal analysis。本文的 significance 在于把“公开故障文本”转成“可评估、可比较、可复用的结构化数据抽取任务”。

挑战 1：没有公开标注数据

已有云故障研究多依赖人工或规则抽取。作者指出 Azure 和 GCP 平均报告长度超过 500 words，手工处理成本高，也很难规模化复现。

挑战 2：缺少方法学评估

LLM 常用于信息抽取，但在 cloud incident reports 上如何设计 prompt、选择模型、衡量成本与延迟，仍缺少系统比较。

挑战 3：长期统计受阻

如果不能稳定抽取服务、时间、症状、根因类别，就很难回答“哪些服务最常出问题”“故障持续多久”“用户影响是什么”等运维问题。

Table 1 · 云故障报告数据集概览

ID	Name	Period	# Rows	# Labeled	Avg. Words
1	AWS	2016–2022	774	150 (19%)	151
2	AZURE	2019–2024	127	95 (75%)	575
3	GCP	2016–2021	2,186	215 (10%)	533
	TOTAL	2016–2024	3,087	460 (15%)	-

原论文 Table 1。注意三个数据源差异明显：AWS 短、Azure/GCP 长；GCP 数量最大但标注比例最低。

Mathematical Representation & Modeling

把报告 $r_i$ 映射成结构化 JSON，再按字段类型选择评价函数。

任务形式化

给定第 $i$ 条 incident report 文本 $r_i$，LLM 在 prompt $p_s$ 和 system prompt $p_{\mathrm{sys}}$ 的条件下输出结构化结果 $\hat{y}_i$：

\[ \hat{y}_i = f_{\theta}(p_{\mathrm{sys}}, p_s, r_i), \quad \hat{y}_i \in \mathcal{Y}_{\mathrm{entity}} \times \mathcal{Y}_{\mathrm{class}} \times \mathcal{Y}_{\mathrm{text}} \]

其中 $f_{\theta}$ 是候选 LLM，$p_s$ 是六种 prompt strategy 中的一种。输出要求为 JSON，核心字段包括 service name、location、start/end time、timezone、service category、user symptom category、user symptom，以及 Azure 可用的 root cause category/root cause。

评价函数

实体和单标签分类字段用 Exact Match；多标签 user symptom category 用 Token-level F1；自由文本 user symptom/root cause 用 BERTScore。

\[ \mathrm{EM}(\hat{y}, y)=\mathbf{1}[\mathrm{norm}(\hat{y})=\mathrm{norm}(y)] \]

\[ P=\frac{|T(\hat{y})\cap T(y)|}{|T(\hat{y})|},\quad R=\frac{|T(\hat{y})\cap T(y)|}{|T(y)|},\quad \mathrm{F1}=\frac{2PR}{P+R} \]

\[ \mathrm{Cost}=\frac{n_{\mathrm{in}}c_{\mathrm{in}}+n_{\mathrm{out}}c_{\mathrm{out}}}{10^6} \]

Figure 1 · LLM-based extraction workflow

A. Sample ReportsK-means 抽样与人工标注

→

B. Prompt TemplateSystem prompt + user prompt

→

C. Candidate LLMsOpenAI / Claude / Gemini

→

D. JSON Response结构化抽取结果

→

E. Evaluation与 labeled results 对比

根据原论文 Figure 1 重绘。最终步骤 $F$ 是选择模型后对全量报告执行 extraction and analysis。

Table 2 · 抽取字段、类型与评价指标

ID	Extracted Fields	Type	AWS	AZURE	GCP	Metric
1	Service Name	entity	✓	✓	✓	EM
2	Location	entity	✓	✓	✓	EM
3	Start Time	entity	✓	✓	✓	EM
4	End Time	entity	✓	✓	✓	EM
5	Timezone	entity	✓	✓	✓	EM
6	Service Category	class	✓	✓	✓	EM
7	Root Cause Category	class	-	✓	-	EM
8	User Symptom Category	multi-class	✓	✓	✓	TK
9	User Symptom	text	✓	✓	✓	BS
10	Root Cause	text	-	✓	-	BS

原 PDF 抽取的表格符号有编码损坏；这里按正文语义重建。论文强调部分字段直接从报告中抽取，类别字段由 LLM 推断。

Experimental Design

复现实验的关键，是数据处理、标注、prompt 组合、模型成本四件事。

数据处理流水线

从 AWS、Azure、GCP 状态页抓取 HTML 形式的 incident reports。
按厂商分别清洗，去除 None values、duplicates 和无效记录，统一成可合并结构。
用 K-means clustering 选择 representative sample datasets，降低人工标注成本。
3 名具备 computer systems 背景的研究者标注：Annotator 1/2 独立标注并对齐，无法达成一致时由 Annotator 3 仲裁。

模型配置 · Table 3

Alias	Type / Name	Input $/10^6T	Output $/10^6T
GPT 4o	S / gpt-4o*	2.50	10.00
GPT 3.5	L / gpt-3.5-turbo*	0.50	1.50
Claude Sonnet 4	S / claude-sonnet-4-20250514	3.00	15.00
Claude 3.5	L / claude-3-5-haiku-20241022	0.80	4.00
Gemini 2.5	S / gemini-2.5-pro*	1.25	10.00
Gemini 2.0	L / gemini-2.0-flash*	0.10	0.40

L 表示 lightweight，S 表示 state-of-the-art。带 * 的模型不提供 time fingerprint，论文只报告运行日期。

Prompt Engineering · Table 4 + Figure 2

作者定义 5 个 prompt components：Task、CoT、Category、Examples、Format，并组合成 6 种策略。

Label	Components	含义
Full-ZS	Task + CoT + Category + Format	完整指令，无示例
Full-FS	Task + CoT + Category + Examples + Format	完整指令，两条 few-shot examples
Basic-ZS	Task + Format	最小可用 prompt
Basic-FS	Task + Examples + Format	最小 prompt 加示例
CoT-ZS	Task + CoT + Format	只加入 step-by-step reasoning
Categ-ZS	Task + Category + Format	只加入类别定义

Task

Analyze the incident report step by step to extract structured information.

CoT

1. Identify service name and location.
2. Select one service category from {service_category_lst}.
3. Extract symptom sentences, then select user symptom categories from {user_symp_lst}.
4. Identify start time, end time, timezone. Format times as "HH:MM:SS".

Category

The definition for user symptom category are: {user_symp_instruction}.

Format

Return JSON with keys:
service_name, location, service_category, start_time, end_time, timezone,
user_symptom, user_symptom_category.

Examples

Q: title: Amazon CloudWatch (Ireland) ... delayed CloudWatch metrics ...
A: {"service_name":"Amazon CloudWatch","location":"Ireland","service_category":"management","start_time":"10:26:00","end_time":"14:40:00","timezone":"PST","user_symptom_category":"DELAY", ...}

根据原论文 Figure 2 复刻。实际模板会因厂商报告结构不同而略有变化。

Results & Core Findings

Few-shot 常常有效，但“更强模型更好”并不总成立。

Table 5 · GPT-3.5 在 AWS 上的 prompt 策略对比

Field / Prompt	Metric	Full-ZS	Full-FS	Basic-ZS	Basic-FS	CoT-ZS	Categ-ZS
Service Name	EM	86.00	100.00	52.00	100.00	83.33	60.67
Location	EM	48.00	96.67	38.00	83.33	57.33	44.67
Start Time	EM	86.00	91.33	70.00	89.33	83.33	72.00
End Time	EM	83.33	86.00	64.00	86.00	78.67	71.33
Timezone	EM	98.67	98.67	98.67	98.67	98.67	98.67
Service Categ.	EM	77.33	71.33	64.67	76.00	80.00	61.33
User Symptom Categ.	TK	88.50	88.94	9.76	50.19	9.33	90.00
User Symptom	BS	84.33	92.90	84.79	93.79	83.09	81.63
Overall	Average	71.08	79.23	49.74	73.60	61.44	62.44

Finding 1：Full-FS 总体最佳，Basic-FS 比 Basic-ZS 提高近 24 个百分点，说明 in-context examples 是最有效的单个组件。

Finding 2

Few-shot 在多数 metadata 字段上提升准确率，最高平均提升 17.34%。但对 service/root category 并不稳定，有时反而下降。

Finding 3

轻量模型有时超过先进模型。Azure 上 Gemini 2.0 few-shot 平均 $80.60\%$，高于 Gemini 2.5 few-shot 的 $77.90\%$。

Finding 4

Few-shot-CoT 输入 token 多 $1.5\text{–}2\times$，但可能更低延迟，因为示例约束输出格式，减少 output tokens。

Finding 5

最贵模型比最便宜模型贵 $50\text{–}60\times$。Azure 上 Claude 4 few-shot 成本为 $190.54\times10^{-4}$ 美元，而 Gemini 2.0 zero/few-shot 仅 $3.10/4.27\times10^{-4}$ 美元。

Finding 6

实际选型建议：先用 Gemini 2.0 或 GPT-3.5 + few-shot；只有在准确率硬需求更高时再考虑 GPT-4o/Gemini 2.5。

关键洞察

LLM extraction 不是单调 scaling 问题。报告格式、字段类型、few-shot 示例、输出长度共同决定 accuracy-cost-latency trade-off。

Figure 3 · Accuracy-Cost trade-off 重绘

AWS

AZURE

GCP

zero-shotfew-shot虚线：平均成本与平均准确率

根据原论文 Figure 3 的坐标范围与均值线重绘。上左区域代表更高准确率、更低成本。

Reviewer Comments

我的锐评：选题很对，但实验还停在“prompt benchmark”的第一层。

优点

问题真实，数据有价值。公开 cloud incident reports 是 AIOps 研究里被低估的数据源，作者把它做成可评估任务，这一步很重要。
实验维度完整：数据源、字段类型、模型、prompt、accuracy、latency、cost 都覆盖了，结论对工程选型有直接参考意义。
开源数据和工具链符合 FAIR，这让后续研究可以真正复现并扩展，而不是只停留在封闭 benchmark。

不足

分类体系偏 closed-world。service/root/symptom category 如果预设不够细，EM 再高也可能掩盖信息损失。
few-shot 示例选择没有被充分控制。作者承认样例数量与内容影响很大，但没有系统比较 retrieval-based examples、diversity sampling、hard cases。
对“语义正确”的评估偏弱。BERTScore 适合粗略相似度，但 incident root cause/symptom 的事实一致性需要 entailment 或人工审计。
缺少置信度与失败模式分析。生产系统更关心何时该相信抽取结果、何时交给人工，而不是只有平均准确率。

我会怎么改进

第一，加入 schema-constrained decoding 或 function calling，减少 JSON 格式错误并约束字段类型。第二，用 retrieval-based few-shot：对每条报告检索相似厂商、相似服务、相似症状的标注样例，而不是固定两条 example。第三，做 calibration：让模型输出 evidence span、confidence 和 abstention，当 evidence 不足时返回 unknown。第四，把评估从 field-level 扩展到 incident-level utility，例如 duration prediction、service-risk ranking、root-cause trend analysis，证明结构化抽取真的提升下游运维决策。

One More Thing

这篇论文真正有意思的地方，是它暗示了“公开 incident intelligence layer”。

如果把 AWS/Azure/GCP/LLM 服务商的公开故障报告持续抓取、结构化、去重、归因，再结合服务依赖图和区域拓扑，就能构建一个跨云的 incident intelligence layer。它不只是论文数据集，而可能成为云可靠性研究、SRE 风险建模、供应商 SLA 审计的基础设施。

可复现实验清单

下载作者 GitHub/Zenodo 数据；按厂商清洗；K-means 抽样；复用 Table 2 schema；生成六种 prompt；对 6 个模型记录 JSON、token、latency；按 EM/TK/BS 计算字段级指标。

最值得复查的点

类别定义、few-shot 样例、时间归一化、timezone 处理、BERTScore 对事实错误的容忍、Azure root cause 与 AWS/GCP 信息缺失造成的跨厂商不可比。

一句话结论

这项工作不是证明 LLM “聪明”，而是证明在受约束 schema、合适 few-shot、成本敏感评估下，LLM 可以成为 incident-report 数据工程的实用组件。

用 LLM 把云故障报告变成可分析数据。