Xpath 語法 - 使用 HtmlAgilityPack 於 C#

2012年5月3日星期四

Xpath 語法 - 使用 HtmlAgilityPack 於 C#

XPath即為XML路徑語言（XML Path Language），它是一種用來確定XML文檔中某部分位置的語言。

XPath基於XML的樹狀結構，提供在資料結構樹中找尋節點的能力。起初 XPath 的提出的初衷是將其作為一個通用的、介於XPointer與XSL間的語法模型。但是 XPath 很快的被開發者採用來當作小型查詢語言。

引用 & 參考：XPath 語言、XPath Axes、XML Path Language (XPath)

XPATH 基本語法

para selects the para element children of the context node

para 選擇 para 子元素的本文節點
* selects all element children of the context node

* 選擇所有子元素的本文節點
text() selects all text node children of the context node

text() 選擇所有 text 子結點的本文節點
@name selects the name attribute of the context node

@name 選擇所有 name 屬性的本文節點
@* selects all the attributes of the context node

@* 選擇所有屬性的本文節點
para[1] selects the first para child of the context node

para[1] 選擇所有第一個 para 元素的本文節點
para[last()] selects the last para child of the context node

para[last()]選擇所最後一個 para 元素的本文節點
*/para selects all para grandchildren of the context node

*/para 選擇所有 para 子孫的本文節點
/doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc

/doc/chapter[5]/section[2] 選擇 doc 下的第五個 chapter 下的第二個 section 節點
chapter//para selects the para element descendants of the chapter element children of the context node

chapter//para 選擇所有的父節點為chapter元素的para元素
//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node

//para 選擇所有為 para 元素
//olist/item selects all the item elements in the same document as the context node that have an olist parent

//olist/item 選擇所有父節點為 olist 元素的 item 元素
. selects the context node

. 選擇當前節點
.//para selects the para element descendants of the context node

.//para
.. selects the parent of the context node

.. 選擇當前節點的父節點
../@lang selects the lang attribute of the parent of the context node

../@lang 選擇名為 lang 的所有属性
para[@type="warning"] selects all para children of the context node that have a type attribute with value warning

para[@type="warning"] 選擇所有 title 元素，且這些元素擁有值為 warning 的 lang 属性
para[@type="warning"][5] selects the fifth para child of the context node that has a type attribute with value warning

para[@type="warning"][5] 選擇所有 title 元素，且這些元素擁有值為 warning 的 lang 属性的第五個節點
para[5][@type="warning"] selects the fifth para child of the context node if that child has a type attribute with value warning

para[5][@type="warning"] 選擇所有 title 元素的第五個節點，且這個元素擁有值為 warning 的 lang 属性
chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to Introduction

chapter[title="Introduction"] 選擇所有 chapter 元素，且其中的 title 元素的值等於 Introduction
chapter[title] selects the chapter children of the context node that have one or more title children

chapter[title] 選擇所有 chapter 元素，且其中有 title 元素的值
employee[@secretary and @assistant] selects all the employee children of the context node that have both a secretary attribute and an assistant attribute

employee[@secretary and @assistant] 選擇所有 employee 元素，且其中有 assistant 和 secretary 屬性的元素

XPATH 座標軸

ancestor 選擇當前節點的所有先輩（父、祖父等）
ancestor-or-self 選擇當前節点的所有先辈（父、祖父等）以及當前節點本身
attribute 選擇當前節點的所有屬性
child 選擇當前節點的所有子元素
descendant選擇當前節點的所有後代元素（子、孫等）
descendant-or-self 選擇當前節點的所有後代元素（子、孫等）以及當前節點本身
following 選擇文檔中當前節點的结束標籤之後的所有節點
namespace 選擇當前節點的所有命名空間節點
parent 選擇當前節點的父節點
preceding 選擇文檔中當前節點的開始標籤之前的所有節點
preceding-sibling 選擇當前節點之前的所有同级節點
self 選擇當前節點

使用範例

要擷取某些節點，必須要先觀察節點的獨特性質，或者是共通屬性，我通常都是使用 firefox 的外掛來觀察，IE 和 chrome 都有這種元件。
這次的例子我們在 google 上輸入 xpath，想要擷取前 10 筆網頁的標題，就先觀察它的網頁結構。

可以利用上面解說的方法，依照網頁結構來擷取，在程式中寫入以下 C# 程式碼：

HtmlWeb web;
HtmlDocument doc;
HtmlNodeCollection nodes;
string xp_title = string.Empty;


web = new HtmlWeb();
doc = web.Load("http://www.google.com/search?hl=en&q=xpath&oq=XPath");

xp_title = @"//h3[@class=""r""]";
nodes = doc.DocumentNode.SelectNodes(xp_title);

foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.InnerText);
}

執行結果如下：

比起 regular expression 輕鬆的是，它是以網頁結構去擷取，而 regular expression 是依照匹配字串做比對，所以在比對上 regular expression 比較有難度。
但是，在特殊情況下，還是需要利用 regular expression 做精密的解析。

回C#目錄
回首頁