hparser document

時間 2020-05-30

標籤 hparser document 简体版

原文原文鏈接

github : https://github.com/chloro-pn/...html

本篇做爲hparser的文檔，主要分爲三部分進行說明。node

hparser查詢接口介紹

使用類hparser解析xhtml文件：ios

#include "hparser.h"
#include <exception>
#include <iostream>

int main() {
  //content is a std::string object.
  //content 存儲了utf8編碼的xhtml文本。
  //在構造函數過程當中進行解析，解析失敗會拋出parser_error異常。
  try {
    hparser doc(content);
  }
  catch(const std::exception& e) {
    std::cout << e.what() << std::endl;
  }
}

類hparser有如下接口函數：
string_type global_notes() const;
此函數的做用是返回不在頂層標籤內的註釋信息，DOCTYPE declaration等文本。c++

std::shared_ptr<element_type> get_root() const;
返回xhtml文本基於dom模型的頂層元素，即<html>...</html>元素。git

std::vector<std::shared_ptr<element_type>> find_tag(std::string str) const;

std::vector<std::shared_ptr<element_type>> find_attr(std::string str) const;

std::vector<std::shared_ptr<element_type>> find_content(std::string str) const;

三個查詢函數，分別在xhtml文本中按照tag信息，attr信息和content信息查詢，返回符合查詢條件的element集合。輸入的std::string能夠是一個regex pattern，內部會經過std::regex進行匹配。最好保證xhtml是ascii編碼，關於std::regex和unicode編碼的問題見如下連接：https://www.zhihu.com/questio...
若是你確實須要處理ascii以外的擴展字符並須要用正則匹配，或者上述接口的查詢能力不足，使用如下接口：github

std::vector<std::shared_ptr<element_type>> find(std::function<bool(std::shared_ptr<element_type> each)> func) const;

find函數接收一個可調用對象，輸入參數爲指向element_type的共享指針，你能夠根據該元素的信息肯定是否查詢成功，若是是則返回true，不然返回false。固然，你能夠在該函數中將元素的記錄文本由utf8轉爲你須要的編碼，而後用正則表達式匹配並肯定是否成功。正則表達式

auto check_func = [](std::shared_ptr<element_type> node)->bool {
  std::u32string tmp = encode_cast(node->content());
  bool found = regex_by_encode(tmp, yourpattern);
  return found;
};
auto result = h.find(chekc_func);

上述代碼不是合法的c++代碼，僅做爲僞代碼展現。dom

類型element_type介紹

element_type表明了xml/html基於文檔對象模型（DOM）的元素概念，元素之間造成樹結構，每一個元素擁有父元素（根元素除外），可選的子元素，標籤tag，文本content等信息。該類型具備如下接口：函數

//公開的kv_type類型，此類型表明元素屬性的類型，是一個key-value對，此類型的對象保證擁有公開可訪問的數據成員key_和value_。
using kv_type = inner_kv_type;

//返回此元素的標籤
string_type tag() const;

//返回此元素內的內容，但不包括子元素的內容
string_type content() const;

//返回屬性的數量
size_t attrs_size() const;

//返回子元素的數量
size_t childs_size() const;

//按照index返回屬性，index從0開始
kv_type get_attr(size_t index) const;

//按照index返回子元素，index從0開始
std::shared_ptr<hparser::element_type> get_child(size_t index) const;

//返回全部屬性
std::vector<kv_type> get_all_attrs() const;

//返回全部子元素
std::vector<std::shared_ptr<hparser::element_type>> get_all_childs() const;

//[]運算符重載，根據屬性的key訪問value，若是key不存在則返回空的value ""。
string_type operator[](string_type str) const;

//判斷此元素是否爲根元素。
bool root() const;

//返回此元素的父元素，若是此元素爲根元素，則返回空的std::shared_ptr。
std::shared_ptr<hparser::element_type> parent() const;

結合使用find接口與element_type類，你將會得到很是強大靈活的查詢能力，幾乎能夠實現任意複雜的查詢條件。例如：編碼

auto check_func = [](std::shared_ptr<element_type> node)->bool {
  if(node->root() == false && node->parent()->tag() == u8"div") {
    if(node->tag() == u8"a" && (*node)[u8"href"] != u8"" && node->attrs_size() == 1) {
    return true;
    }
  }
  return false;
};
auto result = h.find(chekc_func);

這個查詢函數要求返回元素的父元素標籤爲div，本元素標籤爲a且只含有屬性「href」。
體會到find接口強大的查詢能力了麼？你能夠根據element_type所擁有的信息定製任何查詢條件。接下來看第三部分，編碼與正則匹配，這會進一步提高find接口的查詢能力：）

utf8_to_utf32/utf32_to_utf8接口介紹

c++處理編碼真是難，c++標準庫中的string只是個char array，其並不含有編碼信息，而一個合理的正則匹配應該是codepoint-by-codepoint的，所以std::regex可以處理ascii碼，但對擴展編碼則心有餘而力不足。雖然標準庫又提供了wchar_t和std::wregex，但其在不一樣平臺上其佔用字節大小竟然不一樣。。。在這種狀況下，首先咱們不能手工編碼而後存儲wchar_t，由於其佔用字節大小不定，其次，標準庫提供的std::string和std::wstring相互轉換的組件在c++17標準中被廢除。如今std::string和std::wstring徹底成了兩套東西，相互之間的轉換已經不能。

c++11提出了兩種新的字符存儲類型char16_t和char32_t，其具備肯定的大小。原本覺得曙光來臨，將全部外部輸入編碼都轉化爲utf32在內部表示，而後用std::basic_regex<char32_t>進行正則處理，目前爲止的unicode-code-point都能用一個wchar32_t裝下，天下太平了。然而std::basic_regex模板類不直接提供char32_t的支持，須要實現std::regex_traits<char32_t>（由於char32_t只是個更大的存儲單位，並不帶有編碼信息，而我的認爲正則處理應該創建在具體的編碼之上實現。）。見連接：https://stackoverflow.com/que...

機智的我中止填這個無底深坑，轉而提供utf8-utf32的轉換接口，若是你有支持utf32的正則庫，則能夠經過此接口轉換編碼而後作正則匹配。

//將c轉換爲本機字節序，en是c當前的字節序
//endian是一個enum class，該類型的對象可被設置爲
//endian::little_endian or endian::big_endian。
char32_t endian_cast(char32_t c, endian en);

//將src中的utf8編碼轉換爲utf32編碼，en爲輸出參數的字節序，默認爲本機字節序。
size_t utf8_to_utf32(const std::string& src, std::u32string& dst, endian en = local_endian().get());

//將str中的utf32編碼轉換爲utf8編碼，en爲輸入參數的字節序，默認爲本機字節序。
size_t utf32_to_utf8(const std::u32string& src, std::string& dst, endian en = local_endian().get());

上述兩個接口的返回值爲第一個轉換失敗的文本的index，轉換成功時應知足return_index == src.size()。不然src[return_index]及其以後的輸入文本轉換失敗。綜上，你能夠在查詢函數中將utf8編碼轉換爲utf32編碼，而後利用支持char32_t的正則表達式庫作正則匹配。

1. document
2. Document
3. Single document interface和Multiple document interface
4. Document以及Document CRUD操作
5. New Document
6. Elasticsearch Document
7. Document類
8. $(document).ready()
9. JavaScript document
10. javascript document
更多相關文章...
• XSLT document() 函數 - XSLT 教程
• XML DOM - Document 對象 - XML DOM 教程

相關標籤/搜索

document

indexwriter+document+field

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。