一、在處理簡繁轉換的時候,最簡單的方式是逐字進行簡繁體轉換,可是對於一簡多繁、一繁多簡的狀況,須要結合語義、詞組等進行轉換。而這就涉及到一個難點:如何從一串長長的字符串中將一個個詞組提取出來,也就是中文分詞的問題。ios
二、中文分詞指的是將一個漢字序列切分紅一個一個單獨的詞。分詞就是將連續的字序列按照必定的規範從新組合成詞序列的過程。咱們知道,在英文的行文中,單詞之間是以空格做爲天然分界符的,而中文只是字、句和段能經過明顯的分界符來簡單劃界,惟獨詞沒有一個形式上的分界符,雖然英文也一樣存在短語的劃分問題,不過在詞這一層上,中文比之英文要複雜的多、困難的多。算法
三、現有的分詞算法可分爲三大類:基於字符串匹配的分詞方法、基於理解的分詞方法和基於統計的分詞方法。macos
四、較經常使用的方法是採用:字符匹配數組
這種方法又叫作機械分詞方法,它是按照必定的策略將待分析的漢字串與一個「充分大的」機器詞典中的詞條進行配,若在詞典中找到某個字符串,則匹配成功(識別出一個詞)。按照掃描方向的不一樣,串匹配分詞方法能夠分爲正向匹配和逆向匹配;按照不一樣長度優先匹配的狀況,能夠分爲最大(最長)匹配和最小(最短)匹配;經常使用的幾種機械分詞方法以下:app
1)正向最大匹配法(由左到右的方向);ide
2)逆向最大匹配法(由右到左的方向);函數
3)最少切分(使每一句中切出的詞數最小);code
4)雙向最大匹配法(進行由左到右、由右到左兩次掃描)token
蘋果從很早就開始支持中文分詞了,並且咱們幾乎人人天天都會用到,回想一下,在使用手機時,長按一段文字,每每會選中按住位置的一個詞語,這裏就是一個分詞的絕佳用例。ip
蘋果給出了完整的API,想要全面瞭解的能夠直接看文檔:CFStringTokenizer Reference
一、相關係統庫<CoreFoundation.framework>
二、相關目標頭文件<CoreFoundation/CFStringTokenizer.h>
CFStringTokenizerRef CFStringTokenizerCreate(CFAllocatorRef alloc, CFStringRef string, CFRange range, CFOptionFlags options, CFLocaleRef locale) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
第一個參數 alloc,通常傳入NULL(使用當前默認的CFAllocator)便可 第二個參數 string,傳入將要被提取分詞的字符串(__bridge CFStringRef)string 第三個參數 range, 字符串string須要提取分詞的範圍,通常是整個string 第四個參數 options, 設置分詞標準,比較實用的是kCFStringTokenizerUnitWordBoundary。CFOptionFlags有如下枚舉: kCFStringTokenizerUnitWord = 0, kCFStringTokenizerUnitSentence = 1, kCFStringTokenizerUnitParagraph = 2, kCFStringTokenizerUnitLineBreak = 3, kCFStringTokenizerUnitWordBoundary = 4, kCFStringTokenizerAttributeLatinTranscription = 1UL << 16, kCFStringTokenizerAttributeLanguage = 1UL << 17, 第五個參數 locale, 本地化,可指定特殊的語言或區域,NULL爲自動識別
/*! @function CFStringTokenizerCreate @abstract Creates a tokenizer instance. @param alloc The CFAllocator which should be used to allocate memory for the tokenizer and its storage for values. This parameter may be NULL in which case the current default CFAllocator is used. @param string The string to tokenize. @param range The range of characters within the string to be tokenized. The specified range must not exceed the length of the string. @param options Use one of the Tokenization Unit options to specify how the string should be tokenized. Optionally specify one or more attribute specifiers to tell the tokenizer to prepare specified attributes when it tokenizes the string. @param locale The locale to specify language or region specific behavior. Pass NULL if you want tokenizer to identify the locale automatically. @result A reference to the new CFStringTokenizer. */
CFStringTokenizerTokenType CFStringTokenizerAdvanceToNextToken(CFStringTokenizerRef tokenizer) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
直接傳入建立好的分詞器tokenizer,每調用一次按照字符串順序提取一個分詞
/*! @function CFStringTokenizerAdvanceToNextToken @abstract Token enumerator. @param tokenizer The reference to CFStringTokenizer returned by CFStringTokenizerCreate. @result Type of the token if succeeded in finding a token and setting it as current token. kCFStringTokenizerTokenNone if failed in finding a token. @discussion If there is no preceding call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, it finds the first token in the range specified to CFStringTokenizerCreate. If there is a current token after successful call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, it proceeds to the next token. If succeeded in finding a token, set it as current token and return its token type. Otherwise invalidate current token and return kCFStringTokenizerTokenNone. The range and attribute of the token can be obtained by calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), its subtokens and (or) derived subtokens can be obtained by calling CFStringTokenizerGetCurrentSubTokens. */
CFRange CFStringTokenizerGetCurrentTokenRange(CFStringTokenizerRef tokenizer) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
獲取上次執行CFStringTokenizerAdvanceToNextToken後獲取到的分詞在字符串中的範圍Range
/*! @function CFStringTokenizerGetCurrentTokenRange @abstract Returns the range of current token. @param tokenizer The reference to CFStringTokenizer returned by CFStringTokenizerCreate. @result Range of current token, or {kCFNotFound,0} if there is no current token. */
// 要分詞的字符串 NSString *string = @"今天下雨了嗎?小明說下雨了,小紅說沒下雨。那麼,小明和小紅誰在說謊呢?"; NSMutableArray *keywords = [[NSMutableArray alloc] init]; CFStringTokenizerRef ref = CFStringTokenizerCreate(NULL, (__bridge CFStringRef)string, CFRangeMake(0, string.length), kCFStringTokenizerUnitWordBoundary, NULL);// 建立分詞器 CFRange range;// 當前分詞的位置 // 獲取第一個分詞的範圍 CFStringTokenizerAdvanceToNextToken(ref); range = CFStringTokenizerGetCurrentTokenRange(ref); // 循環遍歷獲取全部分詞並記錄到數組中 NSString *keyWord; while (range.length>0) { keyWord = [string substringWithRange:NSMakeRange(range.location, range.length)]; [keywords addObject:keyWord]; CFStringTokenizerAdvanceToNextToken(ref); range = CFStringTokenizerGetCurrentTokenRange(ref); NSLog(@"%@",keyWord); } NSLog(@"keywords = %@", keywords); CFRelease(ref);
運行結果以下
2017-10-20 14:09:23.569459+0800 TokenizerDemo[7220:227855] 今天 2017-10-20 14:09:23.569608+0800 TokenizerDemo[7220:227855] 下 2017-10-20 14:09:23.569742+0800 TokenizerDemo[7220:227855] 雨 2017-10-20 14:09:23.569844+0800 TokenizerDemo[7220:227855] 了 2017-10-20 14:09:23.570082+0800 TokenizerDemo[7220:227855] 嗎 2017-10-20 14:09:23.570207+0800 TokenizerDemo[7220:227855] ? 2017-10-20 14:09:23.570313+0800 TokenizerDemo[7220:227855] 小明 2017-10-20 14:09:23.570431+0800 TokenizerDemo[7220:227855] 說 2017-10-20 14:09:23.570522+0800 TokenizerDemo[7220:227855] 下雨 2017-10-20 14:09:23.570615+0800 TokenizerDemo[7220:227855] 了 2017-10-20 14:09:23.570695+0800 TokenizerDemo[7220:227855] , 2017-10-20 14:09:23.570764+0800 TokenizerDemo[7220:227855] 小紅 2017-10-20 14:09:23.570860+0800 TokenizerDemo[7220:227855] 說 2017-10-20 14:09:23.570936+0800 TokenizerDemo[7220:227855] 沒 2017-10-20 14:09:23.571007+0800 TokenizerDemo[7220:227855] 下雨 2017-10-20 14:09:23.571117+0800 TokenizerDemo[7220:227855] 。 2017-10-20 14:09:23.571373+0800 TokenizerDemo[7220:227855] 那麼 2017-10-20 14:09:23.571529+0800 TokenizerDemo[7220:227855] , 2017-10-20 14:09:23.571773+0800 TokenizerDemo[7220:227855] 小明 2017-10-20 14:09:23.572000+0800 TokenizerDemo[7220:227855] 和 2017-10-20 14:09:23.572235+0800 TokenizerDemo[7220:227855] 小紅 2017-10-20 14:09:23.572559+0800 TokenizerDemo[7220:227855] 誰 2017-10-20 14:09:23.573009+0800 TokenizerDemo[7220:227855] 在 2017-10-20 14:09:23.573432+0800 TokenizerDemo[7220:227855] 說謊 2017-10-20 14:09:23.573892+0800 TokenizerDemo[7220:227855] 呢 2017-10-20 14:09:23.574219+0800 TokenizerDemo[7220:227855] ?