UVa OJ 175 - Keywords (關鍵字)

時間 2019-11-08

標籤 uva keywords 關鍵字欄目 Microsoft Office 简体版

原文原文鏈接

Time limit: 3.000 seconds
限時3.000秒ios

Problem
問題

Many researchers are faced with an ever increasing number of journal articles to read and find it difficult to locate papers of relevance to their particular lines of research. However, it is possible to subscribe to various services which claim that they will find articles that fit an `interest profile' that you supply, and pass them on to you. One simple way of performing such a search is to determine whether a pair of keywords occurs `sufficiently' close to each other in the title of an article. The threshold is determined by the researchers themselves, and refers to the number of words that may occur between the pair of keywords. Thus an archeologist interested in cave paintings could specify her profile as ``0 rock art'', meaning that she wants all titles in which the words ``rock'' and ``art'' appear with 0 words in between, that is next to each other. This would select not only ``Rock Art of the Maori'' but also ``Pop Art, Rock, and the Art of Hang-glider Maintenance''.
許多研究人員都面臨這樣一個問題：閱讀的期刊文章數量與日俱增，要找到與他們特定研究方向相關的文章困難重重。然而，有一些訂閱服務聲稱它們能夠按你制定的「興趣配置」找到匹配的文章，並傳送給你。一種簡單的方式就是執行這樣一種搜索：肯定文章中是否有一對單詞出現的「足夠」靠近。研究人員設定一個閾值，指出一對單詞之間應出現的單詞數量。例如一個考古學家對巖洞壁畫感興趣，就會指定她的興趣配置爲「0 rock art」，意思是她但願標題中出現「rock」和「art」且間隔爲0單詞的全部文章，即這兩個單詞彼此相臨。這樣的興趣配置會選出的標題包括「Rock Art of the Maori」和「Pop Art, Rock, and the Art of Hang-glider Maintenance」等。

Write a program that will read in a series of profiles followed by a series of titles and determine which of the titles (if any) are selected by each of the profiles. A title is selected by a profile if at least one pair of keywords from the profile is found in the title, separated by no more than the given threshold. For the purposes of this program, a word is a sequence of letters, preceded by one or more blanks and terminated by a blank or the end of line marker.
寫一個程序，讀入一系列的配置文件，再讀入一系列的標題，肯定哪些標題（若是有）會被各配置選中。一個標題被一個配置選中僅當配置中的至少一對單詞出如今標題中，而且間隔沒有超過給定的閾值。對於這個程序而言，一個單詞就是字母的序列，前面有一個或多個空白，並以空白或行結束符做爲結束。git

Input
輸入

Input will consist of no more than 50 profiles followed by no more than 250 titles. Each profile and title will be numbered in the order of their appearance, starting from 1, although the numbers will not appear in the file.
輸入的配置不會超過50個，標題不會超過250個。每個配置和標題都以給出的順序編號（從1開始計數），但編號並不會在輸入中給出。數組

Each profile will start with the characters ``P:'', and will consist of a number representing a threshold, followed by two or more keywords in lower case.
每一個配置都以字符「P:」開始，包括一個表示閾值的數，接下來是兩個或更多的關鍵字，均爲小寫形式。數據結構

Each title will start with the characters ``T:'', and will consist of a string of characters terminated by ``|''. The character ``|'' will not occur anywhere in a title except at the end. No title will be longer than 255 characters, and if necessary it will flow on to more than one line. No line will be longer than eighty characters and each continuation line of a title will start with at least one blank. Line breaks will only occur between words.
每一個標題都以字符「T:」開始，會包括一個以「|」做爲結束的字符串。字符「|」不會出如今標題中除末尾外的任何位置。標題都不會超過255個字節，若是必要會分紅多行給出。全部行的長度都不會超過80個字符，且標題的每一個續行都以致少一個空白做爲開始。換行只會出如今單詞之間。app

All non-alphabetic characters are to be ignored, thus the title ``Don't Rock -- the Boat as Metaphor in 1984'' would be treated as ``Dont Rock the Boat as Metaphor in'' and ``HP2100X'' will be treated as ``HPX''. The file will be terminated by a line consisting of a single #.
全部非字母的字符都應忽略，例如標題「Don't Rock -- the Boat as Metaphor in 1984」應被看成「Dont Rock the Boat as Metaphor in」處理，「HP2100X」將被看成「HPX」處理。輸入文件以只有一個#的一行做爲結束。ide

Output
輸出

Output will consist of a series of lines, one for each profile in the input. Each line will consist of the profile number (the number of its appearance in the input) followed by ``:'', a blank space, and the numbers of the selected titles in numerical order, separated by commas and with no spaces.
輸出由多行構成，每行對應輸入的一個配置。每行都應包括配置的編號（配置在輸入中的編號）跟着一個「:」，一個空格，而後是以數字順序排列的選中標題的編號，中間以逗號隔開，不要空格。this

Sample input
示例輸入

Sample output
示例輸出

Analysis
分析

這道題重點考察對搜索匹配問題的建模能力，實際和字符串處理關係不大。要注意如下幾點：spa

全部非字母的字符都不處理；
僅以空格或換行做爲單詞的分隔符；
單詞均以小寫形式處理；
配置中的單詞任兩個都要算作一對。

前三條原則實際上就把單詞給量化了，若是對單詞編號建表，那麼配置和標題就都成爲了一堆數字（每一個數字皆爲單詞的編號）rest

1、以正確的方式處理輸入的配置，錄入所有配置中的單詞。遍歷配置中的全部單詞，創建從單詞到編號的對應表（即單詞表），此處可使用stl中的map做爲單詞表的數據結構。接下來用單詞表規格化處理全部的配置和標題，標題中不在單詞表中的單詞可用-1標記。orm

2、創建數據配置中的單詞對數據。對於配置中的每一對單詞，其實是一個三元組：（單詞對，閾值，所屬配置編號）。因爲在第一步已經將全部單詞規格化爲數字了，所以單詞對就是兩個整數。考慮單詞的總數必定不會上萬，且單詞對中兩個單詞的順序無所謂，所以能夠用兩個字節表示一個編號，而後將較小的編號放在高字節，較大的放低位構成一個4字節的整數，這個整數就能夠惟一的表示一個單詞對。那麼全部配置中的單詞對的數據就能夠多種形式來表達了，這裏使用map映射，key是單詞對的整數，value是一個結構體的動態數組，結構體中包括閾值和所屬配置編號。

3、創建標題中的單詞對數據。標題中的單詞包括非關鍵詞（編號爲-1）和關鍵詞，要求出各標題中每一對存在的關鍵詞的最短距離，並用一種數據結構表達。這裏使用和第二步類似的數據結構，每一對關鍵詞是一個三元組：（單詞對，最短距離，所屬標題編號）。查找標題中全部關鍵詞對的最短距離用暴力搜索就能夠了，遍歷的順序和冒泡排序同樣，複雜度是n²。全部數值求出來後，創建map映射，key是單詞對的整數，value是一個結構體的動態數組，結構體中包括最短距離和所屬標題的編號。

4、最後就是比對配置的映射表和標題的映射表，找出相同的單詞對，而後比對各自的數組。若是有最短距離小於或等於閾值的，那就在這個標題編號和配置編號創建一個聯繫。找出全部的配置編號-標題編號關係後，按配置編號排序，整理輸出便可。

Solution
解答

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <utility>

typedef unsigned long ulong;
typedef unsigned short ushort;

// 用於存儲profile中的閾值和轉成數字序列的關鍵詞組合
struct PROFILE
{
	size_t nThreshold;
	std::vector<ushort> nArray;
};

// 用於存儲profile中的閾值和profile的編號，title中的包含的兩個關鍵字之間的距離和title的編號
struct INFO
{
	size_t nDist;
	size_t nIdx;
};

typedef std::vector<std::string> VECSTR;
typedef std::vector<ushort> ARRAY;
typedef std::vector<ARRAY> MATRIX;
typedef std::map<ulong, std::vector<INFO> > MAPINFO;
typedef std::pair<size_t, size_t> PAIR;

// 將keywords對中的兩個單詞用數字序列表示，用一個unsigned short數據類型存儲
ulong MakeWordPair(ushort w1, ushort w2)
{
	return (w1 > w2)? (w1 | (w2 << 16)) : (w2 | (w1 << 16));
}

// 排序過程，重載「<」運算符
bool operator < (const INFO &f1, const INFO &f2)
{
	return (f1.nDist < f2.nDist || (f1.nDist == f2.nDist && f1.nIdx < f2.nIdx));
}

// 去重過程，重載「==」運算符
bool operator == (const INFO &f1, const INFO &f2)
{
	return (f1.nDist == f2.nDist && f1.nIdx == f2.nIdx);
}

int main(void)
{
	VECSTR profileStrs, titleStrs;
	for (std::string str; getline(std::cin, str) && str[0] != '#'; ) {
		// 讀入數據，若以「P：」開頭，則表示profile，若以「T：」開頭，則表示title，若以空格或者tab開頭，則承接上一個title。
		switch(str[0]) {
		case 'P':
			profileStrs.push_back(std::string(str.begin() + 2, str.end()));
			break;
		case 'T':
			titleStrs.push_back(std::string(str.begin() + 2, str.end()));
			break;
		case ' ':
		case '\t':
			titleStrs.back() += str;
			break;
		}
	}
	std::map<std::string, ushort> wordTbl;	     // 用於給每個keywords編號，keywords與編號的映射關係存入wordTbl中
	std::vector<PROFILE> arrProfile;	         // 將每一個profile中的keywords序列轉化爲相應的keywords編號序列
	for (VECSTR::iterator i = profileStrs.begin(); i != profileStrs.end(); ++i) {
		i->push_back(' ');
		std::string::iterator iBeg = i->begin();
		// 因爲profile由閾值和keywords串組成，遍歷profile字符串，找到閾值的起始位置
		for (; iBeg != i->end() && !isdigit(*iBeg); ++iBeg);
		// 找到閾值的結束位置，讀取閾值
		std::string strThre;
		std::string::iterator iEnd = iBeg;
		for (; iEnd != i->end() && isdigit(*iEnd); ++iEnd)
			strThre.push_back(*iEnd);
		// 保存每個profile的閾值和由keywords的編號組成的序列
		arrProfile.push_back(PROFILE());
		PROFILE &cur = arrProfile.back();
		// 將閾值由文本形式轉爲數值形式
		cur.nThreshold = atoi(strThre.c_str()); 
		//用於存儲keywords中讀取的單詞
		std::string word;   
		for (std::string::iterator j = iEnd; j != i->end(); ++j) {
			if (*j != ' ' && *j != '\t')
				word.push_back(*j);
			else if (!word.empty()) {
				// 更新keywords與編號的映射表
				ushort &wordIdx = wordTbl[word];
				if (wordIdx == 0)
					wordIdx = wordTbl.size();
				// 存儲keywords編號序列
				cur.nArray.push_back(wordIdx); 
				word.clear();
			}
		}
	}
	// 原輸入爲一個profile對應一組keywords pair，將其轉變爲一個keywords pair對應一個profile編號組，創建映射關係
	MAPINFO profileTbl;
	for (std::vector<PROFILE>::iterator i = arrProfile.begin(); i != arrProfile.end(); ++i)	{
		// 全部的keywords兩兩組合做爲一個keywords pair
		for (ARRAY::iterator j = i->nArray.begin(); j != i->nArray.end() - 1; ++j) {
			for (ARRAY::iterator k = j + 1; k != i->nArray.end(); ++k) {
				INFO info = {i->nThreshold, i - arrProfile.begin()};
				profileTbl[MakeWordPair(*j, *k)].push_back(info);
			}
		}
	}

	MATRIX titleAry;
	for (VECSTR::iterator i = titleStrs.begin(); i != titleStrs.end(); ++i) {
		(*i)[i->size() - 1] = ' ';
		titleAry.push_back(ARRAY());
		std::string word;
		// 按題中要求處理title，去掉非字母的符號。再將title序列轉化爲編號序列，若某一個單詞爲keyword，則標記爲相應的編號，若不是，則標記爲-1
		for (std::string::iterator j = i->begin(); j != i->end(); ++j) {
			char cTmp = tolower(*j);
			if (cTmp != ' ' && cTmp != '\t') {
				if (isalpha(cTmp))
					word.push_back(cTmp);
			}
			else if (!word.empty()) {
				std::map<std::string, ushort>::iterator idx = wordTbl.find(word);
				titleAry.back().push_back(idx != wordTbl.end() ? idx->second : -1);
				word.clear();
			}
		}
	}
	// 每個title中包含多個keywords pair，計算並存儲每對keywords的距離
	MAPINFO titleTbl;
	for (MATRIX::iterator i = titleAry.begin(); i != titleAry.end(); ++i) {
		// 對當前title創建keywords pair，每對keywords的距離以及title編號的映射表
		std::map<ulong, ushort> curWordmap;
		for (ARRAY::iterator j = i->begin(); j != i->end() - 1; ++j) {
			if (*j != ushort(-1)) {
				for (ARRAY::iterator k = j + 1; k != i->end(); ++k) {
					if (*k != ushort(-1)) {
						// 若存在關鍵字對，則計算兩個關鍵字間的距離，保留最小值
						ushort nDist = k - j;
						ushort &nWord = curWordmap[MakeWordPair(*j, *k)];
						if (nWord == 0 || nDist < nWord)
							nWord = nDist;
					}
				}
			}
		}
		// 將title處理爲一個keywords pair對應一組title編號和距離
		for (std::map<ulong, ushort>::iterator j = curWordmap.begin(); j != curWordmap.end(); ++j) {
			INFO info = {j->second, i - titleAry.begin()};
			titleTbl[j->first].push_back(info);
		}
	}
	// 比較profile和title，肯定哪些title屬於相應的profile
	std::vector<PAIR> result;
	for (MAPINFO::iterator i = profileTbl.begin(); i != profileTbl.end(); ++i) {
		std::vector<INFO> &curP = i->second;
		std::vector<INFO> &curT = titleTbl[i->first];
		// 判斷title中是否有該keywords pair
		if (!curT.empty()) {
			// 當profile和title包含相同的keywords時，將當前的profile編號排序去重
			std::sort(curP.begin(), curP.end());
			curP.erase(std::unique(curP.begin(), curP.end()), curP.end());
			std::sort(curT.begin(), curT.end());    // 將當前的title編號排序
			for (std::vector<INFO>::iterator icurP = curP.begin(), icurT = curT.begin(); 
				icurP != curP.end() && icurT != curT.end();) {
					// 若當前title中關鍵字的距離小於當前profile中閾值，則該title的編號一定屬於當前以後的全部profile（包含當前profile）
					// 若大於當前閾值，則去下一個profile的閾值
				if (icurT->nDist - 1 <= icurP->nDist) {
					for (std::vector<INFO>::iterator j = icurP; j != curP.end(); ++j)
						result.push_back(std::make_pair(j->nIdx + 1, icurT->nIdx + 1));
					++icurT;
				}
				else
					++icurP;
			}
		}
		else
			result.push_back(std::make_pair(curP.front().nIdx + 1, 0));
	}
	// 對結果排序並輸出
	std::sort(result.begin(), result.end());
	int nProfIdx = 0;
	for (std::vector<PAIR>::iterator i = result.begin(); i != result.end(); ++i) {
		if (i->first != nProfIdx) {
			nProfIdx = i->first;
			if (i != result.begin())
				std::cout << std::endl;
			std::cout << nProfIdx << ": ";
			if (i->second != 0)
				std::cout << i->second;
		}
		else if (i->second != 0)
				std::cout << ',' << i->second;
		}
	}
	std::cout << std::endl;
	return 0;
}