In computer science, the Knuth-Morris-Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re0examination of previously matched characters.
The algorithm was conceived in 1970 by Donald Knuth and Vaughan Pratt, and independelty by James H.Morris.css
The most straightforward algorithm is to look for a character match at successive values of the index m, the posisiotn in the string being searched, ie. S[m]. If the index m reaches the end of the string then there is no match, in which case the search is said to fial. At each position m the algorithm first checks for equality of the first character in the word being searched, I.e. S[m] = ? W[0]. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, I. The algorithm retrieves the character W[I] in the word being searched and checks for equality of the expression S[m+i] = ? W[I]. If all successive characters match in W at position m, then a match is found at that position in the search string.
Usually, the trail check will quickly reject the trail match. If the strings are uniformly distributed random letters, then the chance that charaters match is 1 in 26. In most cases, the tral check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26^2 ( 1 in 676). So if the characters are random, then the expected complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance is very good. If S[] is 1 billion characters and W[] is 100 characters, then the string search should complete after about one billion character comparisons.
The expected peroframnce is not guaranteed. If the strings are not random, then checking a trial m may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string S[] consists of 1 billion characters that are all A, and that the word W[] is 999 A characters terminating in a final B character. The simple string matching algorithm will now examine 1000 characters at each trail position before rejecting the match and advancing the trail position. The simple string search example would now take about 1000 character comparisons times 1 billion positions for 1 trillion character comparisons. If the length of W[] is n, then the worst-case performance is O(kn)html
The KMP algorithm has a better worst-case performance than the straightforward algorithm. KMP spends a little time precomputing a table, and then it uses that table to do an efficient search of the string in O(k)shell
The difference is that KMP makes use of previous match information that the straightforward algorithm does not.express
We assume the existence of a "partial match" table T, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m] that fails when comparing S[m + I] to W[I], then the next possible match will start at index m + I - T[i] in S. This has two implications: first, T[0] = -1, which indicates that if W[0] is a mismatch, we cannot backtrack and must simply check the next character. and second, although the next possible match will begin at index m + I - T[i], as in the example above, we need not actually check any of the T[I] characters after that, so that we continue searching from W[T[I]]. The following is a sample pseudocode implementation of the KMP search algorithm.bash
algorithm kmp_search:dom
input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found) define variables: an integer, m <- 0 (the beginning of the current match in S)
egrep cs246 index.html
[]:= pattern
|:= or, h|c matches h or c. cat|dog matches cat or dog
usage: regexpattern file
* := 0 or more repetitions of the preceding subexpression
(cs)*246 := ex. 246, cs246, cscs246
cs*246 := ex. c246, cs246, css246
for * to not have a special meaning, escape it by using
cs\*246
?:= 0 or 1 of proceeding subexpression
ex. cs ?246:= cs246, cs 246 (cs )?246:= 246, cs 246
+:= 1 or more of the proceeding subexpression
.:= any one characters
.*: any number of any characters
ex. cs.*246:= matches lines that contain a string starting with cs & ending with 246oop
use ^ to indicate pattern should match start of line
use $ to indicate pattern must end with end of line
"^cs.*246$"
^(cs.*)246$:= 246, csb246, csbcsa246ui
e.g. Fetch lines of even length
(..)* is not enough as you need to say the pattern start and end with (..)*
correct: ^(..)*$
e.g. files in the current directory that contain exactly one a
ls| egrep "^[^a]a[^a]*$" if "" is not added then will be interpreted as a glopping item
words in the dictionary containing exactly 5 characters
egrep ".{5}" filethis
ls -l:= long listing of pwd
--rwxr--xr-- 1 naneem staff 28 Jan10 abcd.txt
type|user bits| group bits | others bits shortcuts owners group size last modified filename
w: write
x: execute as a program
r: read
group bits: permissions all others members of group have
user bits: permissions that the owner has
others bits: permissions that all others(people not in group staff in this case) hasspa
r for directory:= read contents
w for directory:= write contents
x for directory:= enters a directory
only the owner can change permissions
chmod mode file(s)
mode:= ownership class:= U_users, g_group, o_others, a_all
mode:= operator:= +, -, =
mode:= r, w, x
425
100, 010, 101
r--, -w-, r-x
vairables
$> x=1
variable=value
usage: $x (beter ${x})
In comparison: $(...) is the embedded command line
e.x. x=1
echo $x
y = 1, not working because of space
values in shell are stored as strings
dir=~/1171/a1
cd $dir
echo '$dir': single quote won't let embedded code to excecute
Script: text file with Linux commands that are executed as a program
#!/bin/bash:= shebang line change this file into command line file
e.x. File basic
chmod a+x basic
Usage: basic, doesn't work, tell the shell where to find it for example ./basic
tells the shell that basic is in the current directory
Arguments to scripts: e.g. is a word in the dictionary
usage: ./isItAWord($0) hello($1)
cat isItAWord:= egrep "^$1$" /usr/share/dict/words
usage () {
echo "Usage: $0 password" >&2 (find in piazza| redirecting) exit 1
}
Using $0 so that if someone change the file name, it still work
${#} := number of args to script
if [${#} -ne 1]; then
usage
fi
syntax for if: if [condition]; then statement elif [condition]; then statement else statement /n fi
egrep "^$1$" /usr/share/dict/words > /dev/null
Loops
e.g. Print numbers 1 to $1
x = 1
while [$x -le $1]; do
echo ${x} x = $((x+1))
done
e.x. for x in a b c d; do echo $x; done
File: RenameC
change all .C files to .cc
mv Hello.C Hello.cc
filename = Hello.C mv ${filename} ${filename%C}cc
for name in *.C; do
mv ${name} ${name%C}cc
done
eg. count the number of occurrence of $1 in file $2
count = 0
for word in $(cat $2); do
if [${word}] == $1]; then count = $((count + 1)) fi
doneecho ${count}usage: ./countWords "To be" ../sample.txt "" prevent the shell from treat to be as two arguments