Sequential Prediction
(Added categories, I hope their are the appropriate ones) |
(→Algorithm at work: Step by step example of the algorithm matching a string) |
||
Line 36: | Line 36: | ||
This results in an O(n) algorithm, than only checks all previous occurrences of the element being added. With clever data structures the algorithm can run very fast. | This results in an O(n) algorithm, than only checks all previous occurrences of the element being added. With clever data structures the algorithm can run very fast. | ||
+ | |||
+ | == Algorithm at work == | ||
+ | Lets see how the algorithm works for the input string ''ABABBAB'' which is created incrementally. We start with an empty string and a new letter in each step. We will represent each step with 3 lines, the first one is the index in the string (1-based), the second is the current string and the third is the longest match size that ends at that character. If the last character in the log doesn't match, we will use a "*" as a match size to represent a mismatch but in real implementation the real match sizes in there can have some value, but they are irrelevant because we don't check them directly at the next step. The character being added in each step is represent as a bold letter. | ||
+ | |||
+ | {| border="1" style="border-collapse: collapse; font-size: 85%; color: black" | ||
+ | |align="left"| | ||
+ | !Index: | ||
+ | !0 | ||
+ | !1 | ||
+ | !2 | ||
+ | !3 | ||
+ | !4 | ||
+ | !5 | ||
+ | !6 | ||
+ | !Comment | ||
+ | |- | ||
+ | |N=1 | ||
+ | |String: | ||
+ | |'''A''' | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |The string was empty, nothing match here. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |* | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |- | ||
+ | |N=2 | ||
+ | |String: | ||
+ | |A | ||
+ | |'''B''' | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |This is the first B, no matches found. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |* | ||
+ | |* | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |- | ||
+ | |N=3 | ||
+ | |String: | ||
+ | |A | ||
+ | |B | ||
+ | |'''A''' | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |When checking the previous A(index 0) it realizes there isn't a B before it so the match size is 1. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |1 | ||
+ | |* | ||
+ | |* | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |- | ||
+ | |N=4 | ||
+ | |String: | ||
+ | |A | ||
+ | |B | ||
+ | |A | ||
+ | |'''B''' | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |When checking the previous B(index 1) it realizes there ''is'' an A(index 0) before it with a match size of 1, so the match size is 1 + 1 = 2. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |* | ||
+ | |2 | ||
+ | |* | ||
+ | |* | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | |- | ||
+ | |N=5 | ||
+ | |String: | ||
+ | |A | ||
+ | |B | ||
+ | |A | ||
+ | |B | ||
+ | |'''B''' | ||
+ | | | ||
+ | | | ||
+ | |When checking the previous B(index 3) it realizes there isn't a B before so the match size is 1. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |* | ||
+ | |1 | ||
+ | |* | ||
+ | |1 | ||
+ | |* | ||
+ | | | ||
+ | | | ||
+ | |When checking further back the previous B(index 1) it realizes there isn't a B before so the match size is 1. | ||
+ | |- | ||
+ | |N=6 | ||
+ | |String: | ||
+ | |A | ||
+ | |B | ||
+ | |A | ||
+ | |B | ||
+ | |B | ||
+ | |'''A''' | ||
+ | | | ||
+ | |When checking the previous A(index 2) it realizes there ''is'' a B(index 1) before it with a match size of 1, so the match size is 1 + 1 = 2. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |1 | ||
+ | |* | ||
+ | |2 | ||
+ | |* | ||
+ | |* | ||
+ | |* | ||
+ | | | ||
+ | |When checking further back the previous A(index 0) it realizes there isn't a B before so the match size is 1. | ||
+ | |- | ||
+ | |N=7 | ||
+ | |String: | ||
+ | |A | ||
+ | |B | ||
+ | |A | ||
+ | |B | ||
+ | |B | ||
+ | |A | ||
+ | |'''B''' | ||
+ | |The B at index 4 doesn't have an A before it so the match size is 1. | ||
+ | |- | ||
+ | | | ||
+ | |Size: | ||
+ | |* | ||
+ | |2 | ||
+ | |* | ||
+ | |3 | ||
+ | |1 | ||
+ | |* | ||
+ | |* | ||
+ | |The Bs at indexes 3 and 1, have an A before them with match sizes of 2 and 1 respectively, so the new match sizes are 2 + 1 = 3 and 1 + 1 = 2. | ||
+ | |} | ||
+ | |||
== Implementation == | == Implementation == |
Revision as of 00:02, 4 October 2009
Sequential prediction is a very fast Pattern Matching algorithm, it has a linear running time. While it matches it is able to find all matches along with their match size. The main draw back is that the values matched need to be discrete and capped, that can be achieved just like in Symbolic Pattern Matching but we can use integers instead of characters in a String if preferred.
Contents |
Credits
I got the idea and basic algorithm from:
- Fri Mommersteeg - Eindhoven University of Technology. Article published in the book Game Programming Wisom, 2002 under article Pattern Recognition with Sequential Prediction.
I tried to keep most of the original author's idea, but I biased it a little bit for Robocode.
Basics
- As in other pattern matching algorithms, assume we have a sequence of size N called log. For simplicity we will assume it's indexed form 1 to N.
- Let a match be a pair {i, k} such that log(i + j) equals log(N - k + j) for all j from 0 to k-1. We will call k the size of match.
- Let a longest match be a match with maximal size.
If we can find a longest match we already have more information than most current pattern matching algorithms (except for Single-Tick Pattern Matching), which usually have one or more fixed sizes to try. And if we can find it faster, then it's even better.
Consider the binary string BABBABAABAAABBBBABBBAABA, in this case the longest match is BAABA and has a size of 5. The pattern starts at index 6 of the log, therefore the match is called {6, 5}. A naive algorithm would look for the previous A, and then check if the value to it's left is a B, and so on. This algorithm takes O(n^2) to find the longest match, which is too slow for even relatively small log sizes.
Better Idea
The key point is take advantage from the fact that the sequence is being generated incrementally. When a new element is added the complete sequence remains the same, except for the tail. Exploiting this property we can find the longest match very fast.
- Let a match size of occurrence of A be the size of a match ending in A.
- Suppose we know all the match sizes of occurrences of A.
- Suppose the log ends with A and that we are adding a B.
Can we use this information to find all match sizes of occurrences of B? Yes we can, and is very simple too. For all occurrences of B we have one of two cases:
- The B is not preceded by A.
- The B is preceded by A.
In the first case it is obvious that the match size is 1. For the second case things could be trickier, but since we now all match sizes of occurrences of A, then is not that hard.
The string is something like "...AB...A", prior to the addition and it will become "...AB...AB". Therefor is easy to see that the match size of B is the match size of A(we know this value) plus 1. This comes from extending the substring that ended in A to include the B as we know it will match the new tail.
What if we are adding an A instead of a B, does the algorithm work? The answer is yes as long as you traverse the occurrences of A in reverse order as they appeared.
Now the only problem is the supposition of having all match sizes of occurrences of A, but since we start with an empty log the initialization is trivial. After initialized the algorithm is self-sustaining, so the supposition gives no problems at all.
This results in an O(n) algorithm, than only checks all previous occurrences of the element being added. With clever data structures the algorithm can run very fast.
Algorithm at work
Lets see how the algorithm works for the input string ABABBAB which is created incrementally. We start with an empty string and a new letter in each step. We will represent each step with 3 lines, the first one is the index in the string (1-based), the second is the current string and the third is the longest match size that ends at that character. If the last character in the log doesn't match, we will use a "*" as a match size to represent a mismatch but in real implementation the real match sizes in there can have some value, but they are irrelevant because we don't check them directly at the next step. The character being added in each step is represent as a bold letter.
Index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | Comment | |
---|---|---|---|---|---|---|---|---|---|
N=1 | String: | A | The string was empty, nothing match here. | ||||||
Size: | * | ||||||||
N=2 | String: | A | B | This is the first B, no matches found. | |||||
Size: | * | * | |||||||
N=3 | String: | A | B | A | When checking the previous A(index 0) it realizes there isn't a B before it so the match size is 1. | ||||
Size: | 1 | * | * | ||||||
N=4 | String: | A | B | A | B | When checking the previous B(index 1) it realizes there is an A(index 0) before it with a match size of 1, so the match size is 1 + 1 = 2. | |||
Size: | * | 2 | * | * | |||||
N=5 | String: | A | B | A | B | B | When checking the previous B(index 3) it realizes there isn't a B before so the match size is 1. | ||
Size: | * | 1 | * | 1 | * | When checking further back the previous B(index 1) it realizes there isn't a B before so the match size is 1. | |||
N=6 | String: | A | B | A | B | B | A | When checking the previous A(index 2) it realizes there is a B(index 1) before it with a match size of 1, so the match size is 1 + 1 = 2. | |
Size: | 1 | * | 2 | * | * | * | When checking further back the previous A(index 0) it realizes there isn't a B before so the match size is 1. | ||
N=7 | String: | A | B | A | B | B | A | B | The B at index 4 doesn't have an A before it so the match size is 1. |
Size: | * | 2 | * | 3 | 1 | * | * | The Bs at indexes 3 and 1, have an A before them with match sizes of 2 and 1 respectively, so the new match sizes are 2 + 1 = 3 and 1 + 1 = 2. |
Implementation
- Let alphabet be the set of all possible values in the sequence.
- Let histogram be an array of size |alphabet| of lists of Entry.
class Entry { int index; // Index into the sequence. int value; // Discrete value being matched. int match_size; // Match size of this occurrence. Entry getPreviousEntry(); // Method to obtain the Entry that came right before this one. } function SequentialPrediction(int next_element) { List<Entry> ocurrences = histogram[next_element]; int longest_match_size = 0; int longest_match_tail = 0; for each entry in ocurrences do { Entry previous = entry.getPreviousEntry(); // Find the match size if ( previous.value == last_value ) entry.match_size = previous.match_size + 1; else entry.match_size = 1; // Update longest match if ( entry.match_size > longest_match_size ) { longest_match_size = entry.match_size; longest_match_tail = entry.index; } } UpdateSequence(next_element); UpdateHistogram(next_element); last_value = next_element; if ( longest_match_size > 0 ) { Start prediction at Sequence[longest_match_tail + 1]; } }
When iterating trough the entries in the occurrences list, remember to go trough them from last to first. Or you can add new entries to the beginning of the list when updating the histogram.
Multiple-Choice
Since the algorithm find all partial matches, instead of updating the index and size of the longest match we can use them all (or some of the best ones) for a multiple choice pattern matcher.
Weighting
For the multiple choice algorithm, choices may be weighted by one or more of the following with minimal complexity added:
- Match size.
- Distance to sequence tail (very good for fast adapting dodgers).
- Similar weighting as found in Dynamic Clustering algorithms.