Suffix Array

A suffix array is simply an array with the suffices of a string or array sorted. For example, the suffix array of the string "abracadabra" is ["a","abra","abracadabra","acadabra","adabra","bra","bracadabra","cadabra","dabra","ra","radabra"]. Since the suffix is entirely determined by the initial index, this is often stored as [10,7,0,3,5,8,1,4,6,9,2] to save space ($O(N)$ rather than $O(N^2)$).

Creation of a suffix array is a useful thing to be able to do quickly. Different methods of creating suffix arrays have different run times, ranging from $O(N)$ to $O(N^2\log N)$. For USACO, $O(N\log N)$ will always be sufficient as the faster algorithms are quite complex. The different methods for creating a suffix array are outlined below.

Creation Method Complexity
Naive $O(N^2\log N)$
Rabin-Karp Hashing $O(N(\log N)^2)$
Power of Two Creation (Comparison sorts) $O(N(\log N)^2)$
Power of Two Creation (Special sorts) $O(N\log N)$
Skew Sort $O(N\log N)$
Skew Sort (Radix Sort) $O(N)$
DFS on a suffix tree Various, as good as $O(N)$

# Creation Methods

## Naive Method

The Naive method is to simply take all the suffices and use an optimal comparison sort on them directly. However, since each comparison is $O(N)$ (it could have to check all characters for differences), the final runtime is $O(N^2\log N)$.

## Rabin-Karp Hashing

First, we do the $O(N)$ precomputatioon from Rabin-Karp Hashing which allows us to hash a substring in constant time. Then, in our comparison method, rather than checking linearly for the longest common prefix, we can binary search. For example, if we have the strings "abracadabra" and "abra", we start the binary search at half the shorter one: "ab" versus "ab". From Rabin-Karp hashing, we can check that these are the same in $O(1)$ time. We continue this way via binary search and find the longest common prefix in $O(\log N)$ time. Then we simply compare the next characters and return the result. With an $O(N\log N)$ comparison sort, this will run in $O(N(\log N)^2)$ time, which will often be sufficient since $\log N$ is small.

## Power of Two Creation

By way of example, consider the string "abracadabra".
We first append enough null characters after the string (which will be sorted before letters) so that we have a string twice as long: "abracadabra-----------".

Next, we want to sort all of the one character substrings. This is easily done in $O(N\log N)$ time. Now we want to store the order in which the indices appear, but with equal letters having the same value. So, for this example, we get [-,a,b,c,d,r], so the array that we store is [1,2,5,1,3,1,4,1,2,5,1,0,0,0,0,0,0,0,0,0,0,0].

Next, we sort the substrings of length 2. Suppose we have the substrings "xy" and "XY". First, we compare x and X. If they're the same, then we compare y and Y. We can do this quickly since we stored the value of all four in the previous step. These are O(1) comparisons.

We then continue sorting substrings of length $2^k$ by treating them as length 2 strings where the alphabet is all substrings of length $2^{k-1}$. This is why we padded the string. We want to be able to treat "bra" as the 8 character substring "bra-" at that step.

We do $O(\log N)$ sorts. If we use an optimal comparison sort, this gives a runtime of $O(N(\log N)^2)$. However, we can take advantage of the fact that we are sorting by lexicographical order in order to sort in $O(N)$ time.

So suppose that we have the array of length $2N$ containing the order of the substrings of length $2^{k-1}$. We do what is called a radix sort. You can use either Most Significant Digit (MSD) or Least Significant Digit (LSD) radix sorts, but I will outline LSD radix sort because it's easier. As an example, consider the following numbers:
[93, 12, 11, 52, 53, 89, 99, 56]
We first do a stable sort by the last digit:
[11, 12, 52, 93, 53, 56, 89, 99]
We then do a stable sort by the first digit:
[11, 12, 52, 53, 56, 89, 93, 99]
The numbers are now sorted.

Now, this has runtime $O(S(n)\cdot k)$ where $k$ is the length of the elements of the array and $S(n)$ is how long it takes to sort by a digit. In our case, $k$ is always $2$, so we can ignore it in our asymptotic analysis. Thus, we simply have to make $S(n) = O(N)$. To do this, we use a sort called bucket sort, which works as follows:

First, we create a bucket for each possible "letter" in our alphabet. If these are numbers, our "letters" are the digits. In the case of the suffix array, the "letters" are the previous substrings. We then iterate through the numbers, placing them in the correct bucket as we go. We then iterate over the buckets in order and recreate the array. By using a queue for each bucket, we obtain a stable sort. For example, consider sorting the array [93, 12, 11, 52, 53, 89, 99, 56] by last digit. We create a bucket for each possible digit. Then, iterating over the array, we obtain the following buckets:

0:
1: 11
2: 12, 52
3: 93, 53
4:
5:
6: 56
7:
8:
9: 89, 99

Recreating the array from the buckets, we get [11, 12, 52, 93, 53, 56, 89, 99], which is what we wanted in the radix sort above. This bucket sort is $O(N)$ when the number of buckets is also $O(N)$. This is the case in suffix arrays since we can make one bucket for each of the previous substrings. Since there are $O(N)$ indexes at which the substring can start, there are at most $O(N)$ "letters" in our alphabet. Thus, we have created the $O(N)$ sort we desire, and can then create a suffix array in $O(N\log N)$ time.

If we are careful about how we do things, we actually don't need to have buckets, as we can sort in-place (although an auxiliary array is needed). Instead of sorting within each group based on the suffix starting $2^k$ characters in front of each suffix, pre-compute the starting points of each group, then go through the suffices in order in the suffix array and insert the suffix beginning $2^k$ characters before each suffix in the appropriate group. This guarantees that they will be sorted and stay within the right groups (see the first implementation below).

## DFS on a suffix tree

We first create a suffix tree using various algorithms, which can be as fast as $O(N)$, but are usually $O(N\log N)$ when implemented in a contest. We then do an $O(N)$ DFS on the suffix tree to obtain our suffix array.

This is an $O(N\log N)$ suffix-array algorithm based on the "power of two" (doubling) method. It needs to store $6N$ integers in memory, so N up to about 500,000 will work given 16MB (an integer is 4 bytes).

#include <algorithm>
#include <stdio.h>

using namespace std;

#define MAX (200005)
//a bit longer than the size of the string we expect

int N,K; char c;
int arr[MAX], //the string
sa[MAX],  //suffix array
sa2[MAX], //auxiliary suffix array
grp[MAX], //keeps track of what group each thing is in
grp2[MAX], //auxiliary array
gs[MAX]; //keeps track of position to insert each group at

bool cmp(int i,int j){ return arr[i]<arr[j]; }

int main(){
FILE *fin = fopen("sa.in","r");
FILE *fout = fopen("sa.out","w");
fscanf(fin,"%d",&N); //size of string
for(int i=0;i<N;i++){
fscanf(fin,"%d",&arr[i]); //we assume for now the input is integsrs
sa[i]=i;
}

//divide the suffices into groups based on their first character
sort(sa,sa+N,cmp); //sort based on first character so we can determine the groups
int g = 0; gs[0]=0; //the first group starts at the beginning of the array
for(int i=1;i<N;i++){
if(arr[sa[i]]!=arr[sa[i-1]]){
g++; //if this suffix starts with a different character than the last one, start a new group
gs[g]=i;
}
grp[sa[i]]=g;
}

for(int d=1;d<N;d*=2){
/* Now repeatedly divide each group up based on the first
* 2^k characters. Within each group, we agree on the first
* 2^(k-1) characters, so sort within the group based on the
* next 2^(k-1) characters.
*/
for(int i=0;i<N;i++)
if(sa[i]>=d)
sa2[gs[grp[sa[i]-d]]++] = sa[i]-d;
/* We have to be careful to keep things within the same group.
* Go through each element in order in the suffix array, and
* insert the suffix starting 2^(k-1) characters behind it. This
* guarantees that we get everything within each group in order
* without having to worry about splitting up the groups.
*/

for(int i=N-d;i<N;i++)
sa2[gs[grp[i]]++] = i;
/* The last 2^(k-1) elements of the array now have to be sorted,
* since we missed them in the last loop. We consider the string
* to be terminated by a character that is lexicographically larger
* than everything else in the string, so contrary to convention 'a'
* comes after 'aa'. We can fix this if necessary by putting this loop
* before the other loop.
*/

g = 0; sa[0]=sa2[0]; grp2[sa[0]]=0; gs[0]=0;
for(int i=1;i<N;i++){ //now, re-assign groups to get ready for the next loop of the suffix array
if(grp[sa2[i]]!=grp[sa2[i-1]]||sa2[i]>=N-d||sa2[i-1]>=N-d||grp[sa2[i]+d]!=grp[sa2[i-1]+d]){
//the middle two statements above make sure we don't go beyond the bounds of the array
g++;
gs[g]=i; //gs keeps track of the beginning of each group
}
sa[i]=sa2[i]; grp2[sa[i]]=g;
}
for(int i=0;i<N;i++) grp[i]=grp2[i];
}

for(int i=0;i<N;i++) fprintf(fout,"%d\n",sa[i]);
fclose(fin); fclose(fout);
return 0;
}

page revision: 22, last edited: 18 May 2008 21:30