Oulipo (模式匹配) -- 编程驿站-- 编程爱好者博客

Oulipo

Time Limit: 5000ms, Special Time Limit:12500ms, Memory Limit:65536KB

Problem description

The French author Georges Perec (1938-1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of Oulipo group. A quote from the book:
Tout avait l’air normal, mais tout s’affirmait faux. Tout avait l’air normal, d’abord, puis surgissait l’inhumain, l’affolant.Ⅱaurait voulu savoir ou s, articulait l’association qui l’unissait au roman: sur son tapis, assailant atout instant son imagination, l’intution d’un tabou, la vision d’un mal obscure, d’sun quoi vacant, d’un non-dit: la vision, l’avision d’un oubli commandant tout,ou s’abolissait la raison: tout avait l’air normal mais …
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given ”word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive ‘T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {‘A’,’B’,’C’, …,’Z’} and two finite strings over that alphabet, a word W and a text T ,count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of the cases to follow. Each test case has following format:
One line with the word W, a string over {‘A’,’B’,’C’, …,’Z’}, with 1<=|W|<=10,000(here |W| denotes the length of the string W).
One line with text T, a string over {‘A’,’B’,’C’, …,’Z’}, with |W|<=|T|<=1,000,000.

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN

Sample Output

1
3
0

/// 解题报告
本题意思很清楚，就是在一个串中找出某个子串的个数，比如
AZA
AZAZAZA 显然有包含了三个AZA,起始下标分别为(0,2,4)
本题数据量比较大，很多人都超时，开始我也超时，过了段时间
再看此题才发现现在这个算法。普通的KMP匹配就不说了，主要是
匹配之后，主串索引和子串索引的变化，一般人都认为是主串回到
和子串第二个位置匹配的位置，而子串索引则变为0。比如:
AZA
AZAZAZA 的匹配过程 j(子串的)=0,i(主串)=0,到j=2,i=2匹配完了
一般的作法是重设 j=0,i=1,继续匹配……

本题的较好的做法是：在子串的末尾虚设一字符（并不需要真的去设置
只是我们逻辑上认为存在这样一个字符）该字符不和任何字符匹配，即当
我们在主串中匹配完后，下一个字符会不匹配，于是我们按通常的匹配算法
主串的索引不变，子串索引从 next 表中取，要获取虚设字符的 next 值
其实就是在字串求完后，再多求一位，这样主串只需扫描一次即可。

不过还是发了1078ms，不知道他们那些XXms的是怎样做的，以后搞到代码了
再贴……

//mycode as followed:

#include <stdio.h>
#include <string.h>
char gStr[1000002],gDest[10002];
int  gNext[10002];

// 获取KMP匹配时串 str 的 next 表,另外求了虚设字符的 next 值
void getNext(char str[],int next[]){
    unsigned int i,j;
    for(next[0]=j=-1,i=0; i<=strlen(str);){
        if(j==-1||str[i]==str[j]){
            ++i;
            ++j;
            next[i]=j;
        }
        else
            j=next[j];
    } 
}

// 获取gDest在gStr中的出现次数
int getCount(){
    int i, j, count;  
    for(count=i=j=0; gStr[i];){
        if(j==-1||gStr[i]==gDest[j]){
            ++i;
            ++j;
            if(gDest[j]=='\0'){
                count++;
                j=gNext[j];  // 使用虚设字符的 next 值
            }
        }
        else
            j=gNext[j];
    }  
    return count;
}

int main(){
    int n;
    scanf("%d",&n);
    while(n--){
        scanf("%s %s",&gDest,&gStr);
        getNext(gDest,gNext);   
        printf("%d\n",getCount());
    }
    return 0;
}

//// 这样就优化了，速度竟快了十几倍 ！！！

#include <stdio.h>
#include <string.h>
char gStr[1000002],gDest[10002];
int  gNext[10002];

void getNext(){
    int i,j,len=(int)strlen(gDest);;
    for(gNext[0]=j=-1,i=0; i<=len;){
        if(j==-1||gDest[i]==gDest[j]){
            ++i;
            ++j;
            gNext[i]=j;
        }
        else
            j=gNext[j];
    } 
}

int getCount(){
    int i, j, count;  
    for(count=i=j=0; gStr[i];){
        if(j==-1||gStr[i]==gDest[j]){
            ++i;++j;
            if(gDest[j]=='\0'){
                count++;
                j=gNext[j];
            }
        }
        else
            j=gNext[j];
    }  
    return count;
}

int main(){
    int n;
    scanf("%d",&n);
    getchar();
    while(n--){
        gets(gDest);
        gets(gStr);
        getNext();   
        printf("%d\n",getCount());
    }
    return 0;
}

博客介绍

正文

Oulipo (模式匹配)2007-10-08 07:46:00

评论