正文

Oulipo (模式匹配)2007-10-08 07:46:00

【评论】 【打印】 【字体: 】 本文链接:http://blog.pfan.cn/lingdlz/29948.html

分享到:

Oulipo
Time Limit: 5000ms, Special Time Limit:12500ms, Memory Limit:65536KB
Problem description
The French author Georges Perec (1938-1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of Oulipo group. A quote from the book:
Tout avait l’air normal, mais tout s’affirmait faux. Tout avait l’air normal, d’abord, puis surgissait l’inhumain, l’affolant.Ⅱaurait voulu savoir ou s, articulait l’association qui l’unissait au roman: sur son tapis, assailant atout instant son imagination, l’intution d’un tabou, la vision d’un mal obscure, d’sun quoi vacant, d’un non-dit: la vision, l’avision d’un oubli commandant tout,ou s’abolissait la raison: tout avait l’air normal mais …
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given ”word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive ‘T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {‘A’,’B’,’C’, …,’Z’} and two finite strings over that alphabet, a word W and a text T ,count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.


Input
The first line of the input file contains a single number: the number of the cases to follow. Each test case has following format:
One line with the word W, a string over {‘A’,’B’,’C’, …,’Z’}, with 1<=|W|<=10,000(here |W| denotes the length of the string W).
One line with text T, a string over {‘A’,’B’,’C’, …,’Z’}, with |W|<=|T|<=1,000,000.

Output
For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

Sample Input
3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output
1
3
0
 
/// 解题报告
本题意思很清楚,就是在一个串中找出某个子串的个数,比如
AZA
AZAZAZA 显然有包含了三个AZA,起始下标分别为(0,2,4)
本题数据量比较大,很多人都超时,开始我也超时,过了段时间
再看此题才发现现在这个算法。普通的KMP匹配就不说了,主要是
匹配之后,主串索引和子串索引的变化,一般人都认为是主串回到
和子串第二个位置匹配的位置,而子串索引则变为0。比如:
AZA
AZAZAZA 的匹配过程 j(子串的)=0,i(主串)=0,到j=2,i=2匹配完了
一般的作法是重设 j=0,i=1,继续匹配……
本题的较好的做法是:在子串的末尾虚设一字符(并不需要真的去设置
只是我们逻辑上认为存在这样一个字符)该字符不和任何字符匹配,即当
我们在主串中匹配完后,下一个字符会不匹配,于是我们按通常的匹配算法
主串的索引不变,子串索引从 next 表中取,要获取虚设字符的 next 值
其实就是在字串求完后,再多求一位,这样主串只需扫描一次即可。
不过还是发了1078ms,不知道他们那些XXms的是怎样做的,以后搞到代码了
再贴……
//mycode as followed:
#include <stdio.h>
#include <string.h>
char gStr[1000002],gDest[10002];
int  gNext[10002];
// 获取KMP匹配时串 str 的 next 表,另外求了虚设字符的 next 值
void getNext(char str[],int next[]){
    unsigned int i,j;
    for(next[0]=j=-1,i=0; i<=strlen(str);){
        if(j==-1||str[i]==str[j]){
            ++i;
            ++j;
            next[i]=j;
        }
        else
            j=next[j];
    }
}
// 获取gDest在gStr中的出现次数
int getCount(){
    int i, j, count; 
    for(count=i=j=0; gStr[i];){
        if(j==-1||gStr[i]==gDest[j]){
            ++i;
     ++j;
            if(gDest[j]=='\0'){
                count++;
   j=gNext[j];  // 使用虚设字符的 next 值
            }
        }
        else
            j=gNext[j];
    } 
    return count;
}
int main(){
    int n;
    scanf("%d",&n);
    while(n--){
        scanf("%s %s",&gDest,&gStr);
  getNext(gDest,gNext);  
        printf("%d\n",getCount());
    }
    return 0;
}
//// 这样就优化了,速度竟快了十几倍 !!!
#include <stdio.h>
#include <string.h>
char gStr[1000002],gDest[10002];
int  gNext[10002];
void getNext(){
    int i,j,len=(int)strlen(gDest);;
    for(gNext[0]=j=-1,i=0; i<=len;){
        if(j==-1||gDest[i]==gDest[j]){
            ++i;
            ++j;
            gNext[i]=j;
        }
        else
            j=gNext[j];
    }
}
int getCount(){
    int i, j, count; 
    for(count=i=j=0; gStr[i];){
        if(j==-1||gStr[i]==gDest[j]){
            ++i;++j;
            if(gDest[j]=='\0'){
                count++;
                j=gNext[j];
            }
        }
        else
            j=gNext[j];
    } 
    return count;
}
int main(){
    int n;
    scanf("%d",&n);
    getchar();
    while(n--){
        gets(gDest);
        gets(gStr);
        getNext();  
        printf("%d\n",getCount());
    }
    return 0;
}

阅读(2933) | 评论(0)


版权声明:编程爱好者网站为此博客服务提供商,如本文牵涉到版权问题,编程爱好者网站不承担相关责任,如有版权问题请直接与本文作者联系解决。谢谢!

评论

暂无评论
您需要登录后才能评论,请 登录 或者 注册