如何阅读源代码之四(转帖) -- 业余空间-- 编程爱好者博客

在parse_record分析完数据之后，做日期的分析，把日志中的月份等数据转换成机器可读（可理解)的数据，并存入到log_rec中去。if ((i>=12)||(rec_min>59)||(rec_sec>59)||(rec_year<1990)){total_bad++; /* if a bad date, bump counter */if (verbose){fprintf(stderr,"%s: %s [%lu]",msg_bad_date,log_rec.datetime,total_rec);......如果日期，时间错误，则把total_bad计数器增加1，并且打印错误信息到标准错误输出。good_rec = 1;/* get current records timestamp (seconds since epoch) */req_tstamp=cur_tstamp;rec_tstamp=((jdate(rec_day,rec_month,rec_year)-epoch)*86400)+(rec_hour*3600)+(rec_min*60)+rec_sec;/* Do we need to check for duplicate records? (incremental mode) */if (check_dup){/* check if less than/equal to last record processed */if ( rec_tstamp <= cur_tstamp ){/* if it is, assume we have already processed and ignore it */total_ignore++;continue;}else{/* if it isn't.. disable any more checks this run */check_dup=0;/* now check if it's a new month */if (cur_month != rec_month){clear_month();cur_sec = rec_sec; /* set current counters */cur_min = rec_min;cur_hour = rec_hour;cur_day = rec_day;cur_month = rec_month;cur_year = rec_year;cur_tstamp= rec_tstamp;f_day=l_day=rec_day; /* reset first and last day */}}}/* check for out of sequence records */if (rec_tstamp/3600 < cur_tstamp/3600){if (!fold_seq_err && ((rec_tstamp+SLOP_VAL)/3600 { total_ignore++; continue; }else{rec_sec = cur_sec; /* if folding sequence */rec_min = cur_min; /* errors, just make it */rec_hour = cur_hour; /* look like the last */rec_day = cur_day; /* good records timestamp */rec_month = cur_month;rec_year = cur_year;rec_tstamp= cur_tstamp;}}cur_tstamp=rec_tstamp; /* update current timestamp */如果该日期、时间没有错误，则该数据是一个好的数据，将good_record计数器加1，并且检查时间戳，和数据是否重复数据。这里有一个函数，jdate()在主程序一开头我们就遇到了，当时跳了过去没有深究，这里留给读者做一个练习。（提示：该函数根据一个日期产生一个字符串，这个字符串是惟一的，可以检查时间的重复性，是一个通用函数，可以在别的程序中拿来使用）/*********************************************//* DO SOME PRE-PROCESS FORMATTING *//*********************************************//* fix URL field */cp1 = cp2 = log_rec.url;/* handle null '-' case here... */if (*++cp1 == '-') { *cp2++ = '-'; *cp2 = ''; }else{/* strip actual URL out of request */while ( (*cp1 != ' ') && (*cp1 != '') ) cp1++;if (*cp1 != ''){/* scan to begin of actual URL field */while ((*cp1 == ' ') && (*cp1 != '')) cp1++;/* remove duplicate / if needed */if (( *cp1=='/') && (*(cp1+1)=='/')) cp1++;while ((*cp1 != ' ')&&(*cp1 != '"')&&(*cp1 != ''))*cp2++ = *cp1++;*cp2 = '';}}/* un-escape URL */unescape(log_rec.url);/* check for service (ie: http://) and lowercase if found */if ( (cp2=strstr(log_rec.url,"://")) != NULL){cp1=log_rec.url;while (cp1!=cp2){if ( (*cp1>='A') && (*cp1<='Z')) *cp1 += 'a'-'A';cp1++;}}/* strip query portion of cgi scripts */cp1 = log_rec.url;while (*cp1 != '')if (!isurlchar(*cp1)) { *cp1 = ''; break; }else cp1++;if (log_rec.url[0]==''){ log_rec.url[0]='/'; log_rec.url[1]=''; }/* strip off index.html (or any aliases) */lptr=index_alias;while (lptr!=NULL){if ((cp1=strstr(log_rec.url,lptr->string))!=NULL){if ((cp1==log_rec.url)||(*(cp1-1)=='/')){*cp1='';if (log_rec.url[0]==''){ log_rec.url[0]='/'; log_rec.url[1]=''; }break;}}lptr=lptr->next;}/* unescape referrer */unescape(log_rec.refer);......这一段，做了一些URL字符串中的字符转换工作，很长，我个人认为为了程序的模块化，结构化和可复用性，应该将这一段代码改为函数，避免主程序体太长，造成可读性不强和没有移植性，和不够结构化。跳过这一段乏味的代码，进入到下面一个部分---后处理。if (gz_log) gzclose(gzlog_fp);else if (log_fname) fclose(log_fp);if (good_rec) /* were any good records? */{tm_site[cur_day-1]=dt_site; /* If yes, clean up a bit */tm_visit[cur_day-1]=tot_visit(sd_htab);t_visit=tot_visit(sm_htab);if (ht_hit > mh_hit) mh_hit = ht_hit;if (total_rec > (total_ignore+total_bad)) /* did we process any? */{if (incremental){if (save_state()) /* incremental stuff */{/* Error: Unable to save current run data */if (verbose) fprintf(stderr,"%s ",msg_data_err);unlink(state_fname);}}month_update_exit(rec_tstamp); /* calculate exit pages */write_month_html(); /* write monthly HTML file */write_main_index(); /* write main HTML file */put_history(); /* write history */}end_time = times(&mytms); /* display timing totals? */if (time_me' '(verbose>1)){printf("%lu %s ",total_rec, msg_records);if (total_ignore){printf("(%lu %s",total_ignore,msg_ignored);if (total_bad) printf(", %lu %s) ",total_bad,msg_bad);else printf(") ");}else if (total_bad) printf("(%lu %s) ",total_bad,msg_bad);/* get processing time (end-start) */temp_time = (float)(end_time-start_time)/CLK_TCK;printf("%s %.2f %s", msg_in, temp_time, msg_seconds);/* calculate records per second */if (temp_time)i=( (int)( (float)total_rec/temp_time ) );else i=0;if ( (i>0) && (i<=total_rec) ) printf(", %d/sec ", i);else printf(" ");}这一段，做了一些后期的处理。接下来的部分，我想在本文中略过，留给感兴趣的读者自己去做分析。原因有两点：1、这个程序在前面结构化比较强，而到了结构上后面有些乱，虽然代码效率还是比较高，但是可重用性不够强, 限于篇幅，我就不再一一解释了。2、前面分析程序过程中，也对后面的代码做了一些预测和估计，也略微涉及到了后面的代码，而且读者可以根据上面提到的原则来自己分析代码，也作为一个实践吧。最后，对于在这篇文章中提到的分析源代码程序的一些方法做一下小结，以作为本文的结束。分析一个源代码，一个有效的方法是：1、阅读源代码的说明文档，比如本例中的README, 作者写的非常的详细，仔细读过之后，在阅读程序的时候往往能够从README文件中找到相应的说明，从而简化了源程序的阅读工作。2、如果源代码有文档目录，一般为doc或者docs，最好也在阅读源程序之前仔细阅读，因为这些文档同样起了很好的说明注释作用。3、从makefile文件入手，分析源代码的层次结构，找出哪个是主程序，哪些是函数包。这对于快速把握程序结构有很大帮助。4、从main函数入手，一步一步往下阅读，遇到可以猜测出意思来的简单的函数，可以跳过。但是一定要注意程序中使用的全局变量（如果是C程序），可以把关键的数据结构说明拷贝到一个文本编辑器中以便随时查找。5、分析函数包（针对C程序），要注意哪些是全局函数，哪些是内部使用的函数，注意extern关键字。对于变量，也需要同样注意。先分析清楚内部函数，再来分析外部函数，因为内部函数肯定是在外部函数中被调用的。6、需要说明的是数据结构的重要性：对于一个C程序来说，所有的函数都是在操作同一些数据，而由于没有较好的封装性，这些数据可能出现在程序的任何地方，被任何函数修改，所以一定要注意这些数据的定义和意义，也要注意是哪些函数在对它们进行操作，做了哪些改变。7、在阅读程序的同时，最好能够把程序存入到cvs之类的版本控制器中去，在需要的时候可以对源代码做一些修改试验，因为动手修改是比仅仅是阅读要好得多的读程序的方法。在你修改运行程序的时候，可以从cvs中把原来的代码调出来与你改动的部分进行比较(diff命令), 可以看出一些源代码的优缺点并且能够实际的练习自己的编程技术。8、阅读程序的同时，要注意一些小工具的使用，能够提高速度，比如vi中的查找功能，模式匹配查找，做标记，还有grep，find这两个最强大最常用的文本搜索工具的使用。对于一个Unix/Linux下面以命令行方式运行的程序，有这么一些套路，大家可以在阅读程序的时候作为参考。1、在程序开头，往往都是分析命令行，根据命令行参数对一些变量或者数组，或者结构赋值，后面的程序就是根据这些变量来进行不同的操作。2、分析命令行之后，进行数据准备，往往是计数器清空，结构清零等等。3、在程序中间有一些预编译选项，可以在makefile中找到相应部分。4、注意程序中对于日志的处理，和调试选项打开的时候做的动作，这些对于调试程序有很大的帮助。5、注意多线程对数据的操作。（这在本例中没有涉及）结束语：当然，在这篇文章中，并没有阐述所有的阅读源代码的方法和技巧，也没有涉及任何辅助工具（除了简单的文本编辑器），也没有涉及面向对象程序的阅读方法。我想把这些留到以后再做讨论。也请大家可以就这些话题展开讨论。

博客介绍

正文

如何阅读源代码之四(转帖)2006-12-06 22:51:00

评论