Tuesday, November 13, 2012

Word Frequency Counter Using Heap

This is an assignment from our Data Structure class. However, the deadline was passed already, and here I post my code and the report. This may be an example code for heap algorithm.

The original problem of the assignment is to count the word in an article to see the frequency of each word. And show out the first k in the rank.


Algorithm

            Heap is used in this program, so that we don’t have to sort the strings by their frequencies before showing the result.
            There are two main operations for heap in this program. One is “add_if_exists()”, another is “add()”. “add_if_exists()” checks if the new input word is already in the heap or not. If yes, this function adds one on its frequency and bubble up this node so that the heap tree remain correct; if no, it returns false, and the program will call “add()”. On the other hand, “add()” is the function to add new word in the heap, which is not seem before.
            To implement the heap in this program, I built an array of structure nodes. For each node, there is a word (character array) and an integer, which records the frequency of the word.

Program Structure

Pseudo code as below,
while( read one word, exits if end of file ){
            if( input word is empty ) skip
            if( input word exists already ){
                        add input word to the current heap
                        reheap
}
else{
            add input word as a new node in the heap
}
}
while( k times ){
            remove the max node from the heap and show it as the result
}
            

Analysis

  • For adding the new word which is not in the heap yet, we don’t have to reheap, because the new word must come with frequency = 1 which becomes the smallest node in the heap.
  • Time Complexity: say n is the amount of words in the text file, in the worst case (all the n words is not the same), time complexity is O(n2). Most of the time spends on searching if the word is already in the heap or not. However, this can be improved be adding “hash table” in the program, which is not done yet my code.
  • Space Complexity: space used in this code is fixed. Adjust the size in the head of the code to different usage. Depends on the amount of words in the article, and the longest characters in one word.


Program Elements

Functions relates to the heap:
  • add_if_exist() Search the heap array to see if the new input word is already in the heap or not. If yes, add one on its frequency, then reheap and return TRUE; if no, return FALSE. 
  • add() Add the new input word at the end of the heap when it is sure that the word is not in the heap yet.
  • reheap() Recheck the heap tree. If it’s not correct, then fix it.
  • reheap_from()After adding one on the frequency of a curtain node, we bubble up the node to make sure the heap is correct.
  • remove_max()To move the root, we exchange the root with the smallest node then reheap the heap tree. Return the original root node. 
Functions relates to the heap tree in array:
  • int heap_left(int root) {return (root*2+1);} 
  • int heap_right(int root) {return (root*2+2);} 
  • int heap_parent(int child) {return (child-1)/2;}


Result


  • Test on Darwin Kernel Version 12.2.0, GCC 4.8.0. 
  • The test text file is downloaded for the assignment.

 
  • Testing on school’s Linux machine, FreeBSD 8.3-RELEASE-p3, GCC 4.2.2.
  • The test text file contains 33705 words, which is larger than the last one.


Other

  • To make the program run faster, we can add “-O3” as the argument of GCC, which ask the complier to do some optimisations for speed. Some versions of GCC even offer “-Ofast” selection. 
  • In the future improvement, the searching part, it can be done by using hash table, which tells the location of a word in the heap tree in the constant of time. Just like the “dictionary” in some programming languages, which is not built-in in C.

Code


/* 
 * Program    : Word Frequency Counter 
 * Author    : Heron Yang @ NCTU 
 * Class    : Data Structure 
 * Date        : 2012/11/06 
 * 
 * By using the heap tree in array, we can calculate 
 * the frequency on words in an article in a short 
 * time. 
 */ 

#include <stdio.h> 
#include <string.h> 
#include <stdlib.h> 
#include <time.h> 

#define FALSE    0
#define TRUE    1

#define LENGTH    100 
#define NODES    1000000 

/* 
 * Structure Declaration 
 * Word_node structure includes the original word, 
 * and its frequency in the article so far. 
 */ 
struct Word_node{ 
    char word[LENGTH]; 
    int freq; 
}heap[NODES]; 

int node_ind = 0; 

// Character Operation Related Functions 
int isDigit(char c); 
int isAlpha(char c); 
char toLower(char c); 

// Self-Completed Function Declarations 
int add_if_exists(char *input);
void add(char *input); 
void reheap_from(int child);
void reheap(int root); 
void swap_node(int a, int b);

// Functions for Heap Operations 
int heap_left(int root)        {return (root*2+1);} 
int heap_right(int root)    {return (root*2+2);} 
int heap_parent(int child)    {return (child-1)/2;} 

// I/O Related Functions 
int read_one_word(char *input);
void output_result(int k); 
void output_debug(int k); 
struct Word_node remove_max(); 

// main 
int main(int argc, char **argv){ 

    // setup file reading 
    FILE *fin = freopen(argv[1], "r", stdin); 
    int k = atoi(argv[2]);

    // read the words from file into the heap 
    char input[LENGTH] = {0};
    while(read_one_word(input)){ 
        // if the length is zero, skip it 
        if( !strlen(input) )    continue; 
        // add into the heap 
        if( !add_if_exists(input) ) 
            add(input); 
    } 

    // output the result here 
    output_result(k); 

    // output time information 
    float elapsed = (float) (clock()) / CLOCKS_PER_SEC;   
    printf("%f sec\n", elapsed); 

    // end here 
    fclose(fin); 
    return 0; 
} 

/* 
 * Function : output_result 
 * Output the whole result here. 
 * Also, to output the all the same rank nodes, we 
 * have to check if there still have nodes need to  
 * be poped out. 
 */ 
void output_result(int k){ 
    int t = 0; 
    struct Word_node out; 
    while(k--){ 
        out = remove_max(); 
        t = out.freq; 
        printf("(%s,%d)\n", out.word, out.freq); 
    } 
    out = remove_max(); 
    while(t == out.freq && node_ind){ 
        printf("(%s,%d)\n", out.word, out.freq); 
        out = remove_max(); 
    } 
} 

/* 
 * Function : output_debug 
 * Output the heap for debug purpose 
 */ 
void output_debug(int k){ 
    int i, t=0; 
    for(i=0 ; i<k ; i++){ 
        printf("(%s,%d)\n", heap[i].word, heap[i].freq); 
        t = heap[i].freq; 
    } 
    while(t == heap[i].freq && t){ 
        printf("(%s,%d)\n", heap[i].word, heap[i].freq); 
        i++; 
    } 
} 

/* 
 * Function : swap_node 
 * Swap two nodes in the heap structure. 
 */ 
void swap_node(int a, int b){ 
    int t; 
    char t_c[LENGTH]; 
     
    // swapping frequency 
    t = heap[a].freq; 
    heap[a].freq = heap[b].freq; 
    heap[b].freq = t; 

    // swapping word 
    strcpy(t_c, heap[a].word); 
    strcpy(heap[a].word, heap[b].word); 
    strcpy(heap[b].word, t_c); 
} 

/* 
 * Function : add 
 * After making sure that this word hasn't appeared, 
 * add this word into the structure array. 
 */ 
void add(char *input){ 
    heap[node_ind].freq = 1; 
    strcpy(heap[node_ind].word, input); 
    node_ind++; 
} 

/* 
 * Function : add_if_exists 
 * Search the existing structure array, add one on 
 * its frequency if the word is already in the array. 
 */ 
int add_if_exists(char *input){ 
    int i; 
    for(i=0 ; i<node_ind ; i++){ 
        if(!strcmp(heap[i].word, input)){ 
            heap[i].freq ++; 
            reheap_from(i); 
            return TRUE; 
        } 
    } 
    return FALSE; 
} 

/* 
 * Function : reheap_from 
 * After adding one on frequency on one node, we have  
 * to bubble up the node to make sure the heap is right. 
 */ 
void reheap_from(int child){ 
    int root = heap_parent(child); 
    if( heap[root].freq >= heap[child].freq || root <0 )    return; 
    swap_node(root, child); 

    reheap_from(root); 
} 

/* 
 * Function :  
 * Most important part of these program, which rebuild 
 * or check the heap is in the right condition or not. 
 */ 
void reheap(int root){ 
    int child = heap_left(root); 
    if( child >= node_ind )    return; 
    if( (heap[child].freq < heap[child+1].freq) && 
            (child < (node_ind-1)) )    child++; 
    if( heap[root].freq >= heap[child].freq )    return; 
    swap_node(root, child); 
    reheap(child); 
} 

/* 
 * Function : remove_max 
 * Pop out the max node in the heap. 
 * Method : Exchange the last one and the root, then 
 * reheap the tree to make the exchanged root to its 
 * right position. 
 */ 
struct Word_node remove_max(){ 
    node_ind --; 
    if(node_ind != 0){ 
        swap_node(0, node_ind); 
        reheap(0); 
    } 
    return heap[node_ind]; 
} 

/* 
 * Function : read_one_word 
 * Read in one word form the file, and store in the 
 * char array, input. 
 */ 
int read_one_word(char *input){ 
    char c; 
    int ind = 0; 
    while(scanf("%c", &c) != EOF){ 
        if( !isAlpha(c) && !isDigit(c) ){ 
            input[ind] = '\0'; 
            return TRUE; 
        } 
        input[ind++] = toLower(c); 
    } 
    return FALSE; 
} 

/* 
 * Function : isDigit 
 * Determine whether the character c is a digit. 
 */ 
int isDigit(char c){ 
    if( c>='0' && c<='9')    return TRUE;
    return FALSE; 
} 

/* Function : isAlpha 
 * Determine whether the character c is a alphabet. 
 */ 
int isAlpha(char c){ 
    if((c>='a' && c<='z')||(c>='A' && c<='Z'))    return TRUE;
    return FALSE; 
} 

/* 
 * Function : toLower 
 * Return the lower case of the input character. 
 * Method : By modifying the 6th bit of the character to zero, 
 * we obtain the lower case. (0010 0000)2 = (1<<5) 
 */ 
char toLower(char c){ 
    if(!isAlpha(c))    return c; 
    return (c|(1<<5)); 
}