Inverted File Index

是时候总结一下这一坨了

但是懒得写，把 report 放上来好了

Inverted File Index Creation

The inverted file index, known as IFI, are created in order to handle seraching queries efficiently. Due to memory limit and huge data, there are two different ways to build a IFI on the original data. Here we call them "Memory Based Inversion" and "Sort Based Inversion".

Memory Based

We use a dynamic dictionary data structure (B-tree, hash table) to record distinct terms, with a linked list of nodes storing the Id of documents and frequency of the occurence of the word.

The dictionary data structure is maintained in memory, therefore it's called Memory Based.

Here we use B+ tree to deal with the terms. We also implement IFI using STL-map and compare the results of the two data structures.

Once all documents have been processed, the whole B+ tree is traversed, and the list of terms and corresponding occurence frequency is written.

the process can be shown in pseudo-code:

function MEMORY-BASED-INVERSION(T):
    // Initialization
    Create an empty dictionary S

    // Phase one: collection of term appearances
    For each document D, d in the collection T:
        // d is the index of document D
        Parse D into index terms
        For each index term t in D:
            f = frequency of t in D
            If t is not in S:
                Insert t into S with an empty list
            Append (d, f) to the list corresponding to t in S

    // Phase two: output of inverted file
    For each term t in S:
        Start a new inverted file entry
        For each (d, f) in the list corresponding to t:
            Append (d, f) to the inverted file entry
        Write the inverted file entry to the final inverted file

Sort Based

In Memory Based Inversion, the main problem is that it consumes too much memory and dealing with really huge data is difficult.

Therefore, we can put the records on the disk insread of memory.

To be specific, we write tuple consisting of term number, document index and frequency to a temp file, and run merging sort on the temp file.

To sort the records (tuples), we define a compare function, if term \(t\) of one record is lexicographically smaller than another record, then we call the record smaller.

If \(t\) is equal, then compare document index \(d\) and return the record with smaller index.

Every run we read \(k\) records from the temp file, sort them and write them back to a chunk file.

After this process there will be \(\frac n k\) chunk files

Then, read exactly one record from each chunk file, and select the minimum record (compare function are specified above) from the \(\frac n k\) records, where \(n\) is the total number of records, and write it into the final inverted file.

support the minimum record is from chunk file \(i\), then read another record from file \(i\).

Repeat until there's no more records.

We can see that the selecting process can be maintained by a min-heap.

If we select \(k = \sqrt n\), then the memory used to store min-heap and sorting process is \(O(\sqrt n)\) , then we can have the best result to reduce memory usage.

To represent in pseudo-code:

function SORT-BASED-INVERSION(T):
    // Initialization
    Create an empty dictionary S
    Create an empty temporary file on disk

    // Process text and write to temporary file
    For each document D, d in the collection T:
        Parse D into index terms
        For each index term t in D:
            f = frequency of t in D
            If t is not in S:
                Insert t into S with a unique term number
            Write record (t, d, f) to the temporary file 
            // t is represented by its term number from S

    // Internal sorting to make runs
    Let k be the number of records that can be held in memory
    Repeat until no more unsorted records:
        Read k records from the temporary file
        Sort records by t (term number) 
        and for equal values of t, by d (document ID)
        Write the sorted run back to the temporary file

    // Merging
    Repeat pairwise merging of runs in the temporary file 
    until only one sorted run remains

    // Output inverted file
    For each term t in S:
        Start a new inverted file entry
        Read all triples (t, d, f) from the temporary file for term t
        Append these triples to the inverted file entry
        If required, compress the inverted file entry
        Write the entry to the final inverted file

Memory Usage and Total Time

On huge data we have the result:

\	Memory Usege/MB	Total Time/s
Memory Based( Map)	1720.4	256
Memory Based (B+)	1744.4	376
Sort Based	34.8	510

we can see that Sort Based use far less memory than Memory Based, and Memory Based is a little faster that Sort Based.

The reason why B+ implemented Memory Based is worse than map implemented Memory Based is that the implementation of B+ tree is bad.