Information Retrieval System

15th September '19 – ⁠21st January '20

For my assignment for the Text Processing module in final year, I was asked to build an information retrieval system that would combine techniques we'd learned about in the lectures. This involved several stages. First, the source data needed to be cleaned. Various techniques were applied to filter down the inputs, including stripping capitalisation and punctuation, and trimming words down to their stems so that similar words (e.g. politics and politician) could be found more easily under the same query. After this, I had to build inverse indexes for the inputs, which showed how many times each word occurred within each document. These could be used to represent the documents and query as vectors. Before I could compare the vectors, I needed to use TF.IDF to refine the inputs. This would weight the vectors based on each word's frequency in the document versus its frequency in the collection, which would prioritise rarer terms. I could then finally apply a cosine similarity metric to compare the documents with the query.

Copyright © 2016-2020 Simon Fish