At a request from a client, I started to look at solutions to do full text search for all documents containing text. My first thought was that there must be a service out there where you can upload or reference files that will then index the file and allow you to query the data? Unfortunately I didn’t find one. Most of the indexing and searching services out there are for structured data. However, this does become helpful as you will see.
Since I had to come up with a more customised solution, I started to see if there were any libraries that contained functionality to parse documents containing text. This lead me to Apache Tika. “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types“. All the file types I needed to use were covered by Tika’s extensive list of usable file types.
Since I was working in Ruby on Rails I went looking for a gem that would utilise Tika’s toolkit for me to use in my project. The Yomu gem is a great wrapper for the toolkit and easily allows for the reading of data from any file passed to it.
The files I had to read were being stored on Amazon S3 so using Ruby’s ‘open-uri’ module I passed in the S3 url and read out the data, which allowed Yomu to extract the text.
data = open(s3_url).read
text = Yomu.read :text, data
My next step could have been to save the text to a database table along with an id for the document and then query the database and this would have sufficed as a solution. However, not to waste some of the research, I decided to use one of the indexing and search services that I previously referred to, namely Amazon Cloudsearch. Since the files were already hosted on S3 and the account was already set up for the project it seemed like the logical option.
All I had to do was create a new Cloudsearch document with an id and text field that I could subsequently query and return a list of ids containing the text I was searching for. The AWSCloudSearch gem seemed to be the most up-to-date gem and with a few lines of code I could create , search and remove documents from Cloudsearch. Here is an example of adding a document
ds = AWSCloudSearch::CloudSearch.new('your-domain-name-53905x4594jxty')
doc = AWSCloudSearch::Document.new(true)
doc.id = id
doc.lang = 'en'
batch = AWSCloudSearch::DocumentBatch.new
I integrated the search into my search box for the document list and, hey presto, documents containing the requested text were returned.