Full text search using Yomu and Cloudsearch

At a request from a client, I started to look at solutions to do full text search for all documents containing text. My first thought was that there must be a service out there where you can upload or reference files that will then index the file and allow you to query the data? Unfortunately I didn’t find one. Most of the indexing and searching services out there are for structured data. However, this does become helpful as you will see.

Since I had to come up with a more customised solution, I started to see if there were any libraries that contained functionality to parse documents containing text. This lead me to Apache Tika. “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types“. All the file types I needed to use were covered by Tika’s extensive list of usable file types.

Since I was working in Ruby on Rails I went looking for a gem that would utilise Tika’s toolkit for me to use in my project. The Yomu gem is a great wrapper for the toolkit and easily allows for the reading of data from any file passed to it.

For your PHP projects there is a library called PhpTikaWrapper available using Composer.

The files I had to read were being stored on Amazon S3 so using Ruby’s ‘open-uri’ module I passed in the S3 url and read out the data, which allowed Yomu to extract the text.

data = open(s3_url).read
text = Yomu.read :text, data

My next step could have been to save the text to a database table along with an id for the document and then query the database and this would have sufficed as a solution. However, not to waste some of the research, I decided to use one of the indexing and search services that I previously referred to, namely Amazon Cloudsearch. Since the files were already hosted on S3 and the account was already set up for the project it seemed like the logical option.

All I had to do was create a new Cloudsearch document with an id and text field that I could subsequently query and return a list of ids containing the text I was searching for. The AWSCloudSearch gem seemed to be the most up-to-date gem and with a few lines of code I could create , search and remove documents from Cloudsearch. Here is an example of adding a document

ds = AWSCloudSearch::CloudSearch.new('your-domain-name-53905x4594jxty')
doc = AWSCloudSearch::Document.new(true)
doc.id = id
doc.lang = 'en'
doc.add_field('text', text)
batch = AWSCloudSearch::DocumentBatch.new
batch.add_document doc

I integrated the search into my search box for the document list and, hey presto, documents containing the requested text were returned.

Heroku timeout management

Recently I have been migrating a Rails application to Heroku. The application is a few years old and very large. Deploying to Heroku was fairly straight forward but within a short time I realised that Heroku’s maximum 30 timeout was going to be a problem.

Heroku’s timeout is completely non-negotiable and probably rightly so. Any application which is taking more than 30 seconds to load has some serious problems. Unfortunately, my application has some serious problems. Heroku compounds these problems in one swift move by using the swap file. Once an application starts using the swap file, processing time skyrockets.

So, here are a few ways I have used to try and alleviate the problem:

  1. Move any long or processor intensive tasks to a worker dyno. Using the delayed_job gem, I was able to pass a number of intensive tasks over to a heroku worker instance. The worker instance then churns away in the background freeing up the web instance to continue serving the user. Also, by using the workless gem, I was able to switch of the worker dyno when it wasn’t required, saving those all important $$.
  2. Pre-empt the heroku timeout to rescue the application. If the heroku timeout limit or your web server timeout limit are reached, you loose the ability to manage the problem within you application. By using the rack-timeout gem, you can catch the exception by setting the rack timeout to a second less than the web server timeout. You can then manage the exception within the application by logging data and displaying useful information to the user.
  3. Size your instance correctly to stop switching to swap file. If you are using a web server like unicorn, you can set the number of concurrent processes for users accessing you application. But beware, each instance will increase the memory usage on your dyno. New relic will give you a good indication of the average memory usage of you application. If you set unicorn concurrency then you can work out your required instance size [average_memory * concurrency = instance_size]. Quick tip – derailed_bencmarks gem shows the memory usage of your gems. It might help reduce your application size.
  4. If you hit the swap file, restart the dyno. Using the unicorn-worker-killer gem, you can set the dyno to restart if the memory quota is exceeded. This may also allow you to retry the page using the exception catching explained in ‘Step 2’. By doing redirect_to response.path in your exception rescue, you are effectively retrying the page with the newly restarted dyno.

Using these methods, you should be able to significantly reduce the impact of application timeouts if not avoid them completely.

Heroku remote database backup

Heroku’s new toolbelt commands have had a few updates which make copying backups a little trickier. Also, if you want to automate these commands in scheduler, your rake task needs to authenticate before it can run the command.

Here is my rake task to backup the database and copy the backup to another location:

namespace :pgbackup do
desc "Database Backup and copy"
task :db_backup => :environment do

heroku_server = ENV['APP_NAME']
timestamp = Time.now.strftime('%Y%m%d%H%M%S')
temp_path = "latest.dump"

Bundler.with_clean_env do
db_list = `echo "#{ENV['BACKUP_USERNAME']}n#{ENV['BACKUP_PASSWORD']}" | heroku pg:backups capture -a #{heroku_server}`
res = db_list.split("n")[6]
db_id = res.split(" ")[2].strip
restore_url = `heroku pg:backups public-url #{db_id} -a "#{heroku_server}"`
restore_url = URI.extract(restore_url)
restore_url = restore_url[0]
p `curl -o "#{temp_path}" "#{restore_url}"`

file = YAML.load_file("#{Rails.root}/config/s3.yml")
config = file[Rails.env]

connection = Fog::Storage.new({
:provider => 'AWS',
:aws_access_key_id => config['access_key_id'],
:aws_secret_access_key => config['secret_access_key'],
:region => 'eu-west-1'

directory = connection.directories.get(ENV["APP_NAME"])

file = directory.files.create(:key => "db-backups/#{timestamp}.dump", :body => File.open("#{temp_path}"), :public => false)


As you can see I have used the fog gem to connect to S3. Fog has other options of where it can connect to so you are not just limited to S3.

You will also have to make sure to add the following environment variables to heroku:


The backup username and password are heroku login details for an owner or collaborator account.

Capistrano3-nginx permissions

Capistrano3 has a plugin for Nginx called capistrano3-nginx which allows you to manage it using the cap command or through your deployment scripts e.g. nginx:reload

However, if you try to run any of these command on some systems, it may start to complain about not having permissions or your deploy my hang while asking for a sudo password.

To get round this you need to give the deploying user the permissions to run a sudo command without needing a password. If you’re like me the, the first time you hear that you think “that can’t be very secure”. Correct! Giving any user full sudo access without a password isn’t recommended.

To improve things we can restrict what commands the deploying user can run as sudo without a password. To do this edit the sudoers file using the following command

sudo visudo

Adding the following line will then allow the deploying user to run the require Nginx commands without the need for a password.

deploy ALL=NOPASSWD:/usr/sbin/service nginx *

You may need to adjust the Nginx service path if your set up is different. The ‘*’ just allows all commands on the Nginx service. You could restrict this further by only allowing specific commands like ‘restart’.

Once you have saved the changes you should be able to run Nginx service commands without the need to enter a password and Capistrano should in turn be able to do the same.

Postgresql. Failed to build gem native extension. pg gem.

I have been trying to deploy a rails app with capistrano and bundler and kept getting the “Failed to build gem native extension” error. After installing, probably, many unnecessary packages, I finally came across a post which pointed me in the right direction.

It turns out that the default pg_config location did not match the path that capistrano was using. I found this by doing
which pg_config
on the server and comparing it to the command which was being executed in the trace.

Since they didn’t match I found a command to set the global local variable for pg_config.
bundle config build.pg --with-pg-config=/usr/bin/pg_config

Once run, bundler executed successfully while deploying.

WARN Could not determine content-length of response body. PATCH

Have had this notice from Webrick for a while and previous response was just “it’s not harming anything, just ignore it”.

Thankfully someone wasn’t happy with that answer and went looking for the source of the problem.

Edit the httpresponse.rb around line 205.
if chunked? || @header['content-length']
if chunked? || @header['content-length'] || @status == 304 || @status == 204

Here is the patch

Unfortunately if you are using rvm or rbenv you need to change each instance of Webrick. e.g.

Adding custom countries using Carmen in rails

I had been using a gem to give me countries for an app i’ve been doing but needed to create some custom country data. The gem didn’t support custom data so I went on the search and found a gem called Carmen, made even easier for my rails project by carmen-rails gem.

Once installed I added carmen.rb to my config/initializers/ folder
Carmen.append_data_path File.expand_path('../../', __FILE__)

Then added world.yml to config/ folder
- alpha_2_code: ND
alpha_3_code: NID
numeric_code: "999"
type: country

Then updated config/locales/en.yml
name: New country

Now I can reference
Carmen::Country.named('New Country')

Ember Rails Basic Application Setup

I had some problems setting up an ember-rails project to do the most basic thing of displaying a model on the page. Without a working base to start from I found it hard to do a lot of the more advanced things as I didn’t know where the bugs were coming up. So, I decided to record the exact commands I used to create a core “working” application so if I needed to start from scratch again I could do it easily. Here is the process:

rails _3.2.13_ new app_name -d postgresql
cd app_name

add ember gems to Gemfile:
gem 'ember-rails'
gem 'ember-source', '1.0.0.rc6'
gem 'handlebars-source', '1.0.0.rc4'

bundle install

Create your database

rails g ember:bootstrap -g --javascript-engine coffee

add following to development/production environment files
config.ember.variant = :development

rails g model contact first_name:string last_name:string
rails g controller contacts index
rails g serializer contact first_name last_name
bundle exec rake db:migrate
rails runner "Contact.create(:first_name => 'Tim', :last_name => 'West')"

set routes in routes.rb:
resources :contacts
root :to => 'application#index'

send json from contacts_controller index:
def index
render :json => Contact.all

Create a blank file app/views/application/index.html.erb

rails g ember:model contact first_name:string last_name:string
rails g ember:controller contacts index
rails g ember:view contacts
rails g ember:template contacts
rails g ember:route contacts

The view generator may not be necessary, I haven’t quite figured that out yet.

In application.js make sure the application variable is set to what are going to use throughout the rest of the application e.g. App = Ember.Application.create();

router.js.coffee :
App.Router.map ()->
@resource('contacts') #set contacts resource

#add index route to redirect to contacts
App.IndexRoute = Ember.Route.extend
redirect: ->

contacts.handlebars :
{{#each model}}
{{firstName}} {{lastName}}

contacts_route.js.coffee :
App.ContactsRoute = Ember.Route.extend({
model: ->
return App.Contact.find()

Passing params through redirect_to in rails

I had an issue recently where I was requiring SSL for a page on a site which was also getting passed parameters. However when the redirect_to happened all the parameters where lost and the page displayed the wrong information.

Here is the fix:

redirect_to({:protocol => 'https://'}.merge(params))

Boot script for Passenger Standalone

In extension to THIS POST, I created a boot script so the application loads when the server is rebooted. Everything in {} needs replaced with your own values.

Create /etc/init.d/{YOUR_APP_NAME}
Add the following code:

# Provides: boot passenger in standalone
# Required-Start: 2 3 4 5
# Required-Stop: 0 1 6
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Start/stop app.name.com

USER="{DEPLOYING_USER}" # e.g. www-data
RUBY_VERSION="{RUBY_VERSION}" # e.g. 1.9.3-p194

start() {
echo "Starting passenger"
/home/$USER/.rbenv/versions/$RUBY_VERSION/bin/passenger start --socket /tmp/$APP_NAME.socket -d --nginx-version 1.0.5 -e $ENVIRONMENT --pid-file $SHARED/pids/passenger.pid --log-file $SHARED/log/passenger.log --user $USER;

Save the file.

Edit /etc/rc.local
Add the following line before the end:

su {DEPLOYING USER} -c "/etc/init.d/{YOUR_APP_NAME} start"


Make sure you have installed the passenger gem for whichever version of ruby you are trying run against.