Postgres full-text search is awesome but without tuning, searching large columns can be slow. Introducing a tsvector
column to cache lexemes and using a trigger to keep the lexemes up-to-date can improve the speed of full-text searches.
This article shows how to accomplish that in Rails.
Using pg_search
with Rails
We want our users to search for products. Let’s add the pg_search
gem to our Rails app:
gem "pg_search"
Then, configure it for our Product
model:
class Product < ActiveRecord::Base include PgSearch pg_search_scope( :search, against: %i( description manufacturer_name name ), using: { tsearch: { dictionary: "english", } } ) end
This example shows configuration for full-text search of the products
table’s description
, manufacturer
, and name
columns. See Implementing Multi-Table Full Text Search with Postgres in Rails for an example of full-text search of multiple tables.
Our pg_search_scope
is named :search
, so we can invoke it with:
Product.search("wool")
We’re explicitly specifying the :tsearch
option (which is the default Postgres full-text search) in order to use the english
dictionary instead of the default simple
dictionary.
What we get for a SQL query
Wonderful. We have full-text searching set up in minutes.
Now, what does our SQL query look like?
SELECT products.* FROM products INNER JOIN ( SELECT products.id AS pg_search_id, ( ts_rank( ( to_tsvector('english', coalesce(products.description::text, '')) || to_tsvector('english', coalesce(products.manufacturer_name::text, '')) || to_tsvector('english', coalesce(products.name::text, '')) ), ( to_tsquery('english', ''' ' || 'wool' || ' ''') ), ? ) ) AS rank FROM products WHERE ( ( ( to_tsvector('english', coalesce(products.description::text, '')) || to_tsvector('english', coalesce(products.manufacturer_name::text, '')) || to_tsvector('english', coalesce(products.name::text, '')) ) @@ ( to_tsquery('english', ''' ' || 'wool' || ' ''') ) ) ) ) pg_search ON products.id = pg_search.pg_search_id ORDER BY pg_search.rank DESC LIMIT 24 OFFSET 0
This is all pretty standard SQL plus a few cool functions: ts_rank
, to_tsvector
, and to_tsquery
. The to_tsvector
function in is worth a closer look. It generates tsvector
data types, which are “a sorted list of distinct lexemes.” Lexemes, in turn, are “words that have been normalized to make different variants of the same word look alike”.
For example, given the following product:
Product.create( description: "Michael Kors", name: "Sunglasses", manufacturer_name: "Michael Kors" )
The tsvector
looks like:
'kor':2,4,6 'michael':1,3,5 'sunglass':7
The resulting lexemes were “normalized to make different variants” by lowercasing, removing suffixes, etc. The lexemes were sorted into a list and the numbers represent the position of the lexeme in the original strings.
For tons of awesome examples and details on these three functions, see Postgres full-text search is Good Enough!
Caching tsvector
lexemes
On a large products
table, our searches may be slow. If so, we have some tuning options.
One option would be to cache the tsvector
s using a materialized view. Read Caching with Postgres materialized views or Postgres full-text search is Good Enough! (again) for more information materialized views with Postgres and Ruby.
Materialized views may be a good option for your data. One downside is that the entire view must be refreshed with:
REFRESH MATERIALIZED VIEW view_name;
That may be a good fit in some scenarios, perhaps run daily as a cron or Heroku Scheduler job. In our case, we want a cache to be updated when a Product
is created or updated.
Let’s edit our pg_search_scope
:
using: { tsearch: { + tsvector_column: "tsv", } }
Since we can’t dump a tsvector
column to schema.rb
, we need to switch to the SQL schema format in our config/application.rb
:
config.active_record.schema_format = :sql
Remove the now-unnecessary db/schema.rb
:
rm db/schema.rb
And generate a migration:
class AddTsvectorColumns < ActiveRecord::Migration def up add_column :products, :tsv, :tsvector add_index :products, :tsv, using: "gin" execute <<-SQL CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON products FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( tsv, 'pg_catalog.english', description, manufacturer_name, name ); SQL now = Time.current.to_s(:db) update("UPDATE products SET updated_at = '#{now}'") end def down execute <<-SQL DROP TRIGGER tsvectorupdate ON products SQL remove_index :products, :tsv remove_column :products, :tsv end end
This change introduces a tsv
column of type tsvector
to search against, a GIN index on the new column, a TRIGGER
on those new columns BEFORE INSERT OR UPDATE
, and a backfill UPDATE
for existing products
, to keep the data in sync.
Postgres has a built-in tsvector_update_trigger
function to make this easier.
The GIN index could alternatively be a GiST index. See the GIN vs. GiST tradeoffs.
Here’s the resulting query with the new tsvector
-type column:
SELECT products.* FROM products INNER JOIN ( SELECT products.id AS pg_search_id, ( ts_rank( (products.tsv), (to_tsquery('english', ''' ' || 'wool' || ' ''')), 0 ) ) AS rank FROM products WHERE ( ((products.tsv) @@ (to_tsquery('english', ''' ' || 'wool' || ' '''))) ) ) pg_search ON products.id = pg_search.pg_search_id ORDER BY pg_search.rank DESC LIMIT 24 OFFSET 0
We can see that our run-time to_tsvector
function calls are gone, and our cached tsvector
data in the GIN-indexed tsv
column are being queried against.
We’ve now improved the speed of our queries by introducing a tsvector
column to cache lexemes. The trigger will keep the lexemes up-to-date as products
are created and updated, without any daily cron job to run.
Happy searching.