Action Text excels at capturing user-provided rich text content and rendering the way end-users intended by editing and storing HTML content. However, once that content is encoded and stored as HTML, it’s becomes indecipherable to search engines.
Wouldn’t it be great if the same database that stores the HTML content could find the proverbial needle in that content’s haystack?
Setting up
We’ll start with an out-of-the-box Rails installation generated by an rails new
command, and skip the majority of the application’s setup (including the Action Text installation steps).
If you’d like to read the rest of this article’s source code (including a test suite!), it can be found on GitHub.
Scaffolding our model
We’ll create our Article
model’s scaffolding with Rails’ model
generator to serve as a starting point for our extensions and customizations:
bin/rails generate model \ Article \ title:text \ --no-fixture bin/rails db:migrate
We’ll also add a has_rich_text :content
declaration to the model so that it accepts rich text content for its content
attribute:
--- b/app/models/article.rb +++ b/app/models/article.rb class Article < ApplicationRecord + has_rich_text :content end
Searching our content with SQL ILIKE
Postgres supports the LIKE and ILIKE keywords for loose pattern matching:
string LIKE pattern [ESCAPE escape-character]
The
LIKE
expression returnstrue
if the*string*
matches the supplied*pattern*
.If
pattern
does not contain percent signs or underscores, then thepattern
only represents thestring
itself; in that caseLIKE
acts like the equals operator. An underscore (_
) in pattern stands for (matches) any single character; a percent sign (%
) matches any sequence of zero or more characters.
To test it out, declare a with_content_containing
scope that joins onto the related ActionText::RichText model, and implemented with an ILIKE
query:
--- a/app/models/article.rb +++ b/app/models/article.rb class Article < ApplicationRecord has_rich_text :content + + scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") } + body ILIKE ? + SQL end
Using an ILIKE
clause achieves our desired outcome, but has some quirks and drawbacks. For instance, ILIKE
won’t account for any HTML characters, so while a search for "needle in the haystack"
matches content like <div>The text of this article matches has the needle in the haystack<div>
, it won’t match content with HTML interspersed (e.g. <div>The text of this article matches has the <strong>needle</strong> in the haystack</div>
.
Ignoring HTML debris
Let’s address the major drawback: HTML junk in our content.
To do so, we’ll add a plain_text_body
column to our action_text_rich_texts
table to mirror the existing body
column so that our ActionText::RichText
models can write to and search from it. We’ll create the migration file through Rails’ migration
generator:
# Created by: # # bin/rails generate migration AddPlainTextBodyToActionTextRichTexts \ # plain_text_body:text # class AddPlainTextBodyToActionTextRichTexts < ActiveRecord::Migration[7.0] def change add_column :action_text_rich_texts, :plain_text_body, :text end end
To write to the new action_text_rich_texts.plain_text_body
, we’ll declare a before_save
callback so that creating or updating an Article
instance transforms by invoking ActionText::RichText#to_plain_text
. The call to to_plain_text
will remove all HTML syntax, leaving behind only the text content of the body
to be written to plain_text_body
:
--- a/app/models/article.rb +++ b/app/models/article.rb class Article < ApplicationRecord has_rich_text :content + + before_save { content.plain_text_body = content.body.to_plain_text } scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") } body ILIKE ? SQL end
Querying with plain-text search
Now that we have a column free of HTML entities, we can point our existing query to it in support querying of values like:
<div> The text of this Article contains the <strong>needle</strong> in the haystack. </div>
While this is a major improvement in what’s possible to query, there are still some quirks and drawbacks. For instance, querying with ILIKE
does not account for “stop words” like “in”, “of”, or “the”. This means that while a query like "needle in the haystack"
will match "The text of this Article contains the needle in the haystack."
, "needle haystack"
will not.
--- a/app/models/article.rb +++ b/app/models/article.rb class Article < ApplicationRecord has_rich_text :content before_save { content.plain_text_body = content.body.to_plain_text } scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") } - body ILIKE ? + plain_text_body ILIKE ? SQL end
Querying with full-text search
We’ll need to ignore stop words and boost the performance of our query with text search vectors by transforming content
with to_tsvector
.
PostgreSQL full-text search works by comparing collections of text search tokens, and ignoring stop words like “in”, “of”, and “the”. To query a column for a search term, both the column and the query must be converted to text search vectors via to_tsvector
. We’ll convert the query string by passing it to websearch_to_tsquery
, and convert the plain_text_body
column by passing it to to_tsvector
:
--- a/app/models/article.rb +++ b/app/models/article.rb before_save { content.plain_text_body = content.body.to_plain_text } - scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") } - plain_text_body ILIKE ? + scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) } + to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?) SQL end
To improve querying performance, declare a Generalized Inverted Index (GIN) index on action_text_rich_texts.plain_text_body
to store the to_tsvector
return value:
class AddTsvectorIndexToActionTextRichTexts < ActiveRecord::Migration[7.0] def change add_index :action_text_rich_texts,"to_tsvector('english', plain_text_body)", using: :gin, name: "tsvector_body_idx" end end
Generalizing our search
While declaring the before_save
callback and with_content_containing
scope within the Article
model has utility, its value is limited to Article
instances despite the implementation being tied to ActionText::RichText
concepts.
Let’s generalize our full-text search wins by re-opening the ActionText::RichText
class through an ActiveSupport.on_load
hook, in the config/initializers/action_text_rich_text.rb
initializer, making it accessible to future models that have rich text:
ActiveSupport.on_load :action_text_rich_text do before_save { self.plain_text_body = body.to_plain_text } scope :with_body_containing, ->(query) { where <<~SQL, query } to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?) SQL end
After extracting the code to an initializer, we can change our scope :with_body_containing
implementation to depend on the ActionText::RichText.with_body_containing
scope:
--- a/app/models/article.rb +++ b/app/models/article.rb class Article < ApplicationRecord has_rich_text :content - before_save { content.plain_text_body = content.body.to_plain_text } - - scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) } - to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?) - SQL + scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.with_body_containing(query)) } end
Wrapping up
We’ve unlocked PostgreSQL full-text searching capabilities for content that was previously indecipherable to its search engine. To do so, we’ve extended existing Action Text concepts and classes to purge rich-text content of its HTML, encoded that content into a text-searchable vector, and stored it in a way that is performant to retrieve. We’ve done so with the tools already at our disposal: Active Record and PostgresSQL. All without adding any additional system-level dependencies or services!
Special thanks to George Claghorn, Javan Makhmali, and Sam Stephenson for their collective work on Action Text and Trix.