DEV Community

Mansa Keïta
Mansa Keïta

Posted on • Edited on

Modeling semi-structured data in Rails

Relational databases are very powerful. Their power comes from their ability to...

  • Preserve data integrity with a predefined schema.
  • Make complex relationships through joins.

But sometimes, we can stumble accross data that don't fit in the relational model. We call this kind of data: semi-structured data.
When this happens, the things that makes relational databases powerful are the things that gets in our way, and complicate our model instead of simplifying it.

That's why document databases exist, to model and store semi structured data. However, if we choose to use a document database, we'll loose all the power of using a relational database.

Luckily for us, relational databases like Postgres and MySQL now has good JSON support. So most of us won't need to use a document database like MongoDB, as it would be overkill. Most of the time, we only need to denormalize some parts of our model. So it makes more sense to use simple JSON columns for those, instead of going all-in, and dump your beloved relational database for MongoDB.

Currently in Rails, we can have full control over how our JSON data is stored and retrieved from the database, by using the Attributes API to serialize and deserialize our data. So let's see how we can model semi-structured data in a more convenient way.

Use case: Dealing with bibliographic data

Let's say that we are building an app to help libraries build and manage an online catalog. When we're browsing through a catalog, we often see item information formatted like this:

Author: Shakespeare, William, 1564-1616. Title: Hamlet / William Shakespeare. Description: xiii, 295 pages : illustrations ; 23 cm. Series: NTC Shakespeare series. Local Call No: 822.33 S52 S7 ISBN: 0844257443 Series Entry: NTC Shakespeare series. 
Enter fullscreen mode Exit fullscreen mode

But in the library world, data is produced and exchanged is this form:

LDR 00815nam 2200289 a 4500 001 ocm30152659 003 OCoLC 005 19971028235910.0 008 940909t19941994ilua 000 0 eng 010 $a92060871 020 $a0844257443 040 $aDLC$cDLC$dBKL$dUtOrBLW 049 $aBKLA 099 $a822.33$aS52$aS7 100 1 $aShakespeare, William,$d1564-1616. 245 10$aHamlet /$cWilliam Shakespeare. 264 1$aLincolnwood, Ill. :$bNTC Pub. Group,$c[1994] 264 4$c©1994. 300 $axiii, 295 pages :$billustrations ;$c23 cm. 336 $atext$btxt$2rdacontent. 337 $aunmediated$bn$2rdamedia. 338 $avolume$bnc$2rdacarrier. 490 1 $aNTC Shakespeare series. 830 0$aNTC Shakespeare series. 907 $a.b108930609 948 $aLTI 2018-07-09 948 $aMARS 
Enter fullscreen mode Exit fullscreen mode

This is what we call a MARC (Machine-Readable Cataloging) record. That's how libraries describes the ressources they own.

As you can see, that's really verbose! That's because in the library world, ressources are described very precisely, in order to be "machine-readable".

For convenience, developers usually represent MARC data in JSON:

{ "leader": "00815nam 2200289 a 4500", "fields": [ { "tag": "001", "value": "ocm30152659" }, { "tag": "003", "value": "OCoLC" }, { "tag": "005", "value": "19971028235910.0" }, { "tag": "008", "value": "940909t19941994ilua 000 0 eng " }, { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] }, { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] }, { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] }, { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] }, { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] }, { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] }, { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] }, { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] }, { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] }, { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] }, { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] }, { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] }, { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] }, { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] }, { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] }, { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] }, { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] }, { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] } ] } 
Enter fullscreen mode Exit fullscreen mode

By looking at this JSON representation, we can see that the data is...

  • Nested: A MARC record contains many fields, and most of them contains multiple subfields.
  • Dynamic: Some fields are repeatable ("264" and "948"), and subfields too. The first fields don't have subfields nor indicators (they're called control fields).
  • Encapsulated: The meaning of subfields depends on the field they're in (take a look at the "a" subfield for example).

All those characteristics can be grouped into what we call: semi-structured data.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. - Wikipedia

A perfect example of that is HTML documents. An HTML document contains different types of tags which can nested in multiple ways. It wouldn't make sense to model HTML documents with tables and columns. Imagine having to access nested tags through joins, considering the fact that we could potentially have hundreds of them on a single HTML document. That's why we usually store this kind of data in a text field.

In our case, we're using JSON to represent MARC data. Luckily for us, we can store JSON data directly in relational databases like Postgres or MySQL:

# config/initializers/inflections.rb ActiveSupport::Inflector.inflections(:en) do |inflect| inflect.acronym "MARC" end 
Enter fullscreen mode Exit fullscreen mode
$ rails g model marc/record leader:string fields:json $ rails db:migrate 
Enter fullscreen mode Exit fullscreen mode

We can then create a MARC record like this:

MARC::Record.create leader: "00815nam 2200289 a 4500", fields: [ { "tag": "001", "value": "ocm30152659" }, { "tag": "003", "value": "OCoLC" }, { "tag": "005", "value": "19971028235910.0" }, { "tag": "008", "value": "940909t19941994ilua 000 0 eng " }, { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] }, { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] }, { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] }, { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] }, { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] }, { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] }, { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] }, { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] }, { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] }, { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] }, { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] }, { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] }, { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] }, { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] }, { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] }, { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] }, { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] }, { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] } ] 
Enter fullscreen mode Exit fullscreen mode

And access it this way:

record = MARC::Record.first field = record.fields.find { |field| field["tag"] == "245" } subfield = field["subfields"].first subfield["value"] => "Hamlet" 
Enter fullscreen mode Exit fullscreen mode

It works, but...

  • It's not very convenient to access nested data this way.
  • We cannot easily attach logic to our JSON data without polluting our model.

What if we could interact with our JSON data the same way we do with ActiveRecord associations ? Enters ActiveModel and the AttributesAPI!

First, we have to define a custom type which...

  • Maps JSON objects to ActiveModel-compliant objects.
  • Handles collections.

To do that, we'll add the following options to our type:

  • :class_name: The class name of an ActiveModel-compliant object.
  • :collection: Specify if the attribute is a collection. Default to false.
class DocumentType < ActiveModel::Type::Value attr_reader :document_class, :collection def initialize(class_name:, collection: false) @document_class = class_name.constantize @collection = collection end def cast(value) if collection value.map { |attributes| process attributes } else process value end end def process(value) document_class.new(value) end def serialize(value) value.to_json end def deserialize(json) value = ActiveSupport::JSON.decode(json) cast value end # Track changes def changed_in_place?(old_value, new_value) deserialize(old_value) != new_value end end 
Enter fullscreen mode Exit fullscreen mode

Let's register our type as we gonna use it multiple times:

# config/initializers/type.rb ActiveModel::Type.register(:document, DocumentType) ActiveRecord::Type.register(:document, DocumentType) 
Enter fullscreen mode Exit fullscreen mode

Now we can use it in our models:

class MARC::Record < ApplicationRecord attribute :fields, :document, class_name: "MARC::Record::Field", collection: true def at(tag) fields.find { |field| field.tag == tag } end end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field include ActiveModel::Model include ActiveModel::Attributes include ActiveModel::Serializers::JSON attribute :tag, :string attribute :value, :string attribute :indicator1, :string attribute :indicator2, :string attribute :subfields, :document, class_name: "MARC::Record::Field::Subfield", collection: true # Control fields don't have subfields def attributes if control_field? { "id" => id, "tag" => tag, "value" => value } else { "id" => id, "tag" => tag, "indicator1" => indicator1, "indicator2" => indicator2, "subfields" => subfields } end end def control_field? /00\d/ === tag end def at(code) subfields.find { |subfield| subfield.code == code } end alias [] at # Used to track changes def ==(other) attributes == other.attributes end end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field::Subfield include ActiveModel::Model include ActiveModel::Attributes include ActiveModel::Serializers::JSON attribute :code, :string attribute :value, :string def ==(other) attributes == other.attributes end end 
Enter fullscreen mode Exit fullscreen mode

Let's test this in the console:

record.at("245")["a"].value => "Hamlet" record.changed? => false record.at("245")["a"].value = "Romeo and Juliet" record.at("245")["a"].value => "Romeo and Juliet" record.changed? => true 
Enter fullscreen mode Exit fullscreen mode

Et voilà! Home-made associations!

Luckily, you won't need to implement this yourself, as this gem does it for you (and even more).

Here's how we can simplify our models:

class MARC::Record < ApplicationRecord include ActiveModel::Embedding::Associations embeds_many :fields # ... end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field include ActiveModel::Embedding::Document # ... embeds_many :subfields # ... end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field::Subfield include ActiveModel::Embedding::Document # ... end 
Enter fullscreen mode Exit fullscreen mode

We can then code our views with nested attributes support out-of-the-box:

# app/views/marc/records/_form.html.erb <%= form_with model: @record do |record_form| %> <% @record.fields.each do |field| %> <%= record_form.fields_for :fields, field do |field_fields| %> <%= field_fields.label :tag %> <%= field_fields.text_field :tag %> <% if field.control_field? %> <%= field_fields.text_field :value %> <% else %> <%= field_fields.text_field :indicator1 %> <%= field_fields.text_field :indicator2 %> <%= field_fields.fields_for :subfields do |subfield_fields| %> <%= subfield_fields.label :code %> <%= subfield_fields.text_field :code %> <%= subfield_fields.text_field :value %> <% end %> <% end %> <% end %> <% end %> <%= record_form.submit %> <% end %> 
Enter fullscreen mode Exit fullscreen mode

We can even use validations:

class MARC::Record < ApplicationRecord # ... validates :fields, presence: true vallidates_associated :fields end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field # ... validates :subfields, presence: true, unless: :control_field? validates_associated :subfields, unless: :control_field? end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record::Field::Subfield # ... validates_presence_of :code, :value end 
Enter fullscreen mode Exit fullscreen mode
record = MARC::Record.new record.valid? => false record.fields = [{ tag: "245" }] record.valid? => false record.at("245").subfields = [{ code: "a", value: "Ruby on Rails" }] record.valid? => true 
Enter fullscreen mode Exit fullscreen mode

We can use custom collections if we need to add custom behaviour:

class MARC::Record::FieldCollection include ActiveModel::Embedding::Collecting include Enumerable def at(tag) find { |field| field.tag == tag } end def repeated?(field) # ... end # ... end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record < ApplicationRecord include ActiveModel::Embedding::Associations embeds_many :fields, collection: "FieldCollection" delegate :at, :repeated?, to: :fields # ... end 
Enter fullscreen mode Exit fullscreen mode
record = MARC::Record.first record.at("245")["a"].value => "Hamlet" record.repeated?("245") => false record.repeated?("264") => true 
Enter fullscreen mode Exit fullscreen mode

We can use custom types if we need to cast the elements of a collection:

class MARC::Record::FieldType < ActiveModel::Type::Value def cast(value) # ... end end 
Enter fullscreen mode Exit fullscreen mode
class MARC::Record < ApplicationRecord include ActiveModel::Embedding::Associations embeds_many :fields, cast_type: "FieldType" # ... end 
Enter fullscreen mode Exit fullscreen mode

So the next time you need to model semi-structured data in your Rails application...

Top comments (2)

Collapse
 
blockbench profile image
Block Bench

Your text has a few minor grammar and spelling issues that could be refined for better clarity and readability. For example, "accross" should be "across," "don't" should be "doesn't" when referring to "data," and "ressources" should be corrected to "resources." Also, some sentences could flow more smoothly, like rewording “most of us won’t need to use a document database like MongoDB” to something clearer, such as “In most cases, MongoDB would be unnecessary.” Additionally, unnecessary commas, like the one after “database” in the Attributes API sentence, should be removed for better readability. If you're looking for a tool to help with structuring and optimizing digital models, consider exploring Blockbench.

Collapse
 
mansakondo profile image
Mansa Keïta

Firstly, thanks for the suggestions. These errors are part of my journey of learning English as a non-native and I'd like to keep them as a reference of how I've improved since then.
However, I don't think that's the best way to market your product, as it has nothing to do with writing, from what I can see.