Here is an example where using PostgreSQL as a document database will quickly fail: indexing. Either use an RDBMS as a relational database, or use a document database. In the previous post, we learned that B-Tree indexes can be created without an array in the JSON path, but this approach may lead to reduced performance since it relies on an expression-based index. However, when you use a document database, you embed some arrays, and the indexing possibilities are even more limited when emulating a document database with SQL and JSONB.
Here is an example where users may have more than one e-mail address:
create table users ( id bigserial primary key, data jsonb not null ); insert into users (data) values ( jsonb_build_object( 'name', 'Homer Simpson', 'email', jsonb_build_array( 'donutlover@springfieldusa.com', 'homerdoh@simpsons.com', 'lazy.sofa.guy@tvcharacters.net' ) ) ); INSERT INTO users (data) SELECT jsonb_build_object( 'name', 'u' || n::text, 'email', jsonb_build_array( 'u' || n::text || '@compuserve.com' ) ) FROM generate_series(1, 1000000) n;
PostgreSQL has a JSON operator @>
to find the document that contains a value in an array:
SELECT * FROM users WHERE data->'email' @> '"donutlover@springfieldusa.com"' ;
An expression index on (data->'email')
cannot be used for such query, as the indexed value would be the whole array. In order to index each item, I need to create an inverted index:
CREATE INDEX idx_users_data_email ON users USING GIN ( (data->'email') jsonb_path_ops );
I check that the index can be used for a query by e-mail:
set enable_seqscan to off; explain (analyze, verbose, buffers, costs off, serialize text) SELECT * FROM users WHERE data->'email' @> '"donutlover@springfieldusa.com"' ; QUERY PLAN ------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.users (actual time=0.014..0.015 rows=1 loops=1) Output: id, data Recheck Cond: ((users.data -> 'email'::text) @> '"donutlover@springfieldusa.com"'::jsonb) Heap Blocks: exact=1 Buffers: shared hit=3 -> Bitmap Index Scan on idx_users_data_email (actual time=0.006..0.006 rows=1 loops=1) Index Cond: ((users.data -> 'email'::text) @> '"donutlover@springfieldusa.com"'::jsonb) Buffers: shared hit=2 Planning: Buffers: shared hit=27 Planning Time: 0.169 ms Serialization: time=0.005 ms output=1kB format=text Execution Time: 0.034 ms (13 rows)
This is fast because I was looking for an exact match. However, indexes are supposed to optimize more than that. Querying this for a partial match where I know only the prefix is extremely complex to write, and cannot use the index:
set enable_seqscan to off; explain (analyze, verbose, buffers, costs off, serialize text) SELECT * FROM users WHERE EXISTS ( SELECT 1 FROM jsonb_array_elements_text(data->'email') AS email WHERE email LIKE 'donutlover@%' ); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Seq Scan on public.users (actual time=45.032..844.109 rows=1 loops=1) Output: users.id, users.data Filter: EXISTS(SubPlan 1) Rows Removed by Filter: 1000000 Buffers: shared hit=12346 SubPlan 1 -> Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.001..0.001 rows=0 loops=1000001) Function Call: jsonb_array_elements_text((users.data -> 'email'::text)) Filter: (email.value ~~ 'donutlover@%'::text) Rows Removed by Filter: 1 Planning Time: 0.071 ms Serialization: time=0.007 ms output=1kB format=text Execution Time: 844.523 ms (13 rows)
If you want to index such query in PostgreSQL you need to forget about the JSONB structure and process it as simple text, with a text search index using trigrams:
CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX idx_users_data_email_trgm ON users USING GIN ( (data->>'email') gin_trgm_ops );
The JSON operator ->>
returns a text, and the gin_trgm_ops
extracts trigrams from it to index with an inverted index. I can pre-filter on the array expanded as text, with data->>'email'
, and search a the '%"donutlover@%'
pattern with a LIKE. This uses the index but may have false positives, so I need to add my EXISTS with JSONB_ARRAY_ELEMENTS_TEXT to re-check before returning the result. Finally, here is the query that can find a user that has an e-mail starting with "donutlover", and its execution plan:
explain (analyze, verbose, buffers, costs off, serialize text) SELECT * FROM users WHERE data->>'email' LIKE '%"donutlover@%' AND EXISTS ( SELECT 1 FROM jsonb_array_elements_text(data->'email') AS email WHERE email LIKE 'donutlover@%' ); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.users (actual time=0.035..0.036 rows=1 loops=1) Output: users.id, users.data Recheck Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text) Filter: EXISTS(SubPlan 1) Heap Blocks: exact=1 Buffers: shared hit=13 -> Bitmap Index Scan on idx_users_data_email_trgm (actual time=0.015..0.015 rows=1 loops=1) Index Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text) Buffers: shared hit=12 SubPlan 1 -> Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.007..0.007 rows=1 loops=1) Function Call: jsonb_array_elements_text((users.data -> 'email'::text)) Filter: (email.value ~~ 'donutlover@%'::text) Planning: Buffers: shared hit=1 Planning Time: 0.102 ms Serialization: time=0.003 ms output=1kB format=text Execution Time: 0.060 ms (18 rows)
PostgreSQL can logically function as a document database, but nobody wants to get there given the complexity to index and query it. If you use a SQL database, such data must be normalized, requiring two tables for the One-to-Many relationship between users and their emails. Regular indexes are necessary for effective LIKE predicates. The query will utilize a join, with the query planner ideally starting with the appropriate table if filters are applied to both.
There are valid reasons to prefer a document model: its simplicity, alignment with business entities and application objects, and the ease of querying a flexible structure. The previous solution is at the opposite of the expected simplicity. In MongoDB, you don't even need to know if the "email" field is a single value or an array:
db.users.insertMany([ { "_id": 1, "data": { "name": "Homer Simpson", "email": [ "donutlover@springfieldusa.com", "homerdoh@simpsons.com", "lazy.sofa.guy@tvcharacters.net" ] } }, { "_id": 2, "data": { "name": "Marge Simpson", "email": "marge@springfieldusa.com" } } ]); // Insert one million const bulkUsers = []; for (let n = 3; n <= 1000002; n++) { bulkUsers.push({ _id: n, data: { name: "u" + n, email: [ "u" + n + "@compuserve.com" ] } }); // Insert in batches of 10,000 for efficiency if (bulkUsers.length === 10000) { db.users.insertMany(bulkUsers); bulkUsers.length = 0; // Clear the array } } // Insert any remaining documents if (bulkUsers.length > 0) { db.users.insertMany(bulkUsers); }
I create a simple index:
db.users.createIndex({ "data.email": 1 });
The query is straightforward, whether it involves an exact search or a regular expression.
db.users.find({ "data.email": "donutlover@springfieldusa.com" }); [ { _id: 1, data: { name: 'Homer Simpson', email: [ 'donutlover@springfieldusa.com', 'homerdoh@simpsons.com', 'lazy.sofa.guy@tvcharacters.net' ] } } ] db.users.find({ "data.email": { $regex: "^donutlover@" } }); [ { _id: 1, data: { name: 'Homer Simpson', email: [ 'donutlover@springfieldusa.com', 'homerdoh@simpsons.com', 'lazy.sofa.guy@tvcharacters.net' ] } } ]
The execution plan exhibits the fastest access to the document:
... executionSuccess: true, nReturned: 1, executionTimeMillis: 1, totalKeysExamined: 2, totalDocsExamined: 1, ... stage: 'IXSCAN', nReturned: 1, executionTimeMillisEstimate: 0, works: 3, keyPattern: { 'data.email': 1 }, indexName: 'data.email_1', isMultiKey: true, multiKeyPaths: { 'data.email': [ 'data.email' ] }, direction: 'forward', indexBounds: { 'data.email': [ '["donutlover@", "donutloverA")', '[/^donutlover@/, /^donutlover@/]' ] }, keysExamined: 2, seeks: 2, ...
With MongoDB, you don't have to choose between regular and inverted indexes or deal with their limitations. A single index on { 'data.email': 1 }
can handle both scalar values and arrays. For arrays, MongoDB recognizes this as a multi-key (isMultiKey: true
) and retrieves documents containing values that meet the filter criteria. This index can be used for equality and range queries, and regular expressions with a known prefix are automatically optimized by the query planner into index bounds.
When you hear that JSONB transforms PostgreSQL into a document database, consider trying simple queries beyond just equality predicates. Adding a MongoDB API on top of an SQL database addresses syntax complexity, but it does not resolve the limitations of the underlying indexes.
Top comments (0)