Full Rich Text Search with Slate and PostgreSQL 12

The following approach will work for all JSON content but was used here for full text search on content created with Slate. In the format Slate produces by default all plain text content is conveniently placed in fields with key text. Slate can emit plain text but here all content will be stored in a PostgreSQL jsonb column and we will be doing full texts search using a functional GIN index directly.

Postgres can produce tsvectors (the format postgres uses to store text search data) on actual json data with the json(b)_to_tsvector function, but this will index all values in the json document, so we need to do some filtering first. Postgres 12 introduced the SQL/JSON Path Language machinery which exactly allows for this.

Let’s consider the following content stored in our documents table:

[
    {
        "data": {
        },
        "type": "introduction",
        "nodes": [
            {
                "text": "This is a test document",
                "marks": [
                ],
                "object": "text"
            }
        ],
        "object": "block"
    },
    {
        "data": {
        },
        "type": "paragraph",
        "nodes": [
            {
                "text": "Just to show what content format we might have.",
                "marks": [
                ],
                "object": "text"
            }
        ],
        "object": "block"
    },
    {
        "data": {
        },
        "type": "conclusion",
        "nodes": [
            {
                "text": "I hope you get the general idea.",
                "marks": [
                ],
                "object": "text"
            }
        ],
        "object": "block"
    }
]

Jumping right to the good bit, jsonb_path_query_array(content, 'strict $.**.text') will produce jsonb:

["This is a test document", "Just to show what content format we might have.", "I hope you get the general idea."].

Note the strict keyword here, leaving it out will produce:

["This is a test document", "This is a test document", "Just to show what content format we might have.", "Just to show what content format we might have.", "I hope you get the general idea.", "I hope you get the general idea."].

According to the documentation strict mode should throw actual errors when a non-existing path is queried (which does not happen because of the explicit ** expansion (otherwise one might use 'strict $.** ? (exists(@.text)).text')), but it seems to me lax mode should not return double results. Anyway, whatever works.

Note that

select jsonb_path_query(content, 'strict $.** ? (exists(@.text))')->>'text' from documents;

produces nice rows of plain text content (as opposed to jsonb-text).

We can now do a full text search of our documents:

select * from documents where
    jsonb_to_tsvector('english',
                      jsonb_path_query_array(content, 'strict $.**.text'),
                      '["string"]')
    @@ to_tsquery('belgian & beer');

Note that json(b)_to_tsvector actually generates a tsvector for every element in our json array and then concatenates these, this might not be what we want. Results do look a bit different, with our original content:

to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text')::text)

gives

'content':10 'document':5 'format':11 'general':20 'get':18 'hope':16 'idea':21 'might':13 'show':8 'test':4

and

jsonb_to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text'), '["string"]')

gives

'content':11 'document':5 'format':12 'general':22 'get':20 'hope':18 'idea':23 'might':14 'show':9 'test':4

A quick check does not show any difference in performance.

To speed things up and allow for split second full rich text search on millions of records create a functional index:

CREATE INDEX ON paper USING gin(jsonb_to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text'), '["string"]'));

Note that you need to exactly reproduce the expression used in your index in your query to make use of this.

And that’s that!