The following approach will work for all JSON content but was used here for full text search on content created with Slate. In the format Slate produces by default all plain text content is conveniently placed in fields with key text
. Slate can emit plain text but here all content will be stored in a PostgreSQL jsonb
column and we will be doing full texts search using a functional GIN index directly.
Postgres can produce tsvector
s (the format postgres uses to store text search data) on actual json data with the json(b)_to_tsvector
function, but this will index all values in the json document, so we need to do some filtering first. Postgres 12 introduced the SQL/JSON Path Language machinery which exactly allows for this.
Let’s consider the following content
stored in our documents
table:
[
{
"data": {
},
"type": "introduction",
"nodes": [
{
"text": "This is a test document",
"marks": [
],
"object": "text"
}
],
"object": "block"
},
{
"data": {
},
"type": "paragraph",
"nodes": [
{
"text": "Just to show what content format we might have.",
"marks": [
],
"object": "text"
}
],
"object": "block"
},
{
"data": {
},
"type": "conclusion",
"nodes": [
{
"text": "I hope you get the general idea.",
"marks": [
],
"object": "text"
}
],
"object": "block"
}
]
Jumping right to the good bit, jsonb_path_query_array(content, 'strict $.**.text')
will produce jsonb
:
["This is a test document", "Just to show what content format we might have.", "I hope you get the general idea."].
Note the strict
keyword here, leaving it out will produce:
["This is a test document", "This is a test document", "Just to show what content format we might have.", "Just to show what content format we might have.", "I hope you get the general idea.", "I hope you get the general idea."].
According to the documentation strict
mode should throw actual errors when a non-existing path is queried (which does not happen because of the explicit **
expansion (otherwise one might use 'strict $.** ? (exists(@.text)).text'
)), but it seems to me lax
mode should not return double results. Anyway, whatever works.
Note that
select jsonb_path_query(content, 'strict $.** ? (exists(@.text))')->>'text' from documents;
produces nice rows of plain text
content (as opposed to jsonb-text
).
We can now do a full text search of our documents
:
select * from documents where
jsonb_to_tsvector('english',
jsonb_path_query_array(content, 'strict $.**.text'),
'["string"]')
@@ to_tsquery('belgian & beer');
Note that json(b)_to_tsvector
actually generates a tsvector
for every element in our json array and then concatenates these, this might not be what we want. Results do look a bit different, with our original content:
to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text')::text)
gives
'content':10 'document':5 'format':11 'general':20 'get':18 'hope':16 'idea':21 'might':13 'show':8 'test':4
and
jsonb_to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text'), '["string"]')
gives
'content':11 'document':5 'format':12 'general':22 'get':20 'hope':18 'idea':23 'might':14 'show':9 'test':4
A quick check does not show any difference in performance.
To speed things up and allow for split second full rich text search on millions of records create a functional index:
CREATE INDEX ON paper USING gin(jsonb_to_tsvector('english', jsonb_path_query_array(content, 'strict $.**.text'), '["string"]'));
Note that you need to exactly reproduce the expression used in your index in your query to make use of this.
And that’s that!