PostgreSQL Full-Text Search with NodeJS: Lite Version of Elastic Search in Postgres
Text Processing, Similarity Ranking, and Example Implementation with NodeJS
What is Full-Text Search?
Full-Text Search is text processing to identify documents in natural language and sort them based on relevance to keywords in a search query. Full-text search queries can find all documents that contain certain keywords and return them in the form of the most similar ones.
Full-text search overcomes the weaknesses of ordinary search queries by using textual operators (~, ~*, LIKE, and ILIKE). Searches with these operators do not support certain linguistics (including English) and are difficult to handle with regex, for example the words ‘butterfly’ and ‘butterflies’ should be the same (also for ‘read’ and ‘reading’). This type of search also requires quite a lot of effort to be made in the form of rankings which are usually needed to sort based on similarity. Additionally, it is very slow because it is not supported by indexing.
On the other hand, full-text search provides various solutions to these weaknesses. Full-text search can parse documents into tokens and divide them into several types (e.g. for words, numbers, complex words, etc.) so that they can be processed differently. This can also change words into lexemes (normalized words), such as changing them to lowercase, removing prefixes and suffixes, and removing stop words (unimportant words such as affixes). Lexemes can help in indexing and searching. The full-text search feature can also save documents that have been optimized for search purposes and ranking for proximity or similarity.
Full-text search supports various languages for text processing, such as English, Indonesian, Spanish, French, German, Russian, Chinese, and many more. You can also perform various additional configurations on the dictionary to control how tokens will be normalized, such as excluding certain stop words, combining synonyms and similar phrases, etc. You can find the complete documentation here: https://www.postgresql.org/docs/current/textsearch.html
TS Vector, TS Query in PostgreSQL
PostgreSQL provides 2 data types that can be used in full-text search, namely tsvector and tsquery. TS stands for Text Search. Tsvector is a collection of unique lexemes from words that have been normalized and then sorted alphabetically. Meanwhile, tsquery is a keyword or search pattern for the tsvector data type.
Tsvector Example
In tsvector you can use specific language to process the text. It will return collection of unique lexemes that has been normalized and sorted alphabetically. It also shown the position of each lexem in the original text.
Example 1
SELECT to_tsvector('english', 'Full-Text Search identify documents in natural language and sort them based on similarity to the keywords in a search query. Full-text search queries can scan all documents and return in the form of the most similar ones.');
The result only return sorted unique normalized word, without prefix, suffix, and affix. It also parsing word ‘full-text’ into ‘full-text’, ‘full’, and ‘text’.
Example 2
select to_tsvector('english', 'In an increasingly digital era, the existence of a website has become a necessity for any organization that wants to increase engagement, expand reach, and introduce itself to a wider audience. A website is not only the public face of an organization, but is also a vital information center for members, event participants, partners, and the general public. Lets examine the importance of a website for getting to know an organization through several commonly found features: 1. Organizational Management Information. A website provides the perfect platform to introduce members of an organizations management, be it the board of directors or the sections involved in carrying out various organizational activities. This information helps build trust and transparency among organizational members and the general public. 2. Event Registration and Participant Registration. With online event registration and participant registration features, a website simplifies the administration process and allows participants to register quickly and easily. It also helps organizations in event planning by providing an estimate of the number of participants who will attend. 3. Articles Related to Current Issues. As a reliable source of information, an organizations website can present articles related to current issues that are relevant to the organizations mission and goals. This helps increase public awareness of the issues being faced as well as the positions and efforts taken by organizations in responding to these issues. 4. Press Release. A website is an ideal place to publish press releases after events or organizational activities have been held. This allows organizations to share information about the success of the event, results achieved, documentation of activities, and even impressions from participants or other stakeholders. 5. Organization Gallery. Websites are an effective means of sharing important moments and building a visual narrative about organizational activities. This gallery can be in the form of visual documentation such as photos and videos of organizational activities. Through this organization gallery, the public can see directly the impact and contribution of the organization in various activities, which can improve the image and trust in the organization. With all these features, a website becomes not only an information platform, but also a powerful tool for building sustainable relationships with members, event participants, partners, and the general public. Therefore, it is important for every organization to pay attention to and manage their website well, so that they can utilize its full potential in strengthening and expanding the organizations positive impact on society.');
Result
Tsquery Example
With tsquery you can search some keywords with some pattern to the document that has been processed into tsvector. It will return boolean ‘true’ if keyword contain in the document, or ‘false’ otherwise.
SELECT to_tsvector('english', 'Full-Text Search identify documents in natural language and sort them based on similarity to the keywords in a search query. Full-text search queries can scan all documents and return in the form of the most similar ones.') @@ to_tsquery('language') as result;
It’s return ‘t’ or true due to the document contain keyword ‘language’ in its content.
Full-Text Search Example with NodeJS
I will show you example full text search in my latest project. I use NodeJS Express for the framework, Postgres database, and Sequelize ORM.
Table Structure
I have ‘search’ column with type tsvector. This column will use to store the tokenized and normalized text from the document (lexemes).
Insert Data
Insert some data for each column, except the ‘search’ column. You need to leave it empty because the contents must be generated using the tsvector query.
Test Convert Title to Tsvector
select id, title, to_tsvector('indonesian', title) from post;
Create Migration
In migration file, you have to include create table, generate ‘search’ column value, and create index for ‘search’ column.
Here I use setweight to sort column from the most match. It will help to build the search feature. In my query I set the matches from ‘title’ column with Indonesian dictionary is higher than matches from ‘title’ column with English dictionary. You can add weight only from A up to D. Inside the setweight, I convert the text to tsvector with dictionary based on language in the document.
You can also separate the configuration to generate ‘search’ column value to other migration file to update existing data. It will be useful if we add a ‘search’ column when the table already has a lot of data. So we need to update ‘search’ with data that already exists but has not been converted into a tsvector.
Then, add indexing for the ‘search’ column. It will faster the search process.
Create Trigger Migration
Create migration file to trigger updating value to ‘search’ column before insert event on post table. It will help to make sure the ‘search’ value is generated.
Create Search Query
For the search query, you have to add some config so it will return from the most similar to the least similar with the search keyword.
Here I define the other attributes I want to return from the query with destruct postAttr array. Then I add 1 important attribute for the search feature, named ‘rank’. This attribute contain rank from the ‘search’ column that matches with the keyword.
Besides, we have to convert the keyword into websearch tsquery and add the dictionary we use for the keyword.
Then, we have to add the tsquery keyword to the where condition. You can combine with other conditions too.
Lastly, don’t forget to order the result by ‘rank’ descending.
Search With Raw Query
Here the example when we use raw query instead of Sequelize ORM.
select id, title, content,
ts_rank(search, websearch_to_tsquery('indonesian', {keyword})) +
ts_rank(search, websearch_to_tsquery('english', {keyword})) +
ts_rank(search, websearch_to_tsquery('simple', {keyword}))
as rank
from post
where
search @@ websearch_to_tsquery('indonesian', {keyword})
or
search @@ websearch_to_tsquery('english', {keyword})
or
search @@ websearch_to_tsquery('simple', {keyword})
order by rank desc;
Test the Result
I will test with same keyword ‘kesehatan’.
Result with Raw Query
That result show post with id 26 is the most match with the keyword. It is because its title contain the keyword. While other rows only contain ‘kesehatan’ keywords in their body.
Result in API
Result in Application
You can also test my project application here: www.genbiupi.com
Conclusion
- Full-text search helps if you want to create a search feature without involving elastic search services
- Full-text search helps when using regular textual logical operators (LIKE, ILIKE, ~, ~*) and regex cannot meet data matching needs
- Full-text search is faster than searching with ordinary queries using matching operators
- Full-text search can recognize language-specific linguistics which helps when tokenizing text and normalizing words into lexemes
- Full-text search supports combining multiple columns and weighting for each column