Welcome to the “Understanding Semantic Search” article series. This is the first post in this series. Don’t miss our next articles on the timeline of search technology, key semantic search use cases, and our recommendation for your next enterprise search engine.
With the advancements in AI and Large Language Models (LLMs), query processing has rapidly shifted. Each new iteration of search has become more efficient and accurate, while also expanding to meet new needs. Today, the world of information retrieval is increasingly focused on semantic search.
In this article we’ll break down what semantic search is and how similarity search — a type of semantic search — differs from keyword search. In understanding the technology behind these search engines, we’ll also look at why both search types remain beneficial in different situations.
Semantic Search Definition
Semantic search is an information retrieval technique that aims to determine the contextual meaning of a search query and the intent of the person running the search.
The road to semantic search started decades ago by augmenting keyword search technology with synonyms and taxonomies. A keyword search engine enumerates words based on their positions and creates an index of this information. Then when a user launches a search, the words in the query are compared against words in the index to pull documents and then sort them into search results. By adding synonyms into the mix, keyword search became more relevant and flexible in interpreting queries. For example, by establishing that “a convertible is a type of car”, subsequent searches for car also return documents about convertibles.
But with the advent of language models came similarity search. Often, what people understand as semantic search is actually similarity search, a sub-group of semantic search. When indexing documentation, a similarity search engine doesn’t look at words, but fragments of text. These chunks are transformed into vectors that convey the meaning of the fragment. By also interpreting the meaning of each query and comparing it to the meaning of the fragments, a similarity search engine delivers results linked to the intent of the user, not to the words used.
Both types of searches order results by relevance ranking, preferably adding business rules and user preferences to contextualize and personalize the results.
Glossary of Related Terms
- Embeddings: These are mathematical representations that try to convey meaning in the form of a vector (a list) of numeric values. Embeddings are also called semantic vectors.
- Generative AI (GenAI): This is a category of AI that produces content — text, images, audio, code — to try and copy human creativity. It uses datasets to study patterns and produce content based on this data in response to prompts (instructions).
- Tokenization: This is the algorithm which splits a written text into small sequences of letters. For keyword search, one token is one word. But for similarity search and embeddings creation, a token is just a sequence of letters that can be a part of a word or span over two words.
How Does Similarity Search Work?
Similarity search works by delivering results based on how relevant the context and meaning of the content are to the search query.
Not Words, but Numbers
Similarity search uses embeddings (i.e. a high dimensional or dense vector, i.e. numeric coordinates) to capture both the meaning of the text at indexing time, and the intent of the query at run time.
Embeddings are created by transformers architecture, the same technology used to train LLMs. Transformers process streams of text into short, manageable chunks which are further divided into tokens, or sequences of letters. They can process any unstructured data — images, music chords, radio frequencies — not just text, but here, we will focus on text-based similarity search. The transformer deciphers the meaning and context of the data in these tokens, giving it a unique numeric embedding.
However, it’s important to note that using transformers for embeddings was not always the best architecture, and it may change again in the future. The key to similarity search is creating embeddings, not using transformers.
Conceptual Mapping and Context
Obviously, you first need a centralized repository of company and product information. Then, once your knowledge base is established, it must be transformed into a database where all information is broken down into embeddings. Only then can your similarity search engine use this database to find relevant, similar information by running approximate nearest neighbor algorithms against the embedding of your query.
The search engine compares the coordinates of the query to those of the information in your knowledge base. The goal is to find embeddings with similar coordinates, which indicate the content is closely related to the search query. Finally, the search engine delivers the documentation behind these close embeddings as the search results.
AI-Powered Similarity Search
Alongside the rise of similarity search is that of GenAI-powered tools, like chatbots. Combining similarity search with LLMs allows users to ask questions in natural language to explain what they’re looking for and enables the engine to respond in a conversational way too. Functioning like a ChatGPT tailored for your business, AI-powered search engines provide users with directly actionable replies. If the engine misunderstands part of the query, users can simply go back and forth to clarify their needs as they would with a support agent.
What is Lexical or Keyword Search?
The goal of lexical search is to match the keywords in the search query to documents or topics that contain the same terms. It’s a direct approach that is efficient and quick. Finding fuzzy (due to the inclusion of stemming and synonyms) matches for keywords is easiest for short, simple searches where users know what terms to use or what documentation they are looking for.
What is TF-IDF?
Term frequency–inverse document frequency (TF-IDF) is a type of keyword search algorithm for ranking results. TF-IDF measures the importance of words against the whole document: how often a word is used in a document versus in other documents. The most popular model for this concept is BM25.
TF-IDF refined the previously popular “bag of words” approach by adjusting the measure of importance with inverse document frequency. This looks at how common or rare a word is in comparison with the corpus.
The Efficiency of Keyword Search
With the rise of Google, people learned how to search with keywords rather than complete sentences, turning their search intent into representative terms. Now, it’s often more efficient to enter concise keywords: when users know exactly what documentation they are looking for, this approach is more user-friendly and less time- and resource-consuming than typing long phrases. Moreover, similarity search isn’t useful when the query is just one or two words. There isn’t enough context to determine the meaning or intent of the search, hence to compute a valid embedding of the query, therefore rendering the model less accurate.
By adding synonyms and taxonomies (generic and business specific), keyword search also offers semantic capabilities, offering more flexibility for searching.
The Value of Similarity Search
In contrast, similarity search is a valuable tool when the user doesn’t know exactly what content they’re looking for or if the documentation even exists. Other times, they need to explain the concept of what they’re searching for with queries like “what’s the name of the feature that allows you to drag-and-drop AI modules onto the page”. In this case, keyword search would be overwhelmed with too many words and unable to determine which are the most important. As a result, the search results would be inconsistent with many irrelevant results.
Additionally, similarity search is ideal for conversational search contexts such as speaking to a chatbot. This is because keyword search is not designed to handle search history (past searches), while similarity search is, allowing users to refine their queries.
Finally, similarity search is better at understanding typos, nuances, and synonyms in queries.
Similarity Search Merged with Keyword Search
Semantic and keyword search are distinct yet complementary search engine models. They both add unique value and play specific roles in retrieving information. So even with the rise of similarity search, don’t throw out your keyword search engine!
The way of the future is to make both technologies work together, sometimes even at the same time, and merge the two proposed result sets. This approach, called hybrid search, enhances relevance and is proven to be more resistant to the way people search, whether it be with natural language expressions or just few words.
From now on, we’ll use the term Semantic Search to refer to either keyword search boosted with synonyms, similarity search for natural language query processing, or the perfect combination of both with hybrid search technology. Just ensure that you have the right technology at hand.
Search Personalization and Engine Training
Both keyword and similarity search, hence semantic search, face challenges around enhancing user personalization. Indeed, personalization is a separate, additional layer that must be added to both types of searches. While it’s not inherently a part of semantic search, it is important as 70% of users expect companies to provide personalized customer support responses.
Semantic search engines are not self-adapting to improve over time; however, it is possible to update business rules and query categorization settings. For example, companies may set their search engines to prioritize version 10 documentation in the search results for geographical locations where version 11 is not yet available.
Furthermore, to continually improve the user experience, companies must focus on training and measuring changes in relevance. Take Fluid Topics for instance. We run regular relevancy tests to track any changes in our search models and make adjustments to ensure the results are always improving.
While talking to Eric Noulard, R&D Engineer at Fluid Topics, he explained how we approach this testing to optimize our enterprise knowledge search results. “We approach non-regression training very scientifically. We do this by staying up to date on the latest research, because measuring relevance is a highly challenging and evolving discipline. Despite the ever-changing complications, I’m proud to say that we are ambitious in our approach. Fluid Topics is implementing cutting-edge practices to ensure our users’ search experiences don’t degrade over time.”
Conclusion
New developments in information retrieval are shining a spotlight on semantic search. These engines are shifting the possibilities of query processing to better understand complex, detailed search needs. As a result, searching for information is much more efficient in situations where previously, users would find themselves lost and confused. Don’t miss article two of the series, The Evolution of Search Engines, where we will explore key milestones in search innovations that have led us to today’s semantic search capabilities.
Latest post