Mastering Website Content Indexing with Sitecore Search
👋Introduction
Enhancing search functionality and user experience on your website require effective content indexing. This article will show you how to use Sitecore Search to effortlessly index content from documents and websites, giving you the tools you need to become an expert in content retrieval and discovery. 🔝
🤯Understanding Sitecore Search
Sitecore Search is a headless content discovery platform powered by AI that helps you build predictive
and custom search experiences
across various content sources. To extract and index your material, the platform offers generic connections that you can configure.
Sitecore Search and Sitecore Discover are different products
, but they have an overlapping feature set and are built on ReflektionAI
. Sitecore Discover is best suited for e-commerce-based applications (product results with personalization and recommendations), and on the other hand, Sitecore Search is best suited for content-driven applications.
You can find more details about What is Sitecore Search, Sitecore Search Features, and Benefits at What is Sitecore Search?: A Definitive Introduction🔝
🔎Website Content Indexing
Sitecore Search is a powerful SaaS product that enables you to index content from any source, including documents, and allows you to search indexed content from any application build using any tech stack using the Sitecore Search Provided API endpoint.
When you are working with the Sitecore Content Management system (traditional or headless), you can use the default search provider Solr, for internal Sitecore CMS search and for end-user website search, or you can use Sitecore Search for end-user website search.
In the case of Sitecore XM Cloud based Headless Application Implementation, you have two options: either use the Sitecore Experience Edge GraphQL (GQL) endpoint for simple search use cases or utilize Sitecore Search or other third-party search providers for end-user website search. 🔝
The indexing of content is very important so that the user can easily search for the required content, which improves the website performance and user experience.
🤔How to Index Website Content with Sitecore Search
The Sitecore Search requires the source of data so that it can ingest data or content into the Sitecore Search System. When you are using Sitecore Search as a search provider, while searching, it looks for the content in the indexed data for searched keywords and returns the AI-based personalised (and recommendations) search results for the end-user.
The Sitecore Search provides flexibility to index different types of contents in different ways as per your requirements. With Sitecore Search, you can index content from HTML pages, API endpoints, documents, etc.
Please find below a diagram that explains the details of indexed content within the Sitecore Search system: 🔝
You can index your Sitecore XP, Sitecore XM, or Sitecore XM Cloud (XMC) website content into the Sitecore Search system in the following ways:
How do I index Sitecore XP or Sitecore XM website content into Sitecore Search?
If you are using Sitecore XP or Sitecore XM based topologies to build your website, either ASP.NET MVC
or Headless
, then you can utilise Sitecore’s SaaS-based search provider, Sitecore Search
, to index your website's data.
You can index data in the following ways: 🔝
The Sitecore Search provides an
API Push source
in the form of the Sitecore Search Ingestion API, which can be used in custom pipeline code base (Sitecore Publish Pipeline > Publish End Event
) in Sitecore XP or Sitecore XMCMS Role
, which will send data to the Sitecore Search system for indexing.Sitecore Search provides the following
pull sources
, which can be used to crawl the data from your website or API endpoint:API Crawler: If your content can only be accessed by an API endpoint, and the API returns JSON
Feed Crawler: Crawls feed files (CSV or JSON)
Web Crawler: If you have content in one locale, and all the content is accessible through a webpage
Advanced Web Crawler: If you need to index content in multiple languages or want to use JavaScript to extract attributes
It's recommended to use the Advance Web Crawler to crawl the data (or content) from your website.
You also need to define the scope of the source that is used by the crawler, and for this, you need to define the domains that the crawler is allowed to access, the URLs it must avoid, the deepest URL the crawler should go to, and more. 🔝
How do I index Sitecore XM Cloud website content in Sitecore Search?
The XM Cloud SaaS solution’s Content Management
role uses Solr
based search instance provision by Sitecore SaaS solution for internal search
. This Solr
based search provider is not available
for XM Cloud Website Search
(front-end search
). The XM Cloud SaaS does not provide the Content Delivery
role, and instead of using the Sitecore Experience Edge for Content Delivery
, there is no possibility to use CM Role Search indexes in the Content Delivery role.
The XM Cloud SaaS solution’s content delivery happens using the Sitecore Experience Edge, so you can build front-end applications in any tech stack and, preferably, using the Sitecore JSS Next.js SDK. You can't
utilise the Solr search
provider available for the CMS role as you used to do in traditional Sitecore, which updates web indexes on publish so that updated content is available for website search. 🔝
If you are using Sitecore XM Cloud to build your headless website, then you can utilise Sitecore’s SaaS based search provider, Sitecore Search, to index your website's data.
You can index data in the following ways:
You can utilize the Sitecore XM Cloud Workflow notification to decide which content state has been changed and send specific data to Sitecore Search via the Webhook.
The Sitecore XM Cloud provides, three types of webhooks:
-
For our use case, we can use the Webhook submit action when an item moves to the approved state, and the XM Cloud workflow webhook sends the notification to the specific endpoints with the required data. 🔝
You can also try Edge Webhook’s OnUpdate event to get the updated data directly from Sitecore Experience Edge. You can find more details at Sitecore Experience Edge for XM > Webhook objects
With this approach, we can utilise Sitecore Connect, which will send the data to Sitecore Search for indexing with its
Low-code
/No-code
integration platform.You can create custom Workato Webhooks connector and use in the Sitecore Connect🔝
Please check more details about Webhooks at XM Cloud Forms and Sitecore Composable Digital Experience Platform
You can utilise Sitecore Search Advance Web Crawler to crawl the data (or content) from your website.
You can also utilise the Sitecore PowerShell Extensions (SPE)task scheduler to run scripts at a specific time. This means you can create the SPE script, which will check the state of items and push those items to Sitecore Search using the Sitecore Search Ingestion API.
💡Sitecore XM Cloud is a SaaS platform, so we need to avoid any custom code deployment to the Sitecore XM CMS instance.You can check out more details about Sitecore XM Cloud Search options at Sitecore XM Cloud Search Options🔝
📄Indexing Document Content with Sitecore Search
Yes, you heard correctly that Sitecore Search can also index PDF files using the Sitecore Search Document Extractors, which only support the parsing of HTML or JSON content. For this, you should know about the HTML structure of your PDF files.
You can check more details at Sitecore Search Indexing PDFs | Sitecore Documentation🔝
👍Best Practices for Indexing with Sitecore Search
Please find below some tips and best practices for effective indexing
🎯 You can utilize Single Source to crawl the content from different domains by using the Sitecore Search Web Crawler Settings > ALLOWED DOMAINS
attribute
🎯 You can utilize Sitecore Search Trigger Settings to define the multiple triggers with different URLs
🎯 Use larger number for MAX URLS
🎯 You can utilize Single Source to define multiple trigger type
with different URLs in the Trigger settings to get content from defined URL in the Trigger setting
🎯 Use different tagger to handle different set of attributes within the single document extractor
🎯 Sitecore Search isn't a replacement of Solr for Sitecore CMS, it's similar to other Search providers like Coveo or Algolia or SearchStax, and used to index the content for end-users, and not meant for Sitecore’s internal search (backend search) 🔝
🎯 The reusable code
Sitecore Search Starter Kit base with Sitecore Search Widgets
present at GitHub and same open source
Sitecore Search Starter Kit code base hostedhere to validate the Sitecore Search widgets and functionalities
🎯 Good to use canonical URL as the ID
to avoid the duplicate items
🎯 One document cannot have more than one extractor. However, if they are set up using the URLs to Match field and are aimed at separate documents, you can set up more than one extractor.
Otherwise, the last configured document extractor "wins" and is the one that will be run. 🔝
🤷♂️Sitecore Search FAQ
Is Experience Edge compatible with Sitecore Search?
The Sitecore Experience Edge (XE) and Sitecore Search are different Sitecore SaaS products, and there is no relationship between these two products, so there is no point of compatibility between these two products.
You can check more details about
Sitecore Experience Edge
at Quickstart guide - All about Sitecore Experience Edge ~ Amit's Blog and forSitecore Search
at What is Sitecore Search?: A Definitive Introduction🔝Can I use Sitecore Search with Sitecore XM Cloud?
Yes, you can, to implement the end-user based search functionality and not for Sitecore XM Cloud CMS internal search
Can I use Solr to implement the search in Sitecore XM Cloud?
The XM Cloud SaaS solution’s Content Management role uses Solr-based search instance provision by Sitecore SaaS solution for internal search.
You can also setup your own Solr instance and ingest data for indexing and can utilize to implement the Search functionality (front-end search) for end-user at Head application (Front-end).
Can I use the Sitecore XM Cloud provided GraphQL (GQL) endpoint to implement the front-end search or end-user search?
Yes, you can but simple search not advance search functionalities with dynamic facets, search recommendation, boosting, etc.. 🔝
What type of capabilities do you need to validate while selecting the external search provider for Sitecore XM Cloud?
While selecting the external search provider, you should look for the following capabilities in your third-party search provider: Required security compliance and certifications, Easy integration, Multi-lingual support, Fit into your organizational tech stack, AI/ML capability for content boosting, Ability to integrate content indexing from different content sources, Out-of-the-box front-end search components, Personalized search results, Analytics reporting, and last but not least, good customer support 🔝
💡Conclusion
In this blog, we discussed the available options for Sitecore XP, Sitecore XM, and Sitecore XM Cloud (XMC) to index the website content in detail.
Also, details about the available search options are present in the Sitecore XM Cloud (XMC)-based implementation.
By doing the content indexing with Sitecore Search, you can empower users to discover and retrieve relevant content effortlessly. Implementing robust indexing strategies for both website and document content ensures a seamless search experience that enhances user engagement and satisfaction. 🔝
In upcoming blog posts, I will try to explain the different types of content indexing options in detail.
🙏Credit/References
Open AI | Indexing Content | Using the Ingestion API to add content to an index |
Best practices to configure a source | What is Sitecore Search?: A Definitive Introduction | Excalidraw🔝 |
HTML5ANIMATIONTOGIF.COM |
🏓Pingback
index Sitecore content | sitecore index pdf content | how to create custom index in sitecore |
sitecore content editor jobs | index sitecore_master_index was not found | index sitecore_marketingdefinitions_master was not found |
sitecore indexing | get sitecore/index | sitecore search index 🔝 |
sitecore content search | sitecore indexes not showing | sitecore indexing role |
sitecore search | sitecore search documentation | sitecore search api |
sitecore search sdk | sitecore searchstax | sitecore search ai |
sitecore search vs coveo | sitecore search engine | sitecore search logo |
sitecore search by field value | sitecore search pricing | sitecore search implementation |
sitecore search analytics | sitecore search api crawler | sitecore search architecture |
sitecore search autocomplete | sitecore azure search | sitecore azure search deprecated |
sitecore azure search index configuration | sitecore azure search compatibility | sitecore search boost 🔝 |
sitecore search by template | sitecore search by id | sitecore search blacklist |
sitecore build search query | sitecore bucket search | sitecore solr search boosting |
sitecore search item by field value | sitecore search item by id | sitecore search index |
sitecore search cli | sitecore search cec | sitecore search component |
sitecore search cost | sitecore content search api | sitecore content search |
sitecore content search linq | sitecore custom search filter | sitecore content search filter 🔝 |
sitecore content search facets | sitecore search demo | sitecore search document extractor |
sitecore search discover | sitecore multilist with search datasource query | sitecore multilist with search datasource |
sitecore search example | sitecore edge search | sitecore solr search example c# |
sitecore icon search extension | sitecore solr facet search example | sitecore content editor search |
elastic search sitecore | sitecore content editor search not working | sitecore graphql search query example |
sitecore elasticsearch | sitecore search features | sitecore search facets 🔝 |
sitecore search functionality | sitecore fuzzy search | sitecore solr fuzzy search |
sitecore multilist with search filter | sitecore sxa search filter | find sitecore version |
sitecore icon finder | sitecore search github | sitecore graphql search |
sitecore graphql search examples | sitecore jss graphql search query | sitecore jss graphql search |
sitecore geoip | sitecore solr search highlighting | sitecore content hub search api |
sitecore content hub search component | search site vs search engine | sitecore search ingestion api |
sitecore search js sdk | sitecore search jss | sitecore-jss |
sitecore search login | sitecore lucene | sitecore search meaning 🔝 |