Extract metadata with Sitecore Search JavaScript Document Extractor

In the fast-paced world of digital content, efficiently discovering and retrieving information is crucial for delivering exceptional user experiences. Sitecore Search excels in this area, offering robust capabilities to seamlessly extract, index, and retrieve content. Central to its functionality is the document extractor, a powerful tool that transforms raw content into highly searchable index documents. Whether you're building a new website or enhancing an existing one, setting up a complex search feature requires meticulous attention to how metadata is stored and extracted. In this article, we will explore best practices for embedding detailed metadata within web pages and how to leverage Sitecore Search's JavaScript Document Extractor for optimal search performance

🎩Optimizing Metadata Storage for Enhanced Search in Sitecore Headless CMS

Efficient metadata storage in a Sitecore Headless CMS for a Next.js website is vital for performance, SEO, and seamless content delivery. For complex search functionality, store metadata efficiently using structured data formats, consistent naming, clear hierarchy, tags, and regular updates. These practices ensure effective extraction and indexing with Sitecore Search: 🔝

1. Leverage Sitecore's Content Architecture

Use Template Fields for Metadata:
Define metadata fields (e.g., title, description, keywords, Open Graph tags) in your Sitecore templates. This ensures consistency and makes it easier for content authors to manage metadata.
Organize Metadata in Structured Data Models:
Group related metadata fields into logical sections (e.g., SEO, Social Media) within your templates. This simplifies content entry and retrieval.

Use custom fields in Sitecore to store specific metadata (e.g., articleTitle, authorName) relevant to your website's content and functionality, and render them dynamically in your Next.js components. This ensures all necessary information is captured and indexed. 🔝

2. Implement Structured Data (Schema.org)

Embed JSON-LD in Your Next.js Pages:
Use JSON-LD to add structured data (e.g., Article, Product, FAQ) to your pages. This improves SEO and helps search engines understand your content.

<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Article",
    "headline": "Article Title",
    "description": "Article Description",
    "author": {
      "@type": "Person",
      "name": "Author Name"
    }
  }
</script>

3. Optimize Metadata for SEO

You can dynamically create the metadata or other needed attributes that you want to add to the Sitecore Search document. It's generally a good idea to make common metadata changes in the Layout.tsx file. Here, I am using the Layout.tsx file from the Sitecore.Demo.XMCloud.Verticals GitHub Repo. 🔝

I updated some code in Layout.tsx to prevent build errors and to generate metadata only from the Layout.tsx file. This avoids duplicating metadata tag generation in different components like ArticleDetails.tsx and AuthorDetails.tsx:

// Layout.tsx => File path: https://github.com/Sitecore/Sitecore.Demo.XMCloud.Verticals/blob/main/src/sxastarter/src/Layout.tsx
        <meta property="og:site" content={layoutData?.sitecore?.context?.site?.name} />
        <meta name="description" content="A Verticals demo site."></meta>

// ArticleDetails.tsx => File path: https://github.com/Sitecore/Sitecore.Demo.XMCloud.Verticals/blob/main/src/sxastarter/src/components/PageContent/AuthorDetails.tsx
      <Head>
        <meta property="og:description" content={props.fields?.Bio?.value} />
        <meta property="og:name" content={props.fields?.Name?.value} />
        <meta property="og:title" content={props.fields?.Name?.value} />
        <meta property="og:image" content={props.fields?.Photo?.value?.src} />
        <meta property="og:type" content="author" />
      </Head>

In the files above, you can see that there is repeated code for common metadata properties like description. It's best to keep common metadata properties in one place and have content-type-specific properties in their respective components or code files.

I updated the code in the Layout.tsx file to put the common metadata properties in one place. I also fixed the LINT errors that showed up during the build. 🔝

/**
 * This Layout is needed for Starter Kit.
 */
import React from 'react';
import Head from 'next/head';
import { Placeholder, LayoutServiceData, HTMLLink } from '@sitecore-jss/sitecore-jss-nextjs';
import config from 'temp/config';
import Scripts from 'src/Scripts';
import { ParallaxProvider } from 'react-scroll-parallax';
import type { Metadata } from 'next';

// Prefix public assets with a public URL to enable compatibility with Sitecore Experience Editor.
// If you're not supporting the Experience Editor, you can remove this.
const publicUrl = config.publicUrl;

interface LayoutProps {
  layoutData: LayoutServiceData;
  headLinks: HTMLLink[];
}

interface OpenGraphImage {
  url: string;
}

const Layout = ({ layoutData, headLinks }: LayoutProps): JSX.Element => {
  const { route } = layoutData.sitecore;
  const fields = route?.fields || {};
  const isPageEditing = layoutData.sitecore.context.pageEditing;
  const mainClassPageEditing = isPageEditing ? 'editing-mode' : 'prod-mode';
  const theme = layoutData.sitecore.context.theme as string;
  const contextSiteClass = `site-${theme?.toLowerCase()}`;

  //The getFirst200Words function is a utility function designed to extract the first 200 words from a given text string.
  const getFirst200Words = (text: string) => {
    return text.split(' ').slice(0, 200).join(' ');
  };

  // The getFieldValue function is a utility function designed to extract the value of a field.
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  const getFieldValue = (field: any) => {
    return typeof field?.value === 'string' ? field.value : '';
  };

  // The getImagePath function is a utility function designed to extract the image path from a field.
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  const getImagePath = (field: any) => {
    return typeof field?.value?.src === 'string' ? field.value.src : '';
  };

  // The pageDescription variable is used to extract the description from the fields of the current route. The getFieldValue function is used to extract the value of a field.
  const pageDescription =
    getFieldValue(fields.Excerpt) ||
    getFieldValue(fields.Bio) ||
    getFieldValue(fields.Text) ||
    getFieldValue(fields.Content) ||
    getFieldValue(fields.Abstract) ||
    '';

  // The pageImage variable is used to extract the image path from the fields of the current route
  const pageImage =
    getImagePath(fields.Thumbnail) ||
    getImagePath(fields.Photo) ||
    getImagePath(fields.BackgroundImage) ||
    getImagePath(fields.Content) ||
    '';

  // The metadata object is used to define the metadata for the current page.
  const metadata: Metadata = {
    title: getFieldValue(fields?.Title) || 'Page',
    description: pageDescription,
    openGraph: {
      title: getFieldValue(fields?.Title) || 'Page',
      description: pageDescription,
      images: pageImage ? [{ url: pageImage }] : [],
    },
  };

  const canonicalUrl = publicUrl + `'/'` + route?.name;
  return (
    <>
      <Scripts />
      <Head>
        <title>{getFieldValue(fields?.Title) || 'Page'}</title>
        <link rel="icon" href={`${publicUrl}/favicon.ico`} />
        <link rel="preconnect" href="https://fonts.googleapis.com" />
        <link rel="preconnect" href="https://fonts.gstatic.com" crossOrigin={'anonymous'} />
        <link rel="preconnect" href="https://cdnjs.cloudflare.com" />
        <link rel="canonical" href={canonicalUrl} />
        <meta property="og:site" content={layoutData?.sitecore?.context?.site?.name} />
        {/* The metadata?.openGraph?.title is used to set the title of the page.*/}
        <meta property="og:title" content={metadata?.openGraph?.title as string | undefined} />
        {/* The metadata?.openGraph?.description is used to set the description of the page.*/}
        <meta property="og:description" content={metadata?.openGraph?.description} />
        {/* The metadata?.openGraph?.images is used to set the image of the page.*/}
        {Array.isArray(metadata?.openGraph?.images) &&
          metadata?.openGraph?.images.map((image: OpenGraphImage, index) => (
            <meta key={index} property="og:image" content={image.url} />
          ))}
       . 
       . 
       . 
       . 

      </Head>

    </>
  );
};

export default Layout;

With this setup, you can control all the common metadata from a single location.😊

Also, I have defined specific types like OpenGraphImage to prevent the @typescript-eslint/no-explicit-any error and make the code more type-safe.

4. Handle Requirements of Complex Metadata for Sitecore Search

When you work on a large-scale, high-traffic, public-facing web or e-commerce application, you often need many values from the backend to the frontend. These values are used for manipulations or for third-party systems consumptions (e.g., Sitecore Search Crawler) on your public-facing website.

The best way to handle this is to use metadata properties to store complex data. If you want to combine multiple values from the backend into one metadata tag, use the application-name with a data- attribute to store custom attributes in a single metadata tag. 🔝

        {/* The below meta tags are combination of the metadata and the fields of the current route, 
         and generally used to store the complex data and combine multiple fields into a single meta tag.*/}
        <meta
          name="application-details"
          content={layoutData?.sitecore?.context?.site?.name}
          data-siteName={layoutData?.sitecore?.context?.site?.name}
          data-itemId={route?.itemId}
          data-itemName={route?.name}
          data-itemTitle={getFieldValue(fields?.Title)}
          data-itemLanguage={route?.itemLanguage}
          data-itemPath={layoutData?.sitecore?.context?.itemPath}
          data-itemContent={getFirst200Words(getFieldValue(fields?.Content))}
          data-itemTemplateId={route?.templateId}
          data-itemTemplateName={route?.templateName}
          data-itemCategory={getFieldValue(fields?.Category)}
        />

Suppose, at client-side script, If we wanted to retrieve or update these attributes using existing, native JavaScript, then we can do so using the getAttribute and setAttribute methods as shown below:

// Get the reference of the metadata object in native JavaScript
document.getElementsByName('application-details');

// Get the value of spcific data- attribute in native JavaScript
document.getElementsByName('application-details')[0].attributes["data-itemid"].value

// Get the reference of the metadata object in JQuery
$('meta[name="application-details"]')

// Get the value of spcific data- attribute in JQuery
$('meta[name="application-details"]').getAttribute('data-itemid')

🔎How to extract Metadata in the Sitecore Search

Before we dive into fetching page metadata in Sitecore Search, let's first understand the role of the document extractor in Sitecore Search. 🔝

The Role of Document Extractors in Sitecore Search

Document extractors are very important in Sitecore Search. They look at URLs or documents from your content, pull out useful metadata, and make index documents. These index documents are added to the source's index, which helps provide fast and accurate search results.

How Document Extractors Work

The document extractor works on every page in your content system. It carefully examines the HTML structure, pulling out metadata and content to fill the index. This process makes sure all important information is captured and searchable.

To learn more about Sitecore Search and how to set up a document extractor, you can read my article, Boost Sitecore Search with Advanced Web Crawling and JavaScript Extraction: 🔝

https://enlightenwithamit.hashnode.dev/boost-sitecore-search-with-advanced-web-crawling-and-javascript-extraction

In our case, we will use a JavaScript document extractor, which allows us to write custom logic to extract the necessary attributes from the crawled source, such as metadata from the page. We then add these attributes to the index document, and Sitecore Search includes these index documents in the source's index, enabling users to get search results.

The complex metadata setup that we added in the Layout.tsx file will be rendered as: 🔝

<meta name="application-details" content="Services" data-sitename="Services" 
data-itemid="7fb88a43-25ab-479b-b7cb-26c545e810b7" 
data-itemname="a-digital-transformation-story-premiere-insurance" 
data-itemtitle="A digital transformation story: Premiere Insurance" 
data-itemlanguage="en" data-itempath="/portfolio/a-digital-transformation-story-premiere-insurance" 
data-itemcontent="In an era where technological advancements dictate the pace of industry evolution, Premiere Insurance embarked on a comprehensive digital transformation journey." 
data-itemtemplateid="60fe154b-e0bc-4fa8-b87c-55ab54015a49" 
data-itemtemplatename="Project Page" data-itemcategory="Branding">

Now, in the JavaScript source code for extracting content, we will use the code below to extract the metadata and store it in the Sitecore Search indexed document:

function extract(request, response) {
    $ = response.body;
    const meta = $('meta[name="application-details"]');

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content'),
         // A custom attribute needs to be created in Sitecore Search.
        'itemid': meta.attr('data-itemid'),
         // A custom attribute needs to be created in Sitecore Search.
        'itemlanguage': meta.attr('data-itemlanguage'),
    }];
}

In the Sitecore Search Document Extractor code above, you can see that the way we access metadata is different from basic JavaScript extraction. If you try to use .getAttribute('data-itemid') to get the metadata, you'll get a "not a function" error. 🔝

In the Sitecore Search Document Extractor code, you can also apply conditions and add content as key-value pairs to the Sitecore Search indexed document:

function extract(request, response) {
    $ = response.body;
      let product = 'Content';
      if(request.url.toLowerCase().indexOf('/portfolio/') !=-1)
        {
            product = 'Portfolio';
        }
      else if(request.url.toLowerCase().indexOf('/insights/') !=-1)
        {
            product = 'Blogs';
        }
      else if(request.url.toLowerCase().indexOf('/authors/') !=-1)
        {
            product = 'Authors';
        }

      const meta = $('meta[name="application-details"]');


    return [{
        'description': $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[property="og:title"]').attr('content') || $('title').text(),
        'type': 'XMC-Vertical-Services',
        'url': $('meta[property="og:url"]').attr('content'),
         // A custom attribute needs to be created in Sitecore Search.
        'itemid': meta.attr('data-itemid'),
         // A custom attribute needs to be created in Sitecore Search.
        'itemlanguage': meta.attr('data-itemlanguage'),
        'product': product,
    }]
}

In the Sitecore Search Document Extractor code above, I identify the product name for the document using the URL in the IF/ELSE conditions. You can apply any JavaScript logic here as needed to create a custom value for your attribute, which you want to store in the Sitecore Search document and use in Sitecore Search to develop an advanced search user interface for your end-users. 🔝

💡

In some cases, if you don't pass the PAGE URL through metadata properties, you can use request.url to get the URL for your indexed document and store it in Sitecore Search.

⚡Benefits of Advanced Metadata Extraction

By using Sitecore Search's document extractors, especially the JavaScript version, organizations can:

Improve search relevance by including detailed metadata in index documents.
Enhance content discoverability through more accurate and comprehensive indexing.
Implement custom extraction logic to handle unique content structures.
Create a more robust and flexible search experience for end-users.

💡Conclusion

Sitecore Search's document extractors, especially the JavaScript Document Extractor, transform raw content into highly searchable index documents. By leveraging these tools, developers and content managers can enhance search capabilities on Sitecore-powered websites, improving user experiences and content discovery. 🔝

Efficient content extraction, indexing, and retrieval are crucial for delivering outstanding digital experiences. Additionally, efficient metadata storage in a Sitecore Headless CMS for Sitecore XM Cloud for a Next.js website requires structured content architecture, optimized data fetching, and dynamic rendering. Implementing these best practices ensures well-organized, SEO-friendly, and performant metadata, enhancing user experience and search engine rankings.

🙏Credit/References

Walkthrough: Configuring an advanced web crawler	Walkthrough: Configuring a temporary source to extract HTML from select PDF content	Create a JSONPath document extractor with JavaScript URL matching
Configuring document extractors	Walkthrough: Configuring a crawler to crawl localized content	Deep Dive into Sitecore Search: API Crawler Essentials
Enhance Sitecore Search with Advanced Crawling	Best practice for meta data in a html document?	Working with entities, attributes, and features 🔝

🏓Pingback

Unlocking Sitecore's Potential: How to Optimize for SEO: Discover how to boost your Sitecore website's SEO with expert tips and strategies. Learn more about optimizing metadata and content for better search rankings.	The Ultimate Guide to Sitecore Headless CMS for Next.js: Get the most out of Sitecore Headless CMS with Next.js. Explore the benefits and implementation strategies for a seamless integration. 🔝	How to Improve Sitecore Search with Efficient Metadata Storage: Enhance your Sitecore Search capabilities by optimizing metadata storage. Learn how to create efficient index documents for better search results.
sitecore-jss	monster crawler search engine	online site crawler
sitecore searchstax	sitecore vulnerabilities	sitecore xconnect search indexer
Sitecore javascript services	Sitecore javascript rendering	sitecore search facets
sitecore jss github	sitecore xpath query	sitecore query examples
Sitecore graphql queries	sitecore elastic search	find sitecore version
how does sitecore search work	what is indexing in Sitecore Search?	Sitecore Search API
Sitecore Search API Crawler	Improve Sitecore Search	What Makes Sitecore a Top Choice for Enterprise Content Management?: Explore why Sitecore is a leading platform for enterprise content management. Discover its features and benefits for large-scale content operations. 🔝
5 Essential Tips for Building a Scalable Sitecore Architecture: Ensure your Sitecore architecture is scalable and efficient. Follow these expert tips to future-proof your content management system.	Why Next.js is Perfect for Sitecore Headless Implementations: Learn why Next.js is the ideal choice for Sitecore headless implementations. Explore the benefits of combining these technologies for a robust web application.	The Secret to Seamless Sitecore and Next.js Integration: Discover the secrets to integrating Sitecore with Next.js seamlessly. Get expert advice on creating a smooth and efficient content delivery pipeline.
How to Master Sitecore Search with Advanced Document Extractors: Take your Sitecore Search to the next level with advanced document extractors. Learn how to optimize your search results for better user experiences.	10 Best Practices for Metadata Management in Sitecore Headless CMS: Learn how to efficiently store and manage metadata in Sitecore Headless CMS for Next.js websites. Boost your SEO and improve content delivery with these expert tips.	How to Optimize Sitecore Search for Better SEO Results: Discover how Sitecore Search's JavaScript Document Extractor works and how to optimize it for superior search performance and SEO.
A Complete Guide to Structured Data in Sitecore Headless CMS: Implement JSON-LD and Schema.org markup in your Sitecore Headless CMS to enhance SEO and improve search engine rankings.	Dynamic Metadata Generation in Next.js with Sitecore Headless CMS: Learn how to dynamically generate metadata in Next.js using Sitecore Headless CMS for improved SEO and faster page loads.	How to Use GraphQL to Fetch Metadata in Sitecore Headless CMS: Master the art of fetching metadata efficiently using GraphQL in Sitecore Headless CMS for Next.js websites.
Improving SEO with Canonical URLs in Sitecore Headless CMS: Learn how to implement canonical URLs in Sitecore Headless CMS to avoid duplicate content and boost SEO.	Caching Strategies for Metadata in Next.js and Sitecore Headless CMS: Discover effective caching strategies to optimize metadata delivery and improve performance in Next.js and Sitecore Headless CMS.	Securing Sensitive Metadata in Sitecore Headless CMS: Protect sensitive metadata in Sitecore Headless CMS with role-based access control and secure data fetching practices. 🔝

Mastering Sitecore Search: How to Extract and Index Complex Metadata with the JavaScript Document Extractor

Sitecore Search: Extract and Index Metadata

Table of contents