Site search with javascript

Published
19 April 2014
Tagged

Static sites are cool. You can configure everything on your machine, see exactly how it'll look before you upload, and you pay a pittance in hosting. With javascript services like Disqus and Clicky becoming more and more popular, the majority of blog tasks that would in the past have required some sort of server-side programming can be outsourced or run entirely in javascript on the user's computer. It's pretty awesome.

One thing that Real Blogs™ should have is search, and this is something that can be implemented in javascript. I've had a really basic system like this in place for a while, but after using it a couple of times I noticed that it was missing certain features I liked. So I completely revamped it.

(Note: I'm deploying my site in jekyll: your choice of website will heavily affect how you get your data into some sort of curated form, but the javascript on the front-end should be relatively similar regardless of platform.)

Step 1: Extract your posts

The first thing you want to do is turn your posts from a series of files into something more searchable. I like a JSON database myself, as javascript reads that pretty well and it's not too hard to set up. I have a JSON file sitting in my _site's root that looks like this:

[
  {
    "title": "Sample post title",
    "body": "Sample body text",
    "category": "site",
    "date": "2012-04-10 00:00:00 +1200",
    "url": "/blog/sample-url.html"
  },
  {
    "title": "A second sample post title",
    "body": "More sample body text",
    "category": "code",
    "date": "2012-04-15 00:00:00 +1200",
    "url": "/blog/another-sample-url.html"
  }
]

I generate this database every time I build using a Generator in jekyll. It's a little complicated and requires a couple of classes, but I'll go through it slowly:

require "json"

module Jekyll

  class JSONPage < Page
    def initialize(site, name, data)
      @site = site
      @base = site.source
      @dir = ""
      @name = "#{name}.json"
      process(@name)

      # in place of read_yaml
      self.data = {}
      self.content = JSON.pretty_generate(data)
    end
  end


  module Generators
    class JSONDB < Generator

      def generate(site)
        # Generate JSON database of posts
        db = []

        site.posts.each do |p|
          payload = {
            "site" => {},
            "page" => p.to_liquid(Post::EXCERPT_ATTRIBUTES_FOR_LIQUID)
          }.deep_merge(site.site_payload)

          info = {
            filters: [Jekyll::Filters],
            registers: {
              site: site,
              page: payload['page']
            }
          }

          post_body = p.content # This will be unparsed
          post_body = p.converter(true).convert(post_body)
          post_body = Liquid::Template.parse(post_body).render!(payload, info)
          post_body = p.converter(false).convert(post_body)

          post_body = post_body.gsub(/<.*?>/,"")

          db << {
            "title"   => p.title,
            "body"    => post_body,
            "category"=> p.categories.join(","),
            "date"    => p.date,
            "url"     => p.url
          }
        end

        site.pages << JSONPage.new(site, "posts", db)
      end
    end
  end
end

The JSONPage class is a subclass of a page, which takes a site and name as well as some data to display. The page's content is generated using the JSON library. This bit is easy.

The hard bit comes in the JSONDB class, and even then it's mainly because of the way Jekyll renders pages, combined with my own modifications to the jekyll library. Every generator's generate method gets called after posts and pages are rendered, and is passed the site variable which holds pretty much everything we need. The assignment of variables to payload and info are merely recreating some of the internals of the Post class that aren't available to the public, and the four following lines run the post through conversion and liquid template insertion. This stuff is usually done automagically in jekyll, and this is where my hacks to the jekyll library make it somewhat trickier than normal. These lines would be replaced by the following in a stock jekyll install:

post_body = Liquid::Template.parse(p.content).render!(payload, info)
post_body = p.converter(false).convert(post_body)

Following this I remove all html (because later on I'll want to make excerpts of the post body, and this sucks when you have html everywhere), and convert to a simple hash (which is added to db). Once I've done this for every post, I pass the resultant array to a new JSONPage, which outputs it to the correct place.

The whole thing is probably pretty processor-intensive (we're effectively generating the site twice), but since you run it once, when you generate/deploy, it's not that big of a deal.

Step 2: Build a search

The actual search page on my site is just a text field (appropriately id'd, so I can access it with javascript), a button, an empty div for search results and a javascript library. Using jQuery I can pretty easily call my posts.json file into the program and use it to look for my query:

$("#searchButton").click(function(){
  var searchQuery = $("#searchField")[0].value;
  $.getJSON("/posts.json", function(msg){
    var matchingItems = msg.filter(function(i){
      return (
        i.title.indexOf(searchQuery) >= 0 ||
        i.body.indexOf(searchQuery)  >= 0
      );
    });
    var divContents = "<ul>";
    for (var i in matchingItems) {
      var item = matchingItems[i];
      divContents = divContents + "<li><a href=\""+item.url + "\">"+item.title+"</a></li>";
    }
    $("#searchResults").html(divContents);
  });
  return false;
});

This method:

  1. Fetches the query from the searchField element
  2. Fetches data from /posts.json using AJAX[1]
  3. Filters the entries based on title and body content
  4. Displays an ordered list of entries and links to their pages

It uses a minimal amount of jQuery to run, and if it weren't for other sections of my site using it I'd probably look at swapping it out for plain vanilla javascript.

Advanced geekery 1: Body extract

I like to have a little piece of the body of my post sitting under each title, to tell people what they're clicking on. My workflow for picking up this abstract is:

  1. Find the first occurrence of my search term in the text.
  2. Pick up a decent chunk of the post on either side (only picking up whole words).
  3. Display this extract, with the search term bolded.

This is what it looks like in code:

for (var i in matchingItems) {
  var item = matchingItems[i];
  var firstOccurrence = i.body.indexOf(searchQuery);
  if (firstOccurrence < 0){ firstOccurence = 0 } // In case we don't spot it

  var start, end; // These store where the extract will start and end on the body
  if (termIndex < 250){start = 0} // I'm taking 250 characters on either side of the search term
  else {
    start = termIndex - 250;
    while (start > 0 && i.body[start-1] != " "){start--} //Decrement until we hit a space, start of text
  }

  var maxEnd = body.length - 1;
  if (maxEnd - termIndex < 250){end = maxEnd} //Again, 250 chars to the right of the term
  else {
    end = termIndex + 250;
    while(end < maxEnd && i.body[end+1] != " "){end--} //As before but we're headed up the string.
  }
  var extract = i.body.substr(start,end-start);
  //Bold the search term!
  var regexpTerm = new RegExp(searchQuery, "gi"); //Global ignore-case
  extract = extract.gsub(regexpTerm, "<b>$&</b>");
}

It's not really that complex - just a bit involved. I've actually farmed this out into its own class to keep my code relatively clean.

Advanced geekery 2: Multiple search terms

Currently this code has a problem. If I search for "foo bar", I won't find all posts that contain either "foo" or "bar", nor will I find posts that contain "foo" and "bar": I'll only find posts that contain the exact phrase "foo bar". A lot of the time, that's not really what I'm going for.

In order to get proper boolean search going, I need to split my search term along the whitespace, and then interpret each search term individually. This isn't too hard, really. To split the search terms, I just need to run:

var searchTerms = $("#searchQuery")[0].value.split(" ");

It's tricker to add boolean AND or OR modifiers, but it gets easier if we start compartmentalising our previous logic:

post_contains_string = function(post,string) {
  return (
    post.body.indexOf(string)  > 0 ||
    post.title.indexOf(string) > 0);
}

//Boolean AND
matches_all_terms = function(post, terms) {
  for (var i in terms) {
    t = terms[i];
    if (!post_contains_string(post,t)){return false}
  }
  return true;
}

//Boolean OR
matches_any_term = function(post,terms) {
  for (var i in terms) {
    t = terms[i];
    if (post_contains_string(post,t)){return true}
  }
  return false;
}

The current state of 1klb search

Right now the search page assumes you want boolean OR searching, and can sort posts either by date (most recent first) or by relevance (most occurrences of search terms first). It uses code very similar (but not identical) to what I've posted above to do this, although I've added some classes to wrap up the behaviour.

What next?

I'm not 100% happy with the "relevance" criterion for search sorting, and I wouldn't be surprised if this underwent some sort of change in the near future. I believe a post with one instance of every search term is probably more relevant, for example, than one with ten instances of just one search term, although this isn't currently how searches are weighted.

The introduction of actual boolean operators or quotes to indicate "whole phrases" isn't necessarily out of the question, although this gets more complex as I have to identify different parts of the search. The overall goal is that I (or others) can use the search feature to quickly find articles, and while boolean operators may help with that, I wonder how much trouble they'll be vs. how easy they would be to implement.


  1. Or AJAJ, I guess, since we're returning JSON. ↩︎