In this post I want to share my experience in development of an integration testing tool for a search engine using RSpec framework, which is popular among Ruby developers. My goal was to automate generation of test cases as much as possible. I encountered several problems along the way. I hope this post will prevent readers from making my mistakes.

I’ve worked with several companies that use Elasticsearch for data indexing and searching, however none of them have covered their search requests with tests. I think the main reason for lack of tests is the complexity of integration Elasticsearch into a testing environment. The second reason is the enormous amount of work which is required to support documents stubs for necessary test cases. The situation with every project has been almost the same: a single search query is stubbed, then one spec checks that system has sent a request and received mocked data. This spec has a problem, it checks if a query has been sent to Elastic, but it does not care about search results.

When I was working in Lobster, we decided to develop a tool which would solve these problems and allow us to cover search engine responses with specs. Lobster is a marketplace for user-generated photos and videos. Photographers connect their accounts from social networks or cloud storages like Instagram, Flickr, Facebook, Dropbox, etc. to the marketplace, then it fetches their photos and videos, indexes them and displays to buyers.

TL;DR

I thought it would be a great idea to develop a tool that records requests and responses from production as fixtures, sends the same requests using fixtures in test environment and asserts results. When an implementation was ready I realised that I was wrong, because I forgot about TF-IDF model, which is used for scoring in Elastic by default. Don’t use this approach if your product does not allow you to change the scoring model.

Search engine in UGC marketplace

The main goal for every media marketplace is to make their content searchable. Nobody would be able buy photos or videos if it isn’t possible to find them, no matter how beautiful these photos are. In case of social networks people usually provide a text description for their photos during publishing. However descriptions are not accurate and they also could be completely unrelated to objects in photos. The situation with cloud storages is even worse: services like Dropbox or Google Drive don’t have any descriptions for images at all. The only related information that is possible to get from cloud storages is EXIF metadata. It is really helpful if a photo from a cloud storage has latitude and longitude in EXIF, because with geo-coordinates, at least, it is possible to query the photo by its location name.

There are several solutions to enhance metadata. First of them is a manual markup. The quality and accuracy would be ideal, but you need to have thousands of people who can do it manually. Of course it is impossible to hire so many people for a small startup. Another way, which is faster and cheaper, but less accurate, is image recognition with OpenCV and neural networks. With the help of computer vision algorithms we have gathered names of objects from photos, dominant colors, emotions, age of people, gender, head position and facial features.

We have been gathering more and more metadata and indexing it in Elasticsearch. The number of different objects’ features has been increased dramatically since an ML expert joined the team. As the number of ML features increased, so did the number of filters and search engine complexity. There were different scoring rules for different combinations of enabled or disabled filters. Although the DSL of Elastic is pretty, the search engine code became really complex to maintain, which sometimes led us to an incorrect ranking. In the end we decided to write specs for search engine logic in RSpec and run them on a CI server to be sure that everything works as expected.

SERP

How to automate testing of a search engine?

Let’s describe a typical test case for a Search Engine Results Page: there are N documents in total, the test makes a query with several filters, parameters and sorting order, then it asserts that response has only M documents, filtered and sorted in correct order. To test hundreds combinations of filters we need to write and support hundreds sets of documents and hundreds sets of correct responses. If somebody changes the search logic, then we have to fix broken sets and correct responses. For SERP testing it would be almost impossible to maintain stubs for documents, because even a small change in the logic can affect almost all test cases.

I gave up on writing specs manually and came up with an idea, which was excellent at the first glance. I thought that I can make a search query in production environment and dump results only from the first page, then I would restore the dump to Elastic in test environment, send the same request there and compare results. I assumed that documents and their order would be the same, because I was going to use the same ranking algorithms in both environments.

In other words, I have a search function F(X) = Y, it accepts a set of documents X and returns a sorted subset of documents Y. I assumed that if F(N) = M then F(M) = M, where N is the set of all documents in production.

Having made this assumption, I decided to automate the process of dumping and comparing results. At that point I was really glad about the simplicity of this idea. My plan was to ask content moderator for search response samples, which we accept as a valid search behaviour, dump them and then build testing fixtures. If somebody breaks search logic, several test cases fail and we figure out a reason. If changes in logic are expected, then we dump new search results from production using new search logic and put them into the test environment.

Development

SERP dumper

I started with development of the search dumper: the easiest way to implement such things in rails is a rake task. It should be run in production and accept two arguments: array of search requests’ ids which are approved by content moderator – search_request_ids, number of documents to dump from each response – number_of_items. Search dumper generates search_dump.json file that contains current configurations of the search engine in search_config field, also it contains data payload with all request samples. Each request’s sample includes filters and parameters in request field, sorted document IDs with ranking scores inside the sorted_results_ids field and raw documents in contents field. dumper-schema I have planned to use obtained dumps as fixtures, to compare document order from requests in test environment with results that have been acquired in production.

I am not going to show a sample of dumper’s code here, because dumper is tightly coupled with classes and libraries which are used to interact with Elastic, so there is no sense in doing it. The logic behind it is quite simple anyway: send requests, serialize responses to json, write data to a file.

Setting up Elasticsearch in a test environment

We need to run Elastic in a test environment to restore the dump and send queries. We also need to clean it after each run of a test suit. Setting it up was easier than I expected, because Elastic supports in-memory cluster which simplifies this task. There is no need to care about data cleaning if new and clean isolated cluster starts in-memory for each test suit. This approach also prevents you from corrupting local data, which could happen in case you decided to use the same cluster that is used in development, but with a separate namespace for testing. There is even an elasticsearch-extensions gem that does all dirty work with managing in-memory clusters.

Here is how to run and stop in-memory cluster with RSpec, using elasticsearch-extensions gem:

# spec/spec_helper.rb

# start an in-memory cluster for Elasticsearch as needed
config.before :all, elasticsearch: true do
  unless Elasticsearch::Extensions::Test::Cluster.running?(on: 9250)
    Elasticsearch::Extensions::Test::Cluster.start(port: 9250, nodes: 1, timeout: 120)
  end
end

# stop elasticsearch cluster after test run
config.after :suite do
  if Elasticsearch::Extensions::Test::Cluster.running?(on: 9250)
    Elasticsearch::Extensions::Test::Cluster.stop(port: 9250, nodes: 1)
  end
end

I don’t want to go deep into in-memory cluster configuration in this post. For those who are interested, here is a great article on this topic.

Specs generation

When I was satisfied with in-memory cluster, I decided to develop a generator for specs. First of all, the generator should deserialize search_dump.json file and create a search configuration in the database. The configuration is a record that stores information about boosts and fields which are used in different queries. For each search case, generator should create a new index in the Elastic cluster. After that it should format dumped documents into a bulk and index the bulk in Elastic. Bulk indexing save lots of time if you need to index many documents. Interaction with Elastic goes through HTTP API, it takes time to open a new connection for each request. It makes sense to send as few requests as possible. For example, instead of sending 1000 requests to index 1000 documents, you can send only one bulk request with all documents.

Elastic is a near real time storage, it means that in our case next search request could return 0 results, because Elastic didn’t have time to refresh index before the query. It’s important to force-refresh index of a cluster when documents are indexed. When index is refreshed we need to send a query from the test sample and assert if the sorting order from results is the same as in the sample.

I know that there are many ways to interact with Elastic from ruby, search logic could also be encapsulated in different ways. I want to show the general idea behind the specs generator in RSpec, so I will use abstract class names. Lets assume that your search logic is implemented in SearchClass. For this example I use a repository pattern from official elasticsearch-rails gem for ruby to interact with Elasticsearch.

# spec/services/search_class_spec.rb
require 'rails_helper'
DUMP_NAME = 'search_dump.json'

RSpec.describe SearchClass, elasticsearch: true do
  let(:repository) { ElasticRepository.new }

  # create index for a test
  before :each do
    begin
      repository.create_index!(number_of_shards: 1)
      repository.refresh_index!
    rescue => Elasticsearch::Transport::Transport::Errors::NotFound
    end
  end

  # delete index after a test
  after :each do
    begin
      repository.delete_index!
    rescue => Elasticsearch::Transport::Transport::Errors::NotFound
    end
  end

  # load dump
  file = File.join(Rails.root, 'spec', 'fixtures', DUMP_NAME)
  dump = JSON.parse(File.open(file, 'r').read, symbolize_names: true)

  # iterate over test samples
  dump[:data].each do |sample|
    it "compares search results #{sample[:request][:id]}" do
      # load configurations for a search engine
      SearchConfiguration.create!(dump[:search_config])

      # build a batch for indexing
      batch = sample[:contents].map do |record|
        {
          index: {
            _id: record[:document_id],
            data: record
          }
        }
      end

      # use bulk indexing API to index the batch of documents
      repository.client.bulk(
        index: repository.index,
        type: repository.type,
        body: batch
      )
      repository.refresh_index!

      # build a search request
      request = SampleSearchRequest.new(sample[:request])

      # send search request
      results = described_class.new.perform(request)

      # compare results
      expect(results[:data].map(&:id)).to eq(sample[:sorted_results_ids])
    end
  end
end

When I tried to launch the spec, it didn’t work properly: order of documents was completely messed up.

What can go wrong?

Let’s get back to the initial assumption. I thought that I could make a query across N documents and get a sorted subset of documents M with the search function F(N) = M. Then I was going to create a new index with these M documents and query it with the same search function F. I expected to have F(M) = M, because it seems logical, but in practice I got F(M) = M', where M' includes the same documents as the subset M, but sorted in a different order. The root of this mistake is the scoring model that is used in Elasticsearch by default.

Elastic makes several steps to process a search query. First of all it filters documents. Filtering is cost effective, because it just determines if documents satisfy conditions from the query. Then Elastic applies scoring queries to filtered document set. Scoring defines weights for each document. In the end Elastic sorts documents by their scores and other fields that were set for the query. search query
schema

To score documents Elasticsearch use TF-IDF: Term Frequency – Inverse Document Frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Term Frequency – the number of times that a term occurs in the document. Inverse Document Frequency – the logarithm of the total number of documents in the index, divided by the number of documents that contain the term.

Full-text search and TF-IDF

Before going further, it’s important to understand why full-text search uses IDF. It shows how often the term appears among all documents in the collection. The more often, the lower is the weight of the term. Common terms like “and” or “the” contribute little to relevance, as they appear in most documents, while uncommon terms like “elastic” or “capybara” help to zoom in on the most interesting documents.

By default Elastic uses Okapi BM25 scoring model. It is based on TF-IDF and also provides field normalization. Fields normalization adds extra precision for scoring. If a term has been found in a field that contains 5 words, then document that contains this field is more important than another document that has a field with 10 words and the same term among them.

Let’s return to our problem: we have several million documents in production, but a test sample contains only M documents. Despite sending the same query, there would be different Inverse Document Frequency values for documents in production and for small dumped document sets. Different IDF values lead us to different scores for the same documents. Sorting order for documents from the sample wouldn’t be the same as genuine order dumped from production. My initial assumption was wrong since I did not account for IDF in the scoring model.

The solution

At this point I was extremely upset and angry, because I completely forgot about TF-IDF when I was designing tools for integration Elasticsearch with RSpec. However, instead of rejecting the broken idea and making something that would work in a completely different way, we decided that it is perfect time to change the logic of our search engine.

Previously we had lots of cases when someone who has searched for photos came to the development team and said something like, “Our ranking is broken! I’ve searched for cute dog, now look at this beautiful picture of a dog, it has both tags: cute and dog. It should be higher in results than another picture, which has only the dog tag from my query. It means that first photo is more relevant to my query and should be higher in results.” In these cases I usually make an explain query to Elastic and find out that the picture with tags dog and cute also contains 50 more unrelated tags. People tend to add trash tags to their photos in instagram, because they want to be discovered and get some likes. As for the photo, which was higher in results, it had only two tags: dog and puppy. As I wrote earlier, Okapi BM25 model uses fields normalization, it treats the document with 2 matched terms among 50 as less important than the document with 1 matched term among 2. Because of the fields normalization, the photo with 1 matched tag had higher score than the photo with two matched tags.

Actually this normalization makes sense, it’s like a built-in protection from tags spammers who try to cheat search algorithm and push their photos to as many queries as possible. On the other hand, we are selling photos, we are not making a text search engine. For people who want to buy a photo the most important thing is the fact that it includes objects from their search request. When people search for cute dog, they don’t really care about other 48 tags if the photo is nice and contains a dog which is cute. Besides, IDF also ruins people’s expectations. They want to see photos with desired objects and don’t care about the frequency of the tags among all photos in the database.

The second argument was more persuasive and important for us. Considering this, we decided to change ranking model in the product and disabled Okapi BM25. Even before the whole story with specs, we already thought that TF-IDF is harmful for us in most scenarios. By disabling it, we not only fixed our testing system, but also made the search engine better from the perspective of media marketplace. People buy photos, not tags or texts.

We switched from Okapi BM25 model to constant_score. In this case Elastic evaluates score for a document as a sum of preset scores for each matched field of the document. Since fields normalization isn’t used anymore, it makes sense to disable norms in mappings with norms: false. This setting saves some disk space for the index.

Another important thing to mention is the implementation of results comparator for specs. Since TF-IDF is disabled, scores deviation almost disappears. When TF-IDF is enabled, documents usually have different float values for scores, but without it there would be many documents with the same scores. It means that Elastic may return documents with the same score in different order for the same query, which is sent several times. To solve this problem, it is possible to add another field for a sorting strategy besides score; for example, date of creation. In this case, subsets of documents with the same score would be sorted by date of creation.

Continuous Integration

After switching from Okapi BM25 to constant_score specs changed to green, but it’s not the end of the story. Previously I mentioned that another goal for this task was to run specs in a CI service. In theory it was easy, but in practice I got a problem with custom tokenizers. We use synonym tokenizers for several search features. Each tokenizer has its own set of synonyms, some of them are extremely large. There are two ways to define synonym sets for tokenizers. First of them is defining synonyms in mappings. However if your set is too big, it’s recommended to define it in a file, otherwise mappings would be bloated. Here comes the problem: we had plenty of files with synonyms, but since version 5, Elasticsearch requires a file’s path to be relative to the directory with its configurations. It’s implemented this way to limit access of Elastic to other directories outside of its own directory. For development, it’s better to store files with synonyms in repositories and symlink them to a directory with Elastic configurations. However it’s not possible do something like that for CI. At that time we were using VexorCI, their support is amazing! We have provided them files with synonyms and they have build a special image of Elasticsearch for us, it was bundled with synonyms files. After that we were able to run specs for Elastic in CI service.

Conclusion

I strongly advise not to use this approach, because disabling TF-IDF isn’t what you usually do with a full-text search engine. I had been lucky that we were making a product which sells photos, otherwise I would have thrown the idea away.