Jekyll2020-08-06T21:25:46+00:00/feed.xmlDevLogEngineering notes
Timur YanberdinHow (not) to integrate Elasticsearch testing with RSpec2019-03-09T22:47:00+00:002019-03-09T22:47:00+00:00/dev/2019/03/09/how_not_to_integrate_elasticsearch_testing_with_rspec<p>In this post I want to share my experience in development of an integration
testing tool for a search engine using <a href="http://rspec.info">RSpec</a> framework,
which is popular among Ruby developers. My goal was to automate generation of
test cases as much as possible. I encountered several problems along the way.
I hope this post will prevent readers from making my mistakes.</p>
<p>I’ve worked with several companies that use Elasticsearch for data indexing and
searching, however none of them have covered their search requests with tests. I
think the main reason for lack of tests is the complexity of integration
Elasticsearch into a testing environment. The second reason is the enormous
amount of work which is required to support documents stubs for necessary test
cases. The situation with every project has been almost the same: a single
search query is stubbed, then one spec checks that system has sent a request and
received mocked data. This spec has a problem, it checks if a query has been
sent to Elastic, but it does not care about search results.</p>
<p>When I was working in Lobster, we decided to develop a tool which would solve
these problems and allow us to cover search engine responses with specs.
Lobster is a marketplace for user-generated photos and videos. Photographers
connect their accounts from social networks or cloud storages like Instagram,
Flickr, Facebook, Dropbox, etc. to the marketplace, then it fetches their photos
and videos, indexes them and displays to buyers.</p>
<h3 id="tldr">TL;DR</h3>
<p>I thought it would be a great idea to develop a tool that records
requests and responses from production as fixtures, sends the same requests
using fixtures in test environment and asserts results.
When an implementation was ready I realised that I was wrong, because I forgot
about TF-IDF model, which is used for scoring in Elastic by default. Don’t use this
approach if your product does not allow you to change the scoring model.</p>
<h3 id="search-engine-in-ugc-marketplace">Search engine in UGC marketplace</h3>
<p>The main goal for every media marketplace is to make their content searchable.
Nobody would be able buy photos or videos if it isn’t possible to find them,
no matter how beautiful these photos are. In case of social networks people
usually provide a text description for their photos during publishing. However
descriptions are not accurate and they also could be completely unrelated to
objects in photos. The situation with cloud storages is even worse: services
like Dropbox or Google Drive don’t have any descriptions for images at all. The
only related information that is possible to get from cloud storages is EXIF
metadata. It is really helpful if a photo from a cloud storage has latitude and
longitude in EXIF, because with geo-coordinates, at least, it is possible to
query the photo by its location name.</p>
<p>There are several solutions to enhance metadata. First of them is a manual
markup. The quality and accuracy would be ideal, but you need to have thousands
of people who can do it manually. Of course it is impossible to hire so many
people for a small startup. Another way, which is faster and cheaper, but less
accurate, is image recognition with OpenCV and neural networks. With the help of
computer vision algorithms we have gathered names of objects from photos,
dominant colors, emotions, age of people, gender, head position and facial features.</p>
<p>We have been gathering more and more metadata and indexing it in Elasticsearch.
The number of different objects’ features has been increased dramatically since
an ML expert joined the team. As the number of ML features increased, so did the
number of filters and search engine complexity. There were different scoring
rules for different combinations of enabled or disabled filters. Although the
DSL of Elastic is pretty, the search engine code became really complex to
maintain, which sometimes led us to an incorrect ranking. In the end we
decided to write specs for search engine logic in RSpec and run them on a CI
server to be sure that everything works as expected.</p>
<p><img src="https://user-images.githubusercontent.com/854386/52372076-7da50980-2a57-11e9-947b-9797c7286677.jpg" alt="SERP" /></p>
<h3 id="how-to-automate-testing-of-a-search-engine">How to automate testing of a search engine?</h3>
<p>Let’s describe a typical test case for a Search Engine Results Page: there are <code class="language-plaintext highlighter-rouge">N</code> documents in total,
the test makes a query with several filters, parameters and sorting order, then
it asserts that response has only <code class="language-plaintext highlighter-rouge">M</code> documents, filtered and sorted in
correct order. To test hundreds combinations of filters we need to write and
support hundreds sets of documents and hundreds sets of correct responses. If
somebody changes the search logic, then we have to fix broken sets and correct
responses. For SERP testing it would be almost impossible to maintain stubs for
documents, because even a small change in the logic can affect almost all test
cases.</p>
<p>I gave up on writing specs manually and came up with an idea, which was
excellent at the first glance. I thought that I can make a search query in
production environment and dump results only from the first page, then I would restore the
dump to Elastic in test environment, send the same request there and compare
results. I assumed that documents and their order would be the same, because
I was going to use the same ranking algorithms in both environments.</p>
<p>In other words, I have a search function <code class="language-plaintext highlighter-rouge">F(X) = Y</code>, it accepts a set of
documents <code class="language-plaintext highlighter-rouge">X</code> and returns a sorted subset of documents <code class="language-plaintext highlighter-rouge">Y</code>.
I assumed that if <code class="language-plaintext highlighter-rouge">F(N) = M</code> then <code class="language-plaintext highlighter-rouge">F(M) = M</code>, where <code class="language-plaintext highlighter-rouge">N</code> is the set of all documents in production.</p>
<p>Having made this assumption, I decided to
automate the process of dumping and comparing results. At that point I was
really glad about the simplicity of this idea. My plan was to ask content
moderator for search response samples, which we accept as a valid search
behaviour, dump them and then build testing fixtures. If somebody
breaks search logic, several test cases fail and we figure out a reason. If
changes in logic are expected, then we dump new search results from production
using new search logic and put them into the test environment.</p>
<h3 id="development">Development</h3>
<h4 id="serp-dumper">SERP dumper</h4>
<p>I started with development of the search dumper: the easiest way to implement
such things in rails is a rake task. It should be run in production and accept
two arguments: array of search requests’ ids which are approved by content
moderator – <code class="language-plaintext highlighter-rouge">search_request_ids</code>, number of documents to dump from each response
– <code class="language-plaintext highlighter-rouge">number_of_items</code>. Search dumper generates <code class="language-plaintext highlighter-rouge">search_dump.json</code> file that
contains current configurations of the search engine in <code class="language-plaintext highlighter-rouge">search_config</code> field,
also it contains <code class="language-plaintext highlighter-rouge">data</code> payload with all request samples. Each request’s
sample includes filters and parameters in <code class="language-plaintext highlighter-rouge">request</code> field, sorted document IDs
with ranking scores inside the <code class="language-plaintext highlighter-rouge">sorted_results_ids</code> field and raw documents in
<code class="language-plaintext highlighter-rouge">contents</code> field.
<img src="https://user-images.githubusercontent.com/854386/53920262-d07cdb80-406c-11e9-85e0-c372daa3199f.jpg" alt="dumper-schema" />
I have planned to use obtained dumps as fixtures, to compare document order
from requests in test environment with results that have been acquired in
production.</p>
<p>I am not going to show a sample of dumper’s code here, because dumper is tightly
coupled with classes and libraries which are used to interact with Elastic, so
there is no sense in doing it. The logic behind it is quite simple anyway: send
requests, serialize responses to json, write data to a file.</p>
<h4 id="setting-up-elasticsearch-in-a-test-environment">Setting up Elasticsearch in a test environment</h4>
<p>We need to run Elastic in a test environment to restore the dump and send
queries. We also need to clean it after each run of a test suit. Setting it up
was easier than I expected, because Elastic supports in-memory cluster which
simplifies this task. There is no need to care about data cleaning if new and
clean isolated cluster starts in-memory for each test suit. This approach also
prevents you from corrupting local data, which could happen in case
you decided to use the same cluster that is used in development, but with a
separate namespace for testing. There is even an <a href="https://github.com/elastic/elasticsearch-ruby/tree/master/elasticsearch-extensions#testcluster">elasticsearch-extensions
gem</a>
that does all dirty work with managing in-memory clusters.</p>
<p>Here is how to run and stop in-memory cluster with RSpec, using
elasticsearch-extensions gem:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># spec/spec_helper.rb</span>
<span class="c1"># start an in-memory cluster for Elasticsearch as needed</span>
<span class="n">config</span><span class="p">.</span><span class="nf">before</span> <span class="ss">:all</span><span class="p">,</span> <span class="ss">elasticsearch: </span><span class="kp">true</span> <span class="k">do</span>
<span class="k">unless</span> <span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Extensions</span><span class="o">::</span><span class="no">Test</span><span class="o">::</span><span class="no">Cluster</span><span class="p">.</span><span class="nf">running?</span><span class="p">(</span><span class="ss">on: </span><span class="mi">9250</span><span class="p">)</span>
<span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Extensions</span><span class="o">::</span><span class="no">Test</span><span class="o">::</span><span class="no">Cluster</span><span class="p">.</span><span class="nf">start</span><span class="p">(</span><span class="ss">port: </span><span class="mi">9250</span><span class="p">,</span> <span class="ss">nodes: </span><span class="mi">1</span><span class="p">,</span> <span class="ss">timeout: </span><span class="mi">120</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c1"># stop elasticsearch cluster after test run</span>
<span class="n">config</span><span class="p">.</span><span class="nf">after</span> <span class="ss">:suite</span> <span class="k">do</span>
<span class="k">if</span> <span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Extensions</span><span class="o">::</span><span class="no">Test</span><span class="o">::</span><span class="no">Cluster</span><span class="p">.</span><span class="nf">running?</span><span class="p">(</span><span class="ss">on: </span><span class="mi">9250</span><span class="p">)</span>
<span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Extensions</span><span class="o">::</span><span class="no">Test</span><span class="o">::</span><span class="no">Cluster</span><span class="p">.</span><span class="nf">stop</span><span class="p">(</span><span class="ss">port: </span><span class="mi">9250</span><span class="p">,</span> <span class="ss">nodes: </span><span class="mi">1</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>I don’t want to go deep into in-memory cluster configuration in this post. For
those who are interested, here is a great
<a href="https://medium.com/@rowanoulton/testing-elasticsearch-in-rails-22a3296d989">article</a>
on this topic.</p>
<h4 id="specs-generation">Specs generation</h4>
<p>When I was satisfied with in-memory cluster, I decided to develop a generator
for specs. First of all, the generator should deserialize <code class="language-plaintext highlighter-rouge">search_dump.json</code>
file and create a search configuration in the database. The configuration is a
record that stores information about boosts and fields which are used in
different queries. For each search case, generator should create a new index in
the Elastic cluster. After that it should format dumped documents into a bulk
and index the bulk in Elastic. Bulk indexing save lots of time if you need to
index many documents. Interaction with Elastic goes through HTTP API, it takes
time to open a new connection for each request. It makes sense to send as few
requests as possible. For example, instead of sending 1000 requests to index
1000 documents, you can send only one bulk request with all documents.</p>
<p>Elastic is a near real time storage, it means that in our case next search
request could return 0 results, because Elastic didn’t have time to refresh
index before the query. It’s important to force-refresh index of a cluster when
documents are indexed. When index is refreshed we need to send a query from the
test sample and assert if the sorting order from results is the same as in the
sample.</p>
<p>I know that there are many ways to interact with Elastic from ruby, search logic
could also be encapsulated in different ways. I want to show the general idea behind
the specs generator in RSpec, so I will use abstract class names. Lets assume
that your search logic is implemented in <code class="language-plaintext highlighter-rouge">SearchClass</code>. For this example I use a
repository pattern from official <a href="https://github.com/elastic/elasticsearch-rails">elasticsearch-rails
gem</a> for ruby to interact with
Elasticsearch.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># spec/services/search_class_spec.rb</span>
<span class="nb">require</span> <span class="s1">'rails_helper'</span>
<span class="no">DUMP_NAME</span> <span class="o">=</span> <span class="s1">'search_dump.json'</span>
<span class="no">RSpec</span><span class="p">.</span><span class="nf">describe</span> <span class="no">SearchClass</span><span class="p">,</span> <span class="ss">elasticsearch: </span><span class="kp">true</span> <span class="k">do</span>
<span class="n">let</span><span class="p">(</span><span class="ss">:repository</span><span class="p">)</span> <span class="p">{</span> <span class="no">ElasticRepository</span><span class="p">.</span><span class="nf">new</span> <span class="p">}</span>
<span class="c1"># create index for a test</span>
<span class="n">before</span> <span class="ss">:each</span> <span class="k">do</span>
<span class="k">begin</span>
<span class="n">repository</span><span class="p">.</span><span class="nf">create_index!</span><span class="p">(</span><span class="ss">number_of_shards: </span><span class="mi">1</span><span class="p">)</span>
<span class="n">repository</span><span class="p">.</span><span class="nf">refresh_index!</span>
<span class="k">rescue</span> <span class="o">=></span> <span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Transport</span><span class="o">::</span><span class="no">Transport</span><span class="o">::</span><span class="no">Errors</span><span class="o">::</span><span class="no">NotFound</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c1"># delete index after a test</span>
<span class="n">after</span> <span class="ss">:each</span> <span class="k">do</span>
<span class="k">begin</span>
<span class="n">repository</span><span class="p">.</span><span class="nf">delete_index!</span>
<span class="k">rescue</span> <span class="o">=></span> <span class="no">Elasticsearch</span><span class="o">::</span><span class="no">Transport</span><span class="o">::</span><span class="no">Transport</span><span class="o">::</span><span class="no">Errors</span><span class="o">::</span><span class="no">NotFound</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c1"># load dump</span>
<span class="n">file</span> <span class="o">=</span> <span class="no">File</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="no">Rails</span><span class="p">.</span><span class="nf">root</span><span class="p">,</span> <span class="s1">'spec'</span><span class="p">,</span> <span class="s1">'fixtures'</span><span class="p">,</span> <span class="no">DUMP_NAME</span><span class="p">)</span>
<span class="n">dump</span> <span class="o">=</span> <span class="no">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="no">File</span><span class="p">.</span><span class="nf">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">).</span><span class="nf">read</span><span class="p">,</span> <span class="ss">symbolize_names: </span><span class="kp">true</span><span class="p">)</span>
<span class="c1"># iterate over test samples</span>
<span class="n">dump</span><span class="p">[</span><span class="ss">:data</span><span class="p">].</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">sample</span><span class="o">|</span>
<span class="n">it</span> <span class="s2">"compares search results </span><span class="si">#{</span><span class="n">sample</span><span class="p">[</span><span class="ss">:request</span><span class="p">][</span><span class="ss">:id</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span> <span class="k">do</span>
<span class="c1"># load configurations for a search engine</span>
<span class="no">SearchConfiguration</span><span class="p">.</span><span class="nf">create!</span><span class="p">(</span><span class="n">dump</span><span class="p">[</span><span class="ss">:search_config</span><span class="p">])</span>
<span class="c1"># build a batch for indexing</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">sample</span><span class="p">[</span><span class="ss">:contents</span><span class="p">].</span><span class="nf">map</span> <span class="k">do</span> <span class="o">|</span><span class="n">record</span><span class="o">|</span>
<span class="p">{</span>
<span class="ss">index: </span><span class="p">{</span>
<span class="ss">_id: </span><span class="n">record</span><span class="p">[</span><span class="ss">:document_id</span><span class="p">],</span>
<span class="ss">data: </span><span class="n">record</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">end</span>
<span class="c1"># use bulk indexing API to index the batch of documents</span>
<span class="n">repository</span><span class="p">.</span><span class="nf">client</span><span class="p">.</span><span class="nf">bulk</span><span class="p">(</span>
<span class="ss">index: </span><span class="n">repository</span><span class="p">.</span><span class="nf">index</span><span class="p">,</span>
<span class="ss">type: </span><span class="n">repository</span><span class="p">.</span><span class="nf">type</span><span class="p">,</span>
<span class="ss">body: </span><span class="n">batch</span>
<span class="p">)</span>
<span class="n">repository</span><span class="p">.</span><span class="nf">refresh_index!</span>
<span class="c1"># build a search request</span>
<span class="n">request</span> <span class="o">=</span> <span class="no">SampleSearchRequest</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">sample</span><span class="p">[</span><span class="ss">:request</span><span class="p">])</span>
<span class="c1"># send search request</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">described_class</span><span class="p">.</span><span class="nf">new</span><span class="p">.</span><span class="nf">perform</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
<span class="c1"># compare results</span>
<span class="n">expect</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="ss">:data</span><span class="p">].</span><span class="nf">map</span><span class="p">(</span><span class="o">&</span><span class="ss">:id</span><span class="p">)).</span><span class="nf">to</span> <span class="n">eq</span><span class="p">(</span><span class="n">sample</span><span class="p">[</span><span class="ss">:sorted_results_ids</span><span class="p">])</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>When I tried to launch the spec, it didn’t work properly: order of documents was
completely messed up.</p>
<h3 id="what-can-go-wrong">What can go wrong?</h3>
<p>Let’s get back to the initial assumption. I thought that I could make a query
across <code class="language-plaintext highlighter-rouge">N</code> documents and get a sorted subset of documents <code class="language-plaintext highlighter-rouge">M</code> with the search function
<code class="language-plaintext highlighter-rouge">F(N) = M</code>. Then I was going to create a new index with these <code class="language-plaintext highlighter-rouge">M</code> documents and
query it with the same search function <code class="language-plaintext highlighter-rouge">F</code>. I expected to have <code class="language-plaintext highlighter-rouge">F(M) = M</code>, because it
seems logical, but in practice I got <code class="language-plaintext highlighter-rouge">F(M) = M'</code>, where <code class="language-plaintext highlighter-rouge">M'</code> includes the same
documents as the subset <code class="language-plaintext highlighter-rouge">M</code>, but sorted in a different order. The root of this mistake
is the scoring model that is used in Elasticsearch by default.</p>
<p>Elastic makes several steps to process a search query. First of all it filters
documents. Filtering is cost effective, because it just determines if documents satisfy
conditions from the query. Then Elastic applies scoring queries to
filtered document set. Scoring defines weights for each document. In the end
Elastic sorts documents by their scores and other fields that were set for the
query. <img src="https://user-images.githubusercontent.com/854386/53922975-cbbd2500-4076-11e9-8480-6a43ae781821.jpg" alt="search query
schema" /></p>
<p>To score documents Elasticsearch use
<a href="https://en.wikipedia.org/wiki/Tf–idf">TF-IDF</a>: Term Frequency – Inverse
Document Frequency. It is a numerical statistic that is intended to reflect how
important a word is to a document in a collection. Term Frequency – the number
of times that a term occurs in the document. Inverse Document Frequency – the
logarithm of the total number of documents in the index, divided by the number
of documents that contain the term.</p>
<h3 id="full-text-search-and-tf-idf">Full-text search and TF-IDF</h3>
<p>Before going further, it’s important to understand why full-text search uses
IDF. It shows how often the term appears among all documents in the
collection. The more often, the lower is the weight of the term. Common terms
like “and” or “the” contribute little to relevance, as they appear in most
documents, while uncommon terms like “elastic” or “capybara” help to zoom in on
the most interesting documents.</p>
<p>By default Elastic uses <a href="https://en.wikipedia.org/wiki/Okapi_BM25">Okapi BM25</a> scoring model. It is based on TF-IDF and also
provides field normalization. Fields normalization adds extra precision for
scoring. If a term has been found in a field that contains 5 words, then
document that contains this field is more important than another document that
has a field with 10 words and the same term among them.</p>
<p>Let’s return to our problem: we have several million documents in production,
but a test sample contains only <code class="language-plaintext highlighter-rouge">M</code> documents. Despite sending the same
query, there would be different Inverse Document Frequency values for documents
in production and for small dumped document sets. Different IDF values lead us
to different scores for the same documents. Sorting order for documents from the
sample wouldn’t be the same as genuine order dumped from production.
My initial assumption was wrong since I did not account for IDF in the scoring model.</p>
<h3 id="the-solution">The solution</h3>
<p>At this point I was extremely upset and angry, because I completely forgot about
TF-IDF when I was designing tools for integration Elasticsearch with RSpec.
However, instead of rejecting the broken idea and making something that would
work in a completely different way, we decided that it is perfect time to change
the logic of our search engine.</p>
<p>Previously we had lots of cases when someone who has searched for photos came
to the development team and said something like, “Our ranking is broken! I’ve
searched for <code class="language-plaintext highlighter-rouge">cute dog</code>, now look at this beautiful picture of a dog, it has
both tags: <code class="language-plaintext highlighter-rouge">cute</code> and <code class="language-plaintext highlighter-rouge">dog</code>. It should be higher in results than another
picture, which has only the <code class="language-plaintext highlighter-rouge">dog</code> tag from my query. It means that first photo is
more relevant to my query and should be higher in results.” In these cases I
usually make an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html">explain
query</a>
to Elastic and find out that the picture with tags <code class="language-plaintext highlighter-rouge">dog</code> and <code class="language-plaintext highlighter-rouge">cute</code> also
contains 50 more unrelated tags. People tend to add trash tags to their photos
in instagram, because they want to be discovered and get some likes. As for the
photo, which was higher in results, it had only two tags: <code class="language-plaintext highlighter-rouge">dog</code> and <code class="language-plaintext highlighter-rouge">puppy</code>. As
I wrote earlier, Okapi BM25 model uses fields normalization, it treats the
document with 2 matched terms among 50 as less important than the document
with 1 matched term among 2. Because of the fields normalization, the photo with
1 matched tag had higher score than the photo with two matched tags.</p>
<p>Actually this normalization makes sense, it’s like a built-in protection from
tags spammers who try to cheat search algorithm and push their photos to as
many queries as possible. On the other hand, we are selling photos, we are not
making a text search engine. For people who want to buy a photo the most
important thing is the fact that it includes objects from their search request.
When people search for <code class="language-plaintext highlighter-rouge">cute dog</code>, they
don’t really care about other 48 tags if the photo is nice and contains a dog
which is cute. Besides, IDF also ruins people’s expectations. They want to
see photos with desired objects and don’t care about the frequency of the tags
among all photos in the database.</p>
<p>The second argument was more persuasive and important for us. Considering this,
we decided to change ranking model in the product and disabled Okapi
BM25. Even before the whole story with specs, we already thought that TF-IDF is
harmful for us in most scenarios. By disabling it, we not only fixed our testing
system, but also made the search engine better from the perspective of media
marketplace. People buy photos, not tags or texts.</p>
<p>We switched from Okapi BM25 model to
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.6/query-dsl-constant-score-query.html"><code class="language-plaintext highlighter-rouge">constant_score</code></a>.
In this case Elastic evaluates score for a document as a sum of preset scores
for each matched field of the document. Since fields normalization isn’t used
anymore, it makes sense to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/norms.html">disable
norms</a>
in mappings with <code class="language-plaintext highlighter-rouge">norms: false</code>. This setting saves some disk space for the
index.</p>
<p>Another important thing to mention is the implementation of results comparator
for specs. Since TF-IDF is disabled, scores deviation almost disappears. When
TF-IDF is enabled, documents usually have different float values for scores, but
without it there would be many documents with the same scores. It means that
Elastic may return documents with the same score in different order for the same
query, which is sent several times. To solve this problem, it is possible to add
another field for a sorting strategy besides score; for example, date of creation.
In this case, subsets of documents with the same score would be sorted
by date of creation.</p>
<h3 id="continuous-integration">Continuous Integration</h3>
<p>After switching from Okapi BM25 to <code class="language-plaintext highlighter-rouge">constant_score</code> specs changed to green, but
it’s not the end of the story. Previously I mentioned that another goal for this
task was to run specs in a CI service. In theory it was easy, but in practice I got
a problem with custom tokenizers. We use synonym tokenizers for several search
features. Each tokenizer has its own set of synonyms, some of them are extremely
large. There are two ways to define synonym sets for tokenizers. First of them
is defining synonyms in mappings. However if your set is too big, it’s
recommended to define it in a file, otherwise mappings would be bloated. Here
comes the problem: we had plenty of files with synonyms, but since version 5,
Elasticsearch requires a file’s path to be relative to the directory with its
configurations. It’s implemented this way to limit access of Elastic to other
directories outside of its own directory. For development, it’s better to store
files with synonyms in repositories and symlink them to a directory with Elastic
configurations. However it’s not possible do something like that for CI. At that
time we were using VexorCI, their support is amazing! We have provided them
files with synonyms and they have build a special image of Elasticsearch for us,
it was bundled with synonyms files. After that we were able to run specs for
Elastic in CI service.</p>
<h3 id="conclusion">Conclusion</h3>
<p>I strongly advise not to use this approach, because disabling TF-IDF isn’t what
you usually do with a full-text search engine. I had been lucky that
we were making a product which sells photos, otherwise I would have thrown the
idea away.</p>Timur YanberdinIn this post I want to share my experience in development of an integration testing tool for a search engine using RSpec framework, which is popular among Ruby developers. My goal was to automate generation of test cases as much as possible. I encountered several problems along the way. I hope this post will prevent readers from making my mistakes.How I shoot in the foot with a case operator in Ruby2017-08-26T17:23:00+00:002017-08-26T17:23:00+00:00/dev/2017/08/26/case_in_ruby<p>Before I started learning Ruby in 2010, I had been programming in Pascal, Delphi and
C++ to solve ACM-like problems. All of those programming languages have a
<code class="language-plaintext highlighter-rouge">switch</code>/<code class="language-plaintext highlighter-rouge">case</code> operator. In all of them it works pretty straightforward:
it compares a variable/object/result of an expression against several values and
decides which branch to execute.
When it was required for me in Ruby for a first time, I just looked up for the
<code class="language-plaintext highlighter-rouge">case</code> syntax. I thought, what could possibly go wrong with the <code class="language-plaintext highlighter-rouge">case</code> operator?</p>
<p>Recently I worked with Dropbox API to migrate an application to API v2.0. I
used the <a href="https://github.com/Jesus/dropbox_api">opensource library</a> that has
been developing by a community. Despite most of the primary endpoints and
features are covered in this gem, I’ve found that it ignores
<code class="language-plaintext highlighter-rouge">media_info</code> field for a <code class="language-plaintext highlighter-rouge">DropboxApi::Metadata::File</code>. This field was urgently
important for my task, so I decided to modify the gem and send a pull request.</p>
<p>First of all, I needed to implement <code class="language-plaintext highlighter-rouge">Hash</code> type casting to <code class="language-plaintext highlighter-rouge">force_cast</code> method:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">force_cast</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">if</span> <span class="vi">@type</span> <span class="o">==</span> <span class="no">String</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_s</span>
<span class="k">elsif</span> <span class="vi">@type</span> <span class="o">==</span> <span class="no">Time</span>
<span class="no">Time</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">elsif</span> <span class="vi">@type</span> <span class="o">==</span> <span class="no">Integer</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_i</span>
<span class="k">elsif</span> <span class="vi">@type</span> <span class="o">==</span> <span class="no">Symbol</span>
<span class="n">object</span><span class="p">[</span><span class="s2">".tag"</span><span class="p">].</span><span class="nf">to_sym</span>
<span class="k">elsif</span> <span class="vi">@type</span> <span class="o">==</span> <span class="ss">:boolean</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_s</span> <span class="o">==</span> <span class="s2">"true"</span>
<span class="k">elsif</span> <span class="vi">@type</span><span class="p">.</span><span class="nf">ancestors</span><span class="p">.</span><span class="nf">include?</span> <span class="no">DropboxApi</span><span class="o">::</span><span class="no">Metadata</span><span class="o">::</span><span class="no">Base</span>
<span class="vi">@type</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">else</span>
<span class="k">raise</span> <span class="no">NotImplementedError</span><span class="p">,</span> <span class="s2">"Can't cast `</span><span class="si">#{</span><span class="vi">@type</span><span class="si">}</span><span class="s2">`"</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Instead of writing another <code class="language-plaintext highlighter-rouge">elsif</code> branch I decided to rewrite this method with
<code class="language-plaintext highlighter-rouge">case</code> operator because all the comparisons occur with the same object
<code class="language-plaintext highlighter-rouge">@type</code>. Code written with the <code class="language-plaintext highlighter-rouge">case</code> operator is easier to read. From
the first line it is obvious that there will be no conditions that
compare objects other than an argument.</p>
<p>At the first sight the problem could be in the last <code class="language-plaintext highlighter-rouge">elsif</code>, however
a lambda expression comes here to help us.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">force_cast</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">case</span> <span class="vi">@type</span>
<span class="k">when</span> <span class="no">String</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_s</span>
<span class="k">when</span> <span class="no">Time</span>
<span class="no">Time</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">when</span> <span class="no">Integer</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_i</span>
<span class="k">when</span> <span class="no">Symbol</span>
<span class="n">object</span><span class="p">[</span><span class="s2">".tag"</span><span class="p">].</span><span class="nf">to_sym</span>
<span class="k">when</span> <span class="ss">:boolean</span>
<span class="n">object</span><span class="p">.</span><span class="nf">to_s</span> <span class="o">==</span> <span class="s2">"true"</span>
<span class="k">when</span> <span class="o">-></span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span> <span class="n">t</span><span class="p">.</span><span class="nf">ancestors</span><span class="p">.</span><span class="nf">include?</span><span class="p">(</span><span class="no">DropboxApi</span><span class="o">::</span><span class="no">Metadata</span><span class="o">::</span><span class="no">Base</span><span class="p">)</span> <span class="p">}</span>
<span class="vi">@type</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">object</span><span class="p">)</span>
<span class="k">else</span>
<span class="k">raise</span> <span class="no">NotImplementedError</span><span class="p">,</span> <span class="s2">"Can't cast `</span><span class="si">#{</span><span class="vi">@type</span><span class="si">}</span><span class="s2">`"</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>When I tested the result of my refactoring, I found that it is broken. The
<code class="language-plaintext highlighter-rouge">case</code> didn’t match any branch and threw <code class="language-plaintext highlighter-rouge">NotImplementedError</code> exception. It was
the first time, when I read how <code class="language-plaintext highlighter-rouge">case</code> works in official Ruby documentation.</p>
<p>After all these years of programming in Ruby I’ve discovered that
case doesn’t compare argument with <code class="language-plaintext highlighter-rouge">==</code>, it uses <code class="language-plaintext highlighter-rouge">===</code> operator on it.
This threequel operator doesn’t have anything in common with <code class="language-plaintext highlighter-rouge">==</code>,
it’s not an equality checking. The <code class="language-plaintext highlighter-rouge">===</code> operator checks if object on
the right side can be included into a set on the left side. In some cases it is
really simple, for example</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sr">/qwe/</span> <span class="o">===</span> <span class="s1">'qwerty'</span> <span class="c1">#=> true</span>
</code></pre></div></div>
<p>because regular
expression <code class="language-plaintext highlighter-rouge">/qwe/</code> describes all possible strings that are matched. Or another
example</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="mi">1</span><span class="o">..</span><span class="mi">10</span><span class="p">)</span> <span class="o">===</span> <span class="mi">4</span> <span class="c1">#=> true</span>
</code></pre></div></div>
<p>integer 4 is included into the range 1..10.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">Integer</span> <span class="o">===</span> <span class="mi">4</span> <span class="c1">#=> true</span>
</code></pre></div></div>
<p>this is also true, because 4 is belong to all possible integers.
It is important to know that <code class="language-plaintext highlighter-rouge">===</code> operator is not symmetrical.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">4</span> <span class="o">===</span> <span class="no">Integer</span> <span class="c1">#=> false</span>
</code></pre></div></div>
<p>Lets move on to the tricky part:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">4</span> <span class="o">===</span> <span class="mi">4</span> <span class="c1">#=> true</span>
</code></pre></div></div>
<p>Ruby is object-oriented programming language, it allows to write your own
implementation for a method to any class.
Lets check the <code class="language-plaintext highlighter-rouge">===</code> implementation for a <code class="language-plaintext highlighter-rouge">Fixnum</code> via pry <code class="language-plaintext highlighter-rouge">pry(main)> $ 4.===</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">From:</span> <span class="n">numeric</span><span class="p">.</span><span class="n">c</span> <span class="p">(</span><span class="n">C</span> <span class="n">Method</span><span class="p">)</span><span class="o">:</span>
<span class="n">Owner</span><span class="o">:</span> <span class="n">Fixnum</span>
<span class="n">Visibility</span><span class="o">:</span> <span class="n">public</span>
<span class="n">Number</span> <span class="n">of</span> <span class="n">lines</span><span class="o">:</span> <span class="mi">15</span>
<span class="k">static</span> <span class="n">VALUE</span>
<span class="nf">fix_equal</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">x</span><span class="p">,</span> <span class="n">VALUE</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span> <span class="k">return</span> <span class="n">Qtrue</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">FIXNUM_P</span><span class="p">(</span><span class="n">y</span><span class="p">))</span> <span class="k">return</span> <span class="n">Qfalse</span><span class="p">;</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">RB_TYPE_P</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">T_BIGNUM</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">rb_big_eq</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">RB_TYPE_P</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">T_FLOAT</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">rb_integer_float_eq</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">else</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">num_equal</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now it is clear. In Ruby <code class="language-plaintext highlighter-rouge">Fixnum</code> objects treat <code class="language-plaintext highlighter-rouge">===</code> as a simple equality check.
As for the human interpretation, I think that it can be represented like: “This
4 belongs to the set of all possible objects of 4”.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">Integer</span> <span class="o">===</span> <span class="no">Integer</span> <span class="c1">#=> false</span>
</code></pre></div></div>
<p>The root of my problem is covered in this line! I’ve used <code class="language-plaintext highlighter-rouge">case</code> to compare
types, but it doesn’t work as I expected.
Let’s jump straight to the implementation <code class="language-plaintext highlighter-rouge">$ Integer.===</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">From:</span> <span class="n">object</span><span class="p">.</span><span class="n">c</span> <span class="p">(</span><span class="n">C</span> <span class="n">Method</span><span class="p">)</span><span class="o">:</span>
<span class="n">Owner</span><span class="o">:</span> <span class="n">Module</span>
<span class="n">Visibility</span><span class="o">:</span> <span class="n">public</span>
<span class="n">Number</span> <span class="n">of</span> <span class="n">lines</span><span class="o">:</span> <span class="mi">5</span>
<span class="k">static</span> <span class="n">VALUE</span>
<span class="nf">rb_mod_eqq</span><span class="p">(</span><span class="n">VALUE</span> <span class="n">mod</span><span class="p">,</span> <span class="n">VALUE</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">rb_obj_is_kind_of</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">mod</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It uses the standard check via <code class="language-plaintext highlighter-rouge">kind_of</code>, but recalling the analogy, it isn’t
possible to put class <code class="language-plaintext highlighter-rouge">Integer</code> in the set of class <code class="language-plaintext highlighter-rouge">Integer</code>.</p>
<p>That’s why my implementation of <code class="language-plaintext highlighter-rouge">force_cast</code> fails. It turns out that <code class="language-plaintext highlighter-rouge">case</code>
operator isn’t suitable to compare classes. In the end, I had to revert my
refactoring and add another <code class="language-plaintext highlighter-rouge">elsif</code> condition.</p>Timur YanberdinBefore I started learning Ruby in 2010, I had been programming in Pascal, Delphi and C++ to solve ACM-like problems. All of those programming languages have a switch/case operator. In all of them it works pretty straightforward: it compares a variable/object/result of an expression against several values and decides which branch to execute. When it was required for me in Ruby for a first time, I just looked up for the case syntax. I thought, what could possibly go wrong with the case operator?Nested objects in ElasticSearch: hidden and dangerous2017-07-23T13:24:50+00:002017-07-23T13:24:50+00:00/dev/2017/07/23/nested-objects-in-elastic-search-hidden-and-dangerous<p>Last week it took me a lot of time to debug the broken query that depends on array of objects. The problem was really simple but I end up in reindexing all data. It can be easily avoided in case if you know about one exception in the rule of working with arrays in ES.</p>
<p>To create an index in ES a mapping with declaration of fields and their datatypes should be provided. Regardless a datatype, any field can be treated as an array of elements with selected type. For example, if field <code class="language-plaintext highlighter-rouge">tag</code> is declared as <code class="language-plaintext highlighter-rouge">text</code> datatype, you can store there both a single string or an array of strings. There is no need in any additional actions to search a value in array. The query:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"tag"</span><span class="p">:</span><span class="w"> </span><span class="s2">"film"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>matches both of the following documents:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w"> </span><span class="nl">"tag"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"film"</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"tag"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"film"</span><span class="p">,</span><span class="w"> </span><span class="s2">"grain"</span><span class="p">]</span><span class="w"> </span><span class="p">}</span></code></pre></figure>
<p>However it doesn’t work that way with objects. In fact ES also can handle <code class="language-plaintext highlighter-rouge">object</code> datatype as an array of objects, but they can’t be queried independently as separate objects by default. And here comes my problem.</p>
<p>Let assume that we have nested objects with two fields: first of them contains tag’s name that describes a document and the second one contains probability value for this tag (0-100). For example: <code class="language-plaintext highlighter-rouge">[{ "name" : "dog", "probability" : 93 }, { "name" : "fur", "probability" : 80 }]</code>. I need to match all documents that have a tag with probability greater or equal to 90. Based on the assumption that ES will store objects as an array I’ve written a query:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"bool"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"must"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"tag.name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"fur"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"range"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"tag.probability"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"gte"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">90</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">bool must</code> syntax means that both of nested conditions must be true. I expected that this query should match documents that have object with <code class="language-plaintext highlighter-rouge">tag == fur</code> and <code class="language-plaintext highlighter-rouge">probability >= 90</code>. However a document from the example was matched despite the probability value of ‘fur’ tag is 80.</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"tag"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"dog"</span><span class="p">,</span><span class="w"> </span><span class="nl">"probability"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">93</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"fur"</span><span class="p">,</span><span class="w"> </span><span class="nl">"probability"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">80</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>After several debugging sessions I found out that ES flattens array of objects to many arrays of separate fields by default. The previous documents is presented as:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"tag.name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"dog"</span><span class="p">,</span><span class="w"> </span><span class="s2">"fur"</span><span class="p">],</span><span class="w">
</span><span class="nl">"tag.probability"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mi">93</span><span class="p">,</span><span class="w"> </span><span class="mi">80</span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The query matches the document because it includes ‘fur’ in <code class="language-plaintext highlighter-rouge">tag.name</code> and there is a value greater than 90 in the <code class="language-plaintext highlighter-rouge">tag.probability</code> field. This behaviour isn’t correct for my case.</p>
<p>To query objects entirely in ElasticSearch, array of objects should be indexed with <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html">nested datatype</a>.
In this case each object is stored as a separate hidden document and can be queried independently with a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html">nested query</a>:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"nested"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"tag"</span><span class="p">,</span><span class="w">
</span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"bool"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"must"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"tag.name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fur"</span><span class="w"> </span><span class="p">}},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"tag.probability"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"gte"</span><span class="p">:</span><span class="w"> </span><span class="mi">90</span><span class="w"> </span><span class="p">}}}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>This way gives a correct result, but if you have already set datatype to <code class="language-plaintext highlighter-rouge">object</code> or you have <code class="language-plaintext highlighter-rouge">dynamic mapping</code> enabled (it also maps datatype to <code class="language-plaintext highlighter-rouge">object</code>) then you can’t just update your mapping and you are doomed to reindex all data from scratch.</p>
<p>At this point I thought that my problem is solved. I was naive. After several days of reindexing with <code class="language-plaintext highlighter-rouge">nested object</code> datatype I’ve realized that number of documents in the index has been increased by 700%, also a disk space consumed by the index has been increased by 130%. These new mysterious documents are hidden nested objects. Even without them I have a large amount of data stored in the index. But the most terrible consequence was a performance regression. Search queries took almost 10 times longer than before reindexing.</p>
<p>Such performance regression was unacceptable for me, so I came up with another solution. Since a threshold for <code class="language-plaintext highlighter-rouge">tag.probability</code> was fixed for all queries and I use another database as a main storage, it allowed me to move the part of query with <code class="language-plaintext highlighter-rouge">tags.probability >= 90</code> to an indexing stage. I decided to declare a field <code class="language-plaintext highlighter-rouge">trusted_tags</code> in a mapping and serialize there only tags that have <code class="language-plaintext highlighter-rouge">probability >= 90</code>. With the help of this approach there is no need for additional <code class="language-plaintext highlighter-rouge">probability</code> condition in a query, correct documents will be matched with:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"trusted_tags"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fur"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Despite it works like a charm, I had to reindex all data from scratch again. In the end, I categorically wouldn’t recommend to use <code class="language-plaintext highlighter-rouge">nested object</code> datatype if number of nested documents is bigger than number of actual indexed documents.</p>Timur YanberdinLast week it took me a lot of time to debug the broken query that depends on array of objects. The problem was really simple but I end up in reindexing all data. It can be easily avoided in case if you know about one exception in the rule of working with arrays in ES.Why one shouldn’t use ElasticSearch as a data storage2017-04-22T07:53:53+00:002017-04-22T07:53:53+00:00/dev/2017/04/22/elastic-search-is-not-db<p>Easy to notice that popularity of ElasticSearch has been growing fast. Almost every new project connected with full-text search and scoring prefers it to Sphinx. A lot of people, who are new to full-text searching frameworks, have tried ES in their projects. Last year I had a pleasure to drink a beer with two different groups of Ruby developers, in both of them someone said: “We have used ElasticSearch to store our data and we got so many troubles with it”. To be honest, generally it isn’t a good idea to use ES as a main storage for your project. Let me explain why.</p>
<p>ElasticSearch is a search engine, it stores documents in indices across shards of a cluster’s nodes. Indices represent collections of documents with similar characteristics. In ElasticSearch, document is a JSON object that contains different fields. ES can’t scan all documents in a sequence to handle search queries, because it would be too slow, so it uses a search index. To understand how it works you should know the basics of indexing.</p>
<p>The lower level of the search abstraction in ES is an inverted index. Inverted index is a data structure that stores a mapping between each words of all documents and documents that contain them. Let’s assume we have these 3 sentences:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>sentence</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>This is my cat</td>
</tr>
<tr>
<td>2</td>
<td>I like your cat</td>
</tr>
<tr>
<td>3</td>
<td>Here is my number</td>
</tr>
</tbody>
</table>
<p>Now we need to lowercase words and build an inverted index for them:</p>
<table>
<thead>
<tr>
<th>term</th>
<th>sentence_id</th>
</tr>
</thead>
<tbody>
<tr>
<td>this</td>
<td>1</td>
</tr>
<tr>
<td>is</td>
<td>1, 3</td>
</tr>
<tr>
<td>my</td>
<td>1, 3</td>
</tr>
<tr>
<td>cat</td>
<td>1, 2</td>
</tr>
<tr>
<td>i</td>
<td>2</td>
</tr>
<tr>
<td>like</td>
<td>2</td>
</tr>
<tr>
<td>your</td>
<td>2</td>
</tr>
<tr>
<td>here</td>
<td>3</td>
</tr>
<tr>
<td>number</td>
<td>3</td>
</tr>
</tbody>
</table>
<p>For example, with the help of this index the word “cat” can be found in sentences 1 and 2. From this example it is clear that original data isn’t used to find documents’ ids with inverted index. By default, ElasticSearch stores JSON data in the <code class="language-plaintext highlighter-rouge">_source</code> field and returns it on search request. Source data can be used for highlighting, reindexing or displaying to user. The <code class="language-plaintext highlighter-rouge">_source</code> parameter can be set to <code class="language-plaintext highlighter-rouge">false</code>, in this case data should be retrieved from different source. It is possible only if documents’ ids in ES are matched with documents’ ids in different storage. For example, one has queried “cat” and got ids: [1, 2] in the result. A separate query with ids 1 and 2 should be sent to a storage for documents. This way allows avoiding data duplication and save some space, but has a little overhead on an additional query.</p>
<p>Technically ElasticSearch can be used as a main data storage, but it has some unpleasant disadvantages.</p>
<p>To make your data searchable, ElasticSearch needs to know what type of data each field contains and how it should be indexed. The process of defining a schema with data types and indexing algorithms is called mapping. A schema should be provided during creation of an index. When you pass your data to the ES-index, Lucene, that is located under the hood of ES, builds many inverted immutable index segments. During the search process, Lucene does the search on every segment and merges the results from them. Let’s look to the mapping example, it is written in Elixir DSL, but the pure JSON looks pretty the same:</p>
<figure class="highlight"><pre><code class="language-elixir" data-lang="elixir"> <span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="ss">index:</span> <span class="s2">"MyIndex"</span><span class="p">,</span> <span class="ss">type:</span> <span class="s2">"MyType"</span><span class="p">]</span>
<span class="n">settings</span> <span class="k">do</span>
<span class="n">analysis</span> <span class="k">do</span>
<span class="n">analyzer</span> <span class="s2">"standard_snowball"</span><span class="p">,</span>
<span class="p">[</span>
<span class="ss">filter:</span> <span class="p">[</span><span class="s2">"lowercase"</span><span class="p">,</span> <span class="s2">"stop"</span><span class="p">,</span> <span class="s2">"snowball"</span><span class="p">],</span>
<span class="ss">tokenizer:</span> <span class="s2">"standard"</span>
<span class="p">]</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">mappings</span> <span class="ss">_source:</span> <span class="p">%{</span><span class="ss">enabled:</span> <span class="no">false</span><span class="p">}</span> <span class="k">do</span>
<span class="n">indexes</span> <span class="s2">"tags"</span><span class="p">,</span> <span class="ss">type:</span> <span class="s2">"string"</span><span class="p">,</span> <span class="ss">analyzer:</span> <span class="s2">"standard_snowball"</span>
<span class="n">indexes</span> <span class="s2">"name"</span><span class="p">,</span> <span class="ss">type:</span> <span class="s2">"string"</span><span class="p">,</span> <span class="ss">analyzer:</span> <span class="s2">"standard_snowball"</span>
<span class="n">indexes</span> <span class="s2">"created_time"</span><span class="p">,</span> <span class="ss">type:</span> <span class="s2">"date"</span>
<span class="n">indexes</span> <span class="s2">"username"</span><span class="p">,</span> <span class="ss">type:</span> <span class="s2">"string"</span><span class="p">,</span> <span class="ss">index:</span> <span class="s2">"not_analyzed"</span>
<span class="k">end</span></code></pre></figure>
<p>Here I’ve defined an index with name “MyIndex” and type “MyType”. Also, I’ve created an analyzer called “standard_snowball”, it uses <code class="language-plaintext highlighter-rouge">standard</code> tokenizer to break sentences into tokens, then it makes them to lowercase with <code class="language-plaintext highlighter-rouge">lowercase</code> filter, remove stop-words with <code class="language-plaintext highlighter-rouge">stop</code> filter and stem them with <code class="language-plaintext highlighter-rouge">snowball</code> filter. I use this analyzer for “tags” and “name” string fields. Important to notice that it will be used both to analyze source data and search queries for these fields. Search query analyzer could be changed at any time, but it’s impossible to change analyzer for a source data that is already has been analyzed. Also, it isn’t possible to change datatype of fields in most cases. For example, mapping can’t be updated to convert a string field to a date field. It’s even not possible to rename a field in existing mapping. But if it needs to add a new field, then you can update mapping and the new data will be indexed properly.</p>
<p>To change analyzers of an existed data, all data should be reindexed. In case of <code class="language-plaintext highlighter-rouge">_source</code> is set to <code class="language-plaintext highlighter-rouge">false</code>, main data storage should be used for reindexing. If source is stored, then it is possible to create a new index and reindex data from an old index. Usually this process is quite slow, but can be done with a zero-downtime. Of course reindexing isn’t an unbeatable challenge, but better to keep in mind that it can create some pain.</p>
<p>Unfortunately ElasticSearch has some disadvantages that can’t be easily bypassed when it is used as a storage. First of all, it isn’t <a href="https://en.wikipedia.org/wiki/ACID">ACID</a>. It doesn’t support transactions, so it isn’t possible to rollback first stored document if saving of a second fails.</p>
<p>As I said earlier, it stores documents as JSON, so it is schema less and non-relational. This mean no joining queries, no migrations and other cool features that we love in relationship databases. Yes, ElasticSearch has parent-child relationship, but it comes with certain <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-parent-field.html#_parent_child_restrictions">restrictions</a>. Maybe it’s a bit personal, but after more than a year of working with MongoDB I have some reasons to dislike NoSql.</p>
<p>Another problem is a deep pagination. Because ElasticSearch is distributed, it requires a lot of memory to paginate distant pages, it even has limitation by default. If index is distributed between 5 shards and someone need to find first 20 documents, it is possible that all of them may be located on a one shard or be divided between 2-5 shards. To present results, each shard retrieves first 20 documents and pass them to a coordination node, then it sorts 100 documents and return only first 20 of them. If user decided to go to page 1001, system needs to return documents on positions from 20001 to 20020. To make this happen, index should find 20020 documents on each shard, sort 100100 documents in total, and then drop first 100080 of them. To avoid this overkill situation query API has <code class="language-plaintext highlighter-rouge">search_after</code> parameter, it accepts an id or other unique value from a last document on previous page to retrieve a next page. It’s a nice solution to paginate over all data page by page, but it is useless if the user needs to jump on a specific page.</p>
<p>As for security, it is true that old versions of ElasticSearch allow anyone, who can connect to a cluster, do any request, but actual versions have <a href="https://www.elastic.co/guide/en/x-pack/current/xpack-security.html">authorization via xpack</a> so it’s not a problem anymore.</p>
<p>Also, some people complain about robustness. As for me, I’ve never encountered a situation with <code class="language-plaintext highlighter-rouge">OutOfMemory</code> errors, so I can’t comment this topic. Maybe I don’t have that many data.</p>
<p>If you OK with these drawbacks, you can try to use ElasticSearch as a primary database, but I can’t recommend doing this. Furthermore, even authors don’t promote it as a storage.</p>Timur YanberdinEasy to notice that popularity of ElasticSearch has been growing fast. Almost every new project connected with full-text search and scoring prefers it to Sphinx. A lot of people, who are new to full-text searching frameworks, have tried ES in their projects. Last year I had a pleasure to drink a beer with two different groups of Ruby developers, in both of them someone said: “We have used ElasticSearch to store our data and we got so many troubles with it”. To be honest, generally it isn’t a good idea to use ES as a main storage for your project. Let me explain why.