Explore kopf a very useful elasticsearch plugin.

Learn how to organise your Data.

Indexing  Mapping  Teamplating

Today we are going to explore a bit more Elasticsearch and the way it works. First you need to understand how Data are stored, and the inverted indexation process behind.

It’s quite simple to explain with a schema (coming from the Elastic official documentation):inverted_index_schema_from_elasticsearch_documentation.png

In a classical sql database format, you would have saved your data as shown on the left : 3 entries in your Database 1, 2 and 3 as key and a text associated. To invert the index as it’s done by Elasticsearch, each word of you document become a key (a token in the ELK world).

At each token is associated a frequency and a reference to documents in which it appears. For instance  “the” (the word) appear 2 times, once in document 2, and once in document 3.

That’s why in the previous blog we removed every white space, every slash, every  dash from the field name :

station["fields"]["name"]=station["fields"]["name"].replace(" ","_")
                                                   .replace("-","_")
                                                   .replace("/","_")

This way the following field name : “034 – fontainas / fontainas” is indexed by “034___fontainas___fontainas” and not by “034” and “fontainas”.

pie-panel-with-number-of-available-bikes-per-villo-station

In this Pie Panel we display the sum of available bikes per station, we can see that there is 27 available bikes in the station 034___fontainas___fontainas (034 – fontainas / fontainas). If we remove the replacement of white spaces, dashes and slashes, the same pie panel look like this :

Pie Panel with number of available bikes per villo station analyzed.png

… and lost most of its interest, while it’s not really useful to know that there is 349 available bikes in stations where the name contains the word ‘de’.

Well you might think, it’s annoying to have to do some replacement before inserting Data into your Elasticsearch node, and you are right, it is annoying…

But don’t be too obsessed, the ELK team already have a solution for this kind of problem. Open you kopf console (localhost:9201/_plugin/kopf), and show the mapping associated to your index “villowithid” :

kopf-showing-the-mapping-of-your-index

You should see something like this :

mapping-of-index-villowithid

It’s the mapping of your index “villowithid”. Let’s see how mapped is the field we are focusing on :

"name": {
   "type": "string",
   "index": "not_analyzed"
    }

So far, no surprises, but remember we want our entire field to be indexed as one single token, and not one token per word, as done previously.

So let’s modify the mapping.

Go to more -> create index, select villowithid in the combobox “load settings from existing index”, and modify the mapping of the field “name” as shown below :

creation of a new index.png

Once done, don’t forget to click “Create”. You still have to modify the python code (line 23 and 25) to get the insert done in index villowithid2 and not in the old one villowithid.

bulk_body += '{ "index" : { "_index" : "villowithid2", "_type" : "station","_id":"'+station["recordid"]+'"} }\n'

You can do the same for the index histvillo, and create histvillo2. Once done create the same Pie Panel in Kibana, and … magic, the field “name” is indexed as we want.

Pie Panel number of available bikes per station not_analyzed.png

Let’s get even deeper in the indexation stuff. By experience, we noticed that it’s really important to get your historized data split into multiple indexes. It will be better for performances and to delete old records you don’t need anymore.

First let’s modify our python code to insert in a daily index :

from datetime import datetime
from elasticsearch import Elasticsearch
import urllib2
import json
import time

client = Elasticsearch(hosts=['127.0.0.1:9201'])

now = datetime.now()

def fetch_villo():
  url = 'http://opendata.bruxelles.be/api/records/1.0/search/?dataset=stations-villo-disponibilites-en-temps-reel&rows=1000&facet=banking&facet=bonus&facet=status&facet=contract_name'

  h = urllib2.urlopen(url)
  res= h.read()
  data = json.loads(res)
  res=res.replace("\u0","")

  bulk_body="";

  for station in data["records"]:
    #station["fields"]["name"]=station["fields"]["name"].replace(" ","_").replace("-","_").replace("/","_")
    jsondata=json.dumps(station);
    bulk_body += '{ "index" : { "_index" : "villowithid-'+str(now)[:10]+'", "_type" : "station","_id":"'+station["recordid"]+'"} }\n'
    bulk_body += jsondata+'\n'
    bulk_body += '{ "index" : { "_index" : "histvillo-'+str(now)[:10]+'", "_type" : "station"} }\n'
    bulk_body += jsondata+'\n'

  print "Bulk ready."
  client.bulk(body=bulk_body)
  print "Bulk gone."

for i in range(0,10):
  print '*'*80
  fetch_villo();
  time.sleep(30)
  print '*'*80

Line 9 we create a variable “now” (the current date), and line 24 and 26 we use it to get our index historized by day.

The second step is to create a template in kopf(http://localhost:9201/_plugin/kopf/#!/indexTemplates) to get the same mapping for every index matching the template field  (the mapping with “index”:”not_analyzed” on the field “name”).

Go to more -> index template, copy paste the mapping you did previously, in the mapping section, and type “histvillo-*” in the template section (as show below), while our indexes will be named histvillo-2016-10-31, histvillo-2016-11-01, histvillo-2016-11-02…

It says that every index where the name matches histvillo-* (the star means everything) will take the mapping define in our template.

kopf creation of a template.png

Let’s save and run our python code. You should see 2 new indexes in your node :

daily indexes.png

To be able to view your Data in Kibana, you have to create a new index pattern (in Kibana), to match your new indexes. villowithid-* and histvillo-* should do the job.

Kibana creation of a new index pattern.png