Using ElasticSearch with Python Hosted on a Virtual Machine

Published in

Analytics Vidhya

6 min readAug 2, 2020

In this blog, we will go through the step by step implementation of hosting Elasticsearch on a virtual machine, and how to connect and use the server using Python programming language

About Elasticsearch

Elasticsearch is a distributed open source search and analytics engine based on Apache Lucene library. It was first released in 2010 and is suitable for all kinds of data including textual,numeric, geospatial, structured and unstructured. Elasticsearch features simple Restful API’s for communication, and is known for its efficiency and scalability. Elasticsearch is popularly used for building search engines for websites and apps, but can also be used for business and security analytics. Elasticsearch was built using Java, but has official clients for almost all major programming languages including Python which we are going to use for demonstration . Top companies which use Elasticsearch as a part of their tech stack include Uber,Udemy, Slack, Netflix and many more.

Setting Up virtual machine

For demonstration purposes, we are going to set up our Elasticsearch on a virtual machine(Ubuntu) on your local machine, but the same steps can be followed for a machine on Cloud (eg Digital ocean droplet or AWS EC 2 instance).We are going to use VirtualBox to set up our virtual machine.Follow the steps given in this link to help you set up Ubuntu.

Memory settings can be changed on the fly, but it is assumed that virtual machine has at least 1GB memory.Click on settings of your virtual machine, go to Network, and select the Attached to option as Bridged Adapter.

Note: This is only necessary if you are using VirtualBox, as it will generate a static IP address for the virtual machine allowing us to linkup with the host machine using the IP address just as we would have if we were using Digital Ocean droplet or an EC2 instance on AWS.

Installing Elasticsearch

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

This command will ensure that packages from elasticsearch repository are trusted.Now to add ElasticSearch repository to the system run the following command

sudo sh -c 'echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" > /etc/apt/sources.list.d/elastic-7.x.list'

Next we will update apt packages and install elasticsearch:

sudo apt-get update && sudo apt-get install elasticsearch

To insure Elasticsearch starts and stops automatically when the system is turned on,run the following command

sudo systemctl enable elasticsearch.service

Configuring Elasticsearch

To access elastic search from outside the virtual machine, we need to change its configuration settings. Elasticsearch server configuration settings can be found in elasticsearch.yml file. To edit it run the following command

sudo nano /etc/elasticsearch/elasticsearch.yml

Set the network.host property to 0.0.0.0 and discover seed hosts as an empty list

network.host: 0.0.0.0discovery.seed_hosts: []

This setting will allow elasticsearch to listen in on traffic from all networks.Obviously this will be a security concern if we are setting up for production use, but for the demonstration purposes this will suffice.

Elasticsearch by default sets 1GB of memory as its Heap Size. Since our virtual machine only has 1GB of memory available, we need to change this to at most half of our memory size i.e 512mb. To do this, we need to edit the jvm.options file. To edit the file run the following command in the terminal

sudo nano /etc/elasticsearch/jvm.options

And implement the following changes.

-Xms512m-Xmx512m

After implementing these changes, we should restart Elasticsearch via the following command

sudo service elasticsearch.service restart

To check if Elasticsearch is working properly, run the following command from the terminal in your virtual machine:

curl -X GET "localhost:9200"

This should display you an output similar to this:

For accessing the Elasticsearch cluster,we need to get the IP address of our virtual machine, to do so run the following command from the terminal of the virtual machine

and Copy the IP address.

Now we switch over to Our Host Operating system

To install elastics search official python client on our host machine use the following command:

pip install elasticsearch

Download the sample data file accounts.json provided in the Elasticsearch documentation.

The documents in this randomly generated dataset contain user accounts information like :

{
"account_number”: 0,
“balance”: 16623,
“firstname”: “Bradshaw”,
“lastname”: “Mckenzie”,
“age”: 29,
“gender”: “F”,
“address”: “244 Columbus Place”,
“employer”: “Euron”,
“email”: “bradshawmckenzie@euron.com”,
“city”: “Hobucken”,
“state”: “CO”}

Go to the directory where you downloaded the file and use the following curl command to upload all the documents in json file to the index “bank” in Elasticsearch cluster running on the virtual machine

curl -H "Content-Type: application/x-ndjson" -XPOST “IP_Address:9200/bank/_bulk” --data-binary "@accounts.json"

(This command will automatically create the index with name bank and take care of all mappings)

To check if the index named bank is created and all the documents are indexes, run the following command:

curl "IP_Address:9200/_cat/indices?v"

You should get an output like this:

Searching Documents with Python

Now we see how easily we can search and retrieve documents stored in our cluster with just a few lines of code . Firstly we connect with the Elasticsearch cluster running on our virtual machine using the code block given below

from elasticsearch import Elasticsearches = Elasticsearch([{'host':'IP_Address', 'port': 9200}],timeout=100)

Now we build the query which will fetch the result from the cluster. Suppose we want to search for accounts in the state with code “LA”.The code for building this query in python will look like this

body = {
  "query": {
"match_phrase": {
    "state": "LA"
   }
  }
}

Next we search for the documents using the following line of code, passing in the query we created above

res = es.search(index="bank", body=body)

Running this query and printing res should give you an output like shown in the figure.

The output shows that there were a total of 17 documents that matched the query. Elasticsearch automatically assigns relevancy scores to each document and sorts the result in order of significance.

You can access the list of retrieved documents by using the following code

res[“hits”][“hits”]

There are many different ways in which we can construct a query but we are not going into much details as the purpose of this blog was to only give an introduction to Elasticsearch and how one can use it in real world scenarios. Hopefully by following along this blog, the reader got an intuitive understanding of how Elasticsearch works and he/she takes it upon themselves to explore further how they can integrate it in their projects.

Please feel free to express your opinions or ask any question you might have in the comment section below

References

https://itsfoss.com/install-linux-in-virtualbox

Index some documents | Elasticsearch Reference [7.8] | Elastic

Once you have a cluster up and running, you're ready to index some data. There are a variety of ingest options for…

www.elastic.co

Python Elasticsearch Client - Elasticsearch 8.0.0 documentation

Official low-level client for Elasticsearch. Its goal is to provide common ground for all Elasticsearch-related code in…

elasticsearch-py.readthedocs.io