Simple Metadata Indexing through Grid for Apps
It’s time for the second article in our series about the event-driven framework that is part of our SDS object storage solution. If you missed our first article, you can read A technical introduction to Grid for Apps. And note that we will be hosting a webinar about Grid for Apps on May 24 (Sign up here to learn more about our data processing framework).
The first article discussed enriching object metadata by adding a new metadata tag to files on the fly. We can think of many useful types of metadata that we can add to an object. For example, it could be a license plate extracted from a speed control camera picture, a document type, the resolution and bit rate of a video, or a pattern found in a picture. We will tackle all these examples in the coming weeks, but let’s answer a question first. Since an SDS object storage solution like ours can store petabytes of data and billions of files, how can we make use of it? How can we find the specific object we are looking for?
This is actually easy. We can use our event-driven solution to index in Elasticsearch all the metadata attached to our files. We will use the same event we used in the previous article (when a new file is updated) to push it in Elasticsearch. You will then be able to query Elasticsearch for all the files matching any type of metadata.
Let's do it!
As in our previous article, we will use our Docker container image to easily spawn an OpenIO SDS environment. We will also use the Elasticsearch Docker image to deploy it.
Retrieve the latest Elasticsearch Docker image (5.4.0 as of this writing):
And start an Elasticsearch instance:
Retrieve the OpenIO SDS Docker image:
Start your new OpenIO SDS environment:
You should now be at the prompt with an OpenIO SDS instance up and running.
Next, we will configure the trigger, so that each time you add a new object, the metadata from the object will be pushed to Elasticsearch. Add the following content to the file /etc/oio/sds/OPENIO/oio-event-agent-0/oio-event-handlers.conf:
If you want to learn more about this configuration file, please refer to our previous blog post.
Then, restart the openio event agent to enable the modification:
Your event-driven system is now up and running. The next step is to write the script that will index the metadata into Elasticsearch. To do so, we first need to install the Elasticsearch python module:
And we can now write the script. Let’s call it index-metadata.py:
You will have to modify the IP address of the Elasticsearch instance. In my case, the IP address of my machine was 192.168.99.1. Change it according to your environment.
Finally, launch the script in background:
Please note that the script is written in Python, but you can write it in any other language.
How does it work?
It’s time to add a new object to see if this works.
Using the OpenIO CLI, let’s upload the new object /etc/fstab to the container mycontainer in the account myaccount. We will also add the metadata type=configfile that will help to search for it in Elasticsearch.
Well done! You’ve just uploaded the file fstab, while, in the background, its metadata was indexed in Elasticsearch.
Now, we’ll query Elasticsearch, asking it to find all the objects that match the property configfile:
Searching for objects with the property ("fields": [ "name", "properties.*"]) configfile ("query": "configfile"), we obtain the following result:
All right, our newly uploaded file was detected by Elasticsearch as matching the request "query": "configfile".
Join us on May 25
As I mentioned above, these series of articles will demonstrate our Grid for Apps technology with some interesting use cases (image recognition and manipulation, pattern recognition, and more). So stay tuned!
We are also planning a webinar for May 24, and we’ll give you a glimpse of what you can expect from Grid for Apps in the near future. This will be the chance for you to ask all your questions about how this technology works and how you can implement it in your environment.
Want to know more about OpenIO SDS?
What are you waiting for?! Sign up and join us on May 24 for "Run Applications Directly on the Storage Infrastructure" webinar!