09/04/2018 |

Data Augmentation explained, and its benefits at the edge

I've been writing about how to exploit the potential of object storage and serverless computing at the edge for a while now (here and here a couple of examples). It's a really exciting topic and the use cases are amazing! 

If you look at modern compute infrastructure and computing models you can see how Object, serverless and even scale-out can easily solve most of the challenges posed by edge computing ad its integration with the cloud. Even more so, having the ability to deploy the same identical technology at both ends makes everything less complex and more manageable at scale. 

What is Data Augmentation?

nullBack to the topic of this post, data augmentation is a key aspect of modern analytics and cutting edge applications. You might have found me talking about metadata enrichment in the past, which is a form of data augmentation that is really specific to object stores because of their rich metadata capabilities. More in general, data augmentation is a way of adding value to data by adding additional information from internal or external sources (by analyzing the content and including the results in the the object itself for example). It can be applied to any form of data, where additional information can help provide more in-depth insight and increase its value. 

 Data is a key asset for any organization today, and increasing its value is becoming a form of investment for the future. In fact, augmented data is easier to find hence to reuse. By comparison data is like money, you could stash cash under your mattress to see it depreciate and become less and less usable over time, or you can choose to create more wealth by investing it consciously.  

Automate Data Augmentation 

The idea of Data augmentation is not new. But, in the past it was less relevant for businesses and more complicated to do. Things are changing quickly though, we have moved from GBs to TBs and PBs, and now most of the data we create is unstructured. Without the necessary information, a system to query this type of data or at least a smart indexing system, all this information we create and store has a limited lifespan and its value decreases exponentially over time. 

nullExamples are everywhere. Think about your home photos - now you store all your digital assets on a cloud service (Google Photos, iCloud Photo, etc.). It scans everything and groups photos by event, faces, place and so on. You can't remember everything and you don't want to waste days looking for a nice picture (or video) you remember taking when you want to reuse it, right? And it is no longer like having film rolls and hard copies of photographs (every picture had a cost and you were careful to shoot before the digital era), now they are "free" and you take many more pictures than actually needed (or at least I do). Today, you just search what you need and you can access, enjoy, re-print, share… or simply reuse it. And the same goes for enterprise data.

Data augmentation, to be effective, should happen during the ingestion process, exactly while you are storing data. A process should run every time new data is added to the system. No matter if it is done in real-time or queued to do it asynchronously, as long as it's taken care of. This is why I think every modern storage system capable of managing files or objects should have embedded serverless capabilities. Each new event generated at the storage layer could be intercepted and a function (a small piece of code) could analyze the content or query external source to augment and enrich the data. And this would happen seamlessly and transparently to end users and applications.  

Data Augmentation at the Edge

If it works at the core, in the cloud, data augmentation is even more effective at the edge! 

When data is generated at the edge, there is no reason to move it to the cloud before augmenting it. It is expensive and, all but efficient. Furthermore, by knowing the value of data before moving it to the cloud enables to take decisions about its real value and if it makes sense to use bandwidth or even if it's worth using up cloud resources for it. 

Again, storage+serveless is the key at the edge too. When new data is created it can be immediately analyzed and relevant information can be added on the fly. Think about a smart surveillance camera, for example. The video stream could be analyzed while stored locally, and the system can decide what to do with it depending on the content. if your camera is placed in a city square, you might want to record everything but send only images containing people and vehicles in it to the cloud, optimizing bandwidth and use of cloud storage while sending out content that is augmented with the information to make it searchable. 



Storage+serverless is a strong enabler for data augmentation. By offloading this process to the infrastructure, no matter if in the cloud or at the edge,  data increases its value and even if we don't need it today, chances are that it will be easier to find when it's needed tomorrow. 

Data augmentation simplifies and improves big data analytics and it can be of support for many cutting edge applications such us machine learning. It also improves and makes the cloud-edge integration more efficient, especially if the same technology can be deployed at both ends.

Looking at other applications, data augmentation can be beneficial for many other use cases. Take the media & entertainment industry, for example. Video ingested in the storage system can be analyzed, indexed, resampled and information about copyright and its content could be added to metadata fields making it self descriptive!


Learn a new concept of smarter storage