Take a minute to increase awareness and understanding of the terminology, context, and application of data and analytics.
Part five of a series.
Another term born out of the Big Data movement, Hadoop burst onto the scene as the need to process more and more data increased exponentially.
It’s defined as “an open-source software framework used for distributed storage and processing of datasets of big data using the MapReduce programming model. It consists of computer clusters built from commodity hardware.”
To put it simply, Hadoop enables large-scale, data-intensive deployments.
libAs a result of its success, various vendors have deployed their own Hadoop distributions and developers have contributed even more solutions that take advantage of Hadoop’s distributed processing to what is now touted as the Hadoop Ecosystem.
At its core, Hadoop and related technology are a Distributed File or Object Storage System. The advantage of Hadoop is that data can be landed and stored raw in its natural form, without any altercations.
This employs the concept of ‘Schema-on-read’ where the schema can be applied as the data is pulled out of a central location, rather than as it goes in. It lends itself to increased flexibility and adaptability if the schema requirements evolve over time.
To use Hadoop technology most effectively and efficiently, it is important to compare Hadoop to the way you organize various information on your own computer. Depending on how you want to access information stored on your computer, you may organize your folders based on year, month, subject area, user or other means. The same applies for data landing in Hadoop.
To get the most out of your data, it is important to think of what questions you will ask of the data, so it can be organized in a way that makes it easily accessible and understood.
Hadoop was a significant catalyst of the big data revolution, and constant innovation will continue as more and more solutions are layered on top of the open source framework.