What is Big Data?
Big data refers to datasets of both structured and unstructured data that are so complex that it is difficult to process using traditional databases or data processing applications. The concept of big data became mainstream in 2001 when analyst Doug Laney defined data growth as being three-dimensional (volume, velocity, and variety), and those dimensions became known as the “Three V’s” models that are used today to describe big data.
Three V’s Model
- Volume: The scale or quantity of data generated and stored. IBM estimates that 2.3 trillion gigabytes of data are generated every day, and many US companies store at least 100 terabytes of data.
- Velocity: The speed at which data streams move and the processing needs that are needed to match demand. In one trade session, the New York Stock Exchange captures 1 terabyte of data.
- Variety: The many different formats of data captured by different systems. Data formats come from many industries with many different formats such as social media tweets or videos around the internet.
Why is big data important?
The International Data Corporation (IDC) estimates from 2005 to 2020 the amount of data created will increase by 300 times leading to 40 trillion gigabytes (40 zetabytes). Advanced analytics can ensure you reap the business value of big data through business intelligence tools. The use of the tools can help you make smart business decisions or optimize product development– whatever the use big data can do it.
How is big data stored?
There are two main methods of storing big data: a NoSQL database or Hadoop.
NoSQL databases allow for non-relational data storage and storage of disparate data types referred to as big data. They differ from relational databases in that they are scalable, agile, and fault tolerant. NoSQL databases differ in how they store data such as graph-based, key-value and column stores, and document based. NoSQL databases often require the use of ETL into a relational data warehouse for the use of analytics such as business intelligence tools.
Apache’s Hadoop is an open source framework written in Java for distributed storage and processing of large datasets (also known as big data) spread across many commodity hardwares such as servers. It can handle large volumes of structured, semi-structured, and unstructured data such as images, audio or video files. Hadoop is not a database and is far from structured so it does not use query languages like SQL like many relational databases use. Instead, it uses MapReduce which allows users to retrieve data. Hadoop is deployed as a software through many companies such as Cloudera that often provide a SQL-like language for ease of use.