How To Work with Big Data

Contents

  1. Setting the scene

  2. What is Hadoop?


1. Setting the Scene

  1. Big data – the 4Vs
  2. What’s wrong with relational databases?
  3. How about massively parallel processing databases?
  4. Big data technologies
  5. Data is the lifeblood of any organization

The challenge…

  • The volume of data is growing exponentially
  • Magnitudes larger than a few years ago
  • Big data has become one of the most exciting technology trends in recent years
  • How to get business value out of massive datasets
  • This is the problem that big data technologies aim to solve

The Big Data 4Vs – Volume

One definition of big data relates to the volume of data.

  • Any dataset whose volume exceeds 10 petabytes
  • If stored in an RDBMS, it would have billions of rows

Some startling numbers from IBM research…

  • 100 terabytes (1012) of data held by most companies in the US
  • 5 exabytes (1018) of data are created every day worldwide
  • 40 zettabytes (1021) of data will be created by 2020 (estimated)

The Big Data 4Vs – Variety

Data comes in a wide variety of formats…

Structured data

  • Highly organized, fits into well-known enterprise data models
  • Typically stored in relational databases or spreadsheets
  • Can be searched using standard search algorithms, e.g. SQL

Semi-structured data

  • Log files, CSV files, etc.
  • There is some degree of order, but not necessarily predictable

Unstructured data

  • E.g. ad-hoc messaging notes, tweets, video clips

The Big Data 4Vs – Velocity

Velocity refers to the speed at which data enters your system.

  • There are some industry sectors where it’s essential to gather, understand, and react to tremendous amounts of streaming data in real time – e.g. Formula 1, financial trading, etc.

With the IoT growing in importance, data is being generated at astonishing rates.

  • g. NYSE captures 1TB of trade info each trading session
  • g. there are approx. 20 billion network connections on the planet
  • g. an F1 car has approx. 150 sensors and can transmit 2GB of data in one lap! (and approx. 3TB over a race weekend)

The Big Data 4Vs – Veracity

Veracity refers to the verifiable correctness of data.  It’s essential you trust your data; otherwise how can you make critical business decisions based on the data?

Here are some more stats from IBM research:

  • Poor data quality costs the US approx. £3 trillion a year
  • In a survey, 1 in 3 business leaders said they don’t trust the info they use to make business decisions

What’s Wrong with Relational Databases? 

Standard relational databases can’t easily handle big data.

  • RDBMS technology was designed decades ago
  • At that time, very few organizations had terabytes (1012) of data
  • Today, organizations can generate terabytes of data every day!

It’s not only the volume of data that causes a problem, but also the rate it’s being generated.

  • We need new technologies that can consume, process, and analyse large volume of data quickly

How about Massively Parallel Processing DBs?

Key driving factors of big data technology:

  • Scalability
  • High availability
  • Fault tolerance
  • … all these things at a low cost

Several proprietary commercial products emerged over the decades to address these requirements.

  • Massively parallel processing (MPP) DBs, e.g. Teradata, Vertica
  • However proprietary MPP products are expensive – not a general solution for everyone

Big Data Technologies

Big data technologies aim to address the issues we’ve just described.

  • Some of the most active open-source projects today relate to big data
  • Large corporates are making significant investments in big data technology

2. What is Hadoop?

  • Hadoop was one of the first open-source big data technologies
  • Scalable, fault-tolerant system for processing large datasets…
  • Across a cluster of commodity servers

Hadoop provides high availability and fault tolerance.

  • You don’t need to buy expensive hardware
  • Hadoop is well suited for batch processing and ETL (extract transform load) of large-scale data

Many organizations have replaced expensive commercial products with Hadoop.

  • Cost benefits – Hadoop is open source, runs on commodity h/w
  • Easily scalable – just add some more (relatively cheap) servers

Hadoop Design Goals

Hadoop uses a cluster of commodity servers for storing and processing large amounts of data.

  • Cheaper than using high-end powerful servers
  • Hadoop uses a scale-out architecture (rather than scale-up)
  • Hadoop is designed to work best with a relatively small number of huge files
  • Average file size in Hadoop is > 500MB

Hadoop implements fault tolerance through software.

  • Cheaper than implementing fault tolerance through hardware
  • Hadoop doesn’t rely on fault-tolerant servers
  • Hadoop assumes servers fail, and transparently handles failures

Developers don’t need to worry about handling hardware failures.

  • You can leave Hadoop to handle these messy details!

Moving code from one computer to another is much faster and more efficient than moving large datasets.

  • g. imagine you have a cluster of 50 computers with 1TB of data on each computer – what are the options for processing this data?

Option 1

Move the data to a very powerful server that can process 50TB of data.

  • Moving 50TB of data will take a long time, even on a fast network
  • Also, you’ll need expensive hardware to process data with this approach

Option 2

Move the code that processes the data to each computer in the 50-node cluster.

  • It’s a lot faster and more efficient than Option 1
  • Also, you don’t need high-end servers, which are expensive

 

Hadoop provides a framework that hides the complexities of writing distributed applications,

  • It’s a lot easier to write code that runs on a single computer, rather than writing distributed applications
  • There’s a much bigger pool of application developers who can write non-distributed applicationsWant to know more? Try TalkIT’s training course in big data.

Copyright 2019 TalkIT Andy Olsen

Scroll to Top