How To Work with Big Data – TalkIT – Courses created by experts

Setting the scene
What is Hadoop?

1. Setting the Scene

Big data – the 4Vs
What’s wrong with relational databases?
How about massively parallel processing databases?
Big data technologies
Data is the lifeblood of any organization

The challenge…

The volume of data is growing exponentially
Magnitudes larger than a few years ago
Big data has become one of the most exciting technology trends in recent years
How to get business value out of massive datasets
This is the problem that big data technologies aim to solve

The Big Data 4Vs – Volume

One definition of big data relates to the volume of data.

Any dataset whose volume exceeds 10 petabytes
If stored in an RDBMS, it would have billions of rows

Some startling numbers from IBM research…

100 terabytes (1012) of data held by most companies in the US
5 exabytes (1018) of data are created every day worldwide
40 zettabytes (1021) of data will be created by 2020 (estimated)

The Big Data 4Vs – Variety

Data comes in a wide variety of formats…

Structured data

Highly organized, fits into well-known enterprise data models
Typically stored in relational databases or spreadsheets
Can be searched using standard search algorithms, e.g. SQL

Semi-structured data

Log files, CSV files, etc.
There is some degree of order, but not necessarily predictable

Unstructured data

E.g. ad-hoc messaging notes, tweets, video clips

The Big Data 4Vs – Velocity

Velocity refers to the speed at which data enters your system.

There are some industry sectors where it’s essential to gather, understand, and react to tremendous amounts of streaming data in real time – e.g. Formula 1, financial trading, etc.

With the IoT growing in importance, data is being generated at astonishing rates.

g. NYSE captures 1TB of trade info each trading session
g. there are approx. 20 billion network connections on the planet
g. an F1 car has approx. 150 sensors and can transmit 2GB of data in one lap! (and approx. 3TB over a race weekend)

The Big Data 4Vs – Veracity

Veracity refers to the verifiable correctness of data. It’s essential you trust your data; otherwise how can you make critical business decisions based on the data?

Here are some more stats from IBM research:

Poor data quality costs the US approx. £3 trillion a year
In a survey, 1 in 3 business leaders said they don’t trust the info they use to make business decisions

What’s Wrong with Relational Databases?

Standard relational databases can’t easily handle big data.

RDBMS technology was designed decades ago
At that time, very few organizations had terabytes (1012) of data
Today, organizations can generate terabytes of data every day!

It’s not only the volume of data that causes a problem, but also the rate it’s being generated.

We need new technologies that can consume, process, and analyse large volume of data quickly

How about Massively Parallel Processing DBs?

Key driving factors of big data technology:

Scalability
High availability
Fault tolerance
… all these things at a low cost

Several proprietary commercial products emerged over the decades to address these requirements.

Massively parallel processing (MPP) DBs, e.g. Teradata, Vertica
However proprietary MPP products are expensive – not a general solution for everyone

Big Data Technologies

Big data technologies aim to address the issues we’ve just described.

Some of the most active open-source projects today relate to big data
Large corporates are making significant investments in big data technology

2. What is Hadoop?

Hadoop was one of the first open-source big data technologies
Scalable, fault-tolerant system for processing large datasets…
Across a cluster of commodity servers

Hadoop provides high availability and fault tolerance.

You don’t need to buy expensive hardware
Hadoop is well suited for batch processing and ETL (extract transform load) of large-scale data

Many organizations have replaced expensive commercial products with Hadoop.

Cost benefits – Hadoop is open source, runs on commodity h/w
Easily scalable – just add some more (relatively cheap) servers

Hadoop Design Goals

Hadoop uses a cluster of commodity servers for storing and processing large amounts of data.

Cheaper than using high-end powerful servers
Hadoop uses a scale-out architecture (rather than scale-up)
Hadoop is designed to work best with a relatively small number of huge files
Average file size in Hadoop is > 500MB

Hadoop implements fault tolerance through software.

Cheaper than implementing fault tolerance through hardware
Hadoop doesn’t rely on fault-tolerant servers
Hadoop assumes servers fail, and transparently handles failures

Developers don’t need to worry about handling hardware failures.

You can leave Hadoop to handle these messy details!

Moving code from one computer to another is much faster and more efficient than moving large datasets.

g. imagine you have a cluster of 50 computers with 1TB of data on each computer – what are the options for processing this data?

Option 1

Move the data to a very powerful server that can process 50TB of data.

Moving 50TB of data will take a long time, even on a fast network
Also, you’ll need expensive hardware to process data with this approach

Option 2

Move the code that processes the data to each computer in the 50-node cluster.

It’s a lot faster and more efficient than Option 1
Also, you don’t need high-end servers, which are expensive

Hadoop provides a framework that hides the complexities of writing distributed applications,

It’s a lot easier to write code that runs on a single computer, rather than writing distributed applications
There’s a much bigger pool of application developers who can write non-distributed applicationsWant to know more? Try TalkIT’s training course in big data.

Contents

Setting the scene

What is Hadoop?