Data Science Essentials with PySpark  

This course aims to introduce the principles and practice of Data Science using Python Libraries. You’ll learn how to use popular Python data science libraries, implement Big Data solutions, and more. You will create distributed big data solutions using PySpark. The essential features of Data Science are explored with practical examples. Data Science will be demystified.TalkIT Learning Course

Duration

5 days

Prerequisites

  • Approx. 6 months Python experience

What you’ll learn

  • Object-oriented Python programming
  • Functional Python programming
  • REST services and web sockets
  • Defining and using decorators
  • Asynchronous programming
  • Python data science techniques
  • Python Big Data
  • Getting Started with PySpark
  • PySpark data structures

Course details

Recap Essential Python Features – optional

  • Language Fundamentals
  • Functions
  • Data Structures
  • Defining and Using Packages
  • Additional Techniques

Object-Oriented Programming

  • Essential Concepts
  • Defining and Using a Class
  • Class-Wide Members

Additional Object-Oriented Techniques

  • A Closer Look at Attributes
  • Implementing Special Methods
  • Inheritance

Functional Programming

  • Functional Programming in Python
  • Higher Order Functions
  • Additional Techniques

Decorators

  • Getting Started with Decorators
  • Additional Decorator Techniques
  • Parameterized Decorators

Asynchronous Processing in Python

  • Getting Started with Asynchrony in Python
  • Creating Tasks to Run in Different Threads
  • Additional Task Techniques

Getting Started with Python Data Science and NumPy

  • Introduction to Python Data Science
  • NumPy Arrays
  • Manipulating Array Elements
  • Manipulating Array Shape

NumPy Techniques

  • NumPy Universal Functions
  • Aggregations
  • Broadcasting
  • Manipulating Arrays using Boolean Logic
  • Additional Techniques

Getting Started with Pandas

  • Introduction to Pandas
  • Creating a Series
  • Using a Series
  • Creating a DataFrame
  • Using a DataFrame

Pandas Techniques

  • Universal Functions
  • Merging and Joining Datasets
  • A Closer Look at Joins

Working with Time Series Data

  • Introduction to Time Series Data
  • Indexing and Plotting Time Series Data
  • Testing Data for Stationarity
  • Making Data Stationary
  • Forecasting Time Series Data
  • Scaling Back the ARIMA Results

Case Study

  • Worked example of a real-world data science problem

Introduction to Big Data

  • Setting the scene
  • Introduction to Hadoop
  • Hadoop components

Getting Started with PySpark

  • Introduction to Spark
  • Spark architecture
  • Application execution
  • Using the Python Spark Shell

Using the PySpark API

  • Essential concepts
  • Creating an RDD
  • Working with RDDs

RDD Operations – Part 1

  • RDD transformations
  • RDD transformations on key-value pairs

RDD Operations – Part 2

  • RDD actions
  • Caching
  • Spark jobs – the big picture

Getting Started Spark SQL

  • Overview of Spark SQL
  • Getting started with the Spark SQL API
  • Creating DataFrames from a data source
  • Creating DataFrames from an RDD

Spark SQL DataFrame Operations

  • Basic operations
  • Language-integrated query operations
  • RDD operations
  • Output operations

Appendix Additional Big Data Technologies

  • Data serialization
  • Columnar storage
  • Messaging systems
  • NoSQL

Register your interest in a Talk-IT Course

Course Interest

By sending this message you agree to the privacy policy.

Do a short survey to tell us what you think about training?

Click here to take the survey, it’ll only take a few minutes!

Scroll to Top