Applications Open now for January 2025 Batch | Applications Close: January 02, 2025 | Exam: February 23, 2025

Applications Open now for January 2025 Batch | Applications Close: January 02, 2025 | Exam: February 23, 2025

Degree Level Course

Introduction to Big Data

This course will introduce students to practical aspects of analytics at a large scale, i.e. big data. The course will start with a basic introduction to big data and cloud concepts spanning hardware, systems and software, and then delve into the details of algorithm design and execution at large scale.

by Rangarajan Vasudevan

Course ID: BSCS4004

Course Credits: 4

Course Type: Elective

Pre-requisites: None

What you’ll learnVIEW COURSE VIDEOS

This course will introduce students to practical aspects of analytics at a large scale, i.e. big data. The course will start with a basic introduction to big data and cloud concepts spanning hardware, systems and software, and then delve into the details of algorithm design and execution at large scale.
Introduction to Cloud Concepts: Cloud-Native architecture, serverless computing, message queues, PaaS, SaaS, IaaS
Introduction to Big Data concepts: divide- and-conquer, parallel algorithms, distributed virtualized storage, distributed resource management, real-time processing.
Data Processing Fundamentals: data formats, sources and their semantics, processing patterns for large data (the ETL vs ELT difference), processing + storage options on cloud, lakehouse architecture
Technology deep-dive on Open Source as well as Google Cloud
echnologies covered: Spark (PySpark, Spark ML, Spark Streaming), SQL (SparkSQL), Kafka, Google Pub/Sub, Google Dataproc, Google Cloud Functions

Course structure & Assessments

11 weeks of coursework, weekly online assignments, 2 in-person invigilated quizzes, 1 in-person invigilated end term exam. For details of standard course structure and assessments, visit Academics page.

WEEK 1 Introduction: ​Big data concepts & GCP Platform Setup
WEEK 2 Cloud concepts​: ​Cloud-Native architecture, serverless computing, message queues, PaaS, SaaS, IaaS
WEEK 3 Types of Data​:​ Data formats, sources & their semantics, processing & storage options on Cloud. Use of serverless to get started (e.g. Google Cloud Functions)
WEEK 4 Intro to Big Data Engineering​:​ Hadoop and PySpark
WEEK 5 ELT​:​ ETL, processing patterns for large data, ETL vs ELT, role of a scheduler
WEEK 6 SQL & NoSQL: For most analysis tasks, SQL is sufficient. Tools like Spark SQL allow that familiarity to translate to big data solutions. Types of NoSQL, evolution, best-of-fit options.
WEEK 7 Streaming​: Overview, Fundamental Concepts, Walkthrough of Google Pub/Sub & Google DataFlow as example technologies
WEEK 8 Streaming​: Kafka as another example of message queue technology & Spark Streaming
WEEK 9 Big Data ML​:​ DataProc with ML - including Spark ML (Batch processing)
WEEK 10 Deep Learning​ with big data on cloud.
WEEK 11 Prep week for final project, summarizing key concepts, and also for Q&A and clarifications
+ Show all weeks

About the Instructors

Rangarajan Vasudevan
Co-Founder & Chief Data Officer , Lentra.ai

Rangarajan Vasudevan is the Co-Founder & CDO of Lentra.ai, India’s fastest growing lending cloud. He did “big data” & “data science” before it was fashionable, building data-native applications across industries and geographies over 15+ years.

...  more

Ranga joined Lentra by way of an acquisition in June 2022 of his company TheDataTeam, creators of Cadenz.ai customer intelligence platform. Prior to founding TheDataTeam, Ranga served as Director, Big Data with Teradata Corporation’s international business unit. Ranga joined Teradata via the acquisition of Aster Data Systems, where he was a founding engineer and co-invented a company-defining, patented, pattern recognition algorithm. He is a recipient of both the Distinguished Engineer (R&D) and Consulting Excellence awards while at Teradata.

Ranga has degrees in Computer Science from the University of Michigan and IIT Madras.

  less