This course will introduce students to practical aspects of analytics at a large scale, i.e. big data. The course will start with a basic introduction to big data and cloud concepts spanning hardware, systems and software, and then delve into the details of algorithm design and execution at large scale.
Introduction to Cloud Concepts: Cloud-Native architecture, serverless computing, message queues, PaaS, SaaS, IaaS
Introduction to Big Data concepts: divide- and-conquer, parallel algorithms, distributed virtualized storage, distributed resource management, real-time processing.
Data Processing Fundamentals: data formats, sources and their semantics, processing patterns for large data (the ETL vs ELT difference), processing + storage options on cloud, lakehouse architecture
Technology deep-dive on Open Source as well as Google Cloud
echnologies covered: Spark (PySpark, Spark ML, Spark Streaming), SQL (SparkSQL), Kafka, Google Pub/Sub, Google Dataproc, Google Cloud Functions