Programming for Big Data
- Lecture/Discussion: 3
The explosion of social media and the computerization of every aspect of social and economic activity resulted in creation of large volumes of mostly unstructured data: web logs, videos, speech recordings, photographs, e-mails, Tweets, and similar data. The key objective of this course is to familiarize the students with key information technologies used in manipulating, storing, and analyzing big data. We look at the basic tools for statistical analysis, R and Python, and some key methods of machine learning. We review MapReduce techniques for parallel processing, Hadoop, an open source framework for running MapReduce on Internet scale problems and HDFS, Hadoop's Distributed File System. We teach Spark which emerged as the most important big data processing framework. We touch on tools that provide SQL-like access to unstructured data like Hive. We analyze so-called NoSQL storage solutions exemplified by Cassandra for their critical features: speed of reads and writes, and ability to scale to extreme volumes. We examine memory resident databases (VoltDB, SciDB) and graph databases (Ne4J). Students gain the ability to initiate and design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time. Most lectures are presented using Java examples. Some lectures use Python and R.