COMP9313: Big Data Management Course Introduction
Lecture in Charge • Lecturer: Yifang Sun • office: used to be K17-208, at home now… • email: yifangs@cse.unsw.edu.au • use [comp9313] in subject • Research interests • Database • High dimensional data • Machine learning (Natural language processing) • Integration of DB and AI 2
Course Aims • Introduce the concepts behind Big Data • Introduce the core technologies used in managing large-scale data sets • MapReduce • Spark • … • Introduce technologies for developing solutions to large-scale data analytics problems • nearest neighbor search • machine learning with big data • … 3
Course Aims - cont. • Not possible to cover every aspect of big data management • We will focus on • concepts • algorithms • principles • We will not focus on • programming languages and API • specific platforms • Make use of tutorials and documents on the Internet 4
Lectures • Delivered through pre-recorded videos • location: anywhere you like • time: anytime you like • links to videos available on Piazza every Mon and Wed • email LiC ASAP if you have no access to Piazza • Slides on course website • No QA sessions during lectures • Ask in Piazza or online consultations • Schedule and length of lectures may vary based on the progress of the course • Note: watching every lecture is assumed. 5
Resources • Books • Hadoop: The Definitive Guide. Tom White. 4th Edition - O’Reilly Media • Learning PySpark. Tomasz Drabas and Denny Lee. O’Reilly Media • Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. University of Maryland, College Park. • Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 3rd edition - Cambridge University Press • Online resources: • PySpark Tutorial • Spark Python API Docs • Online courses/tutorials in Youtube, coursera , … 6
Pre-requisite • Official prerequisite • Data Structures and Algorithms • Database Systems • Before commencing this course, you should • have experiences and good knowledge of algorithm design • have solid background in database systems • have solid programming skills in Python • be familiar with Linux operating systems • have basic knowledge of linear algebra, probability theory and statistics • No previous experience necessary in • MapReduce/Spark • Parallel and distributed programming 7
Please do not enrol if you… • Don’t have COMP9024/9311 knowledge • Cannot produce correct Python program on your own • Have poor time management • Are too busy to watch lecture videos/labs • Otherwise, you are likely to perform badly in this subject 8
Assessment • One written assignment (20%) • Two programming projects (25% each) • Final exam (30%) • There’s no hurdle for any of the above components • All are individual tasks • All are submitted through give 9
Written Assignment • Exam-style questions • Computational, short answer • no essay, no multiple choice • Regarding the lecture contents • algorithms, principles, … • to assess your understanding, not memory • Late penalty • firm deadline • zero mark for late submission 10
Programming projects • Tentative topics • One on MapReduce + nearest neighbor search • One on PySpark + machine learning • Both results and source codes will be checked. • Zero mark if your codes cannot be run due to some bugs. • Late penalty • 10% reduction of raw marks for the 1 st day, 30% reduction per day for the following 3 days 11
Final exam • Open book exam • Firm deadline • No supplementary exam will be given • Special consideration must be submitted prior to the start of the exam • More details on the way 12
Academic honesty and plagiarism • Zero tolerance to plagiarism • You will get 0 marks • Examples of misconduct: • Copy other students’ work • Let other students copy your work • Copy from GitHub • Find a ghost writer • … • I will not accept the following excuses: • “I’ve left the lab with my screen unlocked” • “He stole it from my computer” • “I only gave my code to A. A didn’t use it but gave it to B” • … 13
Tentative course schedule Week Topic Assignment/Project 1 Course Introduction and Introduction to Big Data 2 Hadoop MapReduce 3 Hadoop MapReduce 4 Nearest Neighbor Search Project 1 5 Spark Assignment 6 Flexibility Week (no lecture) 7 Spark Project 2 8 Machine Learning with PySpark Data Stream + NoSQL 9 10 Revision and Exam Preparation 14
Labs • Labs to help you with programming and projects • nothing to submit, no mark • using ipython notebooks • Contents • 1 lab on setting the environment • 1 lab on PySpark and MapReduce • 1 lab on NNS with MapReduce • 1 lab on Machine learning with PySpark 15
Consultations • Online QA discussions in Piazza • encourage you all to participant • Online consultation with tutor • 1pm – 2pm every Friday • using Zoom • room number and password in Piazza • Private online consultation with me • please book an appointment with me with a brief description of your questions, with [comp9313] in subject 16
General Recommendations • Make use of LiC and tutors • don’t hesitate to ask questions • Make use of Piazza • read the notices in course website and Piazza • participate in the discussions in Piazza • Make use of course materials • understand lecture slides • read specifications carefully • try the labs although they are not compulsory • Do not misconduct 17
Your Feedbacks are Always Welcome • Please advice where I can improve after each lecture, through Piazza or by email • myExperience system 18
Recommend
More recommend