Mathematics of Big Data

Final Project


Description:

This is by far the largest component of the course. You will discover, explore, and attack a real world problem of your choosing. There are three types of projects you can work on, shown below in order of increasing difficulty:

  • (1) Application of existing algorithm to a new problem and potentially new data.
  • (2) Mathematics foundation or Algorithm work. You can choose to a) develop mathematical foundation for an existing machine learning algorithm, or b) provide interpretation for a deep learning algorithm in accordance with explainable machine learning, or c) extend an existing algorithm or conceive a new one to solve some problem (This inherently includes the first option because you will need to test this new algorithm on data).
  • (3) Theoretical work. Create a new convergence bound on a learning algorithm. Show that at some limit one learning algorithm becomes another. Etc.

These also have increasing risk. For example, you cannot turn in a paper saying you worked on a convergence bound for months with no results. Option two has medium risk because part of the process of creating a new algorithm is creating baselines to improve upon. At any time during the course please feel free to come and discuss your problem and ask questions with the instructor or TA.

Past complete project for reference: see project proposal, midterm report, midterm presentation, final report, and final presentation.

Requirements:

All of the requirements below must be satisfied in order to receive full credits for the project:

  • Partner:
    Maximum of 1 partner (we may concede to 2 partners in extreme scenarios eg. huge coding project). All partners must contribute equally.
  • Dataset:
    You must use at least one dataset with at least one half million data points as a significant part of your project.
  • Format:
    Your submission must be submitted as a pdf in NIPS format. Note that this means you must use LaTeX with their style file. (NIPS, Neural Information Processing Systems, is one of the major machine learning conferences).
    If you do not know how to use LaTeX, we reccommend finding a partner who does.
  • Code Style:
    All code used in the production of your final report should be clean (suggested format) and placed into a public GitHub repository under one of your partner's accounts. Place a footnote to this URL somewhere in your final pdf. This is not required but it is recommended to place your code under some open-source license such as MIT.

Due Dates:


  • Week 4:  Project Proposal
    Typed (LaTeX) one page maximum explaining your problem, what data sets you are likely to use (you must find some candidates), who your partners are, and what methods (of those you know of) you think you might use. Note that this is not 100% final but it should be within some epsilon of your final project.

  • Week 9 in class:  Midterm Presentation
    10-12 minute presentation (plus 3 minutes for questions) detailing your progress towards your goal. The write-up should be 6 to 8 pages for a 1 person group, 12 to 15 pages for a 2 person group and 15 to 20 for a 3 person group.

  • Week 13:   Draft of Final Project Submission
    Typed (LaTex) draft of final report and all of the codes written need to be submitted. The draft needs to detail the progress of the final project, which is expected to be a significant amount. The draft is used to demonstrate what you have done so far and show that you are ready for the final presentation. It does not need to follow the NIPS format (which is required for the final version). The code does not need to be super clean and organized for this draft submission, but it is expected to be cleaned up for final submission. You do not need to have the presentation slides ready for this submission.

  • Week 14 in class: Final Project Presentation
    Presentation should as detailed as possible, and it should be about 10 minutes to half an hour long.

  • Finals week: Final Project Submission
    Submission of the final project should be done electronically. It must include the Latex final report following NIPS format, all codes written for the project, the dataset, presentation (.pptx or .pdf) and any other files used. Only one copy of each item need be turned in per group. Must conform to the requirements above. If the dataset is too large to upload it to the Github, please contact instructor or TA for submission of the dataset.