Over the semester, each team will pick a business-related competition, to research and critique. On the last day of class, the team will give a 15-minute presentation on their research project. The presentation should cover
1. An overview of the competition - what's the objective? where does the dataset come from? what are the key features?
2. A brief critique of select Notebooks in Python or R on this competition available in the public domain - The team should critically evaluate other people's published work using concepts learned from the Machine Learning 1 and Machine Learning 2 coursework.
3. Based on what the team has researched, the team should create their own solution in R to the competition and explain why they believe the solution is the best, and what is new or different from published Notebooks in some conceptually rigorous way based on the Machine Learning 1 and Machine Learning 2 coursework. You must code your solution in R.
Notes
1. Most of the competitions provide separate train vs. test sets, with the Y labels from the test set hidden from you. In such cases, please just treat the train set as your entire dataset, and split your own train vs. test subsets.
2. Feel free to sample down the dataset if it's too large for you to handle on your laptop or in Colab. You can run R code in Colab to take advantage of its free GPU. To run R in Colab, go to Runtime => Change runtime type it has an option to change from Python to R.
3. Your work on this Machine Learning project should follow Rules of Machine Learning: Best Practices for ML Engineering by Google.
The team should create a Github page to host their work, including code and/or presentation (include PPT only if needed). The captain will submit the link to the TP2 GitHub page in Blackboard for grading by the time of the scheduled presentation.
Grading rubric (details available in Blackboard)
Problem Statement (10%): Clear, concise, accurate and focused statement of the problem being solved
Critiques (20%): Clear, accurate and thoughtful critiques of existing body of work that focus on key issues of algorithms, datasets, or code, and/or their reproducibility issues. Structure your critiques based on this reference Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint arXiv:2108.02497.
Solution (50%): Thoughtful and comprehensive solution that addresses the problem statement in a logical manner and issues identified in the critiques. Proper data processing and exploration and clear rationale for algorithm choices. Check the quality of your solution work using this reference Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint arXiv:2108.02497.
Reproducibility (10%): Comprehensive, clear, and focused explanation of the data exploration and processing work, rationale for the algorithm choices, and documentation of your code, based on this Machine Learning Reproducibility Checklist. Be sure to document rationale for WHY you perform certain tasks/made certain decisions, and not just WHAT you did.
Presentation (10%): Clear, well rehearsed, and concise presentation with laser focus on the key topics, key issues and key elements of the solution. Free of errors.
Each team should select a unique competition (i.e., no two teams, across the two class sections, can select the same competition). There are a total of 21 teams in Spring 2022. Topic choices are first come first serve. As soon as the team decides on a competition, the team captain should email the instructor to claim the competition. The instructor will post team topic claims on this webpage here so everyone can see what's available, what's taken, etc. The following is a list of potential topics. Teams are welcome to select a competition not on the list, as long as the topic is approved by the instructor.
Finance
The Winston Stock Market Challenge (Kaggle)
Predicting the Risk of Customer Credit Default (Kaggle)
Real Estate
Liberty Mutual Group: Property Inspection Prediction (Kaggle)
Problem Statement (10%): Clear, concise, accurate and focused statement of the problem being solved
Critiques (20%): Clear, accurate and thoughtful critiques of existing body of work that focus on key issues of algorithms, datasets, or code, and/or their reproducibility issues. Structure your critiques based on this reference Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint arXiv:2108.02497.
Solution (50%): Thoughtful and comprehensive solution that addresses the problem statement in a logical manner and issues identified in the critiques. Proper data processing and exploration and clear rationale for algorithm choices. Check the quality of your solution work using this reference Lones, M. A. (2021). How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint arXiv:2108.02497.
Reproducibility (10%): Comprehensive, clear, and focused explanation of the data exploration and processing work, rationale for the algorithm choices, and documentation of your code, based on this Machine Learning Reproducibility Checklist. Be sure to document rationale for WHY you perform certain tasks/made certain decisions, and not just WHAT you did.
Presentation (10%): Clear, well rehearsed, and concise presentation with laser focus on the key topics, key issues and key elements of the solution. Free of errors.
Each team should select a unique competition (i.e., no two teams, across the two class sections, can select the same competition). There are a total of 21 teams in Spring 2022. Topic choices are first come first serve. As soon as the team decides on a competition, the team captain should email the instructor to claim the competition. The instructor will post team topic claims on this webpage here so everyone can see what's available, what's taken, etc. The following is a list of potential topics. Teams are welcome to select a competition not on the list, as long as the topic is approved by the instructor.
The Winston Stock Market Challenge (Kaggle)
Predicting the Risk of Customer Credit Default (Kaggle)
Liberty Mutual Group: Property Inspection Prediction (Kaggle)
Marketing
Springleaf Marketing Response: Determine whether to send a direct mail piece to a customer (Kaggle)
Avito Context Ad Clicks: Predict if context ads will earn a user's click (Kaggle)
Springleaf Marketing Response: Determine whether to send a direct mail piece to a customer (Kaggle)
Avito Context Ad Clicks: Predict if context ads will earn a user's click (Kaggle)
Retail/Sales
Travel
Food
Sports
Entertainment
Energy
HR