Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

ubco-mds-2020-labs/data-550-project2-group-4

Repository files navigation

DATA 550 Mini-Project 2

In lieu of quizzes in DATA 550, we will instead do two mini projects. The project will be done in pairs of 2, and both members of the team will receive the same work unless the work distribution was not roughly even (as determined by looking at the commits). In the first project, you will need to select a programming language, either python or R and you will be provided a dataset. In the second project, you will use the other programming language and will be free to select your own dataset.

Project Instructions

Here are the instructions for your project:

  1. Do this mini-project in the same pairs you have selected for your labs. Take note of your group number (from Canvas) and follow all instructions in this section. (5 points)

  2. Choose a programming language, R or python. If you select R, you must use ggplot2 for your plotting framework and if you choose python, you must use Altair for your plotting framework. (5 points)

  3. Select a dataset for your project with the following criteria and get it approved by the course TA or instructor:

Permission to use and distribute

  • Look for a creative commons license (CC4 for e.g.) or Public Domain and check to make sure you can make it publicly available
  • Do not use datasets that require authentication, or APIs

Data quality

  • Try to choose datasets that have no more than 5-10% missing values
  • Since we'll be doing linear regression, you should look for datasets that have quantitative measures
  • Ensure there are at least 2500 rows/observations in the dataset
  • Ensure there are at least 5 variables of potential interest in the dataset

Interesting (to you)

  • Make sure you have some basic interest in the subject matter!
  • There's nothing worse than doing a 6 week project on the penguins or mtcars dataset (please don't pick those)

There are literally thousands of datasets available, I will point you to some high quality sources (keep in mind that I have not personally checked every single dataset):

IMPORTANT: I strongly suggest you spend a bit of effort choosing a good dataset and interesting research questions as these mini-projects will likely be the basis of your DATA 551 and 552 project next block!

Requirements for the EDA:

  • Must include all eight steps of the EDA (Describe your dataset, Load the dataset, Explore your dataset, Initial thoughts, Wrangling, Research Questions, Data analysis & visualizations, Summary and conclusions)

  • Must include at least 3 visualizations, and no more than 6

  • Each visualization must adhere to the principles of effective visualizations as discussed in Lecture 4 and 5.

  • Comments on your EDA must be authentic and genuine, ideally in full sentences.

  1. Once you have done an EDA, you should come up with two follow-up research questions. (10 marks)

Each research question should fulfill ONE of these two criteria:

A) RQ should be answerable with this dataset but requires additional data processing or wrangling that is outside the scope of this mini-project; OR

B) RQ cannot be answerable with this dataset and requires another dataset. If you choose this criteria, make sure to describe what this hypothetical data would include (provide the column names at minimum).

Note: you do NOT need to answer the research questions! In this task we will only evaluate your ability to create research questions.

  1. Present and record the results of your EDA as a 5-min video (We really mean it this time about the 5 min video! That's in total!!). (20 marks)

Important: I am not expecting any video-editing, fancy equipment, or even a slide presentation. I want to hear the results of the analysis from you in fewer than five minutes (!).

Keep it low-tech, I suggest you and your partner get on a zoom call, share your screen with the Jupyter notebook, describe the analysis, and record the call. I should hear from both of you in the presentation, but it's up to you whether you show both the R and python plots, or just have one notebook.

  1. Edit the contributions.md file to your repository to outline the contributions of each partner in the group. (20 marks)

About

data-550-project2-group-4 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published