Notes, Code Samples, Exercises, and Project Instructions for the OKCoders Data Analytics Boot Camp.
This 8-week introductory bootcamp will cover the basics of data analytics using SQL and Python. During the 8 weeks you’ll learn how to work with a relational SQL database, how to break down analytical questions into small sub-problems, and how to solve each with the Python programming language. The class will cover the basics of working with data, statistical modeling, and some application of machine learning. This will include the completion of 2 analytical projects that can be shown in a code portfolio. The bootcamp is intended for beginners, but experienced developers are welcome as well.
-
SQL: The Structured Query Language (SQL) is the primary method for interacting with Relational Databases. A Relational Database is a place to house data that neatly fits into rows and columns (think Excel, but far more powerful and scalable). We will use SQL to interact with data, do simple data connection and formatting operations, and even interact with SQL from inside the analytics pipelines we will write in R.
-
SQLite: The specific relational database engine and environment we are going to use will be SQLite (pronounced "Sequel Light"). SQLite is an entire database that saves to a single file. We will be able to interact with all major SQL commands using this light weight database, as well as connect to it directly with R to incorporate SQL and R programming pipelines together.
-
SQLite Browser: This program will be how we primarily interact with our SQLite database files, and write SQL code to query various data sets. The SQLite Browser provides a great user interface for inspecting the different tables of our database, how they relate to one another, how they are formatted (the "schema"), and for writing queries with a live feedback loop on the query output.
-
Slack: Slack is a messaging tool we will use to communicate inside and outside of the classroom. It is a suped-up instant messaging client that is used throughout many organizations for internal communications. It's super easy to use and you'll get the hang of it in no time. Though the default is a web interface, there are native apps for both the desktop and phone/tablet mobile devices.
-
Python: This is a very popular programming language that is optimized for data analysis and very easy to understand. Lots of popular data analytics and data science platforms make first class use of Python functionality. Additionally, Python has a very active community of developers that have built freely available packages to do very powerful things right out of the box. Python additionally is fully capable of being a full, general purpose programming language meaning the information you learn here can help you with other coding projects outside of only the data analytics domain. Python is great in general to learn as it does a very good job of scaling its own complexity. It does simple things very simply and only becomes more complex as you need to solve more complex problems. Here is a great 4 minute intro to "what is and why Python."
-
Jupyter: Jupyter evolved from the iPython framework and has become a very popular way of working with Python code. It is particularly good for interactive programming like you will do in data analysis, as well as collaborative coding with multiple people.
- Company Employees: This is a fake set of simple data that represents some company employees for a fake company. Various versions of this data will exist across several different formats, which will help show how the same kind of data can be represented in multiple ways. We will work with this data in CSV, JSON, and SQL formats. This data will be used for the first, smaller project.
- Lahman: This is a famous very rich dataset that is produced each year regarding Baseball. We will be working with the SQL version of the data from the 2016 MLB Season. Documentation for this database can be found here. This will be the data that is used for the final project.
The details of the projects that we will build are still being finalized, please stay tuned as this page updates closer to the beginning of class.
There are a ton of resources out there to learn data analytics, machine learning, and data science. Some are great, some are crap, and most are in the middle somewhere. These are the ones that I find to be the best place to get started and learn a ton. Almost everything here is free or has a free version with a premium for additional help/coaching.
- The Open Source Data Science Masters: As you complete the basics of this course, if you want to go into Data Science--this is a fantastic place to look around. This is like a meta-index of the best (and almost all free) resources on the internet to teach you different Data Science skills.
- Udacity: Of all the online interactive education platforms that teach technical skills, I find Udacity to be the far and away best. While many models seek to scale up the "college class" idea, Udacity built a platform that more simulates a personal tutor. You are required to actually answer questions and program solutions to be able to move ahead, and you can take as much time or as little as needed until you master a concept. Browse either their nanodegree concepts (which have paid elements), or look into their course catalog directly to learn anything you want to. If you are a little rusty on statistics, I would highly recommend their statistics courses; as well as their Machine Learning units.
First Hour | Second Hour | |
---|---|---|
Week 1: Getting Started with SQL | ||
Tue | Slack, GitHub Repo, Setting Up SQLite, Intro to SQL/Using the SQLite Browser | SELECT, FROM, WHERE clauses |
Thr | GROUP BY clause, INNER JOIN operations | JOIN operations |
Week 2: Starting Python | ||
Tue | Sub-Queries, CTEs, Answering Questions with SQL | Installing Python, Using Jupyter, Basic Python |
Thr | Python data types, slicing/indexing, control flow | looping, functions, thinking in a programmatic way |
Week 3: More Python | ||
Tue | Building functionality with Python operations | Functions, Code re-use |
Thr | Importing Libraries, Loading Data, Dataframes | Beginning Dataframe operations |
Week 4: Breaking Problems Down | ||
Tue | Loading Data from files: CSV, JSON, SQLite | Compute New Columns, Aggregations |
Thr | Merging, Using Merges to solve problems | Solving Problems Systematically |
Week 5: First Data Analysis Project | ||
Tue | Review so far, Basic Plotting, Basic Markdown | Start work on analysis, survey dataset |
Thr | Solidify first project code and analysis | Present Analysis Findings |
Week 6: Beginning Modeling | ||
Tue | Intro to Modeling/Machine Learning, sci-kit learn | How to Use/Interpret a statistical model |
Thr | Answering Questions with Statistical Models | ...more of that |
Week 7: Answering Questions with Models | ||
Week 8: The Big Project |
We will go over this in class, so don't feel too much pressure to have this done before coming. But if you can, please go ahead and set up the main components on your machine to ensure we move along speedily.
If you are on a Windows machine, please visit the Slack Windows Download Center and download the appropriate version for your OS. If you are using a Mac, the best way to install Slack is directly from the Mac App Store on your machine. We will not sign into the class Slack channel until the first day of class, so there is no need to do anything other than just making sure you can launch the app itself at least once.
Please go to the SQLite Browser home page and download the Mac/Windows version as applicable to your machine. Like the other tools the installation should be quite simple and I don't anticipate any complications. Please verify that you can open this application once installed. This is the only technology we will need for the the SQL aspects of this course.
Chances are decent you already have Python installed on your computer. We will be using Python version 3.7 in class and if you are comfortable installing it on your own please feel free to go ahead. Everything else we will need regarding Python and Jupyter we will install on our first session.
Hi. I am your instructor, Frank D. Evans. I am a Data Engineer with Inspire Brands. I have worked in Data Analytics of one flavor or another for the past 10 years. My professional specialty is Graph Structure Machine Learning and Natural Language Processing. If you'd like to snoop on me, the best places to start are my LinkedIn, Twitter (which I don't use very often), GitHub, and this awesome image of a guy that's definitely not me, but is the first result if you Google "Frank Evans Felony".