-
Notifications
You must be signed in to change notification settings - Fork 17
/
02-SoftSkills.Rmd
194 lines (110 loc) · 31.7 KB
/
02-SoftSkills.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# Soft Skills for Data Scientists {#SoftSkillsforDataScientists}
There are many university courses, online self-learning modules, and excellent books that teach technical skills. However, there are much fewer resources that discuss the soft skills necessary for data scientists. These soft skills are essential for data scientists to succeed in their career, especially in the early stage. We want to introduce soft skills for data scientists before discussing technical components. In this chapter, we will also cover the project cycle and common pitfalls encountered in real-life data science projects.
## Comparison between Statistician and Data Scientist
Statistics as a scientific area can be traced back to 1749, and statistician as a career has been around for hundreds of years with well-established theory and application. Data scientist becomes an attractive career for only a few years, along with the fact that data size and variety beyond the traditional statistician’s toolbox and the fast-growing of computation power. Statistician and data scientist have a lot in common, but there are also significant differences, as highlighted in figure \@ref(fig:softskill1).
```{r softskill1, fig.cap = "Comparison of statistician and data scientist", out.width="80%", fig.asp=.75, fig.align="center", echo = FALSE}
knitr::include_graphics("images/softskill1.png")
```
Both statisticians and data scientists work closely with data. For typical traditional statisticians, the data set is usually well-formatted text files with numbers (i.e., numerical variables) and labels (i.e., categorical variables). The data's size is typically small enough to be loaded in a PC's memory or be saved in a PC's hard disk. Comparing to statisticians, data scientists need to deal with more varieties of data:
- well-formatted data stored in a database system with a size much larger than a PC’s memory or hard-disk;
- a huge amount of verbatim text, voice, image, and video;
- real-time streaming data and other types of records.
One unique power of statistics is to make statistical inferences based on a small set of data. Statisticians, especially in academia, usually spend most of their time developing models and don't need to put too much effort into data cleaning. However, data becomes relatively abundant recently, and modeling is (often small) part of the overall effort. Due to open source communities' active development, fitting standard models are not too far away from button-pushing. Data scientists in industry instead spend a lot of time preprocessing and wrangling the data before feeding them to the model.
Unlike statisticians, data scientists often focus on delivering actionable results and sometimes need to deploy the model to the production system. The data available for model training can be too large to be processed in a single computer. From the entire problem-solving cycle, statisticians are usually not well integrated with the production system where data is obtained in real-time, while data scientists are more embedded in the production system and closer to the data generation procedures. In summary, statisticians focus more on modeling and usually bring data to models, while data scientists focus more on data and usually bring models to data.
## Beyond Data and Analytics
Data scientists usually have a good sense of data and analytics, but data science projects are much more than that. A data science project may involve people with different roles, especially in a large company:
- the business owner or leader who identifies business problem and value;
- the data owner and computation resource/infrastructure owner from the IT department;
- a dedicated policy owner to make sure the data and model are under model governance, security and privacy guidelines and laws;
- a dedicated engineering team to implement, maintain and refresh the model;
- a program manager to ensure the data science project fits into the overall technical program development and to coordinate all involved parties to set periodical tasks so that the project meets the preset milestones and results;
The entire team usually will have multiple rounds of discussion of resource allocation among groups (i.e., who pay for the data science project) at the beginning of the project and during the project.
Effective communication and in-depth domain knowledge about the business problem are essential requirements for a successful data scientist. A data scientist may interact with people at various levels, from senior leaders who set the corporate strategies to front-line employees who do the daily work. A data scientist needs to have the capability to view the problem from 10,000 feet above the ground and down to the detail to the very bottom. To convert a business question into a data science problem, a data scientist needs to communicate using the language other people can understand and obtain the required information through formal and informal conversations.
In the entire data science project cycle, including defining, planning, developing, and implementing, every step needs to get a data scientist involved to ensure the whole team can correctly determine the business problem and reasonably evaluate the business value and success. Corporates are investing heavily in data science and machine learning, and there is a very high expectation of return for the investment.
However, it is easy to set an unrealistic goal and inflated estimation for a data science project's business impact. The team's data scientist should lead and navigate the discussions to ensure data and analytics, not wishful thinking, back the goal. Many data science projects often over-promise in business value and are too optimistic on the timeline to delivery. These projects eventually fail by not delivering the pre-set business impact within the promised timeline. As data scientists, we need to identify these issues early and communicate with the entire team to ensure the project has a realistic deliverable and timeline. The data scientist team also needs to work closely with data owners on different things. For example, identify a relevant internal and external data source, evaluate the data's quality and relevancy to the project, and work closely with the infrastructure team to understand the computation resources (i.e., hardware and software) availability. It is easy to create scalable computation resources through the cloud infrastructure for a data science project. However, you need to evaluate the dedicated computation resources' cost and make sure it fits the budget.
In summary, data science projects are much more than data and analytics. A successful project requires a data scientist to lead many aspects of the project.
## Three Pillars of Knowledge
It is well known there are three pillars of essential knowledge for a successful data scientist.
(1) Analytics knowledge and toolsets
A successful data scientist needs to have a strong technical background in data mining, statistics, and machine learning. The in-depth understanding of modeling with insight about data enables a data scientist to convert a business problem to a data science problem. Many chapters of this book are focusing on analytics knowledge and toolsets.
(2) Domain knowledge and collaboration
A successful data scientist needs in-depth domain knowledge to understand the business problem well. For any data science project, the data scientist needs to collaborate with other team members. Communication and leadership skills are critical for data scientists during the entire project cycle, especially when there is only one scientist in the project. The scientist needs to decide the timeline and impact with uncertainty.
(3) (Big) data management and (new) IT skills
The last pillar is about computation environment and model implementation in a big data platform. It used to be the most difficult one for a data scientist with a statistics background (i.e., lack computer science knowledge or programming skills). The good news is that with the rise of the big data platform in the cloud, it is easier for a statistician to overcome this barrier. The "Big Data Cloud Platform" chapter of this book will describe this pillar in detail.
```{r threepillars, fig.cap = "Three pillars of knowledge", out.width="80%", fig.asp=.75, fig.align="center", echo = FALSE}
knitr::include_graphics("images/softskill2.png")
```
## Data Science Project Cycle
A data science project has various stages. Many textbooks and blogs focus on one or two specific stages, and it is rare to see an end-to-end life cycle of a data science project. To get a good grasp of the end-to-end process requires years of real-world experience. Seeing a holistic picture of the whole cycle helps data scientists to better prepare for real-world applications. We will walk through the full project cycle in this section.
### Types of Data Science Projects
People often use data science projects to describe any project that uses data to solve a business problem, including traditional business analytics, data visualization, or machine learning modeling. Here we limit our discussion of data science projects that involve data and some statistical or machine learning models and exclude basic analytics or visualization. The business problem itself gives us the flavor of the project. We can view data as the raw ingredient to start with, and the machine learning model makes the dish. The types of data used and the final model development define the different kinds of data science projects.
#### Offline and Online Data
There are offline and online data. Offline data are historical data stored in databases or data warehouses. With the development of data storage techniques, the cost to store a large amount of data is low. Offline data are versatile and rich in general (for example, websites may track and keep each user’s mouse position, click and typing information while the user is visiting the website). The data is usually stored in a distributed system, and it can be extracted in batch to create features used in model training.
Online data are real-time information that flows to models to make automatic actions. Real-time data can frequently change (for example, the keywords a customer is searching for can change at any given time). Capturing and using real-time online data requires the integration of a machine learning model to the production infrastructure. It used to be a steep learning curve for data scientists not familiar with computer engineering, but the cloud infrastructure makes it much more manageable. Based on the offline and online data and model properties, we can separate data science projects into three different categories as described below.
#### Offline Training and Offline Application
This type of data science project is for a specific business problem that needs to be solved once or multiple times. But the dynamic and disruptive nature of this type of business problem requires substantial work every time. One example of such a project is “whether a brand-new business workflow is going to improve efficiency.” In this case, we often use internal/external offline data and business insight to build models. The final results are delivered as a report to answer the specific business question. It is similar to the traditional business intelligence project but with more focus on data and models. Sometimes the data size and model complexity are beyond the capacity of a single computer. Then we need to use distributed storage and computation. Since the model uses historical data, and the output is a report, there is no need for real-time execution. Usually, there is no run-time constraint on the machine learning model unless the model runs beyond a reasonable time frame, such as a few days. We can call this type of data science project “offline training, offline application” project.
#### Offline Training and Online Application
Another type of data science project uses offline data for training and applies the trained model to real-time online data in the production environment. For example, we can use historical data to train a personalized advertisement recommendation model that provides a real-time ad recommendation. The model training uses historical offline data. The trained model then takes customers' online real-time data as input features and run the model in real-time to provide an automatic action. The model training is very similar to the "offline training, offline application" project. But to put the trained model into production, there are specific requirements. For example, as features used in the offline training have to be available online in real-time, the model's online run-time has to be short enough without impacting user experience. In most cases, data science projects in this category create continuous and scalable business value as the model could run millions of times a day. We will use this type of data science project to describe the typical data science project cycle from section \@ref(ProblemFormulationandProjectPlanningStage) to section \@ref(ProjectCycleSummary).
#### Online Training and Online Application
For some business problems, it is so dynamic that even yesterday's data is out of date. In this case, we can use online data to train the model and apply it in real-time. We call this type of data science project "online training, online application." This type of data science project requires high automation and low latency.
### Problem Formulation and Project Planning Stage {#ProblemFormulationandProjectPlanningStage}
A data-driven and fact-based planning stage is essential to ensure a successful data science project. With the recent big data and data science hype, there is a high demand for data science projects to create business value across different business sectors. Usually, the leaders of an organization are those who initiate the data science project proposals. This top-down style of data science projects typically have high visibility with some human and computation resources pre-allocated. However, it is crucial to understand the business problem first and align the goal across different teams, including:
(1) the business team, which may include members from the business operation team, business analytics, insight, and metrics reporting team;
(2) the technology team, which may include members from the database and data warehouse team, data engineering team, infrastructure team, core machine learning team, and software development team;
(3) the project, program, and product management team depending on the scope of the data science project.
To start the conversation, we can ask everyone in the team the following questions :
- What are the pain points in the current business operation?
- What data are available, and how is the quality and quantity of the data?
- What might be the most significant impacts of a data science project?
- Is there any positive or negative impact on other teams?
- What computation resources are available for model training and model execution?
- Can we define key metrics to compare and quantify business value?
- Are there any data security, privacy, and legal concerns?
- What are the desired milestones, checkpoints, and timeline?
- Is the final application online or offline?
- Are the data sources online or offline?
It is likely to have a series of intense meetings and heated discussions to frame the project reasonably. After the planning stage, we should be able to define a set of key metrics related to the project, identify some offline and online data sources, request needed computation resources, draft a tentative timeline and milestones, and form a team of data scientist, data engineer, software developer, project manager and members from the business operation. Data scientists should play a significant role in these discussions. If data scientists do not lead the project formulation and planning, the project may not catch the desired timeline and milestones.
### Project Modeling Stage
Even though we already set some strategies, milestones, and timelines at the problem formulation and project planning stage, data science projects are dynamic. There could be uncertainties along the road. As a data scientist, communicating any newly encountered difficulties or opportunities during the modeling stage to the entire team is essential to keep the data science project progress. Data cleaning, data wrangling, and exploratory data analysis are great starting points toward modeling with the available data source identified at the planning stage. Meanwhile, abstracting the business problem to be a set of statistical and machine learning problems is an iterative process. Business problems can rarely be solved by using just one statistical or machine learning model. Using a sequence of methods to decompose the business problem is one of the critical responsibilities for a senior data scientist. The process requires iterative rounds of discussions with the business and data engineering team based on each iteration's new learnings. Each iteration includes both data-related and model-related parts.
#### Data Related Part
Data cleaning, data preprocessing, and feature engineering are related procedures that aim to create usable variables or features for statistical and machine learning models. A critical aspect of data related procedures is to make sure the data source we are using is a good representation of the situation where the final trained model will be applied. The same representation is rarely possible, and it is ok to start with a reasonable approximation. A data scientist must be clear on the assumptions and communicate the limitations of biased data with the team and quantify its impact on the application. In the data-related part, sometimes the available data is not relevant to the business problem we want to solve. We have to collect more and relevant data before modeling.
#### Model Related Part
There are different types of statistical and machine learning models, such as supervised learning, unsupervised learning, and causal inference. For each type, there are various algorithms, libraries, or packages readily available. To solve a business problem, we sometimes need to piece together a few methods at the model exploring and developing stage. This stage also includes model training, validation, and testing to ensure the model works well in the production environment (i.e., it can be generalized well and not causing overfitting). The model selection follows Occam's razor, choosing the simplest among a set of compatible models. Before we try complicated models, it is good to get some benchmarks by additional business rules, common-sense decisions, or standard models (such as random forest for classification and regression problems).
### Model Implementation and Post Production Stage {#ModelImplementationandPostProductionStage}
For offline application data science projects, the end product is often a detailed report with model results and output. However, for online application projects, a trained model is just halfway from the finish line. The offline data is stored and processed in a different environment from the online production environment. Building the online data pipeline and implementing machine learning models in a production environment requires lots of additional work. Even though recent advance in cloud infrastructure lowers the barrier dramatically, it still takes effort to implement an offline model in the online production system. Before we promote the model to production, there are two more steps to go:
1. Shadow mode
2. A/B testing
A **shadow mode** is like an observation period when the data pipeline and machine learning models run as fully functional, but we only record the model output without any actions. Some people call it proof of concept (POC). During the shadow mode, people frequently check the data pipeline and model and detect bugs such as a timeout, missing features, version conflict (for example, Python 2 vs. Python 3), data type mismatch, etc.
Once the online model passes the shadow mode, A/B testing is the next stage. During A/B testing, all the incoming observations are randomly separated into two groups: control and treatment. The control group will skip the machine learning model, while the treatment group is going through the machine learning model. After that, people monitor a list of pre-defined key metrics during a specific period to compare the control and treatment groups. The differences in these metrics determine whether the machine learning model provides business value or not. Real applications can be complicated. For example, there can be multiple treatment groups, or hundreds, even thousands of A/B testing running by different teams at any given time in the same production environment.
Once the A/B testing shows that the model provides significant business value, we can put it into full production. It is ideal that the model runs as expected and continues to offer scalable values. However, the business can change, and a machine learning model that works now can break tomorrow, and features available now may not be available tomorrow. We need a monitoring system to notify us when one or multiple features change. When the model performance degrades below a pre-defined level, we need to fine-tune the parameters and thresholds, re-train the model with more recent data, add or remove features to improve model performance. Eventually, any model will fail or retire at some time with a pre-defined model retirement plan.
### Project Cycle Summary {#ProjectCycleSummary}
Data science end-to-end project cycle is a complicated process that requires close collaboration among many teams. The data scientist, maybe the only scientist in the team, has to lead the planning discussion and model development based on data available and communicate key assumptions and uncertainties. A data science project may fail at any stage, and a clear end-to-end cycle view of the project helps avoid some mistakes.
## Common Mistakes in Data Science
Data science projects can go wrong at different stages in many ways. Most textbooks and online blogs focus on technical mistakes about machine learning models, algorithms, or theories, such as detecting outliers and overfitting. It is important to avoid these technical mistakes. However, there are common systematic mistakes across data science projects that are rarely discussed in textbooks. In this section, we describe these common mistakes in detail so that readers can proactively identify and avoid these systematic mistakes in their data science projects.
### Problem Formulation Stage
The most challenging part of a data science project is problem formulation. Data science project stems from pain points of the business. The draft version of the project's goal is relatively vague without much quantification or is the gut feeling of the leadership team. Often there are multiple teams involved in the initial project formulation stage, and they have different views. It is easy to have misalignment across teams, such as resource allocation, milestone deliverable, and timeline. Data science team members with technical backgrounds sometimes are not even invited to the initial discussion at the problem formulation stage. It sounds ridiculous, but sadly true that a lot of resources are spent on **solving the wrong problem**, the number one systematic common mistake in data science. Formulating a business problem into the right data science project requires an in-depth understanding of the business context, data availability and quality, computation infrastructure, and methodology to leverage the data to quantify business value.
We see people **over-promise about business value** all the time, another common mistake that will fail the project at the beginning. With big data and machine learning hype, leaders across many industries often have unrealistic high expectations on data science. It is especially true during enterprise transformation when there is a strong push to adopt new technology to get value out of the data. The unrealistic expectations are based on assumptions that are way off the chart without checking the data availability, data quality, computation resource, and current best practices in the field. Even when there is some exploratory analysis by the data science team at the problem formulation stage, project leaders sometimes ignore their data-driven voice.
These two systematic mistakes undermine the organization's data science strategy. The higher the expectation, the bigger the disappointment when the project cannot deliver business value. Data and business context are essential to formulate the business problem and set reachable business value. It helps avoid mistakes by having a strong data science leader with a broad technical background and letting data scientists coordinate and drive the problem formulation and set realistic goals based on data and business context.
### Project Planning Stage
Now suppose the data science project is formulated correctly with a reasonable expectation on the business value. The next step is to plan the project by allocating resources, setting up milestones and timelines, and defining deliverables. In most cases, project managers coordinate different teams involved in the project and use agile project management tools similar to those in software development. Unfortunately, the project management team may not have experience with data science projects and hence fail to account for the uncertainties at the planning stage. The fundamental difference between data science projects and other projects leads to another common mistake: **too optimistic about the timeline**. For example, data exploratory and data preparation may take 60% to 80% of the total time for a given data science project, but people often don't realize that.
When there are a lot of data already collected across the organization, people assume we have enough data for everything. It leads to the mistake: **too optimistic about data availability and quality**. We need not "big data," but data that can help us solve the problem. The data available may be of low quality, and we need to put substantial effort into cleaning the data before we can use it. There are "unexpected" efforts to bring the right and relevant data for a specific data science project. To ensure smooth delivery of data science projects, we need to account for the "unexpected" work at the planning stage. Data scientists all know data preprocessing and feature engineering is usually the most time-consuming part of a data science project. However, people outside data science are not aware of it, and we need to educate other team members and the leadership team.
### Project Modeling Stage
Finally, we start to look at the data and fit some models. One common mistake at this stage is **unrepresentative data**. The model trained using historical data may not generalize to the future. There is always a problem with biased or unrepresentative data. As a data scientist, we need to use data that are closer to the situation where the model will apply and quantify the impact of model output in production. Another mistake at this stage is **overfitting and obsession with complicated models**. Now, we can easily get hundreds or even thousands of features, and the machine learning models are getting more complicated. People can use open source libraries to try all kinds of models and are sometimes obsessed with complicated models instead of using the simplest among a set of compatible models with similar results.
The data used to build the models is always somewhat biased or unrepresentative. Simpler models are better to generalize. It has a higher chance of providing consistent business value once the model passes the test and is finally implemented in the production environment. The existing data and methods at hand may be insufficient to solve the business problem. In that case, we can try to collect more data, do feature engineering, or develop new models. However, if there is a fundamental gap between data and the business problem, the data scientist must make the tough decision to unplug the project.
On the other hand, data science projects usually have high visibility and may be initiated by senior leadership. Even after the data science team provided enough evidence that they can't deliver the expected business value, people may not want to stop the project, which leads to another common mistake at the modeling stage: **take too long to fail**. The earlier we can prevent a failing project, the better because we can put valuable resources into other promising projects. It damages the data science strategy, and everyone will be hurt by a long data science project that is doomed to fail.
### Model Implementation and Post Production Stage
Now suppose we have found a model that works great for the training and testing data. If it is an online application, we are halfway. The next is to implement the model, which sounds like alien work for a data scientist without software engineering experience in the production system. The data engineering team can help with model production. However, as a data scientist, we need to know the potential mistakes at this stage. One big mistake is **missing shadow mode and A/B testing** and assuming that the model performance at model training/testing stays the same in the production environment. Unfortunately, the model trained and evaluated using historical data nearly never performs the same in the production environment. The data used in the offline training may be significantly different from online data, and the business context may have changed. If possible, machine learning models in production should always go through shadow mode and A/B testing to evaluate performance.
In the model training stage, people usually focus on model performance, such as accuracy, without paying too much attention to the model execution time. When a model runs online in real-time, each instance's total run time (i.e., model latency) should not impact the customer's user experience. Nobody wants to wait for even one second to see the results after click the "search" button. In the production stage, feature availability is crucial to run a real-time model. Engineering resources are essential for model production. However, in traditional companies, it is common that a data science project **fails to scale in real-time applications** due to lack of computation capacity, engineering resources, or non-tech culture and environment.
As the business problem evolves rapidly, the data and model in the production environment need to change accordingly, or the model's performance deteriorates over time. The online production environment is more complicated than model training and testing. For example, when we pull online features from different resources, some may be missing at a specific time; the model may run into a time-out zone, and various software can cause the version problem. We need regular checkups during the entire life of the model cycle from implementation to retirement. Unfortunately, people often don't set the monitoring system for data science projects, and it is another common mistake: **missing necessary online checkup**. It is essential to set a monitoring dashboard and automatic alarms, create model tuning, re-training, and retirement plans.
### Summary of Common Mistakes
The data science project is a combination of art, science, and engineering. A data science project may fail in different ways. However, the data science project can provide significant business value if we put data and business context at the center of the project, get familiar with the data science project cycle and proactively identify and avoid these potential mistakes. Here is the summary of the mistakes:
- Solving the wrong problem
- Overpromise on business value
- Too optimistic about the timeline
- Too optimistic about data availability and quality
- Unrepresentative data
- Overfitting and obsession with complicated models
- Take too long to fail
- Missing A/B testing
- Fail to scale in real-time applications
- Missing necessary online checkup