Skip to content

yolandameow/data_pipline_practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Pipeline Problem

This question will test some basic skills in cleaning data and building a machine learning pipeline.

The focus of this test is to evaluate:

  • Ability to quickly learn a new framework (luigi)
  • Ability to manipulate and process data (cleaning, processing, feature engineering)
  • Competency in software development

This test does not focus on modelling accuracy, ability to use a fancy model, or efficiency. It is mainly about the mechanics of building a proper machine learning pipeline.

Datasets

There are two files: airline_tweets.csv and cities.csv.

airline_tweets.csv has twitter data regarding airline sentiment augmented with some extra columns. The relevant columns are:

  • airline_sentiment: a string indicating if the tweet had positive, neutral or negative sentiment.
  • tweet_coord: is a string with form "[, ]" if a geo-coordinate exists for that tweet, or an empty string otherwise.

The cities.csv contains information about latitude and longitude for large cities. The relevant columns are:

  • name: The name of the city.
  • latitude: The latitude of the city.
  • longitude: The longitude of the city.

Problem

Build a basic ML pipeline using the luigi Python framework. The pipeline should clean the tweet data, prepare features for building a model, train a classifier and score using the model. The pipeline should have these steps:

  • CleanDataTask: Cleans the input tweet CSV file by removing any rows without valid geo-coordinates.
    • An invalid coordinate has either an empty tweet_coord column or is coordinate (0.0, 0.0).
  • TrainingDataTask: Extracts features/outcome variable in preparation for training a model.
    • This prepares the cleaned data into the exact form that is able to be fit by the model.
    • The "y" variable will be the multi-class sentiment (0, 1, 2 for negative, neutral and positive respectively).
    • The "X" variables will be the closest city to the "tweet_coord" using Euclidean distance.
    • You should use the cities.csv file to find the closest city.
    • You probably will need to one-hot encode the city names.
  • TrainModelTask: Trains a classifier to predict negative, neutral, positive based only on the input city.
    • Train a classifier that uses closest cities as features.
    • Dump the fitted model to the output file.
  • ScoreTask: Uses the scored model to compute the sentiment for each city.
    • Use the trained model to predict the probability/score for each city the negative, neutral and positive sentiment.
    • Output a sorted list of cities by the predicted positive sentiment score to the output file.

References

  • Luigi package: http://luigi.readthedocs.io/en/stable/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published