Skip to content

Latest commit

 

History

History
188 lines (167 loc) · 21.3 KB

README.md

File metadata and controls

188 lines (167 loc) · 21.3 KB

From Python to Tidy R (and Back)

A Running List of Key Python Operations Translated to (Mostly) Tidy R

Visitors

Frequently I am writing code in Python and R. And my team relies heavily on the Tidyverse syntax. So, I am often translating key Python operations (pandas, matplotlib, etc.) to tidy R (dplyr, ggplot2, etc.). In an effort to ease that translation, and also to crowdsource a running directory of these translations, I created this repo.

This is just a start. Please feel free to share and also directly contribute or revise via pulls or issues.

Note: I recommend using the native pipe operator (|>) when constructing piped operations in practice, instead of the magrittr pipe (%>%). However, I used the latter in this repo because the | in the native R pipe threw off formatting of the markdown tables.

Table of Contents


Key tasks

Task / Operation Python (Pandas) Tidyverse (dplyr, ggplot2)
Data Loading import pandas as pd library(readr)
df = pd.read_csv('file.csv') data <- read_csv('file.csv')
Select Columns df[['col1', 'col2']] data %>% select(col1, col2)
Filter Rows df[df['col'] > 5] data %>% filter(col > 5)
Arrange Rows df.sort_values(by='col') data %>% arrange(col)
Mutate (Add Columns) df['new_col'] = df['col1'] + df['col2'] data %>% mutate(new_col = col1 + col2)
Group and Summarize df.groupby('col').agg({'col2': 'mean'}) data %>% group_by(col) %>% summarize(mean_col2 = mean(col2))
Pivot/Wide to Long pd.melt(df, id_vars=['id'], var_name='variable', value_name='value') data %>% gather(variable, value, -id)
Long to Wide/Pivot df.pivot(index='id', columns='variable', values='value') data %>% spread(variable, value)
Data Visualization Matplotlib, Seaborn, Plotly, etc. ggplot2
import matplotlib.pyplot as plt library(ggplot2)
plt.scatter(df['x'], df['y']) ggplot(data, aes(x=x, y=y)) + geom_point()
Data Reshaping pd.concat([df1, df2], axis=0) bind_rows(df1, df2)
pd.concat([df1, df2], axis=1) bind_cols(df1, df2)
String Manipulation df['col'].str.replace('a', 'b') data %>% mutate(col = str_replace(col, 'a', 'b'))
Date and Time pd.to_datetime(df['date_col']) data %>% mutate(date_col = as.Date(date_col))
Missing Data Handling df.dropna() data %>% drop_na()
Rename Columns df.rename(columns={'old_col': 'new_col'}) data %>% rename(new_col = old_col)
Summary Statistics df.describe() data %>% summary() or data %>% glimpse()

Joining Data

This is the only table that includes SQL given that most of the R/dplyr operations were patterned and named after many SQL operations.

Join Type SQL Python (Pandas) R (dplyr)
Inner Join INNER JOIN pd.merge(df1, df2, on='key') inner_join(df1, df2, by='key')
Left Join LEFT JOIN pd.merge(df1, df2, on='key', how='left') left_join(df1, df2, by='key')
Right Join RIGHT JOIN pd.merge(df1, df2, on='key', how='right') right_join(df1, df2, by='key')
Full Outer Join FULL OUTER JOIN pd.merge(df1, df2, on='key', how='outer') full_join(df1, df2, by='key')
Cross Join CROSS JOIN pd.merge(df1, df2, how='cross') Not directly supported, but can be achieved with full_join and filtering
Anti Join Not directly supported pd.merge(df1, df2, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1) Not directly supported, but can be achieved with anti_join function from dplyr or by using filter() and ! condition
Semi Join Not directly supported pd.merge(df1, df2, on='key', how='inner', indicator=True).query('_merge == "both"').drop('_merge', axis=1) Not directly supported, but can be achieved with semi_join function from dplyr or by using filter() and ! condition
Self Join INNER JOIN with the same table pd.merge(df, df, on='key') inner_join(df, df, by='key')
Multiple Key Join INNER JOIN with multiple keys pd.merge(df1, df2, on=['key1', 'key2']) inner_join(df1, df2, by=c('key1', 'key2'))
Join with Renamed Columns INNER JOIN with renamed columns pd.merge(df1.rename(columns={'col1': 'key'}), df2, on='key') inner_join(rename(df1, key = col1), df2, by = 'key')
Join with Complex Condition INNER JOIN with complex conditions pd.merge(df1, df2, on='key', how='inner', left_on=(df1['col1'] > 10) & (df1['col2'] == df2['col3'])) Not directly supported, but can be achieved with filter() and complex conditions
Join with Different Key Names INNER JOIN with different key names pd.merge(df1, df2, left_on='key1', right_on='key2') inner_join(df1, df2, by = c('key1' = 'key2'))

Iteration

Task / Operation Python (Pandas) Tidyverse (dplyr and purrr)
Iterate Over Rows for index, row in df.iterrows(): data %>% rowwise() %>% mutate(new_col = your_function(col))
print(row['col1'], row['col2'])
Map Function to Column df['new_col'] = df['col'].apply(your_function) data %>% mutate(new_col = map_dbl(col, your_function))
Apply Function to Column df['new_col'] = your_function(df['col']) data %>% mutate(new_col = your_function(col))
Group and Map for group, group_df in df.groupby('group_col'): data %>% group_by(group_col) %>% nest(data = .) %>% mutate(new_col = map(data, your_function))
Map Over List Column df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) data %>% mutate(new_col = map(list_col, ~map(your_function, .)))
Map with Anonymous Function - data %>% mutate(new_col = map_dbl(col, ~your_function(.)))
Map Multiple Columns df['new_col'] = df.apply(lambda row: your_function(row['col1'], row['col2']), axis=1) data %>% mutate(new_col = pmap_dbl(list(col1, col2), ~your_function(...)))

Iteration Over Lists

Task / Operation Python (Pandas) Tidyverse (dplyr and purrr)
Map Function Across List Column df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) data %>% mutate(new_col = map(list_col, ~map(your_function, .)))
Nested Map in List Column df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x]) data %>% mutate(new_col = map(list_col, ~map(your_function, .)))
Nested Map Across Columns - data %>% mutate(new_col = map2(list(col1, col2), ~map(your_function, .)))
Nested Map Within List Column - data %>% mutate(new_col = map(list_col, ~map(your_function, .)))
Map Across Rows with Nested Map - data %>% mutate(new_col = pmap(list(col1, col2), ~list(your_function(.x), your_function(.y))))
Nested Map Within Nested List - data %>% mutate(new_col = map(list(list_col), ~map(your_function, .)))
Nested Map Across List of Lists df['new_col'] = df['list_col'].apply(lambda x: [list(map(your_function, i)) for i in x]) data %>% mutate(new_col = map2(list(list_col1, list_col2), ~map2(your_function1, your_function2, .x, .y)))
Nested Map Across Rows and Lists - data %>% mutate(new_col = pmap(list(col1, col2, col3), ~list(your_function(.x), your_function(.y), your_function(.z))))
Map and Reduce Across List df['new_col'] = df['list_col'].apply(lambda x: reduce(your_function, x)) data %>% mutate(new_col = map(list_col, ~reduce(your_function, .)))
Map and Reduce Across Rows df['new_col'] = df.apply(lambda row: reduce(your_function, row[['col1', 'col2']]), axis=1) data %>% mutate(new_col = pmap(list(col1, col2), ~reduce(your_function, .)))

String Operations

Task / Operation Python (Pandas) Tidyverse (dplyr and stringr)
String Length df['col'].str.len() data %>% mutate(new_col = str_length(col))
Concatenate Strings df['new_col'] = df['col1'] + df['col2'] data %>% mutate(new_col = str_c(col1, col2))
Split Strings df['col'].str.split(', ') data %>% mutate(new_col = str_split(col, ', '))
Substring df['col'].str.slice(0, 5) data %>% mutate(new_col = str_sub(col, 1, 5))
Replace Substring df['col'].str.replace('old', 'new') data %>% mutate(new_col = str_replace(col, 'old', 'new'))
Uppercase / Lowercase df['col'].str.upper() data %>% mutate(new_col = str_to_upper(col))
df['col'].str.lower() data %>% mutate(new_col = str_to_lower(col))
Strip Whitespace df['col'].str.strip() data %>% mutate(new_col = str_squish(col))
Check for Substring df['col'].str.contains('pattern') data %>% mutate(new_col = str_detect(col, 'pattern'))
Count Substring Occurrences df['col'].str.count('pattern') data %>% mutate(new_col = str_count(col, 'pattern'))
Find First Occurrence of Substring df['col'].str.find('pattern') data %>% mutate(new_col = str_locate(col, 'pattern')[, 1])
Extract Substring with Regex df['col'].str.extract(r'(\d+)') data %>% mutate(new_col = str_extract(col, '(\\d+)'))
Remove Duplicates in Strings - data %>% mutate(new_col = str_unique(col))
Pad Strings df['col'].str.pad(width=10, side='right', fillchar='0') data %>% mutate(new_col = str_pad(col, width = 10, side = 'right', pad = '0'))
Truncate Strings df['col'].str.slice(0, 10) data %>% mutate(new_col = str_sub(col, 1, 10))
Title Case - data %>% mutate(new_col = str_to_title(col))
Join List of Strings 'separator'.join(df['col']) data %>% mutate(new_col = str_flatten(col, collapse = 'separator'))
Remove Punctuation - data %>% mutate(new_col = str_remove_all(col, '[[:punct:]]'))
String Encoding/Decoding - data %>% mutate(new_col = str_encode(col, to = 'UTF-8'))

Modeling and Machine Learning

Task / Operation Python (scikit-learn) R (various packages)
Data Preprocessing from sklearn.preprocessing import ... library(caret)
from sklearn.pipeline import Pipeline library(glmnet)
preprocessor = ... preprocess <- preProcess(data, ...)
Feature Scaling StandardScaler() preprocess$scaling
Feature Selection SelectKBest() caret::createFolds()
Data Splitting train_test_split() createDataPartition()
Model Initialization model = ...() model <- ...()
Model Training model.fit(X_train, y_train) model <- train(y ~ ., data = data)
Model Prediction y_pred = model.predict(X_test) y_pred <- predict(model, newdata)
Model Evaluation accuracy_score(y_test, y_pred) confusionMatrix(y_pred, y_true)
Hyperparameter Tuning GridSearchCV() tuneGrid(...)
Cross-Validation cross_val_score() trainControl(method = "cv")
Model Pipelining pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)]) model <- train(y ~ ., data = data, method = model, trControl = trainControl(method = "cv"))
Feature Engineering from sklearn.preprocessing import ... library(caret)
Custom feature transformers Custom feature transformers
Handling Missing Data SimpleImputer() preprocess$impute
Encoding Categorical Data OneHotEncoder() dummyVars()
Dimensionality Reduction PCA() preprocess$reduce
Model Selection GridSearchCV() caret::train()
Ensemble Learning Various ensemble methods caret::train() with method="stack"
Regularization Lasso, Ridge, Elastic Net, etc. glmnet()
Model Interpretability SHAP, Lime, etc. DALEX, iml, etc.
Model Export/Serialization joblib or pickle saveRDS or other formats
Deploying Models Web frameworks (e.g., Flask, Django) Web frameworks (e.g., Shiny, Plumber)
Batch Scoring Scripting or automation tools R batch processing
Feature Scaling/Normalization StandardScaler(), MinMaxScaler(), etc. scale(), normalize(), etc.
Feature Selection with L1 Regularization SelectFromModel(), Lasso() glmnet(), cv.glmnet()
Handling Imbalanced Data RandomUnderSampler(), SMOTE(), etc. caret::train() with weights or sampling
Model Evaluation Metrics classification_report(), confusion_matrix(), mean_squared_error(), etc. confusionMatrix(), postResample(), RMSE, etc.
Feature Importance .feature_importances_ (Random Forest, etc.) varImp(), vip(), etc.
Model Persistence joblib, pickle, sklearn.externals saveRDS, save(), serialize(), etc.
Time Series Forecasting Prophet, ARIMA, ExponentialSmoothing, etc. forecast, prophet, auto.arima, etc.
Natural Language Processing (NLP) nltk, spaCy, textblob, etc. tm, quanteda, udpipe, tm.plugin.webmining, etc.
Deep Learning Keras, TensorFlow, PyTorch, etc. keras, tensorflow, torch, mxnet, etc.
Model Interpretation SHAP, LIME, ELI5, etc. DALEX, iml, iBreakDown, lime, etc.
Model Deployment in Production Containers, cloud platforms (e.g., Docker, Kubernetes, AWS SageMaker) Containers, Shiny, Plumber, APIs, cloud platforms

Network Modeling and Dynamics

Task / Operation Python (NetworkX) R (various packages)
Network Creation G = nx.Graph(), G.add_node(), G.add_edge() igraph::graph(), add_vertices(), add_edges()
Node and Edge Attributes G.nodes[node]['attribute'] = value, G.edges[edge]['attribute'] = value V(graph)$attribute <- value, E(graph)$attribute <- value
Network Visualization nx.draw(G), matplotlib for customization plot(graph), igraph, ggplot2, visNetwork, etc.
Network Measures nx.degree_centrality(G), nx.betweenness_centrality(G), nx.clustering(G), etc. degree(), betweenness(), transitivity(), etc.
Community Detection community.detect() (e.g., Louvain, Girvan-Newman) cluster_walktrap(), cluster_fast_greedy(), cluster_leading_eigen(), etc.
Link Prediction link_prediction.method() (e.g., Common Neighbors, Jaccard Coefficient) link_prediction.method() (e.g., Adamic-Adar, Preferential Attachment)
Network Filtering/Selection G.subgraph(nodes) subgraph(graph, vertices)
Network Embedding node2vec, GraphSAGE, etc. walktrap.community, fastgreedy.community, etc.
Network Simulation nx.erdos_renyi_graph(), nx.watts_strogatz_graph(), etc. igraph::erdos.renyi.game(), igraph::watts.strogatz.game(), etc.
Network Analysis Pipelines Custom pipelines using NetworkX, Pandas, and other libraries Custom pipelines using igraph, dplyr, and other packages
Dynamic Network Analysis dynetx for dynamic networks tsna for temporal networks, dyngraph for dynamic graphs, etc.
Geospatial Network Analysis osmnx for urban network analysis stplanr for transport planning, spatnet for spatial network analysis, etc.
Network Modeling for Machine Learning Integration with scikit-learn, PyTorch, etc. Integration with caret, glmnet, keras, etc.
Community Visualization Visualization of detected communities using network layouts igraph::plot.igraph() with community coloring
Path Analysis Shortest paths, k-shortest paths, and all simple paths get.shortest.paths(), all.simple.paths()
Centrality Analysis Closeness centrality, eigenvector centrality, Katz centrality, etc. closeness(), eigen_centrality(), katz_centrality(), etc.
Structural Role Analysis Structural equivalence, equivalence-based roles structural_equivalence(), role_equiv(), etc.
Network Robustness Analysis Network attack simulations, robustness metrics robustness() function, remove_vertices(), etc.
Temporal Network Analysis Temporal networks, evolving networks dynnet package for dynamic networks, temporal extensions of igraph functions
Multiplex Network Analysis Analyzing multiple layers of networks multiplex package for multilayer networks, mgm package for multilayer graphical models
Network Alignment Aligning nodes in two or more networks netAlign package for network alignment, gmatch package for graph matching
Dynamic Community Detection Detecting evolving communities over time dynCOMM for dynamic community detection
Network Generative Models Generating networks from various models (e.g., ER, BA, etc.) igraph::sample_gnm(), igraph::sample_degseq(), etc.
Geospatial Network Analysis Geospatial network analysis and routing stplanr for transport planning, spatnet for spatial network analysis, etc.
Network Modeling for Machine Learning Integrating network data with machine learning libraries Combining igraph or custom network features with caret, glmnet, keras, etc.