diff --git a/caveat/3d.Rmd b/caveat/3d.Rmd index dcdc203..413d825 100644 --- a/caveat/3d.Rmd +++ b/caveat/3d.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/3d.png" mytitle: "The issue with 3D in data visualization" -mydisqus: "3d" +pathSlug: "3d" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -51,7 +51,7 @@ Here is a 3d barplot. You are probably familiar with this kind of figure since i # 3D Pie charts: Don't *** -Nothing is worse than a [pie chart](https://www.data-to-viz.com/caveat/pie.html) in dataviz, except a 3D pie chart. Data to viz offers a whole post on [pie chart issues](https://www.data-to-viz.com/caveat/pie.html). +Nothing is worse than a [pie chart](https://www.data-to-viz.com/caveat/pie.html) in dataviz, except a 3D pie chart. Data to viz offers a whole post on [pie chart issues](https://www.data-to-viz.com/caveat/pie.html). Adding 3D makes it even worse since it distorts reality. Indeed, the part at the back looks smaller than the one at the front, which is not the case. @@ -104,10 +104,10 @@ data$color <- mycolors[ as.numeric(data$Species) ] # Plot par(mar=c(0,0,0,0)) -plot3d( - x=data$`Sepal.Length`, y=data$`Sepal.Width`, z=data$`Petal.Length`, - col = data$color, - type = 's', +plot3d( + x=data$`Sepal.Length`, y=data$`Sepal.Width`, z=data$`Petal.Length`, + col = data$color, + type = 's', radius = .1, xlab="Sepal Length", ylab="Sepal Width", zlab="Petal Length") ``` diff --git a/caveat/annotation.Rmd b/caveat/annotation.Rmd index f0f0ec5..1aae16a 100644 --- a/caveat/annotation.Rmd +++ b/caveat/annotation.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/annotate.png" mytitle: "Annotation is crucial for your dataviz" -mydisqus: "annotation" +pathSlug: "annotation" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -40,7 +40,7 @@ library(babynames) library(viridis) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Mary","Emma", "Ida", "Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>% filter(sex=="F") diff --git a/caveat/area_hard.Rmd b/caveat/area_hard.Rmd index ae98b36..3650b57 100644 --- a/caveat/area_hard.Rmd +++ b/caveat/area_hard.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/bubble_hard.png" mytitle: "Area is a poor metaphor" -mydisqus: "area_hard" +pathSlug: "area_hard" output: html_document: template: template_caveat.html @@ -13,14 +13,14 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html ---



-The human eye does not perform well when it has to translate areas to numeric values. Let's consider the following five bubbles. Try to rank them by decreasing area. You will probably agree that this is possible, but takes some time. +The human eye does not perform well when it has to translate areas to numeric values. Let's consider the following five bubbles. Try to rank them by decreasing area. You will probably agree that this is possible, but takes some time. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=7, fig.height=1} # Libraries @@ -71,7 +71,7 @@ This does not mean that area must never been used to represent a numeric variabl #Going further *** -- Using bubbles: [scaling to radius or area](https://www.data-to-viz.com/caveat/radius_or_area.html)? +- Using bubbles: [scaling to radius or area](https://www.data-to-viz.com/caveat/radius_or_area.html)? - Data visualization: basic principles. [link](http://paldhous.github.io/ucb/2016/dataviz/week2.html) diff --git a/caveat/aspect_ratio.Rmd b/caveat/aspect_ratio.Rmd index 8c4f250..48e5d05 100644 --- a/caveat/aspect_ratio.Rmd +++ b/caveat/aspect_ratio.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/aspect_ratio.png" mytitle: "Mind the aspect ratio" -mydisqus: "aspect_ratio" +pathSlug: "aspect_ratio" output: html_document: template: template_caveat.html @@ -13,14 +13,14 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html ---

-`Aspect ratio` is defined as the ratio of the width to the height of a graphic. It can have a strong impact on the insights gained from your graphic. +`Aspect ratio` is defined as the ratio of the width to the height of a graphic. It can have a strong impact on the insights gained from your graphic.
diff --git a/caveat/bin_size.Rmd b/caveat/bin_size.Rmd index 0de5624..e1a41b5 100644 --- a/caveat/bin_size.Rmd +++ b/caveat/bin_size.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/bin_size.png" mytitle: "Play with your histogram bin size" -mydisqus: "bin_size" +pathSlug: "bin_size" output: html_document: template: template_caveat.html @@ -13,13 +13,13 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html ---

-A [histogram](https://www.data-to-viz.com/graph/histogram.html) takes as input a numeric variable and cuts it into several bins. The number of observations in each bin is represented by the height of the bar. It is a very common type of graphic and most tools select a bin size value by default. +A [histogram](https://www.data-to-viz.com/graph/histogram.html) takes as input a numeric variable and cuts it into several bins. The number of observations in each bin is represented by the height of the bar. It is a very common type of graphic and most tools select a bin size value by default.

However, this bin size choice can have a strong impact on the chart insight. Let's look at the distribution of [Airbnb night prices on the French Riviera](http://www.data-to-viz.com/story/OneNum.html): @@ -45,7 +45,7 @@ data %>% ) ``` -The price ranges between 10 and 300 euros, with most of the apartments ranging between 60 and 150 euros per night. In this chart, prices are cut in several 10 euro bins: between 0 and 10 euros a night, between 10 and 20, and so on. This is represented on the X-axis. Then, the number of apartments per bin is counted and represented by the Y-axis. +The price ranges between 10 and 300 euros, with most of the apartments ranging between 60 and 150 euros per night. In this chart, prices are cut in several 10 euro bins: between 0 and 10 euros a night, between 10 and 20, and so on. This is represented on the X-axis. Then, the number of apartments per bin is counted and represented by the Y-axis.
diff --git a/caveat/boxplot.Rmd b/caveat/boxplot.Rmd index b0d13d7..fdba1e9 100644 --- a/caveat/boxplot.Rmd +++ b/caveat/boxplot.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/boxplot.png" mytitle: "The Boxplot and its pitfalls" -mydisqus: "boxplot" +pathSlug: "boxplot" output: html_document: template: template_caveat.html @@ -27,7 +27,7 @@ A boxplot gives a nice summary of one or more numeric variables. A boxplot is co - The ends of the box shows the upper (Q3) and lower (Q1) [quartiles](https://en.wikipedia.org/wiki/Quartile). If the third quartile is 15, it means that 75% of the observation are lower than 15. - The difference between Quartiles 1 and 3 is called the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) (IQR) - The extreme line shows Q3+1.5xIQR to Q1-1.5xIQR (the highest and lowest value excluding outliers). -- Dots (or other markers) beyond the extreme line shows potntial outliers. +- Dots (or other markers) beyond the extreme line shows potntial outliers.
diff --git a/caveat/calculation_error.Rmd b/caveat/calculation_error.Rmd index 9dd4aa6..4dadceb 100644 --- a/caveat/calculation_error.Rmd +++ b/caveat/calculation_error.Rmd @@ -1,6 +1,6 @@ --- myimage1: "../img/mistake/calculation_error.png" -mydisqus: "calculation_error" +pathSlug: "calculation_error" mytitle: "Calculation errors" output: html_document: @@ -13,9 +13,9 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html - + --- diff --git a/caveat/circular_bar_yaxis.Rmd b/caveat/circular_bar_yaxis.Rmd index 09cbfa4..d7f7593 100644 --- a/caveat/circular_bar_yaxis.Rmd +++ b/caveat/circular_bar_yaxis.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/circular_bar_yaxis.png" mytitle: "Circular barplot and distortion" -mydisqus: "circular_bar_yaxis" +pathSlug: "circular_bar_yaxis" output: html_document: template: template_caveat.html diff --git a/caveat/circular_barplot_accordeon.Rmd b/caveat/circular_barplot_accordeon.Rmd index 60bc9e9..78fa41e 100644 --- a/caveat/circular_barplot_accordeon.Rmd +++ b/caveat/circular_barplot_accordeon.Rmd @@ -1,6 +1,6 @@ --- myimage1: "../img/mistake/calculation_error.png" -mydisqus: "circular_bar_yaxis" +pathSlug: "circular_bar_yaxis" mytitle: "Mind the radial bar charts" output: html_document: @@ -13,9 +13,9 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html - + ---

@@ -52,7 +52,7 @@ data %>% xlab("") + ylab("") + coord_polar(theta = "y") + - ylim(0,15000) + ylim(0,15000) ``` The good thing about this kind of graphic is that it is quite eye-catching. However, because the bars are plotted on different radial points of the polar axis, they have different radii and cannot be compared by their lengths. A bar on the outside will be longer by construction than one on the inside, even with an equal value. diff --git a/caveat/color_com_nothing.Rmd b/caveat/color_com_nothing.Rmd index 1fb886d..1723d41 100644 --- a/caveat/color_com_nothing.Rmd +++ b/caveat/color_com_nothing.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/color_com_nothing.png" mytitle: "Don't use color if they communicate nothing" -mydisqus: "color_com_nothing" +pathSlug: "color_com_nothing" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -35,7 +35,7 @@ data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/ # create random color palette mycolors <- colors()[sample(1:400, nrow(data))] - + # Barplot data %>% filter(!is.na(Value)) %>% diff --git a/caveat/connect_your_dot.Rmd b/caveat/connect_your_dot.Rmd index 03aaecb..e8338ee 100644 --- a/caveat/connect_your_dot.Rmd +++ b/caveat/connect_your_dot.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/connect_your_dot.png" mytitle: "Connect your dots when the X-axis is ordered" -mydisqus: "connect_your_dot" +pathSlug: "connect_your_dot" output: html_document: self_contained: FALSE @@ -14,7 +14,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- diff --git a/caveat/consistency.Rmd b/caveat/consistency.Rmd index 3ffe2f0..e2c49f1 100644 --- a/caveat/consistency.Rmd +++ b/caveat/consistency.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/consistency_sev_graph.png" mytitle: "Consistency between charts" -mydisqus: "consistency" +pathSlug: "consistency" output: html_document: template: template_caveat.html @@ -13,14 +13,14 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html ---

-Let's consider a small report where you present several graphics to your audience. The report is composed of: +Let's consider a small report where you present several graphics to your audience. The report is composed of: - a [barplot](https://www.data-to-viz.com/graph/barplot.html) showing the total amount of money spent on the five products of the company - a [line plot](https://www.data-to-viz.com/graph/line.html) showing the evolution of the money generated by these five products in the last ten years: @@ -58,11 +58,11 @@ a <- data %>% data <- data.frame( year = rep( seq(1,10), 5 ), product = rep(LETTERS[1:5], each=10 ), - value = c( - seq(1,10) + sample( 1:3, 10, replace = TRUE), - seq(5,14) + sample( 4:10, 10, replace=TRUE), - seq(10,1)*2 + sample( 3:5, 10, replace=TRUE), - seq(20,11) + sample( 12:17, 10, replace=TRUE), + value = c( + seq(1,10) + sample( 1:3, 10, replace = TRUE), + seq(5,14) + sample( 4:10, 10, replace=TRUE), + seq(10,1)*2 + sample( 3:5, 10, replace=TRUE), + seq(20,11) + sample( 12:17, 10, replace=TRUE), seq(1,10) + sample( 40:10, 10)) ) @@ -124,11 +124,11 @@ a <- data %>% data <- data.frame( year = rep( seq(1,10), 5 ), product = rep(LETTERS[1:5], each=10 ), - value = c( - seq(1,10) + sample( 1:3, 10, replace = TRUE), - seq(5,14) + sample( 4:10, 10, replace=TRUE), - seq(10,1)*2 + sample( 3:5, 10, replace=TRUE), - seq(20,11) + sample( 12:17, 10, replace=TRUE), + value = c( + seq(1,10) + sample( 1:3, 10, replace = TRUE), + seq(5,14) + sample( 4:10, 10, replace=TRUE), + seq(10,1)*2 + sample( 3:5, 10, replace=TRUE), + seq(20,11) + sample( 12:17, 10, replace=TRUE), seq(1,10) + sample( 40:10, 10)) ) diff --git a/caveat/counter_intuitive.Rmd b/caveat/counter_intuitive.Rmd index aef9abf..a699cd7 100644 --- a/caveat/counter_intuitive.Rmd +++ b/caveat/counter_intuitive.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/counter_intuitive.png" mytitle: "Don't be counter intuitive" -mydisqus: "counter_intuitive" +pathSlug: "counter_intuitive" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -25,7 +25,7 @@ People are used to seeing data displayed in a usual and logical way. Not respect #Reversing axis *** -The Y-axis almost always rises from the bottom to the top of the graphic. Here is an example with a reverse Y-axis. It comes from an article in [business insider](https://www.businessinsider.in/This-Chart-Shows-What-Happened-To-Gun-Deaths-In-Florida-After-Stand-Your-Ground-Was-Enacted/articleshow/30635752.cms) and describes the evolution of gun deaths in Florida. +The Y-axis almost always rises from the bottom to the top of the graphic. Here is an example with a reverse Y-axis. It comes from an article in [business insider](https://www.businessinsider.in/This-Chart-Shows-What-Happened-To-Gun-Deaths-In-Florida-After-Stand-Your-Ground-Was-Enacted/articleshow/30635752.cms) and describes the evolution of gun deaths in Florida. *Disclaimer*: I found this image on [KD Nuggets](https://www.kdnuggets.com/2016/02/common-data-visualization-mistakes.html) diff --git a/caveat/cut_y_axis.Rmd b/caveat/cut_y_axis.Rmd index 4885d48..fdbab39 100644 --- a/caveat/cut_y_axis.Rmd +++ b/caveat/cut_y_axis.Rmd @@ -1,6 +1,6 @@ --- myimage1: "../img/mistake/cut_y_axis.png" -mydisqus: "cut_y_axis" +pathSlug: "cut_y_axis" mytitle: "To cut or not to cut (the Y axis)" output: html_document: @@ -13,9 +13,9 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html - + ---

@@ -61,7 +61,7 @@ By its design, a bar graph emphasizes the absolute magnitude of values associate
-*Read more*: +*Read more*: - Have a look to this [#SWD challenge](http://www.storytellingwithdata.com/blog/2018/3/9/bring-on-the-bar-charts) by storytelling with data: you will see that most of the entry ordered their barplot. - Read more about [barplot]() and [lollipop plot]() diff --git a/caveat/declutter.Rmd b/caveat/declutter.Rmd index 50b9700..259dee7 100644 --- a/caveat/declutter.Rmd +++ b/caveat/declutter.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/declutter_your_graphic.png" mytitle: "Decluttering your chart" -mydisqus: "declutter" +pathSlug: "declutter" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -21,13 +21,13 @@ output:
-Getting rid of all the unnecessary elements can greatly improve the quality and impact of your chart. First, the chart will be cleaner and thus more likely to be read by people. Second, it will allow people to target directly what is important on the chart, and thus to get your point. +Getting rid of all the unnecessary elements can greatly improve the quality and impact of your chart. First, the chart will be cleaner and thus more likely to be read by people. Second, it will allow people to target directly what is important on the chart, and thus to get your point. -Here is a good example that takes a cluttered graphic from [viz.wtf](http://viz.wtf) and gets rid of the unnecessary elements. This example comes from the website [Storytelling with data](http://www.storytellingwithdata.com/blog/2017/3/29/declutter-this-graph) by [Cole Nussbaumer Knaflic](http://www.storytellingwithdata.com/about/). +Here is a good example that takes a cluttered graphic from [viz.wtf](http://viz.wtf) and gets rid of the unnecessary elements. This example comes from the website [Storytelling with data](http://www.storytellingwithdata.com/blog/2017/3/29/declutter-this-graph) by [Cole Nussbaumer Knaflic](http://www.storytellingwithdata.com/about/). #Initial graphic *** -The idea of the chart is to show that women tend to begin Christmas shopping earlier than men: +The idea of the chart is to show that women tend to begin Christmas shopping earlier than men:
diff --git a/caveat/dual_axis.Rmd b/caveat/dual_axis.Rmd index 0bfd791..f5526a3 100644 --- a/caveat/dual_axis.Rmd +++ b/caveat/dual_axis.Rmd @@ -1,6 +1,6 @@ --- myimage1: "../img/mistake/dual_axis.png" -mydisqus: "dual_axis" +pathSlug: "dual_axis" mytitle: "The issue with dual axis" output: html_document: @@ -13,9 +13,9 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html - + ---

@@ -28,10 +28,10 @@ output: http://emschuch.github.io/Planned-Parenthood/ and from http://www.politifact.com/truth-o-meter/statements/2015/oct/01/jason-chaffetz/chart-shown-planned-parenthood-hearing-misleading-/ - + initially published on http://www.aul.org - - + + Data here https://docs.google.com/spreadsheets/d/1vzkuzSi2S-JO0m0VolflzxaMN9wUfjw6Slek97_6L4s/edit#gid=0 @@ -84,15 +84,15 @@ aval steps <- list() p <- plot_ly() for (i in 1:11) { - p <- add_lines(p,x=aval[i][[1]]$x, y=aval[i][[1]]$y, visible = aval[i][[1]]$visible, - name = aval[i][[1]]$name, type = 'scatter', mode = 'lines', hoverinfo = 'name', + p <- add_lines(p,x=aval[i][[1]]$x, y=aval[i][[1]]$y, visible = aval[i][[1]]$visible, + name = aval[i][[1]]$name, type = 'scatter', mode = 'lines', hoverinfo = 'name', line=list(color='00CED1'), showlegend = FALSE) step <- list(args = list('visible', rep(FALSE, length(aval))), method = 'restyle') - step$args[[2]][i] = TRUE - steps[[i]] = step -} + step$args[[2]][i] = TRUE + steps[[i]] = step +} # add slider control to plot p <- p %>% diff --git a/caveat/error_bar.Rmd b/caveat/error_bar.Rmd index 72d063e..81afaec 100644 --- a/caveat/error_bar.Rmd +++ b/caveat/error_bar.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/error_bar.png" mytitle: "The issue with error bars" -mydisqus: "error_bar" +pathSlug: "error_bar" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -37,7 +37,7 @@ data <- data.frame( value=sample(seq(4,15),5), sd=c(1,0.2,3,2,4) ) - + # Plot ggplot(data) + geom_bar( aes(x=name, y=value), stat="identity", fill="#69b3a2", alpha=0.7, width=0.5) + @@ -49,7 +49,7 @@ ggplot(data) + ) + ggtitle("A barplot with error bar") + xlab("") - + ``` @@ -84,7 +84,7 @@ Thus, the same barplot with error bars can in fact tell very different stories, #What is an error bar? *** -The second issue with error bars is that they are used to show `different metrics`, and it is not always clear which one is being shown. Three different types of values are commonly used for error bars, sometimes giving very different results. Here is an overview of their definitions and how to calculate them on a simple vector in R. +The second issue with error bars is that they are used to show `different metrics`, and it is not always clear which one is being shown. Three different types of values are commonly used for error bars, sometimes giving very different results. Here is an overview of their definitions and how to calculate them on a simple vector in R. - [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) (SD) represents the amount of dispersion of the variable. Calculated as the root square of the variance ```{r, eval=FALSE} @@ -110,22 +110,22 @@ Here is an application of these 3 metrics to the famous [Iris dataset](https://s ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=10, fig.height=4} # Data -data <- iris %>% select(Species, Sepal.Length) - +data <- iris %>% select(Species, Sepal.Length) + # Calculates mean, sd, se and ci my_sum <- data %>% group_by(Species) %>% - summarise( + summarise( n=n(), mean=mean(Sepal.Length), sd=sd(Sepal.Length) ) %>% mutate( se=sd/sqrt(n)) %>% mutate( ic=se * qt((1-0.05)/2 + .5, n-1)) - + # Standard deviation p1 <- ggplot(my_sum) + - geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + + geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + geom_errorbar( aes(x=Species, ymin=mean-sd, ymax=mean+sd), width=0.4, colour="black", alpha=0.9, size=1) + ggtitle("standard deviation") + theme( @@ -134,10 +134,10 @@ p1 <- ggplot(my_sum) + theme_ipsum() + xlab("") + ylab("Sepal Length") - + # Standard Error p2 <- ggplot(my_sum) + - geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + + geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + geom_errorbar( aes(x=Species, ymin=mean-se, ymax=mean+se),width=0.4, colour="black", alpha=0.9, size=1) + ggtitle("standard error") + theme( @@ -146,10 +146,10 @@ p2 <- ggplot(my_sum) + theme_ipsum() + xlab("") + ylab("Sepal Length") - + # Confidence Interval p3 <- ggplot(my_sum) + - geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + + geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) + geom_errorbar( aes(x=Species, ymin=mean-ic, ymax=mean+ic), width=0.4, colour="black", alpha=0.9, size=1) + ggtitle("confidence interval") + theme( @@ -172,7 +172,7 @@ It is quite obvious that the 3 metrics report very different visualizations and *** It is better to avoid error bars as much as you can. Of course it is not possible if you only have summary statistics. But if you know the individual data points, show them. Several workarounds are possible. The [boxplot with jitter](http://www.data-to-viz.com/caveat/boxplot.html) is a good one for a relatively small amount of data. The [violin plot](https://www.data-to-viz.com/graph/violin.html) is another possibility if you have a large sample size to display. - + ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=10} data %>% ggplot( aes(x=Species, y=Sepal.Length)) + @@ -185,7 +185,7 @@ data %>% theme_ipsum() + xlab("") + ylab("Sepal Length") -``` +``` diff --git a/caveat/faceting.Rmd b/caveat/faceting.Rmd index 5fff76d..e3ac0a2 100644 --- a/caveat/faceting.Rmd +++ b/caveat/faceting.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/small_multiple.png" mytitle: "A few thoughts about small multiples" -mydisqus: "faceting" +pathSlug: "faceting" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -39,12 +39,12 @@ library(hrbrthemes) library(plotly) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Amanda", "Jessica", "Patricia", "Deborah", "Dorothy", "Helen")) %>% filter(sex=="F") # Plot -p <- data %>% +p <- data %>% ggplot( aes(x=year, y=n, fill=name, text=name)) + geom_area( ) + scale_fill_viridis(discrete = TRUE) + @@ -55,7 +55,7 @@ p <- data %>% ggplotly(p, tooltip="text") ``` -*Note*: This graphic is interactive: hover an area to know the underlying name. +*Note*: This graphic is interactive: hover an area to know the underlying name. diff --git a/caveat/grouped_bar.Rmd b/caveat/grouped_bar.Rmd index 7c95959..9d75205 100644 --- a/caveat/grouped_bar.Rmd +++ b/caveat/grouped_bar.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/ungrouped_grouped_barplot.png" mytitle: "Grouped barplot must be grouped" -mydisqus: "grouped_bar" +pathSlug: "grouped_bar" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -37,12 +37,12 @@ library(babynames) library(viridis) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Anna", "Mary")) %>% filter(sex=="F") # A grouped barplot -data %>% +data %>% filter(year %in% c(1950, 1960, 1970, 1980, 1990, 2000)) %>% mutate(year=as.factor(year)) %>% mutate( nameYear = paste(year, name, sep=" - ")) %>% @@ -70,12 +70,12 @@ library(babynames) library(viridis) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Anna", "Mary")) %>% filter(sex=="F") # A grouped barplot -data %>% +data %>% filter(year %in% c(1950, 1960, 1970, 1980, 1990, 2000)) %>% mutate(year=as.factor(year)) %>% ggplot( aes(x=year, y=n, fill=name)) + diff --git a/caveat/hard_label.Rmd b/caveat/hard_label.Rmd index b8a42c2..169908a 100644 --- a/caveat/hard_label.Rmd +++ b/caveat/hard_label.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/hard_label.png" mytitle: "A note on long labels" -mydisqus: "hard_label" +pathSlug: "hard_label" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- diff --git a/caveat/mental_calculation.Rmd b/caveat/mental_calculation.Rmd index 6b7201e..e6d28c5 100644 --- a/caveat/mental_calculation.Rmd +++ b/caveat/mental_calculation.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/mental_workout.png" mytitle: "Mental arithmetic in dataviz" -mydisqus: "mental_calculation" +pathSlug: "mental_calculation" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -54,7 +54,7 @@ data %>% panel.grid.minor = element_blank(), legend.position = c(0.9, 0.9), ) + - ylab("# of people") + + ylab("# of people") + xlab("Hour of day") ``` @@ -71,8 +71,8 @@ Now, what if somebody asks you: *** To answer these questions, your audience must think hard and will probably be confused. -- Is it where the marker `A` is, when the number of people entering the shop starts decreasing? -- Or marker `B` where more people are leaving than entering? +- Is it where the marker `A` is, when the number of people entering the shop starts decreasing? +- Or marker `B` where more people are leaving than entering? - Or `C` where the number of people leaving decreases? Instead of forcing the reader to make the calculation, it is probably better to represent the number of people in the shop directly: @@ -91,7 +91,7 @@ data %>% theme( panel.grid.minor = element_blank() ) + - ylab("# of people") + + ylab("# of people") + xlab("Hour of day") ``` diff --git a/caveat/multi_distribution.Rmd b/caveat/multi_distribution.Rmd index eb4038f..18936c4 100644 --- a/caveat/multi_distribution.Rmd +++ b/caveat/multi_distribution.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/multi_distribution.png" mytitle: "Too many distributions" -mydisqus: "multi_distribution" +pathSlug: "multi_distribution" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -21,7 +21,7 @@ output:
-Comparing the distributions of several numeric variables is a common task in dataviz. The distribution of a variable can be represented using a [histogram](http://www.data-to-viz.com/graph/histogram.html) or a [density chart](http://www.data-to-viz.com/graph/density.html), and it is very tempting to represent many distributions on the same axis. +Comparing the distributions of several numeric variables is a common task in dataviz. The distribution of a variable can be represented using a [histogram](http://www.data-to-viz.com/graph/histogram.html) or a [density chart](http://www.data-to-viz.com/graph/density.html), and it is very tempting to represent many distributions on the same axis. Here is an example showing how people perceive probability vocabulary. On the [/r/samplesize](https://www.reddit.com/r/SampleSize/) thread of reddit, questions like *What probability would you assign to the phrase "Highly likely"* were asked. Here is the distribution of the score given by people to each question: @@ -36,7 +36,7 @@ library(patchwork) # Load dataset from github data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",") -data <- data %>% +data <- data %>% gather(key="text", value="value") %>% mutate(text = gsub("\\.", " ",text)) %>% mutate(value = round(as.numeric(value),0)) diff --git a/caveat/order_data.Rmd b/caveat/order_data.Rmd index cb8f7e8..d61b449 100644 --- a/caveat/order_data.Rmd +++ b/caveat/order_data.Rmd @@ -1,6 +1,6 @@ --- myimage1: "../img/mistake/order_data.png" -mydisqus: "order_data" +pathSlug: "order_data" mytitle: "Why you should order your data" output: html_document: @@ -84,7 +84,7 @@ The figure is now way more insightful, with France being the third biggest expor #Conclusion *** -Reordering your data is an easy step you should always consider when building a chart. Of course, sometimes the order of groups must be set by their features and not their values, like the months of the year, but it's worth thinking about it. +Reordering your data is an easy step you should always consider when building a chart. Of course, sometimes the order of groups must be set by their features and not their values, like the months of the year, but it's worth thinking about it.
*Read more*: diff --git a/caveat/overplotting.Rmd b/caveat/overplotting.Rmd index 3487233..fb0191f 100644 --- a/caveat/overplotting.Rmd +++ b/caveat/overplotting.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/overplotting.png" mytitle: "How to avoid overplotting" -mydisqus: "overplotting" +pathSlug: "overplotting" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -40,7 +40,7 @@ library(patchwork) a <- data.frame( x=rnorm(20000, 10, 1.2), y=rnorm(20000, 10, 1.2), group=rep("A",20000)) b <- data.frame( x=rnorm(20000, 14.5, 1.2), y=rnorm(20000, 14.5, 1.2), group=rep("B",20000)) c <- data.frame( x=rnorm(20000, 9.5, 1.5), y=rnorm(20000, 15.5, 1.5), group=rep("C",20000)) -data <- do.call(rbind, list(a,b,c)) +data <- do.call(rbind, list(a,b,c)) data %>% ggplot( aes(x=x, y=y)) + @@ -157,7 +157,7 @@ data %>% legend.position="none", plot.title = element_text(size=12) ) + - ggtitle('Behavior of the group B') + ggtitle('Behavior of the group B') ``` diff --git a/caveat/pie.Rmd b/caveat/pie.Rmd index 5125f2d..a1cd7de 100644 --- a/caveat/pie.Rmd +++ b/caveat/pie.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/pie.png" mytitle: "The issue with pie chart" -mydisqus: "pie" +pathSlug: "pie" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -52,7 +52,7 @@ plot_pie <- function(data, vec){ ggplot(data, aes(x="name", y=value, fill=name)) + geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0, direction = -1) + - scale_fill_viridis(discrete = TRUE, direction=-1) + + scale_fill_viridis(discrete = TRUE, direction=-1) + geom_text(aes(y = vec, label = rev(name), size=4, color=c( "white", rep("black", 4)))) + scale_color_manual(values=c("black", "white")) + theme_ipsum() + @@ -65,7 +65,7 @@ ggplot(data, aes(x="name", y=value, fill=name)) + ) + xlab("") + ylab("") - + } plot_pie(data1, c(10,35,55,75,93)) @@ -90,7 +90,7 @@ Now, let's represent exactly the same data using a [barplot](https://www.data-to plot_bar <- function(data){ ggplot(data, aes(x=name, y=value, fill=name)) + geom_bar( stat = "identity") + - scale_fill_viridis(discrete = TRUE, direction=-1) + + scale_fill_viridis(discrete = TRUE, direction=-1) + scale_color_manual(values=c("black", "white")) + theme_ipsum() + theme( @@ -112,7 +112,7 @@ c <- plot_bar(data3) a + b + c ``` -As you can see on this barplot, there is a heavy difference between the three pie plots with a hidden pattern that you definitely don't want to miss when you tell your story. +As you can see on this barplot, there is a heavy difference between the three pie plots with a hidden pattern that you definitely don't want to miss when you tell your story. #And often made even worse @@ -162,30 +162,30 @@ library(treemap) # Plot treemap(data, - + # data index="Country", vSize="Value", type="index", - + # Main title="", palette="Dark2", # Borders: - border.col=c("black"), - border.lwds=1, - + border.col=c("black"), + border.lwds=1, + # Labels fontsize.labels=0.5, fontcolor.labels="white", - fontface.labels=1, - bg.labels=c("transparent"), - align.labels=c("left", "top"), + fontface.labels=1, + bg.labels=c("transparent"), + align.labels=c("left", "top"), overlap.labels=0.5, inflate.labels=T # If true, labels are bigger when rectangle is bigger. - + ) ``` diff --git a/caveat/radius_or_area.Rmd b/caveat/radius_or_area.Rmd index 9126a61..cbf81ee 100644 --- a/caveat/radius_or_area.Rmd +++ b/caveat/radius_or_area.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/radius_or_area.png" mytitle: "Scaling to radius or area?" -mydisqus: "radius_or_area" +pathSlug: "radius_or_area" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -26,7 +26,7 @@ A usual practice in data visualization consists of `scaling` a graphic component #Example *** -Here is an example coming from the Barack Obama’s [State of the Union](https://www.youtube.com/watch?v=kl2g40GoRxg) speech in 2011. It shows the 2010 Gross Domestic Product of 5 countries, each value being represented by a circle. The radius of each circle has been scaled based on the size of each nation’s economy. +Here is an example coming from the Barack Obama’s [State of the Union](https://www.youtube.com/watch?v=kl2g40GoRxg) speech in 2011. It shows the 2010 Gross Domestic Product of 5 countries, each value being represented by a circle. The radius of each circle has been scaled based on the size of each nation’s economy.

@@ -65,7 +65,7 @@ The United States still has the biggest economy, but the difference from other c #Conclusion *** -When working with 2d objects, the scaling must be done using the area and not the radius. Furthermore, note that areas are a poor metaphor of values, being poorly perceived by human eyes. It must be used only when better visuals have already been used on the graphic (like in [bubble plot](https://www.data-to-viz.com/graph/bubble.html)). In this case, a barplot would probably have done a better job. +When working with 2d objects, the scaling must be done using the area and not the radius. Furthermore, note that areas are a poor metaphor of values, being poorly perceived by human eyes. It must be used only when better visuals have already been used on the graphic (like in [bubble plot](https://www.data-to-viz.com/graph/bubble.html)). In this case, a barplot would probably have done a better job. diff --git a/caveat/simpson.Rmd b/caveat/simpson.Rmd index 0050c44..2de0f9c 100644 --- a/caveat/simpson.Rmd +++ b/caveat/simpson.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/simpson.png" mytitle: "The simpson's paradox" -mydisqus: "simpson" +pathSlug: "simpson" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -79,7 +79,7 @@ ggplot(data, aes(x=x, y=y, color=group)) + theme_ipsum() ``` -Here, we understand that the positive correlation was due to a difference `between groups`. Actually, the correlation coefficient is even negative if each group is considered separately. +Here, we understand that the positive correlation was due to a difference `between groups`. Actually, the correlation coefficient is even negative if each group is considered separately. This is the Sympson's paradox: the trend between two different variables reverses when a third variable is included. diff --git a/caveat/spaghetti.Rmd b/caveat/spaghetti.Rmd index cbfb91f..674d88b 100644 --- a/caveat/spaghetti.Rmd +++ b/caveat/spaghetti.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/spaghetti.png" mytitle: "The Spaghetti plot" -mydisqus: "spaghetti" +pathSlug: "spaghetti" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -35,7 +35,7 @@ library(DT) library(plotly) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Mary","Emma", "Ida", "Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>% filter(sex=="F") diff --git a/caveat/spider.Rmd b/caveat/spider.Rmd index bed14c2..eec421f 100644 --- a/caveat/spider.Rmd +++ b/caveat/spider.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/section/Spider150.png" mytitle: "The Radar chart and its caveats" -mydisqus: "spider" +pathSlug: "spider" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -52,16 +52,16 @@ data <- rbind(rep(20,10) , rep(0,10) , data) # Custom the radarChart ! par(mar=c(0,0,0,0)) -radarchart( data, axistype=1, +radarchart( data, axistype=1, #custom polygon - pcol=rgb(0.2,0.5,0.5,0.9) , pfcol=rgb(0.2,0.5,0.5,0.5) , plwd=4 , + pcol=rgb(0.2,0.5,0.5,0.9) , pfcol=rgb(0.2,0.5,0.5,0.5) , plwd=4 , #custom the grid cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, #custom labels - vlcex=0.8 + vlcex=0.8 ) ``` @@ -88,16 +88,16 @@ colors_border=c( rgb(0.2,0.5,0.5,0.9), rgb(0.8,0.2,0.5,0.9) ) colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) ) # Custom the radarChart ! -radarchart( data, axistype=1, +radarchart( data, axistype=1, #custom polygon - pcol=colors_border , pfcol=colors_in , plwd=4, plty=1 , + pcol=colors_border , pfcol=colors_in , plwd=4, plty=1 , #custom the grid cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=1.1, #custom labels - vlcex=0.8 + vlcex=0.8 ) # Legend @@ -130,17 +130,17 @@ par(mfrow=c(2,3)) for(i in 1:6){ # Custom the radarChart ! - radarchart( data[c(1,2,i+2),], axistype=1, - + radarchart( data[c(1,2,i+2),], axistype=1, + #custom polygon - pcol=colors_border[i] , pfcol=colors_in[i] , plwd=4, plty=1 , - + pcol=colors_border[i] , pfcol=colors_in[i] , plwd=4, plty=1 , + #custom the grid cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, - + #custom labels vlcex=0.8, - + #title title=mytitle[i] ) @@ -169,16 +169,16 @@ data <-rbind(rep(20,10) , rep(0,10) , data) # Custom the radarChart ! par(mar=c(0,0,0,0)) -p1 <- radarchart( data, axistype=1, +p1 <- radarchart( data, axistype=1, #custom polygon - pcol=rgb(0.2,0.5,0.5,0.9) , pfcol=rgb(0.2,0.5,0.5,0.5) , plwd=4 , + pcol=rgb(0.2,0.5,0.5,0.9) , pfcol=rgb(0.2,0.5,0.5,0.5) , plwd=4 , #custom the grid cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, #custom labels - vlcex=1.3 + vlcex=1.3 ) # Barplot @@ -219,7 +219,7 @@ data <- rbind(rep(20,10) , rep(0,10) , data) # Change order: data2 <- data[,sample(1:10,10, replace=FALSE)] data3 <- data[,sample(1:10,10, replace=FALSE)] - + # Custom the radarChart ! par(mar=c(0,0,0,0)) par(mfrow=c(1,3)) @@ -319,12 +319,12 @@ data <-rbind(rep(20,10) , rep(0,10) , data) rownames(data) <- c("-", "--", "John", "Angli", "Baptiste", "Alfred") # Barplot -data <- data %>% slice(c(3:6)) %>% - t() %>% - as.data.frame() %>% - add_rownames() %>% - arrange(V1) %>% - mutate(rowname=factor(rowname, rowname)) %>% +data <- data %>% slice(c(3:6)) %>% + t() %>% + as.data.frame() %>% + add_rownames() %>% + arrange(V1) %>% + mutate(rowname=factor(rowname, rowname)) %>% gather(key=name, value=mark, -1) #Recode @@ -351,13 +351,13 @@ data %>% ggplot( aes(x=rowname, y=mark)) + library(GGally) data <- iris -data %>% +data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", - showPoints = TRUE, + showPoints = TRUE, title = "Parallel Coordinate Plot for the Iris Data", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum() ``` diff --git a/caveat/stacking.Rmd b/caveat/stacking.Rmd index 51dfaaa..cd808f0 100644 --- a/caveat/stacking.Rmd +++ b/caveat/stacking.Rmd @@ -1,7 +1,7 @@ --- myimage1: "../img/mistake/stacking.png" mytitle: "The issue with stacking" -mydisqus: "stacking" +pathSlug: "stacking" output: html_document: template: template_caveat.html @@ -13,7 +13,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -22,7 +22,7 @@ output: #What is stacking *** -`Stacking` is a process where a chart is broken up across more than one categoric variables which make up the whole. Each item of the categoric variable is represented by a shaded area. These areas are stacked on top of one another. +`Stacking` is a process where a chart is broken up across more than one categoric variables which make up the whole. Each item of the categoric variable is represented by a shaded area. These areas are stacked on top of one another. Here is an example with a [stacked area chart](https://www.data-to-viz.com/graph/stackedarea.html). It shows the evolution of baby name occurence in the US between 1880 and 2015. Six first names are represented on top of one another. @@ -36,12 +36,12 @@ library(hrbrthemes) library(plotly) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Amanda", "Jessica", "Patricia", "Deborah", "Dorothy", "Helen")) %>% filter(sex=="F") # Plot -p <- data %>% +p <- data %>% ggplot( aes(x=year, y=n, fill=name, text=name)) + geom_area( ) + scale_fill_viridis(discrete = TRUE) + @@ -52,7 +52,7 @@ p <- data %>% ggplotly(p, tooltip="text") ``` -*Note*: This graphic is interactive: hover an area to know the underlying name. +*Note*: This graphic is interactive: hover an area to know the underlying name. `Stacking` is a common practice in dataviz. It occurs on three main types of graphic that are highly related: area charts, barplots and streamcharts: @@ -93,7 +93,7 @@ The efficiency of stacked area graph is discussed and it must be used with care. #Example: mental arithmetic *** -In the previous graphic, try to find out how many times the name Dorothy was given in 1920. +In the previous graphic, try to find out how many times the name Dorothy was given in 1920. It is not trivial to find it out using the previous chart. You have to mentally do 75000 - 37000 which is hard. If you want to convey a message efficiently, you don't want the audience to perform mental arithmetic. @@ -115,7 +115,7 @@ don <- data.frame( ) #plot -don %>% +don %>% ggplot( aes(x=x, y=value, fill=group)) + geom_area( ) + scale_fill_viridis(discrete = TRUE) + @@ -131,7 +131,7 @@ Now let's plot just the green group to find out: ```{r, fig.align="center", message=FALSE, warning=FALSE, fig.width=6} #plot -don %>% +don %>% filter(group=="B") %>% ggplot( aes(x=x, y=value, fill=group)) + geom_area( fill="#22908C") + diff --git a/caveat/template_caveat.html b/caveat/template_caveat.html index 8883198..eac472b 100644 --- a/caveat/template_caveat.html +++ b/caveat/template_caveat.html @@ -1,607 +1,577 @@ - - - - - - - -$mytitle$ - - - - - - - - - - - - - - - - - - - - - -
-
- -

-

-

$mytitle$

-
- A collection of common dataviz caveats by Data-to-Viz.com -
-

- - $if(myimage1)$ - + - - -$if(theme)$ -$else$ - -$endif$ - -$for(author-meta)$ - -$endfor$ - -$if(date-meta)$ - -$endif$ - - -$for(header-includes)$ -$header-includes$ -$endfor$ - -$if(highlightjs)$ - -$if(theme)$ - -$endif$ - -$endif$ - -$if(highlighting-css)$ - - -$if(theme)$ - -$endif$ -$endif$ - -$if(abstract)$ - -$endif$ - -$if(theme)$ - -$endif$ - -$for(css)$ - -$endfor$ - - - - - - - - - -$if(theme)$ - - -$if(kable-scroll)$ - -$endif$ - -$if(navbar)$ - - - - -$endif$ - -
- - - - - -$if(code_menu)$ - - -$endif$ - - - -$if(toc_float)$ - - - - - - - - -
-
-
-
-
- -
- -$endif$ - -$endif$ - -$for(include-before)$ -$include-before$ -$endfor$ - -$if(theme)$ -
- -$if(code_menu)$ -
- - -
- -$endif$ - -$endif$ - -$if(title)$ -

$title$

-$if(subtitle)$ -

$subtitle$

-$endif$ -$for(author)$ -$if(author.name)$ -

$author.name$

-$if(author.affiliation)$ -
-$author.affiliation$
$endif$ -$if(author.email)$ -$author.email$ -
-$endif$ -$else$ -

$author$

-$endif$ -$endfor$ -$if(date)$ -

$date$

-$endif$ -$if(abstract)$ -
-

Abstract

-$abstract$ -
-$endif$ -$endif$ - -$if(theme)$ -
-$endif$ - -$if(toc_float)$ -$else$ -$if(toc)$ -
-$toc$ -
-$endif$ -$endif$ - -$body$ - - - - - -
-
- - +> + + + + + $mytitle$ + + + + + + + + + + + + + + + +
+
+ +

+

+

$mytitle$

+
+
+ A collection of common + dataviz caveats + by Data-to-Viz.com +
+
+

+ + $if(myimage1)$ + + + $endif$ +
+ + $if(theme)$ $else$ + + $endif$ $for(author-meta)$ + + $endfor$ $if(date-meta)$ + + $endif$ $for(header-includes)$ $header-includes$ $endfor$ $if(highlightjs)$ + + $if(theme)$ + + $endif$ + + $endif$ $if(highlighting-css)$ + + + $if(theme)$ + + $endif$ $endif$ $if(abstract)$ + + $endif$ $if(theme)$ + + $endif$ $for(css)$ + + $endfor$ + + + + + + + + $if(theme)$ + + + $if(kable-scroll)$ + + $endif$ $if(navbar)$ + + + + + $endif$ + +
+ + + + + $if(code_menu)$ + + + $endif$ $if(toc_float)$ + + + + + + +
+
+
+
+ +
+ $endif$ $endif$ $for(include-before)$ $include-before$ $endfor$ + $if(theme)$ +
+ $if(code_menu)$ +
+ + +
+ + $endif$ $endif$ $if(title)$ +

$title$

+ $if(subtitle)$ +

$subtitle$

+ $endif$ $for(author)$ $if(author.name)$ +

$author.name$

+ $if(author.affiliation)$ +
+ $author.affiliation$
$endif$ $if(author.email)$ + $author.email$ +
+ $endif$ $else$ +

$author$

+ $endif$ $endfor$ $if(date)$ +

$date$

+ $endif$ $if(abstract)$ +
+

Abstract

+ $abstract$ +
+ $endif$ $endif$ $if(theme)$ +
+ $endif$ $if(toc_float)$ $else$ $if(toc)$ +
$toc$
+ $endif$ $endif$ $body$ + + + +
+
+ + +
+ + + $for(include-after)$ $include-after$ $endfor$ $if(theme)$ + $if(toc_float)$ +
- - - - -$for(include-after)$ -$include-after$ -$endfor$ - - -$if(theme)$ - -$if(toc_float)$ -
-
-$endif$ - -
- - -$endif$ - -$if(mathjax-url)$ - - -$endif$ - - - + $endif$ +
+ + + $endif$ $if(mathjax-url)$ + + + $endif$ + diff --git a/caveat/template_caveat_old.html b/caveat/template_caveat_old.html index b2d3624..89e3d93 100644 --- a/caveat/template_caveat_old.html +++ b/caveat/template_caveat_old.html @@ -1,607 +1,572 @@ - - - - - - - -$mytitle$ - - - - - - - - - - - - - - - - - - - - - -
-
-

-

-

-

$mytitle$

-
- A collection of common dataviz caveats by Data-to-Viz.com -
-

- - $if(myimage1)$ - + - - -$if(theme)$ -$else$ - -$endif$ - -$for(author-meta)$ - -$endfor$ - -$if(date-meta)$ - -$endif$ - - -$for(header-includes)$ -$header-includes$ -$endfor$ - -$if(highlightjs)$ - -$if(theme)$ - -$endif$ - -$endif$ - -$if(highlighting-css)$ - - -$if(theme)$ - -$endif$ -$endif$ - -$if(abstract)$ - -$endif$ - -$if(theme)$ - -$endif$ - -$for(css)$ - -$endfor$ - - - - - - - - - -$if(theme)$ - - -$if(kable-scroll)$ - -$endif$ - -$if(navbar)$ - - - - -$endif$ - -
- - - - - -$if(code_menu)$ - - -$endif$ - - - -$if(toc_float)$ - - - - - - - - -
-
-
-
-
- -
- -$endif$ - -$endif$ - -$for(include-before)$ -$include-before$ -$endfor$ - -$if(theme)$ -
- -$if(code_menu)$ -
- - -
- -$endif$ - -$endif$ - -$if(title)$ -

$title$

-$if(subtitle)$ -

$subtitle$

-$endif$ -$for(author)$ -$if(author.name)$ -

$author.name$

-$if(author.affiliation)$ -
-$author.affiliation$
$endif$ -$if(author.email)$ -$author.email$ -
-$endif$ -$else$ -

$author$

-$endif$ -$endfor$ -$if(date)$ -

$date$

-$endif$ -$if(abstract)$ -
-

Abstract

-$abstract$ -
-$endif$ -$endif$ - -$if(theme)$ -
-$endif$ - -$if(toc_float)$ -$else$ -$if(toc)$ -
-$toc$ -
-$endif$ -$endif$ - -$body$ - - - - - -
-
- - +> + + + + + $mytitle$ + + + + + + + + + + + + + + + +
+
+

+

+

+

$mytitle$

+
+
+ A collection of common + dataviz caveats + by Data-to-Viz.com +
+
+

+ + $if(myimage1)$ + + + $endif$ +
+ + $if(theme)$ $else$ + + $endif$ $for(author-meta)$ + + $endfor$ $if(date-meta)$ + + $endif$ $for(header-includes)$ $header-includes$ $endfor$ $if(highlightjs)$ + + $if(theme)$ + + $endif$ + + $endif$ $if(highlighting-css)$ + + + $if(theme)$ + + $endif$ $endif$ $if(abstract)$ + + $endif$ $if(theme)$ + + $endif$ $for(css)$ + + $endfor$ + + + + + + + + $if(theme)$ + + + $if(kable-scroll)$ + + $endif$ $if(navbar)$ + + + + + $endif$ + +
+ + + + + $if(code_menu)$ + + + $endif$ $if(toc_float)$ + + + + + + +
+
+
+
+ +
+ $endif$ $endif$ $for(include-before)$ $include-before$ $endfor$ + $if(theme)$ +
+ $if(code_menu)$ +
+ + +
+ + $endif$ $endif$ $if(title)$ +

$title$

+ $if(subtitle)$ +

$subtitle$

+ $endif$ $for(author)$ $if(author.name)$ +

$author.name$

+ $if(author.affiliation)$ +
+ $author.affiliation$
$endif$ $if(author.email)$ + $author.email$ +
+ $endif$ $else$ +

$author$

+ $endif$ $endfor$ $if(date)$ +

$date$

+ $endif$ $if(abstract)$ +
+

Abstract

+ $abstract$ +
+ $endif$ $endif$ $if(theme)$ +
+ $endif$ $if(toc_float)$ $else$ $if(toc)$ +
$toc$
+ $endif$ $endif$ $body$ + + + +
+
+ + +
+ + + $for(include-after)$ $include-after$ $endfor$ $if(theme)$ + $if(toc_float)$ +
- - - - -$for(include-after)$ -$include-after$ -$endfor$ - - -$if(theme)$ - -$if(toc_float)$ -
-
-$endif$ - -
- - -$endif$ - -$if(mathjax-url)$ - - -$endif$ - - - + $endif$ +
+ + + $endif$ $if(mathjax-url)$ + + + $endif$ + diff --git a/graph/arc.Rmd b/graph/arc.Rmd index 7c25710..10e4543 100644 --- a/graph/arc.Rmd +++ b/graph/arc.Rmd @@ -1,10 +1,10 @@ --- myimage: "ArcSmal.png" -mydisqus: "arc" +pathSlug: "arc" mytitle: "Arc diagram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -28,8 +28,8 @@ output: # Definition {#definition} *** -An `arc diagram` is a special kind of [network graph](https://www.data-to-viz.com/graph/network.html). It is consituted by `nodes` that represent entities and by `links` that show relationships between entities. In arc diagrams, nodes are displayed along a `single axis` and links are represented with arcs. - +An `arc diagram` is a special kind of [network graph](https://www.data-to-viz.com/graph/network.html). It is consituted by `nodes` that represent entities and by `links` that show relationships between entities. In arc diagrams, nodes are displayed along a `single axis` and links are represented with arcs. + Here is a simple example. Five links between 6 nodes are represented using a [2d network diagram](https://www.data-to-viz.com/graph/network.html) (left) and an arc diagram (right) @@ -53,7 +53,7 @@ links=data.frame( mygraph <- graph_from_data_frame(links) # Make the usual network diagram -p1 <- ggraph(mygraph) + +p1 <- ggraph(mygraph) + geom_edge_link(edge_colour="black", edge_alpha=0.3, edge_width=0.2) + geom_node_point( color="#69b3a2", size=5) + geom_node_text( aes(label=name), repel = TRUE, size=8, color="#69b3a2") + @@ -61,10 +61,10 @@ p1 <- ggraph(mygraph) + theme( legend.position="none", plot.margin=unit(rep(2,4), "cm") - ) + ) # Make a cord diagram -p2 <- ggraph(mygraph, layout="linear") + +p2 <- ggraph(mygraph, layout="linear") + geom_edge_arc(edge_colour="black", edge_alpha=0.3, edge_width=0.2) + geom_node_point( color="#69b3a2", size=5) + geom_node_text( aes(label=name), repel = FALSE, size=8, color="#69b3a2", nudge_y=-0.1) + @@ -72,7 +72,7 @@ p2 <- ggraph(mygraph, layout="linear") + theme( legend.position="none", plot.margin=unit(rep(2,4), "cm") - ) + ) p1 + p2 ``` @@ -106,10 +106,10 @@ Here is an example showing the co-authorship network of a researcher. [Vincent R dataUU <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/13_AdjacencyUndirectedUnweighted.csv", header=TRUE) # Transform the adjacency matrix in a long format -connect <- dataUU %>% +connect <- dataUU %>% gather(key="to", value="value", -1) %>% mutate(to = gsub("\\.", " ",to)) %>% - na.omit() + na.omit() # Number of connection per person c( as.character(connect$from), as.character(connect$to)) %>% @@ -127,13 +127,13 @@ com <- walktrap.community(mygraph) #max(com$membership) #Reorder dataset and make the graph -coauth <- coauth %>% +coauth <- coauth %>% mutate( grp = com$membership) %>% arrange(grp) %>% mutate(name=factor(name, name)) # keep only 10 first communities -coauth <- coauth %>% +coauth <- coauth %>% filter(grp<16) # keep only this people in edges @@ -149,7 +149,7 @@ mycolor <- colormap(colormap=colormaps$viridis, nshades=max(coauth$grp)) mycolor <- sample(mycolor, length(mycolor)) # Make the graph -ggraph(mygraph, layout="linear") + +ggraph(mygraph, layout="linear") + geom_edge_arc(edge_colour="black", edge_alpha=0.2, edge_width=0.3, fold=TRUE) + geom_node_point(aes(size=n, color=as.factor(grp), fill=grp), alpha=0.5) + scale_size_continuous(range=c(0.5,8)) + @@ -161,7 +161,7 @@ ggraph(mygraph, layout="linear") + plot.margin=unit(c(0,0,0.4,0), "null"), panel.spacing=unit(c(0,0,3.4,0), "null") ) + - expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) + expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) ``` @@ -175,7 +175,7 @@ ggraph(mygraph, layout="linear") + # Variation {#variation} *** -A possible variation in arc diagrams consists to make the links wider when the connection is stronger. To do so you need a `weighted network` where each connection as a weight. +A possible variation in arc diagrams consists to make the links wider when the connection is stronger. To do so you need a `weighted network` where each connection as a weight.
@@ -208,7 +208,7 @@ The order of nodes is the key for arc diagrams. See the following figure showing ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=15, fig.height=7} #Reorder dataset randomly -coauth <- coauth %>% +coauth <- coauth %>% slice( sample(c(1:nrow(coauth)), nrow(coauth))) # Create a graph object with igraph @@ -219,7 +219,7 @@ mycolor <- colormap(colormap=colormaps$viridis, nshades=max(coauth$grp)) mycolor <- sample(mycolor, length(mycolor)) # Make the graph -ggraph(mygraph, layout="linear") + +ggraph(mygraph, layout="linear") + geom_edge_arc(edge_colour="black", edge_alpha=0.2, edge_width=0.3, fold=TRUE) + geom_node_point(aes(size=n, color=as.factor(grp), fill=grp), alpha=0.5) + scale_size_continuous(range=c(0.5,8)) + @@ -231,7 +231,7 @@ ggraph(mygraph, layout="linear") + plot.margin=unit(c(0,0,0.4,0), "null"), panel.spacing=unit(c(0,0,3.4,0), "null") ) + - expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) + expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) ``` @@ -277,7 +277,7 @@ ggraph(mygraph, layout="linear") +

Edge bundling

Show connections between entities organized in a hierarchy.

-
+
diff --git a/graph/area.Rmd b/graph/area.Rmd index 5bbecb4..d13d645 100644 --- a/graph/area.Rmd +++ b/graph/area.Rmd @@ -1,10 +1,10 @@ --- myimage: "AreaSmall.png" -mydisqus: "area" +pathSlug: "area" mytitle: "Area chart" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -27,12 +27,12 @@ output: # Definition {#definition} *** -An `area chart` is really similar to a [line chart](https://www.data-to-viz.com/graph/line.html) and represents the evolution of a numeric variable. Basically, the X axis represents time or an ordered variable, and the Y axis gives the value of another variable. Data points are connected by straight line segments and the area between the x axis and the line is filled in with color or shading. +An `area chart` is really similar to a [line chart](https://www.data-to-viz.com/graph/line.html) and represents the evolution of a numeric variable. Basically, the X axis represents time or an ordered variable, and the Y axis gives the value of another variable. Data points are connected by straight line segments and the area between the x axis and the line is filled in with color or shading.
-The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) between April 2013 and April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. +The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) between April 2013 and April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=5, fig.width=10} # Libraries @@ -72,7 +72,7 @@ Area chart can also be used to show the evolution of `several variables`. The mo ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=8} # Load dataset from github -don <- babynames %>% +don <- babynames %>% filter(name %in% c("Ashley", "Amanda", "Mary", "Deborah", "Dorothy", "Betty", "Helen", "Jennifer", "Shirley")) %>% filter(sex=="F") diff --git a/graph/barplot.Rmd b/graph/barplot.Rmd index 64f9375..65789a0 100644 --- a/graph/barplot.Rmd +++ b/graph/barplot.Rmd @@ -1,10 +1,10 @@ --- myimage: "BarSmall.png" -mydisqus: "barplot" +pathSlug: "barplot" mytitle: "Barplot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -86,12 +86,12 @@ library(babynames) library(viridis) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>% filter(sex=="F") # A grouped barplot -data %>% +data %>% filter(name %in% c("Ashley", "Patricia", "Betty", "Helen")) %>% filter(year %in% c(1920, 1960, 2000)) %>% mutate(year=as.factor(year)) %>% @@ -100,7 +100,7 @@ data %>% scale_fill_viridis(discrete=TRUE, name="") + theme_ipsum() + ylab("Number of baby") - + ``` @@ -108,7 +108,7 @@ Instead of puting the bars one beside each other it is possible to stack them, r ```{r, warning=FALSE, message=FALSE, fig.width=8, fig.align="center" } # A grouped barplot -data %>% +data %>% filter(name %in% c("Ashley", "Patricia", "Betty", "Helen")) %>% filter(year %in% c(1920, 1960, 2000)) %>% mutate(year=as.factor(year)) %>% @@ -188,9 +188,9 @@ ggplot(tmp, aes(x=as.factor(id), y=Value)) + # Note that id is a factor. I axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + - coord_polar(start = 0) + + coord_polar(start = 0) + geom_text(data=label_tmp, aes(x=id, y=Value+200, label=Country ), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_tmp$angle, hjust=label_tmp$hjust, inherit.aes = FALSE ) + geom_text( aes(x=24, y=8000, label="Who sells more weapons?"), color="black", inherit.aes = FALSE, data = data.frame()) ``` @@ -204,9 +204,9 @@ ggplot(tmp, aes(x=as.factor(id), y=Value)) + # Note that id is a factor. I # Common mistakes {#mistake} *** -- Do not confound barchart with [histogram](https://www.data-to-viz.com/graph/histogram.html). A histogram has only a numeric variable as input and shows its distribution. +- Do not confound barchart with [histogram](https://www.data-to-viz.com/graph/histogram.html). A histogram has only a numeric variable as input and shows its distribution. -- [Order your bars](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. +- [Order your bars](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. - Several values per group? [Don't use a barplot](http://www.data-to-viz.com/caveat/error_bar.html). Even with error bars, it hides information and other type of graphic like [boxplot](https://www.data-to-viz.com/caveat/boxplot.html) or [violin](https://www.data-to-viz.com/graph/violin.html) are much more appropriate. diff --git a/graph/bubble.Rmd b/graph/bubble.Rmd index cab32e7..cc511aa 100644 --- a/graph/bubble.Rmd +++ b/graph/bubble.Rmd @@ -1,10 +1,10 @@ --- myimage: "BubblePlotSmall.png" -mydisqus: "bubble" +pathSlug: "bubble" mytitle: "Bubble plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -147,7 +147,7 @@ p2 <- data %>% scale_color_viridis(discrete=TRUE) + scale_y_log10() + theme_ipsum() + - theme(legend.position="none") + theme(legend.position="none") p3 <- data %>% mutate(pop=pop/1000000) %>% @@ -158,7 +158,7 @@ p3 <- data %>% scale_color_viridis(discrete=TRUE) + scale_y_log10() + theme_ipsum() + - theme(legend.position="none") + theme(legend.position="none") grid.arrange(p2,p3, ncol=2) ``` diff --git a/graph/bubblemap.Rmd b/graph/bubblemap.Rmd index 4190024..c81b788 100644 --- a/graph/bubblemap.Rmd +++ b/graph/bubblemap.Rmd @@ -1,10 +1,10 @@ --- myimage: "BubbleMapSmall.png" -mydisqus: "bubblemap" +pathSlug: "bubblemap" mytitle: "Bubble map" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -29,7 +29,7 @@ output: # Definition {#definition} *** -A `bubble map` uses circles of different size to represent a numeric value on a territory. It displays one bubble per geographic coordinate, or one bubble per region (in this case the bubble is usually displayed in the baricentre of the region). +A `bubble map` uses circles of different size to represent a numeric value on a territory. It displays one bubble per geographic coordinate, or one bubble per region (in this case the bubble is usually displayed in the baricentre of the region). Here is an example showing the geographic position of about [200k tweets](https://www.data-to-viz.com/story/GPSCoordWithoutValue.html) containing the hashtags #surf, #windsurf or #kitesurf. See more about this project [here](https://www.data-to-viz.com/story/GPSCoordWithoutValue.html). @@ -72,7 +72,7 @@ p <- data %>% xlim(-180,180) + ylim(-60,80) + scale_x_continuous(expand = c(0.006, 0.006)) + - coord_equal() + coord_equal() # Save at PNG ggsave("IMG/Surfer_bubble.png", width = 36, height = 15.22, units = "in", dpi = 90) @@ -111,25 +111,25 @@ Interactivity is appreciated for bubble maps. It allows to zoom on a specific pa ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=4, fig.width=9, cache=TRUE} # Library library(leaflet) - + # load example data (Fiji Earthquakes) + keep only 100 first lines data(quakes) quakes = head(quakes, 100) - + # Create a color palette with handmade bins. mybins=seq(4, 6.5, by=0.5) mypalette = colorBin( palette="YlOrBr", domain=quakes$mag, na.color="transparent", bins=mybins) - + # Prepare the text for the tooltip: mytext=paste("Depth: ", quakes$depth, "
", "Stations: ", quakes$stations, "
", "Magnitude: ", quakes$mag, sep="") %>% lapply(htmltools::HTML) - + # Final Map -leaflet(quakes) %>% - addTiles() %>% +leaflet(quakes) %>% + addTiles() %>% setView( lat=-27, lng=170 , zoom=4) %>% addProviderTiles("Esri.WorldImagery") %>% - addCircles(~long, ~lat, + addCircles(~long, ~lat, fillColor = ~mypalette(mag), fillOpacity = 0.7, color="white", radius=~sqrt(depth)*3000, stroke=FALSE, weight = 1, label = mytext, labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto") diff --git a/graph/cartogram.Rmd b/graph/cartogram.Rmd index 2bec5d0..5e2c34b 100644 --- a/graph/cartogram.Rmd +++ b/graph/cartogram.Rmd @@ -1,10 +1,10 @@ --- myimage: "CartogramSmall.png" -mydisqus: "cartogram" +pathSlug: "cartogram" mytitle: "Cartogram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -28,7 +28,7 @@ output: # Definition {#definition} *** -A `cartogram` is a map in which the geometry of regions is `distorted` in order to convey the information of an alternate variable. The region area will be inflated or deflated according to its numeric value. +A `cartogram` is a map in which the geometry of regions is `distorted` in order to convey the information of an alternate variable. The region area will be inflated or deflated according to its numeric value. Most of the time, a cartogram is also a [choropleth map](https://www.data-to-viz.com/graph/choropleth.html) where regions are colored according to a numeric variable (not necessarily the one use to build the cartogram). @@ -47,7 +47,7 @@ afr=wrld_simpl[wrld_simpl$REGION==2,] # Usual choropleth map: spdf_fortified <- tidy(afr) -spdf_fortified = spdf_fortified %>% left_join(. , afr@data, by=c("id"="ISO3")) +spdf_fortified = spdf_fortified %>% left_join(. , afr@data, by=c("id"="ISO3")) p1 <- ggplot() + geom_polygon(data = spdf_fortified, aes(fill = POP2005/1000000, x = long, y = lat, group = group) , size=0, alpha=0.9) + theme_void() + @@ -55,7 +55,7 @@ p1 <- ggplot() + labs( title = "Real boundaries" ) + ylim(-35,35) + theme( - text = element_text(color = "#22211d"), + text = element_text(color = "#22211d"), plot.title = element_text(size= 22, hjust=0.5, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), legend.position = c(0.2, 0.26) ) + @@ -63,15 +63,15 @@ p1 <- ggplot() + # construct a cartogram using the population in 2005 afr_cartogram <- cartogram(afr, "POP2005", itermax=5) - + # It is a new geospatial object: we can use all the usual techniques on it! Let's start with a basic ggplot2 chloropleth map: spdf_fortified <- tidy(afr_cartogram) -spdf_fortified = spdf_fortified %>% left_join(. , afr_cartogram@data, by=c("id"="ISO3")) +spdf_fortified = spdf_fortified %>% left_join(. , afr_cartogram@data, by=c("id"="ISO3")) ggplot() + geom_polygon(data = spdf_fortified, aes(fill = POP2005, x = long, y = lat, group = group) , size=0, alpha=0.9) + coord_map() + theme_void() - + # As seen before, we can do better with a bit of customization p2 <- ggplot() + geom_polygon(data = spdf_fortified, aes(fill = POP2005/1000000, x = long, y = lat, group = group) , size=0, alpha=0.9) + @@ -80,7 +80,7 @@ p2 <- ggplot() + labs( title = "Cartogram" ) + ylim(-35,35) + theme( - text = element_text(color = "#22211d"), + text = element_text(color = "#22211d"), plot.title = element_text(size= 22, hjust=0.5, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), legend.position = c(0.2, 0.26) ) + @@ -92,7 +92,7 @@ ggsave(p2, filename="IMG/cartogram2.png", dpi=100, width=7, height=7) ```
- +
img @@ -116,7 +116,7 @@ The above maps illustrate the difference between real african country boundaries # What for *** -Cartogram aims to correct the bias that can be observed in a [choropleth map](https://www.data-to-viz.com/graph/choropleth.html): when a variable is aggregated per region, a region with very few data points will look as important as a region with many data points. +Cartogram aims to correct the bias that can be observed in a [choropleth map](https://www.data-to-viz.com/graph/choropleth.html): when a variable is aggregated per region, a region with very few data points will look as important as a region with many data points. For instance, imagine you display the average salary per region on your choropleth map. A region with 3 inhabitants with a huge area will have more importance on your map than a small one with 3,000 inhabitants, what induces a strong bias. The cartogram aims to reduce this bias. @@ -132,31 +132,31 @@ For instance, imagine you display the average salary per region on your chorople library(tidyverse) library(geojsonio) library(RColorBrewer) - + # Hexbin available in the geojson format here: https://team.carto.com/u/andrew/tables/andrew.us_states_hexgrid/public/map. Download it and load it in R: spdf <- geojson_read("us_states_hexgrid.geojson.json", what = "sp") spdf@data = spdf@data %>% mutate(google_name = gsub(" \\(United States\\)", "", google_name)) - + # Load the population per states (source: https://www.census.gov/data/tables/2017/demo/popest/nation-total.html) pop=read.table("https://www.r-graph-gallery.com/wp-content/uploads/2018/01/pop_US.csv", sep=",", header=T) pop$pop = pop$pop / 1000000 - + # merge both spdf@data = spdf@data %>% left_join(., pop, by=c("google_name"="state")) - + # Compute the cartogram, using this population information cartogram <- cartogram(spdf, 'pop') - + # First look! plot(cartogram) - + # tidy data to be drawn by ggplot2 (broom library of the tidyverse) carto_fortified <- tidy(cartogram, region = "google_name") -carto_fortified = carto_fortified %>% left_join(. , cartogram@data, by=c("id"="google_name")) - +carto_fortified = carto_fortified %>% left_join(. , cartogram@data, by=c("id"="google_name")) + # Calculate the position of state labels centers <- cbind.data.frame(data.frame(gCentroid(cartogram, byid=TRUE), id=cartogram@data$iso3166_2)) - + # plot ggplot() + geom_polygon(data = carto_fortified, aes(fill = pop, x = long, y = lat, group = group) , size=0.05, alpha=0.9, color="black") + @@ -168,8 +168,8 @@ ggplot() + legend.position = c(0.5, 0.9), legend.direction = "horizontal", text = element_text(color = "#22211d"), - plot.background = element_rect(fill = "#f5f5f9", color = NA), - panel.background = element_rect(fill = "#f5f5f9", color = NA), + plot.background = element_rect(fill = "#f5f5f9", color = NA), + panel.background = element_rect(fill = "#f5f5f9", color = NA), legend.background = element_rect(fill = "#f5f5f9", color = NA), plot.title = element_text(size= 22, hjust=0.5, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), ) + diff --git a/graph/chord.Rmd b/graph/chord.Rmd index 688d7b3..9053bea 100644 --- a/graph/chord.Rmd +++ b/graph/chord.Rmd @@ -1,10 +1,10 @@ --- myimage: "ChordSmall.png" -mydisqus: "chord" +pathSlug: "chord" mytitle: "Chord diagram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -66,49 +66,49 @@ mycolor <- mycolor[sample(1:10)] # Base plot chordDiagram( - x = data_long, + x = data_long, grid.col = mycolor, transparency = 0.25, directional = 1, - direction.type = c("arrows", "diffHeight"), + direction.type = c("arrows", "diffHeight"), diffHeight = -0.04, - annotationTrack = "grid", + annotationTrack = "grid", annotationTrackHeight = c(0.05, 0.1), - link.arr.type = "big.arrow", - link.sort = TRUE, + link.arr.type = "big.arrow", + link.sort = TRUE, link.largest.ontop = TRUE) # Add text and axis circos.trackPlotRegion( - track.index = 1, - bg.border = NA, + track.index = 1, + bg.border = NA, panel.fun = function(x, y) { - + xlim = get.cell.meta.data("xlim") sector.index = get.cell.meta.data("sector.index") - - # Add names to the sector. + + # Add names to the sector. circos.text( - x = mean(xlim), - y = 3.2, - labels = sector.index, - facing = "bending", + x = mean(xlim), + y = 3.2, + labels = sector.index, + facing = "bending", cex = 0.8 ) # Add graduation on axis circos.axis( - h = "top", - major.at = seq(from = 0, to = xlim[2], by = ifelse(test = xlim[2]>10, yes = 2, no = 1)), - minor.ticks = 1, + h = "top", + major.at = seq(from = 0, to = xlim[2], by = ifelse(test = xlim[2]>10, yes = 2, no = 1)), + minor.ticks = 1, major.tick.percentage = 0.5, labels.niceFacing = FALSE) } ) - + ``` -*Note*: this plot is made using the circlize library, and very strongly inspired from the [Migest package](https://github.com/cran/migest) from [Gui J. Abel](http://guyabel.com). Read more about this story [here](https://www.data-to-viz.com/story/AdjacencyMatrix.html). +*Note*: this plot is made using the circlize library, and very strongly inspired from the [Migest package](https://github.com/cran/migest) from [Gui J. Abel](http://guyabel.com). Read more about this story [here](https://www.data-to-viz.com/story/AdjacencyMatrix.html). @@ -123,8 +123,8 @@ Chord diagrams are eye catching and quite popular in data visualization. They al - One asymetric arc per pair - Two arcs per pair - -- Bipartite: nodes are grouped in a few categories. Connections go *between* categories but not *within* categories. In my opinion [sankey diagrams](https://www.data-to-viz.com/graph/sankey.html) are more adapted in this situation. + +- Bipartite: nodes are grouped in a few categories. Connections go *between* categories but not *within* categories. In my opinion [sankey diagrams](https://www.data-to-viz.com/graph/sankey.html) are more adapted in this situation. @@ -204,7 +204,7 @@ chorddiag(m, groupColors = groupColors, groupnamePadding = 20)

Edge bundling

Show connections between entities organized in a hierarchy.

-
+
diff --git a/graph/choropleth.Rmd b/graph/choropleth.Rmd index a7bc269..bf3b772 100644 --- a/graph/choropleth.Rmd +++ b/graph/choropleth.Rmd @@ -1,10 +1,10 @@ --- myimage: "ChoroplethSmall.png" -mydisqus: "choropleth" +pathSlug: "choropleth" mytitle: "Choropleth map" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -32,7 +32,7 @@ A `choropleth map` displays divided geographical areas or regions that are colou
-Here is an example describing the distribution of restaurants in the south of france. +Here is an example describing the distribution of restaurants in the south of france. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=10} # Libraries @@ -72,19 +72,19 @@ p <- ggplot(spdf_fortified) + scale_fill_viridis(direction=-1, trans = "log", breaks=c(1,5,10,20,50,100), name="Number of restaurant", guide = guide_legend( keyheight = unit(3, units = "mm"), keywidth=unit(12, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) ) + labs( title = "South of France Restaurant concentration", - subtitle = "Number of restaurant per city district", + subtitle = "Number of restaurant per city district", caption = "Data: INSEE | Creation: Yan Holtz | r-graph-gallery.com" ) + theme( - text = element_text(color = "#22211d"), - plot.background = element_rect(fill = "#f5f5f2", color = NA), - panel.background = element_rect(fill = "#f5f5f2", color = NA), + text = element_text(color = "#22211d"), + plot.background = element_rect(fill = "#f5f5f2", color = NA), + panel.background = element_rect(fill = "#f5f5f2", color = NA), legend.background = element_rect(fill = "#f5f5f2", color = NA), - + plot.title = element_text(size= 15, hjust=0.01, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), plot.subtitle = element_text(size= 12, hjust=0.01, color = "#4e4d47", margin = margin(b = -0.1, t = 0.43, l = 2, unit = "cm")), plot.caption = element_text( size=8, color = "#4e4d47", margin = margin(b = 0.3, r=-99, unit = "cm") ), - + legend.position = c(0.7, 0.09) ) + coord_sf(datum = NA) @@ -95,7 +95,7 @@ p *Note*: Boundaries of city districts come from [here](https://github.com/gregoiredavid/france-geojson). Number of restaurant per district comes from [here](https://www.insee.fr/fr/statistiques). -*Important Note*: Here, the absolute number of restaurant per district is shown. Keep in mind that an important bias is present: districts with large area and / or high number of inhabitants are more prone to have a lot of restaurants. +*Important Note*: Here, the absolute number of restaurant per district is shown. Keep in mind that an important bias is present: districts with large area and / or high number of inhabitants are more prone to have a lot of restaurants. diff --git a/graph/circularbarplot.Rmd b/graph/circularbarplot.Rmd index c482773..47cb573 100644 --- a/graph/circularbarplot.Rmd +++ b/graph/circularbarplot.Rmd @@ -1,10 +1,10 @@ --- myimage: "CircularBarplotSmall.png" -mydisqus: "circularbarplot" +pathSlug: "circularbarplot" mytitle: "Circular Barplot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -87,15 +87,15 @@ ggplot(tmp, aes(x=as.factor(id), y=Value)) + # Note that id is a factor. I axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + - coord_polar(start = 0) + + coord_polar(start = 0) + geom_text(data=label_tmp, aes(x=id, y=Value+200, label=Country ), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_tmp$angle, hjust=label_tmp$hjust, inherit.aes = FALSE ) + geom_text( aes(x=24, y=8000, label="Who sells more weapons?"), color="black", inherit.aes = FALSE) ``` -*Note*: +*Note*: - Here no Y scale is displayed since exact values are written on each bar. - More representation of this dataset are available [here](http://www.data-to-viz.com/story/OneNumOneCat.html), with further explanation. @@ -128,41 +128,41 @@ to_add$group=rep(levels(data$group), each=empty_bar) data=rbind(data, to_add) data=data %>% arrange(group) data$id=seq(1, nrow(data)) - + # Get the name and the y position of each label label_data=data number_of_bar=nrow(label_data) angle= 90 - 360 * (label_data$id-0.5) /number_of_bar # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0) label_data$hjust<-ifelse( angle < -90, 1, 0) label_data$angle<-ifelse(angle < -90, angle+180, angle) - + # prepare a data frame for base lines -base_data=data %>% - group_by(group) %>% - summarize(start=min(id), end=max(id) - empty_bar) %>% - rowwise() %>% +base_data=data %>% + group_by(group) %>% + summarize(start=min(id), end=max(id) - empty_bar) %>% + rowwise() %>% mutate(title=mean(c(start, end))) - + # prepare a data frame for grid (scales) grid_data = base_data grid_data$end = grid_data$end[ c( nrow(grid_data), 1:nrow(grid_data)-1)] + 1 grid_data$start = grid_data$start - 1 grid_data=grid_data[-1,] - + # Make the plot p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that id is a factor. If x is numeric, there is some space between the first bar - + geom_bar(aes(x=as.factor(id), y=value, fill=group), stat="identity", alpha=0.5) + - + # Add a val=100/75/50/25 lines. I do it at the beginning to make sur barplots are OVER it. geom_segment(data=grid_data, aes(x = end, y = 80, xend = start, yend = 80), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 60, xend = start, yend = 60), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 40, xend = start, yend = 40), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 20, xend = start, yend = 20), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + - + # Add text showing the value of each 100/75/50/25 lines annotate("text", x = rep(max(data$id),4), y = c(20, 40, 60, 80), label = c("20", "40", "60", "80") , color="grey", size=3 , angle=0, fontface="bold", hjust=1) + - + geom_bar(aes(x=as.factor(id), y=value, fill=group), stat="identity", alpha=0.5) + ylim(-100,120) + theme_minimal() + @@ -171,17 +171,17 @@ p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + - coord_polar() + + coord_polar() + geom_text(data=label_data, aes(x=id, y=value+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) + - + # Add base line information geom_segment(data=base_data, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) + geom_text(data=base_data, aes(x = title, y = -18, label=group), hjust=c(1,1,0,0), colour = "black", alpha=0.8, size=4, fontface="bold", inherit.aes = FALSE) - + p - + ``` @@ -200,10 +200,10 @@ data=data.frame( value2=sample( seq(10,100), 60, replace=T), value3=sample( seq(10,100), 60, replace=T) ) - + # Transform data in a tidy format (long format) -data = data %>% gather(key = "observation", value="value", -c(1,2)) - +data = data %>% gather(key = "observation", value="value", -c(1,2)) + # Set a number of 'empty bar' to add at the end of each group empty_bar=2 nObsType=nlevels(as.factor(data$observation)) @@ -213,44 +213,44 @@ to_add$group=rep(levels(data$group), each=empty_bar*nObsType ) data=rbind(data, to_add) data=data %>% arrange(group, individual) data$id=rep( seq(1, nrow(data)/nObsType) , each=nObsType) - + # Get the name and the y position of each label label_data= data %>% group_by(id, individual) %>% summarize(tot=sum(value)) number_of_bar=nrow(label_data) angle= 90 - 360 * (label_data$id-0.5) /number_of_bar # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0) label_data$hjust<-ifelse( angle < -90, 1, 0) label_data$angle<-ifelse(angle < -90, angle+180, angle) - + # prepare a data frame for base lines -base_data=data %>% - group_by(group) %>% - summarize(start=min(id), end=max(id) - empty_bar) %>% - rowwise() %>% +base_data=data %>% + group_by(group) %>% + summarize(start=min(id), end=max(id) - empty_bar) %>% + rowwise() %>% mutate(title=mean(c(start, end))) - + # prepare a data frame for grid (scales) grid_data = base_data grid_data$end = grid_data$end[ c( nrow(grid_data), 1:nrow(grid_data)-1)] + 1 grid_data$start = grid_data$start - 1 grid_data=grid_data[-1,] - + # Make the plot -p = ggplot(data) + - +p = ggplot(data) + + # Add the stacked bar geom_bar(aes(x=as.factor(id), y=value, fill=observation), stat="identity", alpha=0.5) + scale_fill_viridis(discrete=TRUE) + - + # Add a val=100/75/50/25 lines. I do it at the beginning to make sur barplots are OVER it. geom_segment(data=grid_data, aes(x = end, y = 0, xend = start, yend = 0), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 50, xend = start, yend = 50), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 100, xend = start, yend = 100), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 150, xend = start, yend = 150), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 200, xend = start, yend = 200), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + - + # Add text showing the value of each 100/75/50/25 lines annotate("text", x = rep(max(data$id),5), y = c(0, 50, 100, 150, 200), label = c("0", "50", "100", "150", "200") , color="grey", size=2 , angle=0, fontface="bold", hjust=1) + - + ylim(-150,max(label_data$tot, na.rm=T)) + theme_minimal() + theme( @@ -258,13 +258,13 @@ p = ggplot(data) + axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + coord_polar() + - + # Add labels on top of each bar geom_text(data=label_data, aes(x=id, y=tot+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=1, angle= label_data$angle, inherit.aes = FALSE ) + - + # Add base line information geom_segment(data=base_data, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) + geom_text(data=base_data, aes(x = title, y = -18, label=group), hjust=c(1,1,0,0), colour = "black", alpha=0.8, size=4, fontface="bold", inherit.aes = FALSE) @@ -300,18 +300,18 @@ to_add$group=rep(levels(data$group), each=empty_bar) data=rbind(data, to_add) data=data %>% arrange(group) data$id=seq(1, nrow(data)) - + # Get the name and the y position of each label label_data=data number_of_bar=nrow(label_data) angle= 90 - 360 * (label_data$id-0.5) /number_of_bar # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0) label_data$hjust<-ifelse( angle < -90, 1, 0) label_data$angle<-ifelse(angle < -90, angle+180, angle) - + # Make the plot p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that id is a factor. If x is numeric, there is some space between the first bar - + geom_bar(aes(x=as.factor(id), y=value, fill=group), stat="identity", alpha=0.5) + ylim(-10,120) + theme_minimal() + @@ -320,20 +320,20 @@ p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + - coord_polar() + - geom_text(data=label_data, aes(x=id, y=value+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) - + coord_polar() + + geom_text(data=label_data, aes(x=id, y=value+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) + p - + ``` - Works only if you have many levels to display (> ~40) and a clear pattern - Keep displaying a Y axis all along the circle. -- [Order your bars](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. +- [Order your bars](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. - Several values per group? [Don't use a barplot](http://www.data-to-viz.com/caveat/error_bar.html). Even with error bars, it hides information and other type of graphic like [boxplot](https://www.data-to-viz.com/caveat/boxplot.html) or [violin](https://www.data-to-viz.com/graph/violin.html) are much more appropriate. diff --git a/graph/circularpacking.Rmd b/graph/circularpacking.Rmd index 093a3c9..e70f64d 100644 --- a/graph/circularpacking.Rmd +++ b/graph/circularpacking.Rmd @@ -1,10 +1,10 @@ --- myimage: "CircularPackingSmall.png" -mydisqus: "circularpacking" +pathSlug: "circularpacking" mytitle: "Circular Packing" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -29,7 +29,7 @@ output: # Definition {#definition} *** -`Circular packing` or circular treemap allows to visualize a hierarchic organization. It is an equivalent of a [treemap](https://www.data-to-viz.com/graph/treemap.html) or a [dendrogram](https://www.data-to-viz.com/graph/dendrogram.html), where each node of the tree is represented as a circle and its sub-nodes are represented as circles inside of it. The size of each circle can be proportional to a specific value, what gives more insight to the plot. +`Circular packing` or circular treemap allows to visualize a hierarchic organization. It is an equivalent of a [treemap](https://www.data-to-viz.com/graph/treemap.html) or a [dendrogram](https://www.data-to-viz.com/graph/dendrogram.html), where each node of the tree is represented as a circle and its sub-nodes are represented as circles inside of it. The size of each circle can be proportional to a specific value, what gives more insight to the plot. Here is an example showing the [repartition of the world population](https://www.data-to-viz.com/story/SevCatOneNumNestedOneObsPerGroup.html) of 250 countries. The world is divided in continent (group), regions (subgroup), and countries. Countries are considered as leaves: they are at the end of the branches. @@ -53,7 +53,7 @@ data <- data %>% filter(Continent!="") %>% droplevels() library(data.tree) data$pathString <- paste("world", data$Continent, data$Region, data$Country, sep = "/") population <- as.Node(data) - + # You can custom the minimum and maximum value of the color range. circlepackeR(population, size = "Pop", color_min = "hsl(56,80%,80%)", color_max = "hsl(341,30%,40%)") ``` @@ -69,7 +69,7 @@ circlepackeR(population, size = "Pop", color_min = "hsl(56,80%,80%)", color_max # What for *** -Circle packing is not recommended if you need to precisely compare values of group. Indeed, it is hard for the human eye to [translate an area into an accurate number](https://www.data-to-viz.com/caveat/area_hard.html). If you need accuracy, use a [barplot](https://www.data-to-viz.com/graph/barplot.html) or a [lollipop](https://www.data-to-viz.com/graph/lollipop.html) plot instead. +Circle packing is not recommended if you need to precisely compare values of group. Indeed, it is hard for the human eye to [translate an area into an accurate number](https://www.data-to-viz.com/caveat/area_hard.html). If you need accuracy, use a [barplot](https://www.data-to-viz.com/graph/barplot.html) or a [lollipop](https://www.data-to-viz.com/graph/lollipop.html) plot instead. However, circular packing shows very well how groups are organised in subgroups. It uses the space a bit less efficiently than a [treemap](https://www.data-to-viz.com/graph/treemap.html), but the hierarchy gets very neat. @@ -86,18 +86,18 @@ When using circular packing I really like to remove the first or two first level library(ggraph) library(igraph) library(viridis) - + # We need a data frame giving a hierarchical structure. Let's consider the flare dataset: edges=flare$edges vertices = flare$vertices mygraph <- graph_from_data_frame( edges, vertices=vertices ) - + # Second one: add 2 first levels -ggraph(mygraph, layout = 'circlepack', weight=size) + +ggraph(mygraph, layout = 'circlepack', weight=size) + geom_node_circle(aes(fill = as.factor(depth), color = as.factor(depth) )) + scale_fill_manual(values=c("0" = "white", "1" = "white", "2" = magma(4)[2], "3" = magma(4)[3], "4"=magma(4)[4])) + scale_color_manual( values=c("0" = "white", "1" = "white", "2" = "black", "3" = "black", "4"="black") ) + - theme_void() + + theme_void() + theme(legend.position="FALSE") ``` diff --git a/graph/connectedscatter.Rmd b/graph/connectedscatter.Rmd index f21022e..eca9312 100644 --- a/graph/connectedscatter.Rmd +++ b/graph/connectedscatter.Rmd @@ -1,10 +1,10 @@ --- myimage: "ScatterConnectedSmall.png" -mydisqus: "connectedscatter" +pathSlug: "connectedscatter" mytitle: "Connected Scatterplot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -29,9 +29,9 @@ output: # Definition {#definition} *** -A `connected scatterplot` displays the evolution of a numeric variable. Data points are represented by a dot and connected by straight line segments. It often shows a trend in data over intervals of time: a time series. Basically it is the same as a [line plot](https://www.data-to-viz.com/graph/line.html) in most of the cases, except that individual observation are highlighted. +A `connected scatterplot` displays the evolution of a numeric variable. Data points are represented by a dot and connected by straight line segments. It often shows a trend in data over intervals of time: a time series. Basically it is the same as a [line plot](https://www.data-to-viz.com/graph/line.html) in most of the cases, except that individual observation are highlighted. -The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) in April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. +The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) in April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=4, fig.width=7} @@ -127,7 +127,7 @@ We can first visualize the evolution of both names using a usual line plot with library(babynames) # Load dataset -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Ashley", "Amanda")) %>% filter(sex=="F") @@ -141,7 +141,7 @@ data %>% theme_ipsum() ``` -This is an accurate way to visualize the information. However, it would be much harder to build it if both variables would not share the same unit. In this case, it would require a dual axis line chart that is known to be very misleading. +This is an accurate way to visualize the information. However, it would be much harder to build it if both variables would not share the same unit. In this case, it would require a dual axis line chart that is known to be very misleading. The connected scatterplot can be a good workaround in this situation: @@ -158,13 +158,13 @@ tmp <- data %>% # data for date tmp_date <- tmp %>% sample_frac(0.3) -tmp%>% +tmp%>% ggplot(aes(x=Amanda, y=Ashley, label=year)) + geom_point(color="#69b3a2") + geom_text_repel(data=tmp_date) + - geom_segment(color="#69b3a2", + geom_segment(color="#69b3a2", aes( - xend=c(tail(Amanda, n=-1), NA), + xend=c(tail(Amanda, n=-1), NA), yend=c(tail(Ashley, n=-1), NA) ), arrow=arrow(length=unit(0.3,"cm")) @@ -181,7 +181,7 @@ Here the history of both names is obvious. They were not popular at all in 1972 - This graph is not adapted for all audience. At least, you need to educate the audience with progressive explanation to make it impactful. -*Going further*: +*Going further*: - *The Connected Scatterplot for Presenting Paired Time Series* by [Haroz et al](http://steveharoz.com/research/connected_scatterplot/). - A nice and famous example of story telling by the [New York Times](https://archive.nytimes.com/www.nytimes.com/interactive/2012/09/17/science/driving-safety-in-fits-and-starts.html?smid=tw-share) @@ -223,7 +223,7 @@ p1 + p2 ``` - + - If you need to compare the evolution of 2 different variables, do not use [dual axis](https://www.data-to-viz.com/caveat/dual_axis.html). Indeed dual axis can show very different results depending on what range you apply to the axis. [Read more about it](https://www.data-to-viz.com/caveat/dual_axis.html). - Mind the [spaghetti chart](https://www.data-to-viz.com/caveat/spaghetti.html): too many lines make the chart unreadable. - Think about the [aspect ratio](https://www.data-to-viz.com/caveat/aspect_ratio.html) of the graphic, extreme ratio make the chart unreadable. diff --git a/graph/correlogram.Rmd b/graph/correlogram.Rmd index 4c5f102..bf442c8 100644 --- a/graph/correlogram.Rmd +++ b/graph/correlogram.Rmd @@ -1,10 +1,10 @@ --- myimage: "CorrelogramSmall.png" -mydisqus: "correlogram" +pathSlug: "correlogram" mytitle: "Correlogram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -48,7 +48,7 @@ library(reticulate) import seaborn as sns df = sns.load_dataset('iris') import matplotlib.pyplot as plt - + # Basic correlogram sns_plot = sns.pairplot(df) sns_plot.savefig("IMG/correlogram1.png") diff --git a/graph/dendrogram.Rmd b/graph/dendrogram.Rmd index 1079a24..d8b6513 100644 --- a/graph/dendrogram.Rmd +++ b/graph/dendrogram.Rmd @@ -1,10 +1,10 @@ --- myimage: "DendrogramSmall.png" -mydisqus: "dendrogram" +pathSlug: "dendrogram" mytitle: "Dendrogram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -43,37 +43,37 @@ library(colormap) library(kableExtra) options(knitr.table.format = "html") -# create a data frame +# create a data frame data=data.frame( level1="CEO", level2=c( rep("boss1",4), rep("boss2",4)), level3=paste0("mister_", letters[1:8]) ) - + # transform it to a edge list! edges_level1_2 = data %>% select(level1, level2) %>% unique %>% rename(from=level1, to=level2) edges_level2_3 = data %>% select(level2, level3) %>% unique %>% rename(from=level2, to=level3) edge_list=rbind(edges_level1_2, edges_level2_3) - + # Now we can plot that mygraph <- graph_from_data_frame( edge_list ) -ggraph(mygraph, layout = 'dendrogram', circular = FALSE) + +ggraph(mygraph, layout = 'dendrogram', circular = FALSE) + geom_edge_diagonal() + geom_node_point(color="#69b3a2", size=3) + geom_node_text( - aes( label=c("CEO", "Manager", "Manager", LETTERS[8:1]) ), - hjust=c(1,0.5, 0.5, rep(0,8)), + aes( label=c("CEO", "Manager", "Manager", LETTERS[8:1]) ), + hjust=c(1,0.5, 0.5, rep(0,8)), nudge_y = c(-.02, 0, 0, rep(.02,8)), nudge_x = c(0, .3, .3, rep(0,8)) ) + theme_void() + coord_flip() + - scale_y_reverse() + scale_y_reverse() ```
-Two type of dendrogram exist, resulting from 2 types of dataset: +Two type of dendrogram exist, resulting from 2 types of dataset: - A `hierarchic` dataset provides the links between nodes explicitely. Like above. - The result of a `clustering` algorythm can be visualized as a dendrogram. @@ -92,14 +92,14 @@ The following example shows the hierarchy of a company. The CEO is the **root no library(ggraph) library(igraph) library(tidyverse) - -# create a data frame + +# create a data frame data <- data.frame( level1="CEO", level2=c( rep("boss1",4), rep("boss2",4)), level3=paste0("mister_", letters[1:8]) ) - + # transform it to a edge list! edges_level1_2 <- data %>% select(level1, level2) %>% unique %>% rename(from=level1, to=level2) edges_level2_3 <- data %>% select(level2, level3) %>% unique %>% rename(from=level2, to=level3) @@ -107,7 +107,7 @@ edge_list <- rbind(edges_level1_2, edges_level2_3) # Now we can plot that mygraph <- graph_from_data_frame( edge_list ) -ggraph(mygraph, layout = 'dendrogram', circular = FALSE) + +ggraph(mygraph, layout = 'dendrogram', circular = FALSE) + geom_edge_diagonal() + geom_node_point() + theme_void() @@ -152,7 +152,7 @@ tmp %>% kable() %>%
-It is possible to perform [hierarchical cluster analysis](https://en.wikipedia.org/wiki/Hierarchical_clustering) on this set of dissimilarities. Basically, this statistical method seeks to build a `hierarchy` of clusters: it tries to group sample that are close one from another. +It is possible to perform [hierarchical cluster analysis](https://en.wikipedia.org/wiki/Hierarchical_clustering) on this set of dissimilarities. Basically, this statistical method seeks to build a `hierarchy` of clusters: it tries to group sample that are close one from another. The result can be seen as a dendrogram: @@ -179,7 +179,7 @@ As expected, cities that are in same geographic area tend to be `clusterized` to A common task consists to compare the result of a clustering with an expected result. For instance, we can check if the countries are indeed grouped in continent using a color bar: ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=9} # Create a color vector with continent -continent <- c("Europe", "South America", "Africa", "Asia", "Africa", "South America", "North America", "Asia", "North America", +continent <- c("Europe", "South America", "Africa", "Asia", "Africa", "South America", "North America", "Asia", "North America", "Europe", "Europe","Europe", "North America", "Asia", "South America", "North America", "Europe", "North America", "Europe", "South America", "Europe", "North America", "Asia", "Europe", "Asia", "Asia", "Europe", "North America" @@ -214,23 +214,23 @@ Many variations exist for dendrogram. It can be horizontal or vertical as shown library(ggraph) library(igraph) library(tidyverse) -library(RColorBrewer) +library(RColorBrewer) set.seed(1) # create a data frame giving the hierarchical structure of your individuals d1=data.frame(from="origin", to=paste("group", seq(1,10), sep="")) d2=data.frame(from=rep(d1$to, each=10), to=paste("group", seq(1,100), sep="_")) edges=rbind(d1, d2) - + # create a vertices data.frame. One line per object of our hierarchy vertices = data.frame( - name = unique(c(as.character(edges$from), as.character(edges$to))) , + name = unique(c(as.character(edges$from), as.character(edges$to))) , value = runif(111) -) +) # Let's add a column with the group of each name. It will be useful later to color points vertices$group = edges$from[ match( vertices$name, edges$to ) ] - - + + #Let's add information concerning the label we are going to add: angle, horizontal adjustement and potential flip #calculate the ANGLE of the labels vertices$id=NA @@ -238,14 +238,14 @@ myleaves=which(is.na( match(vertices$name, edges$from) )) nleaves=length(myleaves) vertices$id[ myleaves ] = seq(1:nleaves) vertices$angle= 90 - 360 * vertices$id / nleaves - + # calculate the alignment of labels: right or left # If I am on the left part of the plot, my labels have currently an angle < -90 vertices$hjust<-ifelse( vertices$angle < -90, 1, 0) - + # flip angle BY to make them readable vertices$angle<-ifelse(vertices$angle < -90, vertices$angle+180, vertices$angle) - + # Create a graph object mygraph <- graph_from_data_frame( edges, vertices=vertices ) @@ -253,7 +253,7 @@ mygraph <- graph_from_data_frame( edges, vertices=vertices ) mycolor <- colormap(colormap = colormaps$viridis, nshades = 6, format = "hex", alpha = 1, reverse = FALSE)[sample(c(1:6), 10, replace=TRUE)] # Make the plot -ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + +ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + geom_edge_diagonal(colour="grey") + scale_edge_colour_distiller(palette = "RdPu") + geom_node_text(aes(x = x*1.15, y=y*1.15, filter = leaf, label=name, angle = angle, hjust=hjust, colour=group), size=2.7, alpha=1) + diff --git a/graph/density2d.Rmd b/graph/density2d.Rmd index 17f193a..c207203 100644 --- a/graph/density2d.Rmd +++ b/graph/density2d.Rmd @@ -1,10 +1,10 @@ --- myimage: "2dDensitySmall.png" -mydisqus: "density2d" +pathSlug: "density2d" mytitle: "2D density plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -49,14 +49,14 @@ library(reticulate) import numpy as np import matplotlib.pyplot as plt from scipy.stats import gaussian_kde as kde - + # Create data: 200 points data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200) x, y = data.T - + # Create a figure with 6 plot areas fig, axes = plt.subplots(ncols=6, nrows=1, figsize=(21, 5)) - + # Everything starts with a Scatterplot axes[0].set_title('Scatterplot') axes[0].plot(x, y, 'ko') @@ -65,24 +65,24 @@ axes[0].plot(x, y, 'ko') nbins = 20 axes[1].set_title('Hexbin') axes[1].hexbin(x, y, gridsize=nbins, cmap=plt.cm.BuGn_r) - + # 2D Histogram axes[2].set_title('2D Histogram') axes[2].hist2d(x, y, bins=nbins, cmap=plt.cm.BuGn_r) - + # Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents k = kde(data.T) xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j] zi = k(np.vstack([xi.flatten(), yi.flatten()])) - + # plot a density axes[3].set_title('Calculate Gaussian KDE') axes[3].pcolormesh(xi, yi, zi.reshape(xi.shape), cmap=plt.cm.BuGn_r) - + # add shading axes[4].set_title('2D Density with shading') axes[4].pcolormesh(xi, yi, zi.reshape(xi.shape), shading='gouraud', cmap=plt.cm.BuGn_r) - + # contour axes[5].set_title('Contour') axes[5].pcolormesh(xi, yi, zi.reshape(xi.shape), shading='gouraud', cmap=plt.cm.BuGn_r) @@ -118,7 +118,7 @@ library(patchwork) a <- data.frame( x=rnorm(20000, 10, 1.2), y=rnorm(20000, 10, 1.2), group=rep("A",20000)) b <- data.frame( x=rnorm(20000, 14.5, 1.2), y=rnorm(20000, 14.5, 1.2), group=rep("B",20000)) c <- data.frame( x=rnorm(20000, 9.5, 1.5), y=rnorm(20000, 15.5, 1.5), group=rep("C",20000)) -data <- do.call(rbind, list(a,b,c)) +data <- do.call(rbind, list(a,b,c)) p1 <- data %>% ggplot( aes(x=x, y=y)) + @@ -145,11 +145,11 @@ p1 + p2 # Variation {#variation} *** -2d distribution is one of the rare cases where using 3d can be worth it. +2d distribution is one of the rare cases where using 3d can be worth it.
-It is possible to transform the [scatterplot](https://www.data-to-viz.com/graph/scatter) information in a grid, and count the number of data points on each position of the grid. Then, instead of representing this number by a graduating color, the `surface plot` use 3d to represent dense are higher than others. +It is possible to transform the [scatterplot](https://www.data-to-viz.com/graph/scatter) information in a grid, and count the number of data points on each position of the grid. Then, instead of representing this number by a graduating color, the `surface plot` use 3d to represent dense are higher than others.
diff --git a/graph/donut.Rmd b/graph/donut.Rmd index 8310095..b1229d8 100644 --- a/graph/donut.Rmd +++ b/graph/donut.Rmd @@ -1,10 +1,10 @@ --- myimage: "DougnutSmall.png" -mydisqus: "donut" +pathSlug: "donut" mytitle: "Donut plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -44,7 +44,7 @@ data <- data.frame( category=c("A", "B", "C"), count=c(10, 60, 30) ) - + # Compute percentages data$fraction <- data$count / sum(data$count) diff --git a/graph/edge_bundling.Rmd b/graph/edge_bundling.Rmd index 6dd8a4d..5b4e48a 100644 --- a/graph/edge_bundling.Rmd +++ b/graph/edge_bundling.Rmd @@ -1,10 +1,10 @@ --- myimage: "BundleSmall.png" -mydisqus: "edge_bundling" +pathSlug: "edge_bundling" mytitle: "Hierarchical edge bundling" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -31,7 +31,7 @@ output:
-Step 1: Let's consider the hierarchy of the [Flare](http://flare.prefuse.org) ActionScript visualization library. The elements of its library are organized in several folder, like query, data, scale... Each folder is then subdivided in subfolders and so on. The hierarchy can be visualized as a [dendrogram](https://www.data-to-viz.com/graph/dendrogram.html) as follow: +Step 1: Let's consider the hierarchy of the [Flare](http://flare.prefuse.org) ActionScript visualization library. The elements of its library are organized in several folder, like query, data, scale... Each folder is then subdivided in subfolders and so on. The hierarchy can be visualized as a [dendrogram](https://www.data-to-viz.com/graph/dendrogram.html) as follow: ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=6} # Libraries @@ -62,9 +62,9 @@ mygraph <- graph_from_data_frame(edges, vertices = vertices) # The connection object must refer to the ids of the leaves: from = match( connections$from, vertices$name) to = match( connections$to, vertices$name) - + # Basic dendrogram -ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + +ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + geom_edge_link(size=0.4, alpha=0.1) + geom_node_text(aes(x = x*1.01, y=y*1.01, filter = leaf, label=shortName, angle = angle, hjust=hjust), size=1.5, alpha=1) + coord_fixed() + @@ -76,7 +76,7 @@ ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) ``` -Step 2: now consider another level of information. Some elements of the library have dependencies: basically they call other elements when they are used. A naive approach to represent this connection would be to draw a straight line (left). Instead, hierarchical edge bundling uses a curve that follows the hierarchy link between the 2 elements (right). +Step 2: now consider another level of information. Some elements of the library have dependencies: basically they call other elements when they are used. A naive approach to represent this connection would be to draw a straight line (left). Instead, hierarchical edge bundling uses a curve that follows the hierarchy link between the 2 elements (right). ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=10} @@ -85,9 +85,9 @@ from_head = match( connections$from, vertices$name) %>% head(1) to_head = match( connections$to, vertices$name) %>% head(1) # Basic dendrogram -p1 <- ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + +p1 <- ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + geom_edge_link(size=0.4, alpha=0.1) + - geom_conn_bundle(data = get_con(from = from_head, to = to_head), alpha = 1, colour="#69b3a2", width=2, tension=0) + + geom_conn_bundle(data = get_con(from = from_head, to = to_head), alpha = 1, colour="#69b3a2", width=2, tension=0) + geom_node_text(aes(x = x*1.01, y=y*1.01, filter = leaf, label=shortName, angle = angle, hjust=hjust), size=1.5, alpha=1) + coord_fixed() + theme_void() + @@ -97,9 +97,9 @@ p1 <- ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + ) + expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) -p2 <- ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + +p2 <- ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + geom_edge_link(size=0.4, alpha=0.1) + - geom_conn_bundle(data = get_con(from = from_head, to = to_head), alpha = 1, colour="#69b3a2", width=2, tension=0.9) + + geom_conn_bundle(data = get_con(from = from_head, to = to_head), alpha = 1, colour="#69b3a2", width=2, tension=0.9) + geom_node_text(aes(x = x*1.01, y=y*1.01, filter = leaf, label=shortName, angle = angle, hjust=hjust), size=1.5, alpha=1) + coord_fixed() + theme_void() + @@ -119,8 +119,8 @@ p1 + p2 ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=7, fig.width=7} # Make the plot -ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + - geom_conn_bundle(data = get_con(from = from, to = to), alpha = 0.1, colour="#69b3a2") + +ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + + geom_conn_bundle(data = get_con(from = from, to = to), alpha = 0.1, colour="#69b3a2") + geom_node_text(aes(x = x*1.01, y=y*1.01, filter = leaf, label=shortName, angle = angle, hjust=hjust), size=1.5, alpha=1) + coord_fixed() + theme_void() + @@ -144,7 +144,7 @@ ggraph(mygraph, layout = 'dendrogram', circular = TRUE) + # What for *** -Hierarchical edge bundling reduces visual clutter and also visualizes implicit adjacency edges between parent nodes that are the result of explicit adjacency edges between their respective child nodes. Furthermore, hierarchical edge bundling is a generic method which can be used in conjunction with existing tree visualization techniques. +Hierarchical edge bundling reduces visual clutter and also visualizes implicit adjacency edges between parent nodes that are the result of explicit adjacency edges between their respective child nodes. Furthermore, hierarchical edge bundling is a generic method which can be used in conjunction with existing tree visualization techniques. Here is an example showing the same dataset with and without the use of bundling. The use of straight line on the left results in a cluttered figure that makes impossible to read the connection. The use of bundling on the right makes a neat figure: @@ -218,7 +218,7 @@ Because I love this kind of graphic so much, I feel like displaying a few exampl

Chord diagram

A circular layout used to display weighted relationships between entities through arcs.

-
+
diff --git a/graph/heatmap.Rmd b/graph/heatmap.Rmd index c38432d..60faf34 100644 --- a/graph/heatmap.Rmd +++ b/graph/heatmap.Rmd @@ -1,10 +1,10 @@ --- myimage: "HeatmapSmall.png" -mydisqus: "heatmap" +pathSlug: "heatmap" mytitle: "Heatmap" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -31,7 +31,7 @@ output: # Definition {#definition} *** -A `heatmap` is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. +A `heatmap` is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. Here is an example showing 8 general features like population or life expectancy for about 30 countries in 2015. Data come from the French National Institute of [Demographic Studies](https://www.ined.fr/en/everything_about_population/data/all-countries/?lst_continent=908&lst_pays=926). @@ -44,12 +44,12 @@ library(plotly) # d3heatmap is not on CRAN yet, but can be found here: https://github.com/talgalili/d3heatmap library(d3heatmap) -# Load data +# Load data data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/multivariate.csv", header = T, sep = ";") colnames(data) <- gsub("\\.", " ", colnames(data)) # Select a few country -data <- data %>% +data <- data %>% filter(Country %in% c("France", "Sweden", "Italy", "Spain", "England", "Portugal", "Greece", "Peru", "Chile", "Brazil", "Argentina", "Bolivia", "Venezuela", "Australia", "New Zealand", "Fiji", "China", "India", "Thailand", "Afghanistan", "Bangladesh", "United States of America", "Canada", "Burundi", "Angola", "Kenya", "Togo")) %>% arrange(Country) %>% mutate(Country = factor(Country, Country)) @@ -64,9 +64,9 @@ mat <- as.matrix(mat) #d3heatmap(mat, scale="column", dendrogram = "none", width="800px", height="80Opx", colors = "Blues") library(heatmaply) -p <- heatmaply(mat, +p <- heatmaply(mat, dendrogram = "none", - xlab = "", ylab = "", + xlab = "", ylab = "", main = "", scale = "column", margins = c(60,100,40,20), @@ -102,9 +102,9 @@ A heatmap is really useful to display a `general view` of numerical data, not to A heatmap is also useful to display the result of `hierarchical clustering`. Basically, clustering checks which countries tend to have the same features on their numeric variables, and therefore which countries are similar. The usual way to represent the result is to use [dendrograms](https://www.data-to-viz.com/graph/dendrogram.html). This type of chart can be drawn around the heatmap: ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=6} -p <- heatmaply(mat, +p <- heatmaply(mat, #dendrogram = "row", - xlab = "", ylab = "", + xlab = "", ylab = "", main = "", scale = "column", margins = c(60,100,40,20), @@ -130,7 +130,7 @@ p <- heatmaply(mat,
```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=6, echo=FALSE} p -``` +```
Here, Burundi and Angola are grouped together. Indeed they are two countries in strong expansion, with a lot of children per woman but still a strong mortality rate. @@ -159,10 +159,10 @@ my_color <- my_color[as.numeric(as.factor(cont))] # NOTE: this does not work: #my_color <- my_color[as.numeric(as.factor(data$Continent))] -# -p <- heatmaply(mat, +# +p <- heatmaply(mat, dendrogram = "row", - xlab = "", ylab = "", + xlab = "", ylab = "", main = "", scale = "column", margins = c(60,100,40,20), @@ -186,7 +186,7 @@ p <- heatmaply(mat, - + - For a static heatmap, a common practice is to display the exact value of each cell in numbers. Indeed, it is hard to translate a color into a precise number. - Heatmaps can also be used for time series where there is a regular pattern in time. diff --git a/graph/hexbinmap.Rmd b/graph/hexbinmap.Rmd index 952809f..6e2e822 100644 --- a/graph/hexbinmap.Rmd +++ b/graph/hexbinmap.Rmd @@ -1,10 +1,10 @@ --- myimage: "MapHexbinSmall.png" -mydisqus: "hexbinmap" +pathSlug: "hexbinmap" mytitle: "Hexbin map" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -170,7 +170,7 @@ data %>% # What for *** -Hexbin or grid map has an `advantage` over usual [choropleth maps](https://www.data-to-viz.com/graph/choropleth.html). In choropleths, a large polygon’s data looks more emphasized just because of its size, what introduces a bias. Here with hexbin, each region is represented equally dismissing the bias. +Hexbin or grid map has an `advantage` over usual [choropleth maps](https://www.data-to-viz.com/graph/choropleth.html). In choropleths, a large polygon’s data looks more emphasized just because of its size, what introduces a bias. Here with hexbin, each region is represented equally dismissing the bias. There’s a `drawback` to this format though. Map readers generally recognize a geographic area by it’s shape and orientation to other areas. For instance, the geography of the US is well known and people easily identify different regions. In hexbin maps, these landmarks do not exist anymore what can confuse the audience. One solution for this is to choose a basemap that uses labels on top of your data layer. diff --git a/graph/histogram.Rmd b/graph/histogram.Rmd index ddad991..fb6613d 100644 --- a/graph/histogram.Rmd +++ b/graph/histogram.Rmd @@ -1,10 +1,10 @@ --- myimage: "HistogramSmall.png" -mydisqus: "histogram" +pathSlug: "histogram" mytitle: "Histogram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -30,7 +30,7 @@ output: # Definition {#definition} *** -A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar. +A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar. Here is an example showing the distribution of the night price of Rbnb appartements in the south of France. Price range is divided per 10 euros interval. For example, there are slightly less than 750 appartements with a night price between 100 and 110 euros: ```{r, warning=FALSE, message=FALSE, fig.align="center"} @@ -110,13 +110,13 @@ A common variation of the histogram is the mirror histogram: it puts face to fac ```{r, fig.align="center", fig.width=7, warning=FALSE, message=FALSE} data <- data.frame( - x = rnorm(1000), + x = rnorm(1000), y = rnorm(1000, mean=2) ) - -data %>% - ggplot( aes(x) ) + - geom_histogram( aes(x = x, y = ..density..), binwidth = diff(range(data$x))/30, fill="#69b3a2" ) + + +data %>% + ggplot( aes(x) ) + + geom_histogram( aes(x = x, y = ..density..), binwidth = diff(range(data$x))/30, fill="#69b3a2" ) + geom_label( aes(x=4.8, y=0.25, label="variable1"), color="#69b3a2") + geom_histogram( aes(x = y, y = -..density..), binwidth = diff(range(data$x))/30, fill= "#404080") + geom_label( aes(x=4.8, y=-0.25, label="variable2"), color="#404080") + diff --git a/graph/line.Rmd b/graph/line.Rmd index 1aef46f..a23d2b0 100644 --- a/graph/line.Rmd +++ b/graph/line.Rmd @@ -1,10 +1,10 @@ --- myimage: "LineSmall.png" -mydisqus: "line" +pathSlug: "line" mytitle: "Line chart" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -31,7 +31,7 @@ A `line chart` or line graph displays the evolution of one or several numeric va
-The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) between April 2013 and April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. +The following example shows the evolution of the [bitcoin price](https://www.data-to-viz.com/story/TwoNumOrdered.html) between April 2013 and April 2018. Data comes from the [CoinMarketCap](https://www.data-to-viz.com/story/TwoNumOrdered.html) website. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=5, fig.width=10} # Libraries @@ -67,10 +67,10 @@ Line chart can be used to show the evolution of one (like above) or several vari ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=5, fig.width=10} # Load dataset from github -don <- babynames %>% +don <- babynames %>% filter(name %in% c("Ashley", "Patricia", "Helen")) %>% filter(sex=="F") - + # Plot don %>% ggplot( aes(x=year, y=n, group=name, color=name)) + @@ -144,7 +144,7 @@ p1 + p2 ``` - + - If you need to compare the evolution of 2 different variables, do not use [dual axis](https://www.data-to-viz.com/caveat/dual_axis.html). Indeed dual axis can show very different results depending on what range you apply to the axis. [Read more about it](https://www.data-to-viz.com/caveat/dual_axis.html). - Mind the [spaghetti chart](https://www.data-to-viz.com/caveat/spaghetti.html): too many lines make the chart unreadable. - Think about the [aspect ratio](https://www.data-to-viz.com/caveat/aspect_ratio.html) of the graphic, extreme ratio make the chart unreadable. diff --git a/graph/lollipop.Rmd b/graph/lollipop.Rmd index 0ef3cdf..5b9e016 100644 --- a/graph/lollipop.Rmd +++ b/graph/lollipop.Rmd @@ -1,10 +1,10 @@ --- myimage: "LollipopSmall.png" -mydisqus: "lollipop" +pathSlug: "lollipop" mytitle: "Lollipop chart" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -131,15 +131,15 @@ The `Cleveland dotplot` is a handy variation, allowing to compare the value of 2 # Create data (could be way easier but it's late) value1 <- abs(rnorm(26))*2 don <- data.frame( - x=LETTERS[1:26], - value1=value1, - value2=value1+1+rnorm(26, sd=1) + x=LETTERS[1:26], + value1=value1, + value2=value1+1+rnorm(26, sd=1) ) %>% - rowwise() %>% - mutate( mymean = mean(c(value1,value2) )) %>% - arrange(mymean) %>% + rowwise() %>% + mutate( mymean = mean(c(value1,value2) )) %>% + arrange(mymean) %>% mutate(x=factor(x, x)) - + # With a bit more style ggplot(don) + @@ -164,7 +164,7 @@ Note that with a number of subgroups between 3 and ~7 this type of lollipop plot # Create data (could be way easier but it's late) value1 <- abs(rnorm(6))*2 don <- data.frame( - x=LETTERS[1:24], + x=LETTERS[1:24], val=c( value1, value1+1+rnorm(6, 14,1) ,value1+1+rnorm(6, sd=1) ,value1+1+rnorm(6, 12, 1) ), grp=rep(c("grp1", "grp2", "grp3", "grp4"), each=6) ) %>% @@ -199,7 +199,7 @@ ggplot(don) + # Common mistakes {#mistake} *** -- [Order your groups](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. +- [Order your groups](http://www.data-to-viz.com/caveat/order_data.html). If the levels of your categoric variable have no obvious order, order the bars following their values. - If for whatever reason your bars must remain unsorted, it is probably better to use a barplot instead. Lollipop would be harder to read. diff --git a/graph/map.Rmd b/graph/map.Rmd index 4a5c69e..70814fb 100644 --- a/graph/map.Rmd +++ b/graph/map.Rmd @@ -1,10 +1,10 @@ --- myimage: "Map150.png" -mydisqus: "map" +pathSlug: "map" mytitle: "Background Map" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- diff --git a/graph/network.Rmd b/graph/network.Rmd index 03e29c2..71f0e65 100644 --- a/graph/network.Rmd +++ b/graph/network.Rmd @@ -1,10 +1,10 @@ --- myimage: "NetworkSmall.png" -mydisqus: "network" +pathSlug: "network" mytitle: "Network diagram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -28,7 +28,7 @@ output: # Definition {#definition} *** -`Network diagrams` (also called Graphs) show interconnections between a set of entities. Each entity is represented by a `Node` (or vertice). Connections between nodes are represented through `links` (or edges). +`Network diagrams` (also called Graphs) show interconnections between a set of entities. Each entity is represented by a `Node` (or vertice). Connections between nodes are represented through `links` (or edges). Here is an example showing the co-authors network of [Vincent Ranwez](https://sites.google.com/site/ranwez/), a researcher who's my previous supervisor. Basically, people having published at least one research paper with him are represented by a node. If two people have been listed on the same publication at least once, they are connected by a link. @@ -46,7 +46,7 @@ library(networkD3) dataUU <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/13_AdjacencyUndirectedUnweighted.csv", header=TRUE) # Transform the adjacency matrix in a long format -connect <- dataUU %>% +connect <- dataUU %>% gather(key="to", value="value", -1) %>% na.omit() @@ -61,7 +61,7 @@ colnames(coauth) <- c("name", "n") graph=simpleNetwork(connect) # Plot -simpleNetwork(connect, +simpleNetwork(connect, Source = 1, # column number of source Target = 2, # column number of target height = 880, # height of frame area in pixels @@ -97,7 +97,7 @@ Four main types of network diagram exist, according to the features of data inpu

-

`Undirected and Unweighted`

+

`Undirected and Unweighted`

Tom, Cherelle and Melanie live in the same house. They are connected but no direction and no weight. ```{r, warning=FALSE, message=FALSE, fig.align="center"} # Create data @@ -111,7 +111,7 @@ network=graph_from_adjacency_matrix(data) # Plot it # Make the graph -ggraph(network) + +ggraph(network) + geom_edge_link(edge_colour="black", edge_alpha=0.3, edge_width=0.2) + geom_node_point( color="#69b3a2", size=5) + geom_node_text( aes(label=name), repel = TRUE, size=8, color="#69b3a2") + @@ -119,7 +119,7 @@ ggraph(network) + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ```
@@ -136,12 +136,12 @@ set.seed(1) data=matrix(sample(0:3, 25, replace=TRUE), nrow=5) data[lower.tri(data)] <- NA colnames(data)=rownames(data)=LETTERS[1:5] - + # Transform it in a graph format network=graph_from_adjacency_matrix(data, weighted = TRUE) # Make the graph -ggraph(network) + +ggraph(network) + geom_edge_link( aes(edge_width=E(network)$weight), edge_colour="black", edge_alpha=0.3) + geom_node_point( color="#69b3a2", size=5) + geom_node_text( aes(label=name), repel = TRUE, size=8, color="#69b3a2") + @@ -149,7 +149,7 @@ ggraph(network) + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ```
@@ -166,12 +166,12 @@ set.seed(10) data=matrix(sample(0:1, 25, replace=TRUE), nrow=5) diag(data) = NA colnames(data)=rownames(data)=LETTERS[1:5] - + # Transform it in a graph format network=graph_from_adjacency_matrix(data) # Make the graph -ggraph(network) + +ggraph(network) + geom_edge_link(edge_colour="black", edge_alpha=0.8, edge_width=0.2, arrow = arrow(angle=20)) + geom_node_point( color="#69b3a2", size=3) + geom_node_text( aes(label=name), repel = TRUE, size=6, color="#69b3a2") + @@ -179,7 +179,7 @@ ggraph(network) + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ```
@@ -196,12 +196,12 @@ set.seed(10) data=matrix(sample(0:3, 16, replace=TRUE), nrow=4) diag(data) <- NA colnames(data)=rownames(data)=LETTERS[1:4] - + # Transform it in a graph format network=graph_from_adjacency_matrix(data, weighted=TRUE) # Make the graph -ggraph(network) + +ggraph(network) + geom_edge_link(edge_colour="black", edge_alpha=0.3, aes(edge_width=E(network)$weight) , arrow=arrow()) + scale_edge_width(range=c(1,3)) + geom_node_point( color="#69b3a2", size=3) + @@ -210,7 +210,7 @@ ggraph(network) + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ```
@@ -221,7 +221,7 @@ ggraph(network) +

-*Note*: as you can observe on the examples above, directed graphs are quite hard to represent using this type of visualization. More appropriate techniques exist to represent flows, like [Sankey diagram](https://www.data-to-viz.com/graph/sankey.html) or [chord diagram](https://www.data-to-viz.com/graph/chord.html). +*Note*: as you can observe on the examples above, directed graphs are quite hard to represent using this type of visualization. More appropriate techniques exist to represent flows, like [Sankey diagram](https://www.data-to-viz.com/graph/sankey.html) or [chord diagram](https://www.data-to-viz.com/graph/chord.html). @@ -253,7 +253,7 @@ Probably the most widely used algorithm, using a force-directed method. dataUU <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/13_AdjacencyUndirectedUnweighted.csv", header=TRUE) # Transform the adjacency matrix in a long format -connect <- dataUU %>% +connect <- dataUU %>% gather(key="to", value="value", -1) %>% na.omit() @@ -268,7 +268,7 @@ colnames(coauth) <- c("name", "n") mygraph <- graph_from_data_frame( connect, vertices = coauth ) # Make the graph -ggraph(mygraph, layout="fr") + +ggraph(mygraph, layout="fr") + #geom_edge_density(edge_fill="#69b3a2") + geom_edge_link(edge_colour="black", edge_alpha=0.2, edge_width=0.3) + geom_node_point(aes(size=n, alpha=n)) + @@ -276,7 +276,7 @@ ggraph(mygraph, layout="fr") + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ```
@@ -288,7 +288,7 @@ ggraph(mygraph, layout="fr") + A force-directed graph layout toolbox focused on real-world large-scale graphs ```{r, warning=FALSE, message=FALSE, fig.align="center"} # Make the graph -ggraph(mygraph, layout="drl") + +ggraph(mygraph, layout="drl") + #geom_edge_density(edge_fill="#69b3a2") + geom_edge_link(edge_colour="black", edge_alpha=0.2, edge_width=0.3) + geom_node_point(aes(size=n, alpha=n)) + @@ -296,7 +296,7 @@ ggraph(mygraph, layout="drl") + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ``` @@ -306,7 +306,7 @@ ggraph(mygraph, layout="drl") +
This is what happens if node positions is set up randomly ```{r, warning=FALSE, message=FALSE, fig.align="center"} -ggraph(mygraph, layout="igraph", algorithm="randomly") + +ggraph(mygraph, layout="igraph", algorithm="randomly") + #geom_edge_density(edge_fill="#69b3a2") + geom_edge_link(edge_colour="black", edge_alpha=0.2, edge_width=0.3) + geom_node_point(aes(size=n, alpha=n)) + @@ -314,7 +314,7 @@ ggraph(mygraph, layout="igraph", algorithm="randomly") + theme( legend.position="none", plot.margin=unit(rep(1,4), "cm") - ) + ) ``` @@ -365,7 +365,7 @@ ggraph(mygraph, layout="igraph", algorithm="randomly") +

Edge bundling

Show connections between entities organized in a hierarchy.

- + diff --git a/graph/parallel.Rmd b/graph/parallel.Rmd index b6f2030..91d5183 100644 --- a/graph/parallel.Rmd +++ b/graph/parallel.Rmd @@ -1,10 +1,10 @@ --- myimage: "Parallel1Small.png" -mydisqus: "parallel" +pathSlug: "parallel" mytitle: "Parallel coordinates plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -29,7 +29,7 @@ output: # Definition {#definition} *** -`Parallel plot` or parallel coordinates plot allows to compare the feature of several individual observations (`series`) on a set of numeric variables. Each vertical bar represents a variable and often has its own scale. (The units can even be different). Values are then plotted as series of lines connected across each axis. +`Parallel plot` or parallel coordinates plot allows to compare the feature of several individual observations (`series`) on a set of numeric variables. Each vertical bar represents a variable and often has its own scale. (The units can even be different). Values are then plotted as series of lines connected across each axis.
@@ -47,13 +47,13 @@ library(viridis) data <- iris # Plot -data %>% +data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", - showPoints = TRUE, + showPoints = TRUE, title = "Parallel Coordinate Plot for the Iris Data", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -77,16 +77,16 @@ A parallel plot allows to study the features of samples for `several quantitativ In the graphic above flower features were grouped in species, and all variables were normalized and sharing the same unit (cm). Here is another example where diamonds are compared for 4 variables that share different units, like the price in $ or depth in %. Note the use of scaling to be able to compare them. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=9} -diamonds %>% +diamonds %>% sample_n(10) %>% ggparcoord( - columns = c(1,5:7), - groupColumn = 2, + columns = c(1,5:7), + groupColumn = 2, #order = "anyClass", - showPoints = TRUE, + showPoints = TRUE, title = "Diamonds features", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -105,14 +105,14 @@ Here is an overview of the parallel coordinates features you can play with: ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=7, fig.width=10} # Plot -p1 <- data %>% +p1 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", scale="globalminmax", - showPoints = TRUE, + showPoints = TRUE, title = "No scaling", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -120,15 +120,15 @@ p1 <- data %>% plot.title = element_text(size=10) ) + xlab("") - -p2 <- data %>% + +p2 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", scale="uniminmax", - showPoints = TRUE, + showPoints = TRUE, title = "Standardize to Min = 0 and Max = 1", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -138,14 +138,14 @@ p2 <- data %>% xlab("") -p3 <- data %>% +p3 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", scale="std", - showPoints = TRUE, + showPoints = TRUE, title = "Normalize univariately (substract mean & divide by sd)", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -155,14 +155,14 @@ p3 <- data %>% xlab("") -p4 <- data %>% +p4 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", scale="center", - showPoints = TRUE, + showPoints = TRUE, title = "Standardize and center variables", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -180,13 +180,13 @@ p1 + p2 + p3 + p4 + plot_layout(ncol = 2) - *Axis order* - optimizing the order of vertical axis can decrease the `clutter` of your parallel plot. Basically, the goal is to minimize the number of cross between series. On the next figure, the left plot is much harder to understand the the right one. Only variable order is different. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=4, fig.width=9} # Plot -p1 <- data %>% +p1 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = c(1:4), - showPoints = TRUE, + showPoints = TRUE, title = "Original", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -194,14 +194,14 @@ p1 <- data %>% plot.title = element_text(size=10) ) + xlab("") - -p2 <- data %>% + +p2 <- data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", - showPoints = TRUE, + showPoints = TRUE, title = "Re-ordered", alphaLines = 0.3 - ) + + ) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme( @@ -210,7 +210,7 @@ p2 <- data %>% ) + xlab("") -p1 + p2 +p1 + p2 ``` @@ -220,13 +220,13 @@ p1 + p2 - *Highlighting* - a parallel plot being a [line plot](https://www.data-to-viz.com/graph/line.html), the main caveat is the [spaghetti chart](https://www.data-to-viz.com/caveat/spaghetti.html) where too many lines overlap, making the chart unreadable. Several workaround exist as described in [this page](https://www.data-to-viz.com/caveat/spaghetti.html). A solution is to highlight a specific sample or a specific group of interest: ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=4, fig.width=7} # Plot -data %>% +data %>% ggparcoord( columns = 1:4, groupColumn = 5, order = "anyClass", - showPoints = TRUE, + showPoints = TRUE, title = "Original", alphaLines = 0.3 - ) + + ) + scale_color_manual(values=c( "#69b3a2", "grey", "grey") ) + theme_ipsum()+ theme( @@ -234,7 +234,7 @@ data %>% plot.title = element_text(size=10) ) + xlab("") - + ``` diff --git a/graph/ridgeline.Rmd b/graph/ridgeline.Rmd index b75856e..aa4ddd5 100644 --- a/graph/ridgeline.Rmd +++ b/graph/ridgeline.Rmd @@ -1,10 +1,10 @@ --- myimage: "Joyplot150.png" -mydisqus: "ridgeline" +pathSlug: "ridgeline" mytitle: "Ridgeline plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -28,7 +28,7 @@ output: # Definition {#definition} *** -A Ridgeline plot (sometimes called Joyplot) shows the distribution of a numeric value for several groups. Distribution can be represented using [histograms](http://www.data-to-viz.com/graph/histogram.html) or [density plots](http://www.data-to-viz.com/graph/density.html), all aligned to the same horizontal scale and presented with a slight overlap. +A Ridgeline plot (sometimes called Joyplot) shows the distribution of a numeric value for several groups. Distribution can be represented using [histograms](http://www.data-to-viz.com/graph/histogram.html) or [density plots](http://www.data-to-viz.com/graph/density.html), all aligned to the same horizontal scale and presented with a slight overlap.
@@ -44,7 +44,7 @@ library(viridis) # Load dataset from github data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",") -data <- data %>% +data <- data %>% gather(key="text", value="value") %>% mutate(text = gsub("\\.", " ",text)) %>% mutate(value = round(as.numeric(value),0)) %>% @@ -69,7 +69,7 @@ data %>% ``` -**Disclaimer**: This idea originally comes from a publication of the [CIA](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/art15.html) which resulted in this [figure](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/fig18.gif/image.gif). Then, [Zoni Nation](https://github.com/zonination) cleaned the reddit dataset and built [graphics with R](https://github.com/zonination/perceptions). +**Disclaimer**: This idea originally comes from a publication of the [CIA](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/art15.html) which resulted in this [figure](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/fig18.gif/image.gif). Then, [Zoni Nation](https://github.com/zonination) cleaned the reddit dataset and built [graphics with R](https://github.com/zonination/perceptions). # What for @@ -114,7 +114,7 @@ ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`, fill = ..x. legend.position="none", panel.spacing = unit(0.1, "lines"), strip.text.x = element_text(size = 8) - ) + ) ``` - See more variations in the [R graph gallery](https://www.r-graph-gallery.com/ridgeline-plot/). diff --git a/graph/sankey.Rmd b/graph/sankey.Rmd index 1d41f2e..14f9406 100644 --- a/graph/sankey.Rmd +++ b/graph/sankey.Rmd @@ -1,10 +1,10 @@ --- myimage: "SankeySmall.png" -mydisqus: "sankey" +pathSlug: "sankey" mytitle: "Sankey diagram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -28,7 +28,7 @@ output: # Definition {#definition} *** -A `Sankey Diagram` is a visualisation technique that allows to display flows. Several entities (`nodes`) are represented by rectangles or text. Their links are represented with arrow or arcs that have a width proportional to the importance of the flow. +A `Sankey Diagram` is a visualisation technique that allows to display flows. Several entities (`nodes`) are represented by rectangles or text. Their links are represented with arrow or arcs that have a width proportional to the importance of the flow. Here is an example displaying the number of people migrating from one country (left) to another (right). Data used comes from this [scientific publication](https://onlinelibrary.wiley.com/doi/abs/10.1111/imre.12327). @@ -55,9 +55,9 @@ data_long$target <- paste(data_long$target, " ", sep="") # From these flows we need to create a node data frame: it lists every entities involved in the flow nodes <- data.frame(name=c(as.character(data_long$source), as.character(data_long$target)) %>% unique()) - + # With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it. -data_long$IDsource=match(data_long$source, nodes$name)-1 +data_long$IDsource=match(data_long$source, nodes$name)-1 data_long$IDtarget=match(data_long$target, nodes$name)-1 # prepare colour scale @@ -66,7 +66,7 @@ ColourScal ='d3.scaleOrdinal() .range(["#FDE725FF","#B4DE2CFF","#6DCD59FF","#35B # Make the Network sankeyNetwork(Links = data_long, Nodes = nodes, Source = "IDsource", Target = "IDtarget", - Value = "value", NodeID = "name", + Value = "value", NodeID = "name", sinksRight=FALSE, colourScale=ColourScal, nodeWidth=40, fontSize=13, nodePadding=20) ``` @@ -90,18 +90,18 @@ Sankey diagrams are used to show weighted networks, i.e. flows. It can happen wi ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=6, fig.width=9} # Load package library(networkD3) - + # Load energy projection data URL <- "https://cdn.rawgit.com/christophergandrud/networkD3/master/JSONdata/energy.json" Energy <- jsonlite::fromJSON(URL) - + # Now we have 2 data frames: a 'links' data frame with 3 columns (from, to, value), and a 'nodes' data frame that gives the name of each node. # Thus we can plot it sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source", Target = "target", Value = "value", NodeID = "name", units = "TWh", fontSize = 12, nodeWidth = 30) - + ``` @@ -164,7 +164,7 @@ If you're interested to see more examples, there is a [whole website about it](h

Edge bundling

Show connections between entities organized in a hierarchy.

- + diff --git a/graph/scatter.Rmd b/graph/scatter.Rmd index b4af834..fbfdbfc 100644 --- a/graph/scatter.Rmd +++ b/graph/scatter.Rmd @@ -1,10 +1,10 @@ --- myimage: "ScatterPlotSmall.png" -mydisqus: "scatter" +pathSlug: "scatter" mytitle: "Scatter plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -45,7 +45,7 @@ library(viridis) data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/2_TwoNum.csv", header=T, sep=",") %>% dplyr::select(GrLivArea, SalePrice) # plot -data %>% +data %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000)) + geom_point(color="#69b3a2", alpha=0.8) + ggtitle("Ground living area partially explains sale price of apartments") + @@ -60,7 +60,7 @@ data %>% # What for *** -A scatterplot is made to study the relationship between 2 variables. Thus it is often accompanied by a [correlation coefficient](https://en.wikipedia.org/wiki/Correlation_coefficient) calculation, that usually tries to measure the `linear relationship`. +A scatterplot is made to study the relationship between 2 variables. Thus it is often accompanied by a [correlation coefficient](https://en.wikipedia.org/wiki/Correlation_coefficient) calculation, that usually tries to measure the `linear relationship`.
@@ -94,7 +94,7 @@ Interactivity is a real plus for scatterplot. It allows to `zoom` on a specific # Plotly allows to turn any ggplot2 graphic interactive library(plotly) -p <- data %>% +p <- data %>% mutate(text=paste("Apartment Number: ", seq(1:nrow(data)), "\nLocation: New York\nAny other information you need..", sep="")) %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000, text=text)) + geom_point(color="#69b3a2", alpha=0.8) + diff --git a/graph/stackedarea.Rmd b/graph/stackedarea.Rmd index ee8dbcd..167791c 100644 --- a/graph/stackedarea.Rmd +++ b/graph/stackedarea.Rmd @@ -1,10 +1,10 @@ --- myimage: "StackedAreaSmall.png" -mydisqus: "stackedarea" +pathSlug: "stackedarea" mytitle: "Stacked Area Graph" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -31,7 +31,7 @@ A `stacked area chart` is the extension of a basic [area chart](https://www.data
-The following example shows the evolution of baby name frequencies in the US between 1880 and 2015. +The following example shows the evolution of baby name frequencies in the US between 1880 and 2015.
@@ -45,12 +45,12 @@ library(hrbrthemes) library(plotly) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>% filter(sex=="F") # Plot -p <- data %>% +p <- data %>% ggplot( aes(x=year, y=n, fill=name, text=name)) + geom_area( ) + scale_fill_viridis(discrete = TRUE) + @@ -74,7 +74,7 @@ The efficiency of stacked area graph is [discussed](https://www.data-to-viz.com/ - stacked area graph are `appropriate` to study the evolution of the `whole` and the `relative proportions` of each group. Indeed, the top of the areas allows to visualize how the whole behaves, like for a classic area chart. -- however they are not appropriate to study the evolution of each `individual group`: it is very hard to substract the height of other groups at each time point. For a more accurate but less attractive figure, consider a [line chart](https://www.data-to-viz.com/graph/line.html) or [area chart](https://www.data-to-viz.com/graph/area.html) using small multiple. +- however they are not appropriate to study the evolution of each `individual group`: it is very hard to substract the height of other groups at each time point. For a more accurate but less attractive figure, consider a [line chart](https://www.data-to-viz.com/graph/line.html) or [area chart](https://www.data-to-viz.com/graph/area.html) using small multiple. This website dedicates a whole [page about stacking](https://www.data-to-viz.com/caveat/stacking.html) and its potential pitfalls, [visit it](https://www.data-to-viz.com/caveat/stacking.html) to go further. @@ -87,12 +87,12 @@ A variation of the stacked area graph is the `percent stacked area graph`. It is ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=5, fig.width=10} -p <- data %>% +p <- data %>% # Compute the proportions: group_by(year) %>% mutate(freq = n / sum(n)) %>% ungroup() %>% - + # Plot ggplot( aes(x=year, y=freq, fill=name, color=name, text=name)) + geom_area( ) + diff --git a/graph/streamgraph.Rmd b/graph/streamgraph.Rmd index d1e5b65..c3f2e22 100644 --- a/graph/streamgraph.Rmd +++ b/graph/streamgraph.Rmd @@ -1,10 +1,10 @@ --- myimage: "StreamSmall.png" -mydisqus: "streamgraph" +pathSlug: "streamgraph" mytitle: "Streamgraph" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -27,13 +27,13 @@ output: # Definition {#definition} *** -A Stream graph is a type of [stacked area chart](https://www.data-to-viz.com/graph/stackedarea.html). It displays the evolution of a numeric value (Y axis) following another numeric value (X axis). This evolution is represented for several groups, all with a distinct color. +A Stream graph is a type of [stacked area chart](https://www.data-to-viz.com/graph/stackedarea.html). It displays the evolution of a numeric value (Y axis) following another numeric value (X axis). This evolution is represented for several groups, all with a distinct color. Contrary to a stacked area, there is no corner: edges are rounded what gives this nice impression of flow. Moreover, areas are usually displaced around a central axis, resulting in a flowing and organic shape.
-The following example shows the evolution of baby name frequencies in the US between 1880 and 2015. +The following example shows the evolution of baby name frequencies in the US between 1880 and 2015. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=3, fig.width=10} # Libraries @@ -43,12 +43,12 @@ library(streamgraph) # Load dataset from github -data <- babynames %>% +data <- babynames %>% filter(name %in% c("Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>% filter(sex=="F") # Plot -data %>% +data %>% streamgraph(key="name", value="n", date="year") %>% sg_fill_brewer("BuPu") @@ -61,7 +61,7 @@ data %>% # What for *** -Streamchart are good to study the `relative proportions` of the whole. However they are bad to study the evolution of each `individual group`: it is very hard to substract the height of other groups at each time point. For a more accurate but less attractive figure, consider a [line chart](https://www.data-to-viz.com/graph/line.html) or [area chart](https://www.data-to-viz.com/graph/area.html) using small multiple. +Streamchart are good to study the `relative proportions` of the whole. However they are bad to study the evolution of each `individual group`: it is very hard to substract the height of other groups at each time point. For a more accurate but less attractive figure, consider a [line chart](https://www.data-to-viz.com/graph/line.html) or [area chart](https://www.data-to-viz.com/graph/area.html) using small multiple. Stream chart gets really useful when displayed in an interactive mode: highlighting a group gives you directly an insight of its evolution. @@ -74,7 +74,7 @@ Even if areas are usually displaced around a central axis, it is possible to dis ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=3, fig.width=10} # Plot -data %>% +data %>% streamgraph(key="name", value="n", date="year", offset="zero") %>% sg_fill_brewer("BuPu") ``` @@ -85,7 +85,7 @@ It also possible to create a percent streamchart where the proportion of each gr ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=3, fig.width=10} # Plot -data %>% +data %>% streamgraph(key="name", value="n", date="year", offset="expand") %>% sg_fill_brewer("BuPu") ``` diff --git a/graph/sunburst.Rmd b/graph/sunburst.Rmd index 8eafffc..e3ee2a7 100644 --- a/graph/sunburst.Rmd +++ b/graph/sunburst.Rmd @@ -1,10 +1,10 @@ --- myimage: "SunburstSmall.png" -mydisqus: "sunburst" +pathSlug: "sunburst" mytitle: "Sunburst" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html diff --git a/graph/template_datatoviz.html b/graph/template_datatoviz.html index 4ce67e7..6351083 100644 --- a/graph/template_datatoviz.html +++ b/graph/template_datatoviz.html @@ -40,7 +40,7 @@ /> diff --git a/graph/treemap.Rmd b/graph/treemap.Rmd index c9232e3..da20c4b 100644 --- a/graph/treemap.Rmd +++ b/graph/treemap.Rmd @@ -1,10 +1,10 @@ --- myimage: "TreeSmall.png" -mydisqus: "treemap" +pathSlug: "treemap" mytitle: "Treemap" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -29,9 +29,9 @@ output: # Definition {#definition} *** -A `Treemap` displays `hierarchical` data as a set of nested rectangles. Each group is represented by a rectangle, which area is proportional to its value. Using color schemes and or interactivity, it is possible to represent several dimensions: groups, subgroups etc. +A `Treemap` displays `hierarchical` data as a set of nested rectangles. Each group is represented by a rectangle, which area is proportional to its value. Using color schemes and or interactivity, it is possible to represent several dimensions: groups, subgroups etc. -Here is an example describing the [world population](https://www.data-to-viz.com/story/SevCatOneNumNestedOneObsPerGroup.html) of 250 countries. The world is divided in continent (group), continent are divided in regions (subgroup), and regions are divided in countries. In this tree structure, countries are considered as leaves: they are at the end of the branches. +Here is an example describing the [world population](https://www.data-to-viz.com/story/SevCatOneNumNestedOneObsPerGroup.html) of 250 countries. The world is divided in continent (group), continent are divided in regions (subgroup), and regions are divided in countries. In this tree structure, countries are considered as leaves: they are at the end of the branches. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.height=4} # libraries @@ -45,28 +45,28 @@ colnames(data) <- c("Continent", "Region", "Country", "Pop") # Plot p <- treemap(data, - + # data index=c("Continent", "Region", "Country"), vSize="Pop", type="index", - + # Main title="", palette="Dark2", # Borders: - border.col=c("black", "grey", "grey"), - border.lwds=c(1,0.5,0.1), - + border.col=c("black", "grey", "grey"), + border.lwds=c(1,0.5,0.1), + # Labels fontsize.labels=c(0.7, 0.4, 0.3), fontcolor.labels=c("white", "white", "black"), - fontface.labels=1, - bg.labels=c("transparent"), - align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")), - overlap.labels=0.5#, inflate.labels=T - + fontface.labels=1, + bg.labels=c("transparent"), + align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")), + overlap.labels=0.5#, inflate.labels=T + ) ``` @@ -94,7 +94,7 @@ Treemaps have the advantage to make efficient use of space, what makes them usef # Variation {#variation} *** -The main variation of treemaps concerns the use of interactivity. It is advised to use it if you have more than 2 or 3 levels of organization to display. Indeed, treemap get cluttered in this situation otherwise. +The main variation of treemaps concerns the use of interactivity. It is advised to use it if you have more than 2 or 3 levels of organization to display. Indeed, treemap get cluttered in this situation otherwise. In the figure below, clicking on a group zooms on it a reveals the underlying structure. Hint: click on the title to come back to the previous level of the hierarchy. diff --git a/graph/venn.Rmd b/graph/venn.Rmd index 8951c0c..3b95ddb 100644 --- a/graph/venn.Rmd +++ b/graph/venn.Rmd @@ -1,10 +1,10 @@ --- myimage: "VennSmall.png" -mydisqus: "venn" +pathSlug: "venn" mytitle: "Venn Diagram" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -45,26 +45,26 @@ library(tm) library(proustr) # Load dataset from github -data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) +data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) to_remove <- c("_|[0-9]|\\.|function|^id|script|var|div|null|typeof|opts|if|^r$|undefined|false|loaded|true|settimeout|eval|else|artist") data <- data %>% filter(!grepl(to_remove, word)) %>% filter(!word %in% stopwords('fr')) %>% filter(!word %in% proust_stopwords()$word) # library library(VennDiagram) - + #cMake the plot venn.diagram( x = list( - data %>% filter(artist=="booba") %>% select(word) %>% unlist() , - data %>% filter(artist=="nekfeu") %>% select(word) %>% unlist() , + data %>% filter(artist=="booba") %>% select(word) %>% unlist() , + data %>% filter(artist=="nekfeu") %>% select(word) %>% unlist() , data %>% filter(artist=="georges-brassens") %>% select(word) %>% unlist() ), category.names = c("Booba (1995)" , "Nekfeu (663)" , "Brassens (471)"), filename = 'IMG/venn.png', output = TRUE , imagetype="png" , - height = 480 , - width = 480 , + height = 480 , + width = 480 , resolution = 300, compression = "lzw", lwd = 1, @@ -89,7 +89,7 @@ venn.diagram(
-Here, it is easy to understand that Booba used 1995 unique words in the dataset. 44 of them were also used by Brassens *and* Nekfeu, 126 only shared with Nekfeu only. +Here, it is easy to understand that Booba used 1995 unique words in the dataset. 44 of them were also used by Brassens *and* Nekfeu, 126 only shared with Nekfeu only. @@ -98,7 +98,7 @@ Here, it is easy to understand that Booba used 1995 unique words in the dataset. # What for *** -A venn diagram makes a really good work to study the intersection between 2 or 3 sets. It becomes very hard to read with more groups than that and thus must be avoided. +A venn diagram makes a really good work to study the intersection between 2 or 3 sets. It becomes very hard to read with more groups than that and thus must be avoided. Here is a famous example: a six-set venn diagram published in [Nature](https://www.nature.com/articles/nature11241) that shows the relationship between the banana’s genome and the genome of five other species. @@ -121,7 +121,7 @@ Even if this figure is quite attractive, it is really hard to extract any inform # Variation {#variation} *** -To visualize the intersection between more than 3 sets, the best option is to use a [UpSet plot](http://caleydo.org/tools/upset/). +To visualize the intersection between more than 3 sets, the best option is to use a [UpSet plot](http://caleydo.org/tools/upset/). Here is an example provided by the [UpsetR](https://github.com/hms-dbmi/UpSetR) R library that displays the banana genome information seen before. The total size of each set is represented on the left barplot. Every possible intersection is represented by the bottom plot, and their occurence is shown on the top barplot. diff --git a/graph/violin.Rmd b/graph/violin.Rmd index 77b9b13..e168604 100644 --- a/graph/violin.Rmd +++ b/graph/violin.Rmd @@ -1,10 +1,10 @@ --- myimage: "ViolinSmall.png" -mydisqus: "violin" +pathSlug: "violin" mytitle: "Violin plot" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -41,7 +41,7 @@ library(viridis) # Load dataset from github data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",") -data <- data %>% +data <- data %>% gather(key="text", value="value") %>% mutate(text = gsub("\\.", " ",text)) %>% mutate(value = round(as.numeric(value),0)) %>% @@ -63,7 +63,7 @@ data %>% ylab("Assigned Probability (%)") ``` -**Disclaimer**: This idea originally comes from a publication of the [CIA](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/art15.html) which resulted in this [figure](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/fig18.gif/image.gif). Then, [Zoni Nation](https://github.com/zonination) cleaned the reddit dataset and built [graphics with R](https://github.com/zonination/perceptions). +**Disclaimer**: This idea originally comes from a publication of the [CIA](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/art15.html) which resulted in this [figure](https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/fig18.gif/image.gif). Then, [Zoni Nation](https://github.com/zonination) cleaned the reddit dataset and built [graphics with R](https://github.com/zonination/perceptions). @@ -119,12 +119,12 @@ data %>% # Load dataset from github data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/10_OneNumSevCatSubgroupsSevObs.csv", header=T, sep=",") %>% mutate(tip = round(tip/total_bill*100, 1)) - + # Grouped data %>% mutate(day = fct_reorder(day, tip)) %>% mutate(day = factor(day, levels=c("Thur", "Fri", "Sat", "Sun"))) %>% - ggplot(aes(fill=sex, y=tip, x=day)) + + ggplot(aes(fill=sex, y=tip, x=day)) + geom_violin(position="dodge", alpha=0.5, outlier.colour="transparent") + scale_fill_viridis(discrete=T, name="") + theme_ipsum() + @@ -149,7 +149,7 @@ data %>% ```{r, fig.align='center', fig.height=6, fig.width=8, warning=FALSE} # Load dataset from github data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",") -data <- data %>% +data <- data %>% gather(key="text", value="value") %>% mutate(text = gsub("\\.", " ",text)) %>% mutate(value = round(as.numeric(value),0)) %>% diff --git a/graph/wordcloud.Rmd b/graph/wordcloud.Rmd index 63488a8..24fb530 100644 --- a/graph/wordcloud.Rmd +++ b/graph/wordcloud.Rmd @@ -1,10 +1,10 @@ --- myimage: "WordCloudSmall.png" -mydisqus: "wordcloud" +pathSlug: "wordcloud" mytitle: "Wordcloud" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_datatoviz.html @@ -16,7 +16,7 @@ output: number_section: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -43,7 +43,7 @@ library(tm) library(proustr) # Load dataset from github -data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) +data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) to_remove <- c("_|[0-9]|\\.|function|^id|script|var|div|null|typeof|opts|if|^r$|undefined|false|loaded|true|settimeout|eval|else|artist") data <- data %>% filter(!grepl(to_remove, word)) %>% filter(!word %in% stopwords('fr')) %>% filter(!word %in% proust_stopwords()$word) @@ -108,8 +108,8 @@ data %>% ``` - - + + # Variation {#variation} *** diff --git a/story/AdjacencyMatrix.Rmd b/story/AdjacencyMatrix.Rmd index 9a09acb..2c1aaa2 100644 --- a/story/AdjacencyMatrix.Rmd +++ b/story/AdjacencyMatrix.Rmd @@ -4,11 +4,11 @@ myimage2: "Network150.png" myimage3: "Heatmap150.png" myimage4: "Sankey150.png" myimage5: "Chord150.png" -mydisqus: "AdjacencyMatrix" +pathSlug: "AdjacencyMatrix" mytitle: "Researchers network and migration flows" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -20,7 +20,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -79,7 +79,7 @@ dataUU %>% head(3) %>% select(1:4) %>% kable() %>% # Chord diagram *** -A chord diagram is a good way to represent migration flows. It works well if your data are directed and weighted like for migration flows between country. +A chord diagram is a good way to represent migration flows. It works well if your data are directed and weighted like for migration flows between country. Disclaimer: this plot is made using the circlize library, and very strongly inspired from the [Migest package](https://github.com/cran/migest) from [Gui J. Abel](http://guyabel.com), who is also the author of the migration [dataset](https://www.oeaw.ac.at/fileadmin/subsites/Institute/VID/PDF/Publications/Working_Papers/WP2016_02.pdf) used here. @@ -106,46 +106,46 @@ mycolor <- mycolor[sample(1:10)] # Base plot chordDiagram( - x = data_long, + x = data_long, grid.col = mycolor, transparency = 0.25, directional = 1, - direction.type = c("arrows", "diffHeight"), + direction.type = c("arrows", "diffHeight"), diffHeight = -0.04, - annotationTrack = "grid", + annotationTrack = "grid", annotationTrackHeight = c(0.05, 0.1), - link.arr.type = "big.arrow", - link.sort = TRUE, + link.arr.type = "big.arrow", + link.sort = TRUE, link.largest.ontop = TRUE) # Add text and axis circos.trackPlotRegion( - track.index = 1, - bg.border = NA, + track.index = 1, + bg.border = NA, panel.fun = function(x, y) { - + xlim = get.cell.meta.data("xlim") sector.index = get.cell.meta.data("sector.index") - - # Add names to the sector. + + # Add names to the sector. circos.text( - x = mean(xlim), - y = 3.2, - labels = sector.index, - facing = "bending", + x = mean(xlim), + y = 3.2, + labels = sector.index, + facing = "bending", cex = 0.8 ) # Add graduation on axis circos.axis( - h = "top", - major.at = seq(from = 0, to = xlim[2], by = ifelse(test = xlim[2]>10, yes = 2, no = 1)), - minor.ticks = 1, + h = "top", + major.at = seq(from = 0, to = xlim[2], by = ifelse(test = xlim[2]>10, yes = 2, no = 1)), + minor.ticks = 1, major.tick.percentage = 0.5, labels.niceFacing = FALSE) } ) - + ``` In my opinion this is a powerful way to display information. Major flows are easy to detect, like the migration from South Asia towards Westa Asia, or Africa to Europe. Moreover, for each continent it is quite easy to quantify the proportion of people leaving and arriving. @@ -178,9 +178,9 @@ data_long$target <- paste(data_long$target, " ", sep="") # From these flows we need to create a node data frame: it lists every entities involved in the flow nodes <- data.frame(name=c(as.character(data_long$source), as.character(data_long$target)) %>% unique()) - + # With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it. -data_long$IDsource=match(data_long$source, nodes$name)-1 +data_long$IDsource=match(data_long$source, nodes$name)-1 data_long$IDtarget=match(data_long$target, nodes$name)-1 # prepare colour scale @@ -189,7 +189,7 @@ ColourScal ='d3.scaleOrdinal() .range(["#FDE725FF","#B4DE2CFF","#6DCD59FF","#35B # Make the Network sankeyNetwork(Links = data_long, Nodes = nodes, Source = "IDsource", Target = "IDtarget", - Value = "value", NodeID = "name", + Value = "value", NodeID = "name", sinksRight=FALSE, colourScale=ColourScal, nodeWidth=40, fontSize=13, nodePadding=20) ``` @@ -207,9 +207,9 @@ The [heatmap](https://www.data-to-viz.com/graph/heatmap.html) is another great a ```{r, fig.align="center", fig.width=6, fig.height=6, warning=FALSE, message=FALSE} library(heatmaply) -p <- heatmaply(data, +p <- heatmaply(data, dendrogram = "none", - xlab = "", ylab = "", + xlab = "", ylab = "", main = "", scale = "column", margins = c(60,100,40,20), @@ -250,7 +250,7 @@ tmp <- tmp[which(rowSums(tmp)>3), which(colSums(tmp)>3)] # Heatmap p <- heatmaply(tmp, dendrogram = "both", - xlab = "", ylab = "", + xlab = "", ylab = "", main = "", scale = "none", margins = c(60,100,40,20), @@ -278,9 +278,9 @@ p # Network *** -Since an adjacency matrix is a `network structure`, it is possible to build a [network graph](https://www.data-to-viz.com/graph/network.html). In a network graph, each entity is represented as a `node`, and each connection as an `edge`. +Since an adjacency matrix is a `network structure`, it is possible to build a [network graph](https://www.data-to-viz.com/graph/network.html). In a network graph, each entity is represented as a `node`, and each connection as an `edge`. -In my opinion, this type of representation makes more sense when the connections are `unweighted`, since drawing edges with different sizes tends to clutter the figure and make it unreadable. +In my opinion, this type of representation makes more sense when the connections are `unweighted`, since drawing edges with different sizes tends to clutter the figure and make it unreadable. Thus, here is an application of this chart type to the coauthor network. Researchers are the nodes, represented as dots. If 2 researchers have published at least one scientific paper together, they are connected. The node size is proportionnal to the number of coauthors. @@ -291,10 +291,10 @@ Thus, here is an application of this chart type to the coauthor network. Researc ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=12, fig.height=9} # Transform the adjacency matrix in a long format -connect <- dataUU %>% +connect <- dataUU %>% gather(key="to", value="value", -1) %>% mutate(to = gsub("\\.", " ",to)) %>% - na.omit() + na.omit() # Number of connection per person c( as.character(connect$from), as.character(connect$to)) %>% @@ -310,13 +310,13 @@ mygraph <- graph_from_data_frame( connect, vertices = coauth, directed = FALSE ) com <- walktrap.community(mygraph) #Reorder dataset and make the graph -coauth <- coauth %>% +coauth <- coauth %>% mutate( grp = com$membership) %>% arrange(grp) %>% mutate(name=factor(name, name)) # keep only 10 first communities -coauth <- coauth %>% +coauth <- coauth %>% filter(grp<16) # keep only this people in edges @@ -332,7 +332,7 @@ mycolor <- colormap(colormap=colormaps$viridis, nshades=max(coauth$grp)) mycolor <- sample(mycolor, length(mycolor)) # Make the graph -ggraph(mygraph) + +ggraph(mygraph) + geom_edge_link(edge_colour="black", edge_alpha=0.2, edge_width=0.3, fold=TRUE) + geom_node_point(aes(size=n, color=as.factor(grp), fill=grp), alpha=0.9) + scale_size_continuous(range=c(0.5,8)) + @@ -344,7 +344,7 @@ ggraph(mygraph) + plot.margin=unit(c(0,0,0,0), "null"), panel.spacing=unit(c(0,0,0,0), "null") ) + - expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) + expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) ``` @@ -373,10 +373,10 @@ Instead of using a custom algorithm to position each nodes, it is possible to pl ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=8, fig.height=8} # Transform the adjacency matrix in a long format -connect <- dataUU %>% +connect <- dataUU %>% gather(key="to", value="value", -1) %>% mutate(to = gsub("\\.", " ",to)) %>% - na.omit() + na.omit() # Number of connection per person c( as.character(connect$from), as.character(connect$to)) %>% @@ -392,13 +392,13 @@ mygraph <- graph_from_data_frame( connect, vertices = coauth, directed = FALSE ) com <- walktrap.community(mygraph) #Reorder dataset and make the graph -coauth <- coauth %>% +coauth <- coauth %>% mutate( grp = com$membership) %>% arrange(grp) %>% mutate(name=factor(name, name)) # keep only 10 first communities -coauth <- coauth %>% +coauth <- coauth %>% filter(grp<16) # keep only this people in edges @@ -421,7 +421,7 @@ mycolor <- colormap(colormap=colormaps$viridis, nshades=max(coauth$grp)) mycolor <- sample(mycolor, length(mycolor)) # Make the graph -ggraph(mygraph, layout="circle") + +ggraph(mygraph, layout="circle") + geom_edge_link(edge_colour="black", edge_alpha=0.2, edge_width=0.3, fold=FALSE) + geom_node_point(aes(size=n, color=as.factor(grp), fill=grp), alpha=0.9) + scale_size_continuous(range=c(0.5,8)) + @@ -433,7 +433,7 @@ ggraph(mygraph, layout="circle") + plot.margin=unit(c(0,0,0,0), "null"), panel.spacing=unit(c(0,0,0,0), "null") ) + - expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) + expand_limits(x = c(-1.2, 1.2), y = c(-1.2, 1.2)) ``` @@ -452,7 +452,7 @@ An arc diagram follows the same concept, but displays nodes along a single axis ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=15, fig.height=7} # Make the graph -ggraph(mygraph, layout="linear") + +ggraph(mygraph, layout="linear") + geom_edge_arc(edge_colour="black", edge_alpha=0.2, edge_width=0.3, fold=TRUE) + geom_node_point(aes(size=n, color=as.factor(grp), fill=grp), alpha=0.5) + scale_size_continuous(range=c(0.5,8)) + @@ -464,7 +464,7 @@ ggraph(mygraph, layout="linear") + plot.margin=unit(c(0,0,0.4,0), "null"), panel.spacing=unit(c(0,0,3.4,0), "null") ) + - expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) + expand_limits(x = c(-1.2, 1.2), y = c(-5.6, 1.2)) ``` @@ -481,7 +481,7 @@ ggraph(mygraph, layout="linear") + - + diff --git a/story/GPSCoordWithValue.Rmd b/story/GPSCoordWithValue.Rmd index 92e2279..1dee902 100644 --- a/story/GPSCoordWithValue.Rmd +++ b/story/GPSCoordWithValue.Rmd @@ -1,10 +1,10 @@ --- myimage1: "BubbleMap150.png" -mydisqus: "GPSCoordWithValue" +pathSlug: "GPSCoordWithValue" mytitle: "The biggest UK cities" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -16,7 +16,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -71,14 +71,14 @@ UK <- map_data("world") %>% filter(region=="UK") # Easy to make it interactive! library(plotly) - + # plot p=data %>% - + arrange(desc(pop)) %>% mutate( name=factor(name, unique(name))) %>% mutate( mytext=paste("City: ", name, "\n", "Population: ", pop, sep="")) %>% # This prepare the text displayed on hover. - + # Makte the static plot calling this text: ggplot() + ggplot2::annotate("text", x = 1, y = 56.3, label="1000 biggest cities in the UK", colour = "black", size=4, alpha=1) + diff --git a/story/GPSCoordWithoutValue.Rmd b/story/GPSCoordWithoutValue.Rmd index cfd501f..44c7a06 100644 --- a/story/GPSCoordWithoutValue.Rmd +++ b/story/GPSCoordWithoutValue.Rmd @@ -2,11 +2,11 @@ myimage1: "Choropleth150.png" myimage2: "MapHexbin150.png" myimage3: "Cartogram150.png" -mydisqus: "GPSCoordWithoutValue" +pathSlug: "GPSCoordWithoutValue" mytitle: "Where surfers live" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -18,7 +18,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -86,7 +86,7 @@ p <- data %>% xlim(-180,180) + ylim(-60,80) + scale_x_continuous(expand = c(0.006, 0.006)) + - coord_equal() + coord_equal() ggsave(p, file="IMG/Surfer_position.png", width = 36, height = 15.22, units = "in", dpi = 90) @@ -122,7 +122,7 @@ To create a [hexbin map](https://www.data-to-viz.com/graph/hexbinmap.html), the data %>% filter(homecontinent=='Europe') %>% - ggplot( aes(x=homelon, y=homelat)) + + ggplot( aes(x=homelon, y=homelat)) + geom_hex(bins=59) + ggplot2::annotate("text", x = -27, y = 72, label="Where people tweet about #Surf", colour = "black", size=5, alpha=1, hjust=0) + ggplot2::annotate("segment", x = -27, xend = 10, y = 70, yend = 70, colour = "black", size=0.2, alpha=1) + @@ -130,21 +130,21 @@ data %>% xlim(-30, 70) + ylim(24, 72) + scale_fill_viridis( - trans = "log", + trans = "log", breaks = c(1,7,54,403,3000), - name="Tweet # recorded in 8 months", - guide = guide_legend( keyheight = unit(2.5, units = "mm"), keywidth=unit(10, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) + name="Tweet # recorded in 8 months", + guide = guide_legend( keyheight = unit(2.5, units = "mm"), keywidth=unit(10, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) ) + ggtitle( "" ) + theme( legend.position = c(0.8, 0.09), legend.title=element_text(color="black", size=8), text = element_text(color = "#22211d"), - plot.background = element_rect(fill = "#f5f5f2", color = NA), - panel.background = element_rect(fill = "#f5f5f2", color = NA), + plot.background = element_rect(fill = "#f5f5f2", color = NA), + panel.background = element_rect(fill = "#f5f5f2", color = NA), legend.background = element_rect(fill = "#f5f5f2", color = NA), plot.title = element_text(size= 13, hjust=0.1, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), - ) + ) ``` @@ -171,7 +171,7 @@ Note that this is very close from an [2d histogram map](https://www.data-to-viz. ```{r, warning=FALSE, message=FALSE, fig.align="center", fig.width=9, fig.height=6} # Make the hexbin map with the geom_hex function -ggplot(data, aes(x=homelon, y=homelat)) + +ggplot(data, aes(x=homelon, y=homelat)) + geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.3) + geom_bin2d(bins=100) + ggplot2::annotate("text", x = 175, y = 80, label="Where people tweet about #Surf", colour = "black", size=4, alpha=1, hjust=1) + @@ -179,21 +179,21 @@ ggplot(data, aes(x=homelon, y=homelat)) + theme_void() + ylim(-70, 80) + scale_fill_viridis( - trans = "log", + trans = "log", breaks = c(1,7,54,403,3000), - name="Tweet # recorded in 8 months", - guide = guide_legend( keyheight = unit(2.5, units = "mm"), keywidth=unit(10, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) + name="Tweet # recorded in 8 months", + guide = guide_legend( keyheight = unit(2.5, units = "mm"), keywidth=unit(10, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) ) + ggtitle( "" ) + theme( legend.position = c(0.8, 0.09), legend.title=element_text(color="black", size=8), text = element_text(color = "#22211d"), - plot.background = element_rect(fill = "#f5f5f2", color = NA), - panel.background = element_rect(fill = "#f5f5f2", color = NA), + plot.background = element_rect(fill = "#f5f5f2", color = NA), + panel.background = element_rect(fill = "#f5f5f2", color = NA), legend.background = element_rect(fill = "#f5f5f2", color = NA), plot.title = element_text(size= 13, hjust=0.1, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")), - ) + ) ``` diff --git a/story/OneCatSevOrderedNum.Rmd b/story/OneCatSevOrderedNum.Rmd index 7bdef21..f2c440c 100644 --- a/story/OneCatSevOrderedNum.Rmd +++ b/story/OneCatSevOrderedNum.Rmd @@ -3,11 +3,11 @@ myimage1: "Line150.png" myimage2: "Area150.png" myimage3: "StackedArea150.png" myimage4: "Stream150.png" -mydisqus: "OneCatSevOrderedNum" +pathSlug: "OneCatSevOrderedNum" mytitle: "Evolution of baby names in the US" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html diff --git a/story/OneNum.Rmd b/story/OneNum.Rmd index ca3cd1f..f85ec17 100644 --- a/story/OneNum.Rmd +++ b/story/OneNum.Rmd @@ -2,10 +2,10 @@ myimage1: "DensitySmall.png" myimage2: "HistogramSmall.png" mytitle: "Airbnb prices on the french riviera" -mydisqus: "OneNum" +pathSlug: "OneNum" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -17,7 +17,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -63,7 +63,7 @@ data %>% head(6) %>% kable() %>% #Histogram *** -The most common way to represent a unique numeric variable is with a histogram. Basically, the numeric variable is cut in several `bins`: between 0 and 10 euros a night, between 10 and 20 and so on. This is represented on the X axis. Then, the number of apartments per bin is counted and represented on the Y axis. +The most common way to represent a unique numeric variable is with a histogram. Basically, the numeric variable is cut in several `bins`: between 0 and 10 euros a night, between 10 and 20 and so on. This is represented on the X axis. Then, the number of apartments per bin is counted and represented on the Y axis. Here, it appears that about 500 appartments have a price between 80 and 90 euros. A histogram is a convenient way to visualize the data: it allows us to understand its `distribution`. @@ -106,7 +106,7 @@ There is a huge difference difference between these 2 histograms. Actually a few A variation of the histogram is the density plot, which is basically a smoothed version of the histogram. It represents a `kernel density estimate` of the variable. As seen for the bin size of the histogram, it is important to try several values for the `bandwidth` argument for the same reason:
- +
```{r, fig.align="center"} data %>% diff --git a/story/OneNumOneCat.Rmd b/story/OneNumOneCat.Rmd index 411c9d3..e880a36 100644 --- a/story/OneNumOneCat.Rmd +++ b/story/OneNumOneCat.Rmd @@ -4,11 +4,11 @@ myimage2: "CircularBarplot150.png" myimage3: "Lollipop150.png" myimage4: "Tree150.png" myimage5: "CircularPacking150.png" -mydisqus: "OneNumOneCat" +pathSlug: "OneNumOneCat" mytitle: "Who sells more weapons?" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -20,7 +20,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -155,9 +155,9 @@ ggplot(tmp, aes(x=as.factor(id), y=Value)) + # Note that id is a factor. I axis.text = element_blank(), axis.title = element_blank(), panel.grid = element_blank(), - plot.margin = unit(rep(-1,4), "cm") + plot.margin = unit(rep(-1,4), "cm") ) + - coord_polar(start = 0) + + coord_polar(start = 0) + geom_text(data=label_tmp, aes(x=id, y=Value+200, label=Country ), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_tmp$angle, hjust=label_tmp$hjust, inherit.aes = FALSE ) + geom_text( aes(x=24, y=8000, label="Who sells more weapons?"), color="black", inherit.aes = FALSE) ``` @@ -169,7 +169,7 @@ Please note that in this case the [circular barplot](https://www.data-to-viz.com #Treemap *** -A [treemap](https://www.data-to-viz.com/graph/treemap.html) represents each entity as a rectangle, with an area that is proportional to the numeric variable of the dataset. +A [treemap](https://www.data-to-viz.com/graph/treemap.html) represents each entity as a rectangle, with an area that is proportional to the numeric variable of the dataset. It is a good way to show a general overview of the data organization and is probably more eye-catching than the previous barplot. However, it is less precise in the sense that it is harder to make accurate comparisons between groups. ```{r, fig.align="center", fig.width=9} # Package @@ -177,30 +177,30 @@ library(treemap) # Plot treemap(data, - + # data index="Country", vSize="Value", type="index", - + # Main title="", palette="Dark2", # Borders: - border.col=c("black"), - border.lwds=1, - + border.col=c("black"), + border.lwds=1, + # Labels fontsize.labels=0.5, fontcolor.labels="white", - fontface.labels=1, - bg.labels=c("transparent"), - align.labels=c("left", "top"), + fontface.labels=1, + bg.labels=c("transparent"), + align.labels=c("left", "top"), overlap.labels=0.5, inflate.labels=T # If true, labels are bigger when rectangle is bigger. - + ) ``` @@ -216,7 +216,7 @@ library(ggraph) library(igraph) library(tidyverse) library(viridis) - + # We need a data frame giving a hierarchical structure. Let's consider the flare dataset: tmp <- data %>% filter(!is.na(Value)) edges <- data.frame( @@ -226,7 +226,7 @@ edges <- data.frame( vertices = rbind(tmp, data.frame(Country="o", Value=1)) %>% mutate(name=Country) mygraph <- graph_from_data_frame( edges, vertices=vertices ) -ggraph(mygraph, layout = 'circlepack', weight="Value") + +ggraph(mygraph, layout = 'circlepack', weight="Value") + geom_node_circle( aes(fill=name)) + scale_fill_viridis(discrete=TRUE) + geom_node_label(aes(label=name, size=Value)) + diff --git a/story/OneNumOneCatSeveralObs.Rmd b/story/OneNumOneCatSeveralObs.Rmd index cae181b..eca2506 100644 --- a/story/OneNumOneCatSeveralObs.Rmd +++ b/story/OneNumOneCatSeveralObs.Rmd @@ -5,10 +5,10 @@ myimage3: "Box1Small.png" myimage4: "ViolinSmall.png" myimage5: "JoyplotSmall.png" mytitle: "Perception of probability" -mydisqus: "OneNumOneCatSeveralObs" +pathSlug: "OneNumOneCatSeveralObs" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -20,7 +20,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -52,11 +52,11 @@ library(viridis) # Load dataset from github data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",") -data <- data %>% +data <- data %>% gather(key="text", value="value") %>% mutate(text = gsub("\\.", " ",text)) %>% mutate(value = round(as.numeric(value),0)) - + # show data data %>% sample_n(8) %>% kable(row.names = FALSE) %>% @@ -146,7 +146,7 @@ data %>% ``` -However if you have more than ~4 groups this technique does not work: the graphic would become too cluttered. Thus it is a better practice to use small multiple: +However if you have more than ~4 groups this technique does not work: the graphic would become too cluttered. Thus it is a better practice to use small multiple: ```{r, fig.align='center', fig.height=7, fig.width=8, warning=FALSE} data %>% mutate(text = fct_reorder(text, value)) %>% diff --git a/story/OneNumSevCatSubgroupOneObsPerGroup.Rmd b/story/OneNumSevCatSubgroupOneObsPerGroup.Rmd index 053ac0b..e963a16 100644 --- a/story/OneNumSevCatSubgroupOneObsPerGroup.Rmd +++ b/story/OneNumSevCatSubgroupOneObsPerGroup.Rmd @@ -4,10 +4,10 @@ myimage2: "Parallel1Small.png" myimage3: "SpiderSmall.png" myimage4: "ScatterPlotSmall.png" mytitle: "The gender wage gap" -mydisqus: "OneNumSevCatSubgroupOneObsPerGroup" +pathSlug: "OneNumSevCatSubgroupOneObsPerGroup" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -19,7 +19,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -79,7 +79,7 @@ data %>% filter(Country %in% with4$Country) %>% mutate(Country = fct_reorder(Country, Value)) %>% mutate(TIME=factor(TIME, levels = c("2000", "2005", "2010", "2015"))) %>% - ggplot(aes(fill=as.factor(TIME), y=Value, x=Country)) + + ggplot(aes(fill=as.factor(TIME), y=Value, x=Country)) + geom_bar(position="dodge", stat="identity") + scale_fill_viridis(discrete=T, name="") + coord_flip() + @@ -97,7 +97,7 @@ data %>% filter(Country %in% with4$Country) %>% mutate(Country = fct_reorder(Country, Value)) %>% mutate(TIME=factor(TIME, levels = c("2000", "2005", "2010", "2015"))) %>% - ggplot(aes(fill=Country, y=Value, x=as.factor(TIME))) + + ggplot(aes(fill=Country, y=Value, x=as.factor(TIME))) + geom_bar(position="dodge", stat="identity") + scale_fill_viridis(discrete=T, name="") + coord_flip() + @@ -125,7 +125,7 @@ data %>% filter(Country != "OECD - Average") %>% mutate(label = if_else(TIME == max(TIME) & Country %in% grp1, as.character(Country), NA_character_)) %>% mutate(label2 = if_else(TIME == min(TIME) & Country %in% grp2, as.character(Country), NA_character_)) %>% - ggplot( aes(x=as.factor(TIME), y=Value, color=Country, group=Country)) + + ggplot( aes(x=as.factor(TIME), y=Value, color=Country, group=Country)) + geom_point() + geom_line() + geom_label_repel( aes(label=label), nudge_x = 0.3, hjust=0, na.rm = TRUE, segment.colour="grey") + @@ -167,7 +167,7 @@ data %>% filter(TIME %in% c(2000, 2015)) %>% mutate(label = if_else(TIME == max(TIME) & Country %in% grp1, as.character(Country), NA_character_)) %>% mutate(label2 = if_else(TIME == min(TIME) & Country %in% grp2, as.character(Country), NA_character_)) %>% - ggplot( aes(x=as.factor(TIME), y=Value, color=Country, group=Country)) + + ggplot( aes(x=as.factor(TIME), y=Value, color=Country, group=Country)) + geom_point() + geom_line() + geom_label_repel( aes(label=label), nudge_x = 0.3, hjust=0, na.rm = TRUE, segment.colour="grey") + @@ -202,7 +202,7 @@ p <- data %>% theme(legend.position="none") + xlab("Gender wage gap in 2000 (%)") + ylab("Gender wage gap in 2015 (%)") - + ggplotly(p, tooltip="text") ``` diff --git a/story/OneNumSevCatSubgroupSevObsPerGroup.Rmd b/story/OneNumSevCatSubgroupSevObsPerGroup.Rmd index 123c459..f06bc61 100644 --- a/story/OneNumSevCatSubgroupSevObsPerGroup.Rmd +++ b/story/OneNumSevCatSubgroupSevObsPerGroup.Rmd @@ -3,11 +3,11 @@ myimage1: "DensitySmall.png" myimage2: "Box1Small.png" myimage3: "ViolinSmall.png" myimage4: "HistogramSmall.png" -mydisqus: "OneNumSevCatSubgroupSevObsPerGroup" +pathSlug: "OneNumSevCatSubgroupSevObsPerGroup" mytitle: "How much do people tip?" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -19,7 +19,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -73,8 +73,8 @@ The most common way to represent that kind of dataset is probably the grouped [b Here, it looks like there is not much difference in tip values from one day to the other in average, except a slight increase on sunday. Moreover, it looks like females tend to tip more than males on friday. Note that individual data points are presented using `jittering`, what allows to detect more particular pattern and assess the sample size of each group. ```{r, fig.width=8, fig.height=6, fig.align="center", warning=FALSE} # Counts the number of value per group and subgroup -counts = data %>% - group_by(day, sex) %>% +counts = data %>% + group_by(day, sex) %>% summarize( n=n(), median=median(tip) @@ -84,7 +84,7 @@ counts = data %>% data %>% mutate(day = fct_reorder(day, tip)) %>% mutate(day = factor(day, levels=c("Thur", "Fri", "Sat", "Sun"))) %>% - ggplot(aes(fill=sex, y=tip, x=day)) + + ggplot(aes(fill=sex, y=tip, x=day)) + geom_boxplot(position=position_dodge2(preserve = "total"), alpha=0.5, outlier.colour="transparent", varwidth = TRUE) + geom_point(color="grey", size=1, width=0.1, position=position_jitterdodge() , alpha=0.4) + scale_fill_viridis(discrete=T, name="") + @@ -103,7 +103,7 @@ In the above chart categories are grouped by day. It is possible to build the sa data %>% mutate(day = fct_reorder(day, tip)) %>% mutate(day = factor(day, levels=c("Thur", "Fri", "Sat", "Sun"))) %>% - ggplot(aes(fill=day, y=tip, x=sex)) + + ggplot(aes(fill=day, y=tip, x=sex)) + geom_boxplot(position="dodge", alpha=0.5, outlier.colour="transparent") + geom_point(color="grey", size=1, width=0.1, position=position_jitterdodge() , alpha=0.4) + scale_fill_viridis(discrete=T, name="") + @@ -125,7 +125,7 @@ The [violin plot](https://www.data-to-viz.com/graph/violin.html) can be used exa data %>% mutate(day = fct_reorder(day, tip)) %>% mutate(day = factor(day, levels=c("Thur", "Fri", "Sat", "Sun"))) %>% - ggplot(aes(fill=sex, y=tip, x=day)) + + ggplot(aes(fill=sex, y=tip, x=day)) + geom_violin(position="dodge", alpha=0.5, outlier.colour="transparent") + scale_fill_viridis(discrete=T, name="") + theme_ipsum() + @@ -147,7 +147,7 @@ Small multiple is a powerful technique that can be used with that kind of data. data %>% mutate(day = fct_reorder(day, tip)) %>% mutate(day = factor(day, levels=c("Thur", "Fri", "Sat", "Sun"))) %>% - ggplot(aes(x=tip)) + + ggplot(aes(x=tip)) + geom_histogram(bins=20, fill="#69b3a2", color="white") + facet_grid(sex~time) + theme_ipsum() + diff --git a/story/SevCatOneNumNestedOneObsPerGroup.Rmd b/story/SevCatOneNumNestedOneObsPerGroup.Rmd index d992705..e6486d1 100644 --- a/story/SevCatOneNumNestedOneObsPerGroup.Rmd +++ b/story/SevCatOneNumNestedOneObsPerGroup.Rmd @@ -3,11 +3,11 @@ myimage1: "Lollipop150.png" myimage2: "CircularBarplot150.png" myimage3: "Tree150.png" myimage4: "CircularPacking150.png" -mydisqus: "SevCatOneNumNestedOneObsPerGroup" +pathSlug: "SevCatOneNumNestedOneObsPerGroup" mytitle: "Visualizing the world population" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -19,7 +19,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -30,11 +30,11 @@ output:


-This document gives a few suggestions to analyse a `nested` or `hierarchical` dataset in which a numeric value is available for each leaf. This kind of data has an origine node that gives birth to subsequent nodes and so on until the final leaves. +This document gives a few suggestions to analyse a `nested` or `hierarchical` dataset in which a numeric value is available for each leaf. This kind of data has an origine node that gives birth to subsequent nodes and so on until the final leaves.
-Take the world population of 250 countries as an example. The world is divided in continent (group), continent are divided in regions (subgroup), and regions are divided in countries. In this tree structure, countries are considered as leaves: they are at the end of the branches. +Take the world population of 250 countries as an example. The world is divided in continent (group), continent are divided in regions (subgroup), and regions are divided in countries. In this tree structure, countries are considered as leaves: they are at the end of the branches.
@@ -71,29 +71,29 @@ A [treemap](https://www.data-to-viz.com/graph/treemap.html) represents each node library(treemap) p <- treemap(data, - + # data index=c("Continent", "Region", "Country"), vSize="Pop", type="index", - + # Main title="", palette="Dark2", # Borders: - border.col=c("black", "grey", "grey"), - border.lwds=c(1,0.5,0.1), - + border.col=c("black", "grey", "grey"), + border.lwds=c(1,0.5,0.1), + # Labels fontsize.labels=c(0.7, 0.4, 0.3), fontcolor.labels=c("white", "white", "black"), - fontface.labels=1, - bg.labels=c("transparent"), - align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")), + fontface.labels=1, + bg.labels=c("transparent"), + align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")), overlap.labels=0.5, - inflate.labels=T - + inflate.labels=T + ) ``` @@ -124,8 +124,8 @@ d3tree2( p , rootname = "General" )
```{r, fig.align="center", fig.width=9, fig.height=6, warning=FALSE, message=FALSE} # Libraries -library(circlepackeR) - +library(circlepackeR) + # Remove a few problematic lines data <- data %>% filter(Continent!="") %>% droplevels() @@ -134,7 +134,7 @@ data <- data %>% filter(Continent!="") %>% droplevels() library(data.tree) data$pathString <- paste("world", data$Continent, data$Region, data$Country, sep = "/") population <- as.Node(data) - + # You can custom the minimum and maximum value of the color range. circlepackeR(population, size = "Pop", color_min = "hsl(56,80%,80%)", color_max = "hsl(341,30%,40%)") ``` @@ -145,7 +145,7 @@ circlepackeR(population, size = "Pop", color_min = "hsl(56,80%,80%)", color_max # Lollipop *** -A [lollipop](https://www.data-to-viz.com/graph/lollipop.html) plot is basically a [barplot](https://www.data-to-viz.com/graph/barplot.html), where the bar is transformed in a line and a dot. It shows the relationship between a numeric and a categoric variable. +A [lollipop](https://www.data-to-viz.com/graph/lollipop.html) plot is basically a [barplot](https://www.data-to-viz.com/graph/barplot.html), where the bar is transformed in a line and a dot. It shows the relationship between a numeric and a categoric variable. It can be a good option if your interested in the value of each leaf, but do not really care about the hierarchy of the dataset. For instance, if you wonder which country as the highest population size. Here, China and India pop out clearly. @@ -193,11 +193,11 @@ data %>% legend.position="none" ) + xlab("") + - ylab("Population (M)") + ylab("Population (M)") ``` - + # Circular version *** Note that it is possible to make a [circular version](https://www.data-to-viz.com/graph/circularbarplot.html) of your barplot or lollipop plot. In my opinion, this kind of representation works especially well when you have several groups and obvious patterns. Indeed, it suits the world population dataset not too bad: @@ -223,7 +223,7 @@ to_add$group=rep(levels(data$group), each=empty_bar) data=rbind(data, to_add) data=data %>% arrange(group) data$id=seq(1, nrow(data)) - + # Get the name and the y position of each label label_data=data number_of_bar=nrow(label_data) @@ -234,36 +234,36 @@ label_data$individual <- gsub("Democratic Republic of the Congo", "R. D. Congo", label_data$value[which(label_data$individual == "Nigeria")] <- 130 # prepare a data frame for base lines -base_data=data %>% - group_by(group) %>% - summarize(start=min(id), end=max(id) - empty_bar) %>% - rowwise() %>% +base_data=data %>% + group_by(group) %>% + summarize(start=min(id), end=max(id) - empty_bar) %>% + rowwise() %>% mutate(title=mean(c(start, end))) %>% mutate(group = gsub(" Africa", "", group)) %>% mutate(group = gsub("ern", "", group)) - + # prepare a data frame for grid (scales) grid_data = base_data grid_data$end = grid_data$end[ c( nrow(grid_data), 1:nrow(grid_data)-1)] + 1 grid_data$start = grid_data$start - 1 grid_data=grid_data[-1,] - + # Make the plot p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that id is a factor. If x is numeric, there is some space between the first bar - + # Main bars geom_bar(aes(x=as.factor(id), y=value, fill=group), stat="identity", alpha=0.5) + scale_fill_viridis(discrete=T) + - + # Add a val=100/75/50/25 lines. I do it at the beginning to make sur barplots are OVER it. geom_segment(data=grid_data, aes(x = end, y = 80, xend = start, yend = 80), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 60, xend = start, yend = 60), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 40, xend = start, yend = 40), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + geom_segment(data=grid_data, aes(x = end, y = 20, xend = start, yend = 20), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) + - + # Add text showing the value of each 100/75/50/25 lines ggplot2::annotate("text", x = rep(max(data$id),4), y = c(20, 40, 60, 80), label = c("20", "40", "60", "80") , color="grey", size=3 , angle=0, fontface="bold", hjust=1) + - + geom_bar(aes(x=as.factor(id), y=value, fill=group), stat="identity", alpha=0.5) + ylim(-70,180) + theme_minimal() + @@ -274,16 +274,16 @@ p = ggplot(data, aes(x=as.factor(id), y=value, fill=group)) + # Note that panel.grid = element_blank(), plot.margin = unit(c(-3,-5,-5,-5), "cm") ) + - coord_polar() + + coord_polar() + geom_text(data=label_data, aes(x=id, y=value+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) + - + # Add base line information geom_segment(data=base_data, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) + geom_text(data=base_data, aes(x = title, y = -15, label=group), hjust=c(1,1,0.5,0,0), colour = "black", alpha=0.8, size=3, fontface="bold", inherit.aes = FALSE) - + p ``` - + diff --git a/story/SeveralIndepLists.Rmd b/story/SeveralIndepLists.Rmd index 1267e5a..850aca4 100644 --- a/story/SeveralIndepLists.Rmd +++ b/story/SeveralIndepLists.Rmd @@ -3,10 +3,10 @@ myimage1: "VennSmall.png" myimage2: "LollipopSmall.png" myimage3: "WordCloudSmall.png" mytitle: "Comparing raper lyrics" -mydisqus: "SeveralIndepLists" +pathSlug: "SeveralIndepLists" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -18,7 +18,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -44,7 +44,7 @@ options(knitr.table.format = "html") library(proustr) # Load dataset from github -data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) +data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/14_SeveralIndepLists.csv", header=TRUE) to_remove <- c("_|[0-9]|\\.|function|^id|script|var|div|null|typeof|opts|if|^r$|undefined|false|loaded|true|settimeout|eval|else|artist") data <- data %>% filter(!grepl(to_remove, word)) %>% filter(!word %in% stopwords('fr')) %>% filter(!word %in% proust_stopwords()$word) @@ -85,7 +85,7 @@ c %>% head(6) %>% kable(row.names=FALSE) %>% #Wordcloud *** -If some words are repeated in the dataset, the first thing to do is probably to find out what are the most frequent ones. A common way to do so is to build a [wordcloud](https://www.data-to-viz.com/graph/wordcloud.html): each word is written with a size proportionnal to its frequency. +If some words are repeated in the dataset, the first thing to do is probably to find out what are the most frequent ones. A common way to do so is to build a [wordcloud](https://www.data-to-viz.com/graph/wordcloud.html): each word is written with a size proportionnal to its frequency. ```{r, fig.align="center", fig.height=6, warning=FALSE, message=FALSE} # The wordcloud 2 library is the best option for wordcloud in R @@ -145,20 +145,20 @@ Once the most frequent words are known, it is of interest to know how many words ```{r, warning=FALSE, message=FALSE, results = "hide"} #upload library library(VennDiagram) - + #Make the plot venn.diagram( x = list( - data %>% filter(artist=="booba") %>% select(word) %>% unlist() , - data %>% filter(artist=="nekfeu") %>% select(word) %>% unlist() , + data %>% filter(artist=="booba") %>% select(word) %>% unlist() , + data %>% filter(artist=="nekfeu") %>% select(word) %>% unlist() , data %>% filter(artist=="georges-brassens") %>% select(word) %>% unlist() ), category.names = c("Booba" , "Nekfeu" , "Brassens"), filename = 'venn.png', output = TRUE , imagetype="png" , - height = 480 , - width = 480 , + height = 480 , + width = 480 , resolution = 300, compression = "lzw", lwd = 1, @@ -185,7 +185,7 @@ venn.diagram( *** This section needs improvements: - + - introduction of upset plot - same number of word per artist - more ideas to come diff --git a/story/SeveralNum.Rmd b/story/SeveralNum.Rmd index 20f42c0..d1eed67 100644 --- a/story/SeveralNum.Rmd +++ b/story/SeveralNum.Rmd @@ -5,11 +5,11 @@ myimage3: "ViolinSmall.png" myimage4: "JoyplotSmall.png" myimage5: "HeatmapSmall.png" myimage6: "DendrogramSmall.png" -mydisqus: "SeveralNum" +pathSlug: "SeveralNum" mytitle: "Eleven features for 32 cars" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -21,7 +21,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -112,7 +112,7 @@ dend %>% set("branches_k_color", value = c("#69b3a2", "#404080", "orange"), k = 3) %>% plot(horiz=TRUE, axes=FALSE) abline(v = 350, lty = 2) - + ``` Here, the dendrogram informs us that the Mercedes 280 and the Mercedes 280C have similar features, what makes sense. Basically, it gives an idea of group of cars that are similar one another. @@ -124,7 +124,7 @@ See more about it [here](https://www.data-to-viz.com/graph/dendrogram.html). #Heatmap *** -The [heatmap](https://www.data-to-viz.com/graph/heatmap.html) is often used in complement of a dendrogram. It is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. +The [heatmap](https://www.data-to-viz.com/graph/heatmap.html) is often used in complement of a dendrogram. It is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. In addition of a dendrogram, it allows to understand why samples ore features are grouped together. ```{r, fig.align="center", fig.width=8, message=FALSE, warning=FALSE} @@ -133,7 +133,7 @@ d3heatmap(data, k_row = 4, k_col = 2, scale = "column") ``` The heatmap above allows to understand why cars are split in 2 main clusters. For instance the weight (`wt`) is much higher for the group on top than for the other. - + diff --git a/story/ThreeNum.Rmd b/story/ThreeNum.Rmd index 12106ca..1f35c7a 100644 --- a/story/ThreeNum.Rmd +++ b/story/ThreeNum.Rmd @@ -3,11 +3,11 @@ myimage1: "Density150.png" myimage2: "Box1150.png" myimage3: "BubblePlot150.png" myimage4: "3d150.png" -mydisqus: "ThreeNum" +pathSlug: "ThreeNum" mytitle: "Life expectancy, gdp per capita and population size" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -19,7 +19,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -46,7 +46,7 @@ library(gapminder) data <- gapminder %>% filter(year=="2007") %>% select(-year) # show data -data %>% head(6) %>% +data %>% head(6) %>% mutate(gdpPercap=round(gdpPercap,0)) %>% mutate(pop=round(pop/1000000,2)) %>% mutate(lifeExp=round(lifeExp,1)) %>% @@ -89,7 +89,7 @@ p2 <- data %>% scale_color_viridis(discrete=TRUE) + scale_y_log10() + theme_ipsum() + - theme(legend.position="none") + theme(legend.position="none") p3 <- data %>% mutate(pop=pop/1000000) %>% @@ -100,7 +100,7 @@ p3 <- data %>% scale_color_viridis(discrete=TRUE) + scale_y_log10() + theme_ipsum() + - theme(legend.position="none") + theme(legend.position="none") grid.arrange(p2,p3, ncol=2) @@ -173,12 +173,12 @@ A specific use case where three numeric columns are displayed is the grid system ```{r, warning=FALSE, message=FALSE} # prepare the dataset: -don <- volcano +don <- volcano colnames(don) <- seq(1,ncol(don)) don <- don %>% as.tibble() %>% mutate(lat=seq(1,nrow(don)) ) %>% - gather(key="long", value="altitude", -lat) + gather(key="long", value="altitude", -lat) # show data don %>% head(6) %>% kable() %>% diff --git a/story/TwoNum.Rmd b/story/TwoNum.Rmd index 7cfef4c..b0b86e2 100644 --- a/story/TwoNum.Rmd +++ b/story/TwoNum.Rmd @@ -4,10 +4,10 @@ myimage2: "Box1Small.png" myimage3: "ScatterPlotSmall.png" myimage4: "2dDensitySmall.png" mytitle: "Apartment price vs ground living area." -mydisqus: "TwoNum" +pathSlug: "TwoNum" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -19,7 +19,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- @@ -65,7 +65,7 @@ data %>% head(6) %>% kable() %>% As usual when working with numeric variables, it is always a good practice to check their distributions. Here Prices and Ground living areas are on two different scales so it makes sense to study them in two different graphics. This can be done using a [histogram]() or a [density plot](). ```{r, fig.align="center", out.width = '90%', fig.height=3} -p1 <- data %>% +p1 <- data %>% ggplot( aes(x=GrLivArea)) + geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9, bins=50) + ggtitle("Ground living area distribution") + @@ -75,7 +75,7 @@ p1 <- data %>% ) + xlab('area') -p2 <- data %>% +p2 <- data %>% ggplot( aes(x=SalePrice/1000)) + geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9, bins=50) + ggtitle("Sale price distribution") + @@ -98,7 +98,7 @@ This allows to understand that most of the prices range between 100 and 300 k\$ *** The next step is to study the relationship between the 2 variables. Basically to explore if there is a correlation between sale price and living area. The first chart type to try in this case is the [scatterplot](). ```{r, fig.align="center"} -data %>% +data %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000)) + geom_point(color="#69b3a2", alpha=0.8) + ggtitle("Ground living area partially explains sale price of apartments") + @@ -114,7 +114,7 @@ It is quite obvious that there is a relationship between prices and ground livin #Improving the scatter plot {.tabset} *** -The previous graphic convey most of the information efficiently. Still, there are a few customizations that can be done to make the chart even more insightful: +The previous graphic convey most of the information efficiently. Still, there are a few customizations that can be done to make the chart even more insightful: - Adding a `trend line` with confidence interval to illustrate and clarify the relationship - Using an `interactive` version to get more information concerning each data point. @@ -125,7 +125,7 @@ The previous graphic convey most of the information efficiently. Still, there ar ##Trend Help the reader seing the trend on the chart by showing it explicitely. Several models exist to show a trend. A [linear regression](https://en.wikipedia.org/wiki/Linear_regression) is used on the left plot, and a [local regression](https://en.wikipedia.org/wiki/Local_regression) is used on the right. Showing the confindence interval is a good practice as well. ```{r, fig.align="center", out.width = '90%', fig.height=3} -p <- data %>% +p <- data %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000)) + geom_point(color="#69b3a2", alpha=0.8) + theme_ipsum() + @@ -151,7 +151,7 @@ Scatter plot is probably the chart type for which it makes the most sense to use # Plotly allows to turn any ggplot2 graphic interactive library(plotly) -p <- data %>% +p <- data %>% mutate(text=paste("Apartment Number: ", seq(1:nrow(data)), "\nLocation: New York\nAny other information you need..", sep="")) %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000, text=text)) + geom_point(color="#69b3a2", alpha=0.8) + @@ -170,7 +170,7 @@ ggplotly(p, tooltip="text") ##Marginal distribution If the number of data points on the scatterplot is high, it is a good practice to display the marginal distributions arount the graphic. ```{r} -data %>% +data %>% ggplot( aes(x=GrLivArea, y=SalePrice)) + geom_point() #%>% #ggMarginal(type="histogram") @@ -184,20 +184,20 @@ The most common pitfall with scatterplot is overplotting: when the sample size g ```{r, fig.width=8, fig.height=8, fig.align="center", warning=FALSE, message=FALSE} # code for all graphics: -p <- data %>% +p <- data %>% ggplot( aes(x=GrLivArea, y=SalePrice/1000)) + theme_ipsum() + theme( plot.title = element_text(size=12) ) + ylab('Sale price (k$)') + - xlab('Ground living area') + xlab('Ground living area') # Reduce dot size -p1 <- p + geom_point(color="#69b3a2", alpha=0.8, size=0.2) + ggtitle("Dot size") +p1 <- p + geom_point(color="#69b3a2", alpha=0.8, size=0.2) + ggtitle("Dot size") # Use density estimate -p2 <- p + geom_density2d(color="#69b3a2") + ggtitle("Density 2d: contour") +p2 <- p + geom_density2d(color="#69b3a2") + ggtitle("Density 2d: contour") # Use density estimate (area) p3 <- p + stat_density_2d(aes(fill = ..level..), geom = "polygon") + ggtitle("Density 2d: area") + theme(legend.position="none") @@ -218,14 +218,14 @@ p4 <- p + # Hexbin p5 <- p + geom_hex() + scale_fill_viridis() + - theme(legend.position="none") + + theme(legend.position="none") + ggtitle("Hexbin") # 2d histogram p6 <- p + geom_bin2d( ) + scale_fill_viridis( ) + - theme(legend.position="none") + - ggtitle("2d histogram") + theme(legend.position="none") + + ggtitle("2d histogram") p1 + p2 + p3 + p4 + p5 + p6 + plot_layout(ncol = 2) ``` diff --git a/story/TwoNumOrdered.Rmd b/story/TwoNumOrdered.Rmd index 658c27c..fc24c84 100644 --- a/story/TwoNumOrdered.Rmd +++ b/story/TwoNumOrdered.Rmd @@ -2,11 +2,11 @@ myimage1: "Line150.png" myimage2: "Area150.png" myimage3: "ScatterConnected150.png" -mydisqus: "TwoNumOrdered" +pathSlug: "TwoNumOrdered" mytitle: "Evolution of the bitcoin price" output: html_document: - self_contained: false + self_contained: false mathjax: default lib_dir: libs template: template_story.html @@ -18,7 +18,7 @@ output: number_sections: FALSE df_print: "paged" code_folding: "hide" - includes: + includes: after_body: footer.html --- diff --git a/story/template_story.html b/story/template_story.html index 990ab5d..63b1f80 100644 --- a/story/template_story.html +++ b/story/template_story.html @@ -1,676 +1,659 @@ - - - - - - - - - - - - - - - - - - -
-
- -

-

-

$mytitle$

-
- A few data analytics ideas from Data-to-Viz.com -
-

- - $if(myimage1)$ - - $endif$ - - $if(myimage2)$ - - $endif$ - - $if(myimage3)$ - + - $endif$ - - $if(myimage5)$ - - $endif$ - - $if(myimage6)$ - - $endif$ - - $if(myimage7)$ - - $endif$ - - $if(myimage8)$ - - $endif$ - - $if(myimage9)$ - - $endif$ - - $if(myimage10)$ - - $endif$ - -
- - -$if(theme)$ -$else$ - -$endif$ - -$for(author-meta)$ - -$endfor$ - -$if(date-meta)$ - -$endif$ - -$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$ - -$for(header-includes)$ -$header-includes$ -$endfor$ - -$if(highlightjs)$ - -$if(theme)$ - -$endif$ - -$endif$ - -$if(highlighting-css)$ - - -$if(theme)$ - -$endif$ -$endif$ - -$if(abstract)$ - -$endif$ - -$if(theme)$ - -$endif$ - -$for(css)$ - -$endfor$ - - - - - -$if(theme)$ - - -$if(kable-scroll)$ - -$endif$ - -$if(navbar)$ - - - - -$endif$ - -
- - - - - -$if(code_menu)$ - - -$endif$ - - - -$if(toc_float)$ - - - - - - - - -
-
-
-
-
- -
- -$endif$ - -$endif$ - -$for(include-before)$ -$include-before$ -$endfor$ - -$if(theme)$ -
- -$if(code_menu)$ -
- - -
- -$endif$ - -$endif$ - -$if(title)$ -

$title$

-$if(subtitle)$ -

$subtitle$

-$endif$ -$for(author)$ -$if(author.name)$ -

$author.name$

-$if(author.affiliation)$ -
-$author.affiliation$
$endif$ -$if(author.email)$ -$author.email$ -
-$endif$ -$else$ -

$author$

-$endif$ -$endfor$ -$if(date)$ -

$date$

-$endif$ -$if(abstract)$ -
-

Abstract

-$abstract$ -
-$endif$ -$endif$ - -$if(theme)$ -
-$endif$ - -$if(toc_float)$ -$else$ -$if(toc)$ -
-$toc$ -
-$endif$ -$endif$ - -$body$ - - - -

Going further

-
-

You can learn more about each type of graphic presented in this story in the dedicated sections. Click the icon below:

- $if(myimage1)$ - - $endif$ - - $if(myimage2)$ - - $endif$ - - $if(myimage3)$ - - $endif$ - - $if(myimage4)$ - - $endif$ - - $if(myimage5)$ - - $endif$ - - $if(myimage6)$ - - $endif$ - - $if(myimage7)$ - - $endif$ - - $if(myimage8)$ - - $endif$ - - $if(myimage9)$ - - $endif$ - - $if(myimage10)$ - - $endif$ - - - - - -
-

Comments

-
-

Any thoughts on this? Found any mistake? Have another way to show the data? Please drop me a word on Twitter or in the comment section below:

-
-
-
- - +> + + + + + + + + +
+
+ +

+

+

$mytitle$

+
+
+ A few data analytics ideas from + Data-to-Viz.com +
+
+

+ + $if(myimage1)$ + + + $endif$ $if(myimage2)$ + + + $endif$ $if(myimage3)$ + + + $endif$ $if(myimage4)$ + + + $endif$ $if(myimage5)$ + + + $endif$ $if(myimage6)$ + + + $endif$ $if(myimage7)$ + + + $endif$ $if(myimage8)$ + + + $endif$ $if(myimage9)$ + + + $endif$ $if(myimage10)$ + + + $endif$ +
+ + $if(theme)$ $else$ + + $endif$ $for(author-meta)$ + + $endfor$ $if(date-meta)$ + + $endif$ + + $if(title-prefix)$$title-prefix$ - $endif$$pagetitle$ + + $for(header-includes)$ $header-includes$ $endfor$ $if(highlightjs)$ + + $if(theme)$ + + $endif$ + + $endif$ $if(highlighting-css)$ + + + $if(theme)$ + + $endif$ $endif$ $if(abstract)$ + + $endif$ $if(theme)$ + + $endif$ $for(css)$ + + $endfor$ + + + + $if(theme)$ + + + $if(kable-scroll)$ + + $endif$ $if(navbar)$ + + + + + $endif$ + +
+ + + + + $if(code_menu)$ + + + $endif$ $if(toc_float)$ + + + + + + +
+
+
+
+ +
+ $endif$ $endif$ $for(include-before)$ $include-before$ $endfor$ + $if(theme)$ +
+ $if(code_menu)$ +
+ + +
+ + $endif$ $endif$ $if(title)$ +

$title$

+ $if(subtitle)$ +

$subtitle$

+ $endif$ $for(author)$ $if(author.name)$ +

$author.name$

+ $if(author.affiliation)$ +
+ $author.affiliation$
$endif$ $if(author.email)$ + $author.email$ +
+ $endif$ $else$ +

$author$

+ $endif$ $endfor$ $if(date)$ +

$date$

+ $endif$ $if(abstract)$ +
+

Abstract

+ $abstract$ +
+ $endif$ $endif$ $if(theme)$ +
+ $endif$ $if(toc_float)$ $else$ $if(toc)$ +
$toc$
+ $endif$ $endif$ $body$ + + +

Going further

+
+

+ You can learn more about each type of graphic presented in this + story in the dedicated sections. Click the icon below: +

+ $if(myimage1)$ + + + $endif$ $if(myimage2)$ + + + $endif$ $if(myimage3)$ + + + $endif$ $if(myimage4)$ + + + $endif$ $if(myimage5)$ + + + $endif$ $if(myimage6)$ + + + $endif$ $if(myimage7)$ + + + $endif$ $if(myimage8)$ + + + $endif$ $if(myimage9)$ + + + $endif$ $if(myimage10)$ + + + $endif$ + + +
+

Comments

+
+

+ Any thoughts on this? Found any mistake? Have another way to show + the data? Please drop me a word on + Twitter or in the + comment section below: +

+
+
+
+ + +
+ + + $for(include-after)$ $include-after$ $endfor$ $if(theme)$ + $if(toc_float)$ +
- - - - - - -$for(include-after)$ -$include-after$ -$endfor$ - - -$if(theme)$ - -$if(toc_float)$ -
-
-$endif$ - -
- - -$endif$ - -$if(mathjax-url)$ - - -$endif$ - - + $endif$ +
+ + + $endif$ $if(mathjax-url)$ + + + $endif$ +