From 6003d35ef922b1ab42a8655b3e1d0795c324dc4a Mon Sep 17 00:00:00 2001 From: "Marco A. Lopez-Sanchez" Date: Mon, 11 Mar 2024 20:01:32 +0100 Subject: [PATCH] Update grain_size_analysis.ipynb --- grain_size_tools/grain_size_analysis.ipynb | 57 ++++++++++++++++------ 1 file changed, 43 insertions(+), 14 deletions(-) diff --git a/grain_size_tools/grain_size_analysis.ipynb b/grain_size_tools/grain_size_analysis.ipynb index 83e0b0b..164babb 100644 --- a/grain_size_tools/grain_size_analysis.ipynb +++ b/grain_size_tools/grain_size_analysis.ipynb @@ -60,15 +60,7 @@ "source": [ "## Loading the data\n", "\n", - "TODO\n", - "\n", - "```reStructuredText\n", - "sep # Delimiter/separator to use.\n", - "header # Row number(s) to use as the column names. By default it takes the first row as the column names (header=0). If there is no columns names in the file you must set header=None\n", - "skiprows # Number of lines to skip at the start of the file (an integer).\n", - "na_filter # Detect missing value markers. False by default.\n", - "sheet_name # Only for excel files, the excel sheet name either a number or the full name of the sheet.\n", - "```" + "The first step is to read the data. For this we will use the Pandas method (imported as pd) ``read_csv`` as follows" ] }, { @@ -82,6 +74,22 @@ "dataset = pd.read_csv('DATA\\data_set.txt', sep='\\t')" ] }, + { + "cell_type": "markdown", + "id": "9c23c16e", + "metadata": {}, + "source": [ + "The above example loads the data into a variable named ``dataset``, assumes that the data is stored in a file named ``data_set.txt`` located within the ``DATA`` folder, and that the column separator (or delimiter) is a tab denoted as ``\\t``. Normally the ``read_csv`` method will try to guess the default delimiter/separator type, but to be sure you can pass it by default as in the example. The ``read_csv`` method also assumes that the data is stored in a text-like file (e.g. csv, txt, tsv, dat, etc.) and that the first line contains the column names. Other useful parameters that can be used within the parentheses to load text files with more complex layout are:\n", + "\n", + "- ``header``: Row number(s) in the text file to use as the column names. By default the first row is assumed as the column names (header=0). If there are no column names in the file, you must set ``header=None``.\n", + "- ``skiprows``: Number of lines to skip/ignore at the begining of the file (an integer). \n", + "- ``na_filter``: When set to ``True``, it automatically detects missing values in the dataset. False by default. \n", + "\n", + "For more details on this method, or on loading other file types (e.g. Excel files), see the script documentation.\n", + "\n", + "Once the data has been loaded, the first thing to do is to view the data to check that it has loaded correctly, as in the example below." + ] + }, { "cell_type": "code", "execution_count": 3, @@ -321,7 +329,7 @@ "ECD = 2 \\sqrt{\\text{area} / \\pi}\n", "$$\n", "\n", - "To add a new column with ECDs" + "To add a new column with ECDs (we will name it ``ECD``)" ] }, { @@ -472,6 +480,7 @@ } ], "source": [ + "# show the first rows of the data\n", "dataset.head()" ] }, @@ -484,7 +493,7 @@ "\n", "## Grain size statistics\n", "\n", - "The ``summarise()`` method is responsible for describing the population of grain sizes. By default, this method returns several common averages (central tendency estimators) with corresponding confidence intervals at the 95% level (2-sigma) and several statistical parameters describing the distribution of grain sizes. The ``summasize()`` function will automatically choose the most optimal methods for estimating confidence intervals for each of the averages (see documentation for details). If necessary, you can modify various parameters of this function to return only the types of averages you are interested in, or to change the confidence level reported among others. For more details, see the documentation at https://github.com/marcoalopez/GrainSizeTools/wiki/2.-Quantifying-grain-size-populations-using-GrainSizeTools-Script or type ``summarize()?`` in a cell and run it." + "The ``summarize()`` method is responsible for describing the population of grain sizes. By default, this method returns **several common averages** (central tendency estimators) with corresponding confidence intervals at the 95% level (2-sigma) and (2) **several statistical parameters describing the distribution of grain sizes**. The method will automatically select the most optimal methods for estimating confidence intervals for each of the averages (see documentation for details)." ] }, { @@ -544,9 +553,11 @@ "id": "1fd110fb", "metadata": {}, "source": [ + "If necessary, you can modify the behaviour of this function using various parameters. For example, if you want to ignore a certain type of average or change the confidence level of the average. For more details, go to the documentation at https://github.com/marcoalopez/GrainSizeTools/wiki/2.-Quantifying-grain-size-populations-using-GrainSizeTools-Script or type ``summarize()?`` in a cell and run it.\n", + "\n", "## Plotting grain size populations\n", "\n", - "All the statistical parameters calculated above only make sense if your population is unimodal. Therefore, it is imperative to **always display the grain size distribution**. The default plot for this is to show the distribution on a linear scale. The ``plot.distribution()`` function takes care of this. By default it shows the distribution using the histogram and the kernel density estimator, as well as the location of the various averages. This function also allows you to change some default values, among them the type of graph (histogram, kde) abd the type of average(s) to be displayed and to change various histogram or kde adjustment parameters, use the ``?`` help command or the script documentation for specific details. You can also modify the parameters of the figure itself, such as the axis labels, their size, etc. Some examples are commented on below and there are also examples in the documentation." + "All the statistical parameters calculated above only make sense if your population is unimodal. It is therefore imperative to **always display the grain size distribution**. The default plot for this is to display the distribution on a linear scale. The ``plot.distribution()`` function takes care of this. By default it shows the distribution using the histogram and the Kernel Density Estimator or KDE (i.e. the continuous line), as well as the location of the various averages. This function also allows you to change some default values, including the type of plot (histogram, kde) or the type of average(s) to be displayed. You can also adjust the binsize of the histogram or the kernel of the KDE, use the ``?`` help command or the script documentation for specific details. You can also modify the parameters of the plot itself, such as the axis labels, the font size, etc. Some examples are commented on below and there are also more examples in the documentation." ] }, { @@ -583,8 +594,8 @@ "fig1, axe = plot.distribution(dataset['ECD'])\n", "\n", "# uncomment the lines below (remove the # at the begining) to modify the figure defaults\n", - "#axe.set_xlabel('diameters $\\mu$m') # modify x label\n", - "#axe.set_ylabel('probability') # modify y label" + "#axe.set_xlabel('diameters $\\mu$m', fontsize=14) # modify x label\n", + "#axe.set_ylabel('probability', fontsize=14) # modify y label" ] }, { @@ -606,6 +617,14 @@ "## Testing lognormality" ] }, + { + "cell_type": "markdown", + "id": "5f30a2b7", + "metadata": {}, + "source": [ + "Sometimes it can be helpful to test whether the data follow or deviate from a lognormal distribution. For example, to find out if the data is suitable for using the two-step stereological method, or which confidence interval method is best. The script uses two methods to test whether the grain size distribution follows a lognormal distribution. One is a visual method called quantile-quantile (q-q) plot and the other is a quantitative test called the [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test). To do this we use the function test_lognorm as follows" + ] + }, { "cell_type": "code", "execution_count": 17, @@ -639,6 +658,16 @@ "fig2, axe = plot.qq_plot(dataset['ECD'], figsize=(6, 5))" ] }, + { + "cell_type": "markdown", + "id": "89b00b44", + "metadata": {}, + "source": [ + "The Shapiro-Wilk test returns two different values, the test statistic and the p-value. This test considers the distribution to be lognormally distributed when the p-value is greater than 0.05.\n", + "\n", + "The q-q plot is a visual test that when the points fall right onto the reference line it means that the distribution are lognormally distributed. The q-q plot has the advantage over the Shapiro-Wilk test that it shows where the distribution deviates from lognormality (if it deviates). In the example above, we can see that it deviates mainly at the extremes, which is quite common in grain size populations. The deviation in the lower part of the grain size distribution is usually due to the resolution limit of our acquisition system not being able to measure some of the smaller fraction of the population, so we lose some of the smaller fraction. Deviation in the upper range is usually due to insufficient sample size (although this is not always the case). As the probability of measuring grains in this range is lower, it is more affected by non-representative sample sizes. The message here is that even if the Shapiro-Wilk test is negative, the quantile-quantile plot may indicate that this is due to other limiting factors, not that the population does not approximately follow a lognormal pattern." + ] + }, { "cell_type": "code", "execution_count": null,