-
Notifications
You must be signed in to change notification settings - Fork 30
Analyzing Data
Once you have imported data and organized it into FrameViews you want to analyze, you can easily extract individual columns as IReadOnlyLists that can be handed to the column-oriented statistical analysis methods in the Meta.Numerics.Statistics namespace.
For these examples, we will use a data frame created from data available on the internet:
using System;
using System.IO;
using System.Net;
using Meta.Numerics.Data;
FrameTable table;
Uri url = new Uri("https://raw.githubusercontent.com/dcwuser/metanumerics/master/Examples/Data/example.csv");
WebRequest request = WebRequest.Create(url);
using (WebResponse response = request.GetResponse()) {
using (StreamReader reader = new StreamReader(response.GetResponseStream())) {
table = FrameTable.FromCsv(reader);
}
}
FrameView view = table.WhereNotNull();
Notice that we have used the WhereNotNull() method to create a view scrubbed of nulls that we will use in the analysis.
The basic column-access APIs are illustrated here:
// Get the column with (zero-based) index 4.
FrameColumn column4 = view.Columns[4];
// Get the column named "Height".
FrameColumn heightsColumn = view.Columns["Height"];
// Even easier way to get the column named "Height".
FrameColumn alsoHeightsColumn = view["Height"];
From the FrameColumn object, you can get column properties such as Count and StorageType. You can also get individual column values, but only as objects, which is not how most statistics APIs want them. To get column values as an IReadOnlyList of the type you need, use the As method:
using System.Collections.Generic;
// Get heights as list of doubles
IReadOnlyList<double> heights = view["Height"].As<double>();
This way to extract a column as a list of values of a particular type will be used again and again in our examples. It is the key to passing data-frame data into statistical analysis APIs.
Note that when you access a column of data, you are not incurring any copy costs; you are just creating a read-only view into the existing storage. If you do need an independent, writeable copy, you can always use methods like ToList() and ToArray() from the System.Linq namespace to obtain one.
Here is code that produces summary statistics for height:
using Meta.Numerics.Statistics;
SummaryStatistics summary = new SummaryStatistics(view["Height"].As<double>());
Console.WriteLine($"Count = {summary.Count}");
Console.WriteLine($"Mean = {summary.Mean}");
Console.WriteLine($"Standard Deviation = {summary.StandardDeviation}");
Console.WriteLine($"Skewness = {summary.Skewness}");
Console.WriteLine($"Estimated population mean = {summary.PopulationMean}");
Console.WriteLine($"Estimated population standard deviation = {summary.PopulationStandardDeviation}");
Of course, this is not all you can do with a column of data representing a univariate sample. The one-sample methods of the Univariate class allow you to extract arbitrary moments and percentiles, run statistical tests such a the z-test or Shapiro-Francia test of normality, fit to parameterized distributions, and more.
Let's see if the height differences between men and women are statistically significant by performing a t-test:
IReadOnlyList<double> maleHeights =
view.Where<string>("Sex", s => s == "M").Columns["Height"].As<double>();
IReadOnlyList<double> femaleHeights =
view.Where<string>("Sex", s => s == "F").Columns["Height"].As<double>();
TestResult test = Univariate.StudentTTest(maleHeights, femaleHeights);
Console.WriteLine($"{test.Statistic.Name} = {test.Statistic.Value}");
Console.WriteLine($"P = {test.Probability}");
Of course, this is not all you can do with columns of data representing distinct samples. Using the two-sample methods of the Univariate class, you can compare their medians using a Mann-Whitney test, or compare their distributions using a two-sample Kolmogorov-Smirnov test. You can compare more than two samples using ANOVA or Kruskal-Wallis tests.
Let's see how well we can predict weight as a linear function of height:
LinearRegressionResult fit =
view["Weight"].As<double>().LinearRegression(view["Height"].As<double>());
Console.WriteLine($"Model weight = ({fit.Slope}) * height + ({fit.Intercept}).");
Console.WriteLine($"Model explains {fit.RSquared * 100.0}% of variation.");
Notice that we have used LinearRegression as an extension method on the output values. This allows the code to visually reflect the mathematical relationship implied by the model: outputs ~ function of inputs.
As with all our fitting routines, the returned fit parameters are of the type UncertainValue, which allows you to obtain error bars and confidence intervals as well as best-fit values.
Of course, this linear regression is not the only sort of analysis you can perform on bivariate samples. Using other methods of the Bivariate class, you can compute covariances and other multivariate moments, and do other, more complex regressions, including polynomial regression, non-linear regression, and logistic regressions.
Here is some code that produces a contingency table summarizing test results for male and female subjects:
ContingencyTable<string, bool> contingency =
Bivariate.Crosstabs(view["Sex"].As<string>(), view["Result"].As<bool>());
Console.WriteLine($"Male incidence: {contingency.ProbabilityOfColumnConditionalOnRow(true, "M")}");
Console.WriteLine($"Female incidence: {contingency.ProbabilityOfColumnConditionalOnRow(true, "F")}");
Console.WriteLine($"Log odds ratio = {contingency.Binary.LogOddsRatio}");
Once you have a contingency table, you can do all sort of other analysis, including chi square tests and Fisher exact tests.
Here is some code to fit a multi-linear model that uses the computed BMI value and the subject's sex to predict the boolean test result:
view.AddComputedColumn("Bmi",
r => ((double) r["Weight"]) / MoreMath.Sqr((double) r["Height"] / 100.0)
);
MultiLinearLogisticRegressionResult result =
view["Result"].As<bool>().MultiLinearLogisticRegression(
view["Bmi"].As<double>(),
view["Sex"].As<string, double>(s => s == "M" ? 1.0 : 0.0)
);
foreach (Parameter parameter in result.Parameters) {
Console.WriteLine($"{parameter.Name} = {parameter.Estimate}");
}
This example illustrates several noteworthy points:
- We can use AddComputedColumn to create a computed column, and then use that column just like any other for statistical analysis.
- We can also use the overload of the As method that takes a conversion function to effectively create a computed column on the fly. We could have used AddComputedColumn instead, but for situations where the conversion is just for one specific API call (in the case, converting M or F into an indicator variable), using the overload may be easier.
- Notice again that the use of an extension methods makes the code visually mirror the mathematical formulation of our model as output ~ function of inputs.
- When the parameters are printed, you can see that their names have been auto-magically lifted from the column names, which makes it easy to see which parameter is the coefficient of which column.
This isn't all you can do with multivariate samples. Using other methods of the Multivariate class, you can compute arbitrary moments, apply clustering algorithms, do principal component analysis, and do other kinds of regressions.
Our example data is not time series data, but see the time series analysis topic for examples of the kind of analysis you can do on time series data.
The type parameter of the As method need not be the same as the underlying storage type of the column. As long as the values can be converted to the target type, it will all work out. You can, for example, read an int column as doubles in order to hand it to the Mean API that expects doubles. And you can, for example, read a bool? column as bool, as long as you have filtered to a view that doesn't contain any null values. You can even read a bool as an int or double (or vice-versa as a long as all values are 0 or 1), which makes it easier to work with indicator variables. And of course, if all else fails, you can use the either the converting As overload or AddComputedColumn to produce a collection of exactly the type required by your API.
There is so much! Look at the statistics tutorials to see even more kinds of statistical analysis you can do.
- Project
- What's New
- Installation
- Versioning
- Tutorials
- Functions
- Compute a Special Function
- Bessel Functions
- Solvers
- Evaluate An Integral
- Find a Maximum or Minimum
- Solve an Equation
- Integrate a Differential Equation
- Data Wrangling
- Statistics
- Analyze a Sample
- Compare Two Samples
- Simple Linear Regression
- Association
- ANOVA
- Contingency Tables
- Multiple Regression
- Logistic Regression
- Cluster and Component Analysis
- Time Series Analysis
- Fit a Sample to a Distribution
- Distributions
- Special Objects
- Linear Algebra
- Polynomials
- Permutations
- Partitions
- Uncertain Values
- Extended Precision
- Functions