Descriptive statistics and data normalization with CNTK and C#

As you probably know CNTK is Microsoft Cognitive Toolkit for deep learning. It is open source library which is used by various Microsoft products. Also the CNTK is powerful library for developing custom ML solutions from various fields with different platforms and languages. What is also so powerful in the CNTK is the way of the implementation. In fact the library is implemented as series of computation graphs, which  is fully elaborated into the sequence of steps performed in a deep neural network training.

Each CNTK compute graph is created with set of nodes where each node represents numerical (mathematical) operation. The edges between nodes in the graph represent data flow between operations. Such a representation allows CNTK to schedule computation on the underlying hardware GPU or CPU. The CNTK can dynamically analyze the graphs in order to to optimize both latency and efficient use of resources. The most powerful part of this is the fact thet the CNTK can calculate derivation of any constructed set of operations, which can be used for efficient learning  process of the network parameters. The flowing image shows the core architecture of the CNTK. On the other hand, any operation can be executed on CPU or GPU with minimal code changes. In fact we can implement method which can automatically takes GPU computation if available. The CNTK is the first .NET library which provide .NET developers to develop GPU aware .NET applications.

What this exactly mean is that with this powerful library you can develop complex math computation directly to GPU in .NET using C#, which currently is not possible when using standard .NET library.

For this blog post I will show how to calculate some of basic statistics operations on data set.

Say we have data set with 4 columns (features) and 20 rows (samples). The C# implementation of this 2D array is show on the following code snippet:

static float[][] mData = new float[][] {
new float[] { 5.1f, 3.5f, 1.4f, 0.2f},
new float[] { 4.9f, 3.0f, 1.4f, 0.2f},
new float[] { 4.7f, 3.2f, 1.3f, 0.2f},
new float[] { 4.6f, 3.1f, 1.5f, 0.2f},
new float[] { 6.9f, 3.1f, 4.9f, 1.5f},
new float[] { 5.5f, 2.3f, 4.0f, 1.3f},
new float[] { 6.5f, 2.8f, 4.6f, 1.5f},
new float[] { 5.0f, 3.4f, 1.5f, 0.2f},
new float[] { 4.4f, 2.9f, 1.4f, 0.2f},
new float[] { 4.9f, 3.1f, 1.5f, 0.1f},
new float[] { 5.4f, 3.7f, 1.5f, 0.2f},
new float[] { 4.8f, 3.4f, 1.6f, 0.2f},
new float[] { 4.8f, 3.0f, 1.4f, 0.1f},
new float[] { 4.3f, 3.0f, 1.1f, 0.1f},
new float[] { 6.5f, 3.0f, 5.8f, 2.2f},
new float[] { 7.6f, 3.0f, 6.6f, 2.1f},
new float[] { 4.9f, 2.5f, 4.5f, 1.7f},
new float[] { 7.3f, 2.9f, 6.3f, 1.8f},
new float[] { 5.7f, 3.8f, 1.7f, 0.3f},
new float[] { 5.1f, 3.8f, 1.5f, 0.3f},};

If you want to play with CNTK and math calculation you need some knowledge from Calculus, as well as vectors, matrix and tensors. Also in CNTK any operation is performed as matrix operation, which may simplify the calculation process for you. In standard way, you have to deal with multidimensional arrays during calculations. As my knowledge currently there is no .NET library which can perform math operation on GPU, which constrains the .NET platform for implementation of high performance applications.

If we want to compute average value, and standard deviation for each column, we can do that with CNTK very easy way. Once we compute those values we can used them for normalizing the data set by computing standard score (Gauss Standardization).

The Gauss standardization is calculated by the flowing term: $nValue= \frac{X-\nu}{\sigma}$,
where X- is column values, $\nu$ – column mean, and $\sigma$– standard deviation of the column.

For this example we are going to perform three statistic operations,and the CNTK automatically provides us with ability to compute those values on GPU. This is very important in case you have data set with millions of rows, and computation can be performed in few milliseconds.

Any computation process in CNTK can be achieved in several steps:

1. Read data from external source or in-memory data,
2. Define Value and Variable objects.
3. Define Function for the calculation
4. Perform Evaluation of the function by passing the Variable and Value objects
5. Retrieve the result of the calculation and show the result.

All above steps are implemented in the following implementation:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using CNTK;
namespace DataNormalizationWithCNTK
{
class Program
{
static float[][] mData = new float[][] {
new float[] { 5.1f, 3.5f, 1.4f, 0.2f},
new float[] { 4.9f, 3.0f, 1.4f, 0.2f},
new float[] { 4.7f, 3.2f, 1.3f, 0.2f},
new float[] { 4.6f, 3.1f, 1.5f, 0.2f},
new float[] { 6.9f, 3.1f, 4.9f, 1.5f},
new float[] { 5.5f, 2.3f, 4.0f, 1.3f},
new float[] { 6.5f, 2.8f, 4.6f, 1.5f},
new float[] { 5.0f, 3.4f, 1.5f, 0.2f},
new float[] { 4.4f, 2.9f, 1.4f, 0.2f},
new float[] { 4.9f, 3.1f, 1.5f, 0.1f},
new float[] { 5.4f, 3.7f, 1.5f, 0.2f},
new float[] { 4.8f, 3.4f, 1.6f, 0.2f},
new float[] { 4.8f, 3.0f, 1.4f, 0.1f},
new float[] { 4.3f, 3.0f, 1.1f, 0.1f},
new float[] { 6.5f, 3.0f, 5.8f, 2.2f},
new float[] { 7.6f, 3.0f, 6.6f, 2.1f},
new float[] { 4.9f, 2.5f, 4.5f, 1.7f},
new float[] { 7.3f, 2.9f, 6.3f, 1.8f},
new float[] { 5.7f, 3.8f, 1.7f, 0.3f},
new float[] { 5.1f, 3.8f, 1.5f, 0.3f},};
static void Main(string[] args)
{
//define device where the calculation will executes
var device = DeviceDescriptor.UseDefaultDevice();

//print data to console
Console.WriteLine($"X1,\tX2,\tX3,\tX4"); Console.WriteLine($"-----,\t-----,\t-----,\t-----");
foreach (var row in mData)
{
Console.WriteLine($"{row},\t{row},\t{row},\t{row}"); } Console.WriteLine($"-----,\t-----,\t-----,\t-----");

//convert data into enumerable list
var data = mData.ToEnumerable<IEnumerable<float>>();

//assign the values
var vData = Value.CreateBatchOfSequences<float>(new int[] {4},data, device);
//create variable to describe the data
var features = Variable.InputVariable(vData.Shape, DataType.Float);

//define mean function for the variable
var mean =  CNTKLib.ReduceMean(features, new Axis(2));//Axis(2)- means calculate mean along the third axes which represent 4 features

//map variables and data
var inputDataMap = new Dictionary<Variable, Value>() { { features, vData } };
var meanDataMap = new Dictionary<Variable, Value>() { { mean, null } };

//mean calculation
mean.Evaluate(inputDataMap,meanDataMap,device);
//get result
var meanValues = meanDataMap[mean].GetDenseData<float>(mean);

Console.WriteLine($""); Console.WriteLine($"Average values for each features x1={meanValues},x2={meanValues},x3={meanValues},x4={meanValues}");

//Calculation of standard deviation
var std = calculateStd(features);
var stdDataMap = new Dictionary<Variable, Value>() { { std, null } };
//mean calculation
std.Evaluate(inputDataMap, stdDataMap, device);
//get result
var stdValues = stdDataMap[std].GetDenseData<float>(std);

Console.WriteLine($""); Console.WriteLine($"STD of features x1={stdValues},x2={stdValues},x3={stdValues},x4={stdValues}");

//Once we have mean and std we can calculate Standardized values for the data
var gaussNormalization = CNTKLib.ElementDivide(CNTKLib.Minus(features, mean), std);
var gaussDataMap = new Dictionary<Variable, Value>() { { gaussNormalization, null } };
//mean calculation
gaussNormalization.Evaluate(inputDataMap, gaussDataMap, device);

//get result
var normValues = gaussDataMap[gaussNormalization].GetDenseData<float>(gaussNormalization);
//print data to console
Console.WriteLine($"-------------------------------------------"); Console.WriteLine($"Normalized values for the above data set");
Console.WriteLine($""); Console.WriteLine($"X1,\tX2,\tX3,\tX4");
Console.WriteLine($"-----,\t-----,\t-----,\t-----"); var row2 = normValues; for (int j = 0; j < 80; j += 4) { Console.WriteLine($"{row2[j]},\t{row2[j + 1]},\t{row2[j + 2]},\t{row2[j + 3]}");
}
Console.WriteLine(\$"-----,\t-----,\t-----,\t-----");
}

private static Function calculateStd(Variable features)
{
var mean = CNTKLib.ReduceMean(features,new Axis(2));
var remainder = CNTKLib.Minus(features, mean);
var squared = CNTKLib.Square(remainder);
//the last dimension indicate the number of samples
var n = new Constant(new NDShape(0), DataType.Float, features.Shape.Dimensions.Last()-1);
var elm = CNTKLib.ElementDivide(squared, n);
var sum = CNTKLib.ReduceSum(elm, new Axis(2));
var stdVal = CNTKLib.Sqrt(sum);
return stdVal;
}
}

public static class ArrayExtensions
{
public static IEnumerable<T> ToEnumerable<T>(this Array target)
{
foreach (var item in target)
yield return (T)item;
}
}
}

The output for the source code above should look like: Gauss Numeric Integrator Project After short post about numerical integration in 1D, I have decided to start with open source project for numerical integration for any function in 1,2 or 3D.  If you are interesting in numeric analysis you can find a lot of stuff on the internet. For this blog post I prepared Demo example which extends the previous post by providing integration for 2D function. By using this demo you can integrate any 2D function on area defined by 3, 4 or 8 points. That means you can perform numerical integration for 3 kind of finite elements: triangle, quadrilateral 4-point or 8-point element. Beside element type you can choose up to 7 Gaussian integration points.

Implementation

The project contains one ClassLibrary and one Windows application (see picture below). Class library contains implementation related to numeric integration: Shape function, partial derivatives of shape function, Jacobina calculation for every element type as well as Integration process.

Windows Application implements Form for defining function, and other parameters. The project uses Flee – fast .Net expression evaluator for evaluating string in to math expression. The Gauss Numeric Integrator can be downloaded from codeplex site at http://gauss.codeplex.com/

C# implementation of numerical integration by Gaussian Quadrature of 1D function

Update: Gauss Numeric Integrator as open source project hosted on codeplex: http://gauss.codeplex.com

In this post it will be presented C# implementation of numerical calculation of  integral within $(a,b)$ interval based on Gauss Quadrature method.

As you know definite integral can be expressed as: $\int_{a}^{b}{f(x)} dx$.

Graphically it represents the area below the function, as picture shows below. In many numerical methods like FEM, FVM, etc it is necessary to calculate integral numerically. There are several methods but the Gauss Quadrature is the most used one and popular.

Gaus Quadrature method of integration is based on the fact that if we make transformation of the function $f(x)$ between interval $(a,b)$ in to another function $\phi ( \xi )$ on interval $(-1,1)$ we can calculate approximate value of the integral on very simple way.

The picture below shows how we transform out starting function in to function defined on interval $(-1,1)$. If we make such a transformation numerically we can calculate integral on the following way: $\int_{a}^{b}{f(x)}dx = \frac{b-a}{2} \int_{-1}^{1}{f(\frac{b-a}{2} x_i+ \frac{a+b}{2})}dx = \frac{b-a}{2}\sum_{i=1}^{n} w_if(\frac{b-a}{2} x_i+ \frac{a+b}{2})$

The summation function is called the Legendre-Gauss quadrature rule because the abscissa $x_i$ in the Gauss quadrature function for $[-1,1]$ are defined as the roots of the Legendre polynomial for $n$: $P_n(x)= \frac{1}{2 \pi i} \oint (1-2tx+r^2)^{-1/2}t^{-n-1}$,

with the weights $w_i$ coming from the following function: $w_i= \frac{-2} {(1-x_i^2)[P'_n(x_i))]^2}$.

Solution for $x_i$ and $w_i$  are based on number of points $n$ which can be calculated by using several method implementation you can find on internet. In this demo values for $x_i$ and $w_i$  for n=2 to 64  is taken from this link.

C# Implementation

Gauss Quadrature 1D numeric integrator demo is shown on the picture below. It is a  small Windows Forms application by which you can calculate almost any definite integral. Demo uses Flee – Fast Lightweight Expression Evaluator for evaluating string in to math expression. As picture shows write down any function with x as an independent variable, define interval of integration, specify number of Gaussian points and press Calculate button.

Demo contains external file for Gaussian points and weights for n=2 to 64 points. So you can integrate any function with up to 64 Gaussian points which represent very large range of numerical application.

1. The first step in implementation was to implement loading Gaussian points from external files. The following implementation shows loading points to .NET object.

{
gaus1DValues = new List();
string buffer = "";
// open  file and retrieve the content
{
//read data in to buffer
}

//define the lines from file
var lines = (from l in buffer.Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
where l != '!' && l != '\r'
select l.IndexOf('!') == -1 ? l : l.Remove(l.IndexOf('!'))
).ToArray();

int counter=1;
for (int i=0; i<lines.Length; i++)
{
string line = lines[i].Replace("\r", "");
string str = "n=" + counter.ToString();
if (line == str)
{
i++;
Helpers h = new Helpers();
h.n = counter;
h.wi= new double[counter];
h.xi = new double[counter];
for (int j = 0; j < counter; j++)
{
var lns= lines[i].Split(new char[] { '\t' });
h.wi[j] = double.Parse(lns, CultureInfo.InvariantCulture);
var s = lns.Replace("\r","");
h.xi[j] = double.Parse(s, CultureInfo.InvariantCulture);
if(j+1<counter)
i++;
}
counter++;
}
}
return gaus1DValues;
}

As can be seen from the code above, all values are stored in list of object of  Helper class which contains Gaussian values for each integration point. When the user click on Calculate button, the following code is going to be executed:

try
{
var funstr = textFunction.Text;
function = context.CompileGeneric(funstr);
}
catch (Exception ex)
{
MessageBox.Show("Error in fun definition. ERROR: "+ex.Message);
label5.Text = "I=n/a";
return;
}

//perform integration
double retVal= Integrate1D(a,b,n);
label5.Text = string.Format("I={0:0.0000000000}", retVal);

Integrate1D(a,b,n); – method is implemented on the following way:

private double Integrate1D(double a, double b, int n)
{
double I = 0;
for (int i = 0; i < n; i++)
{
var ksi = 0.5 * (a + b) + 0.5 * (b - a) * gaus1DValues[n].xi[i];//f(1/2(a+b)+1/2*(a-b)xi)
var dx = 0.5 * (b - a);//det|J|
register.x = ksi;
var f = function.Evaluate();
//intgreation
I += f * dx * gaus1DValues[n].wi[i];
}

return I;
}

As you can see this is very easy implementation,by using great expression evaluator and predefined Gaussian points.

The next blog post will be implementation of integral for 2D problems similar we can see on classic FEA engineering problems of stress, fracture mechanic etc..