Descriptive statistics and data normalization with CNTK and C#


As you probably know CNTK is Microsoft Cognitive Toolkit for deep learning. It is open source library which is used by various Microsoft products. Also the CNTK is powerful library for developing custom ML solutions from various fields with different platforms and languages. What is also so powerful in the CNTK is the way of the implementation. In fact the library is implemented as series of computation graphs, which  is fully elaborated into the sequence of steps performed in a deep neural network training.

Each CNTK compute graph is created with set of nodes where each node represents numerical (mathematical) operation. The edges between nodes in the graph represent data flow between operations. Such a representation allows CNTK to schedule computation on the underlying hardware GPU or CPU. The CNTK can dynamically analyze the graphs in order to to optimize both latency and efficient use of resources. The most powerful part of this is the fact thet the CNTK can calculate derivation of any constructed set of operations, which can be used for efficient learning  process of the network parameters. The flowing image shows the core architecture of the CNTK.

On the other hand, any operation can be executed on CPU or GPU with minimal code changes. In fact we can implement method which can automatically takes GPU computation if available. The CNTK is the first .NET library which provide .NET developers to develop GPU aware .NET applications.

What this exactly mean is that with this powerful library you can develop complex math computation directly to GPU in .NET using C#, which currently is not possible when using standard .NET library.

For this blog post I will show how to calculate some of basic statistics operations on data set.

Say we have data set with 4 columns (features) and 20 rows (samples). The C# implementation of this 2D array is show on the following code snippet:

static float[][] mData = new float[][] {
new float[] { 5.1f, 3.5f, 1.4f, 0.2f},
new float[] { 4.9f, 3.0f, 1.4f, 0.2f},
new float[] { 4.7f, 3.2f, 1.3f, 0.2f},
new float[] { 4.6f, 3.1f, 1.5f, 0.2f},
new float[] { 6.9f, 3.1f, 4.9f, 1.5f},
new float[] { 5.5f, 2.3f, 4.0f, 1.3f},
new float[] { 6.5f, 2.8f, 4.6f, 1.5f},
new float[] { 5.0f, 3.4f, 1.5f, 0.2f},
new float[] { 4.4f, 2.9f, 1.4f, 0.2f},
new float[] { 4.9f, 3.1f, 1.5f, 0.1f},
new float[] { 5.4f, 3.7f, 1.5f, 0.2f},
new float[] { 4.8f, 3.4f, 1.6f, 0.2f},
new float[] { 4.8f, 3.0f, 1.4f, 0.1f},
new float[] { 4.3f, 3.0f, 1.1f, 0.1f},
new float[] { 6.5f, 3.0f, 5.8f, 2.2f},
new float[] { 7.6f, 3.0f, 6.6f, 2.1f},
new float[] { 4.9f, 2.5f, 4.5f, 1.7f},
new float[] { 7.3f, 2.9f, 6.3f, 1.8f},
new float[] { 5.7f, 3.8f, 1.7f, 0.3f},
new float[] { 5.1f, 3.8f, 1.5f, 0.3f},};

If you want to play with CNTK and math calculation you need some knowledge from Calculus, as well as vectors, matrix and tensors. Also in CNTK any operation is performed as matrix operation, which may simplify the calculation process for you. In standard way, you have to deal with multidimensional arrays during calculations. As my knowledge currently there is no .NET library which can perform math operation on GPU, which constrains the .NET platform for implementation of high performance applications.

If we want to compute average value, and standard deviation for each column, we can do that with CNTK very easy way. Once we compute those values we can used them for normalizing the data set by computing standard score (Gauss Standardization).

The Gauss standardization is calculated by the flowing term:

nValue= \frac{X-\nu}{\sigma},
where X- is column values, \nu – column mean, and \sigma– standard deviation of the column.

For this example we are going to perform three statistic operations,and the CNTK automatically provides us with ability to compute those values on GPU. This is very important in case you have data set with millions of rows, and computation can be performed in few milliseconds.

Any computation process in CNTK can be achieved in several steps:

1. Read data from external source or in-memory data,
2. Define Value and Variable objects.
3. Define Function for the calculation
4. Perform Evaluation of the function by passing the Variable and Value objects
5. Retrieve the result of the calculation and show the result.

All above steps are implemented in the following implementation:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using CNTK;
namespace DataNormalizationWithCNTK
{
    class Program
    {
       static float[][] mData = new float[][] {
        new float[] { 5.1f, 3.5f, 1.4f, 0.2f},
        new float[] { 4.9f, 3.0f, 1.4f, 0.2f},
        new float[] { 4.7f, 3.2f, 1.3f, 0.2f},
        new float[] { 4.6f, 3.1f, 1.5f, 0.2f},
        new float[] { 6.9f, 3.1f, 4.9f, 1.5f},
        new float[] { 5.5f, 2.3f, 4.0f, 1.3f},
        new float[] { 6.5f, 2.8f, 4.6f, 1.5f},
        new float[] { 5.0f, 3.4f, 1.5f, 0.2f},
        new float[] { 4.4f, 2.9f, 1.4f, 0.2f},
        new float[] { 4.9f, 3.1f, 1.5f, 0.1f},
        new float[] { 5.4f, 3.7f, 1.5f, 0.2f},
        new float[] { 4.8f, 3.4f, 1.6f, 0.2f},
        new float[] { 4.8f, 3.0f, 1.4f, 0.1f},
        new float[] { 4.3f, 3.0f, 1.1f, 0.1f},
        new float[] { 6.5f, 3.0f, 5.8f, 2.2f},
        new float[] { 7.6f, 3.0f, 6.6f, 2.1f},
        new float[] { 4.9f, 2.5f, 4.5f, 1.7f},
        new float[] { 7.3f, 2.9f, 6.3f, 1.8f},
        new float[] { 5.7f, 3.8f, 1.7f, 0.3f},
        new float[] { 5.1f, 3.8f, 1.5f, 0.3f},};
        static void Main(string[] args)
        {
            //define device where the calculation will executes
            var device = DeviceDescriptor.UseDefaultDevice();

            //print data to console
            Console.WriteLine($"X1,\tX2,\tX3,\tX4");
            Console.WriteLine($"-----,\t-----,\t-----,\t-----");
            foreach (var row in mData)
            {
                Console.WriteLine($"{row[0]},\t{row[1]},\t{row[2]},\t{row[3]}");
            }
            Console.WriteLine($"-----,\t-----,\t-----,\t-----");


            //convert data into enumerable list
            var data = mData.ToEnumerable<IEnumerable<float>>();

            
            //assign the values 
            var vData = Value.CreateBatchOfSequences<float>(new int[] {4},data, device);
            //create variable to describe the data
            var features = Variable.InputVariable(vData.Shape, DataType.Float);

            //define mean function for the variable
            var mean =  CNTKLib.ReduceMean(features, new Axis(2));//Axis(2)- means calculate mean along the third axes which represent 4 features
            
            //map variables and data
            var inputDataMap = new Dictionary<Variable, Value>() { { features, vData } };
            var meanDataMap = new Dictionary<Variable, Value>() { { mean, null } };

            //mean calculation
            mean.Evaluate(inputDataMap,meanDataMap,device);
            //get result
            var meanValues = meanDataMap[mean].GetDenseData<float>(mean);

            Console.WriteLine($"");
            Console.WriteLine($"Average values for each features x1={meanValues[0][0]},x2={meanValues[0][1]},x3={meanValues[0][2]},x4={meanValues[0][3]}");

            //Calculation of standard deviation
            var std = calculateStd(features);
            var stdDataMap = new Dictionary<Variable, Value>() { { std, null } };
            //mean calculation
            std.Evaluate(inputDataMap, stdDataMap, device);
            //get result
            var stdValues = stdDataMap[std].GetDenseData<float>(std);
            
            Console.WriteLine($"");
            Console.WriteLine($"STD of features x1={stdValues[0][0]},x2={stdValues[0][1]},x3={stdValues[0][2]},x4={stdValues[0][3]}");

            //Once we have mean and std we can calculate Standardized values for the data
            var gaussNormalization = CNTKLib.ElementDivide(CNTKLib.Minus(features, mean), std);
            var gaussDataMap = new Dictionary<Variable, Value>() { { gaussNormalization, null } };
            //mean calculation
            gaussNormalization.Evaluate(inputDataMap, gaussDataMap, device);

            //get result
            var normValues = gaussDataMap[gaussNormalization].GetDenseData<float>(gaussNormalization);
            //print data to console
            Console.WriteLine($"-------------------------------------------");
            Console.WriteLine($"Normalized values for the above data set");
            Console.WriteLine($"");
            Console.WriteLine($"X1,\tX2,\tX3,\tX4");
            Console.WriteLine($"-----,\t-----,\t-----,\t-----");
            var row2 = normValues[0];
            for (int j = 0; j < 80; j += 4)
            {
                Console.WriteLine($"{row2[j]},\t{row2[j + 1]},\t{row2[j + 2]},\t{row2[j + 3]}");
            }
            Console.WriteLine($"-----,\t-----,\t-----,\t-----");
        }

        private static Function calculateStd(Variable features)
        {
            var mean = CNTKLib.ReduceMean(features,new Axis(2));
            var remainder = CNTKLib.Minus(features, mean);
            var squared = CNTKLib.Square(remainder);
            //the last dimension indicate the number of samples
            var n = new Constant(new NDShape(0), DataType.Float, features.Shape.Dimensions.Last()-1);
            var elm = CNTKLib.ElementDivide(squared, n);
            var sum = CNTKLib.ReduceSum(elm, new Axis(2));
            var stdVal = CNTKLib.Sqrt(sum);
            return stdVal;
        }
    }

    public static class ArrayExtensions
    {
        public static IEnumerable<T> ToEnumerable<T>(this Array target)
        {
            foreach (var item in target)
                yield return (T)item;
        }
    }
}

The output for the source code above should look like:

Advertisements

Announcing GPdotNET v5 and related Book


https://www.igi-global.com/Images/Covers/9781522560050.png

After one year of writing and coding, finally I  can announce my two big achievements which are related to each other:

1. The fifth version of my open source project GPdotNET – genetic programming tool, and

2. The book:Optimized Genetic Programming Applications: Emerging Research and Opportunities, published by IGI-GLobal.

Along the book, I was developing GPdotNET application which is explained in Chapter 5. Actually the Chapter 5 described in depth all aspects of the application, with real world examples.

As can be seen GPdotNET v5 is completely rewritten application, with new logo and GUI. As Introduction of the application I have prepared several videos on youtube with quick explanation how to use some of the main modules in GPdotNET.

Using ANNdotNET – GUI tool to create CNTK based model for Iris data set


In this tutorial we are going to create and train Iris model using ANNdotNET.  ANNdotNET is windows application for creating and training CNTK based models without leaving GUI.

All procedures from downloading the data set, to exporting model, can be achieved in 6 steps.

1. Step: Download the data set file from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data.

2. Step: Open ANNdotNET application. Press New command, select Project 1 tree item and rename the project  into Iris Data Set.

2. Step: Select Data Command from Model Preparation ribbon group, Click File button from Import experimenal data dialog and select the recently downloaded file. Check Comma check box and press Import Data button.

3. Steps: Double click on Scaling for each column, and select MinMax normalization option from the popup ComboBox list. Double click on Type for the output column, and select Category, and 1:N for encoding. More information how to prepare data for ML you can find at https://bhrnjica.net/2018/03/01/data-preparation-tool-for-machine-learning/

4. Steps: Once the data is prepared Click Create Model Command and Model Settings panel is shown. Setup parameters as shown on the image below and click Run command.

5. Steps: Once the model is trained you can evaluate model by selecting Evaluate Command. Depending on the model type (regression, Binary or Multi class classification) The appropriate Evaluation dialog appears. Since this is multi class classification model, the Confusion matrix is shows, with micro and macron performance parameters.

6. Steps: For further analysis you can export model to Excel, or into ONNX. Also you can save the project which can later be opened and retrained again.

Note: Currently ANNdotNET is in alpha version, and more feature will come in near future.

Data Preparation Tool for Machine Learning


Regardless of machine learning library you use, the data preparation is the first and one of the most important step in developing predictive models. It is very often case that the data supposed to be used for the training is dirty with lot of unnecessary columns, full of missing values, un-formatted numbers etc. Before training the data must be cleaned and properly defined in order to get good model. This is known as data preparation. The data preparation consist of cleaning the data, defining features and labels, deriving the new features from the existing data, handling missing values, scaling the data etc.  It can be concluded that the total time we spend in ML modelling,the most of it is related to data preparation.

In this blog post I am going to present the simple tool which can significantly reduce the preparation time for ML. The tool simply loads the data in to GUI, and then the user can define all necessary information. Once the data is prepared user can store the data it to files which can be then directly imported into ML algorithm such as CNTK.

The following image shows the ML Data Preparation Tool main window.

From the image above, the data preparation can be achieved in several steps.

  1. Load dirty data into ML Prep Tool, by pressing Import Data button
  2. Transform the data by providing the flowing:
    1. Type – each column can be:
      1. Numeric – which holds continuous numeric values,
      2. Binary – which indicates two class categorical data,
      3. Category – which indicates categorical data with more than two classes,
      4. String – which indicate the column will not be part of training and testing data set,
    2. Encoding – in case of Binary and Category column type, the encoding must be defined. The flowing encoding is supported:
      1. Binary Encoding with (0,1) – first binary values will be 0, and second binary values will be 1.
      2. Binary encoding with (-1,1) – first binary values will be -1, and second binary values will be 1.
      3. Category Level- which each class treats as numeric value. In case of 3 categories(R,G, B), encoding will be (0,1,2)
      4. Category 1:N- implements One-Hot vector with N columns. In case of 3 categories(R,G, B), encoding will be R =  (1,0,0),G =  (0,1,0), B =  (0,0,1).
      5. Category 1:N-1(0) – implements dummy coding with N-1 columns. In case of 3 categories(R, G, B), encoding will be R =  (1,0),G =  (0,1), B =  (0,0).
      6. Category 1:N-1(-1) – implements dummy coding with N-1 columns. In case of 3 categories(R, G, B), encoding will be R =  (1,0),G =  (0,1), B =  (-1,-1).
    3. Variable – defines features and label. Only one label, and at least one features can be defined. Also the column can be defined as Ignore variable, which will skip that column.  The following options are sported:
      1. Input – which identifies the column as feature or predictor,
      2. Output – which identifies the column as label or model output.
    4. Scaling – defines column scaling. Two scaling options are supported:
      1. MinMax,
      2. Gauss Standardization,
    5. Missing Values – defines the replacement for the missing value withing the column. There are several options related to numeric and two options (Random and Mode ) for categorical type.
  3. Define the testing data set size by providing information of row numbers or percent.
  4. Define export options
  5. Press Export Button.

As can be seen this is straightforward workflow of data preparation.

Besides the general export options which can be achieved by selecting different delimiter options, you can export data set in to CNTK format, which is very handy if you play with CNTK.

After data transformations, the user need to check CNTK format int the export options and press Export in order to get CNTK training and testing files, which can be directly used in the code without any modifications.

Some of examples will be provided int he next blog post.

The project is hosted at GitHub, where the source code can be freely downloaded and used at this location: https://github.com/bhrnjica/MLDataPreparationTool .

In case you want only binaries, the release of version v1.0 is published here: https://github.com/bhrnjica/MLDataPreparationTool/releases/tag/v1.0

Advance Technology Days 13: Predavanje o C# 7.* kompajleru


Predavanje o novoj verziji C# 7 kompajlera prošla je vrlo uspješno, a mnogobrojna publika pokazala je da se ipak i ovakve teme mogu uraditi zanimljive i interesantne.

Sve prezentirano na predavanju moguće je preuzeti sa donjeg linka. Na linku je uključena prezentacijska datoteka i demo primjeri u C#:

Trikovi u C# 7.1 u službi objektnog orijentiranog i funkcionalnog programera

 

Vidimo se na ATD14

 

 

 

I am Weblica 2016 Speaker


weblica2016

Drugi put za redom u Čakovcu se održava konferencija o Web tehnologijama pod nazivom “Weblica“. Mjesto održavanja je Tehnološko-Inovacijski centar u Čakovcu, a konferencija je zakazana za subotu 14 maja.

Naziv konferencije, kako su organizatori naglasili, predstavlja asocijaciju na “Tiblicu” – tradicionalno međimursko jelo. Kompletna organizacija konferencije pripada MSCommunityu iz Čakovca, a na konferenciji će se naći predavači iz Hrvatske i regiona. Konferenciju sponzoriše INETA EUROPE te druge organizacije.

Po drugi put pripala mi je čast da budem dio ove konferencije na kojoj ću održati predavanje na temu Entity Framework Core 1.0 (EF Core). EF Core se trenutno nalazi u release candidate verziji, a dosad je pokupio više simpatija i interesovanje nego sve prethodne verzije. EF Core 1.0 donosi pregršt novosti iz područja backed razvoja, koji je po performansama superioran nad svojim prethodnikom EF 6.

Pored toga EF Core 1.0 dolazi uz .NET Core koji predstavlja novu platformu za razvoj Web, Desktop i Mobilnih aplikacije koje su prvenstveno neovisne od operativnog sistemu i uređaju na kojem se vrte. Po svojoj prirodi .NET Core se definiše  kao dio danas najpopularnije platforme Microsoft .NET Frameworka, koji je u open source verziji najmjenjen za razvoja aplikacija visokih performanci, neovisne o operativnom sisitemu, a koje se nativno izvršavaju uz pomoć .NET Native kompajlera.

Na predavanju će se kroz primjere na Linux-Ubuntu i Windowsima moći vidjeti neke od najznačajnijih osobina EF Core, a koje će se naći u konačnoj verziji koja je planirana da izađe u drugoj polovini ove godine.