Blog Archives

Testing and Validation CNTK models using C#


…continue from the previous post.
Once the model is build and Loss and Validation functions are satisfied our expectation, we need to validate and test the model using the data which was not part of the training data set (unseen data). The model validation is very important because we want to see if our model is trained well,so that can evaluates unseen data approximately same as the training data. Otherwise the model which cannot predict the output is called overfitted model. Overfitting can happen when the model was trained long enough that shows very high performance for the training data set, but for the testing data evaluate bad results.
We will continue with the implementation from the prevision two posts, and implement model validation. After the model is trained, the model and the trainer are passed to the Evaluation method. The evaluation method loads the testing data and calculated the output using passed model. Then it compares calculated (predicted) values with the output from the testing data set and calculated the accuracy. The following source code shows the evaluation implementation.

private static void EvaluateIrisModel(Function ffnn_model, Trainer trainer, DeviceDescriptor device)
{
    var dataFolder = "Data";//files must be on the same folder as program
    var trainPath = Path.Combine(dataFolder, "testIris_cntk.txt");
    var featureStreamName = "features";
    var labelsStreamName = "label";

    //extract features and label from the model
    var feature = ffnn_model.Arguments[0];
    var label = ffnn_model.Output;

    //stream configuration to distinct features and labels in the file
    var streamConfig = new StreamConfiguration[]
        {
            new StreamConfiguration(featureStreamName, feature.Shape[0]),
            new StreamConfiguration(labelsStreamName, label.Shape[0])
        };

    // prepare testing data
    var testMinibatchSource = MinibatchSource.TextFormatMinibatchSource(
        trainPath, streamConfig, MinibatchSource.InfinitelyRepeat, true);
    var featureStreamInfo = testMinibatchSource.StreamInfo(featureStreamName);
    var labelStreamInfo = testMinibatchSource.StreamInfo(labelsStreamName);

    int batchSize = 20;
    int miscountTotal = 0, totalCount = 20;
    while (true)
    {
        var minibatchData = testMinibatchSource.GetNextMinibatch((uint)batchSize, device);
        if (minibatchData == null || minibatchData.Count == 0)
            break;
        totalCount += (int)minibatchData[featureStreamInfo].numberOfSamples;

        // expected labels are in the mini batch data.
        var labelData = minibatchData[labelStreamInfo].data.GetDenseData<float>(label);
        var expectedLabels = labelData.Select(l => l.IndexOf(l.Max())).ToList();

        var inputDataMap = new Dictionary<Variable, Value>() {
            { feature, minibatchData[featureStreamInfo].data }
        };

        var outputDataMap = new Dictionary<Variable, Value>() {
            { label, null }
        };

        ffnn_model.Evaluate(inputDataMap, outputDataMap, device);
        var outputData = outputDataMap[label].GetDenseData<float>(label);
        var actualLabels = outputData.Select(l => l.IndexOf(l.Max())).ToList();

        int misMatches = actualLabels.Zip(expectedLabels, (a, b) => a.Equals(b) ? 0 : 1).Sum();

        miscountTotal += misMatches;
        Console.WriteLine($"Validating Model: Total Samples = {totalCount}, Mis-classify Count = {miscountTotal}");

        if (totalCount >= 20)
            break;
    }
    Console.WriteLine($"---------------");
    Console.WriteLine($"------TESTING SUMMARY--------");
    float accuracy = (1.0F - miscountTotal / totalCount);
    Console.WriteLine($"Model Accuracy = {accuracy}");
    return;

}

The implemented method is called in the previous Training method.

 EvaluateIrisModel(ffnn_model, trainer, device);

As can be seen the model validation has shown that the model predicts the data with high accuracy, which is shown on the following picture.

This was the latest post in series of blog posts about using Feed forward neural networks to train the Iris data using CNTK and C#.

The full source code for all three samples can be found here.

Advertisements

Train Iris data by Batch using CNTK and C#


In the previous post we have seen how to train NN model by using MinibatchSource. Usually we should use it when we have large amount of data. In case of small amount of the data, all data can be loaded in memory, and all can be passed to each iteration in order to train the model. This blog post will implement this kind of feeding the trainer.
We will reused the previous implementation, so the starting point can be previous source code. For data loading we have to define a new method. The Iris data is stored in text format like the following:

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa(1 0 0)
7.0,3.2,4.7,1.4,versicolor(0 1 0)
7.6,3.0,6.6,2.1,virginica(0 0 1)
...

The output column is encoded to 1-N-1 encoding rule we have seen previously.
The method will read all the data from the file, parse the data and create two float arrays:

  • float[] feature, and
  • float[] label.

As can be seen both arrays are 1D, which means all data will be inserted in 1D, because the CNTK requires so.  Since the data is in 1D array, we should also provide the dimensionality of the data so te CNTK can resolve what values for each features. The following listing shows the loading Iris data in two 1D array returned as tuple.

static (float[], float[]) loadIrisDataset(string filePath, int featureDim, int numClasses)
{
    var rows = File.ReadAllLines(filePath);
    var features = new List<float>();
    var label = new List<float>();
    for (int i = 1; i < rows.Length; i++)
    {
        var row = rows[i].Split(',');
        var input = new float[featureDim];
        for (int j = 0; j < featureDim; j++)
        {
            input[j] = float.Parse(row[j], CultureInfo.InvariantCulture);
        }
        var output = new float[numClasses];
        for (int k = 0; k < numClasses; k++)
        {
            int oIndex = featureDim + k;
            output[k] = float.Parse(row[oIndex], CultureInfo.InvariantCulture);
        }

        features.AddRange(input);
        label.AddRange(output);
    }

    return (features.ToArray(), label.ToArray());
}

Once the data is loaded we should change very little amount of the previous code in order to implement batching instead of using minibatchSource. At the beginning we provides several variable to define the NN model structure. Then we call the loadIrisDataset, and define xValues and yValues, which we use in order to create feature and label input variables. Then we create dictionary which connect the feature and labels with data values which we will pass to the trainer later.
The next code is the same as in the previous version in order to create NN model, Loss and Evaluation functions, and learning rate.

Then we create loop, for 800 iteration. Once the iteration reaches the maximum value the program outputs the model properties and terminates.
Above said it implemented in the following code.

public static void TrainIriswithBatch(DeviceDescriptor device)
{
    //data file path
    var iris_data_file = "Data/iris_with_hot_vector.csv";

    //Network definition
    int inputDim = 4;
    int numOutputClasses = 3;
    int numHiddenLayers = 1;
    int hidenLayerDim = 6;
    int sampleSize = 130;

    //load data in to memory
    var dataSet = loadIrisDataset(iris_data_file, inputDim, numOutputClasses);

    // build a NN model
    //define input and output variable
    var xValues = Value.CreateBatch<float>(new NDShape(1, inputDim), dataSet.Item1, device);
    var yValues    = Value.CreateBatch<float>(new NDShape(1, numOutputClasses), dataSet.Item2, device);

    // build a NN model
    //define input and output variable and connecting to the stream configuration
    var feature = Variable.InputVariable(new NDShape(1, inputDim), DataType.Float);
    var label = Variable.InputVariable(new NDShape(1, numOutputClasses), DataType.Float);

    //Combine variables and data in to Dictionary for the training
    var dic = new Dictionary<Variable, Value>();
    dic.Add(feature, xValues);
    dic.Add(label, yValues);

    //Build simple Feed Froward Neural Network model
    // var ffnn_model = CreateMLPClassifier(device, numOutputClasses, hidenLayerDim, feature, classifierName);
    var ffnn_model = createFFNN(feature, numHiddenLayers, hidenLayerDim, numOutputClasses, Activation.Tanh, "IrisNNModel", device);

    //Loss and error functions definition
    var trainingLoss = CNTKLib.CrossEntropyWithSoftmax(new Variable(ffnn_model), label, "lossFunction");
    var classError = CNTKLib.ClassificationError(new Variable(ffnn_model), label, "classificationError");

    // set learning rate for the network
    var learningRatePerSample = new TrainingParameterScheduleDouble(0.001125, 1);

    //define learners for the NN model
    var ll = Learner.SGDLearner(ffnn_model.Parameters(), learningRatePerSample);

    //define trainer based on ffnn_model, loss and error functions , and SGD learner
    var trainer = Trainer.CreateTrainer(ffnn_model, trainingLoss, classError, new Learner[] { ll });

    //Preparation for the iterative learning process
    //used 800 epochs/iterations. Batch size will be the same as sample size since the data set is small
    int epochs = 800;
    int i = 0;
    while (epochs > -1)
    {
        trainer.TrainMinibatch(dic, device);

        //print progress
        printTrainingProgress(trainer, i++, 50);

        //
        epochs--;
    }
    //Summary of training
    double acc = Math.Round((1.0 - trainer.PreviousMinibatchEvaluationAverage()) * 100, 2);
    Console.WriteLine($"------TRAINING SUMMARY--------");
    Console.WriteLine($"The model trained with the accuracy {acc}%");
}

If we run the code, the output will be the same as we got from the previous blog post example:

The full source code with training data can be downloaded
here.

Using CNTK 2.2 and Python to learn from Iris data


Now that we have setup CNTK 2.2 and Python we can start with first example. For the first time, we can take the Iris data. The data set has categorical output value which contains three classes : Sentosa, Virglica and Versicolor. The features consist of the 4 real value inputs. The Iris data set can be easily found on  the internet. One of the places is on http://kaggle.com

Usually, the Iris data is given in the flowing format:

Since we are going to use CNTK we should prepare the data in cntk file format, which is far from the format we can see on the previous image. This format has different structure and looks like on the flowing image:

The difference is obvious. To transform the previous file format in to the cntk format it tooks me several minutes and now we can continue with the implementation.

First, lets implement simple python function to read the cntk format. For the implementation we are going to use CNTK MinibatchSource, which is specially developed to handle file data. The flowing python code reads the file and return the MinibatchSource.

import cntk

# The data in the file must satisfied the following format:
# |labels 0 0 1 |features 2.1 7.0 2.2 - the format consist of 4 features and one 3 component hot vector
#represents the iris flowers
def create_reader(path, is_training, input_dim, num_label_classes):

#create the streams separately for the label and for the features
labelStream = cntk.io.StreamDef(field='label', shape=num_label_classes, is_sparse=False)
featureStream = cntk.io.StreamDef(field='features', shape=input_dim, is_sparse=False)

#create deserializer by providing the file path, and related streams
deserailizer = cntk.io.CTFDeserializer(path, cntk.io.StreamDefs(labels = labelStream, features = featureStream))

#create mini batch source as function return
mb = cntk.io.MinibatchSource(deserailizer, randomize = is_training, max_sweeps = cntk.io.INFINITELY_REPEAT if is_training else 1)
return mb

The code above take several arguments:

-path – the file path where the data is stored,

-is_training – Boolean variable which indicates if the data is for training or testing. In case of training the data will be randomized.

– input_dim, num_label_classes are the numbers of the input features and the output hot vector size. Those two arguments are important in order to properly parse the file.

The first method creates the two streams , which are passed as argument in order to create deserializer, and then for minibatchsource creation. The function returns minibatchsource object which the trainer uses for data handling.

Once that we implemented the data reader, we need the python function for model creation. For the Iris data set we are going to create 4-50-3 feed forward neural network, which consist of one input layer with 4 neurons, one hidden layer with 50 neurons and the output layer with 4 neurons. The hidden layer will contain tanh- activation function.

The function which creates the NN model will looks like on the flowing code snippet:

#model creation
# FFNN with one input, one hidden and one output layer 
def create_model(features, hid_dim, out_dim):
    #perform some initialization 
    with cntk.layers.default_options(init = cntk.glorot_uniform()):
        #hidden layer with hid_def number of neurons and tanh activation function
        h1=cntk.layers.Dense(hid_dim, activation= cntk.ops.tanh, name='hidLayer')(features)
        #output layer with out_dim neurons
        o = cntk.layers.Dense(out_dim, activation = None)(h1)
        return o

As can be seen Dense function creates the layer where the user has to specify the dimension of the layer, activation function and the input variable. When the hidden layer is created, input variable is set to the input data. The output layer is created for the hidden layer as input.

The one more helper function would be showing the progress of the learner. The flowing function takes the three arguments and prints the current status of the trainer.

# Function that prints the training progress
def print_training_progress(trainer, mb, frequency):
    training_loss = "NA"
    eval_error = "NA"

    if mb%frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%".format(mb, training_loss, eval_error*100))   
    return mb, training_loss, eval_error

Once we implemented all three functions we can start with CNTK learning on the Iris data.

At the beginning,  we have to specify some helper variable which we will use later.

#setting up the NN type
input_dim=4
hidden_dim = 50
num_output_classes=3
input = cntk.input_variable(input_dim)
label = cntk.input_variable(num_output_classes)

Create the reader for data batching.

# Create the reader to training data set
reader_train= create_reader("C:/sc/Offline/trainData_cntk.txt",True,input_dim, num_output_classes)

Then create the NN model, with Loss and Error functions:

#Create model and Loss and Error function
z= create_model(input, hidden_dim,num_output_classes);
loss = cntk.cross_entropy_with_softmax(z, label)
label_error = cntk.classification_error(z, label)

Then we defined how look like the trainer. The trainer will be with Stochastic Gradient Decadent learner, with learning rate of 0.2

# Instantiate the trainer object to drive the model training
learning_rate = 0.2
lr_schedule = cntk.learning_parameter_schedule(learning_rate)
learner = cntk.sgd(z.parameters, lr_schedule)
trainer = cntk.Trainer(z, (loss, label_error), [learner])

Now we need to defined parameters for learning, and showing results.

# Initialize the parameters for the trainer
minibatch_size = 120 #mini batch size will be full data set
num_iterations = 20 #number of iterations 

# Map the data streams to the input and labels.
input_map = {
label  : reader_train.streams.labels,
input  : reader_train.streams.features
} 
# Run the trainer on and perform model training
training_progress_output_freq = 1

plotdata = {"batchsize":[], "loss":[], "error":[]}

As can be seen the batchsize is set to dataset size which is typical for small data sets.  Since we defined minibach to dataset size, the iteration should be very small value since Iris data is very simple and the learner will find good result very fast.

Running the trainer looks very simple. For each iteration, the reader load the batch size amount of the data, and pass to the trainer. The trainer performs the learning process using SGD learner, and returns the Loss and the error value for the current iteration. Then we call print function to show the progress of the trainer.

for i in range(0, int(num_iterations)):
        # Read a mini batch from the training data file
        data=reader_train.next_minibatch(minibatch_size, input_map=input_map) 
        trainer.train_minibatch(data)
        batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq)
        if not (loss == "NA" or error =="NA"):
            plotdata["batchsize"].append(batchsize)
            plotdata["loss"].append(loss)
            plotdata["error"].append(error)

Once the learning process completes, we can perform some result presentation.

# Plot the training loss and the training error
import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["loss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss')

plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["error"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error')
plt.show()

We plot the Loss and Error function converted in to total accuracy of the classifier. The folowing pictures shows those graphs.

The last part of the ML procedure is the testing or validating the model. FOr the Iris data set we prepare 20 samples which will be used for the testing. The code i similar to the previous, except we call create_reader with different file name. Then we try to evaluate the model and grab the Loss and error values, and print out.

# Read the training data
reader_test = create_reader("C:/sc/Offline/testData_cntk.txt",False, input_dim, num_output_classes)

test_input_map = {
    label  : reader_test.streams.labels,
    input  : reader_test.streams.features,
}

# Test data for trained model
test_minibatch_size = 20
num_samples = 20
num_minibatches_to_test = num_samples // test_minibatch_size
test_result = 0.0

for i in range(num_minibatches_to_test):
    
    data = reader_test.next_minibatch(test_minibatch_size,input_map = test_input_map)
    eval_error = trainer.test_minibatch(data)
    test_result = test_result + eval_error

# Average of evaluation errors of all test minibatches
print("Average test error: {0:.2f}%".format(test_result*100 / num_minibatches_to_test))

Full sample with python code and data set can be found here.

Using CNTK with Visual Studio 2017 and Python


In the next few steps will show how to install CNTK and python environment in Visual Studio 2017.

  1. First download the latest CNTK version from the official GitHub page, or just click on the following link: https://github.com/Microsoft/CNTK/releases

The release page will show the latest bits. Click on the CPU only package, accept the license and download the zip file.

  1. Once that you have zip file on your PC, create the folder C:/local on disk and unzip the package in to it.
  2. The next step performs the installation of the library as well as installation of the Python related distribution anaconda 4.1.1.
  3. Open C:\local\cntk\Scripts\install\windows path and run install.bat file. You will need administrative rights in order to successfully install all required components.
  4. The following image shows the installation process:

  1. As can be seen first you have to run batch file (step 2), then press 1 and ENTER in order to continue with the installation process and press ‘y‘, to perform downloading required components.
  2. The installation process takes several minutes to complete. The first component to be installed is Anaconda 4.1.1 which is needed in order to setup  CNTK.

  1. Once the anaconda is installed, the process of CNTK installation starts and passes very quickly since we already download all CNTK bits.

  1. Now that we have CNTK installed, the last installation step is installation of the Visual Studio Tool for Python.
  2. Run the Visual Studio 2017 Installer and after the installed is show, just select the python components similar picture shows below:

  1. Once the installation is completed run Visual Studio 2017.
  2. From the Visual Studio 2017 Tool menu select Python and then select Python Environment:

  1. From the Python Environment window select Anaconda 4.1.1 and update symbols DB, by pressing the button pointed on the image below:

  1. Once we have environment updated, Press “Make this the default environment for the new projects” option in order to apply the environment for the future Python CNTK based projects.
  2. Also the path for Python and Python scripts should be registered in Global Environment OS.

  1. Once the previous steps are performed successfully, we can start writing CNTK aware python code in Visual Studio 2017.
  2. OPen VS 2017 and Anaconda 4.1.1 environment and type.

   import cntk

print(“CNTK verion:”, cntk__version__)

  1. Similar output should be appear
  2. print(“CNTK version:”, cntk.__version__)