Regardless of machine learning library you use, the data preparation is the first and one of the most important step in developing predictive models. It is very often case that the data supposed to be used for the training is dirty with lot of unnecessary columns, full of missing values, un-formatted numbers etc. Before training the data must be cleaned and properly defined in order to get good model. This is known as data preparation. The data preparation consist of cleaning the data, defining features and labels, deriving the new features from the existing data, handling missing values, scaling the data etc. It can be concluded that the total time we spend in ML modelling,the most of it is related to data preparation.
In this blog post I am going to present the simple tool which can significantly reduce the preparation time for ML. The tool simply loads the data in to GUI, and then the user can define all necessary information. Once the data is prepared user can store the data it to files which can be then directly imported into ML algorithm such as CNTK.
The following image shows the ML Data Preparation Tool main window.
From the image above, the data preparation can be achieved in several steps.
- Load dirty data into ML Prep Tool, by pressing Import Data button
- Transform the data by providing the flowing:
- Type – each column can be:
- Numeric – which holds continuous numeric values,
- Binary – which indicates two class categorical data,
- Category – which indicates categorical data with more than two classes,
- String – which indicate the column will not be part of training and testing data set,
- Encoding – in case of Binary and Category column type, the encoding must be defined. The flowing encoding is supported:
- Binary Encoding with (0,1) – first binary values will be 0, and second binary values will be 1.
- Binary encoding with (-1,1) – first binary values will be -1, and second binary values will be 1.
- Category Level- which each class treats as numeric value. In case of 3 categories(R,G, B), encoding will be (0,1,2)
- Category 1:N- implements One-Hot vector with N columns. In case of 3 categories(R,G, B), encoding will be R = (1,0,0),G = (0,1,0), B = (0,0,1).
- Category 1:N-1(0) – implements dummy coding with N-1 columns. In case of 3 categories(R, G, B), encoding will be R = (1,0),G = (0,1), B = (0,0).
- Category 1:N-1(-1) – implements dummy coding with N-1 columns. In case of 3 categories(R, G, B), encoding will be R = (1,0),G = (0,1), B = (-1,-1).
- Variable – defines features and label. Only one label, and at least one features can be defined. Also the column can be defined as Ignore variable, which will skip that column. The following options are sported:
- Input – which identifies the column as feature or predictor,
- Output – which identifies the column as label or model output.
- Scaling – defines column scaling. Two scaling options are supported:
- Gauss Standardization,
- Missing Values – defines the replacement for the missing value withing the column. There are several options related to numeric and two options (Random and Mode ) for categorical type.
- Type – each column can be:
- Define the testing data set size by providing information of row numbers or percent.
- Define export options
- Press Export Button.
As can be seen this is straightforward workflow of data preparation.
Besides the general export options which can be achieved by selecting different delimiter options, you can export data set in to CNTK format, which is very handy if you play with CNTK.
After data transformations, the user need to check CNTK format int the export options and press Export in order to get CNTK training and testing files, which can be directly used in the code without any modifications.
Some of examples will be provided int he next blog post.
The project is hosted at GitHub, where the source code can be freely downloaded and used at this location: .
In case you want only binaries, the release of version v1.0 is published here: