Naive Spot-Check of AI Algorithms

More Adventures in AI

A “Back to Basics” RESET

I found the work of Jason Brownlee to help me get “Back to Basics” with AI.

Currently i am following his EXCELLENT Python Machine Learning Mini-Course.

At Lesson 9, He suggests one should “Spot-Check Algorithms”.

I took a naive approach employing the following “Analysis Code” section below to generate the resulting “Analysis Grid” section below.

I am hoping to use this post to get information in the Comments Section about how to use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.

Methodology

To generate the Algorithm Spot-Check Cases i cross multiplied:

  1. (3) Datasets
    1.1 Boston House Price Data
    1.2 Iris Data
    1.3 Pima Indians Diabetes Data
  2. (4) Models(Algorithms)
    2.1 KNeighborsRegressor
    2.2 LinearRegression
    2.3 LogisticRegression
    2.4 LinearDiscriminantAnalysis
  3. (3) Scorings
    3.1 accuracy
    3.2 neg_mean_squared_error
    3.3 neg_log_loss

for a total of 36 Spot-Check Cases.

I then ran each case against a
kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7)
with
results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

Summary Analysis

There were 11 out of 36 Cases that returned numerical results. The other 25 Cases returned Errors or Warnings. The full Analysis can be seen in the “Analysis Grid” section below.

Question Reiteration, Answer, Further Study

Question Reiteration

How can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.

I posted this question to Jason Brownlee in the Comments on Python Machine Learning Mini-Course. The Answer is his response to me there.

Answer

From Jason Brownlee in the Comments on Python Machine Learning Mini-Course:

Nice post and great question Joe.

Spot checking is to discover which algorithms look good on one given dataset. Not across datasets.

You may need to group algorithms by their expectations then prepare data for each group.

Most machine learning algorithms expect data to have numeric input values and an integer encoded or one hot encoded output value for classification. This is a good normalized view of a dataset to construct.

Here’s a tutorial that shows how to spot check 7 machine learning algorithms on one problem in Python, Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn.

Further Study

From Jason’s Answer, above, I need to study:

  • Algorithm Groups
  • Algorithm Group Expectations
  • Data Preparation

Analysis Code

#! python3
# -*- coding: utf-8 -*-
"""analyzeAlgorithmsMod.py module contains  AnalyzeAlgorithms class.
    It aids in the comparison of Machine Learning Algorithms(Models)
       for a particular dataset.
"""
import os, sys, shutil, time, datetime, urllib, tarfile, zipfile, csv, io, copy, itertools, six

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
import sklearn
import sklearn.preprocessing
import sklearn.linear_model

import sklearn.cross_validation
# import sklearn.model_selection

import sklearn.ensemble
import sklearn.metrics
import sklearn.discriminant_analysis
import sklearn.preprocessing
import itertools
import textwrap
from pprint import pprint as pp

import warnings
warnings.filterwarnings("error")


def reportInModule(model, scoring, results):
    print("\n\n{model: '%s', scoring: '%s'}:\n    results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),))
    print("    sorted(results):")
    pp(sorted(results), indent=8)


class AnalyzeAlgorithms:
    """Aids in the comparison of Machine Learning Algorithms
       for a particular dataset.
           http://machinelearningmastery.com/python-machine-learning-mini-course/
    """

    @staticmethod
    def calcX_Y(dataFrame):
        array = dataFrame.values
        X = array[:, 0:dataFrame.shape[1]-1]
        Y = array[:, dataFrame.shape[1]-1]
        return X, Y

    def __init__(self, datasetInfoTupleList, modelList, scoringStrList, kfold):
        self.datasetInfoTupleList = datasetInfoTupleList # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList)
        self.modelList = modelList
        self.scoringStrList = scoringStrList
        self.kfold = kfold

    def analyzeAlgorithms(self):

        for datasetInfoTuple in self.datasetInfoTupleList: # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList)

            self.datasetTitle, self.csvFilePath, self.delim_whitespace, self.columnNamesList = datasetInfoTuple
            self.df = pd.read_csv(self.csvFilePath, delim_whitespace=self.delim_whitespace, names=self.columnNamesList)
            self.X, self.Y = AnalyzeAlgorithms.calcX_Y(self.df)

            for model in self.modelList:

                for scoring in self.scoringStrList:
                    print("trying|%s|%s|%s|"%(self.datasetTitle, type(model).__name__, scoring),end='')

                    results = self.genResults(model, scoring, self.X, self.Y, self.kfold)
                    self.report_short(datasetInfoTuple, model, scoring, results)

    def genResults(self, model, scoring, X, Y, kfold):
        try:
            results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
        except:
            results = "Error: %s"%( sys.exc_info()[1] )
        return results

    def report_short(self, datasetInfoTuple, model, scoring, results):
        datasetTitle = datasetInfoTuple[0]
        if isinstance(results, six.string_types):
            results_short = results.splitlines()[0]
            print(results_short)
        else:
            results_short = results.mean()
            print(results_short)




    def report(self, model, scoring, results):
        if isinstance(results, six.string_types):    # Error
            print("\n\n{model: '%s', scoring: '%s'}:\n    results ERROR: %s" % (type(model).__name__, scoring, results) )
        else:
            print("\n\n{model: '%s', scoring: '%s'}:\n    results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),))
            print("    sorted(results):")
            pp(sorted(results), indent=8)


def test():
    datasetInfoTupleList = [
        ('Boston House Price Data', # datasetTitle
             r'C:\BLA\BLA\BLA\data\BostonHousing\housing.data.txt', # csvFilePath
             True, # delim_whitespace for csv boolean
             ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'], # columnNames
         ),
        ('Iris Data',  # datasetTitle
             r'C:\BLA\BLA\BLA\data\iris\iris.data.txt', # csvFilePath
             False,  # delim_whitespace for csv boolean
             ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'class'],  # columnNames
        ),
        ('Pima Indians Diabetes Data',  # datasetTitle
             r'C:\BLA\BLA\BLA\data\PimaIndiansDiabetes\pima-indians-diabetes.data.txt',  # csvFilePath
             False,  # delim_whitespace for csv boolean
             ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'],  # columnNames
         ),

    ]

    modelList = [ sklearn.neighbors.KNeighborsRegressor(),
                  sklearn.linear_model.LinearRegression(),
                  sklearn.linear_model.LogisticRegression(),
                  sklearn.discriminant_analysis.LinearDiscriminantAnalysis(),
                  ]

    scoringStrList = [
                       'accuracy',
                       'neg_mean_squared_error',
                       'neg_log_loss',

                    ]
    kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7)

    analysis02 = AnalyzeAlgorithms(datasetInfoTupleList, modelList, scoringStrList, kfold)
    analysis02.analyzeAlgorithms()




def main():
    test()

if __name__ == '__main__':
    main()

output = """

C:\Python35\python.exe C:/BLA/BLA/MachineLearningMasteryPj/AnalyzeAlgorithmsPkg/analyzeAlgorithmsMod.py
C:\Python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
trying|Boston House Price Data|KNeighborsRegressor|accuracy|Error: continuous is not supported
trying|Boston House Price Data|KNeighborsRegressor|neg_mean_squared_error|-107.28683898
trying|Boston House Price Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'
trying|Boston House Price Data|LinearRegression|accuracy|Error: continuous is not supported
trying|Boston House Price Data|LinearRegression|neg_mean_squared_error|-34.7052559445
trying|Boston House Price Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba'
trying|Boston House Price Data|LogisticRegression|accuracy|Error: Unknown label type: 'continuous'
trying|Boston House Price Data|LogisticRegression|neg_mean_squared_error|Error: Unknown label type: 'continuous'
trying|Boston House Price Data|LogisticRegression|neg_log_loss|Error: Unknown label type: 'continuous'
trying|Boston House Price Data|LinearDiscriminantAnalysis|accuracy|Error: Unknown label type: (array([ 20.5,  25. ,  23.4,  18.9,  35.4,  24.7,  31.6,  23.3,  19.6,
trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: Unknown label type: (array([ 20.5,  25. ,  23.4,  18.9,  35.4,  24.7,  31.6,  23.3,  19.6,
trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_log_loss|Error: Unknown label type: (array([ 20.5,  25. ,  23.4,  18.9,  35.4,  24.7,  31.6,  23.3,  19.6,
trying|Iris Data|KNeighborsRegressor|accuracy|Error: unsupported operand type(s) for /: 'str' and 'int'
trying|Iris Data|KNeighborsRegressor|neg_mean_squared_error|Error: unsupported operand type(s) for /: 'str' and 'int'
trying|Iris Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'
trying|Iris Data|LinearRegression|accuracy|Error: could not convert string to float: 'Iris-virginica'
trying|Iris Data|LinearRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-virginica'
trying|Iris Data|LinearRegression|neg_log_loss|Error: could not convert string to float: 'Iris-virginica'
trying|Iris Data|LogisticRegression|accuracy|0.88
trying|Iris Data|LogisticRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-setosa'
trying|Iris Data|LogisticRegression|neg_log_loss|Error: y_true contains only one label (Iris-setosa). Please provide the true labels explicitly through the labels argument.
trying|Iris Data|LinearDiscriminantAnalysis|accuracy|Error: The priors do not sum to 1. Renormalizing
trying|Iris Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: The priors do not sum to 1. Renormalizing
trying|Iris Data|LinearDiscriminantAnalysis|neg_log_loss|Error: The priors do not sum to 1. Renormalizing
trying|Pima Indians Diabetes Data|KNeighborsRegressor|accuracy|Error: Can't handle mix of binary and continuous
trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_mean_squared_error|-0.196342447027
trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'
trying|Pima Indians Diabetes Data|LinearRegression|accuracy|Error: Can't handle mix of binary and continuous
trying|Pima Indians Diabetes Data|LinearRegression|neg_mean_squared_error|-0.162812506544
trying|Pima Indians Diabetes Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba'
trying|Pima Indians Diabetes Data|LogisticRegression|accuracy|0.76951469583
trying|Pima Indians Diabetes Data|LogisticRegression|neg_mean_squared_error|-0.23048530417
trying|Pima Indians Diabetes Data|LogisticRegression|neg_log_loss|-0.492545522852
trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|accuracy|0.773462064252
trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_mean_squared_error|-0.226537935748
trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_log_loss|-0.485655330102

Process finished with exit code 0


"""

Analysis Grid

Here’s the Analysis Grid copied from an Excel Spreadsheet.


| Dataset                    | Model                      | Scoring                | Result(Error or Mean(result)                                                                                                    | Joe's Comment                                                                     |
|----------------------------|----------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| Boston House Price Data    | KNeighborsRegressor        | accuracy               | Error: continuous is not supported                                                                                              | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy)            |
| Boston House Price Data    | KNeighborsRegressor        | neg_mean_squared_error | -107.286839                                                                                                                     |                                                                                   |
| Boston House Price Data    | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Boston House Price Data    | LinearRegression           | accuracy               | Error: continuous is not supported                                                                                              | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy)            |
| Boston House Price Data    | LinearRegression           | neg_mean_squared_error | -34.70525594                                                                                                                    |                                                                                   |
| Boston House Price Data    | LinearRegression           | neg_log_loss           | Error: 'LinearRegression' object has no attribute 'predict_proba'                                                               | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression,   neg_log_loss)    |
| Boston House Price Data    | LogisticRegression         | accuracy               | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LogisticRegression         | neg_mean_squared_error | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LogisticRegression         | neg_log_loss           | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LinearDiscriminantAnalysis | accuracy               | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Boston House Price Data    | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Boston House Price Data    | LinearDiscriminantAnalysis | neg_log_loss           | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Iris Data                  | KNeighborsRegressor        | accuracy               | Error: unsupported operand type(s) for /: 'str' and 'int'                                                                       | Data Prep: (Iris Data, KNeighborsRegressor,  [accuracy, neg_mean_squared_error] ) |
| Iris Data                  | KNeighborsRegressor        | neg_mean_squared_error | Error: unsupported operand type(s) for /: 'str' and 'int'                                                                       | Data Prep: (Iris Data, KNeighborsRegressor,  [accuracy, neg_mean_squared_error] ) |
| Iris Data                  | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Iris Data                  | LinearRegression           | accuracy               | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LinearRegression           | neg_mean_squared_error | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LinearRegression           | neg_log_loss           | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LogisticRegression         | accuracy               | 0.88                                                                                                                            |                                                                                   |
| Iris Data                  | LogisticRegression         | neg_mean_squared_error | Error: could not convert string to float: 'Iris-setosa'                                                                         |                                                                                   |
| Iris Data                  | LogisticRegression         | neg_log_loss           | Error: y_true contains only one   label (Iris-setosa). Please provide the true labels explicitly through the   labels argument. |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | accuracy               | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | neg_log_loss           | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Pima Indians Diabetes Data | KNeighborsRegressor        | accuracy               | Error: Can't handle mix of binary and continuous                                                                                | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy)              |
| Pima Indians Diabetes Data | KNeighborsRegressor        | neg_mean_squared_error | -0.196342447                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Pima Indians Diabetes Data | LinearRegression           | accuracy               | Error: Can't handle mix of binary and continuous                                                                                | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy)              |
| Pima Indians Diabetes Data | LinearRegression           | neg_mean_squared_error | -0.162812507                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearRegression           | neg_log_loss           | Error: 'LinearRegression' object has no attribute 'predict_proba'                                                               | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression,   neg_log_loss)    |
| Pima Indians Diabetes Data | LogisticRegression         | accuracy               | 0.769514696                                                                                                                     |                                                                                   |
| Pima Indians Diabetes Data | LogisticRegression         | neg_mean_squared_error | -0.230485304                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LogisticRegression         | neg_log_loss           | -0.492545523                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | accuracy               | 0.773462064                                                                                                                     |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_mean_squared_error | -0.226537936                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_log_loss           | -0.48565533                                                                                                                     |                                                                                   |