Naive Spot-Check of AI Algorithms

More Adventures in AI

A “Back to Basics” RESET

I found the work of Jason Brownlee to help me get “Back to Basics” with AI.

Currently i am following his EXCELLENT Python Machine Learning Mini-Course.

At Lesson 9, He suggests one should “Spot-Check Algorithms”.

I took a naive approach employing the following “Analysis Code” section below to generate the resulting “Analysis Grid” section below.

I am hoping to use this post to get information in the Comments Section about how to use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.

Methodology

To generate the Algorithm Spot-Check Cases i cross multiplied:

  1. (3) Datasets
    1.1 Boston House Price Data
    1.2 Iris Data
    1.3 Pima Indians Diabetes Data
  2. (4) Models(Algorithms)
    2.1 KNeighborsRegressor
    2.2 LinearRegression
    2.3 LogisticRegression
    2.4 LinearDiscriminantAnalysis
  3. (3) Scorings
    3.1 accuracy
    3.2 neg_mean_squared_error
    3.3 neg_log_loss

for a total of 36 Spot-Check Cases.

I then ran each case against a
kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7)
with
results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

Summary Analysis

There were 11 out of 36 Cases that returned numerical results. The other 25 Cases returned Errors or Warnings. The full Analysis can be seen in the “Analysis Grid” section below.

Question Reiteration, Answer, Further Study

Question Reiteration

How can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.

I posted this question to Jason Brownlee in the Comments on Python Machine Learning Mini-Course. The Answer is his response to me there.

Answer

From Jason Brownlee in the Comments on Python Machine Learning Mini-Course:

Nice post and great question Joe.

Spot checking is to discover which algorithms look good on one given dataset. Not across datasets.

You may need to group algorithms by their expectations then prepare data for each group.

Most machine learning algorithms expect data to have numeric input values and an integer encoded or one hot encoded output value for classification. This is a good normalized view of a dataset to construct.

Here’s a tutorial that shows how to spot check 7 machine learning algorithms on one problem in Python, Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn.

Further Study

From Jason’s Answer, above, I need to study:

  • Algorithm Groups
  • Algorithm Group Expectations
  • Data Preparation

Analysis Code

#! python3
# -*- coding: utf-8 -*-
"""analyzeAlgorithmsMod.py module contains  AnalyzeAlgorithms class.
    It aids in the comparison of Machine Learning Algorithms(Models)
       for a particular dataset.
"""
import os, sys, shutil, time, datetime, urllib, tarfile, zipfile, csv, io, copy, itertools, six

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
import sklearn
import sklearn.preprocessing
import sklearn.linear_model

import sklearn.cross_validation
# import sklearn.model_selection

import sklearn.ensemble
import sklearn.metrics
import sklearn.discriminant_analysis
import sklearn.preprocessing
import itertools
import textwrap
from pprint import pprint as pp

import warnings
warnings.filterwarnings("error")


def reportInModule(model, scoring, results):
    print("\n\n{model: '%s', scoring: '%s'}:\n    results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),))
    print("    sorted(results):")
    pp(sorted(results), indent=8)


class AnalyzeAlgorithms:
    """Aids in the comparison of Machine Learning Algorithms
       for a particular dataset.
           
Python Machine Learning Mini-Course
""" @staticmethod def calcX_Y(dataFrame): array = dataFrame.values X = array[:, 0:dataFrame.shape[1]-1] Y = array[:, dataFrame.shape[1]-1] return X, Y def __init__(self, datasetInfoTupleList, modelList, scoringStrList, kfold): self.datasetInfoTupleList = datasetInfoTupleList # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList) self.modelList = modelList self.scoringStrList = scoringStrList self.kfold = kfold def analyzeAlgorithms(self): for datasetInfoTuple in self.datasetInfoTupleList: # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList) self.datasetTitle, self.csvFilePath, self.delim_whitespace, self.columnNamesList = datasetInfoTuple self.df = pd.read_csv(self.csvFilePath, delim_whitespace=self.delim_whitespace, names=self.columnNamesList) self.X, self.Y = AnalyzeAlgorithms.calcX_Y(self.df) for model in self.modelList: for scoring in self.scoringStrList: print("trying|%s|%s|%s|"%(self.datasetTitle, type(model).__name__, scoring),end='') results = self.genResults(model, scoring, self.X, self.Y, self.kfold) self.report_short(datasetInfoTuple, model, scoring, results) def genResults(self, model, scoring, X, Y, kfold): try: results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) except: results = "Error: %s"%( sys.exc_info()[1] ) return results def report_short(self, datasetInfoTuple, model, scoring, results): datasetTitle = datasetInfoTuple[0] if isinstance(results, six.string_types): results_short = results.splitlines()[0] print(results_short) else: results_short = results.mean() print(results_short) def report(self, model, scoring, results): if isinstance(results, six.string_types): # Error print("\n\n{model: '%s', scoring: '%s'}:\n results ERROR: %s" % (type(model).__name__, scoring, results) ) else: print("\n\n{model: '%s', scoring: '%s'}:\n results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),)) print(" sorted(results):") pp(sorted(results), indent=8) def test(): datasetInfoTupleList = [ ('Boston House Price Data', # datasetTitle r'C:\BLA\BLA\BLA\data\BostonHousing\housing.data.txt', # csvFilePath True, # delim_whitespace for csv boolean ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'], # columnNames ), ('Iris Data', # datasetTitle r'C:\BLA\BLA\BLA\data\iris\iris.data.txt', # csvFilePath False, # delim_whitespace for csv boolean ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'class'], # columnNames ), ('Pima Indians Diabetes Data', # datasetTitle r'C:\BLA\BLA\BLA\data\PimaIndiansDiabetes\pima-indians-diabetes.data.txt', # csvFilePath False, # delim_whitespace for csv boolean ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'], # columnNames ), ] modelList = [ sklearn.neighbors.KNeighborsRegressor(), sklearn.linear_model.LinearRegression(), sklearn.linear_model.LogisticRegression(), sklearn.discriminant_analysis.LinearDiscriminantAnalysis(), ] scoringStrList = [ 'accuracy', 'neg_mean_squared_error', 'neg_log_loss', ] kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7) analysis02 = AnalyzeAlgorithms(datasetInfoTupleList, modelList, scoringStrList, kfold) analysis02.analyzeAlgorithms() def main(): test() if __name__ == '__main__': main() output = """ C:\Python35\python.exe C:/BLA/BLA/MachineLearningMasteryPj/AnalyzeAlgorithmsPkg/analyzeAlgorithmsMod.py C:\Python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) trying|Boston House Price Data|KNeighborsRegressor|accuracy|Error: continuous is not supported trying|Boston House Price Data|KNeighborsRegressor|neg_mean_squared_error|-107.28683898 trying|Boston House Price Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Boston House Price Data|LinearRegression|accuracy|Error: continuous is not supported trying|Boston House Price Data|LinearRegression|neg_mean_squared_error|-34.7052559445 trying|Boston House Price Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba' trying|Boston House Price Data|LogisticRegression|accuracy|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LogisticRegression|neg_mean_squared_error|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LogisticRegression|neg_log_loss|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LinearDiscriminantAnalysis|accuracy|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_log_loss|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Iris Data|KNeighborsRegressor|accuracy|Error: unsupported operand type(s) for /: 'str' and 'int' trying|Iris Data|KNeighborsRegressor|neg_mean_squared_error|Error: unsupported operand type(s) for /: 'str' and 'int' trying|Iris Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Iris Data|LinearRegression|accuracy|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LinearRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LinearRegression|neg_log_loss|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LogisticRegression|accuracy|0.88 trying|Iris Data|LogisticRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-setosa' trying|Iris Data|LogisticRegression|neg_log_loss|Error: y_true contains only one label (Iris-setosa). Please provide the true labels explicitly through the labels argument. trying|Iris Data|LinearDiscriminantAnalysis|accuracy|Error: The priors do not sum to 1. Renormalizing trying|Iris Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: The priors do not sum to 1. Renormalizing trying|Iris Data|LinearDiscriminantAnalysis|neg_log_loss|Error: The priors do not sum to 1. Renormalizing trying|Pima Indians Diabetes Data|KNeighborsRegressor|accuracy|Error: Can't handle mix of binary and continuous trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_mean_squared_error|-0.196342447027 trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Pima Indians Diabetes Data|LinearRegression|accuracy|Error: Can't handle mix of binary and continuous trying|Pima Indians Diabetes Data|LinearRegression|neg_mean_squared_error|-0.162812506544 trying|Pima Indians Diabetes Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba' trying|Pima Indians Diabetes Data|LogisticRegression|accuracy|0.76951469583 trying|Pima Indians Diabetes Data|LogisticRegression|neg_mean_squared_error|-0.23048530417 trying|Pima Indians Diabetes Data|LogisticRegression|neg_log_loss|-0.492545522852 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|accuracy|0.773462064252 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_mean_squared_error|-0.226537935748 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_log_loss|-0.485655330102 Process finished with exit code 0 """

Analysis Grid

Here’s the Analysis Grid copied from an Excel Spreadsheet.


| Dataset                    | Model                      | Scoring                | Result(Error or Mean(result)                                                                                                    | Joe's Comment                                                                     |
|----------------------------|----------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| Boston House Price Data    | KNeighborsRegressor        | accuracy               | Error: continuous is not supported                                                                                              | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy)            |
| Boston House Price Data    | KNeighborsRegressor        | neg_mean_squared_error | -107.286839                                                                                                                     |                                                                                   |
| Boston House Price Data    | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Boston House Price Data    | LinearRegression           | accuracy               | Error: continuous is not supported                                                                                              | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy)            |
| Boston House Price Data    | LinearRegression           | neg_mean_squared_error | -34.70525594                                                                                                                    |                                                                                   |
| Boston House Price Data    | LinearRegression           | neg_log_loss           | Error: 'LinearRegression' object has no attribute 'predict_proba'                                                               | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression,   neg_log_loss)    |
| Boston House Price Data    | LogisticRegression         | accuracy               | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LogisticRegression         | neg_mean_squared_error | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LogisticRegression         | neg_log_loss           | Error: Unknown label type: 'continuous'                                                                                         | Data Prep (Boston, LogisticRegression, *)                                         |
| Boston House Price Data    | LinearDiscriminantAnalysis | accuracy               | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Boston House Price Data    | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Boston House Price Data    | LinearDiscriminantAnalysis | neg_log_loss           | Error: Unknown label type: (array([ 20.5,    25. ,  23.4,  18.9,    35.4,  24.7,  31.6,    23.3,  19.6,                         | Data Prep (Boston, LinearDiscriminantAnalysis *)                                  |
| Iris Data                  | KNeighborsRegressor        | accuracy               | Error: unsupported operand type(s) for /: 'str' and 'int'                                                                       | Data Prep: (Iris Data, KNeighborsRegressor,  [accuracy, neg_mean_squared_error] ) |
| Iris Data                  | KNeighborsRegressor        | neg_mean_squared_error | Error: unsupported operand type(s) for /: 'str' and 'int'                                                                       | Data Prep: (Iris Data, KNeighborsRegressor,  [accuracy, neg_mean_squared_error] ) |
| Iris Data                  | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Iris Data                  | LinearRegression           | accuracy               | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LinearRegression           | neg_mean_squared_error | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LinearRegression           | neg_log_loss           | Error: could not convert string to float: 'Iris-virginica'                                                                      | Data Prep: (Iris Data, LinearRegression, *)                                       |
| Iris Data                  | LogisticRegression         | accuracy               | 0.88                                                                                                                            |                                                                                   |
| Iris Data                  | LogisticRegression         | neg_mean_squared_error | Error: could not convert string to float: 'Iris-setosa'                                                                         |                                                                                   |
| Iris Data                  | LogisticRegression         | neg_log_loss           | Error: y_true contains only one   label (Iris-setosa). Please provide the true labels explicitly through the   labels argument. |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | accuracy               | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Iris Data                  | LinearDiscriminantAnalysis | neg_log_loss           | Error: The priors do not sum to 1. Renormalizing                                                                                |                                                                                   |
| Pima Indians Diabetes Data | KNeighborsRegressor        | accuracy               | Error: Can't handle mix of binary and continuous                                                                                | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy)              |
| Pima Indians Diabetes Data | KNeighborsRegressor        | neg_mean_squared_error | -0.196342447                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | KNeighborsRegressor        | neg_log_loss           | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba'                                                            | NG: (KNeighborsRegressor, neg_log_loss)                                           |
| Pima Indians Diabetes Data | LinearRegression           | accuracy               | Error: Can't handle mix of binary and continuous                                                                                | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy)              |
| Pima Indians Diabetes Data | LinearRegression           | neg_mean_squared_error | -0.162812507                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearRegression           | neg_log_loss           | Error: 'LinearRegression' object has no attribute 'predict_proba'                                                               | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression,   neg_log_loss)    |
| Pima Indians Diabetes Data | LogisticRegression         | accuracy               | 0.769514696                                                                                                                     |                                                                                   |
| Pima Indians Diabetes Data | LogisticRegression         | neg_mean_squared_error | -0.230485304                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LogisticRegression         | neg_log_loss           | -0.492545523                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | accuracy               | 0.773462064                                                                                                                     |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_mean_squared_error | -0.226537936                                                                                                                    |                                                                                   |
| Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_log_loss           | -0.48565533                                                                                                                     |                                                                                   |

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s