More Adventures in AI
A “Back to Basics” RESET
I found the work of Jason Brownlee to help me get “Back to Basics” with AI.
Currently i am following his EXCELLENT Python Machine Learning Mini-Course.
At Lesson 9, He suggests one should “Spot-Check Algorithms”.
I took a naive approach employing the following “Analysis Code” section below to generate the resulting “Analysis Grid” section below.
I am hoping to use this post to get information in the Comments Section about how to use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.
Methodology
To generate the Algorithm Spot-Check Cases i cross multiplied:
- (3) Datasets
1.1 Boston House Price Data
1.2 Iris Data
1.3 Pima Indians Diabetes Data - (4) Models(Algorithms)
2.1 KNeighborsRegressor
2.2 LinearRegression
2.3 LogisticRegression
2.4 LinearDiscriminantAnalysis - (3) Scorings
3.1 accuracy
3.2 neg_mean_squared_error
3.3 neg_log_loss
for a total of 36 Spot-Check Cases.
I then ran each case against a
kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7)
with
results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
Summary Analysis
There were 11 out of 36 Cases that returned numerical results. The other 25 Cases returned Errors or Warnings. The full Analysis can be seen in the “Analysis Grid” section below.
Question Reiteration, Answer, Further Study
Question Reiteration
How can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE.
I posted this question to Jason Brownlee in the Comments on Python Machine Learning Mini-Course. The Answer is his response to me there.
Answer
From Jason Brownlee in the Comments on Python Machine Learning Mini-Course:
Nice post and great question Joe.
Spot checking is to discover which algorithms look good on one given dataset. Not across datasets.
You may need to group algorithms by their expectations then prepare data for each group.
Most machine learning algorithms expect data to have numeric input values and an integer encoded or one hot encoded output value for classification. This is a good normalized view of a dataset to construct.
Here’s a tutorial that shows how to spot check 7 machine learning algorithms on one problem in Python, Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn.
Further Study
From Jason’s Answer, above, I need to study:
- Algorithm Groups
- Algorithm Group Expectations
- Data Preparation
Analysis Code
#! python3 # -*- coding: utf-8 -*- """analyzeAlgorithmsMod.py module contains AnalyzeAlgorithms class. It aids in the comparison of Machine Learning Algorithms(Models) for a particular dataset. """ import os, sys, shutil, time, datetime, urllib, tarfile, zipfile, csv, io, copy, itertools, six import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import scipy.stats import sklearn import sklearn.preprocessing import sklearn.linear_model import sklearn.cross_validation # import sklearn.model_selection import sklearn.ensemble import sklearn.metrics import sklearn.discriminant_analysis import sklearn.preprocessing import itertools import textwrap from pprint import pprint as pp import warnings warnings.filterwarnings("error") def reportInModule(model, scoring, results): print("\n\n{model: '%s', scoring: '%s'}:\n results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),)) print(" sorted(results):") pp(sorted(results), indent=8) class AnalyzeAlgorithms: """Aids in the comparison of Machine Learning Algorithms for a particular dataset. http://machinelearningmastery.com/python-machine-learning-mini-course/ """ @staticmethod def calcX_Y(dataFrame): array = dataFrame.values X = array[:, 0:dataFrame.shape[1]-1] Y = array[:, dataFrame.shape[1]-1] return X, Y def __init__(self, datasetInfoTupleList, modelList, scoringStrList, kfold): self.datasetInfoTupleList = datasetInfoTupleList # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList) self.modelList = modelList self.scoringStrList = scoringStrList self.kfold = kfold def analyzeAlgorithms(self): for datasetInfoTuple in self.datasetInfoTupleList: # (datasetTitle, csvFilePath, delim_whitespace, columnNamesList) self.datasetTitle, self.csvFilePath, self.delim_whitespace, self.columnNamesList = datasetInfoTuple self.df = pd.read_csv(self.csvFilePath, delim_whitespace=self.delim_whitespace, names=self.columnNamesList) self.X, self.Y = AnalyzeAlgorithms.calcX_Y(self.df) for model in self.modelList: for scoring in self.scoringStrList: print("trying|%s|%s|%s|"%(self.datasetTitle, type(model).__name__, scoring),end='') results = self.genResults(model, scoring, self.X, self.Y, self.kfold) self.report_short(datasetInfoTuple, model, scoring, results) def genResults(self, model, scoring, X, Y, kfold): try: results = sklearn.model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) except: results = "Error: %s"%( sys.exc_info()[1] ) return results def report_short(self, datasetInfoTuple, model, scoring, results): datasetTitle = datasetInfoTuple[0] if isinstance(results, six.string_types): results_short = results.splitlines()[0] print(results_short) else: results_short = results.mean() print(results_short) def report(self, model, scoring, results): if isinstance(results, six.string_types): # Error print("\n\n{model: '%s', scoring: '%s'}:\n results ERROR: %s" % (type(model).__name__, scoring, results) ) else: print("\n\n{model: '%s', scoring: '%s'}:\n results summary: %.3f mean (%.3f) std" % (type(model).__name__, scoring, results.mean(), results.std(),)) print(" sorted(results):") pp(sorted(results), indent=8) def test(): datasetInfoTupleList = [ ('Boston House Price Data', # datasetTitle r'C:\BLA\BLA\BLA\data\BostonHousing\housing.data.txt', # csvFilePath True, # delim_whitespace for csv boolean ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'], # columnNames ), ('Iris Data', # datasetTitle r'C:\BLA\BLA\BLA\data\iris\iris.data.txt', # csvFilePath False, # delim_whitespace for csv boolean ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'class'], # columnNames ), ('Pima Indians Diabetes Data', # datasetTitle r'C:\BLA\BLA\BLA\data\PimaIndiansDiabetes\pima-indians-diabetes.data.txt', # csvFilePath False, # delim_whitespace for csv boolean ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'], # columnNames ), ] modelList = [ sklearn.neighbors.KNeighborsRegressor(), sklearn.linear_model.LinearRegression(), sklearn.linear_model.LogisticRegression(), sklearn.discriminant_analysis.LinearDiscriminantAnalysis(), ] scoringStrList = [ 'accuracy', 'neg_mean_squared_error', 'neg_log_loss', ] kfold = sklearn.model_selection.KFold(n_splits=10, random_state=7) analysis02 = AnalyzeAlgorithms(datasetInfoTupleList, modelList, scoringStrList, kfold) analysis02.analyzeAlgorithms() def main(): test() if __name__ == '__main__': main() output = """ C:\Python35\python.exe C:/BLA/BLA/MachineLearningMasteryPj/AnalyzeAlgorithmsPkg/analyzeAlgorithmsMod.py C:\Python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) trying|Boston House Price Data|KNeighborsRegressor|accuracy|Error: continuous is not supported trying|Boston House Price Data|KNeighborsRegressor|neg_mean_squared_error|-107.28683898 trying|Boston House Price Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Boston House Price Data|LinearRegression|accuracy|Error: continuous is not supported trying|Boston House Price Data|LinearRegression|neg_mean_squared_error|-34.7052559445 trying|Boston House Price Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba' trying|Boston House Price Data|LogisticRegression|accuracy|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LogisticRegression|neg_mean_squared_error|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LogisticRegression|neg_log_loss|Error: Unknown label type: 'continuous' trying|Boston House Price Data|LinearDiscriminantAnalysis|accuracy|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Boston House Price Data|LinearDiscriminantAnalysis|neg_log_loss|Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, trying|Iris Data|KNeighborsRegressor|accuracy|Error: unsupported operand type(s) for /: 'str' and 'int' trying|Iris Data|KNeighborsRegressor|neg_mean_squared_error|Error: unsupported operand type(s) for /: 'str' and 'int' trying|Iris Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Iris Data|LinearRegression|accuracy|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LinearRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LinearRegression|neg_log_loss|Error: could not convert string to float: 'Iris-virginica' trying|Iris Data|LogisticRegression|accuracy|0.88 trying|Iris Data|LogisticRegression|neg_mean_squared_error|Error: could not convert string to float: 'Iris-setosa' trying|Iris Data|LogisticRegression|neg_log_loss|Error: y_true contains only one label (Iris-setosa). Please provide the true labels explicitly through the labels argument. trying|Iris Data|LinearDiscriminantAnalysis|accuracy|Error: The priors do not sum to 1. Renormalizing trying|Iris Data|LinearDiscriminantAnalysis|neg_mean_squared_error|Error: The priors do not sum to 1. Renormalizing trying|Iris Data|LinearDiscriminantAnalysis|neg_log_loss|Error: The priors do not sum to 1. Renormalizing trying|Pima Indians Diabetes Data|KNeighborsRegressor|accuracy|Error: Can't handle mix of binary and continuous trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_mean_squared_error|-0.196342447027 trying|Pima Indians Diabetes Data|KNeighborsRegressor|neg_log_loss|Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' trying|Pima Indians Diabetes Data|LinearRegression|accuracy|Error: Can't handle mix of binary and continuous trying|Pima Indians Diabetes Data|LinearRegression|neg_mean_squared_error|-0.162812506544 trying|Pima Indians Diabetes Data|LinearRegression|neg_log_loss|Error: 'LinearRegression' object has no attribute 'predict_proba' trying|Pima Indians Diabetes Data|LogisticRegression|accuracy|0.76951469583 trying|Pima Indians Diabetes Data|LogisticRegression|neg_mean_squared_error|-0.23048530417 trying|Pima Indians Diabetes Data|LogisticRegression|neg_log_loss|-0.492545522852 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|accuracy|0.773462064252 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_mean_squared_error|-0.226537935748 trying|Pima Indians Diabetes Data|LinearDiscriminantAnalysis|neg_log_loss|-0.485655330102 Process finished with exit code 0 """
Analysis Grid
Here’s the Analysis Grid copied from an Excel Spreadsheet.
| Dataset | Model | Scoring | Result(Error or Mean(result) | Joe's Comment | |----------------------------|----------------------------|------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------| | Boston House Price Data | KNeighborsRegressor | accuracy | Error: continuous is not supported | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy) | | Boston House Price Data | KNeighborsRegressor | neg_mean_squared_error | -107.286839 | | | Boston House Price Data | KNeighborsRegressor | neg_log_loss | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' | NG: (KNeighborsRegressor, neg_log_loss) | | Boston House Price Data | LinearRegression | accuracy | Error: continuous is not supported | Data Prep? (Boston, [KNeighborsRegressor, LinearRegression], accuracy) | | Boston House Price Data | LinearRegression | neg_mean_squared_error | -34.70525594 | | | Boston House Price Data | LinearRegression | neg_log_loss | Error: 'LinearRegression' object has no attribute 'predict_proba' | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression, neg_log_loss) | | Boston House Price Data | LogisticRegression | accuracy | Error: Unknown label type: 'continuous' | Data Prep (Boston, LogisticRegression, *) | | Boston House Price Data | LogisticRegression | neg_mean_squared_error | Error: Unknown label type: 'continuous' | Data Prep (Boston, LogisticRegression, *) | | Boston House Price Data | LogisticRegression | neg_log_loss | Error: Unknown label type: 'continuous' | Data Prep (Boston, LogisticRegression, *) | | Boston House Price Data | LinearDiscriminantAnalysis | accuracy | Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, | Data Prep (Boston, LinearDiscriminantAnalysis *) | | Boston House Price Data | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, | Data Prep (Boston, LinearDiscriminantAnalysis *) | | Boston House Price Data | LinearDiscriminantAnalysis | neg_log_loss | Error: Unknown label type: (array([ 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, | Data Prep (Boston, LinearDiscriminantAnalysis *) | | Iris Data | KNeighborsRegressor | accuracy | Error: unsupported operand type(s) for /: 'str' and 'int' | Data Prep: (Iris Data, KNeighborsRegressor, [accuracy, neg_mean_squared_error] ) | | Iris Data | KNeighborsRegressor | neg_mean_squared_error | Error: unsupported operand type(s) for /: 'str' and 'int' | Data Prep: (Iris Data, KNeighborsRegressor, [accuracy, neg_mean_squared_error] ) | | Iris Data | KNeighborsRegressor | neg_log_loss | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' | NG: (KNeighborsRegressor, neg_log_loss) | | Iris Data | LinearRegression | accuracy | Error: could not convert string to float: 'Iris-virginica' | Data Prep: (Iris Data, LinearRegression, *) | | Iris Data | LinearRegression | neg_mean_squared_error | Error: could not convert string to float: 'Iris-virginica' | Data Prep: (Iris Data, LinearRegression, *) | | Iris Data | LinearRegression | neg_log_loss | Error: could not convert string to float: 'Iris-virginica' | Data Prep: (Iris Data, LinearRegression, *) | | Iris Data | LogisticRegression | accuracy | 0.88 | | | Iris Data | LogisticRegression | neg_mean_squared_error | Error: could not convert string to float: 'Iris-setosa' | | | Iris Data | LogisticRegression | neg_log_loss | Error: y_true contains only one label (Iris-setosa). Please provide the true labels explicitly through the labels argument. | | | Iris Data | LinearDiscriminantAnalysis | accuracy | Error: The priors do not sum to 1. Renormalizing | | | Iris Data | LinearDiscriminantAnalysis | neg_mean_squared_error | Error: The priors do not sum to 1. Renormalizing | | | Iris Data | LinearDiscriminantAnalysis | neg_log_loss | Error: The priors do not sum to 1. Renormalizing | | | Pima Indians Diabetes Data | KNeighborsRegressor | accuracy | Error: Can't handle mix of binary and continuous | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy) | | Pima Indians Diabetes Data | KNeighborsRegressor | neg_mean_squared_error | -0.196342447 | | | Pima Indians Diabetes Data | KNeighborsRegressor | neg_log_loss | Error: 'KNeighborsRegressor' object has no attribute 'predict_proba' | NG: (KNeighborsRegressor, neg_log_loss) | | Pima Indians Diabetes Data | LinearRegression | accuracy | Error: Can't handle mix of binary and continuous | Data Prep? (Pima, [KNeighborsRegressor, LinearRegression], accuracy) | | Pima Indians Diabetes Data | LinearRegression | neg_mean_squared_error | -0.162812507 | | | Pima Indians Diabetes Data | LinearRegression | neg_log_loss | Error: 'LinearRegression' object has no attribute 'predict_proba' | Data Prep? Or Model-Scoring?([Boston, Pima], LinearRegression, neg_log_loss) | | Pima Indians Diabetes Data | LogisticRegression | accuracy | 0.769514696 | | | Pima Indians Diabetes Data | LogisticRegression | neg_mean_squared_error | -0.230485304 | | | Pima Indians Diabetes Data | LogisticRegression | neg_log_loss | -0.492545523 | | | Pima Indians Diabetes Data | LinearDiscriminantAnalysis | accuracy | 0.773462064 | | | Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_mean_squared_error | -0.226537936 | | | Pima Indians Diabetes Data | LinearDiscriminantAnalysis | neg_log_loss | -0.48565533 | |