Background
Although I HAVE NOT thought about Artificial Intelligence, ai, since i was a student in Michael Arbib’s class studying for my M.S., when i became aware of H2O.ai.com, i decided it was time to jump in. 🙂
The following will be a chronicle of my adventures. 🙂
THIS IS A WORK IN PROGRESS
Big Data Hadoop vs Apache Spark
Downloads: (H2O vs Sparkling Water)
H2O.ai’s offerings, H2O and Sparkling Water, seemed to pose the question, “What Big Data platform should I choose, Hadoop or Apache Spark?” I have learned that they are not competitors. Katherine Noyes says in Infoworld,
“They do different things. … Hadoop is essentially a distributed data infrastructure … Spark, on the other hand, is a data-processing tool that operates on those distributed data collection”.
OK. But which of H2O.ai’s Downloads, only 2 when i started, should i choose to investigate? I picked Sparkling Water because of a page explaining the ai “Classification” Use Case.
Goal Install & RUN PySparkling
Here’s some notes for PySparkling installation on Windows 10.
Be prepared for (SysAdmin, SysAdmin, … more SysAAdmin)!
Install Apache Spark (to use PySpark)
- Apache Spark needs to be installed first
- Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6
- > echo %PYSPARK_PYTHON% == C:\Python27\python.exe
- Test Run PySpark in the PySpark Shell
- > cd Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6
- > .\bin\pyspark.cmd
- Test with QuickStart N.B. Click the Python_Tab
- RESULT: OK
- Test Run PySpark as a Self-Contained Application
- Test with Self-Contained Application N.B. Click the Python_Tab
- RESULT: NO GOOD –
Self-Contained PySpark RESULT
Here’s the Self-Contained RESULT with NO MODIFICATIONS of the sys.path
"""SimpleApp.py""" from pyspark import SparkContext logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-1-024ac6cc8be6> in <module>() 1 """SimpleApp.py""" ----> 2 from pyspark import SparkContext 3 4 logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system 5 sc = SparkContext("local", "Simple App") ImportError: No module named pyspark
sys.path Analysis – PySpark Shell vs Plain Python27
<br /># This is pyspark shell sys.path pysparkShellSysPath = ''' C:\Users\joeco\AppData\Local\Temp\spark-a00a2ab4-63a0-404f-b607-5f34c4206e76\userFiles-5a9b86bf-a518-4fef-b4de-b42005b143d5 C:\Python27\lib\site-packages\pywebview-0.8.2-py2.7.egg C:\Python27\lib\site-packages\ouimeaux-0.7.9.post0-py2.7.egg C:\Python27\lib\site-packages\gevent_socketio-0.3.6-py2.7.egg C:\Python27\lib\site-packages\flask_restful-0.3.5-py2.7.egg C:\Python27\lib\site-packages\pysignals-0.1.2-py2.7.egg C:\Python27\lib\site-packages\pyyaml-3.11-py2.7-win-amd64.egg C:\Python27\lib\site-packages\requests-2.9.1-py2.7.egg C:\Python27\lib\site-packages\gevent-1.1rc3-py2.7-win-amd64.egg C:\Python27\lib\site-packages\gevent_websocket-0.9.5-py2.7.egg C:\Python27\lib\site-packages\pytz-2015.7-py2.7.egg C:\Python27\lib\site-packages\six-1.10.0-py2.7.egg C:\Python27\lib\site-packages\flask-0.10.1-py2.7.egg C:\Python27\lib\site-packages\aniso8601-1.1.0-py2.7.egg C:\Python27\lib\site-packages\greenlet-0.4.9-py2.7-win-amd64.egg C:\Python27\lib\site-packages\itsdangerous-0.24-py2.7.egg C:\Python27\lib\site-packages\werkzeug-0.11.3-py2.7.egg C:\Python27\lib\site-packages\python_dateutil-2.4.2-py2.7.egg C:\Python27\lib\site-packages\python_registry-1.1.0-py2.7.egg C:\Python27\lib\site-packages\enum34-1.1.2-py2.7.egg C:\Python27\lib\site-packages\speedtest_cli-0.3.4-py2.7.egg C:\Python27\lib\site-packages\midi-0.2.3-py2.7.egg C:\Python27\lib\site-packages\h2o_pysparkling_1.6-1.6.5-py2.7.egg C:\Python27\lib\site-packages\tabulate-0.7.5-py2.7.egg C:\Python27\lib\site-packages\future-0.15.2-py2.7.egg C:\Users\joeco\Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip C:\Users\joeco\Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6\python C:\Users\joeco\Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6 C:\WINDOWS\SYSTEM32\python27.zip C:\Python27\DLLs C:\Python27\lib C:\Python27\lib\plat-win C:\Python27\lib\lib-tk C:\Python27 C:\Python27\lib\site-packages C:\Python27\lib\site-packages\win32 C:\Python27\lib\site-packages\win32\lib C:\Python27\lib\site-packages\Pythonwin C:\Python27\lib\site-packages\wx-3.0-msw ''' # This is plain python 2.7 sys.path plainPythonSysPath = ''' C:\Python27\lib\site-packages\pywebview-0.8.2-py2.7.egg C:\Python27\lib\site-packages\ouimeaux-0.7.9.post0-py2.7.egg C:\Python27\lib\site-packages\gevent_socketio-0.3.6-py2.7.egg C:\Python27\lib\site-packages\flask_restful-0.3.5-py2.7.egg C:\Python27\lib\site-packages\pysignals-0.1.2-py2.7.egg C:\Python27\lib\site-packages\pyyaml-3.11-py2.7-win-amd64.egg C:\Python27\lib\site-packages\requests-2.9.1-py2.7.egg C:\Python27\lib\site-packages\gevent-1.1rc3-py2.7-win-amd64.egg C:\Python27\lib\site-packages\gevent_websocket-0.9.5-py2.7.egg C:\Python27\lib\site-packages\pytz-2015.7-py2.7.egg C:\Python27\lib\site-packages\six-1.10.0-py2.7.egg C:\Python27\lib\site-packages\flask-0.10.1-py2.7.egg C:\Python27\lib\site-packages\aniso8601-1.1.0-py2.7.egg C:\Python27\lib\site-packages\greenlet-0.4.9-py2.7-win-amd64.egg C:\Python27\lib\site-packages\itsdangerous-0.24-py2.7.egg C:\Python27\lib\site-packages\werkzeug-0.11.3-py2.7.egg C:\Python27\lib\site-packages\python_dateutil-2.4.2-py2.7.egg C:\Python27\lib\site-packages\python_registry-1.1.0-py2.7.egg C:\Python27\lib\site-packages\enum34-1.1.2-py2.7.egg C:\Python27\lib\site-packages\speedtest_cli-0.3.4-py2.7.egg C:\Python27\lib\site-packages\midi-0.2.3-py2.7.egg C:\Python27\lib\site-packages\h2o_pysparkling_1.6-1.6.5-py2.7.egg C:\Python27\lib\site-packages\tabulate-0.7.5-py2.7.egg C:\Python27\lib\site-packages\future-0.15.2-py2.7.egg C:\WINDOWS\SYSTEM32\python27.zip C:\Python27\DLLs C:\Python27\lib C:\Python27\lib\plat-win C:\Python27\lib\lib-tk C:\Python27 C:\Python27\lib\site-packages C:\Python27\lib\site-packages\win32 C:\Python27\lib\site-packages\win32\lib C:\Python27\lib\site-packages\Pythonwin C:\Python27\lib\site-packages\wx-3.0-msw ''' # print('hello', len( sorted(pysprkShellSysPath.splitlines()) ) ) pysparksp = sorted(pysparkShellSysPath.splitlines()) len(pysparksp) plainsp = sorted(plainPythonSysPath.splitlines()) len(plainsp) for i in range( max(len(pysparksp),len(plainsp)) ): if i < len(pysparksp): print ('pyspk', pysparksp[i]) if i < len(plainsp ): print ('plain', plainsp[i]) print ----------------------OUTPUT--------------------------- ('pyspk', '') ('plain', '') ('pyspk', 'C:\\Python27') ('plain', 'C:\\Python27') ('pyspk', 'C:\\Python27\\DLLs') ('plain', 'C:\\Python27\\DLLs') ('pyspk', 'C:\\Python27\\lib') ('plain', 'C:\\Python27\\lib') ('pyspk', 'C:\\Python27\\lib\\lib-tk') ('plain', 'C:\\Python27\\lib\\lib-tk') ('pyspk', 'C:\\Python27\\lib\\plat-win') ('plain', 'C:\\Python27\\lib\\plat-win') ('pyspk', 'C:\\Python27\\lib\\site-packages') ('plain', 'C:\\Python27\\lib\\site-packages') ('pyspk', 'C:\\Python27\\lib\\site-packages') ('plain', 'C:\\Python27\\lib\\site-packages') ('pyspk', 'C:\\Python27\\lib\\site-packages\x07niso8601-1.1.0-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\x07niso8601-1.1.0-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\tabulate-0.7.5-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\tabulate-0.7.5-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\x0clask-0.10.1-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\x0clask-0.10.1-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\x0clask_restful-0.3.5-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\x0clask_restful-0.3.5-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\x0cuture-0.15.2-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\x0cuture-0.15.2-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\Pythonwin') ('plain', 'C:\\Python27\\lib\\site-packages\\Pythonwin') ('pyspk', 'C:\\Python27\\lib\\site-packages\\enum34-1.1.2-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\enum34-1.1.2-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\gevent-1.1rc3-py2.7-win-amd64.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\gevent-1.1rc3-py2.7-win-amd64.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\gevent_socketio-0.3.6-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\gevent_socketio-0.3.6-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\gevent_websocket-0.9.5-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\gevent_websocket-0.9.5-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\greenlet-0.4.9-py2.7-win-amd64.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\greenlet-0.4.9-py2.7-win-amd64.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\h2o_pysparkling_1.6-1.6.5-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\h2o_pysparkling_1.6-1.6.5-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\itsdangerous-0.24-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\itsdangerous-0.24-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\midi-0.2.3-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\midi-0.2.3-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\ouimeaux-0.7.9.post0-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\ouimeaux-0.7.9.post0-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\pysignals-0.1.2-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\pysignals-0.1.2-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\python_dateutil-2.4.2-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\python_dateutil-2.4.2-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\python_registry-1.1.0-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\python_registry-1.1.0-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\pytz-2015.7-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\pytz-2015.7-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\pywebview-0.8.2-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\pywebview-0.8.2-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\pyyaml-3.11-py2.7-win-amd64.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\pyyaml-3.11-py2.7-win-amd64.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\six-1.10.0-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\six-1.10.0-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\speedtest_cli-0.3.4-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\speedtest_cli-0.3.4-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\werkzeug-0.11.3-py2.7.egg') ('plain', 'C:\\Python27\\lib\\site-packages\\werkzeug-0.11.3-py2.7.egg') ('pyspk', 'C:\\Python27\\lib\\site-packages\\win32') ('plain', 'C:\\Python27\\lib\\site-packages\\win32') ('pyspk', 'C:\\Python27\\lib\\site-packages\\win32\\lib') ('plain', 'C:\\Python27\\lib\\site-packages\\win32\\lib') ('pyspk', 'C:\\Python27\\lib\\site-packages\\wx-3.0-msw') ('plain', 'C:\\Python27\\lib\\site-packages\\wx-3.0-msw') ('pyspk', 'C:\\Users\\joeco\\AppData\\Local\\Temp\\spark-a00a2ab4-63a0-404f-b607-5f34c4206e76\\userFiles-5a9b86bf-a518-4fef-b4de-b42005b143d5') ('plain', 'C:\\WINDOWS\\SYSTEM32\\python27.zip') ('pyspk', 'C:\\Users\\joeco\\Downloads\\ApacheSpark\\spark-1.6.2-bin-hadoop2.6') ('plain', 'equests-2.9.1-py2.7.egg') ('pyspk', 'C:\\Users\\joeco\\Downloads\\ApacheSpark\\spark-1.6.2-bin-hadoop2.6\\python') ('pyspk', 'C:\\Users\\joeco\\Downloads\\ApacheSpark\\spark-1.6.2-bin-hadoop2.6\\python\\lib\\py4j-0.9-src.zip') ('pyspk', 'C:\\WINDOWS\\SYSTEM32\\python27.zip') ('pyspk', 'equests-2.9.1-py2.7.egg')
My Question: Am I missing some secret sauce to set the sys.path?
Answer: I don’t know yet.
Pursuing the Secret, sys.path, Sauce
This morning i went back to my Original Goal, run PySparkling, NOT PySpark, but PySparkling.
- I found a new download page for Sparkling Water, PySparkling’s Uncle? 🙂
- I chose my Spark version 1.6 out of [1.4, 1.5, 1.6]
- Went to the 1.6 Download Page, downloaded Sparkling Water and clicked on the Python Tab which said “Get started with PySparkling”. Hallelulia! 🙂
Get started with PySparkling Steps
- Download Spark
1.1 DONE - Point SPARK_HOME to the existing installation of Spark and export variable MASTER.
>echo %SPARK_HOME% ...\Downloads\ApacheSpark\spark-1.6.2-bin-hadoop2.6 >echo %MASTER% local-cluster[3,2,1024] >
- From your terminal, run:
#To start an interactive Python terminal- bin/pysparkling
PROBLEM: No bin/pysparkling.cmd
BUT sparkling-env.cmd exists
- Could bin/pysparkling or bin/sparkling-env.cmd contain Secret sys.path Sauce?
- I Need a break from Sys Admin
- I Need TO CODE SOMETHING
MORE NEXT TIME! 🙂
TODO – FOLLOWING NEEDS WORK = UNPUBLISHED
PySparkling Installation
N.B. Click the Python_Tab
- Be careful of your python environment.
- I am running 2 & 3.
- PySparkling needs 2.
C:\Users\joeco>python Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from pysparkling import Context Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python35\lib\site-packages\h2o_pysparkling_1.6-1.6.5-py3.5.egg\pysparkling\__init__.py", line 11, in <module> from pysparkling.context import H2OContext File "C:\Python35\lib\site-packages\h2o_pysparkling_1.6-1.6.5-py3.5.egg\pysparkling\context.py", line 142 print self ^ SyntaxError: Missing parentheses in call to 'print' >>>
- Be careful of your version of Sparkling Water
- Be careful of your version of Spark