{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, you will learn how to run Linear Regressions. Run each cell in this notebook in order.  \n",
    "\n",
    "When you are asked to insert your own code, write your code in the cell provided and run the cell.  Correct any errors and rerun. When you have the result you want, move on and run the next cell.  Keep running each cell until you are again asked to add code yourself.\n",
    "\n",
    "To get started, run the cell below to load the code and the table you will need to use in the rest of the Notebook.\n",
    "\n",
    "Ignore the \"read_table is deprecateed\" warning message."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\datascience\\tables.py:132: FutureWarning: read_table is deprecated, use read_csv instead.\n",
      "  df = pandas.read_table(filepath_or_buffer, *args, **vargs)\n"
     ]
    }
   ],
   "source": [
    "path_data = 'http://www.millerjw.com/dom/mgmt462/pfiles/'\n",
    "from datascience import *\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plots\n",
    "plots.style.use('fivethirtyeight')\n",
    "# Read in usda data into a datascience table for use in this assignment\n",
    "elantra = Table.read_table(path_data + 'elantra.csv')\n",
    "# Make a copy of the usda data in pandas dataframe format \n",
    "elantrap = elantra.to_df()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cell below. Seeing the table data with column headings will help you in the rest of this exercise. Remember that you can display a table by using print(table) or by coding a single line with the table name at the end of the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "    <thead>\n",
       "        <tr>\n",
       "            <th>Month</th> <th>Year</th> <th>ElantraSales</th> <th>Unemployment</th> <th>Queries</th> <th>CPI_energy</th> <th>CPI_all</th>\n",
       "        </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <td>1    </td> <td>2010</td> <td>7690        </td> <td>9.7         </td> <td>153    </td> <td>213.377   </td> <td>217.466</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>1    </td> <td>2011</td> <td>9659        </td> <td>9.1         </td> <td>259    </td> <td>229.353   </td> <td>221.082</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>1    </td> <td>2012</td> <td>10900       </td> <td>8.2         </td> <td>354    </td> <td>244.178   </td> <td>227.666</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>1    </td> <td>2013</td> <td>12174       </td> <td>7.9         </td> <td>230    </td> <td>242.56    </td> <td>231.321</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>1    </td> <td>2014</td> <td>15326       </td> <td>6.6         </td> <td>232    </td> <td>247.575   </td> <td>234.933</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>2    </td> <td>2010</td> <td>7966        </td> <td>9.8         </td> <td>130    </td> <td>209.924   </td> <td>217.251</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>2    </td> <td>2011</td> <td>12289       </td> <td>9           </td> <td>266    </td> <td>232.188   </td> <td>221.816</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>2    </td> <td>2012</td> <td>13820       </td> <td>8.3         </td> <td>296    </td> <td>247.615   </td> <td>228.138</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>2    </td> <td>2013</td> <td>16219       </td> <td>7.7         </td> <td>239    </td> <td>252.639   </td> <td>232.599</td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>2    </td> <td>2014</td> <td>16393       </td> <td>6.7         </td> <td>240    </td> <td>246.389   </td> <td>235.169</td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>\n",
       "<p>... (40 rows omitted)</p>"
      ],
      "text/plain": [
       "Month | Year | ElantraSales | Unemployment | Queries | CPI_energy | CPI_all\n",
       "1     | 2010 | 7690         | 9.7          | 153     | 213.377    | 217.466\n",
       "1     | 2011 | 9659         | 9.1          | 259     | 229.353    | 221.082\n",
       "1     | 2012 | 10900        | 8.2          | 354     | 244.178    | 227.666\n",
       "1     | 2013 | 12174        | 7.9          | 230     | 242.56     | 231.321\n",
       "1     | 2014 | 15326        | 6.6          | 232     | 247.575    | 234.933\n",
       "2     | 2010 | 7966         | 9.8          | 130     | 209.924    | 217.251\n",
       "2     | 2011 | 12289        | 9            | 266     | 232.188    | 221.816\n",
       "2     | 2012 | 13820        | 8.3          | 296     | 247.615    | 228.138\n",
       "2     | 2013 | 16219        | 7.7          | 239     | 252.639    | 232.599\n",
       "2     | 2014 | 16393        | 6.7          | 240     | 246.389    | 235.169\n",
       "... (40 rows omitted)"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elantra\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "elantra_y = elantra.select('ElantraSales')\n",
    "elantra_x = elantra.select('Queries','CPI_energy','CPI_all')\n",
    "y= elantra_y.to_df()\n",
    "x = elantra_x.to_df()\n",
    "#print(y)\n",
    "#print(X)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.5,random_state = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "regression_model = LinearRegression()\n",
    "regression_model.fit(x_train, y_train)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = regression_model.predict(x_train)\n",
    "#y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.639869728416425"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import r2_score\n",
    "score = r2_score(y_train, y_pred)\n",
    "score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-67762.31038843562"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "intercept = regression_model.intercept_[0]\n",
    "intercept\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 30.72609602,  -7.38331952, 340.50962618]])"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regression_model.coef_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's suppose we want to computer the average number of calories for all the rows in the usda table.  We can extract an entire table column into a Python array by using the column method. Then, we can use Python's capabilities to operate on an entire array. The code to do this is \n",
    "\n",
    "array_name = table.column('columnheader')\n",
    "\n",
    "Create an array named array_calories by extracting the Calories column from the USDA table and then display your array by coding array_calories as your second of two statements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the Python array functions is sum.  The code looks like this:\n",
    "variablename = sum(array_name)\n",
    "\n",
    "Code two more statements.  The first should sum up array_calories and put the result in variable s\n",
    "The second statement should display the sum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step in the calculation is to divide the sum by the number of rows.  You already have the sum in variable s and the number of rows in variable r.   Calculate the average and place the result in variable a.  The display variable a."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Short variable names can make your code hard to understand so I only used them as temporary hold places.  If you are going to use a variable in more than a few localized places in your code use better names such as array_calories\n",
    "\n",
    "Normally, a calculation like this is done in a single step.  Example:\n",
    "sum(usda.column('Calories')) / usda.num_rows\n",
    "\n",
    "Paste the statement above into the cell below.  Then run it to see if you get the same result for the average that you got before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Excel can calculate averages without writing any code.  Why use Python?  Python code is easily repeatable.  If you get a new USDA dataset with a different number of rows and different values, you can run the Python code you created with one mouse click and you are done.  In Excel, you would need to manually open the csv file and then go to the bottom of the table and code the average function again. If there were 100 calculations, you would have to rewrite 100 calculations in Excel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose we want to find the maximum value of Cholesterol in the usda table.  First, I don't remember the column names so I can run the statement\n",
    "print(usda.labels)\n",
    "\n",
    "The maximum value of Cholesterol can calculated as follows:\n",
    "max_cholesterol = max(usda.column('Cholesterol'))\n",
    "\n",
    "I can display the entire row that contains the max_cholesterol value using:\n",
    "print(usda.where('Cholesterol', max_cholesterol))\n",
    "\n",
    "This could be done in a single statement:\n",
    "usda.where('Cholesterol', max(usda.column('Cholesterol')))\n",
    "\n",
    "Run the cell below to see the results.  Note thata there is more than one row that has the maximum value for cholesterol. Also note that while a print statement can be used anywhere, you get a better format by just putting a statement without an equal sign on the last line in a cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(usda.labels)\n",
    "max_cholesterol = max(usda.column('Cholesterol')) \n",
    "print(usda.where('Cholesterol', max_cholesterol))\n",
    "usda.where('Cholesterol', max(usda.column('Cholesterol')))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next cell, write code to display all rows in the usda table that have the maximum value for Sodium"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, all selections have been made with a simple version of the where method. In the next example, we want to create a subset of the usda table that contains only those rows that have above average sugar.\n",
    "\n",
    "Run the cell below to sort descending by the Sugar column.  Notice that there are many rows that contain nan (not a number) for Sugar.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "usda.sort('Sugar', descending = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To calculate average Sugar, we must have valid numbers in the Sugar Column.  Getting a subset of a table that contains only valid values in one of the columns is something that needs to be done frequently.  So we will write a function to do that job.  That function is:\n",
    "\n",
    "def valid_subset(tbl,col):\n",
    "    return tbl.where(col, are.strictly_between(0, max(tbl.column(col))))\n",
    "    \n",
    "This function selects only rows that have a value between zero and the highest value in the column.  Anything else must be \"not a number\".  The function returns a subset of the original table.  Run the cell below to see how the function works"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def valid_subset(tbl,col):\n",
    "    return tbl.where(col, are.strictly_between(0, max(tbl.column(col))))\n",
    "\n",
    "valid_sugar = valid_subset(usda,'Sugar')\n",
    "\n",
    "print(' original rows ', usda.num_rows, '  subset rows ', valid_sugar.num_rows)\n",
    "print(' dropped rows ', usda.num_rows - valid_sugar.num_rows)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have valid data, let's calculate the average like we did before.\n",
    "Here is an example of calulating an average from before:\n",
    "\n",
    "sum(usda.column('Calories')) / usda.num_rows\n",
    "\n",
    "This time, we want to use only valid data and calculate the average for Sugar \n",
    "Run the cell below to see how the average is calculated for sugar.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "avg_sugar = sum(valid_sugar.column('Sugar')) / valid_sugar.num_rows\n",
    "avg_sugar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we know the average for the sugar column, we can use it in the where method to select a subset that contains only rows with above average sugar content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sugar_above_average = valid_sugar.where('Sugar', are.above(avg_sugar))\n",
    "sugar_above_average"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem you need to solve is easier than what we have just done.  Your task is to create a subset of the usda table that contains only rows where 'Sodium' is greater than 10000.  Use table.where('columnname', are.above(number))\n",
    "Name your subset high_sodium. Place your work in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your next task is to display the single row where the description contains 'CAVIAR'.  Use table.where('columnname', 'value') Do that in the cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have been using a tabe in datascience format so far.  Python has a similar structure called a dataframe.  The dataframe for the usda table is called usdap.  This dataframe was creataed from the usda.csv file in the first cell of code in this notebook.  \n",
    "\n",
    "Run the statement usdap.describe() to summarize the usda data. Use the cell below.\n",
    "The output may be useful in answering questions in your homework in Canvas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two statements below will\n",
    "1. Count rows that contain valid numbers in the Sodium column\n",
    "2. Count rows that do not contain valid numbers in the Sodium column\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "print(np.count_nonzero(~np.isnan(usda.column('Sodium'))))\n",
    "print(np.count_nonzero(np.isnan(usda.column('Sodium'))))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "stats.describe will produce descriptive statistics for an array.  In this case, the array contains data from the column 'Sugar'.\n",
    "\n",
    "Since we did not import the stats package in the first cell of this notebook, we must import it now so that we can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import stats \n",
    "  \n",
    "x = stats.describe(usda.column('Sugar')) \n",
    "print(x)\n",
    "   \n",
    "x = stats.describe(valid_sugar.column('Sugar')) \n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the blank cells below to help answer questions in Canvas.   For example you might want to \n",
    "1. Create a valid_sodium table using valid_sugar as a model\n",
    "2. Run stats.describe on the valid_sodium table for the column 'Sodium'\n",
    "3. Calculate the standard deviation (SD) by using the variance number from stats.describe as input to the calculation.\n",
    "Note: To calculate SD use np.sqrt(variance) where variance is the number you obtaied from stats.describe in Step 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}