.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import pandas as pd
data = pd.read_csv(data_file)
print(data)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import pandas as pd
data = pd.read_csv(data_file)
print(data)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import pandas as pd
data = pd.read_csv(data_file)
print(data)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import pandas as pd
data = pd.read_csv(data_file)
print(data)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
.. raw:: html
.. raw:: html
Data Preparation
----------------
In supervised learning, we train models to predict a designated *target*
value, given some set of *input* values. Our first step in processing
the dataset is to separate out columns corresponding to input versus
target values. We can select columns either by name or via
integer-location based indexing (``iloc``).
You might have noticed that ``pandas`` replaced all CSV entries with
value ``NA`` with a special ``NaN`` (*not a number*) value. This can
also happen whenever an entry is empty, e.g., “3,,,270000”. These are
called *missing values* and they are the “bed bugs” of data science, a
persistent menace that you will confront throughout your career.
Depending upon the context, missing values might be handled either via
*imputation* or *deletion*. Imputation replaces missing values with
estimates of their values while deletion simply discards either those
rows or those columns that contain missing values.
Here are some common imputation heuristics. For categorical input
fields, we can treat ``NaN`` as a category. Since the ``RoofType``
column takes values ``Slate`` and ``NaN``, ``pandas`` can convert this
column into two columns ``RoofType_Slate`` and ``RoofType_nan``. A row
whose roof type is ``Slate`` will set values of ``RoofType_Slate`` and
``RoofType_nan`` to 1 and 0, respectively. The converse holds for a row
with a missing ``RoofType`` value.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 NaN False True
1 2.0 False True
2 4.0 True False
3 NaN False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 NaN False True
1 2.0 False True
2 4.0 True False
3 NaN False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 NaN False True
1 2.0 False True
2 4.0 True False
3 NaN False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 NaN False True
1 2.0 False True
2 4.0 True False
3 NaN False True
.. raw:: html
.. raw:: html
For missing numerical values, one common heuristic is to replace the
``NaN`` entries with the mean value of the corresponding column.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs = inputs.fillna(inputs.mean())
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 3.0 False True
1 2.0 False True
2 4.0 True False
3 3.0 False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs = inputs.fillna(inputs.mean())
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 3.0 False True
1 2.0 False True
2 4.0 True False
3 3.0 False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs = inputs.fillna(inputs.mean())
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 3.0 False True
1 2.0 False True
2 4.0 True False
3 3.0 False True
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
inputs = inputs.fillna(inputs.mean())
print(inputs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
NumRooms RoofType_Slate RoofType_nan
0 3.0 False True
1 2.0 False True
2 4.0 True False
3 3.0 False True
.. raw:: html
.. raw:: html
Conversion to the Tensor Format
-------------------------------
Now that all the entries in ``inputs`` and ``targets`` are numerical, we
can load them into a tensor (recall :numref:`sec_ndarray`).
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(tensor([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=torch.float64),
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np
X, y = np.array(inputs.to_numpy(dtype=float)), np.array(targets.to_numpy(dtype=float))
X, y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:09:02] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(array([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=float64),
array([127500., 106000., 178100., 140000.], dtype=float64))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from jax import numpy as jnp
X = jnp.array(inputs.to_numpy(dtype=float))
y = jnp.array(targets.to_numpy(dtype=float))
X, y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(Array([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=float32),
Array([127500., 106000., 178100., 140000.], dtype=float32))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
X = tf.constant(inputs.to_numpy(dtype=float))
y = tf.constant(targets.to_numpy(dtype=float))
X, y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(,
)
.. raw:: html
.. raw:: html
Discussion
----------
You now know how to partition data columns, impute missing variables,
and load ``pandas`` data into tensors. In :numref:`sec_kaggle_house`,
you will pick up some more data processing skills. While this crash
course kept things simple, data processing can get hairy. For example,
rather than arriving in a single CSV file, our dataset might be spread
across multiple files extracted from a relational database. For
instance, in an e-commerce application, customer addresses might live in
one table and purchase data in another. Moreover, practitioners face
myriad data types beyond categorical and numeric, for example, text
strings, images, audio data, and point clouds. Oftentimes, advanced
tools and efficient algorithms are required in order to prevent data
processing from becoming the biggest bottleneck in the machine learning
pipeline. These problems will arise when we get to computer vision and
natural language processing. Finally, we must pay attention to data
quality. Real-world datasets are often plagued by outliers, faulty
measurements from sensors, and recording errors, which must be addressed
before feeding the data into any model. Data visualization tools such as
`seaborn