Dealing with Large Datasets¶

I.e. data is in external memory (e.g. hard drive(s)), a.k.a. out-of-core.

Also can be applied to data from network, sensor samples, etc.

Incremental learning - algorithm consumes batch of data at a time rather than all of it
Streaming - separate program loads data to feed into algorithm
Reducing data - feature extraction/selection, hashing trick, dimensionality reduction, etc.

Approaches to parallelism¶

JobLib
Hashing trick
scikit-learn Partial_fit()
online learning
streaming

joblib()¶

https://joblib.readthedocs.io/en/latest/

numpy.linalg¶

Use BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms.
Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations
but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred.
Examples of such libraries are OpenBLAS, MKL (TM), and ATLAS.
Because those libraries are multithreaded and processor dependent, environmental variables and external packages such as threadpoolctl may be needed to control the number of threads or specify the processor architecture.

https://docs.scipy.org/doc/numpy/reference/routines.linalg.html

Linear algebra on several matrices at once¶

As of version 1.8.0 several of the linear algebra routines are able to compute results for several matrices at once if they are stacked into the same array.
indicated in the documentation via input parameter specifications such as a : (..., M, M) array_like.
This means that if for instance given an input array a.shape == (N, M, M), it is interpreted as a “stack” of N matrices, each of size M-by-M.
Similar specification applies to return values, for instance the determinant has det : (...) and will in this case return an array of shape det(a).shape == (N,).
This generalizes to linear algebra operations on higher-dimensional arrays: the last 1 or 2 dimensions of a multidimensional array are interpreted as vectors or matrices, as appropriate for each operation.

Scikit-Learn¶

Popular Python ML toolbox, has several functions relevant to this course

$\verb|sklearn.covariance|$: Covariance Estimators
$\verb|sklearn.decomposition|$: Matrix Decomposition
$\verb|sklearn.linear_model|$: Linear Models
$\verb|klearn.manifold|$: Manifold Learning
$\verb|sklearn.preprocessing|$: Preprocessing and Normalization

https://scikit-learn.org/stable/modules/classes.html

The Sklearn API¶

sklearn has an Object Oriented interface Most models/transforms/objects in sklearn are Estimator objects

class Estimator(object):
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
    
model = Estimator()

Unsupervised - Transformer interface¶

Some estimators in the library implement this.
Unsupervised in this case refers to any method that does not need labels, including unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.

The transformer interface usually defines two additional methods:

model.transform: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit() the model before you transform it.
model.fit_transform: For some models you may not need to fit() and transform() separately. In these cases it is more convenient to do both at the same time.

partial_fit()¶

Replaces the fit() method
applies to single instance or batch at a time

updates internal parameters

  est = SGDClassifier(...)
  est.partial_fit(X_train_1, y_train_1)
  est.partial_fit(X_train_2, y_train_2)

import numpy as np
from sklearn import linear_model

n_samples, n_features = 5000, 5

y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)

clf = linear_model.Ridge(alpha=0)

clf.fit(X, y) 
print(clf.coef_,clf.intercept_)

[-0.0160644  -0.01977289  0.0085692   0.0016327  -0.00955983] 0.004120867922031649

X@clf.coef_ + clf.intercept_

array([-0.01666297,  0.01962315, -0.02500205, ...,  0.02935635,
        0.04442187,  0.02813384])

y

array([ 1.90716894,  0.01640062, -0.09586444, ...,  1.12658549,
       -0.21788871,  0.98386351])

clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,1000):
    k_rand = np.random.randint(0,len(X)-10)
    clfi.partial_fit(X[k_rand:k_rand+10,:], y[k_rand:k_rand+10]) # same as fit() for this case...
    if k%100==0:
        print(clfi.coef_,clfi.intercept_,np.linalg.norm(X@clf.coef_ + clf.intercept_ - y))

[-0.03875006 -0.0061775  -0.0461975   0.02043416  0.00478203] [-0.00873014] 69.00954796242148
[-0.02145023  0.02916176 -0.00597924 -0.00624544  0.01222963] [0.03185706] 69.00954796242148
[ 0.01989001  0.03249868 -0.011325    0.02657303 -0.04167097] [0.04699459] 69.00954796242148
[ 0.00025748 -0.02033948 -0.01220119  0.02562552 -0.00033842] [0.02740075] 69.00954796242148
[-0.01409707 -0.07659565 -0.00462208  0.05957979  0.0042063 ] [0.00724766] 69.00954796242148
[-0.00485012 -0.03195498 -0.0177146   0.03209723  0.00536539] [0.00126791] 69.00954796242148
[-0.01612697 -0.02440589 -0.02230939 -0.01682379 -0.02080975] [-0.01813303] 69.00954796242148
[-0.04383133  0.00649149  0.02628502 -0.01349541  0.01671622] [-0.00081695] 69.00954796242148
[ 0.00435112 -0.03723621  0.01628536  0.00586213 -0.02188905] [-0.02691946] 69.00954796242148
[-0.0030079  -0.03191597 -0.00248743 -0.00211203 -0.05372476] [-0.01612772] 69.00954796242148

clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,10000):
    clfi.partial_fit(X, y) # same as fit() for this case...
    if k%1000==0:
        print(clfi.coef_,clfi.intercept_)

[0.0154067  0.00465966 0.00438107 0.01999652 0.00370198] [0.01208624]
[-0.10507267 -0.10943268  0.11507153 -0.04308667  0.03831518] [0.09327391]
[ 0.00388918 -0.02699711  0.0976475  -0.02116281  0.0077606 ] [0.11682694]
[-0.08655026  0.062645   -0.07445008  0.00614652  0.0465812 ] [-0.01081084]
[ 0.01965576 -0.01328789 -0.07319845  0.08159582 -0.07790923] [0.00358737]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-80-286fcdbd75a6> in <module>
      1 clfi = linear_model.SGDRegressor(alpha=0)
      2 for k in range(0,10000-10):
----> 3     clfi.partial_fit(X[k:k+10,:], y[k:k+10]) # same as fit() for this case...
      4     if k%1000==0:
      5         print(clfi.coef_,clfi.intercept_)

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py in partial_fit(self, X, y, sample_weight)
   1154                                  learning_rate=self.learning_rate, max_iter=1,
   1155                                  sample_weight=sample_weight, coef_init=None,
-> 1156                                  intercept_init=None)
   1157 
   1158     def _fit(self, X, y, alpha, C, loss, learning_rate, coef_init=None,

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, sample_weight, coef_init, intercept_init)
   1100                      max_iter, sample_weight, coef_init, intercept_init):
   1101         X, y = check_X_y(X, y, "csr", copy=False, order='C', dtype=np.float64,
-> 1102                          accept_large_sparse=False)
   1103         y = y.astype(np.float64, copy=False)
   1104 

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    737                     ensure_min_features=ensure_min_features,
    738                     warn_on_dtype=warn_on_dtype,
--> 739                     estimator=estimator)
    740     if multi_output:
    741         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    568                              " minimum of %d is required%s."
    569                              % (n_samples, array.shape, ensure_min_samples,
--> 570                                 context))
    571 
    572     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 5)) while a minimum of 1 is required.

Is Spark Still Relevant: Spark vs Dask vs RAPIDS¶

https://www.youtube.com/watch?v=RRtqIagk93k

Dask is functionally equivalent to Spark
RAPIDS orders of magnitude faster than both because GPU's
Spark = distributed dataframes on JVM on CPU
Dask = distributed dataframes on CPU
RAPIDS = distributed dataframes on GPU
Why are enterprises still using spark?
- risk (spark is established, others brand new)
- support, consulting (only one: quansight), training

Ecosystem:

Dask - Graph: networkX, ML: Dask-ml,scikit-learn,XGBoost, DL: TF,keras
Spark - Graph: graphX, ML: MLlib,XGBoost, DL: ?
RAPIDS: - Graph: cuGraph, ML: cuML, DL: Chainer, PyTorch,MXNet,DLPack
Visualization with Dask: Datashader+Dask
SQL: Dask has no SQL compiler, RAPIDS has blazingSQL, Spark has some
Streaming: dask+streamz, RAPIDS custreamz, Spark streaming

Dask on Kubernetes

Anaconda - Dask gateway

Pandas¶

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

https://pandas.pydata.org/pandas-docs/stable/index.html

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

DataFrame¶

a 2-dimensional labeled data structure with columns of potentially different types.
can think of it like a spreadsheet or SQL table, or a dict of Series objects
generally the most commonly used pandas object.

# https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

import numpy as np
import pandas as pd

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print(df)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

pd.DataFrame(d, index=['d', 'b', 'a'])

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])