Dealing with Large Datasets

I.e. data is in external memory (e.g. hard drive(s)), a.k.a. out-of-core.

Also can be applied to data from network, sensor samples, etc.

  • Incremental learning - algorithm consumes batch of data at a time rather than all of it
  • Streaming - separate program loads data to feed into algorithm
  • Reducing data - feature extraction/selection, hashing trick, dimensionality reduction, etc.

Approaches to parallelism

  • JobLib
  • Hashing trick
  • scikit-learn Partial_fit()
  • online learning
  • streaming

numpy.linalg

  • Use BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms.
  • Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations
  • but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred.
  • Examples of such libraries are OpenBLAS, MKL (TM), and ATLAS.
  • Because those libraries are multithreaded and processor dependent, environmental variables and external packages such as threadpoolctl may be needed to control the number of threads or specify the processor architecture.

https://docs.scipy.org/doc/numpy/reference/routines.linalg.html

Linear algebra on several matrices at once

  • As of version 1.8.0 several of the linear algebra routines are able to compute results for several matrices at once if they are stacked into the same array.
  • indicated in the documentation via input parameter specifications such as a : (..., M, M) array_like.
  • This means that if for instance given an input array a.shape == (N, M, M), it is interpreted as a “stack” of N matrices, each of size M-by-M.
  • Similar specification applies to return values, for instance the determinant has det : (...) and will in this case return an array of shape det(a).shape == (N,).
  • This generalizes to linear algebra operations on higher-dimensional arrays: the last 1 or 2 dimensions of a multidimensional array are interpreted as vectors or matrices, as appropriate for each operation.
In [ ]:
 

Scikit-Learn

Popular Python ML toolbox, has several functions relevant to this course

  • $\verb|sklearn.covariance|$: Covariance Estimators
  • $\verb|sklearn.decomposition|$: Matrix Decomposition
  • $\verb|sklearn.linear_model|$: Linear Models
  • $\verb|klearn.manifold|$: Manifold Learning
  • $\verb|sklearn.preprocessing|$: Preprocessing and Normalization

https://scikit-learn.org/stable/modules/classes.html

The Sklearn API

sklearn has an Object Oriented interface Most models/transforms/objects in sklearn are Estimator objects

In [2]:
class Estimator(object):
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
    
model = Estimator()

Unsupervised - Transformer interface

Some estimators in the library implement this.
Unsupervised in this case refers to any method that does not need labels, including unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.

The transformer interface usually defines two additional methods:

  • model.transform: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit() the model before you transform it.
  • model.fit_transform: For some models you may not need to fit() and transform() separately. In these cases it is more convenient to do both at the same time.

partial_fit()

  • Replaces the fit() method
  • applies to single instance or batch at a time
  • updates internal parameters

      est = SGDClassifier(...)
      est.partial_fit(X_train_1, y_train_1)
      est.partial_fit(X_train_2, y_train_2)
In [76]:
import numpy as np
from sklearn import linear_model

n_samples, n_features = 5000, 5

y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)

clf = linear_model.Ridge(alpha=0)

clf.fit(X, y) 
print(clf.coef_,clf.intercept_)
[-0.0160644  -0.01977289  0.0085692   0.0016327  -0.00955983] 0.004120867922031649
In [77]:
X@clf.coef_ + clf.intercept_
Out[77]:
array([-0.01666297,  0.01962315, -0.02500205, ...,  0.02935635,
        0.04442187,  0.02813384])
In [78]:
y
Out[78]:
array([ 1.90716894,  0.01640062, -0.09586444, ...,  1.12658549,
       -0.21788871,  0.98386351])
In [86]:
clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,1000):
    k_rand = np.random.randint(0,len(X)-10)
    clfi.partial_fit(X[k_rand:k_rand+10,:], y[k_rand:k_rand+10]) # same as fit() for this case...
    if k%100==0:
        print(clfi.coef_,clfi.intercept_,np.linalg.norm(X@clf.coef_ + clf.intercept_ - y))
[-0.03875006 -0.0061775  -0.0461975   0.02043416  0.00478203] [-0.00873014] 69.00954796242148
[-0.02145023  0.02916176 -0.00597924 -0.00624544  0.01222963] [0.03185706] 69.00954796242148
[ 0.01989001  0.03249868 -0.011325    0.02657303 -0.04167097] [0.04699459] 69.00954796242148
[ 0.00025748 -0.02033948 -0.01220119  0.02562552 -0.00033842] [0.02740075] 69.00954796242148
[-0.01409707 -0.07659565 -0.00462208  0.05957979  0.0042063 ] [0.00724766] 69.00954796242148
[-0.00485012 -0.03195498 -0.0177146   0.03209723  0.00536539] [0.00126791] 69.00954796242148
[-0.01612697 -0.02440589 -0.02230939 -0.01682379 -0.02080975] [-0.01813303] 69.00954796242148
[-0.04383133  0.00649149  0.02628502 -0.01349541  0.01671622] [-0.00081695] 69.00954796242148
[ 0.00435112 -0.03723621  0.01628536  0.00586213 -0.02188905] [-0.02691946] 69.00954796242148
[-0.0030079  -0.03191597 -0.00248743 -0.00211203 -0.05372476] [-0.01612772] 69.00954796242148
In [80]:
clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,10000):
    clfi.partial_fit(X, y) # same as fit() for this case...
    if k%1000==0:
        print(clfi.coef_,clfi.intercept_)
[0.0154067  0.00465966 0.00438107 0.01999652 0.00370198] [0.01208624]
[-0.10507267 -0.10943268  0.11507153 -0.04308667  0.03831518] [0.09327391]
[ 0.00388918 -0.02699711  0.0976475  -0.02116281  0.0077606 ] [0.11682694]
[-0.08655026  0.062645   -0.07445008  0.00614652  0.0465812 ] [-0.01081084]
[ 0.01965576 -0.01328789 -0.07319845  0.08159582 -0.07790923] [0.00358737]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-80-286fcdbd75a6> in <module>
      1 clfi = linear_model.SGDRegressor(alpha=0)
      2 for k in range(0,10000-10):
----> 3     clfi.partial_fit(X[k:k+10,:], y[k:k+10]) # same as fit() for this case...
      4     if k%1000==0:
      5         print(clfi.coef_,clfi.intercept_)

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py in partial_fit(self, X, y, sample_weight)
   1154                                  learning_rate=self.learning_rate, max_iter=1,
   1155                                  sample_weight=sample_weight, coef_init=None,
-> 1156                                  intercept_init=None)
   1157 
   1158     def _fit(self, X, y, alpha, C, loss, learning_rate, coef_init=None,

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, sample_weight, coef_init, intercept_init)
   1100                      max_iter, sample_weight, coef_init, intercept_init):
   1101         X, y = check_X_y(X, y, "csr", copy=False, order='C', dtype=np.float64,
-> 1102                          accept_large_sparse=False)
   1103         y = y.astype(np.float64, copy=False)
   1104 

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    737                     ensure_min_features=ensure_min_features,
    738                     warn_on_dtype=warn_on_dtype,
--> 739                     estimator=estimator)
    740     if multi_output:
    741         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\envs\PyTorch1_3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    568                              " minimum of %d is required%s."
    569                              % (n_samples, array.shape, ensure_min_samples,
--> 570                                 context))
    571 
    572     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 5)) while a minimum of 1 is required.
In [ ]:
 

Is Spark Still Relevant: Spark vs Dask vs RAPIDS

https://www.youtube.com/watch?v=RRtqIagk93k

  • Dask is functionally equivalent to Spark
  • RAPIDS orders of magnitude faster than both because GPU's
  • Spark = distributed dataframes on JVM on CPU
  • Dask = distributed dataframes on CPU
  • RAPIDS = distributed dataframes on GPU
  • Why are enterprises still using spark?
    • risk (spark is established, others brand new)
    • support, consulting (only one: quansight), training

Ecosystem:

  • Dask - Graph: networkX, ML: Dask-ml,scikit-learn,XGBoost, DL: TF,keras
  • Spark - Graph: graphX, ML: MLlib,XGBoost, DL: ?
  • RAPIDS: - Graph: cuGraph, ML: cuML, DL: Chainer, PyTorch,MXNet,DLPack
  • Visualization with Dask: Datashader+Dask
  • SQL: Dask has no SQL compiler, RAPIDS has blazingSQL, Spark has some
  • Streaming: dask+streamz, RAPIDS custreamz, Spark streaming
drawing

Dask on Kubernetes

  • Anaconda - Dask gateway

Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

https://pandas.pydata.org/pandas-docs/stable/index.html

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

DataFrame

  • a 2-dimensional labeled data structure with columns of potentially different types.
  • can think of it like a spreadsheet or SQL table, or a dict of Series objects
  • generally the most commonly used pandas object.
In [1]:
# https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

import numpy as np
import pandas as pd

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print(df)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
In [2]:
pd.DataFrame(d, index=['d', 'b', 'a'])
Out[2]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
In [3]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[3]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
In [ ]:
 
In [ ]: