BDS 754: Python for Data Science


drawing

Topic 15: Python Data Science Packages

Many popular packages (include popular AI frameworks) are built upon Numpy or "swappable" replacements.

drawing

Harris, Charles R., et al. "Array programming with NumPy." nature 585.7825 (2020): 357-362.

Matplotlib¶

The most common Python visualization tool

  • "Create publication quality plots."
  • "Make interactive figures that can zoom, pan, update."
  • "Customize visual style and layout."
  • "Export to many file formats."
  • "Embed in JupyterLab and Graphical User Interfaces."
drawing

https://github.com/amueller/scipy-2017-sklearn/blob/master/notebooks/02.Scientific_Computing_Tools_in_Python.ipynb

http://matplotlib.org

Pyplot¶

A module within matplotlib which provides a relatively-simple interface, modelled after matlab.

Can be used in direct "state-based" calls, closest to matlab-style:

In [27]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 15, 0.1)
y = np.sin(x)
plt.figure()
plt.plot(x, y)
plt.show()

...PyPlot Object-Oriented API¶

For much more control and features, you must directly access the pytorch classes via object-oriented API

Usually necessary to get a plot to look exactly the way you want it.

https://matplotlib.org/stable/api/pyplot_summary.html

In [206]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange']

ax.bar(fruits, counts, label=bar_labels, color=bar_colors)
ax.set_ylabel('fruit supply')
ax.set_title('Fruit supply by kind and color')
ax.legend(title='Fruit color')
plt.show()

Inline Jupyter plots¶

Jupyter has built-in magic function: the "matoplotlib inline" mode, which will draw the plots directly inside the notebook. Should be on by default.

%matplotlib inline

Without this mode, or in ipython, use .show() function to generate visualization

plt.show()

https://ipython.org/ipython-doc/3/interactive/magics.html

Line plots¶

Accepts various types of inputs (list, array, ...)

In [78]:
plt.plot([0,1,-2,3,-4,5]);

plot x versus y

In [65]:
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y);

Colors automatically changed for multiple traces

In [54]:
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x));
plt.plot(x, np.sin(x)+0.5);
plt.plot(x, np.sin(x)+1);

Or select trace style

In [67]:
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x) ,    'ko');
plt.plot(x, np.sin(x)+0.5, 'rs');
plt.plot(x, np.sin(x)+1,   'g.');

Scatter plots¶

In [68]:
x = np.random.normal(size=500)
y = np.random.normal(size=500)
plt.scatter(x, y);

Images with imshow()¶

Can be $M\times N$ matrices or $M\times N \times 3$ arrays representing color

note that origin is at the top-left by default

In [208]:
im = np.array([[1, 2, 3],[4,5,6],[6,7,8]])
import matplotlib.pyplot as plt

plt.imshow(im);
plt.colorbar();
plt.xlabel('x')
plt.ylabel('y');

Contour plots¶

note that origin here is at the bottom-left by default

In [71]:
plt.contour(im);

Loading examples¶

There are many more plot types available.

See matplotlib gallery] at http://matplotlib.org/gallery.html

To run, copy the Source Code link near bottom of page, and put it in a notebook using the %load magic.

%load <link goes here>

Then run cell

In [75]:
%load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py

Saving figure as image¶

call plt.savefig(filename) after forming plot.

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html

In [210]:
plt.plot([0,1,-2,3,-4,5]);
plt.savefig("test_savefig_example.pdf") # png, jpg, ...

Beyond Matplotlib¶

To achieve nice-looking figures often requires a lot of code as you must specify every detail of the figure carefully.

You may find yourself having to go back and research how to achieve each detail whenever it comes time to publish.

Alternatives:

  • Seaborn - for simple statistical plotting, built on matplotlib. https://seaborn.pydata.org/
  • Plotly - commmercial product with free tier. Popular for interactive plots, enterprise use. https://plotly.com/python/
  • Bokeh - for interactive visualization. https://docs.bokeh.org/en/latest/

SciPy¶

Extends NumPy providing additional tools for array computing and provides specialized data structures, such as sparse matrices and k-dimensional trees.

Wraps highly-optimized implementations written in low-level languages like Fortran, C, and C++.

drawing

https://scipy.org/

Scipy is a small collection of numerical modules¶

  • Integration (scipy.integrate)
  • Optimization (scipy.optimize)
  • Interpolation (scipy.interpolate)
  • Numerical integration (scipy.integrate)
  • Signal Processing (scipy.signal)
  • Linear Algebra (scipy.linalg)
  • Statistics (scipy.stats)
  • File IO (scipy.io)
In [7]:
import scipy
printcols(dir(scipy),2)
LowLevelCallable   ndimage            
__numpy_version__  odr                
__version__        optimize           
cluster            show_config        
fft                signal             
fftpack            sparse             
integrate          spatial            
interpolate        special            
io                 stats              
linalg             test               
misc                                  
In [97]:
import numpy.linalg
#[fun for fun in dir(numpy.linalg) if not fun.startswith('_')]
In [96]:
import scipy.linalg
#[fun for fun in dir(scipy.linalg) if not fun.startswith('_')]

scipy.stats¶

In [95]:
from scipy.stats import norm

print(norm.pdf(0))        
print(norm.cdf(1))     
print(norm.rvs(size=10)) # random samples
0.3989422804014327
0.8413447460685429
[ 0.85099468 -0.82656877 -0.09945668 -0.07761511 -0.59359535 -2.48334685
 -1.16963941  0.55575344  0.62577528  1.06376605]
In [217]:
#printcols([fun for fun in dir(scipy.stats) if not fun.startswith('_')],4)

Optimization¶

In [81]:
from scipy.optimize import minimize
import numpy as np

def f(x):
    return (x - 3)**2

res = minimize(f, x0=0)
res      
Out[81]:
      fun: 2.5388963550532293e-16
 hess_inv: array([[0.5]])
      jac: array([-1.69666681e-08])
  message: 'Optimization terminated successfully.'
     nfev: 6
      nit: 2
     njev: 3
   status: 0
  success: True
        x: array([2.99999998])
In [212]:
printcols([fun for fun in dir(scipy.optimize) if not fun.startswith('_')],4)
BFGS                      brute                     fsolve                    nnls                      
Bounds                    check_grad                golden                    nonlin                    
HessianUpdateStrategy     cobyla                    lbfgsb                    optimize                  
LbfgsInvHessProduct       curve_fit                 least_squares             quadratic_assignment      
LinearConstraint          diagbroyden               leastsq                   ridder                    
NonlinearConstraint       differential_evolution    line_search               root                      
OptimizeResult            direct                    linear_sum_assignment     root_scalar               
OptimizeWarning           dual_annealing            linearmixing              rosen                     
RootResults               excitingmixing            linesearch                rosen_der                 
SR1                       fixed_point               linprog                   rosen_hess                
anderson                  fmin                      linprog_verbose_callback  rosen_hess_prod           
approx_fprime             fmin_bfgs                 lsq_linear                shgo                      
basinhopping              fmin_cg                   milp                      show_options              
bisect                    fmin_cobyla               minimize                  slsqp                     
bracket                   fmin_l_bfgs_b             minimize_scalar           test                      
brent                     fmin_ncg                  minpack                   tnc                       
brenth                    fmin_powell               minpack2                  toms748                   
brentq                    fmin_slsqp                moduleTNC                 zeros                     
broyden1                  fmin_tnc                  newton                                              
broyden2                  fminbound                 newton_krylov                                       

SkLearn¶

Machine Learning functions. Previously dominant Machine Learning toolbox.

drawing

Was getting discarded as field switched from single CPU to GPU and distributed, but making comeback with upgrades for scaling sizes and parallelism

https://scikit-learn.org/stable/index.html

The Supervised Learning "API"¶

  1. Given training data $(\mathbf x_{(i)},y_i)$ for $i=1,...,m$ --> lists of samples $\mathbf X$ and labels $\mathbf y$

  2. Choose a model $f(\cdot)$ where we want to make $f(\mathbf x_{(i)})\approx y_i$ (for all $i$) --> choose sklearn estimator to use

  3. Define a loss function $L(f(\mathbf x), y)$ to minimize by changing $f(\cdot)$ ...by adjusting the weights --> default choices for estimators, sometimes multiple options

In traditional machine learning the models were support vector machines, decision trees, various kinds of regression models (linear, polynomial, logistic).

In modern AI they are deep learning "architectures".

Unlike traditional statistics, Machine Learning is about making a tool to predict things, as opposed to doing science or data analysis. Once we have a "fit" a model, we want to use it on new data to make decisions

$$f(\mathbf x_{new})\approx ?$$

Most methods have no way to compute uncertainty quantification, p-values or confidence, etc.

The Sklearn API¶

Scikit-learn is a uniform API over many statistical and machine learning models.

model = ModelClass(...)   # constructor. specify hyperparameters
model.fit(X, y)           # estimate model parameters from data
y_pred = model.predict(X) # generate predictions from new data

Most models/transforms/objects in sklearn are Estimator objects

In [2]:
class Estimator(object):
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
    
model = Estimator()

Linear regression example¶

The model: $$ y = \beta_0 + \beta_1 x$$

Given data $(\mathbf x_{(i)},y_i)$, we want to fit the model, which means choose $\beta_0$ and $\beta_1$.

In [219]:
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4], [5]])   # features (list of lists convention)
y = np.array([2, 4, 5, 4, 5])             # response
In [220]:
from matplotlib.pyplot import *
figure(figsize=(5,2))
plot(X,y, '.-');
In [115]:
#help(LinearRegression)

Fitting the model¶

Scikit learn will internally compute the least squares solution

In [118]:
model = LinearRegression() # construct estimator class
model.fit(X, y); # fit parameters using data
In [123]:
print(model.coef_)      # slope
print(model.intercept_) # intercept
[0.6]
2.1999999999999993
In [124]:
model.score(X,y)      # R^2 implemented for this model
Out[124]:
0.6000000000000001

Making predictions using the model¶

Here we're using the original training data to view how well it was fit

In [120]:
y_pred = model.predict(X)
y_pred
Out[120]:
array([2.8, 3.4, 4. , 4.6, 5.2])
In [221]:
model.intercept_ + model.coef_*X
Out[221]:
array([[2.8],
       [3.4],
       [4. ],
       [4.6],
       [5.2]])
In [128]:
figure(figsize=(5,2))
plot(X,y, '.-');
plot(X,y_pred, '.-');

We can apply our "trained" model to anything now

In [223]:
model.predict([[99]]) # note input form at list of lists of feature values
Out[223]:
array([61.6])
In [225]:
model.predict([[-100]]) # note input form at list of lists of feature values
Out[225]:
array([-57.8])
In [224]:
model.predict([[1e6]]) 
Out[224]:
array([600002.2])

Other notes¶

Extra info and variations for different methods are squeezed into the interface

  • model.predict_proba: For classifiers that have a notion of probability (or some measure of confidence in a prediction) this method returns those probabilities.
  • model.score: For both classification and regression models, this method returns some measure of validation of the model. In regression the default is typically R^2 and classification it is accuracy.
  • Clustering methods - fit(X), note lack of y here, computes clusters.
  • Dim reduction (e.g., PCA) fit_transform(X) to compute the dimension transformation

Clustering example¶

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [227]:
import numpy as np
from sklearn.cluster import KMeans

# synthetic data: two clusters
X = np.array([[1, 2], [1, 3], [2, 2], [2, 3], [-8, -7], [-8, -8], [-9, -7], [-9, -8]])

figure(figsize=(5,2))
scatter(X[:,0],X[:,1]);
In [232]:
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)

labels = kmeans.labels_ # cluster assignments
centers = kmeans.cluster_centers_ # cluster centers
print(centers)

figure(figsize=(5,2))
scatter(X[:,0],X[:,1]);
scatter(centers[:,0],centers[:,1]);
[[-8.5 -7.5]
 [ 1.5  2.5]]

Modern Machine Learning with scikit-learn¶

A large focus of modern AI development is in "democratizing" the technologies.

"NVIDIA cuML is an open-source CUDA-X™ Data Science library that accelerates scikit-learn, UMAP, and HDBSCAN on GPUs—supercharging machine learning workflows with no code changes required. "

https://developer.nvidia.com/blog/scikit-learn-tutorial-beginners-guide-to-gpu-accelerated-ml-pipelines/

https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cuml

Pandas¶

Pandas is a library for tabular data analysis built on NumPy, taking a database-style view to n-dimensional arrays (with n=2).

Pandas adds labeled axes, heterogeneous columns, and database-like data manipulation.

drawing

https://pandas.pydata.org/

Dataframe as packaging for a 2D array¶

In [166]:
import numpy as np
import pandas as pd

arr = np.array([[1, 2],[3, 4]])
print(arr)
print('')
df = pd.DataFrame(arr, columns=["A", "B"], index=["r1", "r2"])
print(df)
[[1 2]
 [3 4]]

    A  B
r1  1  2
r2  3  4
In [161]:
df.to_numpy()
Out[161]:
array([[1, 2],
       [3, 4]])
In [200]:
df.describe()
Out[200]:
A B
count 2.000000 2.000000
mean 2.000000 3.000000
std 1.414214 1.414214
min 1.000000 2.000000
25% 1.500000 2.500000
50% 2.000000 3.000000
75% 2.500000 3.500000
max 3.000000 4.000000

Semantic labels enable semantic operations¶

In [192]:
df = pd.DataFrame([[1, 2],[3, 4]], columns=["A", "B"], index=["r1", "r2"])
print(df)
    A  B
r1  1  2
r2  3  4
In [193]:
df["A"]
Out[193]:
r1    1
r2    3
Name: A, dtype: int64
In [194]:
df["A"] + df["B"]
Out[194]:
r1    3
r2    7
dtype: int64

access rows via .loc()¶

In [198]:
print(df)
df.loc['r1']
    A  B
r1  1  2
r2  3  4
Out[198]:
A    1
B    2
Name: r1, dtype: int64
In [199]:
df.loc['r1']+df.loc['r2']
Out[199]:
A    4
B    6
dtype: int64

Relational alignment¶

In [178]:
df1 = pd.DataFrame([[1, 2],[3, 4]], columns=["A", "B"], index=["r1", "r2"])
df2 = pd.DataFrame([[.1, .2],[.3, .4]], columns=["B", "C"], index=["r1", "r2"])
print(df1)
print(df2)
    A  B
r1  1  2
r2  3  4
      B    C
r1  0.1  0.2
r2  0.3  0.4
In [179]:
print(df1+df2)
     A    B   C
r1 NaN  2.1 NaN
r2 NaN  4.3 NaN

Dataframe as packaging for collection of 1D arrays¶

Each column can have its own dtype, stored as a container class of type Series()

In [233]:
df = pd.DataFrame({
    'age': np.array([23, 45, 31]),
    'name': ['Alice', 'Bob', 'Carol'], 
    'score': [88.5, 92.0, 79.5] }) 
print(df)
   age   name  score
0   23  Alice   88.5
1   45    Bob   92.0
2   31  Carol   79.5
In [234]:
df['age']
Out[234]:
0    23
1    45
2    31
Name: age, dtype: int32
In [235]:
type(df), type(df['age'])
Out[235]:
(pandas.core.frame.DataFrame, pandas.core.series.Series)
In [236]:
list(df['age'])
Out[236]:
[23, 45, 31]
In [238]:
df['age'].mean()
Out[238]:
33.0
In [239]:
df['age'].median()
Out[239]:
31.0
In [240]:
df.describe()
Out[240]:
age score
count 3.000000 3.000000
mean 33.000000 86.666667
std 11.135529 6.448514
min 23.000000 79.500000
25% 27.000000 84.000000
50% 31.000000 88.500000
75% 38.000000 90.250000
max 45.000000 92.000000
In [241]:
df.describe().loc['50%']
Out[241]:
age      31.0
score    88.5
Name: 50%, dtype: float64

Importing data¶

Many formats supported for load directly into dataframe. Also html, clipboard, parquet, feather

Corresponding .to_csv() etc methods for writing

df = pd.read_csv("data.csv")
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df = pd.read_json("data.json")
df = pd.read_pickle("data.pkl")

import sqlite3
conn = sqlite3.connect("db.sqlite")
df = pd.read_sql("SELECT * FROM table", conn)

Beyond Pandas¶

pandas was originally created for in-memory, single-threaded operation.

Packages like polars and NVIDIA RAPIDS use a similar interface but add support for parallel and distributed operation.

Recap¶

  • numpy/scipy - lots of modules building on fast linear algebra ("array processing") operations.
  • matplotlib - figure generation. use examples/AI rather than trying to learn it
  • scikit-learn - the fit/predict API for a wide variety of models.
  • pandas - dataframe view of 2D arrays, with relational database functionality

These are core methods for data science on a traditional single-CPU, in-RAM, mode of operation.

Major subsequent packages build on them to surpass their limitation.