BDS 761: Data Science and Machine Learning I


drawing

Topic 1: Introduction

This topic:¶

  1. Class topics
  2. Syllabus
  3. Software installation
  4. Q&A Discussion

Reading:

  • https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb
  • https://herbsutter.com/welcome-to-the-jungle/
  • http://www.incompleteideas.net/IncIdeas/BitterLesson.html

I. Class topics¶

Course Theme¶

The key tools and approaches driving modern Artificial Intelligence, and related:

Dealing with vast quantities of data

Algorithms that can scale linearly with data size

Algorithms that can take advantage of parallel processing

General Topic List¶

We will focus on methods and tools in the following broad areas

  1. Matrix algebra methods and software
  2. Regression and intro to Deep Learning
  3. Scalable computation
  4. Scaling data size
  5. Embedding methods
  6. Basic methods in Machine Learning
  7. Text processing and Natural Language Processing

Objectives of this class¶

  • Be able to use "core" python libraries in your research
  • Understand how SOTA A.I. methods are broadly based on these same libraries
  • Be able to implement basic processing and a few machine learning algorithms from "scratch"
  • Generally understand what is going on in research publications

II. Syllabus Discussion¶

  • Homework and readings will be provided at end of class or via announcement later that evening. Due at beginning of following class. Points deducted if show up late.
  • No particular textbook needed
  • A computer is needed to participate in class.
  • Attendance not mandatory (?). Will attempt to record classes. Please do not come to class with anything contagious.
  • Academic integrity - can discuss verbally. Do not share work or copy fellow students' writing or code. Be very careful about basing your work on code from internet.
  • Office hours TBD.

Course Information¶

  • Labs/Participation/Homework - 20%
  • Quizzes - 10%
  • Midterm - 30%
  • Final Exam - 40%

Point of lab/participation/homework is to encourage you to study and learn. Easy points. Don't get overconfident or complacent, it is 20%.

Point of exams is to decide your grade.

Prerequisites: programming + math¶

  • Programming skills. We will be using Python. If the amount of work seems to be overwhelming, it is most likely due to a deficiency here. Tasks which should take minutes may take you hours if you get stuck trying to hack together an approach with google or A.I.

  • Vector geometry

  • Matrix Algebra

  • Prob & Stat won't be used much

Prereqs exist for very good reasons.

Books¶

There is no required text. There is a vast supply of free resources online. Suggestions:

  • Introduction to Applied Linear Algebra, Boyd & Vendenberghe 2018, http://vmls-book.stanford.edu/

  • Speech and Language Processing, 3e, Jurafsky & Martin 2024. https://web.stanford.edu/~jurafsky/slp3/

  • data science tutor GPT: https://chatgpt.com/g/g-SSBhmwHol-introductory-data-science-tutor

  • some python and linear algebra review material: https://www.keithdillon.com/index.php/bootcamp/

Academic Integrity, etc.¶

  • See student handbook. This is your contract.
  • Fairness will not be sacrificied for other noble causes
  • Big source of drama: students skipping class or not doing homework then being unhappy with exams they could not handle as a result

III. Software Installation¶

Jupyter - "notebooks" for inline code + LaTex math + markup, etc.¶

A single document containing a series of "cells". Each containing code which can be run, or images and other documentation.

  • Run a cell via [shift] + [Enter] or "play" button in the menu.
drawing

Will execute code and display result below, or render markup etc.

Can also use R or Julia (easily), Matlab, SQL, etc. (with increasing difficulty).

In [1]:
import datetime

print("This code is run right now (" + str(datetime.datetime.now()) + ")")

'hi'
This code is run right now (2025-08-26 14:41:26.932482)
Out[1]:
'hi'
In [3]:
x=1+2+2

print(x)
5
In [4]:
import numpy as np
In [9]:
np.random.randn(2,5)
Out[9]:
array([[ 1.24350758,  1.99906955, -0.3226366 , -0.98266019, -0.1309466 ],
       [-0.85026968, -0.35865037,  0.70637075,  1.06492839,  0.35220974]])
In [12]:
np.ones((2,2))
Out[12]:
array([[1., 1.],
       [1., 1.]])

Installation¶

First project: get Jupyter running and be able to import listed tools

Easiest to install via Anaconda. Preferrably Python 3.

https://www.anaconda.com/download/

Highly recomended to make a separate environment for class - hot open source tools change fast and deprecate (i.e. break) old features constantly

conda install jupyter matplotlib numpy scipy scikit-learn pandas ...

Many other packages...

Python Help Tips¶

  • Get help on a function or object via [shift] + [tab] after the opening parenthesis function(
drawing
  • Can also get help by executing function?
drawing

IV. Q & A Discussion¶

  • Recording of classes
  • Job interests/plans?
  • Research topics?

V. Class Motivation¶

Hardware: the real driver¶

The supposed end of Moore's Law

Note Logarithmic axis.

drawing

CPU's stopped getting faster clock speeds around 3GHZ

Countered by increasing number of cores on chip

Parallelism¶

100x increase in compute (due to more cores) in 20 years (since clock speed stopped)

To take advantage of this gain, need algorithms and techniques that can scale well

Good examples: matrix multiplication, gradient descent

Bad examples: classical statistical methods, constrained optimization

Ironic twist: to use (vastly) more advanced computing hardware, must limit yourself to a subset of simple algorithms.

What are some domains that appear to have taken advantage of this scaling?

(based on 100x or better performance boost recently?)

Data Science

Artificial Intelligence

A.I. Example¶

Hugging Face Pipelines: Base class implementing NLP operations. Pipeline workflow is defined as a sequence of the following operations:

  • A tokenizer in charge of mapping raw textual input to token --> string (or other data format) processing
  • A model to make predictions from the inputs --> linear algebra
  • Some (optional) post processing for enhancing model’s output --> misc
drawing

https://huggingface.co/docs/transformers/en/main_classes/pipelines

Behind the pipeline¶

drawing

https://huggingface.co/learn/nlp-course/chapter2/2

Inside the Model¶

drawing

Inside the GPU¶

General Matrix Multiplication (GEMM) ~ $C = \alpha AB + \beta C$

drawing

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html