BDS 761: Data Science and Machine Learning I

Topic 1: Introduction
I. Class topics¶
Course Theme¶
The key tools and approaches driving modern Artificial Intelligence, and related:
Dealing with vast quantities of data
Algorithms that can scale linearly with data size
Algorithms that can take advantage of parallel processing
General Topic List¶
We will focus on methods and tools in the following broad areas
- Matrix algebra methods and software
- Regression and intro to Deep Learning
- Scalable computation
- Scaling data size
- Embedding methods
- Basic methods in Machine Learning
- Text processing and Natural Language Processing
Objectives of this class¶
- Be able to use "core" python libraries in your research
- Understand how SOTA A.I. methods are broadly based on these same libraries
- Be able to implement basic processing and a few machine learning algorithms from "scratch"
- Generally understand what is going on in research publications
II. Syllabus Discussion¶
- Homework and readings will be provided at end of class or via announcement later that evening. Due at beginning of following class. Points deducted if show up late.
- No particular textbook needed
- A computer is needed to participate in class.
- Attendance not mandatory (?). Will attempt to record classes. Please do not come to class with anything contagious.
- Academic integrity - can discuss verbally. Do not share work or copy fellow students' writing or code. Be very careful about basing your work on code from internet.
- Office hours TBD.
Course Information¶
- Labs/Participation/Homework - 20%
- Quizzes - 10%
- Midterm - 30%
- Final Exam - 40%
Point of lab/participation/homework is to encourage you to study and learn. Easy points. Don't get overconfident or complacent, it is 20%.
Point of exams is to decide your grade.
Prerequisites: programming + math¶
Programming skills. We will be using Python. If the amount of work seems to be overwhelming, it is most likely due to a deficiency here. Tasks which should take minutes may take you hours if you get stuck trying to hack together an approach with google or A.I.
Vector geometry
Matrix Algebra
Prob & Stat won't be used much
Prereqs exist for very good reasons.
Books¶
There is no required text. There is a vast supply of free resources online. Suggestions:
Introduction to Applied Linear Algebra, Boyd & Vendenberghe 2018, http://vmls-book.stanford.edu/
Speech and Language Processing, 3e, Jurafsky & Martin 2024. https://web.stanford.edu/~jurafsky/slp3/
data science tutor GPT: https://chatgpt.com/g/g-SSBhmwHol-introductory-data-science-tutor
some python and linear algebra review material: https://www.keithdillon.com/index.php/bootcamp/
Academic Integrity, etc.¶
- See student handbook. This is your contract.
- Fairness will not be sacrificied for other noble causes
- Big source of drama: students skipping class or not doing homework then being unhappy with exams they could not handle as a result
III. Software Installation¶
Jupyter - "notebooks" for inline code + LaTex math + markup, etc.¶
A single document containing a series of "cells". Each containing code which can be run, or images and other documentation.
- Run a cell via
[shift] + [Enter]
or "play" button in the menu.

Will execute code and display result below, or render markup etc.
Can also use R or Julia (easily), Matlab, SQL, etc. (with increasing difficulty).
import datetime
print("This code is run right now (" + str(datetime.datetime.now()) + ")")
'hi'
This code is run right now (2025-08-26 14:41:26.932482)
'hi'
x=1+2+2
print(x)
5
import numpy as np
np.random.randn(2,5)
array([[ 1.24350758, 1.99906955, -0.3226366 , -0.98266019, -0.1309466 ], [-0.85026968, -0.35865037, 0.70637075, 1.06492839, 0.35220974]])
np.ones((2,2))
array([[1., 1.], [1., 1.]])
Installation¶
First project: get Jupyter running and be able to import listed tools
Easiest to install via Anaconda. Preferrably Python 3.
https://www.anaconda.com/download/
Highly recomended to make a separate environment for class - hot open source tools change fast and deprecate (i.e. break) old features constantly
conda install jupyter matplotlib numpy scipy scikit-learn pandas ...
Many other packages...
Python Help Tips¶
- Get help on a function or object via
[shift] + [tab]
after the opening parenthesisfunction(

- Can also get help by executing
function?

IV. Q & A Discussion¶
- Recording of classes
- Job interests/plans?
- Research topics?
V. Class Motivation¶
Hardware: the real driver¶
The supposed end of Moore's Law
Note Logarithmic axis.

CPU's stopped getting faster clock speeds around 3GHZ
Countered by increasing number of cores on chip
Parallelism¶
100x increase in compute (due to more cores) in 20 years (since clock speed stopped)
To take advantage of this gain, need algorithms and techniques that can scale well
Good examples: matrix multiplication, gradient descent
Bad examples: classical statistical methods, constrained optimization
Ironic twist: to use (vastly) more advanced computing hardware, must limit yourself to a subset of simple algorithms.
What are some domains that appear to have taken advantage of this scaling?
(based on 100x or better performance boost recently?)
Data Science
Artificial Intelligence
A.I. Example¶
Hugging Face Pipelines: Base class implementing NLP operations. Pipeline workflow is defined as a sequence of the following operations:
- A tokenizer in charge of mapping raw textual input to token --> string (or other data format) processing
- A model to make predictions from the inputs --> linear algebra
- Some (optional) post processing for enhancing model’s output --> misc

https://huggingface.co/docs/transformers/en/main_classes/pipelines
Inside the Model¶

Inside the GPU¶
General Matrix Multiplication (GEMM) ~ $C = \alpha AB + \beta C$

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html