For this week the homework is to make a basic cheatsheet for python programming according to the following requirements:
Bring to class next tuesday (if class is cancelled, hold on to it until the following class meeting a week later).
If you haven't created your python cheatsheet, please do so to be prepared for the next quiz. See instructions from HW 1.
If you are rusty on linear algebra, you should work through the matrix multiplication practice by hand.
For this week, you are to use your vector functions from class (scalar-vector multiply, vector add, and dot product) and use them to build more complex operations.
make a matrix-vector multiply function which uses your dot product function. This means your matrix-vector multiply function should internally call your dot product function.
similarly, make a matrix-matrix multiply function which uses your matrix-vector multiply function.
Note you should choose to define matrices as lists of rows and so will also need a new function to extract columns from such matrices.
Give your results in a printout of your jupyter notebook.
Use numpy to solve the "Exercises: linear algebra with python" questions in the python practice bootcamp.
Do your work in a jupyter notebook and print to pdf or html to either email or hand in.
Classification using norms:
For this you can use the Iris data dataset again, or a more interesting dataset if you prefer.
For this assignment use the nump.linalg.norm function to classify samples within the dataset by comparing them to the other flowers.
make a script that takes a chosen sample (a particular row of the data) and computes the distance between it and each of the other samples. Then use the class as the nearest sample (this is the corresponding value in the target vector).
Try this using the L1, L2, and chebyshev (infinity) norms to compute the distance, as well as the "cosine distance" (using the L2).
Feel free to ask AI for help, but only use numpy functions. And be sure you understand each step. Break your work into multiple steps and put a comment explaining what each step does.
Turn in by emailing a printed pdf of your notebook.
For this homework we will work more thoroughly with the in-class Least Squares regression project to perform a better fit.
You can use the same dataset from class. Or find a better one.
A = [[.22, .34],[1.5, 6.7],[.33, .01]]
we want to change it into:
A = [[1, .22, .34],[1, 1.5, 6.7],[1, .33, .01]]
You can do this using the np.concat() function and the np.ones() function.
See how this affects the residual. (both the plot of r = A@x-b as well as the norm of r).
See how this affects the residual. Compare to with and without the above bias term.
For each of the above parts, provide your conclusions based on what you see and how well it works. If you are using a classification dataset, meaning the target is an integer representing class value, see how well regression classifies the data by approximating the correct integer.
For this week, theoretically derive the regularized versions of the least-squares regression problems from the notes. Start from the optimization problem to maximize the posterior distribution, then use Bayes law and put the appropriate distributions for the likelihood and priors (hint: you don't care what p(y) is). Take the log and use the tricks we covered in class to reach the simple least-squares form we had in the notes, this time with a regularization term.
Do this using both the Gaussian prior on beta and the Laplace prior on beta. What is the meaning of the regularization term lambda in each case?
Show every step and give a short note explaining each step.
Hand your work in next class.
Email me or create a discussion on canvas if you get stuck.
Go through each of the main methods we mentioned in class (split, join, and the major regex functions). Give a simple description and example of use in a jupyter notebook.
(a) create a random string representing DNA with one million base pairs. (each character is randomly chosen from ('A','T','C','G'). This might take awhile. Print the first 100 elements to show it worked.
(b) create a RNA string by replacing every 'T' with a 'U'. Print the first 100 elements to show it worked.
(c) create a second random RNA string of length 8. Print it.
(d) perform an exact search of your small RNA string within the large one and find how many times it appears (should be more than zero).
(e) list the locations where a match is found. What exactly do these numbers correspond to? (middle of match region? beginning? end? ...)
Consider the following recursive implementation of the edit distance function from the notes:
def D(a, b):
if a=='': return len(b)
if b=='': return len(a)
if a[-1] == b[-1]:
delta = 0
else:
delta = 1
return min(
D(a[:-1], b) + 1,
D(a, b[:-1]) + 1,
D(a[:-1], b[:-1]) + delta
)
(a) Compute the edit distances and time taken (as above, without memoization) for: "hanning" versus "hamming", "chocolate" versus "anniversary", and "GGAAAATTT" versus "GGAAGAGGCTTGT".
(b) now use the @lru_cache decorator before the function definition (import lru_cache from functools), and repeat the timing tests.
Email me a printout of your notebook to turn it in.
Use the google Ngram viewer to estimate the probability that a book will contain the phrase "data science is great" in 2022.
Compare the result from searching for the phrase directly, to the approximation using bigrams.
Load the enterobacteria phage phiX174 (NC_001422.1) from NCBI into a string. Use the following code to load the data:
import requests
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
params = {"db": "nuccore", "id": "NC_001422.1", "rettype": "fasta", "retmode": "text"}
ret = requests.get(url, params=params, timeout=60)
dna_0 = ret.text
Note that the result (dna_0) contains a header and linebreaks (it's basically a text file). Clean these out to get a simple string of base pairs.
Estimate the frequency of 'TATA' using bigram and unigram approximations. Compare it to the probability if the DNA was assumed to be random i.i.d. (i.e., each base pair has 1/4 probability and no conditional dependencies).
In this homework we will implement the Naive Bayes method for sentiment analysis.
Import the Stanford SST2 dataset from HuggingFace. https://huggingface.co/datasets/stanfordnlp/sst2Links to an external site.
Use the HuggingFace datasets library to download the data.
Using the 'train' member of the dataset, find the most likely class by manually estimating the probabilities you need for each of the terms in the Naive Bayes probability for each of the sentences:
'science is great' 'science sucks'
Compare the result you got to TextBlob's internal sentiment analysis function and to the distilbert sentiment analysis pipeline.
(a) Use numpy and matplotlib to generate images of 2D Normal distributions which match the slides labeled "Harder Exercises" and "Harder Exercises II".
(b) Generate 2D random data which matches one of the cases with correlation=1, and one of the cases with correlation=zero, and plot their scatterplots.
Implement the Graphical Lasso method to produce a sparse network relating features of the breast cancer dataset. Make one graph for benign tumors and one for malignant tumors. What differences do you notice that may be clinically important?
Using the same dataset from your in-class exercises, perform PCA manually by using the eigenvalue decomposition of the sample covariance matrix, and produce a 2D visualization. Compare the result to the sklearn PCA method.
Reassemble the covariance matrix matrix using its eigenvalues and eigenvectors, but with all eigenvalues set to zero except the largest two. Examine how similar it is to the original.