BDS 761: Data Science and Machine Learning I
Topic 6: Text & Natural Language Processing
This topic:¶
Text processing
- Character encodings
- Regular expressions
- Tokenization, segmentation, & stemming
- Approximate sequence matching
Language Modeling
- Probability Review
- $N$-gram Language modeling
- Naive Bayes
- Logistic Regression
Intro to Modern NLP
- Small NLP packages
- NLP with LLMs
Reading - Text processing:
- https://www.oreilly.com/library/view/fluent-python/9781491946237/ch04.html
- J&M Chapter 2, "Regular Expressions, Text Normalization, and Edit Distance"
Further reading - Dynamic Programming:
- "The Algorithm Design Manual, 3e," Chapter 8, Steven Skiena, 2020.
- OSU Molecular Biology Primer, Chapter 21: https://open.oregonstate.education/computationalbiology/chapter/bioinformatics-knick-knacks-and-regular-expressions/
Reading - Language Modeling:
- Probability Review http://j.mp/CG_prob_cheatsheet
- J&M Chapter 3 (Language Modeling with N-Grams)
- J&M Chapter 4 (Naive Bayes Classification and Sentiment)
- J&M Chapter 5 (Logistic Regression)
Reading - NLP:
- IBM: What is NLP (natural language processing)? https://www.ibm.com/topics/natural-language-processing
Motivation¶
Processing formatted records which are in varying text formats, such as converting different date formats '01/01/24' vs 'Jan 1, 2024' vs '1 January 2024' to a single numerical variable
Processing survey data or health records in text format, use NLP to convert unstructured text to a categorical variable.
Processing other sequential data such as DNA or biological signals
Modern A.I. (Large Language Models) are trained to solve NLP problems using large collections of text. Model incorporates general knowlege for any topic broadly understood, such as science, health, psychology, and can be applied outside of NLP problems
Text Processing Levels¶
- Character
- Words
- Sentences / multiple words
- Paragraphs / multiple sentences
- Document
- Corpus / multiple documents
Source: Taming Text, p 9
Character¶
- Character encodings
- Case (upper and lower)
- Punctuation
- Numbers
Indicate by quotes (either single or double works)
x = 'a'
y = '3'
z = '&'
q = '"'
print(x,y,z,q)
a 3 & "
Words¶
- Word segmentation: dividing text into words. Fairly easy for English and other languages that use whitespace; much harder for languages like Chinese and Japanese.
- Stemming: the process of shortening a word to its base or root form.
- Abbreviations, acronyms, and spelling. All help understand words.
In Python we store words into strings ~ sequences of characters. Will do this soon.
Sentences¶
- Sentence boundary detection: a well-understood problem in English, but is still not perfect.
- Phrase detection: San Francisco and quick red fox are examples of phrases.
- Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
- Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.
Paragraphs¶
At this level, processing becomes more difficult in an effort to find deeper understanding of an author’s intent.
For example, algorithms for summarization often require being able to identify which sentences are more important than others.
Document¶
Similar to the paragraph level, understanding the meaning of a document often requires knowledge that goes beyond what’s contained in the actual document.
Authors often expect readers to have a certain background or possess certain reading skills.
Corpus¶
At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents.
Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.
I. Character Encodings¶
Character Encodings - map character to binary¶
- ASCII - 7 bits
- char - 8-bit - a-z,A-Z,0-9,...
- multi-byte encodings for other languages
ascii(38)
'38'
str(38), float('38')
('38', 38.0)
chr(38)
'&'
Unicode¶
Unicode - "code points"¶
lookup table of unique numbers denoting every possible character
Encoding still needed -- various standards
- UTF-8 (dominant standard),16 - variable-length
- UTF-32 - 4-byte
chr(2^30+1) # for python 3
'\x1d'
Mojibake¶

Incorrect, unreadable characters shown when computer software fails to show text correctly.
It is a result of text being decoded using an unintended character encoding.
Very common in Japanese websites, hence the name:
文字 (moji) "character" + 化け (bake) "transform"
II. String Processing and Regular Expressions¶
Background: Lists [item1, item2, item3]¶
- Sequence of values - order & repeats ok
- Mutable
- Concatenate lists with "+"
- Index with mylist[index] - note zero based
L = [1,2,3,4,5,6]
print(L)
print("length =",len(L))
print(L[0],L[1],L[2])
[1, 2, 3, 4, 5, 6] length = 6 1 2 3
[1,2,3,4,5][3]
4
Slices - mylist[start:end:step]¶
Matlabesque way to select sub-sequences from list
- If first index is zero, can omit - mylist[:end:step]
- If last index is length-1, can omit - mylist[::step]
- If step is 1, can omit mylist[start:end]
Make slices for even and odd indexed members of this list.
[1,2,3,4,5][:3]
[1, 2, 3]
[1,2,3,4,5][0:1]
[1]
Strings¶
List of characters
s = 'Hello there'
print(s)
Hello there
s
'Hello there'
print(s[0],s[2])
H l
(s[0],s[2])
('H', 'l')
print(s[:2])
He
x = 'hello'
y = 'there'
z = '!'
print(x,y,z) # x,y,z is actually a tuple
hello there !
# addition concatenates lists or characters or strings
xyz = x+y+z
print(xyz)
hellothere!
spc = chr(32)
spc
' '
How do we fix the spacing in this sentence?
Other useful operations¶
xyz = 'hello there'
print(xyz.split(' '))
['hello', 'there']
print(xyz.split())
['hello', 'there']
print(xyz.split('e'))
['h', 'llo th', 'r', '']
fname = 'smith_05_2024.txt'
root = fname.split('.')
root
['smith_05_2024', 'txt']
rootparts = root[0].split('_')
rootparts
['smith', '05', '2024']
name = rootparts[0]
name
'smith'
fname.split('.')[0].split('_')[0]
'smith'
mylist = xyz.split()
print(mylist)
['hello', 'there']
print(' '.join(mylist))
hello there
print('_'.join(mylist))
hello_there
from string import *
whos
Variable Type Data/Info
---------------------------------------
Formatter type <class 'string.Formatter'>
L list n=6
Template type <class 'string.Template'>
ascii_letters str abcdefghijklmnopqrstuvwxy<...>BCDEFGHIJKLMNOPQRSTUVWXYZ
ascii_lowercase str abcdefghijklmnopqrstuvwxyz
ascii_uppercase str ABCDEFGHIJKLMNOPQRSTUVWXYZ
capwords function <function capwords at 0x00000181E7B2DF80>
dat0 list n=3
dat1 list n=4
digits str 0123456789
hexdigits str 0123456789abcdefABCDEF
literal1 str calendar
literal2 str calandar
literal3 str celender
mylist list n=2
octdigits str 01234567
pattern2 str c[ae]l[ae]nd[ae]r
patterns str calendar|calandar|celender
printable str 0123456789abcdefghijklmno<...>/:;<=>?@[\]^_`{|}~ \n
punctuation str !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
re module <module 're' from 'C:\\Us<...>4\\Lib\\re\\__init__.py'>
s str Hello there
st str calendar foo calandar cal celender calli
string module <module 'string' from 'C:<...>_083124\\Lib\\string.py'>
sub_pattern str [ae]
whitespace str \n
x str hello
xyz str hello there
xyz2 str hellothere2!
y str there
z str !
Student Activity¶
Let's make a simple password generator function!
Your code should return something like this:
'kZmuSUVeVC'
'mGEsuIfl91'
'FEFsWwAgLM'
import random
import string
n = 10
pw = ''.join((random.choice(string.ascii_letters + string.digits) for n in range(n)))
Regular Expressions ("regex")¶
Used in grep, awk, ed, perl, ...
Regular expression is a pattern matching language, the "RE language".
A Domain Specific Language (DSL). Powerful (but limited language). E.g. SQL, Markdown.
from re import *
whos
Variable Type Data/Info
----------------------------------------
A RegexFlag re.ASCII
ASCII RegexFlag re.ASCII
DOTALL RegexFlag re.DOTALL
Formatter type <class 'string.Formatter'>
I RegexFlag re.IGNORECASE
IGNORECASE RegexFlag re.IGNORECASE
L RegexFlag re.LOCALE
LOCALE RegexFlag re.LOCALE
M RegexFlag re.MULTILINE
MULTILINE RegexFlag re.MULTILINE
Match type <class 're.Match'>
NOFLAG RegexFlag re.NOFLAG
Pattern type <class 're.Pattern'>
RegexFlag EnumType <flag 'RegexFlag'>
S RegexFlag re.DOTALL
Template type <class 'string.Template'>
U RegexFlag re.UNICODE
UNICODE RegexFlag re.UNICODE
VERBOSE RegexFlag re.VERBOSE
X RegexFlag re.VERBOSE
ascii_letters str abcdefghijklmnopqrstuvwxy<...>BCDEFGHIJKLMNOPQRSTUVWXYZ
ascii_lowercase str abcdefghijklmnopqrstuvwxyz
ascii_uppercase str ABCDEFGHIJKLMNOPQRSTUVWXYZ
capwords function <function capwords at 0x00000181E7B2DF80>
chars str abcdefghijklmnopqrstuvwxy<...>LMNOPQRSTUVWXYZ0123456789
compile function <function compile at 0x00000181E76CDB20>
dat0 list n=3
dat1 list n=4
digits str 0123456789
error type <class 're.error'>
escape function <function escape at 0x00000181E76CDD00>
findall function <function findall at 0x00000181E76CD9E0>
finditer function <function finditer at 0x00000181E76CDA80>
fullmatch function <function fullmatch at 0x00000181E76CD580>
hexdigits str 0123456789abcdefABCDEF
k int 9
literal1 str calendar
literal2 str calandar
literal3 str celender
match function <function match at 0x00000181E7634400>
mylist list n=2
n int 10
newchar str I
octdigits str 01234567
pattern2 str c[ae]l[ae]nd[ae]r
patterns str calendar|calandar|celender
printable str 0123456789abcdefghijklmno<...>/:;<=>?@[\]^_`{|}~ \n
punctuation str !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
purge function <function purge at 0x00000181E76CDBC0>
pw str Yi7RZhEMvI
r float 0.5457937230425709
random module <module 'random' from 'C:<...>_083124\\Lib\\random.py'>
re module <module 're' from 'C:\\Us<...>4\\Lib\\re\\__init__.py'>
s str Hello there
search function <function search at 0x00000181E76CD760>
split function <function split at 0x00000181E76CD940>
st str calendar foo calandar cal celender calli
string module <module 'string' from 'C:<...>_083124\\Lib\\string.py'>
sub function <function sub at 0x00000181E76CD800>
sub_pattern str [ae]
subn function <function subn at 0x00000181E76CD8A0>
template function <function template at 0x00000181E76CDC60>
whitespace str \n
x str hello
xyz str hello there
xyz2 str hellothere2!
y str there
z str !
Motivating example¶
Write a regex to match common misspellings of calendar: "calendar", "calandar", or "celender"
# Let's explore how to do this
# Patterns to match
dat0 = ["calendar", "calandar", "celender"]
# Patterns to not match
dat1 = ["foo", "cal", "calli", "calaaaandar"]
# Interleave them
st = " ".join([item for pair in zip(dat0, dat1) for item in pair])
st
'calendar foo calandar cal celender calli'
# You match it with literals
literal1 = 'calendar'
literal2 = 'calandar'
literal3 = 'celender'
patterns = "|".join([literal1, literal2, literal3])
patterns
'calendar|calandar|celender'
import re
print(re.findall(patterns, st))
['calendar', 'calandar', 'celender']
... a better way¶
Let's write it with regex language
sub_pattern = '[ae]'
pattern2 = sub_pattern.join(["c","l","nd","r"])
print(pattern2)
c[ae]l[ae]nd[ae]r
print(st)
re.findall(pattern2, st)
calendar foo calandar cal celender calli
['calendar', 'calandar', 'celender']
Regex Terms¶
- target string: This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.
- search expression: The pattern we use to find what we want. Most commonly called the regular expression.
- literal: A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, it is literally the string we want to find.
- metacharacter: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character.
- escape sequence: An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal.
function(search_expression, target_string)¶
pick function based on goal (find all matches, replace matches, find first match, ...)
form search expression to account for variations in target we allow. E.g. possible misspellings.
- findall() - Returns a list containing all matches
- search() - Returns a Match object if there is a match anywhere in the string
- split() - Returns a list where the string has been split at each match
- sub() - Replaces one or many matches with a string
- match() apply the pattern at the start of the string
Metacharacters¶
special characters that have a unique meaning
[] A set of characters. Ex: "[a-m]"
\ Signals a special sequence, also used to escape special characters). Ex: "\d"
. Any character (except newline character). Ex: "he..o"
^ Starts with. Ex: "^hello"
$ Ends with. Ex: "world$"
* Zero or more occurrences. Ex: "aix*"
+ One or more occurrences. Ex: "aix+"
{} Specified number of occurrences. Ex: "al{2}"
| Either or. Ex: "falls|stays"
() Capture and group
Escape sequence "\"¶
A way of indicating that we want to use one of our metacharacters as a literal.
In a regular expression an escape sequence is metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal.
Ex: If we want to find \file in the target string c:\file then we would need to use the search expression \\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence ).
Special Escape Sequences¶
- \A - specified characters at beginning of string. Ex: "\AThe"
- \b - specified characters at beginning or end of a word. Ex: r"\bain" r"ain\b"
- \B - specified characters present but NOT at beginning (or end) of word. Ex: r"\Bain" r"ain\B"
- \d - string contains digits (numbers from 0-9)
- \D - string DOES NOT contain digits
- \s - string contains a white space character
- \S - string DOES NOT contain a white space character
- \w - string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
- \W - string DOES NOT contain any word characters
- \Z - specified characters are at the end of the string
Set¶
a set of characters inside a pair of square brackets [] with a special meaning:
- [arn] one of the specified characters (a, r, or n) are present
- [a-n] any lower case character, alphabetically between a and n
- [^arn] any character EXCEPT a, r, and n
- [0123] any of the specified digits (0, 1, 2, or 3) are present
- [0-9] any digit between 0 and 9
- [0-5][0-9] any two-digit numbers from 00 and 59
- [a-zA-Z] any character alphabetically between a and z, lower case OR upper case
- [+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means any + character in the string
Ex: Matching phone numbers¶
target_string = 'fgsfdgsgf 415-805-1888 xxxddd 800-555-1234'
pattern1 = '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
print(re.findall(pattern1,target_string))
['415-805-1888', '800-555-1234']
pattern2 = '\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'
print(re.findall(pattern2,target_string))
['415-805-1888', '800-555-1234']
pattern3 = '\\d{3}-\\d{3}-\\d{4}'
print(re.findall(pattern3,target_string))
['415-805-1888', '800-555-1234']
\d{3}-\d{3}-\d{4} uses Quantifiers.
Quantifiers: allow you to specify how many times the preceding expression should match.
{} is extact quantifier.
print(re.findall('x?','xxxy'))
['x', 'x', 'x', '', '']
print(re.findall('x+','xxxy'))
['xxx']
Capturing groups¶
Problem: You have odd line breaks in your text.
text = 'Long-\nterm problems with short-\nterm solutions.'
print(text)
Long- term problems with short- term solutions.
text.replace('-\n','\n')
'Long\nterm problems with short\nterm solutions.'
Solution: Write a regex to find the "dash with line break" and replace it with just a line break.
import re
# 1st Attempt
text = 'Long-\nterm problems with short-\nterm solutions.'
re.sub('(\\w+)-\\n(\\w+)', r'-', text)
'- problems with - solutions.'
Not right. We need capturing groups.
Capturing groups allow you to apply regex operators to the groups that have been matched by regex.
For for example, if you wanted to list all the image files in a folder. You could then use a pattern such as ^(IMG\d+\.png)$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png$ which only captures the part before the period.
re.sub(r'(\w+)-\n(\w+)', r'\1-\2', text)
'Long-term problems with short-term solutions.'
The parentheses around the word characters (specified by \w) means that any matching text should be captured into a group.
The '\1' and '\2' specifiers refer to the text in the first and second captured groups.
"Long" and "term" are the first and second captured groups for the first match.
"short" and "term" are the first and second captured groups for the next match.
NOTE: 1-based indexing
III. Tokenization, segmentation, & stemming¶
Sentence segmentation:¶
Dividing a stream of language into component sentences.
Sentences can be defined as a set of words that is complete in itself, typically containing a subject and predicate.
Sentence segmentation typically done using punctuation, particularly the full stop character "." as a reasonable approximation.
Complications because punctuation also used in abbreviations, which may or may not also terminate a sentence.
For example, Dr. Evil.
Example¶
A Confederacy Of Dunces
By John Kennedy Toole
A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.
sentence_1 = A green hunting cap squeezed the top of the fleshy balloon of a head.
sentence_2 = The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once.
sentence_3 = Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs.
sentence_4 = In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.
Code version 1¶
text = """A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress. """
import re
pattern = "|".join(['!', # end with "!"
'\\?', # end with "?"
'\\.\\D', # end with "." and the full stop is not followed by a number
'\\.\\s']) # end with "." and the full stop is followed by a whitespace
print(pattern)
!|\?|\.\D|\.\s
re.split(pattern, text)
['A green hunting cap squeezed the top of the fleshy balloon of a head', 'The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once', 'Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs', 'In the shadow under the green visor of the cap Ignatius J', 'Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D', '', 'Holmes department store, studying the crowd of people for signs of bad taste in dress', '']
pattern = r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
re.split(pattern, text)
['A green hunting cap squeezed the top of the fleshy balloon of a head.', 'The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once.', 'Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs.', 'In the shadow under the green visor of the cap Ignatius J.', 'Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.', '']
Code by using a library¶
...next class
Tokenization¶
Breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens
The simplest way to tokenize is to split on white space
sentence1 = 'Sky is blue and trees are green'
sentence1.split(' ')
['Sky', 'is', 'blue', 'and', 'trees', 'are', 'green']
sentence1.split() # in fact it's the default
['Sky', 'is', 'blue', 'and', 'trees', 'are', 'green']
Sometimes you might also want to deal with abbreviations, hypenations, puntuations and other characters.
In those cases, you would want to use regex.
However, going through a sentence multiple times can be slow to run if the corpus is long
import re
sentence2 = 'This state-of-the-art technology is cool, isn\'t it?'
sentence2 = re.sub('-', ' ', sentence2)
sentence2 = re.sub('[,|.|?]', '', sentence2)
sentence2 = re.sub('n\'t', ' not', sentence2)
print(sentence2)
sentence2_tokens = re.split('\\s+', sentence2)
print(sentence2_tokens)
This state of the art technology is cool is not it ['This', 'state', 'of', 'the', 'art', 'technology', 'is', 'cool', 'is', 'not', 'it']
In this case, there are 11 tokens and the size of the vocabulary is 10
print('Number of tokens:', len(sentence2_tokens))
print('Number of vocabulary:', len(set(sentence2_tokens)))
Number of tokens: 11 Number of vocabulary: 10
Tokenization is a major component of modern language models and A.I., where tokens are defined more generally.
Morphemes¶
A morpheme is the smallest unit of language that has meaning. Two types:
- stems
- affixes (suffixes, prefixes, infixes, and circumfixes)
Example: "unbelievable"
What is the stem? What is the affixes?
"believe" is a stem.
"un" and "able" are affixes.
What we usually want to do in NLP preprocessing is get the stem a stem by eliminating the affixes from a token.
Stemming¶
Stemming usually refers to a crude heuristic process that chops off the ends of words.
Ex: automates, automating and automatic could be stemmed to automat
Exercise: how would you implement this using regex? What difficulties would you run into?
Lemmatization¶
Lemmatization aims to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
This is doing things properly with the use of a vocabulary and morphological analysis of words.
How are stemming and lemmatization similar/different?
Summary¶
- Tokenization separates words in a sentence
- You would normalize or process the sentence during tokenization to obtain sensible tokens
- These normalizations include:
- Replacing special characters with spaces such as
,.-=!using regex - Lowercasing
- Stemming to remove the suffix of tokens to make tokens more uniform
- Replacing special characters with spaces such as
- There are three types of commonly used stemmers. They are Porter, Snowball and Lancaster
- Lancaster is the fastest and most aggressive, Snowball is a balance between speed and quality
Bioinformatics¶
Many analogous tasks in processing DNA and RNA sequences
- Finding exact matches for shorter sequence within long sequence
- Inexact or approximate matching?
IV. Approximate Sequence Matching¶
Exact Matching¶
Fing places where pattern $P$ is found within text $T$.
What python functions do this?
Also very important problem. Not trivial for massive datasets.
Alignment - compare $P$ to same-length substring of $T$ at some starting point.
How many calculations will this take in the most naive approach possible?
Improving on the Naive Exact-matching algorithm¶
Naive approach: test all possible alignments for match.
Ideas for improvement:
- stop comparing given alignment at first mismatch.
- use result of prior alignments to shorten or skip subsequent alignments.
Approximate matching: Motivational problems¶
- Matching Regular Expressions to text efficiently
- Biological sequence alignment between different species
- Matching noisy trajectories through space
- Clustering sequences into a few groups of most similar classes
- Applying k Nearest Neighbors classification to sequences
Pre-filtering, Pruning, etc.¶
When performing a slow search algorithm over a large dataset, start with fast algorithm with poor FPR to reject obvious mismatches.
Ex. BLAST (sequence alignment) in bioinformatics, followed by slow accurate structure alignment technique.
- K. Dillon and Y.-P. Wang, “On efficient meta-filtering of big data”, 2016
Correlation screening
- Fan, Jianqing, and Jinchi Lv. "Sure independence screening for ultrahigh dimensional feature space." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.5 (2008): 849-911.
- Wu, Tong Tong, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. "Genome-wide association analysis by lasso penalized logistic regression." Bioinformatics 25.6 (2009): 714-721.
SAFE screening
- Ghaoui, Laurent El, Vivian Viallon, and Tarek Rabbani. "Safe feature elimination for the lasso and sparse supervised learning problems." arXiv preprint arXiv:1009.4219 (2010).
- Liu, Jun, et al. "Safe screening with variational inequalities and its application to lasso." arXiv preprint arXiv:1307.7577 (2013).
Rather than match vs no match, we now need a similarity score a.k.a. distance $d$(string1, string2)
Matching time-series with varying timescales¶
How similar are these two curves, assuming we ignore varying timescales?
Example: we wish to determine location of hiker given altitude measurements during hike.
Note the amount of warping is often not the distance metric. We first "warp" the pattern, then compute distance some other way, e.g. LMS. Final distance as the closest LMS possible over all acceptable warpings.
Dynamic Time Warping
Dynamic Programming Review¶
Fibonacci sequence¶
\begin{align} f(n) &= f(n-1) + f(n-2) \\ f(0) &= 0 \\ f(1) &= 1 \\ \end{align}
Recursive calculation¶
Inefficient due to repeatedly calculating same terms
def fib_recursive(n):
if n == 0: return 0
if n == 1: return 1
return fib_recursive(n-1) + fib_recursive(n-2)
Each term requires two additional terms be calculated. Exponential time.
DP calculation¶
Intelligently plan terms to calculate and store ("cache"), e.g. in a table.
Each term requires one term be calculated.
DP Caching¶
Always plan out the data structure and calculation order.
- need to make sure you have sufficient space
- need to choose optimal strategy to fill in
The data structure to fill in for Fibonacci is trivial:
Optimal order is to start at bottom and work up to $n$, so always have what you need for next term.
What are caches?¶
Caches are storage for information to be used in the near future in a more accessible form.
The difference between dynamic programming and recursion¶
1) Direction
Recursion: Starts at the end/largest and divides into subproblems.
DP: Starts at the smallest and builds up a solution.
2) Amount of computation:
During recursion, the same sub-problems are solved multiple times.
DP is basically a memoization technique which uses a table to store the results of sub-problem so that if same sub-problem is encountered again in future, it could directly return the result instead of re-calculating it.
DP = apply common sense to do recursive problems more efficiently.
def fib_recursive(n):
if n == 0: return 0
if n == 1: return 1
return fib_recursive(n-1) + fib_recursive(n-2)
def fib_dp(n):
fib_seq = [0, 1]
for i in range(2,n+1):
fib_seq.append(fib_seq[i-1] + fib_seq[i-2])
return fib_seq[n]
Let's compare the runtime
%timeit -n4 fib_recursive(30)
390 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 4 loops each)
%timeit -n4 fib_dp(100)
14.5 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 4 loops each)
Algorithm Complexity¶
Complexity in algorithm theory refers to the runtime of an algorithm, usually described by a function which bounds the worst-case number of computations based on the input size.
Basically in data science we want linear complexity, meaning the computation time is a scalar multiple of the data size.
Exponential complexity is practically impossible. Such algorithms are replaced by rough approximations out of necessity.
Generator approach¶
def fib():
a, b = 0, 1
while True:
a, b = b, a+b
yield a
f = fib()
for i in range(10):
print(next(f))
1 1 2 3 5 8 13 21 34 55
from itertools import islice
help(islice)
Help on class islice in module itertools: class islice(builtins.object) | islice(iterable, stop) --> islice object | islice(iterable, start, stop[, step]) --> islice object | | Return an iterator whose next() method returns selected values from an | iterable. If start is specified, will skip all preceding elements; | otherwise, start defaults to zero. Step defaults to one. If | specified as another value, step determines how many values are | skipped between successive calls. Works like a slice() on a list | but returns an iterator. | | Methods defined here: | | __getattribute__(self, name, /) | Return getattr(self, name). | | __iter__(self, /) | Implement iter(self). | | __next__(self, /) | Implement next(self). | | __reduce__(...) | Return state information for pickling. | | __setstate__(...) | Set state information for unpickling. | | ---------------------------------------------------------------------- | Static methods defined here: | | __new__(*args, **kwargs) from builtins.type | Create and return a new object. See help(type) for accurate signature.
n = 100
next(islice(fib(), n-1, n))
354224848179261915075
%timeit -n4 next(islice(fib(), n-1, n))
The slowest run took 9.46 times longer than the fastest. This could mean that an intermediate result is being cached. 21.7 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 4 loops each)
Memoization¶
A technique used in computing to speed up programs. Temporarily stores the calculation results of processed input such as the results of function calls.
If the same input or a function call with the same parameters is used, the previously stored results can be used again and unnecessary calculation are avoided.
In many cases a simple array is used for storing the results, but lots of other structures can be used as well, such as associative arrays, called hashes in Perl or dictionaries in Python.
Source: http://www.amazon.com/Algorithm-Design-Manual-Steve-Skiena/dp/0387948600
def fib_recursive(n):
if n == 0: return 0
if n == 1: return 1
return fib_recursive(n-1) + fib_recursive(n-2)
def fib_dp(n):
cache = [0,1]
for i in range(2,n+1):
cache.append(cache[i-1]+cache[i-2])
return cache[n]
def memoize(f):
memo = {}
def helper(x):
if x not in memo:
memo[x] = f(x)
return memo[x]
return helper
fib_recursive_memoized = memoize(fib_recursive)
fib_recursive_memoized(9)
34
%timeit fib_recursive(9)
6.23 μs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit fib_recursive_memoized(9)
80.7 ns ± 1.49 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
from functools import lru_cache
@lru_cache()
def fib_recursive(n):
"Calculate nth Fibonacci number using recursion"
if n == 0: return 0
if n == 1: return 1
return fib_recursive(n-1) + fib_recursive(n-2)
%timeit fib_recursive(n)
55 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Memoization in Python with joblib¶
import joblib
dir(joblib)
['Logger', 'MemorizedResult', 'Memory', 'Parallel', 'PrintTime', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_cloudpickle_wrapper', '_memmapping_reducer', '_multiprocessing_helpers', '_parallel_backends', '_store_backends', '_utils', 'backports', 'compressor', 'cpu_count', 'delayed', 'disk', 'dump', 'effective_n_jobs', 'executor', 'expires_after', 'externals', 'func_inspect', 'hash', 'hashing', 'load', 'logger', 'memory', 'numpy_pickle', 'numpy_pickle_compat', 'numpy_pickle_utils', 'os', 'parallel', 'parallel_backend', 'parallel_config', 'pool', 'register_compressor', 'register_parallel_backend', 'register_store_backend', 'wrap_non_picklable_objects']
Exercise: Change making problem.¶
The objective is to determine the smallest number of currency of a particular denomination required to make change for a given amount.
For example, if the denomination of the currency are \$1 and \$2 and it was required to make change for \$3 then we would use \$1 + \$2 i.e. 2 pieces of currency.
However if the amount was \$4 then we would could either use \$1+\$1+\$1+\$1 or \$1+\$1+\$2 or \$2+\$2 and the minimum number of currency would 2 (\$2+\$2).
Solution: dynamic programming (DP).¶
The minimum number of coins required to make change for \$P is the number of coins required to make change for the amount \$P-x plus 1 (+1 because we need another coin to get us from \$P-x to P).
These can be illustrated mathematically as:
Let us assume that we have $n$ currency of distinct denomination. Where the denomination of the currency $i$ is $v_i$. We can sort the currency according to denomination values such that $v_1<v_2<v_3<..<v_n$
Let us use $C(p)$ to denote the minimum number of currency required to make change for $ \$p$
Using the principles of recursion $C(p)=min_i C(p-v_i)+1$
For example, assume we want to make 5, and $v_1=1, v_2=2, v_3=3$.
Therefore $C(5) = min(C(5-1)+1, C(5-2)+1, C(5-3)+1)$ $\Longrightarrow min(C(4)+1, C(3)+1, C(2)+1)$
Exercise: Compute a polynomial over a list of points¶
How can we use redundancy here?
Mini-summary¶
In a recursive approach, your function recursively calls itself on smaller subproblems as needed, until the calculation is done. Potentially performing many redundant calculations naively.
With memoization, you form a cache of these recursive calls, and check if the same one is called again before recalculating. Still potential danger since do not plan memory needs or usage.
With Dynamic Programming, you follow a bottum-up plan to produce the cache of results for smaller subproblems.
Edit Distance between two strings¶
Minimum number of operations needed to convert one string into other
Ex. typo correction: how do we decide what "scool" was supposed to be?
Consider possibilities with lowest edit distance. "school" or "cool".
Hamming distance - operations consist only of substitutions of character (i.e. count differences)
Levenshtein distance - operations are the removal, insertion, or substitution of a character
"Fuzzy String Matching"
def hammingDistance(x, y):
''' Return Hamming distance between x and y '''
assert len(x) == len(y)
nmm = 0
for i in range(0, len(x)):
if x[i] != y[i]:
nmm += 1
return nmm
hammingDistance('brown', 'blown')
1
hammingDistance('cringe', 'orange')
2
Levenshtein distance (between strings $P$ and $T$)¶
Special case where the operations are the insertion, deletion, or substitution of a character
- Insertion – Insert a single character into pattern $P$ to help it match text $T$ , such as changing “ago” to “agog.”
- Deletion – Delete a single character from pattern $P$ to help it match text $T$ , such as changing “hour” to “our.”
- Substitution – Replace a single character from pattern $P$ with a different character in text $T$ , such as changing “shot” to “spot.”
Count the minimum number needed to convert $P$ into $T$.
Interchangably called "edit distance".
Exercise: What are the Hamming and Edit distances?¶
\begin{align} T: \text{"The quick brown fox"} \\ P: \text{"The quick grown fox"} \\ \end{align}
\begin{align} T: \text{"The quick brown fox"} \\ P: \text{"The quik brown fox "} \\ \end{align}
Exercise: What are the Edit distances?¶
Comprehension check: give three different ways to transform $P$ into $T$ (not necessarily fewest operations)
Edit distance - Divide and conquer¶
How do we use simpler comparisons to perform more complex ones?
Use substring match results to compute
Consider by starting from first character and building up
Probabilistic Language Modeling¶
I. Probability Review¶
Sample space¶
$\Omega$ = set of all possible outcomes of an experiment
Examples:
- $\Omega = \{HH, \ HT, \ TH, \ TT\}$ (discrete, finite)
- $\Omega = \{0, \ 1, \ 2, \ \dots\}$ (discrete, infinite)
- $\Omega = [0, \ 1]$ (continuous, infinite)
- $\Omega = \{ [0, 90), [90, 360) \}$ (discrete, finite)
Events¶
subset of the sample space. That is, any collection of outcomes forms an event.
Example:
Toss a coin twice. Sample space: $\Omega = \{HH, \ HT, \ TH, \ TT\}$
Let event $A$ be the event that there is ** exactly one head **
We write: $A =“exactly \ one \ head”$
Then $A = \{HT, \ TH \}$
$A$ is a subset of $\Omega$, and we write $A \subset \Omega$
Combining Events: Union, Intersection and Complement¶
- Union of two events $A$ and $B$, called $A \cup B$ = the set of outcomes that belong either to $A$, to $B$, or to both. In words, $A \cup B$ means "A or B."
- Intersection of two events $A$ and $B$, called $A \cap B$ = the set of outcomes that belong both to $A$ and to $B$. In words, $A \cap B$ means “A and B.”
- Complement of an event $A$, called $A^c$ = the set of outcomes that do not belong to $A$. In words, $A^c$ means "not A."
Theorems:
- The probability of event "not A": $P(A^c) = 1 - P(A)$
- For any A and B (not necessarily disjoint): $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Notation note: $P(A,B) = P(A \text{ AND } B) = P(A \cap B)$
Venn Diagram¶
This sort of diagram representing events in a sample space is called a Venn diagram.
a) $A \cup B\quad(e.g. 6$ on either $die_1$ or $die_2$ (or both)$)$
b) $A \cap B\quad(e.g. 6$ on both $die_1$ and $die_2)$
c) $B \cap A^c\quad(e.g. 6$ on $die_2$ but not on $die_1)$
Axioms of Probability¶
- For any event $A$: $P(A) \geq 0$
- The probability of the entire sample space: $P(\Omega) = 1$
- For any countable collection $A_1, A_2,...$ of mutually exclusive events:
$$P(A_1\cup A_2 \cup \dots) = P(A_1) + P(A_2) + \dots$$
Law of Total Probability¶
If $B_1, B_2, \dots, B_k$ form a partition of $\Omega$, then $(A \cap B_1), (A \cap B_2), \dots, (A \cap B_k)$ form a partition of the set or event A.
The probability of event A is therefore the sum of its parts:
$$P(A) = P(A \cap B_1) + P(A \cap B_2) + P(A \cap B_3) + P(A \cap B_4)$$
Counting¶
If experiment $A$ has $n$ possible outcomes, and experiment $B$ has $k$ possible outcomes, then there are $nk$ possible outcomes when you perform both experiments.
Example:¶
Let $A$ be the experiment "Flip a coin." Let $B$ be "Roll a die." Then $A$ has two outcomes, $H$ and $T$, and $B$ has six outcomes, $1,...,6$. The joint experiment, called "Flip a coin and roll a die" has how many outcomes?
Explain what this computation means in this case
Permutation¶
The number of $k$-permutations of $n$ distinguishable objects is $$^nP_k=n(n-1)(n-2)\dots(n-k+1) = \frac{n!}{(n-k)!}$$
The number of ways to select $k$ objects from $n$ distinct objects when different orderings constitute different choices
Example¶
I have five vases, and I want to put two of them on the table, how many different ways are there to arrange the vases?
Combination¶
If order doesn't matter...
- The number of ways to choose $k$ objects out of $n$ distinguishable objects is $$ ^nC_k = \binom{n}{k} = \frac{^nP_k}{k!}=\frac{n!}{k!(n-k)!}$$
Example¶
Q. How many ways are there to get 4 of a kind in a 5 card draw?
A. Break it down:
- Q. How many ways are there to get 4 of a kind?
A. $\binom{13}{1}$ - Q. How many ways can you fill the last card?
A. $\binom{52-4}{1}$ - Total: $$\binom{13}{1} \binom{48}{1} = 13 \times 48 = 624$$
Example¶
Q. How many ways are there to get a full house in a 5 card draw?
A. The matching triple can be any of 13 the denominations and the pair can be any of the remaining 12 denominations.
- Q. How many ways are there to select the suits of the matching triple?
A. $13\binom{4}{3}$ - Q. How many ways are there to select the suits of the matching pair?
A. $12\binom{4}{2}$ - Total: $$13\binom{4}{3} \times 12\binom{4}{2} = 13 \times 4 \times 12 \times 6 = 3,744$$
Conditional Probability: The probability that A occurs, given B has occurred¶
$$ P(A|B) = \frac{P(A\cap B)}{P(B)} $$
Joint probability : $P(A \cap B) = P(A|B) \times P(B)$
Law of Total Probability: If $B_1, \dots, B_k$ partition $S$, then for any event A,
$$ P(A) = \sum_{i = 1}^k P(A \cap B_i) = \sum_{i = 1}^k P(A | B_i) P(B_i) $$
Bayes' Theorem :¶
$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} = \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B|A^c) \times P(A^c)} $$
Chain Rule for Probability¶
We can write any joint probability as incremental product of conditional probabilities,
$ P(A_1 \cap A_2) = P(A_1)P(A_2 | A_1) $
$ P(A_1 \cap A_2 \cap A_3) = P(A_1)P(A_2 | A_1)P(A_3 | A_2 \cap A_1) $
In general, for $n$ events $A_1, A_2, \dots, A_n$, we have
$ P (A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1)P(A_2 | A_1) \dots P(A_n | A_{n-1} \cap \dots \cap A_1) $
Estimating probabilities¶
Frequentist perspective: probability of event is relative frequency
$$P(A) = \frac{\text{\# times $A$ occurs}}{\text{total \# experiments}}$$
$$P(A,B) = \frac{\text{\# times $A$ and $B$ occur together}}{\text{total \# experiments}}$$
$$P(A|B) = \text{?}$$
Problems with relative frequency¶
- Sampling may not be sufficently random (or large number).
$$P(\text{candidate will win}) = \frac{\text{\# facebook users who support candidate}}{\text{total \# of facebook users}}?$$
- Very rare events.
$$P(\text{winning lottery}) = \frac{\text{\# times won lottery}}{\text{total \# times bought ticket}} = 0?$$
II. Language Modeling¶
Prerequisites¶
Probability basic concepts, axioms
Joint probability : $P(A \cap B) = P(A|B) \times P(B)$
Chain Rule for Probability $ P (A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1)P(A_2 | A_1) \dots P(A_n | A_{n-1} \cap \dots \cap A_1) $
Law of total probability $P(A) = P(A \cap B_1) + P(A \cap B_2) + P(A \cap B_3) + P(A \cap B_4)$ where $B_i$ partition set of all possible outcomes
Notation shorthands $P(A \cap B) = P(A,B) = P(A \text{ AND } B)$
(Probabilistic) Language Modeling (The LM in LLM's)¶
Assign probabilities to sequences of "words". a model of language structure.
Most common word is "the".
Is the most common two-word sentence: "the the."?
"Their are..." vs. "There are..." vs. "They're are...". Only one makes sense.
Generally, predict next word in this sequence: "5, 4, 3, 2, _"
"Sequence" model¶
Note word order is built into the events.
$$P(\text{first word}=\text{"once"},\text{second word}=\text{"upon"},\text{third word}=\text{"a"},\text{fourth word}=\text{"time"})$$
Or with vector notation $P(\mathbf w)$, where $w_1$="once", $w_2$="upon", $w_3$="a", $w_4$="time".
Unless we choose to ignore relative location of words (Naive Bayes, Bag-of-words).
"Predictive" model¶
$$P(\text{fourth word}=\text{"time"}\,|\,\text{first word}=\text{"once"},\text{second word}=\text{"upon"},\text{third word}=\text{"a"})$$
Different versions of same information. Both are referred to as "language models".
Estimating sentence probabilities¶
$$P(\text{"The cat sat on the hat."}) = \frac{\text{\# times corpus contains sentence: "The cat sat on the hat."}}{\text{total \# of sentences in corpus}}?$$
$$P(\text{"...hat"} \,|\, \text{"The cat sat on the..."}) = ?$$
$$ P(A|B) = \frac{P(A\cap B)}{P(B)} $$
What are the events A and B here?
$$P(\text{"...hat"} \,|\, \text{"The cat sat on the..."}) = \frac{\text{\# times corpus contains sentence: "The cat sat on the hat."}}{\text{\# of sentences starting with "The cat sat on the..."}}?$$
Problem: most sentences will never appear in our corpus.
Exercise: Rare sentences¶
We need to assign a probability to every possible sentence
Suppose our vocubulary is (limited to) 10,000 words
How many possible 5-word sentences are there?
$(10,000)^5 = 10^{20}$ -- sampling with replacement
Relating joint probability $P(w_1w_2w_3)$ to conditional $P(w_n|w_1w_2)$¶
How do we do this?
Answer: use the Chain Rule of Probability
$$P(w_1w_2) = P(w_2|w_1)P(w_1)$$
$$P(\text{"the cat"}) = P(\text{"...cat"}|\text{"the..."})P(\text{"the..."})$$
$$P(w_1w_2w_3) = P(w_3|w_1w_2)P(w_1w_2)$$
\begin{align} P(\text{"the cat sat"}) &= P(\text{"...sat"}|\text{"the cat..."})P(\text{"the cat..."}) \\ &=P(\text{"...sat"}|\text{"the cat..."})P(\text{"...cat..."}|\text{"the..."})P(\text{"the..."}) \end{align}
The chain rule of probability¶
\begin{align} P(w_1w_2...w_n) &= P(w_n|w_1...w_{n-1})P(w_{n-1}|w_1...w_{n-2})...P(w_1) \\ &= \prod_{k=1}^n P(w_k|w^{k-1}_1) \end{align}
where we denote the sequence from $w_a$ to $w_b$, i.e., $w_aw_{a+1}...w_b$, as $w^{b}_a$
Exercise: apply the chain rule of probability to "the quick brown fox jumped over the lazy dog".
$N$-grams¶
Model order limited to length $N$ sequences:
So use previous $N-1$ words: $P(\text{ word }|\text{ preceding word(s) are ...})$
- Unigram
- Bigram
- Trigram
Exercise: list all possible unigrams from sentence "The cat sat on the hat."
Exercise: list all possible bigrams from sentence "The cat sat on the hat."
Exercise: Suppose we have a 1M word corpuse, and "The cat sat on the hat." appears 47 times. Estimate its probability.
$N$-gram approximation formally¶
\begin{align} P(w_n|w_{n-1}...w_1) &\approx P(w_n|w_{n-(N-1)}...w_{n-1}) \\ P(w_n|w^{n-1}_1) &\approx P(w_n|w^{n-1}_{n-N+1}) \end{align}
Note our new shorthand for sequences
Exercise: Write right-hand-side out in terms of $w_1,w_2,w_3$ for unigrams, bigrams, trigrams
Relating joint probability $P(w_1w_2w_3)$ to conditional $P(w_3|w_1w_2)$¶
How did we do this again?
$$P(w_1w_2w_3) = P(w_3|w_1w_2)P(w_1w_2)$$
\begin{align} P(\text{"the cat sat"}) &= P(\text{"...sat"}|\text{"the cat..."})P(\text{"the cat..."}) \\ &=P(\text{"...sat"}|\text{"the cat..."})P(\text{"...cat..."}|\text{"the..."})P(\text{"the..."}) \end{align}
Suppose we wish to use bigrams, how do we apply the bigram approximation here?
Bag-of-Words (BOW)¶
Convert a text string (like a document) into a vector of word frequencies by essentially summing up the one-hot encoded vectors for the words. Perhaps divide by total number of words.
Basically get histogram for each documents to use as feature vectors. Becomes structured data.
What kind of $N$-gram does this use?
Markov assumption¶
Future state only depends on present state (not past states)
How might this apply to word prediction?
$$P(\text{word}\,|\,\text{all previous words}) \approx P(\text{word}\,|\,\text{immediately preceding word only})$$
What kind of $N$-gram does this use?
Higher-order Markov processes. Probability of next step depends on current node and previous node.
$N$-grams for $N>2$
Bigram Probability Estimation¶
where we used the Law of Total Probability
Assume the following is our entire corpus:
What is P(I|<s>)? P(am|I)? P(Sam|am)? P(</s>|Sam) ?
What is the bigram estimate of the probability of <s>I am Sam</s>?
What is direct estimate (no bigram approximation) for <s>I am Sam</s> using corpus?
$N$-gram Probability Estimation¶
Bigram Example - Berkeley Restaurant database¶
Notes¶
Trigrams are actually much more commonly used in practice, but bigrams easier for examples
In practice, we work with log-probabilities instead to avoid underflow
Exercise: demonstrate how this calculation would be done with log-probabilities
Model Evaluation¶
Extrinsic versus Intrinsic
Training set, Test set, and "Dev Set"
Use Training Set for computing counts of sequences, then compare to Test set.
Perplexity of a Language Model (PP)¶
Bigram version:
Weighted average branching factor - number of words that can follow any given word, weighted by probability.
Note similarity to concept of information entropy.
Exercise: perplexity of digits¶
Assume each digit has probability of $\frac{1}{10}$, independent of prior digit. Compute PP.
What happens if probabilities vary from this uniform case?
Perplexity for comparing $N$-gram models¶
WSJ dataset: 20k word vocabulary, 1.5M word corpus.
- Unigram Perplexity: 962
- Bigram Perplexity: 170
- Trigram Perplexity: 109
Perplexity Notes¶
Overfitting and Bias cause misleadingly low PP
Shorter vocabulary causes higher PP, must compare using same size.
Using WSJ treebank corpus:
Note bias caused by corpus, importance of genre & dialect.
Sparse $N$-gram problem¶
Consider what happens to this calculation if any of our $N$-gram count are zero in the corpus.
Closed Vocabulary System - Vocabulary is limited to certain words. Ex: phone system with limited options.
Open Vocabulary System - Possible unknown words - set to < UNK >. Typically treat as any other word.
Out-of-Vocabulary (OOV) - words encoutered in test set (or real application) which aren't in training set.
Smoothing a.k.a. Discounting¶
Adjust $N$-gram counts to give positive numbers in place of zeros, reduces counts of others.
Laplace a.k.a. Additive smoothing¶
https://en.wikipedia.org/wiki/Additive_smoothing
$$P(w_i) = \dfrac{C(w_i)}{N} \approx \dfrac{C(w_i)+1}{N+V}$$
$N$ = total number of words.
$V$ = size of vocabulary.
Adjusted count: $C^*(w_i) = \left(C(w_i)+1\right)\dfrac{N}{N+V}$
Exercise: compute $P(w_i)$ using the adjusted count.
Discount = reduction for words with nonzero counts = $d_c=\dfrac{C^*(w_i)}{C(w_i)}$
Before Laplace smoothing:
After Laplace smoothing:
Before Laplace smoothing:
After Laplace smoothing:
Note large reductions.
Backoff : If desired $N$-gram not available, use $(N-1)$-gram.
Stupid Backoff : perform backoff but don't bother adjusting normalization properly.
Interpolation : combine $N$-grams for different $N$.
Simple linear interpolation: linear combination of $N$-gram with $(N-1)$-gram and $(N-2)$-gram ... and unigram.
Note need to adjust normalization (denominator in probability estimate) depending on total number $N$-grams used
Absolute Discounting¶
Church & Gale noticed in 1991 using AP Newswire dataset with 22M word training set and 22M word test set:
Bigram Absolute discounting with interpolated Backoff
$$ P_{Abs}(w_i|w_{i-1})= \dfrac{C(w_{i-1}w_i)-d}{\sum_vC(w_{1-1}v)}+\lambda(w_{i-1})P(w_i) $$ context-dependent weights : $\lambda$ higher when count higher.
Continuation Probability¶
Consider $P(\text{kong})>P(\text{glasses})$, but $P(\text{reading glasses})>P(\text{reading kong})$
Bigrams should capture this, but when we don't have any in training set and need to do backoff, want to still maintain the effect.
Replace unigram probability with continuation probability.
Word appears in many different bigrams --> higher $P_{CONTINUATION}$
III. Naive Bayes Classification¶
Text Categorization¶
the task of assigning a label or categorization category to an entire text or document
- sentiment analysis - positive or negative orientation that a writer expresses toward some object. A review of a movie, book, or product on the web expresses the author’s sentiment toward the product, while an editorial or political text expresses sentiment toward a candidate or political action.
- spam detection
- language identification
- authorship attribution
- topic classification
Example: sentiment analysis¶
Words like great, richly, awesome, and pathetic, and awful and ridiculously are very informative cues:
- ...zany characters and richly applied satire, and some great plot twists
- It was pathetic. The worst part about it was the boxing scenes...
- ...awesome caramel sauce and sweet toasty almonds. I love this place!
- ...awful pizza and ridiculously overpriced...
(so a unigram model might work reasonably well)
Supervised machine learning¶
- have a data set of input observations, each associated with some correct output (a ‘supervision signal’).
- The goal of the algorithm is to learn how to map from a new observation to a correct output.
- Input: $x$, a.k.a. $d$ (document)
- Output: $y$, a.k.a. $c$ (class)
- Probabilistic classifier: output class probabilities, e.g. 99% chance of belonging to class 0, 1% to class 1.
- Generative classifiers: model data generated by class. Then use it to return class most likely to have generated data. Ex: Naive Bayes.
- Discriminative classifiers: learn model to directly discriminate classes given data, ex. logistic regression.
Naive Bayes = MAP estimate of class¶
\begin{align} \hat{c} &= \arg \max_c P(c|d) \\ &= \arg\max_c \dfrac{P(d|c)P(c)}{P(d)} \\ &= \arg\max_c P(d|c)P(c) \end{align}
$P(c|d)$ = posterior probability
$P(d|c)$ = likelihood of data
$P(c)$ = prior probability of class $c$
Naive assumption¶
\begin{align} P(d|c) &= P(w_1w_2...w_L|c) \\ &\approx P(w_1|c)P(w_2|c)...P(w_L|c) \end{align}
\begin{align} \arg\max_c P(d|c)P(c) \rightarrow \arg \max_c P(w_1|c)P(w_2|c)...P(w_L|c)P(c) \end{align}
Inference using model¶
$$ c_{NB} = \arg\max_c P(c) \prod_i P(w_i|c) $$
Where the product index $i$ runs over every word in document (including repeats).
Using log-probabilities¶
$$ c_{NB} = \arg\max_c \left\{ \log P(c) + \sum_i \log P(w_i|c) \right\} $$
Note this can be viewed as a linear classification technique, take linear combination by applying weight to each feature (word).
Feature vector as list of 1's. Different weight vector for each class, document-specific.
Or with bag-of-words representation, feature vector as histogram. Can apply same weight vectors to different documents.
\begin{align} c_{NB} &= \arg\max_c \left\{ \log P(c) + \sum_i \log P(w_i|c) \right\} \\ &= \arg\max_c \left\{ \log P(c) + \sum_{w \in V} N_w \log P(w|c) \right\} \\ \end{align} where we sum over all words in vocabulary, and apply weights.
Note that outputs aren't class probabilities. How could we make them into probabilities?
Bag-of-Words (BOW)¶
Convert a text string (like a document) into a vector of word frequencies $(N_1,N_2, ...)$ by essentially summing up the one-hot encoded vectors for the words. Perhaps divide by total number of words.
Basically get histogram for each documents to use as feature vectors. Becomes structured data.
How does this relate to $N$-grams?
Estimating the probabilities¶
$$ P(c) = \text{probability of document belonging to class $c$} \approx \dfrac{\text{\# documents in corpus belonging to class $c$}}{\text{total \# documents in corpus}} $$
$$ P(w|c) = \text{probability of word $w$ appearing in document from class $c$} \approx \dfrac{\text{count of word $w$ in documents of class $c$}}{\text{total \# words in documents of class $c$}} $$
What probability rule are we using here?
Application notes¶
- Need smoothing (Laplace is popular for this Naive Bayes) for dealing with zeros.
- Unknown words typically can be ignored here.
- Stop words and frequent words like 'a' and 'the' also often ignored.
- special handling of negation for sentiment applications - include negated versions of words
Binary Naive Bayes : rather than using counts (and smoothing), just use binary indicator of word presense or absense (1 or 2, rather than 0 or 1, to avoid zeros). Essentially just remove word repeats from document.
Get word classes from online lexicons¶
Lists of positive vs negative words.
- General Inquirer (Stone et al., 1966),
- LIWC (Pennebaker et al., 2007),
- the opinion lexicon of Hu and Liu (2004a)
- the MPQA Subjectivity Lexicon (Wilson et al., 2005).
Summary: Naive Bayes versus Language Models¶
A language model models statistical relationsips between words, can be used to predict words of high overall probability for a string of text.
Naive bayes models statistical relationships between words and classes. Used to predict a class given words.
Introduction to Modern NLP¶
History¶
- Natural Language Processing is an old field of study in computer science and Artificial Intelligence research
- E.g. to make a program which can interact with people via natural language in text format
- Tasks range from basic data wrangling operations to advanced A.I.
- Many "canned" problems were posed for competitions and research
- Hardest major problems arguably solved very very recently by large language models
Canned problem examples¶
- Part-of-speech tagging
- Named entity recognition
- Sentiment analysis
- Machine Translation
Inlcudes some of the tasks we solved last class
NLP Python Packages¶
Small libraries to solve the easier NLP problems and related string operations
May include crude solutions for the harder problems (i.e. low-accuracy speech recognition)
- NLTK
- TextBlob
- SpaCy
Python Natural Language Toolkit (NLTK)¶
Natural Language Toolkit (nltk) is a Python package for NLP
Pros: Common, Functionality
Cons: academic, slow, Awkward API
import nltk
printcols(dir(nltk),3)
Download NLTK corpora (3.4GB)¶
#nltk.download('genesis')
#nltk.download('brown')
nltk.download('abc')
[nltk_data] Downloading package abc to [nltk_data] C:\Users\micro\AppData\Roaming\nltk_data... [nltk_data] Unzipping corpora\abc.zip.
True
from nltk.corpus import abc
printcols(dir(abc),2)
_LazyCorpusLoader__args _LazyCorpusLoader__kwargs _LazyCorpusLoader__load _LazyCorpusLoader__name _LazyCorpusLoader__reader_cls __class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattr__ __getattribute__ __gt__ __hash__ __init__ __init_subclass__ __le__ __lt__ __module__ __name__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _unload subdir
print(abc.raw()[:500])
PM denies knowledge of AWB kickbacks The Prime Minister has denied he knew AWB was paying kickbacks to Iraq despite writing to the wheat exporter asking to be kept fully informed on Iraq wheat sales. Letters from John Howard and Deputy Prime Minister Mark Vaile to AWB have been released by the Cole inquiry into the oil for food program. In one of the letters Mr Howard asks AWB managing director Andrew Lindberg to remain in close contact with the Government on Iraq wheat sales. The Opposition's G
Tokenize¶
import nltk
help(nltk.tokenize)
Help on package nltk.tokenize in nltk:
NAME
nltk.tokenize - NLTK Tokenizer Package
DESCRIPTION
Tokenizers divide strings into lists of substrings. For example,
tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization
models to be installed. NLTK also provides a simpler,
regular-expression based tokenizer, which splits text on whitespace
and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence
tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)] # doctest: +NORMALIZE_WHITESPACE
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with ``s.decode("utf8")``.
NLTK tokenizers can produce token-spans, represented as tuples of integers
having the same semantics as string slices, to support efficient comparison
of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s)) # doctest: +NORMALIZE_WHITESPACE
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over
tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
PACKAGE CONTENTS
api
casual
destructive
legality_principle
mwe
nist
punkt
regexp
repp
sexpr
simple
sonority_sequencing
stanford
stanford_segmenter
texttiling
toktok
treebank
util
FUNCTIONS
sent_tokenize(text, language='english')
Return a sentence-tokenized copy of *text*,
using NLTK's recommended sentence tokenizer
(currently :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into sentences
:param language: the model name in the Punkt corpus
word_tokenize(text, language='english', preserve_line=False)
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: A flag to decide whether to sentence tokenize the text or not.
:type preserve_line: bool
FILE
c:\users\micro\anaconda3\envs\rise_083124\lib\site-packages\nltk\tokenize\__init__.py
# nltk.download('punkt_tab') # <--- may need to do this first, see error
from nltk.tokenize import word_tokenize
s = '''Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks.'''
word_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
from nltk.tokenize import word_tokenize
text1 = "It's true that the chicken was the best bamboozler in the known multiverse."
tokens = word_tokenize(text1)
print(tokens)
['It', "'s", 'true', 'that', 'the', 'chicken', 'was', 'the', 'best', 'bamboozler', 'in', 'the', 'known', 'multiverse', '.']
Stemming¶
Chopping off the ends of words.
from nltk import stem
porter = stem.porter.PorterStemmer()
porter.stem("cars")
'car'
porter.stem("octopus")
'octopu'
porter.stem("am")
'am'
"Stemmers"¶
There are 3 types of commonly used stemmers, and each consists of slightly different rules for systematically replacing affixes in tokens. In general, Lancaster stemmer stems the most aggresively, i.e. removing the most suffix from the tokens, followed by Snowball and Porter
Porter Stemmer:
- Most commonly used stemmer and the most gentle stemmers
- The most computationally intensive of the algorithms (Though not by a very significant margin)
- The oldest stemming algorithm in existence
Snowball Stemmer:
- Universally regarded as an improvement over the Porter Stemmer
- Slightly faster computation time than the Porter Stemmer
Lancaster Stemmer:
- Very aggressive stemming algorithm
- With Porter and Snowball Stemmers, the stemmed representations are usually fairly intuitive to a reader
- With Lancaster Stemmer, shorter tokens that are stemmed will become totally obfuscated
- The fastest algorithm and will reduce the vocabulary
- However, if one desires more distinction between tokens, Lancaster Stemmer is not recommended
from nltk import stem
tokens = ['player', 'playa', 'playas', 'pleyaz']
# Define Porter Stemmer
porter = stem.porter.PorterStemmer()
# Define Snowball Stemmer
snowball = stem.snowball.EnglishStemmer()
# Define Lancaster Stemmer
lancaster = stem.lancaster.LancasterStemmer()
print('Porter Stemmer:', [porter.stem(i) for i in tokens])
print('Snowball Stemmer:', [snowball.stem(i) for i in tokens])
print('Lancaster Stemmer:', [lancaster.stem(i) for i in tokens])
Porter Stemmer: ['player', 'playa', 'playa', 'pleyaz'] Snowball Stemmer: ['player', 'playa', 'playa', 'pleyaz'] Lancaster Stemmer: ['play', 'play', 'playa', 'pleyaz']
Lemmatization¶
https://www.nltk.org/api/nltk.stem.wordnet.html
WordNet Lemmatizer
Provides 3 lemmatizer modes: _morphy(), morphy() and lemmatize().
lemmatize() is a permissive wrapper around _morphy(). It returns the shortest lemma found in WordNet, or the input string unchanged if nothing is found.
Lemmatize word by picking the shortest of the possible lemmas, using the wordnet corpus reader’s built-in _morphy function. Returns the input word unchanged if it cannot be found in WordNet.
from nltk.stem import WordNetLemmatizer as wnl
print('WNL Lemmatization:',wnl().lemmatize('solution'))
print('Porter Stemmer:', porter.stem('solution'))
WNL Lemmatization: solution Porter Stemmer: solut
Edit distance¶
from nltk.metrics.distance import edit_distance
edit_distance('intention', 'execution')
5
Textblob¶
https://textblob.readthedocs.io/en/dev/
"Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more."
# conda install conda-forge::textblob
import textblob
printcols(dir(textblob),3)
Blobber __license__ en PACKAGE_DIR __loader__ exceptions Sentence __name__ inflect TextBlob __package__ mixins Word __path__ np_extractors WordList __spec__ os __all__ __version__ parsers __author__ _text sentiments __builtins__ base taggers __cached__ blob tokenizers __doc__ compat translate __file__ decorators utils
from textblob import TextBlob
text1 = '''
It’s too bad that some of the young people that were killed over the weekend
didn’t have guns attached to their [hip],
frankly, where bullets could have flown in the opposite direction...
'''
text2 = '''
A President and "world-class deal maker," marveled Frida Ghitis, who demonstrates
with a "temper tantrum," that he can't make deals. Who storms out of meetings with
congressional leaders while insisting he's calm (and lines up his top aides to confirm it for the cameras).
'''
blob1 = TextBlob(text1)
blob2 = TextBlob(text2)
from nltk.corpus import abc
blob3 = TextBlob(abc.raw())
blob3.words[:50]
WordList(['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', 'Letters', 'from', 'John', 'Howard', 'and', 'Deputy', 'Prime', 'Minister', 'Mark', 'Vaile', 'to', 'AWB', 'have', 'been', 'released'])
from textblob import Word
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to [nltk_data] C:\Users\micro\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
True
w = Word("cars")
w.lemmatize()
'car'
Word("octopi").lemmatize()
'octopus'
Word("am").lemmatize()
'am'
w = Word("litter")
w.definitions
['the offspring at one birth of a multiparous mammal', 'rubbish carelessly dropped or left about (especially in public places)', 'conveyance consisting of a chair or bed carried on two poles by bearers', 'material used to provide a bed for animals', 'strew', 'make a place messy by strewing garbage around', 'give birth to a litter of animals']
text = """A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress. """
blob = TextBlob(text)
blob.sentences
[Sentence("A green hunting cap squeezed the top of the fleshy balloon of a head."),
Sentence("The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once."),
Sentence("Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs."),
Sentence("In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.")]
#blob3.word_counts
blob3.word_counts['the'],blob3.word_counts['and'],blob3.word_counts['people']
(41626, 14876, 1281)
Sentiment Analysis¶
blob1.sentiment
Sentiment(polarity=-0.19999999999999996, subjectivity=0.26666666666666666)
blob2.sentiment
Sentiment(polarity=0.4, subjectivity=0.625)
# -1 = most negative, +1 = most positive
print(TextBlob("this is horrible").sentiment)
print(TextBlob("this is lame").sentiment)
print(TextBlob("this is awesome").sentiment)
print(TextBlob("this is x").sentiment)
Sentiment(polarity=-1.0, subjectivity=1.0) Sentiment(polarity=-0.5, subjectivity=0.75) Sentiment(polarity=1.0, subjectivity=1.0) Sentiment(polarity=0.0, subjectivity=0.0)
# Simple approaches to NLP tasks typically used keyword matching.
print(TextBlob("this is horrible").sentiment)
print(TextBlob("this is the totally not horrible").sentiment)
print(TextBlob("this was horrible").sentiment)
print(TextBlob("this was horrible but now isn't").sentiment)
Sentiment(polarity=-1.0, subjectivity=1.0) Sentiment(polarity=0.5, subjectivity=1.0) Sentiment(polarity=-1.0, subjectivity=1.0) Sentiment(polarity=-1.0, subjectivity=1.0)
SpaCy¶
https://github.com/explosion/spaCy
"spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products."
"spaCy comes with pretrained pipelines and currently supports tokenization and training for 70+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license."
# conda install conda-forge::spacy
import spacy
#dir(spacy)
Activity: Zipf's Law¶
Zipf's law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. 2nd word is half as common as first word. Third word is 1/3 as common. Etc.
For example:
| Word | Rank | Frequency |
|---|---|---|
| “the” | 1st | 30k |
| "of" | 2nd | 15k |
| "and" | 3rd | 7.5k |
Plot word frequencies¶
from nltk.corpus import genesis
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(counts_sorted[:50]);
Does this confirm to Zipf Law? Why or Why not?
Activity Part 2: List the most common words¶
Activity Part 3: Remove punctuation¶
from string import punctuation
punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
sample_clean = [item for item in sample if not item[0] in punctuation]
sample_clean
[('the', 4642),
('and', 4368),
('de', 3160),
('of', 2824),
('a', 2372),
('e', 2353),
('und', 2010),
('och', 1839),
('to', 1805),
('in', 1625)]
Activity Part 4: Null model¶
- Generate random text including the space character.
- Tokenize this string of gibberish
- Generate another plot of Zipf's
- Compare the two plots.
What do you make of Zipf's Law in the light of this?
How does your result compare to Wikipedia?¶

Modern NLP A.I. tasks using HuggingFace transformer class¶
Install the Transformers, Datasets, and Evaluate libraries to run this notebook.
Installation (7/8/24)¶
This repository is tested on Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, and TensorFlow 2.6+.
Virtual environments: https://docs.python.org/3/library/venv.html
Venv user guide: https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
- create a virtual environment with the version of Python you're going to use and activate it.
- install at least one of Flax, PyTorch, or TensorFlow. Please refer to TensorFlow installation page, PyTorch installation page and/or Flax and Jax installation pages regarding the specific installation command for your platform.
- install transformers using pip as follows:
pip install transformers
...or using conda...
conda install conda-forge::transformers
NOTE: Installing transformers from the huggingface channel is deprecated.
Note 7/8/24: got error when importing:
ImportError: huggingface-hub>=0.23.2,<1.0 is required for a normal functioning of this module, but found huggingface-hub==0.23.1.
When installing conda-forge::transformers above it also installed huggingface_hub-0.23.1-py310haa95532_0 as a dependency. However if first run:
conda install conda-forge::huggingface_hub
It installs huggingface_hub-0.23.4. After this, can install conda-forge::transformers and import works without error.
Hugging Face Pipelines¶
Base class implementing NLP operations. Pipeline workflow is defined as a sequence of the following operations:
- A tokenizer in charge of mapping raw textual input to token.
- A model to make predictions from the inputs.
- Some (optional) post processing for enhancing model’s output.
https://huggingface.co/docs/transformers/en/main_classes/pipelines
from transformers import pipeline
# first indicate ask. Model optional.
pipe = pipeline("text-classification", model="FacebookAI/roberta-large-mnli")
pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]
Sentiment analysis¶
classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification. All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
classifier("I've been waiting for a HuggingFace course my whole life.")
[{'label': 'POSITIVE', 'score': 0.9598046541213989}]
classifier("I've been waiting for a HuggingFace course my whole life.")
[{'label': 'POSITIVE', 'score': 0.9598046541213989}]
classifier("I hate this so much!")
[{'label': 'NEGATIVE', 'score': 0.9994558691978455}]
"Zero-shot-classification"¶
classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 130fb28 (https://huggingface.co/FacebookAI/roberta-large-mnli). Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0%| | 0.00/688 [00:00<?, ?B/s]
C:\Users\micro\anaconda3\envs\HF_070824\lib\site-packages\huggingface_hub\file_download.py:157: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\micro\.cache\huggingface\hub\models--FacebookAI--roberta-large-mnli. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)
model.safetensors: 0%| | 0.00/1.43G [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification. All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
tokenizer_config.json: 0%| | 0.00/25.0 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]
{'sequence': 'This is a course about the Transformers library',
'labels': ['education', 'business', 'politics'],
'scores': [0.9562344551086426, 0.02697223611176014, 0.016793379560112953]}
Text generation¶
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2). Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0%| | 0.00/665 [00:00<?, ?B/s]
C:\Users\micro\anaconda3\envs\HF_070824\lib\site-packages\huggingface_hub\file_download.py:157: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\micro\.cache\huggingface\hub\models--openai-community--gpt2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)
model.safetensors: 0%| | 0.00/548M [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': "In this course, we will teach you how to do so, using techniques and techniques designed to assist in your own creative process and your own goals.\n\nHow do I know it's going to work?\n\nThere are many ways people say"}]
generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)
All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training. Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'In this course, we will teach you how to practice the magic of magic and how to do this.\n\n\nA big part of this course'},
{'generated_text': 'In this course, we will teach you how to build your own self-defense systems. The first step is to create the first self-defense system'}]
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base). Using a pipeline without specifying a model name and revision in production is not recommended. All PyTorch model weights were used when initializing TFRobertaForMaskedLM. All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
[{'score': 0.19619587063789368,
'token': 30412,
'token_str': ' mathematical',
'sequence': 'This course will teach you all about mathematical models.'},
{'score': 0.040526941418647766,
'token': 38163,
'token_str': ' computational',
'sequence': 'This course will teach you all about computational models.'}]
Named entity recognition¶
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english). Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0%| | 0.00/998 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/1.33G [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFBertForTokenClassification. All the weights of TFBertForTokenClassification were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.
tokenizer_config.json: 0%| | 0.00/60.0 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]
C:\Users\micro\anaconda3\envs\HF_070824\lib\site-packages\transformers\pipelines\token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="simple"` instead. warnings.warn(
[{'entity_group': 'PER',
'score': 0.9981694,
'word': 'Sylvain',
'start': 11,
'end': 18},
{'entity_group': 'ORG',
'score': 0.9796019,
'word': 'Hugging Face',
'start': 33,
'end': 45},
{'entity_group': 'LOC',
'score': 0.9932106,
'word': 'Brooklyn',
'start': 49,
'end': 57}]
Question answering¶
question_answerer = pipeline("question-answering")
question_answerer(
question="Where do I work?",
context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad). Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0%| | 0.00/473 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/261M [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering. All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.
tokenizer_config.json: 0%| | 0.00/49.0 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/436k [00:00<?, ?B/s]
{'score': 0.6949759125709534, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
question_answerer(
question="How many years old am I?",
context="I was both in 1990. This is 2023. Hello.",
)
{'score': 0.8601788878440857, 'start': 28, 'end': 32, 'answer': '2023'}
Summarization¶
summarizer = pipeline("summarization")
summarizer(
"""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.
Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
)
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6). Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0%| | 0.00/1.80k [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/1.22G [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]
Machine Translation¶
from transformers import pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
pytorch_model.bin: 0%| | 0.00/301M [00:00<?, ?B/s]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[5], line 3 1 from transformers import pipeline ----> 3 translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en") 4 translator("Ce cours est produit par Hugging Face.") File ~\anaconda3\envs\HF_070824\lib\site-packages\transformers\pipelines\__init__.py:994, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs) 991 tokenizer_kwargs = model_kwargs.copy() 992 tokenizer_kwargs.pop("torch_dtype", None) --> 994 tokenizer = AutoTokenizer.from_pretrained( 995 tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs 996 ) 998 if load_image_processor: 999 # Try to infer image processor from model or config name (if provided as str) 1000 if image_processor is None: File ~\anaconda3\envs\HF_070824\lib\site-packages\transformers\models\auto\tokenization_auto.py:913, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) 911 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 912 else: --> 913 raise ValueError( 914 "This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed " 915 "in order to use this tokenizer." 916 ) 918 raise ValueError( 919 f"Unrecognized configuration class {config.__class__} to build an AutoTokenizer.\n" 920 f"Model type should be one of {', '.join(c.__name__ for c in TOKENIZER_MAPPING.keys())}." 921 ) ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer.
Bias in pretrained models¶
Historic and stereotypical outputs often statistically most likely in the large datasets used ~ data scraped from the internet or past decades of books
BERT trained on English Wikipedia and BookCorpus datasets
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result1 = unmasker("This man works as a [MASK].")
print('Man:',[r["token_str"] for r in result1])
result2 = unmasker("This woman works as a [MASK].")
print('Woman:',[r["token_str"] for r in result2])
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Man: ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor'] Woman: ['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
scores1 = [r['score'] for r in result1]
labels1 = [r["token_str"] for r in result1]
scores2 = [r['score'] for r in result2]
labels2 = [r["token_str"] for r in result2]
from matplotlib.pyplot import *
figure(figsize = (5,2))
bar(labels1, scores1);
title('Men');
figure(figsize = (5,2))
bar(labels2, scores2);
title('Women');
result1
[{'score': 0.0751064345240593,
'token': 10533,
'token_str': 'carpenter',
'sequence': 'this man works as a carpenter.'},
{'score': 0.0464191772043705,
'token': 5160,
'token_str': 'lawyer',
'sequence': 'this man works as a lawyer.'},
{'score': 0.03914564475417137,
'token': 7500,
'token_str': 'farmer',
'sequence': 'this man works as a farmer.'},
{'score': 0.03280140459537506,
'token': 6883,
'token_str': 'businessman',
'sequence': 'this man works as a businessman.'},
{'score': 0.02929229475557804,
'token': 3460,
'token_str': 'doctor',
'sequence': 'this man works as a doctor.'}]