BDS 754: Python for Data Science


drawing

Topic 13: Text

This topic:¶

Text processing

  1. Character encodings
  2. Regular expressions
  3. Tokenization, segmentation, & stemming
  4. Approximate sequence matching

Intro to Modern NLP

  1. Small NLP packages
  2. NLP with LLMs

Reading - Text processing:

  • https://www.oreilly.com/library/view/fluent-python/9781491946237/ch04.html
  • J&M Chapter 2, "Regular Expressions, Text Normalization, and Edit Distance"

Further reading - Dynamic Programming:

  • "The Algorithm Design Manual, 3e," Chapter 8, Steven Skiena, 2020.
  • OSU Molecular Biology Primer, Chapter 21: https://open.oregonstate.education/computationalbiology/chapter/bioinformatics-knick-knacks-and-regular-expressions/

Reading - NLP:

  • IBM: What is NLP (natural language processing)? https://www.ibm.com/topics/natural-language-processing

Motivation¶

  • Processing formatted records which are in varying text formats, such as converting different date formats '01/01/24' vs 'Jan 1, 2024' vs '1 January 2024' to a single numerical variable

  • Processing survey data or health records in text format, use NLP to convert unstructured text to a categorical variable.

  • Processing other sequential data such as DNA or biological signals

  • Modern A.I. (Large Language Models) are trained to solve NLP problems using large collections of text. Model incorporates general knowlege for any topic broadly understood, such as science, health, psychology, and can be applied outside of NLP problems

Text Processing Levels¶

In computing, text is stored in strings (the data structure)

Strings are lists of characters.

In Natural Language Programming (NLP), the levels are:

  1. Character
  2. Words
  3. Sentences / multiple words
  4. Paragraphs / multiple sentences
  5. Document
  6. Corpus / multiple documents

Source: Taming Text, p 9

Character¶

  • Character encodings
  • Case (upper and lower)
  • Punctuation
  • Numbers

Indicate by quotes (either single or double works)

In [1]:
x = 'a'
y = '3'
z = '&'
q = '"'
print(x,y,z,q)
a 3 & "

NLP Methods: Words¶

  • Word segmentation: dividing text into words. Fairly easy for English and other languages that use whitespace; much harder for languages like Chinese and Japanese.
  • Stemming: the process of shortening a word to its base or root form.
  • Abbreviations, acronyms, and spelling. All help understand words.

NLP Methods: Sentences¶

  • Sentence boundary detection: a well-understood problem in English, but is still not perfect.
  • Phrase detection: San Francisco and quick red fox are examples of phrases.
  • Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
  • Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.

NLP Methods: Paragraphs¶

Methods to find deeper understanding of an author’s intent.

For example, algorithms for summarization often require being able to identify which sentences are more important than others.

NLP Methods: Document¶

Similar to the paragraph level, understanding the meaning of a document

Often requires knowledge that goes beyond what’s contained in the actual document.

Authors often expect readers to have a certain background or possess certain reading skills.

NLP Methods: Corpus¶

At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents.

Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.

Information Retrieval: a related area here (which also uses modern NLP methods among others).


I. Character Encodings¶

Character Encodings - map character to binary¶

  • ASCII - 7 bits
  • char - 8-bit - a-z,A-Z,0-9,...
  • multi-byte encodings for other languages
drawing
In [42]:
ascii(38)
Out[42]:
'38'
In [3]:
str(38), float('38')
Out[3]:
('38', 38.0)
In [44]:
chr(38)
Out[44]:
'&'

Unicode¶

drawing

Unicode¶

lookup table of unique numbers ("code points") denoting every possible character in a giant list of all languages

Encoding still needed to decide which subset to use, how to represent in binary

various standards:

  • UTF-8 (dominant standard),16 - variable-length
  • UTF-32 - 4-byte
drawing
In [45]:
chr(2^30+1) # for python 3
Out[45]:
'\x1d'

Mojibake¶

Incorrect, unreadable characters shown when computer software fails to show text correctly.

It is a result of text being decoded using an unintended character encoding.

Very common in Japanese websites, hence the name:
文字 (moji) "character" + 化け (bake) "transform"


II. String Processing and Regular Expressions¶

Background: Lists [item1, item2, item3]¶

  • Sequence of values - order & repeats ok
  • Mutable
  • Concatenate lists with "+"
  • Index with mylist[index] - note zero based
In [35]:
L = [1,2,3,4,5,6]
print(L)
print("length =",len(L))
print(L[0],L[1],L[2])
[1, 2, 3, 4, 5, 6]
length = 6
1 2 3
In [36]:
[1,2,3,4,5][3]
Out[36]:
4

Slices - mylist[start:end:step]¶

Matlabesque way to select sub-sequences from list

  • If first index is zero, can omit - mylist[:end:step]
  • If last index is length-1, can omit - mylist[::step]
  • If step is 1, can omit mylist[start:end]

Make slices for even and odd indexed members of this list.

In [106]:
[1,2,3,4,5][:3]
Out[106]:
[1, 2, 3]
In [22]:
[1,2,3,4,5][0:1]
Out[22]:
[1]

Strings¶

List of characters

In [39]:
s = 'Hello there'

print(s)
Hello there
In [49]:
print(s[0],s[2])
H l
In [47]:
(s[0],s[2])
Out[47]:
('H', 'l')
In [50]:
print(s[:2])
He
In [54]:
x = 'hello'
y = 'there'
z = '!'

print(x,y,z) # x,y,z is actually a tuple
hello there !
In [56]:
# addition concatenates lists or characters or strings

xyz = x+y+z 
print(xyz)
hellothere!
In [57]:
spc = chr(32)
spc
Out[57]:
' '

How do we fix the spacing in this sentence?

Other useful operations¶

https://docs.python.org/3/library/stdtypes.html

In [59]:
xyz = 'hello there'
print(xyz.split(' '))
['hello', 'there']
In [54]:
print(xyz.split())
['hello', 'there']
In [55]:
print(xyz.split('e'))
['h', 'llo th', 'r', '']
In [62]:
fname = 'smith_05_2024.txt'

root = fname.split('.')
root
Out[62]:
['smith_05_2024', 'txt']
In [64]:
rootparts = root[0].split('_')
rootparts
Out[64]:
['smith', '05', '2024']
In [65]:
name = rootparts[0]
name
Out[65]:
'smith'
In [66]:
fname.split('.')[0].split('_')[0]
Out[66]:
'smith'
In [67]:
mylist = xyz.split()
print(mylist)
['hello', 'there']
In [68]:
print(' '.join(mylist))
hello there
In [58]:
print('_'.join(mylist))
hello_there
In [85]:
from string import *
In [60]:
whos
Variable          Type        Data/Info
---------------------------------------
Formatter         type        <class 'string.Formatter'>
L                 list        n=6
Template          type        <class 'string.Template'>
ascii_letters     str         abcdefghijklmnopqrstuvwxy<...>BCDEFGHIJKLMNOPQRSTUVWXYZ
ascii_lowercase   str         abcdefghijklmnopqrstuvwxyz
ascii_uppercase   str         ABCDEFGHIJKLMNOPQRSTUVWXYZ
capwords          function    <function capwords at 0x00000181E7B2DF80>
dat0              list        n=3
dat1              list        n=4
digits            str         0123456789
hexdigits         str         0123456789abcdefABCDEF
literal1          str         calendar
literal2          str         calandar
literal3          str         celender
mylist            list        n=2
octdigits         str         01234567
pattern2          str         c[ae]l[ae]nd[ae]r
patterns          str         calendar|calandar|celender

punctuation       str         !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
re                module      <module 're' from 'C:\\Us<...>4\\Lib\\re\\__init__.py'>
s                 str         Hello there
st                str         calendar foo calandar cal celender calli
string            module      <module 'string' from 'C:<...>_083124\\Lib\\string.py'>
sub_pattern       str         [ae]

x                 str         hello
xyz               str         hello there
xyz2              str         hellothere2!
y                 str         there
z                 str         !

Example¶

Let's make a simple password generator function!

Your code should return something like this:
'kZmuSUVeVC'
'mGEsuIfl91'
'FEFsWwAgLM'

In [86]:
import random
import string

n = 10
pw = ''.join((random.choice(string.ascii_letters + string.digits) for n in range(n)))

Break this down and figure out how it works

Regular Expressions ("regex")¶

Used in grep, awk, ed, perl, ...

Regular expression is a pattern matching language, the "RE language".

A Domain Specific Language (DSL). Powerful (but limited language). E.g. SQL, Markdown.

drawing
In [1]:
from re import *
In [2]:
whos
Variable     Type         Data/Info
-----------------------------------
A            RegexFlag    re.ASCII
ASCII        RegexFlag    re.ASCII
DOTALL       RegexFlag    re.DOTALL
I            RegexFlag    re.IGNORECASE
IGNORECASE   RegexFlag    re.IGNORECASE
L            RegexFlag    re.LOCALE
LOCALE       RegexFlag    re.LOCALE
M            RegexFlag    re.MULTILINE
MULTILINE    RegexFlag    re.MULTILINE
Match        type         <class 're.Match'>
Pattern      type         <class 're.Pattern'>
S            RegexFlag    re.DOTALL
U            RegexFlag    re.UNICODE
UNICODE      RegexFlag    re.UNICODE
VERBOSE      RegexFlag    re.VERBOSE
X            RegexFlag    re.VERBOSE
compile      function     <function compile at 0x000001F92F137430>
error        type         <class 're.error'>
escape       function     <function escape at 0x000001F92F1375E0>
findall      function     <function findall at 0x000001F92F137310>
finditer     function     <function finditer at 0x000001F92F1373A0>
fullmatch    function     <function fullmatch at 0x000001F92F137040>
match        function     <function match at 0x000001F92F08D700>
purge        function     <function purge at 0x000001F92F1374C0>
search       function     <function search at 0x000001F92F1370D0>
split        function     <function split at 0x000001F92F137280>
sub          function     <function sub at 0x000001F92F137160>
subn         function     <function subn at 0x000001F92F1371F0>
template     function     <function template at 0x000001F92F137550>

Motivating example¶

Write a regex to match common misspellings of calendar: "calendar", "calandar", or "celender"

In [65]:
# Let's explore how to do this

# Patterns to match
dat0 = ["calendar", "calandar", "celender"]

# Patterns to not match
dat1 = ["foo", "cal", "calli", "calaaaandar"] 

# Interleave them
st = " ".join([item for pair in zip(dat0, dat1) for item in pair])
In [66]:
st
Out[66]:
'calendar foo calandar cal celender calli'
In [67]:
# You match it with literals
literal1 = 'calendar'
literal2 = 'calandar'
literal3 = 'celender'

patterns = "|".join([literal1, literal2, literal3])

patterns
Out[67]:
'calendar|calandar|celender'
In [68]:
import re

print(re.findall(patterns, st))
['calendar', 'calandar', 'celender']

... a better way¶

Let's write it with regex language

In [109]:
sub_pattern = '[ae]'
pattern2 = sub_pattern.join(["c","l","nd","r"])

print(pattern2)
c[ae]l[ae]nd[ae]r
In [110]:
print(st)

re.findall(pattern2, st)
calendar foo calandar cal celender calli
Out[110]:
['calendar', 'calandar', 'celender']

Regex Terms¶

  • target string: This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.
  • search expression: The pattern we use to find what we want. Most commonly called the regular expression.
  • literal: A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, it is literally the string we want to find.
  • metacharacter: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character.
  • escape sequence: An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal.

function(search_expression, target_string)¶

  1. pick function based on goal (find all matches, replace matches, find first match, ...)

  2. form search expression to account for variations in target we allow. E.g. possible misspellings.

  • findall() - Returns a list containing all matches
  • search() - Returns a Match object if there is a match anywhere in the string
  • split() - Returns a list where the string has been split at each match
  • sub() - Replaces one or many matches with a string
  • match() apply the pattern at the start of the string

Metacharacters¶

special characters that have a unique meaning

 [] A set of characters. Ex: "[a-m]"    

 \  Signals a special sequence, also used to escape special characters). Ex: "\d" 

 .  Any character (except newline character). Ex: "he..o" 

 ^  Starts with. Ex: "^hello"   

 $  Ends with. Ex: "world$"     

 *  Zero or more occurrences. Ex: "aix*"    

 +  One or more occurrences. Ex: "aix+"     

 {} Specified number of occurrences. Ex: "al{2}"    

 |  Either or. Ex: "falls|stays"    

 () Capture and group

Escape sequence "\"¶

A way of indicating that we want to use one of our metacharacters as a literal.

In a regular expression an escape sequence is metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal.

Ex: If we want to find \file in the target string c:\file then we would need to use the search expression \\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence ).

Special Escape Sequences¶

  • \A - specified characters at beginning of string. Ex: "\AThe"
  • \b - specified characters at beginning or end of a word. Ex: r"\bain" r"ain\b"
  • \B - specified characters present but NOT at beginning (or end) of word. Ex: r"\Bain" r"ain\B"
  • \d - string contains digits (numbers from 0-9)
  • \D - string DOES NOT contain digits
  • \s - string contains a white space character
  • \S - string DOES NOT contain a white space character
  • \w - string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
  • \W - string DOES NOT contain any word characters
  • \Z - specified characters are at the end of the string

Set¶

a set of characters inside a pair of square brackets [] with a special meaning:

  • [arn] one of the specified characters (a, r, or n) are present
  • [a-n] any lower case character, alphabetically between a and n
  • [^arn] any character EXCEPT a, r, and n
  • [0123] any of the specified digits (0, 1, 2, or 3) are present
  • [0-9] any digit between 0 and 9
  • [0-5][0-9] any two-digit numbers from 00 and 59
  • [a-zA-Z] any character alphabetically between a and z, lower case OR upper case
  • [+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means any + character in the string

Ex: Matching phone numbers¶

In [125]:
target_string = 'fgsfdgsgf 415-805-1888 xxxddd 800-555-1234'

pattern1 = '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'  
print(re.findall(pattern1,target_string))
['415-805-1888', '800-555-1234']
In [126]:
pattern2 = '\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'  
print(re.findall(pattern2,target_string))
['415-805-1888', '800-555-1234']
In [127]:
pattern3 = '\\d{3}-\\d{3}-\\d{4}'  
print(re.findall(pattern3,target_string))
['415-805-1888', '800-555-1234']

\d{3}-\d{3}-\d{4} uses Quantifiers.

Quantifiers: allow you to specify how many times the preceding expression should match.

{} is extact quantifier.

In [79]:
print(re.findall('x?','xxxy'))
['x', 'x', 'x', '', '']
In [80]:
print(re.findall('x+','xxxy'))
['xxx']

Capturing groups¶

Problem: You have odd line breaks in your text.

In [129]:
text = 'Long-\nterm problems with short-\nterm solutions.'
print(text)
Long-
term problems with short-
term solutions.
In [130]:
text.replace('-\n','\n')
Out[130]:
'Long\nterm problems with short\nterm solutions.'

Solution: Write a regex to find the "dash with line break" and replace it with just a line break.

In [82]:
import re
In [84]:
# 1st Attempt
text = 'Long-\nterm problems with short-\nterm solutions.'
re.sub('(\\w+)-\\n(\\w+)', r'-', text)
Out[84]:
'- problems with - solutions.'

Not right. We need capturing groups.

Capturing groups allow you to apply regex operators to the groups that have been matched by regex.

For for example, if you wanted to list all the image files in a folder. You could then use a pattern such as ^(IMG\d+\.png)$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png$ which only captures the part before the period.

In [86]:
re.sub(r'(\w+)-\n(\w+)', r'\1-\2', text)
Out[86]:
'Long-term problems with short-term solutions.'

The parentheses around the word characters (specified by \w) means that any matching text should be captured into a group.

The '\1' and '\2' specifiers refer to the text in the first and second captured groups.

"Long" and "term" are the first and second captured groups for the first match.
"short" and "term" are the first and second captured groups for the next match.

NOTE: 1-based indexing

In [63]:
# speed of regex versus naive implementation

import re
import time
import random
import string

# naive wildcard match
def naive_find(text, pattern):
    matches = []
    n = len(text)
    m = len(pattern)

    for i in range(n - m + 1):
        ok = True
        for j in range(m):
            if pattern[j] != "*" and text[i + j] != pattern[j]:
                ok = False
                break
        if ok:
            matches.append(i)
    return matches

# generate random text
letters = string.ascii_lowercase + " "
text = "".join(random.choice(letters) for _ in range(300000))

pattern = "th*s is"

# naive timing
t0 = time.time()
naive_matches = naive_find(text, pattern)
t1 = time.time()

# regex timing
t2 = time.time()
regex_pattern = pattern.replace("*", ".")
r = re.compile(regex_pattern)
regex_matches = [m.start() for m in r.finditer(text)]
t3 = time.time()

print("naive matches:", len(naive_matches))
print("regex matches:", len(regex_matches))
print("naive time:", t1 - t0)
print("regex time:", t3 - t2)
naive matches: 0
regex matches: 0
naive time: 0.13492965698242188
regex time: 0.0

Note: potential combinatoric explosion¶

With wildcards of varying lengths, it can still require combinatoric number of matches

In [66]:
import re
import time

# pathological regex pattern
pattern = re.compile(r'(a+)+b')

# input with no terminating 'b'
text = "a" * 24

t0 = time.time()
match = pattern.search(text)
t1 = time.time()

print('text:', text)
print('pattern:', pattern)
print("match:", match)
print("time:", t1 - t0)
text: aaaaaaaaaaaaaaaaaaaaaaaa
pattern: re.compile('(a+)+b')
match: None
time: 1.8922786712646484

Useful Tools:¶

  • Realtime regex engine
  • Regex tester
  • Regex cheatsheet
  • Python Regex checker

III. Tokenization, segmentation, & stemming¶

Sentence segmentation:¶

Dividing a stream of language into component sentences.

Sentences can be defined as a set of words that is complete in itself, typically containing a subject and predicate.

Sentence segmentation typically done using punctuation, particularly the full stop character "." as a reasonable approximation.

Complications because punctuation also used in abbreviations, which may or may not also terminate a sentence.

For example, Dr. Evil.

Example¶

A Confederacy Of Dunces
By John Kennedy Toole

A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.

sentence_1 = A green hunting cap squeezed the top of the fleshy balloon of a head.

sentence_2 = The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once.

sentence_3 = Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs.

sentence_4 = In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.

Code version 1¶

In [131]:
text = """A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress. """
In [132]:
import re

pattern = "|".join(['!', # end with "!"
                    '\\?', # end with "?" 
                    '\\.\\D', # end with "." and the full stop is not followed by a number
                    '\\.\\s']) # end with "." and the full stop is followed by a whitespace

print(pattern)
!|\?|\.\D|\.\s
In [133]:
re.split(pattern, text)
Out[133]:
['A green hunting cap squeezed the top of the fleshy balloon of a head',
 'The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once',
 'Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs',
 'In the shadow under the green visor of the cap Ignatius J',
 'Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D',
 '',
 'Holmes department store, studying the crowd of people for signs of bad taste in dress',
 '']

Code version 2¶

http://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing

In [134]:
pattern = r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
re.split(pattern, text)
Out[134]:
['A green hunting cap squeezed the top of the fleshy balloon of a head.',
 'The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once.',
 'Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs.',
 'In the shadow under the green visor of the cap Ignatius J.',
 'Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.',
 '']

Tokenization¶

Breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens

The simplest way to tokenize is to split on white space

In [101]:
sentence1 = 'Sky is blue and trees are green'
sentence1.split(' ')
Out[101]:
['Sky', 'is', 'blue', 'and', 'trees', 'are', 'green']
In [135]:
sentence1.split() # in fact it's the default
Out[135]:
['Sky', 'is', 'blue', 'and', 'trees', 'are', 'green']

Sometimes you might also want to deal with abbreviations, hypenations, puntuations and other characters.

In those cases, you would want to use regex.

However, going through a sentence multiple times can be slow to run if the corpus is long

In [104]:
import re

sentence2 = 'This state-of-the-art technology is cool, isn\'t it?'

sentence2 = re.sub('-', ' ', sentence2)
sentence2 = re.sub('[,|.|?]', '', sentence2)
sentence2 = re.sub('n\'t', ' not', sentence2)
print(sentence2)

sentence2_tokens = re.split('\\s+', sentence2)

print(sentence2_tokens)
This state of the art technology is cool is not it
['This', 'state', 'of', 'the', 'art', 'technology', 'is', 'cool', 'is', 'not', 'it']

In this case, there are 11 tokens and the size of the vocabulary is 10

In [105]:
print('Number of tokens:', len(sentence2_tokens))
print('Number of vocabulary:', len(set(sentence2_tokens)))
Number of tokens: 11
Number of vocabulary: 10

Tokenization is a major component of modern language models and A.I., where tokens are defined more generally.

Morphemes¶

A morpheme is the smallest unit of language that has meaning. Two types:

  1. stems
  2. affixes (suffixes, prefixes, infixes, and circumfixes)

Example: "unbelievable"

What is the stem? What is the affixes?

"believe" is a stem.
"un" and "able" are affixes.

What we usually want to do in NLP preprocessing is get the stem by eliminating the affixes from a token.

Stemming¶

Stemming usually refers to a crude heuristic process that chops off the ends of words.

Ex: automates, automating and automatic could be stemmed to automat

Exercise: how would you implement this using regex? What difficulties would you run into?

Lemmatization¶

Lemmatization aims to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

This is doing things properly with the use of a vocabulary and morphological analysis of words.

How are stemming and lemmatization similar/different?

Bioinformatics¶

Many analogous tasks in processing DNA and RNA sequences

  • Finding exact matches for shorter sequence within long sequence
  • Inexact or approximate matching?

IV. Approximate Sequence Matching¶

Exact Matching¶

Fing places where pattern $P$ is found within text $T$.

What python functions do this?

Also very important problem. Not trivial for massive datasets.

Alignment - compare $P$ to same-length substring of $T$ at some starting point.

drawing

Use base python to perform this.

How many calculations will this take in the most naive approach possible?

Improving on the Naive Exact-matching algorithm¶

Naive approach: test all possible alignments for match.

Ideas for improvement:

  • stop comparing given alignment at first mismatch.
  • use result of prior alignments to shorten or skip subsequent alignments.

Approximate matching: Motivational problems¶

  • Matching Regular Expressions to text efficiently
  • Biological sequence alignment between different species
  • Matching noisy trajectories through space
  • Clustering sequences into a few groups of most similar classes
  • Applying k Nearest Neighbors classification to sequences

Pre-filtering, Pruning, etc.¶

When performing a slow search algorithm over a large dataset, start with fast algorithm with poor FPR to reject obvious mismatches.

Ex. BLAST (sequence alignment) in bioinformatics, followed by slow accurate structure alignment technique.

  • K. Dillon and Y.-P. Wang, “On efficient meta-filtering of big data”, 2016

Correlation screening

  • Fan, Jianqing, and Jinchi Lv. "Sure independence screening for ultrahigh dimensional feature space." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.5 (2008): 849-911.
  • Wu, Tong Tong, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. "Genome-wide association analysis by lasso penalized logistic regression." Bioinformatics 25.6 (2009): 714-721.

SAFE screening

  • Ghaoui, Laurent El, Vivian Viallon, and Tarek Rabbani. "Safe feature elimination for the lasso and sparse supervised learning problems." arXiv preprint arXiv:1009.4219 (2010).
  • Liu, Jun, et al. "Safe screening with variational inequalities and its application to lasso." arXiv preprint arXiv:1307.7577 (2013).

Rather than match vs no match, we now need a similarity score a.k.a. distance $d$(string1, string2)

Approximate matching of strings¶

Given spelling errors, determine most-likely name

drawing

Sequence Alignment given mutations¶

  • Needleman-Wunsch
  • Smith-Waterman
drawing

Matching time-series with varying timescales¶

How similar are these two curves, assuming we ignore varying timescales?

drawing

Example: we wish to determine location of hiker given altitude measurements during hike.

Note the amount of warping is often not the distance metric. We first "warp" the pattern, then compute distance some other way, e.g. LMS. Final distance as the closest LMS possible over all acceptable warpings.

Dynamic Time Warping

Edit Distance between two strings¶

Minimum number of operations needed to convert one string into other

Ex. typo correction: how do we decide what "scool" was supposed to be?

Consider possibilities with lowest edit distance. "school" or "cool".

Hamming distance - operations consist only of substitutions of character (i.e. count differences)

drawing

Levenshtein distance (between strings $P$ and $T$)¶

Special case where the operations are the insertion, deletion, or substitution of a character

  • Insertion – Insert a single character into pattern $P$ to help it match text $T$ , such as changing “ago” to “agog.”
  • Deletion – Delete a single character from pattern $P$ to help it match text $T$ , such as changing “hour” to “our.”
  • Substitution – Replace a single character from pattern $P$ with a different character in text $T$ , such as changing “shot” to “spot.”

Count the minimum number needed to convert $P$ into $T$.

Interchangably called "edit distance".

Exercise: What are the Hamming and Edit distances?¶

\begin{align} T: \text{"The quick brown fox"} \\ P: \text{"The quick grown fox"} \\ \end{align}
\begin{align} T: \text{"The quick brown fox"} \\ P: \text{"The quik brown fox "} \\ \end{align}

Exercise: What are the Edit distances?¶

drawing

Comprehension check: give three different ways to transform $P$ into $T$ (not necessarily fewest operations)

Edit distance - Divide and conquer¶

Recursively use substring match results to compute

$$ \text{Let } D[0,j] = j, \quad \text{and let } D[i,0] = i $$$$ \text{Otherwise, let } D[i,j] = \min \begin{cases} D[i-1,j] + 1 \\ D[i,j-1] + 1 \\ D[i-1,j-1] + \delta(x[i-1],y[j-1]) \end{cases} $$$$ \delta(a,b) = \begin{cases} 0 & \text{if } a=b \\ 1 & \text{otherwise} \end{cases} $$
In [44]:
# recursive implementation

def D(a, b):
    if a=='': return len(b)
    if b=='': return len(a)
    if a[-1] == b[-1]: 
        delta = 0 
    else:
        delta = 1
    return min(
        D(a[:-1], b) + 1,
        D(a, b[:-1]) + 1,
        D(a[:-1], b[:-1]) + delta
    )

a = "chocolate"
b = "anniversary"

import time
t0 = time.time(); d = D(a, b); t1 = time.time()
print("distance:", d, "time:", t1 - t0)
distance: 10 time: 3.675112247467041

Recall: Dynamic Programming and Memoization¶

Deal with repeatedly calculating same terms

In [65]:
def fib_recursive(n):
    if n == 0: return 0
    if n == 1: return 1
    return fib_recursive(n-1) + fib_recursive(n-2)
drawing

Each term requires two additional terms be calculated. Exponential time.

DP calculation¶

Intelligently plan terms to calculate and store ("cache"), e.g. in a table.

drawing

Each term requires one term be calculated.

DP Caching¶

Always plan out the data structure and calculation order.

  • need to make sure you have sufficient space
  • need to choose optimal strategy to fill in

The data structure to fill in for Fibonacci is trivial:

drawing

Optimal order is to start at bottom and work up to $n$, so always have what you need for next term.

In [47]:
# Dynamic programming version

def edit_distance(x, y):
    
    def delta(a, b):
        if a == b:
            return 0
        return 1

    m = len(x)
    n = len(y)

    # D has (m+1) rows and (n+1) columns
    D = []
    for i in range(m + 1):
        row = []
        for j in range(n + 1):
            row.append(0)
        D.append(row)

    # Let D[0,j] = j
    for j in range(n + 1):
        D[0][j] = j

    # Let D[i,0] = i
    for i in range(m + 1):
        D[i][0] = i

    # Otherwise, let D[i,j] = min(...)
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            D[i][j] = min(
                D[i - 1][j] + 1,
                D[i][j - 1] + 1,
                D[i - 1][j - 1] + delta(x[i - 1], y[j - 1])
            )

    return D
In [53]:
a = "chocolate"
b = "anniversary"

import time
t0 = time.time(); 
d = edit_distance(a, b); 
t1 = time.time()
print("distance:", d[-1][-1], "time:", t1 - t0)
distance: 10 time: 0.0
In [57]:
x = "kitten"
y = "sitting"

import time
t0 = time.time(); 
D = edit_distance(x, y)
t1 = time.time()

for row in D:
    print(row)

print("\ndistance:", D[len(x)][len(y)], "time:", t1 - t0)
[0, 1, 2, 3, 4, 5, 6, 7]
[1, 1, 2, 3, 4, 5, 6, 7]
[2, 2, 1, 2, 3, 4, 5, 6]
[3, 3, 2, 1, 2, 3, 4, 5]
[4, 4, 3, 2, 1, 2, 3, 4]
[5, 5, 4, 3, 2, 2, 3, 4]
[6, 6, 5, 4, 3, 3, 2, 3]

distance: 3 time: 0.0015840530395507812

D[i,j] = min edits needed to convert x[:i] into y[:j]

Up: D[i−1,j]+1 - delete a character from x

Left: D[i,j−1]+1 - insert a character into x

Diagonal: D[i−1,j−1]+δ(x[i−1],y[j−1]) - match or substitute

drawing

The backtrace gives the edit by taking step with smallest value

drawing

https://docs.python.org/3/library/functools.html

In [68]:
from functools import lru_cache

@lru_cache()
def D(a, b):
    if a=='': return len(b)
    if b=='': return len(a)
    if a[-1] == b[-1]: 
        delta = 0 
    else:
        delta = 1
    return min(
        D(a[:-1], b) + 1,
        D(a, b[:-1]) + 1,
        D(a[:-1], b[:-1]) + delta
    )

a = "chocolate"
b = "anniversary"

import time
t0 = time.time(); d = D(a, b); t1 = time.time()
print("distance:", d, "time:", t1 - t0)
distance: 10 time: 0.0009992122650146484

diff¶

Application of edit distance and backtrace

Essentially the minimal list of edits which convert one sequence into another

Used for version control (git) - nodes are versions, edges are described with diffs

drawing
In [35]:
import difflib

old = "return x * x"
new = "return x ** 2"

# character-based diff
diff = difflib.ndiff(old, new)

for d in diff:
    print(d)
  r
  e
  t
  u
  r
  n
   
  x
   
  *
+ *
   
- x
+ 2
In [34]:
old_code = """
def square(x):
    return x * x
""".strip().splitlines()

new_code = """
def square(x):
    print("computing square")
    return x ** 2
""".strip().splitlines()

# line-based diff
diff = difflib.unified_diff(old_code,new_code,fromfile="old.py",tofile="new.py")
    
for line in diff:
    print(line)
--- old.py

+++ new.py

@@ -1,2 +1,3 @@

 def square(x):
-    return x * x
+    print("computing square")
+    return x ** 2

Introduction to Modern NLP¶

History¶

  • Natural Language Processing is an old field of study in computer science and Artificial Intelligence research
  • E.g. to make a program which can interact with people via natural language in text format
  • Tasks range from basic data wrangling operations to advanced A.I.
  • Many "canned" problems were posed for competitions and research
  • Hardest major problems arguably solved very very recently by large language models
drawing

Canned problem examples¶

  • Part-of-speech tagging
  • Named entity recognition
  • Sentiment analysis
  • Machine Translation

Inlcudes some of the tasks we solved last class

NLP Python Packages¶

Small libraries to solve the easier NLP problems and related string operations

May include crude solutions for the harder problems (i.e. low-accuracy speech recognition)

  • NLTK
  • TextBlob
  • SpaCy

Python Natural Language Toolkit (NLTK)¶

Natural Language Toolkit (nltk) is a Python package for NLP

Pros: Common, Functionality

Cons: academic, slow, Awkward API

In [69]:
import nltk

#printcols(dir(nltk),3)

Download NLTK corpora (3.4GB)¶

In [4]:
#nltk.download('genesis')
#nltk.download('brown')
nltk.download('abc')
[nltk_data] Downloading package abc to
[nltk_data]     C:\Users\micro\AppData\Roaming\nltk_data...
[nltk_data]   Package abc is already up-to-date!
Out[4]:
True
In [5]:
from nltk.corpus import abc

printcols(dir(abc),2)
_LazyCorpusLoader__args        __init__                       
_LazyCorpusLoader__kwargs      __init_subclass__              
_LazyCorpusLoader__load        __le__                         
_LazyCorpusLoader__name        __lt__                         
_LazyCorpusLoader__reader_cls  __module__                     
__class__                      __name__                       
__delattr__                    __ne__                         
__dict__                       __new__                        
__dir__                        __reduce__                     
__doc__                        __reduce_ex__                  
__eq__                         __repr__                       
__firstlineno__                __setattr__                    
__format__                     __sizeof__                     
__ge__                         __static_attributes__          
__getattr__                    __str__                        
__getattribute__               __subclasshook__               
__getstate__                   __weakref__                    
__gt__                         _unload                        
__hash__                       subdir                         
In [6]:
print(abc.raw()[:500])
PM denies knowledge of AWB kickbacks
The Prime Minister has denied he knew AWB was paying kickbacks to Iraq despite writing to the wheat exporter asking to be kept fully informed on Iraq wheat sales.
Letters from John Howard and Deputy Prime Minister Mark Vaile to AWB have been released by the Cole inquiry into the oil for food program.
In one of the letters Mr Howard asks AWB managing director Andrew Lindberg to remain in close contact with the Government on Iraq wheat sales.
The Opposition's G

Tokenize¶

In [7]:
import nltk
help(nltk.tokenize)
Help on package nltk.tokenize in nltk:

NAME
    nltk.tokenize - NLTK Tokenizer Package

DESCRIPTION
    Tokenizers divide strings into lists of substrings.  For example,
    tokenizers can be used to find the words and punctuation in a string:

        >>> from nltk.tokenize import word_tokenize
        >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
        ... two of them.\n\nThanks.'''
        >>> word_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
        ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

    This particular tokenizer requires the Punkt sentence tokenization
    models to be installed. NLTK also provides a simpler,
    regular-expression based tokenizer, which splits text on whitespace
    and punctuation:

        >>> from nltk.tokenize import wordpunct_tokenize
        >>> wordpunct_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
        ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

    We can also operate at the level of sentences, using the sentence
    tokenizer directly as follows:

        >>> from nltk.tokenize import sent_tokenize, word_tokenize
        >>> sent_tokenize(s)
        ['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
        >>> [word_tokenize(t) for t in sent_tokenize(s)] # doctest: +NORMALIZE_WHITESPACE
        [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
        ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

    Caution: when tokenizing a Unicode string, make sure you are not
    using an encoded version of the string (it may be necessary to
    decode it first, e.g. with ``s.decode("utf8")``.

    NLTK tokenizers can produce token-spans, represented as tuples of integers
    having the same semantics as string slices, to support efficient comparison
    of tokenizers.  (These methods are implemented as generators.)

        >>> from nltk.tokenize import WhitespaceTokenizer
        >>> list(WhitespaceTokenizer().span_tokenize(s)) # doctest: +NORMALIZE_WHITESPACE
        [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
        (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

    There are numerous ways to tokenize text.  If you need more control over
    tokenization, see the other methods provided in this package.

    For further information, please see Chapter 3 of the NLTK book.

PACKAGE CONTENTS
    api
    casual
    destructive
    legality_principle
    mwe
    nist
    punkt
    regexp
    repp
    sexpr
    simple
    sonority_sequencing
    stanford
    stanford_segmenter
    texttiling
    toktok
    treebank
    util

FUNCTIONS
    sent_tokenize(text, language='english')
        Return a sentence-tokenized copy of *text*,
        using NLTK's recommended sentence tokenizer
        (currently :class:`.PunktSentenceTokenizer`
        for the specified language).

        :param text: text to split into sentences
        :param language: the model name in the Punkt corpus

    word_tokenize(text, language='english', preserve_line=False)
        Return a tokenized copy of *text*,
        using NLTK's recommended word tokenizer
        (currently an improved :class:`.TreebankWordTokenizer`
        along with :class:`.PunktSentenceTokenizer`
        for the specified language).

        :param text: text to split into words
        :type text: str
        :param language: the model name in the Punkt corpus
        :type language: str
        :param preserve_line: A flag to decide whether to sentence tokenize the text or not.
        :type preserve_line: bool

FILE
    c:\users\micro\anaconda3\envs\hf_110325\lib\site-packages\nltk\tokenize\__init__.py


In [8]:
# nltk.download('punkt_tab') # <--- may need to do this first, see error

from nltk.tokenize import word_tokenize
s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''
word_tokenize(s) # doctest: +NORMALIZE_WHITESPACE
Out[8]:
['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']
In [9]:
from nltk.tokenize import word_tokenize
text1 = "It's true that the chicken was the best bamboozler in the known multiverse."
tokens = word_tokenize(text1)
print(tokens)
['It', "'s", 'true', 'that', 'the', 'chicken', 'was', 'the', 'best', 'bamboozler', 'in', 'the', 'known', 'multiverse', '.']

Stemming¶

Chopping off the ends of words.

In [10]:
from nltk import stem

porter = stem.porter.PorterStemmer()
In [11]:
porter.stem("cars")
Out[11]:
'car'
In [12]:
porter.stem("octopus")
Out[12]:
'octopu'
In [13]:
porter.stem("am")
Out[13]:
'am'

"Stemmers"¶

There are 3 types of commonly used stemmers, and each consists of slightly different rules for systematically replacing affixes in tokens. In general, Lancaster stemmer stems the most aggresively, i.e. removing the most suffix from the tokens, followed by Snowball and Porter

  1. Porter Stemmer:

    • Most commonly used stemmer and the most gentle stemmers
    • The most computationally intensive of the algorithms (Though not by a very significant margin)
    • The oldest stemming algorithm in existence
  2. Snowball Stemmer:

    • Universally regarded as an improvement over the Porter Stemmer
    • Slightly faster computation time than the Porter Stemmer
  3. Lancaster Stemmer:

    • Very aggressive stemming algorithm
    • With Porter and Snowball Stemmers, the stemmed representations are usually fairly intuitive to a reader
    • With Lancaster Stemmer, shorter tokens that are stemmed will become totally obfuscated
    • The fastest algorithm and will reduce the vocabulary
    • However, if one desires more distinction between tokens, Lancaster Stemmer is not recommended
In [14]:
from nltk import stem

tokens =  ['player', 'playa', 'playas', 'pleyaz'] 

# Define Porter Stemmer
porter = stem.porter.PorterStemmer()
# Define Snowball Stemmer
snowball = stem.snowball.EnglishStemmer()
# Define Lancaster Stemmer
lancaster = stem.lancaster.LancasterStemmer()

print('Porter Stemmer:', [porter.stem(i) for i in tokens])
print('Snowball Stemmer:', [snowball.stem(i) for i in tokens])
print('Lancaster Stemmer:', [lancaster.stem(i) for i in tokens])
Porter Stemmer: ['player', 'playa', 'playa', 'pleyaz']
Snowball Stemmer: ['player', 'playa', 'playa', 'pleyaz']
Lancaster Stemmer: ['play', 'play', 'playa', 'pleyaz']

Lemmatization¶

https://www.nltk.org/api/nltk.stem.wordnet.html

WordNet Lemmatizer

Provides 3 lemmatizer modes: _morphy(), morphy() and lemmatize().

lemmatize() is a permissive wrapper around _morphy(). It returns the shortest lemma found in WordNet, or the input string unchanged if nothing is found.

Lemmatize word by picking the shortest of the possible lemmas, using the wordnet corpus reader’s built-in _morphy function. Returns the input word unchanged if it cannot be found in WordNet.

In [15]:
from nltk.stem import WordNetLemmatizer as wnl
print('WNL Lemmatization:',wnl().lemmatize('solution'))

print('Porter Stemmer:', porter.stem('solution'))
WNL Lemmatization: solution
Porter Stemmer: solut

Edit distance¶

In [18]:
from nltk.metrics.distance import edit_distance 

edit_distance('intention', 'execution')
Out[18]:
5

Textblob¶

https://textblob.readthedocs.io/en/dev/

"Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more."

In [20]:
# conda install conda-forge::textblob

import textblob

printcols(dir(textblob),3)
Blobber        __license__    en             
PACKAGE_DIR    __loader__     exceptions     
Sentence       __name__       inflect        
TextBlob       __package__    mixins         
Word           __path__       np_extractors  
WordList       __spec__       os             
__all__        __version__    parsers        
__author__     _text          sentiments     
__builtins__   base           taggers        
__cached__     blob           tokenizers     
__doc__        compat         translate      
__file__       decorators     utils          
In [3]:
from textblob import TextBlob

text1 = '''
It’s too bad that some of the young people that were killed over the weekend 
didn’t have guns attached to their [hip], 
frankly, where bullets could have flown in the opposite direction...
'''

text2 = '''
A President and "world-class deal maker," marveled Frida Ghitis, who demonstrates 
with a "temper tantrum," that he can't make deals. Who storms out of meetings with 
congressional leaders while insisting he's calm (and lines up his top aides to confirm it for the cameras). 
'''

blob1 = TextBlob(text1)
blob2 = TextBlob(text2)

from nltk.corpus import abc

blob3 = TextBlob(abc.raw())
blob3.words[:50]
Out[3]:
WordList(['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', 'Letters', 'from', 'John', 'Howard', 'and', 'Deputy', 'Prime', 'Minister', 'Mark', 'Vaile', 'to', 'AWB', 'have', 'been', 'released'])
In [22]:
from textblob import Word

nltk.download('wordnet')
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\micro\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[22]:
True
In [23]:
w = Word("cars")
w.lemmatize()
Out[23]:
'car'
In [24]:
Word("octopi").lemmatize()
Out[24]:
'octopus'
In [25]:
Word("am").lemmatize() 
Out[25]:
'am'
In [26]:
w = Word("litter")
w.definitions
Out[26]:
['the offspring at one birth of a multiparous mammal',
 'rubbish carelessly dropped or left about (especially in public places)',
 'conveyance consisting of a chair or bed carried on two poles by bearers',
 'material used to provide a bed for animals',
 'strew',
 'make a place messy by strewing garbage around',
 'give birth to a litter of animals']
In [27]:
text = """A green hunting cap squeezed the top of the fleshy balloon of a head. The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once. Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs. In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress. """

blob = TextBlob(text)

blob.sentences
Out[27]:
[Sentence("A green hunting cap squeezed the top of the fleshy balloon of a head."),
 Sentence("The green earflaps, full of large ears and uncut hair and the fine bristles that grew in the ears themselves, stuck out on either side like turn signals indicating two directions at once."),
 Sentence("Full, pursed lips protruded beneath the bushy black moustache and, at their corners, sank into little folds filled with disapproval and potato chip crumbs."),
 Sentence("In the shadow under the green visor of the cap Ignatius J. Reilly’s supercilious blue and yellow eyes looked down upon the other people waiting under the clock at the D.H. Holmes department store, studying the crowd of people for signs of bad taste in dress.")]
In [28]:
#blob3.word_counts

blob3.word_counts['the'],blob3.word_counts['and'],blob3.word_counts['people']
Out[28]:
(41626, 14876, 1281)

Sentiment Analysis¶

In [29]:
blob1.sentiment
Out[29]:
Sentiment(polarity=-0.19999999999999996, subjectivity=0.26666666666666666)
In [30]:
blob2.sentiment
Out[30]:
Sentiment(polarity=0.4, subjectivity=0.625)
In [31]:
# -1 = most negative, +1 = most positive

print(TextBlob("this is horrible").sentiment)
print(TextBlob("this is lame").sentiment)
print(TextBlob("this is awesome").sentiment)
print(TextBlob("this is x").sentiment)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=-0.5, subjectivity=0.75)
Sentiment(polarity=1.0, subjectivity=1.0)
Sentiment(polarity=0.0, subjectivity=0.0)
In [32]:
# Simple approaches to NLP tasks typically used keyword matching. 

print(TextBlob("this is horrible").sentiment)
print(TextBlob("this is the totally not horrible").sentiment)
print(TextBlob("this was horrible").sentiment)
print(TextBlob("this was horrible but now isn't").sentiment)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=0.5, subjectivity=1.0)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=-1.0, subjectivity=1.0)

SpaCy¶

https://github.com/explosion/spaCy

"spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products."

"spaCy comes with pretrained pipelines and currently supports tokenization and training for 70+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license."

In [34]:
# conda install conda-forge::spacy

import spacy

#dir(spacy)

Activity: Zipf's Law¶

Zipf's law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. 2nd word is half as common as first word. Third word is 1/3 as common. Etc.

For example:

Word Rank Frequency
“the” 1st 30k
"of" 2nd 15k
"and" 3rd 7.5k
drawing

Plot word frequencies¶

In [35]:
from nltk.corpus import genesis
from collections import Counter
import matplotlib.pyplot as plt
In [42]:
plt.figure(figsize = (5,3))
plt.plot(counts_sorted[:50]);

Does this confirm to Zipf Law? Why or Why not?

Activity Part 2: List the most common words¶

Activity Part 3: Remove punctuation¶

In [44]:
from string import punctuation
In [45]:
punctuation
Out[45]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [ ]:
sample_clean = [item for item in sample if not item[0] in punctuation]
In [15]:
sample_clean
Out[15]:
[('the', 4642),
 ('and', 4368),
 ('de', 3160),
 ('of', 2824),
 ('a', 2372),
 ('e', 2353),
 ('und', 2010),
 ('och', 1839),
 ('to', 1805),
 ('in', 1625)]

Activity Part 4: Null model¶

  1. Generate random text including the space character.
  2. Tokenize this string of gibberish
  3. Generate another plot of Zipf's
  4. Compare the two plots.

What do you make of Zipf's Law in the light of this?

How does your result compare to Wikipedia?¶

Modern NLP A.I. tasks using HuggingFace transformer class¶

https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter1/section3.ipynb

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

Installation (7/8/24)¶

This repository is tested on Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, and TensorFlow 2.6+.

Virtual environments: https://docs.python.org/3/library/venv.html

Venv user guide: https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/

  1. create a virtual environment with the version of Python you're going to use and activate it.
  2. install at least one of Flax, PyTorch, or TensorFlow. Please refer to TensorFlow installation page, PyTorch installation page and/or Flax and Jax installation pages regarding the specific installation command for your platform.
  3. install transformers using pip as follows:

pip install transformers

...or using conda...

conda install conda-forge::transformers

NOTE: Installing transformers from the huggingface channel is deprecated.

Note 7/8/24: got error when importing:

ImportError: huggingface-hub>=0.23.2,<1.0 is required for a normal functioning of this module, but found huggingface-hub==0.23.1.

When installing conda-forge::transformers above it also installed huggingface_hub-0.23.1-py310haa95532_0 as a dependency. However if first run:

conda install conda-forge::huggingface_hub

It installs huggingface_hub-0.23.4. After this, can install conda-forge::transformers and import works without error.

Hugging Face Pipelines¶

Base class implementing NLP operations. Pipeline workflow is defined as a sequence of the following operations:

  • A tokenizer in charge of mapping raw textual input to token.
  • A model to make predictions from the inputs.
  • Some (optional) post processing for enhancing model’s output.
drawing

https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [54]:
from transformers import pipeline

# first indicate ask. Model optional.
pipe = pipeline("text-classification", model="FacebookAI/roberta-large-mnli")

pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]
Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Out[54]:
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]

Sentiment analysis¶

drawing
In [55]:
classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
Device set to use cpu
In [56]:
classifier("I've been waiting for a HuggingFace course my whole life.")
Out[56]:
[{'label': 'POSITIVE', 'score': 0.9598049521446228}]
In [57]:
classifier("I've been waiting for a HuggingFace course my whole life.")
Out[57]:
[{'label': 'POSITIVE', 'score': 0.9598049521446228}]
In [58]:
classifier("I hate this so much!")
Out[58]:
[{'label': 'NEGATIVE', 'score': 0.9994558691978455}]
In [60]:
classifier("This isn't horrible anymore.")
Out[60]:
[{'label': 'POSITIVE', 'score': 0.9929516911506653}]
In [59]:
classifier("This was horrible but isn't anymore.")
Out[59]:
[{'label': 'NEGATIVE', 'score': 0.9913051724433899}]
drawing

"Zero-shot-classification"¶

In [61]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)
No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 0.00B [00:00, ?B/s]
model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]
vocab.json: 0.00B [00:00, ?B/s]
merges.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
Device set to use cpu
Out[61]:
{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445982933044434, 0.11197470128536224, 0.04342702403664589]}

Text generation¶

In [62]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[62]:
[{'generated_text': 'In this course, we will teach you how to read, write, and communicate. These lessons are divided into two areas:\n\n1.) Writing: Writing.\n\nWriting is an activity and is one of the most complex and rewarding skills in'}]
In [63]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)
generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[63]:
[{'generated_text': 'In this course, we will teach you how to be a full professional with your skills in the workplace and how to work in small, medium and large'},
 {'generated_text': 'In this course, we will teach you how to design a prototype. We offer guidance on what to do, and how to not only implement what you'}]
In [64]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Out[64]:
[{'score': 0.19198477268218994,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04209217056632042,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

Named entity recognition¶

In [14]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.
tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]
C:\Users\micro\anaconda3\envs\HF_070824\lib\site-packages\transformers\pipelines\token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="simple"` instead.
  warnings.warn(
Out[14]:
[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Question answering¶

In [16]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.
tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]
Out[16]:
{'score': 0.6949759125709534, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
In [24]:
question_answerer(
    question="How many years old am I?",
    context="I was both in 1990. This is 2023. Hello.",
)
Out[24]:
{'score': 0.8601788878440857, 'start': 28, 'end': 32, 'answer': '2023'}

Summarization¶

drawing
In [2]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]
pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
Out[2]:
[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

Machine Translation¶

In [5]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 3
      1 from transformers import pipeline
----> 3 translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
      4 translator("Ce cours est produit par Hugging Face.")

File ~\anaconda3\envs\HF_070824\lib\site-packages\transformers\pipelines\__init__.py:994, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    991             tokenizer_kwargs = model_kwargs.copy()
    992             tokenizer_kwargs.pop("torch_dtype", None)
--> 994         tokenizer = AutoTokenizer.from_pretrained(
    995             tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
    996         )
    998 if load_image_processor:
    999     # Try to infer image processor from model or config name (if provided as str)
   1000     if image_processor is None:

File ~\anaconda3\envs\HF_070824\lib\site-packages\transformers\models\auto\tokenization_auto.py:913, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    911             return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    912         else:
--> 913             raise ValueError(
    914                 "This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed "
    915                 "in order to use this tokenizer."
    916             )
    918 raise ValueError(
    919     f"Unrecognized configuration class {config.__class__} to build an AutoTokenizer.\n"
    920     f"Model type should be one of {', '.join(c.__name__ for c in TOKENIZER_MAPPING.keys())}."
    921 )

ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer.

Bias in pretrained models¶

Historic and stereotypical outputs often statistically most likely in the large datasets used ~ data scraped from the internet or past decades of books

BERT trained on English Wikipedia and BookCorpus datasets

In [22]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result1 = unmasker("This man works as a [MASK].")
print('Man:',[r["token_str"] for r in result1])

result2 = unmasker("This woman works as a [MASK].")
print('Woman:',[r["token_str"] for r in result2])
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Man: ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
Woman: ['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
In [27]:
scores1 = [r['score'] for r in result1]
labels1 = [r["token_str"] for r in result1]
scores2 = [r['score'] for r in result2]
labels2 = [r["token_str"] for r in result2]

from matplotlib.pyplot import *

figure(figsize = (5,2))
bar(labels1, scores1);
title('Men');
figure(figsize = (5,2))
bar(labels2, scores2);
title('Women');
In [29]:
result1
Out[29]:
[{'score': 0.0751064345240593,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'this man works as a carpenter.'},
 {'score': 0.0464191772043705,
  'token': 5160,
  'token_str': 'lawyer',
  'sequence': 'this man works as a lawyer.'},
 {'score': 0.03914564475417137,
  'token': 7500,
  'token_str': 'farmer',
  'sequence': 'this man works as a farmer.'},
 {'score': 0.03280140459537506,
  'token': 6883,
  'token_str': 'businessman',
  'sequence': 'this man works as a businessman.'},
 {'score': 0.02929229475557804,
  'token': 3460,
  'token_str': 'doctor',
  'sequence': 'this man works as a doctor.'}]