Quiz 1: Strings with Python¶

Explain your work. All work must be your own.

For code questions, you do not need to give working code or use exact forms of functions if you cannot remember. You can use "pseudocode" involving standard programming structures and functions.

1. Zipf's law¶

Zipf's law states that the $n$th most common word in a corpus is 1/$n$ as common as the most common word. For example the 10th most common word is used 1/10 as often as the most common word (in English, probably 'the').

Given a string variable containing text for a long corpus, describe how you would determine the frequency of the words

In your answer, only use low-level python (no packages)

For example the first few characters of the string would be something like: "This enhances our commitment to open-source collaboration while providing additional protections for contributors and users alike. It provides a collection of working systems with different complexities."...

Write your code such that it operates on an input string variable mystring

In [26]:
# giving python code here. 
# pseudocode would be similar but with true python function names and loop syntax replaced 
# with some roughly similar creation

mystring = "this This THIS this enhances our commitment to open-source collaboration while providing additional protections for contributors and users alike. It provides a collection of working systems with different complexities."

word_freqs = {}
words = mystring.split(' ')

for word in words:
    if word not in word_freqs:
        word_freqs[word] = 1 # initialize counter 
    else:
        word_freqs[word] = word_freqs[word]+1 # increment counter

print(word_freqs) # not many repeats in this case. note puncutation included in words. also case
{'this': 2, 'This': 1, 'THIS': 1, 'enhances': 1, 'our': 1, 'commitment': 1, 'to': 1, 'open-source': 1, 'collaboration': 1, 'while': 1, 'providing': 1, 'additional': 1, 'protections': 1, 'for': 1, 'contributors': 1, 'and': 1, 'users': 1, 'alike.': 1, 'It': 1, 'provides': 1, 'a': 1, 'collection': 1, 'of': 1, 'working': 1, 'systems': 1, 'with': 1, 'different': 1, 'complexities.': 1}
In [25]:
# handle punctuation and case

word_freqs = {}
words = mystring.split(' ')

for word in words:
    word = word.lower() # convert all to lowercase
    if word[-1] in {'.',',','?','!'}:
        word = word[:-1] # chop off punctuation if found
    if word not in word_freqs:
        word_freqs[word] = 1
    else:
        word_freqs[word] = word_freqs[word]+1
        
print(word_freqs) # better (though won't be perfect)
{'this': 4, 'enhances': 1, 'our': 1, 'commitment': 1, 'to': 1, 'open-source': 1, 'collaboration': 1, 'while': 1, 'providing': 1, 'additional': 1, 'protections': 1, 'for': 1, 'contributors': 1, 'and': 1, 'users': 1, 'alike': 1, 'it': 1, 'provides': 1, 'a': 1, 'collection': 1, 'of': 1, 'working': 1, 'systems': 1, 'with': 1, 'different': 1, 'complexities': 1}

2. Regex¶

Suppose you want to find all decimal numbers in some text, with no errors or exceptions. E.g, numbers of the form 123.456. Describe how regex can achieve this in as much detail as you can, and note the particular issues that come up.

answer in plain language¶

make pattern which finds:

  1. a sequence of numbers (0-9) of any length
  2. followed by a decimal point
  3. followed by another sequence of numbers (0-9) of any length.

Possible issue: numbers may lack decimal places, e.g. "5.00" may be written as simply "5". So we should check for those also.

In [19]:
# python example

import re

mystring = "This 123.456 ... 78 ... 9."

pattern = "\\d+\\.\\d+|\\d+"
re.findall(pattern, mystring)
Out[19]:
['123.456', '78', '9']