Is white space tokenization enough?
In this assignment, you will use an online tokenization tool. Navigate to http://text-processing.com/demo/tokenize/ and try to following:
Enter several sample sentences (you can copy paste them from the web or write your own) into the textbox where it says “tokenize text”. Your sentences should include at least one contraction and at least one compound word (if you don’t know what a compound word is, see here).
Sentences used:
- She doesn’t want to miss the fireworks at the festival.
- You could’ve warned me that the football game was canceled.
- He should’ve known that the shortcut would actually take longer.
Observe how the different tokenizers handle your text. Look carefully at the whitespace tokenizer and answer the following question: Are spaces sufficient to tokenize English language text? Why or why not? Cite examples from your test to support your conclusion.
No, spaces are not sufficient to tokenize English language text due to various factors like the punctuation in contractions, hyphenated words, and compound words, that introduce different levels of complexity to language. These examples must be separated deeper than spaces, as they have different principle parts that set them apart from simple, one or two syllable words.
