Tokenization

We need to identify which units constitutes the words of a sentence.

In Scala we can use string.split:

WrappedArray
  1. Mr.
  2. Bob
  3. Dobolina
  4. is
  5. thinkin'
  6. of
  7. a
  8. master
  9. plan.
  10. Why
  11. doesn't
  12. he
  13. quit?

This isn't quite optimal. Before we fix it we move to Wolfe document data structures.

Below Document.fromString creates a document where a single token spans the complete source string.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?Token1

Tokenization with Regular Expressions

We can use regular expressions to define where to split tokens.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken1

Use lookahead to split at zero-length tokens.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken1

This is still not perfect due to abbreviations such as "Mr." and the representation of contractions.

Exercise
Improve the regular expressions to overcome the two problems above.

Learning To Tokenize

Tokenization is relatively easy for English, but not for other languages or domains.

We can learn to tokenize using sequence labelling models we discuss later.

Sentence Segmentation

Most NLP tools work on a per-sentence basis. To find sentence splits we can again use regular expressions .

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken123

This is still bad, but we can the default tokenizer and segmenter to fix this.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken12

Exercise

Create a regular expression tokenizer and sentence segmenter that can process Hip-Hop lyrics from OHHLA. Use the Corpora.ohhla document collection and split sentences per line.

Background Reading

  • Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.
  • Manning, Raghavan & Schuetze, Introduction to Information Retrieval: Tokenization