Tokenization

We need to identify which units constitutes the words of a sentence.

In Scala we can use string.split:

WrappedArray
  1. Mr.
  2. Bob
  3. Dobolina
  4. is
  5. thinkin'
  6. of
  7. a
  8. master
  9. plan.
  10. Why
  11. doesn't
  12. he
  13. quit?

This isn't quite optimal. Before we fix it we move to Wolfe document data structures.

Below Document.fromString creates a document where a single token spans the complete source string.

Tokenization with Regular Expressions

We can use regular expressions to define where to split tokens.

Use lookahead to split at zero-length tokens.

This is still not perfect due to abbreviations such as "Mr." and the representation of contractions.

Exercise
Improve the regular expressions to overcome the two problems above.

Learning To Tokenize

Tokenization is relatively easy for English, but not for other languages or domains.

We can learn to tokenize using sequence labelling models we discuss later.

Sentence Segmentation

Most NLP tools work on a per-sentence basis. To find sentence splits we can again use regular expressions .

This is still bad, but we can the default tokenizer and segmenter to fix this.

Exercise

Create a regular expression tokenizer and sentence segmenter that can process Hip-Hop lyrics from OHHLA. Use the Corpora.ohhla document collection and split sentences per line.

Background Reading

  • Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.
  • Manning, Raghavan & Schuetze, Introduction to Information Retrieval: Tokenization
BESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswyBESbswy