Tokenization

Before a program can process natural language, we need identify the words that constitute a string of characters. This is important because the meaning of text generally depends on the relations of words in that text. We need to identify which units constitutes the words of a sentence.

By default text on a computer is represented through String values. These values store a sequence of characters (nowadays mostly in UTF-8 format). The first step of an NLP pipeline is therefore to split the text into smaller units corresponding to the words of the language we are considering. In the context of NLP we often refer to these units as tokens, and the process of extracting these units is called tokenization. Tokenization is considered boring by most, but it's hard to overemphasize its importance, seeing as it's the first step in a long pipeline of NLP processors, and if you get this step wrong, all further steps will suffer.

In Scala (and Java) a simple way to tokenize a text is via the split method that divides a text wherever a particular pattern matches. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach.In Scala we can use string.split:

WrappedArray
  1. Mr.
  2. Bob
  3. Dobolina
  4. is
  5. thinkin'
  6. of
  7. a
  8. master
  9. plan.
  10. Why
  11. doesn't
  12. he
  13. quit?

There are clearly shortcomings in this tokenization. However, before we address these we will switch to using the Wolfe document data structures. These simplify downstream processing, and also enable tailor-made rendering that will come in handy.This isn't quite optimal. Before we fix it we move to Wolfe document data structures.

Notice that in wolfe any document is always in a tokenized state, but it may be based on a very coarse grained tokenization where the whole document is one single token. Tokenization in this setting amounts to refining a given tokenization.

Below Document.fromString creates a document where a single token spans the complete source string.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?Token1

Tokenization with Regular Expressions

Wolfe allows users to construct tokenizers using regular expressions that define the character sequence patterns at which to split tokens. In general regular expressions are a powerful tool NLP practitioners can use when working with text, and they come in handy when you work with command line tools such as grep. In the code below we use a simple pattern \\s that matches any whitespace. We can use regular expressions to define where to split tokens.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken1

One shortcoming of this tokenization is its treatment of punctuation because it considers "plan." as a token whereas ideally we would prefer "plan" and "." to be distinct tokens (why?). To achieve this we need to use lookahead patterns that split at zero-length patterns before and after punctuation. Note how we used Scala String Interpolation to inject smaller patterns into larger ones. Use lookahead to split at zero-length tokens.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken1

This still isn't perfect. First, "Mr." is split into two tokens, but it should be one. Second, and more subtly, many downstream linguistic processors (such as syntactic parsers) prefer contractions such as "doesn't" to become two tokens, "does" and "'t" (Why?). This is still not perfect due to abbreviations such as "Mr." and the representation of contractions.

Exercise
Improve the regular expressions to overcome the two problems above.

Learning To Tokenize

For most English domains powerful and robust tokenizers can be built using the simple pattern matching approach shown above. However, in languages such as Japanese, words are not separated by whitespace, and this makes tokenization substantially more challenging. Even for certain English domains such as the domain of biomedical papers, tokenization is non-trivial.Tokenization is relatively easy for English, but not for other languages or domains.

When tokenization is more challenging and difficult to capture in a few rules a machine-learning based approach can be useful. In a nutshell, we can treat the tokenization problem as a character classification problem, or if needed, as a sequential labelling problem.We can learn to tokenize using sequence labelling models we discuss later.

Sentence Segmentation

Many NLP tools work on a sentence-by-sentence basis. The next preprocessing step is hence to segment streams of tokens into sentences. In most cases this is straightforward after tokenization, because we only need to split sentences at sentence-ending punctuation tokens. In Wolfe you can implement segmentation using Segmenter that splits a sentence at a token whenever that token's content matches a regular expression. If this expression only fires for punctuations, you get the desired behaviour. Most NLP tools work on a per-sentence basis. To find sentence splits we can again use regular expressions .

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken123

This is in fact a bad segmentation, as we split at the period of "Mr.". However, once you fix tokenization to treat "Mr." as single token, the problem disappears. The default Wolfe tokenizer and segmenter overcome this problem. Below we use these to construct a pipeline of document processors using Scala function composition via andThen. This is still bad, but we can the default tokenizer and segmenter to fix this.

Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?TokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenTokenToken12

Exercise

Create a regular expression tokenizer and sentence segmenter that can process Hip-Hop lyrics from OHHLA. Use the Corpora.ohhla document collection and split sentences per line.

Background Reading

  • Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.
  • Manning, Raghavan & Schuetze, Introduction to Information Retrieval: Tokenization