We need to identify which units constitutes the words of a sentence.
In Scala we can use string.split
:
This isn't quite optimal. Before we fix it we move to Wolfe document data structures.
Below Document.fromString
creates a document where a single token spans the complete source string.
We can use regular expressions to define where to split tokens.
Use lookahead to split at zero-length tokens.
This is still not perfect due to abbreviations such as "Mr." and the representation of contractions.
Tokenization is relatively easy for English, but not for other languages or domains.
We can learn to tokenize using sequence labelling models we discuss later.
Most NLP tools work on a per-sentence basis. To find sentence splits we can again use regular expressions .
This is still bad, but we can the default tokenizer and segmenter to fix this.
Create a regular expression tokenizer and sentence segmenter that can process Hip-Hop lyrics from OHHLA. Use the Corpora.ohhla
document collection and split sentences per line.