Welcome to this interactive book on Statistical Natural Language Processing (NLP). NLP is a field that lies in the intersection of Computer Science, Artificial Intelligence (AI) and Linguistics with the goal to enable computers to solve tasks that require natural language understanding and/or generation. Such tasks are omnipresent in most of our day-to-day life: think of Machine Translation, Automatic Question Answering or even basic Search. All these tasks require the computer to process language in one way or another. But even if you ignore these practical applications, many people consider language to be at the heart of human intelligence, and this makes NLP (and it's more linguistically motivated cousin, Computional Linguistics), important for its role in AI alone.

Statistical NLP

NLP is a vast field with beginnings dating back to TODO, and it of course is difficult to give a full account of every aspect of NLP. Hence, this book focusses on a sub-field of NLP termed Statistical NLP (SNLP). In SNLP computers aren't directly programmed to process language; instead, they learn how language should be processed based on the statistics of a corpus of natural language. For example, a statistical machine translation system's behaviour is affected by the statistics of a parallel corpus where each document in one language is paired with its translation in another. This approach has been dominating NLP research for almost two decades now, and has seen widespread in industry too. Notice that while Statistics and Machine Learning are, in general, quite different fields, for the purposes of this book we will mostly identify Statistical NLP with Machine Learning-based NLP.

Structure of this Book

We think that to understand and apply SNLP in practice one needs knowledge of the following:

  • Tasks (e.g. Machine Translation, Syntactic Parsing)
  • Methods (e.g. Discriminative Training, Linear Chain models)
  • Implementations (e.g. NLP data structures, efficient dynamic programming, Working with Scala?)

The book is hence organized along these three dimensions. For each dimension we provide a series of chapters that be understood in isolation as much as possible. You can read the book linearly by following the Task axis. Each task chapter will feature links to methods useful for the tasks, and implementation details that apply in the given context.


The best way to learn language processing with computers is to process language with computers, and hence this book features interactive code blocks that we use to show NLP in practice, and that you can use to test and investigate methods and language. We use the scala language throughout this book because of its concise syntax and type safety.

If you have programmed before, odds are you have used processed language in one way or another. For example, you have probably accessed substrings before.