$\def\prob{p} \def\vocab{V} \def\params{\boldsymbol{\theta}} \def\argmax{\mathop{\arg\,\max}} \def\param{\theta} \def\bpi{\boldsymbol{\pi}} \def\balpha{\boldsymbol{\alpha}} \def\bbeta{\boldsymbol{\beta}} \def\perplexity{PP} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\Xs{\mathcal{X}} \def\Ys{\mathcal{Y}} \def\perplexity{PP} \def\train{\mathcal{D}_\mathit{train}} \def\counts#1#2{\#_{#1}(#2)} \def\RR{\bf R} \def\length#1{\mathop{length}(#1)} \def\aligns{\mathbf{a}} \def\align{a} \def\source{\mathbf{s}} \def\ssource{s} \def\target{\mathbf{t}} \def\starget{t}$

Scala Crash Course, part 2: Collections

Collections

In Scala, collections are a set of useful classes and interfaces which enable you (efficient) data storage, and processing. They are divided in mutable and immutable structures (check more about this here).

Before going further, just a small word on performance: know your data structures! Whatever you are using, if you are concerned with performance, documentation is your best friend!

Lists

Scala lists are an ubiquitous data structure, in its essence a simple linked lists (#ScalaDoc), coming in both immutable and mutable flavour.Let's first construct a simple list:

List

All
work
and
no
play
makes
Jack
a
dull
boy
.

Very useful methods on lists are head and tail. Head of a list is the first element of a list:

All

whereas the tail of a list is the list following the first element:

List

work
and
no
play
makes
Jack
a
dull
boy
.

Lists allow quick addition of an element to the beginning to the list (prepending):

List

Jack Torrance:
All
work
and
no
play
makes
Jack
a
dull
boy
.

Concatenating two lists:

List

All
work
and
no
play
makes
Jack
a
dull
boy
.
Jack
!
What
...
are
you
?

A frequently useful method is getting unique elements from a list:

List

Breaking
the
law
,
breaking

For more details on lists, including other useful methods check the documentation or a random tutorial.

Immutable vs mutable list

Lists are immutable - they cannot be changed!

List

Can't
change
this
!

But you can convert it to an array and change:

Can change this !

Or even better, check scala.collection.mutable package for various mutable structures, like ListBuffer:

ListBuffer

Adding
elements
Oh, the Joy!

If you find yourself lost in immutable structures, check scala.collection.mutable

Case classes behave in the same manner:

Student

name John
surname Explosion

You cannot change m, but you CAN copy it with a change:

Student

name John
surname Brutal

Sets

Sets are data structures which store elements without an order and repetition.

An example of a set is given here:

Set

.
stirred
A
,
not
Shaken
martini

And some of the most useful methods on sets are union:

Set

is
try
.
do
stirred
A
or
There
,
not
Do
Shaken
martini
no

intersection:

Set

and set difference:

Set

is
try
do
or
There
Do
no

Maps

A map (also known as associative array or dictionary) is a collection of key-value pairs, such that each key appears exactly only once.

Map

saving save
tokens token
token token
occurring occur

Fetching the value of a specific key in the map:

token

The set of all the keys in a map:

Set

saving
tokens
token
occurring

Mutable map

Again, if you find yourself lost in the fact that you cannot add/change/update values in maps, check scala.collection.mutable package, which holds a mutable map:

Map

is 9
My 2
hovercraft 2

...and it is easy to add new elements...

Map

is 9
My 2
full 1
hovercraft 2

...another way of adding an element

Map

is 9
of 99
My 2
full 1
hovercraft 2

and it is easy to change elements (or again, another flavor of adding):

Map

is 9
of 99
My 2
eels 4
full 1
hovercraft 2

Tuples

Tuples are fixed-length lists in Scala (length up to 22 in Scala), denoted in a specific format:

Tuple2

_1 like
_2 2

You can access specific elements of that list (first, second) by using the following notation:

A tuple (N=2) is equivalent to a pair (as a key-value pair in the map above):

true

Option

Options are containers for optional values, which can contain values Some(X), if a value is present, and None if the value is missing. They are very useful to eliminate using null as a missing value.

In the following example, our map lemmas contains a method get which returns an optional value, whose specific value can then be accessed with a method get:

token

In case this option does not contain a value, get returns a None:

I'm sorry, Dave.

The getOrElse is particularly useful, as it enables you to either obtain a value of an option, or fall back to a default value (its parameter):

I'm afraid I can't do that.

Important methods on Collections

map

Applies a specific function on all the elements in a collection.

def map[B, Coll[B]](f: A => B): Coll[B]

The map method is one of the most frequently used methods.

In our case, having two sentences in a list:

List

Daisy, Daisy, give me your answer do.
I'm half crazy all for the love of you.

We define a function dyingHal and map it on each sentence in the list:

Da-sY, daiSy, give M- -our answer D-. i'M haLF craZY all For the LOVe oF You.

Let's do a short exercise through which we'll showcase the rest of the functions. Given a text:

It's not pining, it's passed on! This parrot is no more! It has ceased to be! It's expired and gone to meet its maker! This is a late parrot! It's a stiff! Bereft of life, it rests in peace! If you hadn't nailed it to the perch, it would be pushing up the daisies! It's rung down the curtain and joined the choir invisible! This is an ex-parrot!

Let's do some frequency analysis...

flatten

Flattens a collection (of collections...)

flatten[B]: Coll[B]

First, we'll split the text into sentences over exclamations points. This is a very bad way to do sentence segmentation, but you will learn better ways very soon. Afterwards, we'll split the sentences into words.

List

It's
not
pining,
it's
passed
on
This
parrot
is

The first split creates an array of strings (our sentences). The split inside a map results in an array of arrays of strings (our words, in sentences). As we will take a look only at sentences, we need to flatten everything to a single array.

flatMap

Applies a function that returns a sequence to a collection, and flattens the result.

flatMap[B, Coll[B]](f: A => Coll[B]): Coll[B]

We can do the same as previously by invoking flatMap:

List

It's
not
pining,
it's
passed
on
This
parrot
is

filter

Filters out a collection with a Boolean function.

filter(p: A => Boolean): Coll[A]

Seeing how our bad sentence splitting (take care of your sentence splitting!) creates some empty strings, we need to filter them out:

List

it's
not
pining,
it's
passed

We can also filter out other things, like stopwords:

List

It's
not
pining,
it's
passed
on
This
parrot
is
no

groupBy

Groups elements of a collection by a specific discriminator function, into key (the value of the descriminator function) and value (a list of all the elements of the starting collection which produce the same value of the descriminator function).

groupBy[K](f: A => K): Map[K, Coll[K]]

We will group our words by themselves:

Map

down
List
1. down
it's
List
1. it's
2. it's
3. it's
4. it's
5. it's
this
List
1. this
2. this
3. this

Exercise

Group words by their first letter.

mapValues

Applies a function to every value in a map.

mapValues[C](f: B => C): Map[A, C]

The reason we grouped the words by themselves is to count them up easily. We will do that with the mapValues function which applies a desired counting function over the values of our group map.

Map

down 1
it's 5
this 3
in 1
pining, 1
perch, 1
life, 1
is 3
expired 1
late 1

maxBy

Returns a maximum value in a collection.

maxBy[B](f: A => B): A

Out of curiosity, let's take a look at the most frequent word in our text by using maxBy:

Tuple2

_1 it's
_2 5

fold

The fold method comes in three similar flavours, fold, foldLeft and foldRight (check the differences between them here). In its essence, you can view these functions as iterators with an accumulator. Starting with a dedicated starting element, this function applies a function to a starting element and the first element. Then it applies the same function to the result and the second element, and so on...

foldLeft[B](z: B)(op: (B, A) => B): B

You can use, for example, foldLeft to iterate through the map of occurrences and calculate the total number of words:

and use that to calculate word frequencies:

Map

down 0.014705882352941176
it's 0.07352941176470588
this 0.04411764705882353
in 0.014705882352941176
pining, 0.014705882352941176
perch, 0.014705882352941176
life, 0.014705882352941176
is 0.04411764705882353
expired 0.014705882352941176
late 0.014705882352941176

foreach

As opposed to map, which applies a function to each element of a collection, foreach calls a non-returning procedure over each element, and does not result in a new collection, as map does.

foreach[U](f: A => U): Unit

So, if we'd want to check whether our frequencies sum to 1, we would do the following:

0.9999999999999989

Exercise

Do the same by using the foldLeft function.

StatNLP

StatNLP

Scala Crash Course, part 2: Collections

Collections

Lists

Immutable vs mutable list

Sets

Maps

Mutable map

Tuples

Option

Important methods on Collections

map

flatten

flatMap

filter

groupBy

mapValues

maxBy

fold

foreach