In Scala, collections are a set of useful classes and interfaces which enable you (efficient) data storage, and processing. They are divided in mutable and immutable structures (check more about this here).
Before going further, just a small word on performance: know your data structures! Whatever you are using, if you are concerned with performance, documentation is your best friend!
Scala lists are an ubiquitous data structure, in its essence a simple linked lists (#ScalaDoc), coming in both immutable and mutable flavour.Let's first construct a simple list:
Very useful methods on lists are head and tail. Head of a list is the first element of a list:
whereas the tail of a list is the list following the first element:
Lists allow quick addition of an element to the beginning to the list (prepending):
Concatenating two lists:
A frequently useful method is getting unique elements from a list:
For more details on lists, including other useful methods check the documentation or a random tutorial.
Lists are immutable - they cannot be changed!
But you can convert it to an array and change:
Or even better, check scala.collection.mutable
package for various mutable structures, like ListBuffer
:
If you find yourself lost in immutable structures, check scala.collection.mutable
Case classes behave in the same manner:
You cannot change m
, but you CAN copy it with a change:
Sets are data structures which store elements without an order and repetition.
An example of a set is given here:
And some of the most useful methods on sets are union:
and set difference:
A map (also known as associative array or dictionary) is a collection of key-value pairs, such that each key appears exactly only once.
Fetching the value of a specific key in the map:
The set of all the keys in a map:
Again, if you find yourself lost in the fact that you cannot add/change/update values in maps, check scala.collection.mutable
package, which holds a mutable map:
...and it is easy to add new elements...
...another way of adding an element
and it is easy to change elements (or again, another flavor of adding):
Tuples are fixed-length lists in Scala (length up to 22 in Scala), denoted in a specific format:
You can access specific elements of that list (first, second) by using the following notation:
A tuple (N=2) is equivalent to a pair (as a key-value pair in the map above):
Options are containers for optional values, which can contain values Some(X)
, if a value is present, and None
if the value is missing. They are very useful to eliminate using null
as a missing value.
In the following example, our map lemmas
contains a method get
which returns an optional value, whose specific value can then be accessed with a method get
:
In case this option does not contain a value, get returns a None
:
The getOrElse
is particularly useful, as it enables you to either obtain a value of an option, or fall back to a default value (its parameter):
Applies a specific function on all the elements in a collection.
def map[B, Coll[B]](f: A => B): Coll[B]
The map method is one of the most frequently used methods.
In our case, having two sentences in a list:
We define a function dyingHal and map it on each sentence in the list:
Let's do a short exercise through which we'll showcase the rest of the functions. Given a text:
Let's do some frequency analysis...
Flattens a collection (of collections...)
flatten[B]: Coll[B]
First, we'll split the text into sentences over exclamations points. This is a very bad way to do sentence segmentation, but you will learn better ways very soon. Afterwards, we'll split the sentences into words.
The first split creates an array of strings (our sentences). The split inside a map results in an array of arrays of strings (our words, in sentences). As we will take a look only at sentences, we need to flatten everything to a single array.
Applies a function that returns a sequence to a collection, and flattens the result.
flatMap[B, Coll[B]](f: A => Coll[B]): Coll[B]
We can do the same as previously by invoking flatMap:
Filters out a collection with a Boolean function.
filter(p: A => Boolean): Coll[A]
Seeing how our bad sentence splitting (take care of your sentence splitting!) creates some empty strings, we need to filter them out:
We can also filter out other things, like stopwords:
Groups elements of a collection by a specific discriminator function, into key (the value of the descriminator function) and value (a list of all the elements of the starting collection which produce the same value of the descriminator function).
groupBy[K](f: A => K): Map[K, Coll[K]]
We will group our words by themselves:
Applies a function to every value in a map.
mapValues[C](f: B => C): Map[A, C]
The reason we grouped the words by themselves is to count them up easily. We will do that with the mapValues function which applies a desired counting function over the values of our group map.
Returns a maximum value in a collection.
maxBy[B](f: A => B): A
Out of curiosity, let's take a look at the most frequent word in our text by using maxBy:
The fold method comes in three similar flavours, fold, foldLeft and foldRight (check the differences between them here). In its essence, you can view these functions as iterators with an accumulator. Starting with a dedicated starting element, this function applies a function to a starting element and the first element. Then it applies the same function to the result and the second element, and so on...
foldLeft[B](z: B)(op: (B, A) => B): B
You can use, for example, foldLeft to iterate through the map of occurrences and calculate the total number of words:
and use that to calculate word frequencies:
As opposed to map, which applies a function to each element of a collection, foreach calls a non-returning procedure over each element, and does not result in a new collection, as map does.
foreach[U](f: A => U): Unit
So, if we'd want to check whether our frequencies sum to 1, we would do the following: