Advanced Streams

Grouping Results

So far, results were either a value or a collection
Sometimes, want to split result into groups
Example: Group all words with the same first letter together
Use
```
stream.collect(Collectors.groupingBy(function))
```
- The function produces a key for each element
- The result is a a map
- Map values are collections of elements with the same key

Map<String, List<String>> groups = Stream.of(words)
   .collect(Collectors.groupingBy(
   w -> w.substring(0, 1))); // The function for extracting the keys

groups.get("a") is a list of all words starting with a

Processing Groups

Nice to split result into groups
Even nicer: Can process each group
Pass a collector to Collectors.groupingBy

Example: Group into sets, not lists

Map<String, Set<String>> groupOfSets = Stream.of(words)
   .collect(Collectors.groupingBy(
      w -> w.substring(0, 1), // The function for extracting the keys
      Collectors.toSet())); // The group collector

The groupingBy collector collects the stream into groups
The toSet collector collects each group int a set

Collecting Counts and Sums

Use Collectors.counting() to count the group values

Map<String, Long> groupCounts = Stream.of(words)
   .collect(Collectors.groupingBy(
      w -> w.substring(0, 1),
      Collectors.counting()));

groupCounts.get("a") is the number of words that start with an a

To sum up some aspect of group values, use summingInt, summingDouble, summingLong:

Map<String, Long> groupSum = countries.collect(
   Collectors.groupingBy(
      c -> c.getContinent(), // The function for extracting the keys
      Collectors.summingLong(
         c -> c.getPopulation()))); // The function for getting the summands

groupSum.get("Asia") is the total population of Asian countries

Collecting Average, Maximum, Minimum

The Collectors methods averagingInt, averagingDouble, averagingLongwork just like summingXxx
Return 0 for empty groups (not an Option)

Average word length grouped by starting character:

Map<String, Double> groupAverages = Stream.of(words)
   .collect(Collectors.groupingBy(
      w -> w.substring(0, 1),
      Collectors.averagingInt(String::length)));

maxBy, minBy use a comparison function and return Optional results:

Map<String, Optional<String>> groupLongest = Stream.of(words)
   .collect(
      Collectors.groupingBy(
         w -> w.substring(0, 1), // The function for extracting the keys
         Collectors.maxBy(
            (v, w) -> v.length() - w.length()))); // The comparator function

Exercise 1

Today, we go beyond what Codecheck can do. Fire up your favorite Java IDE and make a project Unit5.

At http://horstmann.com/heig-vd/spring2015/poo/unit5/movies.txt , there is a database of movie facts. Each movie has five lines that look like this:

Name: Five Easy Pieces
Year: 1970
Directed by: Bob Rafelson
Produced by: Bob Rafelson, Richard Wechsler, Harold Schneider
Actors: Jack Nicholson, Karen Black, Billy Green Bush, more...

First, let’s come up with a class that describes a movie. Add it to the project.

Next, we need to read in the movies. Because we need to consume five input lines per movie, there is nothing to be gained by reading the input as a stream of lines. Instead, we just put the movies into an ArrayList:

public static List<Movie> readMovies(String url) throws IOException
{
   List<Movie> movies = new ArrayList<>();
   try (Scanner in = new Scanner(new URL(url).openStream()))
   {
      while (in.hasNextLine())
      {
         String nameLine = in.nextLine();
         String yearLine = in.nextLine();
         String directorsLine = in.nextLine();
         String producersLine = in.nextLine();
         String actorsLine = in.nextLine();
         movies.add(new Movie(getString(nameLine),
            Integer.parseInt(getString(yearLine)),
            getList(directorsLine),
            getList(producersLine),
            getList(actorsLine)));
     }
   }
   return movies;
}

Here, getString is a helper method that strips off the field header, and getList is a helper that breaks up a comma-separated list:

private static String getString(String line)
{
   int colon = line.indexOf(":");
   return line.substring(colon + 1).trim();
}
private static List<String> getList(String line)
{
   return Stream.of(getString(line).split(", "))
      .collect(Collectors.toList());
}

Now we can make a stream:

List<Movie> movieList = readMovies("http://horstmann.com/heig-vd/spring2015/poo/unit5/movies.txt");
Stream<Movie> movies = movieList.stream();

Let's do something interesting with the data. Are there any movie titles that start with X?

List<String> result1 = movieList.stream()
   .map(m -> m.getTitle())
   .filter(t -> t.startsWith("X"))
   .collect(Collectors.toList());

Let's find out how many movies start with a given letter. This is a typical use of groupingBy with a secondary collector:
```
Map<String, Long> firstLetters = movieList.stream()
   .collect(Collectors.groupingBy(
      m -> m.getTitle().substring(0, 1),
      Collectors.counting()));
```
How many movies start with the letter T?

There is a simple reason for that. What do you get from

movieList.stream()
   .filter(m -> m.getTitle().startsWith("The "))
   .count();

What other letters are common starting letters of movies? Why do you think that is?
Who is the most prolific director? This is not so easy to answer because a movie can have more than one director. Fortunately, this only happens with about five percent of movies, so let’s just pick the first one. There are, however, a number of movies with no directors in the data set. We filter those out first. Then we can group by the first director:
```
Map<String, List<Movie>> moviesByDirector = movieList.stream()
   .filter(m -> m.getDirectors().size() > 0)
   .collect(Collectors.groupingBy(
      m -> m.getDirectors().get(0)));
```
This map associates all directors with a list of the movies that they directed. Unfortunately, that’s a large map. How many entries does it have?

Let's find the map entry with the longest list. There is no need to use streams.

String mostProlificDirector = Collections.max(
   moviesByDirector.entrySet(),
   Comparator.comparing(e -> e.getValue().size())).getKey();

Who is this director? Never heard of him? Google his name!

Which movies? To extract the titles, it is easy to use map with a stream:

List<String> titles = moviesByDirector.get(mostProlificDirector)
   .stream()
   .map(m -> m.getTitle())
   .collect(Collectors.toList());

Which of these would you like to see first?

Parallel Streams

Use parallelStream on a collection:

Stream<String> parallelWords = words.parallelStream();

Or use parallel on any stream:

Stream<String> parallelWords = Stream.of(wordArray).parallel();

When the terminal method executes, operations are parallelized
Intent: Same result as when run sequentially
Just faster because the work is distributed over available processors

Example:

long result = wordStream.parallel()
   .filter(w -> w.length() > 10)
   .count();

The underlying data is partitioned in n regions
The filtering and counting executes concurrently
When all counts are ready, thy are combined

Caution: Side Effects

Don't count like this:

int[] shortWords = new int[12];
words.parallelStream().forEach(
   s -> {
      if (s.length() < 12)
         shortWords[s.length()]++;
            // Error—race condition!
   });

Avoid side effects in the lambdas that you pass to the stream ops!

Correct solution:

Map<Integer, Long> shortWordCounts =
   words.parallelStream()
      .filter(s -> s.length() < 12)
      .collect(Collectors.groupingBy(
         String::length,
         Collectors.counting()));

Work with the stream library, not against it!

Effective Parallelization

Streams from arrays and lists are ordered. Results are predictable, even on parallel streams
Use findAny instead of findFirst if you don't care about ordering

Call unordered to speed up limit or distinct:

Stream<String> sample = words.parallelStream().unordered().limit(n);

Use groupingByConcurrent to speed up grouping if you don't care about the order in which the values are processed

Map<Integer, Long> wordCounts =
   words.parallelStream()
      .collect(
         Collectors.groupingByConcurrent(
         String::length,
         Collectors.counting()));

Lab Exercise 2

In the Unit5 project, make a new class Exercise2 with this code. Run the code. What do you get?
Run the code again. What happens? Try it a couple more times, just to be sure.
Now try it with the alternative solution. What happens?
Use this code to compute the sum of the map values:
```
System.out.println("Total: "
   + shortWordCounts.values().stream().mapToLong(n -> n).sum());
```

Lab Exercise 4

The file src.zip in the JDK directory contains the source for many of the library functions. In this lab, you will look inside this file to find the longest identifiers used by the library authors. And you'll parallelize the search.
Make a class Exercise4 and copy the time method from Exercise 3.
Add this Pair class to the project.

Add this method that reads in a file from a Path, yielding a pair of the path and the file:

public static Pair<Path, String> read(Path p)
{
   try
   {
      return Pair.of(p, new String(Files.readAllBytes(p)));
   }
   catch (IOException e)
   {
      return Pair.of(p, "");
   }
}

Add this method that finds the longest word in a string that holds the contents of a file. We split along non-letters (regex \PL+), and the rest is standard stream stuff.

public static String longestWord(String contents)
{
   return Stream.of(contents.split("\\PL+"))
      .max(Comparator.comparing(String::length))
      .orElse("");
}

Now we'll traverse all files in the zip file. Don't worry about the details, but be amazed that Java can treat a zip file like a file system and walk through its contents.
```
String zippath = "/opt/jdk1.8.0/src.zip";
FileSystem zipfs = FileSystems.newFileSystem(Paths.get(zippath), null);
		
try (Stream<Path> entries = Files.walk(zipfs.getPath("/")))
{
   ... 
}
```
However, you'll need to change the zippath variable to point where your src.zip is located. In Windows, it's in c:\Program Files\Java\jdk1.8.0_xx. (Remember to use double backslashes in strings.) In Mac OS X, it's in something like /Library/Java/JavaVirtualMachines/jdk1.8.0_xx.jdk/Contents/Home. In Linux, if you installed the JDK yourself, you know where it is. If you use openjdk-8 on Ubuntu, you need to sudo apt-get install openjdk-8-src, and look in /usr/lib/jvm/openjdk-8.
Before going on, make sure that everything works. Tag main as throws IOException. Add
```
time(() -> entries.count());
```
where the ... are. Run the program. You should get about 8220 files.

Now remove .count() and add these stream operations:

.filter(Files::isRegularFile)
.map(Exercise4::read)
.map(p -> Pair.of(p.first(), longestWord(p.second())))
.sorted(Comparator.comparing(p -> -p.second().length()))
.limit(10)
.collect(Collectors.toList())

What do they do? (In plain English or French, what's the high-level idea?)

Run the program. Marvel at the output. Isn't it amazing that someone named a method makeInfoOnlyServantCacheLocalClientRequestDispatcherFactory?
Now make the stream of directories parallel. Does it speed up the program? By how much? How does that compare with the speedup in exercise 3? Why isn't it as good?