Advanced Streams

streams'

Grouping Results

Processing Groups

Collecting Counts and Sums

Collecting Average, Maximum, Minimum

Exercise 1

  1. Today, we go beyond what Codecheck can do. Fire up your favorite Java IDE and make a project Unit5.
  2. At http://horstmann.com/heig-vd/spring2015/poo/unit5/movies.txt , there is a database of movie facts. Each movie has five lines that look like this:
    Name: Five Easy Pieces
    Year: 1970
    Directed by: Bob Rafelson
    Produced by: Bob Rafelson, Richard Wechsler, Harold Schneider
    Actors: Jack Nicholson, Karen Black, Billy Green Bush, more...
  3. First, let’s come up with a class that describes a movie. Add it to the project.
  4. Next, we need to read in the movies. Because we need to consume five input lines per movie, there is nothing to be gained by reading the input as a stream of lines. Instead, we just put the movies into an ArrayList:
    public static List<Movie> readMovies(String url) throws IOException
    {
       List<Movie> movies = new ArrayList<>();
       try (Scanner in = new Scanner(new URL(url).openStream()))
       {
          while (in.hasNextLine())
          {
             String nameLine = in.nextLine();
             String yearLine = in.nextLine();
             String directorsLine = in.nextLine();
             String producersLine = in.nextLine();
             String actorsLine = in.nextLine();
             movies.add(new Movie(getString(nameLine),
                Integer.parseInt(getString(yearLine)),
                getList(directorsLine),
                getList(producersLine),
                getList(actorsLine)));
         }
       }
       return movies;
    }
    Here, getString is a helper method that strips off the field header, and getList is a helper that breaks up a comma-separated list:
    private static String getString(String line)
    {
       int colon = line.indexOf(":");
       return line.substring(colon + 1).trim();
    }
    private static List<String> getList(String line)
    {
       return Stream.of(getString(line).split(", "))
          .collect(Collectors.toList());
    }
    
  5. Now we can make a stream:
    List<Movie> movieList = readMovies("http://horstmann.com/heig-vd/spring2015/poo/unit5/movies.txt");
    Stream<Movie> movies = movieList.stream();
  6. Let's do something interesting with the data. Are there any movie titles that start with X?
    List<String> result1 = movieList.stream()
       .map(m -> m.getTitle())
       .filter(t -> t.startsWith("X"))
       .collect(Collectors.toList());
  7. Let's find out how many movies start with a given letter. This is a typical use of groupingBy with a secondary collector:
    Map<String, Long> firstLetters = movieList.stream()
       .collect(Collectors.groupingBy(
          m -> m.getTitle().substring(0, 1),
          Collectors.counting()));
    How many movies start with the letter T?
  8. There is a simple reason for that. What do you get from
    movieList.stream()
       .filter(m -> m.getTitle().startsWith("The "))
       .count();
  9. What other letters are common starting letters of movies? Why do you think that is?
  10. Who is the most prolific director? This is not so easy to answer because a movie can have more than one director. Fortunately, this only happens with about five percent of movies, so let’s just pick the first one. There are, however, a number of movies with no directors in the data set. We filter those out first. Then we can group by the first director:
    Map<String, List<Movie>> moviesByDirector = movieList.stream()
       .filter(m -> m.getDirectors().size() > 0)
       .collect(Collectors.groupingBy(
          m -> m.getDirectors().get(0)));
    This map associates all directors with a list of the movies that they directed. Unfortunately, that’s a large map. How many entries does it have?
  11. Let's find the map entry with the longest list. There is no need to use streams.
    String mostProlificDirector = Collections.max(
       moviesByDirector.entrySet(),
       Comparator.comparing(e -> e.getValue().size())).getKey();
    Who is this director? Never heard of him? Google his name!
  12. Which movies? To extract the titles, it is easy to use map with a stream:
    List<String> titles = moviesByDirector.get(mostProlificDirector)
       .stream()
       .map(m -> m.getTitle())
       .collect(Collectors.toList());
    Which of these would you like to see first?

Parallel Streams

Caution: Side Effects

Effective Parallelization

Lab Exercise 2

Lab Exercise 3

  1. Make a class Exercise3. To measure the execution time of an arbitrary code snippet, add this method:
    public static <T> void time(Callable<T> c)
    {
       Instant start = Instant.now();
       try
       {
          T result = c.call();
          System.out.println(result);
       } 
       catch (Exception e)
       {
          System.out.println(e);
       }
       long millis = Duration.between(start, Instant.now()).toMillis();
       System.out.printf("Elapsed time: %d milliseconds\n", millis);
    }
    
  2. Make a range of BigInteger numbers:
    BigInteger start = BigInteger.valueOf(2).pow(267);
    int length = 100000; // If you have a fast computer, increase this value
    List<BigInteger> bs = IntStream.range(1,  length)
       .mapToObj(i -> BigInteger.valueOf(i).add(start))
       .collect(Collectors.toList());
  3. Then turn bs into a stream and find the prime numbers that contain the sequence 666:
    bs.stream()
       .filter(b -> b.isProbablePrime(100))
       .filter(b -> b.toString().contains("666"))
       .count()
  4. Pass this to time:
    time(() -> bs.stream()...count());
  5. How long does it run? (It's about 5 seconds on my laptop. If it's much faster on yours, increase length to 200000 and try again.)
  6. Now make the stream into a parallel stream and try again. Simply copy/paste the time command and make the stream parallel:
    time(() -> bs.stream().parallel()...count());
  7. Now we want to see one of the results. Change count to findAny on both the sequential and parallel streams. What are the timing results? Why?
  8. Run it a few times. Do you always get the same answer? If not, why not?

Lab Exercise 4

  1. The file src.zip in the JDK directory contains the source for many of the library functions. In this lab, you will look inside this file to find the longest identifiers used by the library authors. And you'll parallelize the search.
  2. Make a class Exercise4 and copy the time method from Exercise 3.
  3. Add this Pair class to the project.
  4. Add this method that reads in a file from a Path, yielding a pair of the path and the file:
    public static Pair<Path, String> read(Path p)
    {
       try
       {
          return Pair.of(p, new String(Files.readAllBytes(p)));
       }
       catch (IOException e)
       {
          return Pair.of(p, "");
       }
    }
    
  5. Add this method that finds the longest word in a string that holds the contents of a file. We split along non-letters (regex \PL+), and the rest is standard stream stuff.

    public static String longestWord(String contents)
    {
       return Stream.of(contents.split("\\PL+"))
          .max(Comparator.comparing(String::length))
          .orElse("");
    }
    

  6. Now we'll traverse all files in the zip file. Don't worry about the details, but be amazed that Java can treat a zip file like a file system and walk through its contents.
    String zippath = "/opt/jdk1.8.0/src.zip";
    FileSystem zipfs = FileSystems.newFileSystem(Paths.get(zippath), null);
    		
    try (Stream<Path> entries = Files.walk(zipfs.getPath("/")))
    {
       ... 
    }
    
    However, you'll need to change the zippath variable to point where your src.zip is located. In Windows, it's in c:\Program Files\Java\jdk1.8.0_xx. (Remember to use double backslashes in strings.) In Mac OS X, it's in something like /Library/Java/JavaVirtualMachines/jdk1.8.0_xx.jdk/Contents/Home. In Linux, if you installed the JDK yourself, you know where it is. If you use openjdk-8 on Ubuntu, you need to sudo apt-get install openjdk-8-src, and look in /usr/lib/jvm/openjdk-8.
  7. Before going on, make sure that everything works. Tag main as throws IOException. Add
    time(() -> entries.count());
    where the ... are. Run the program. You should get about 8220 files.
  8. Now remove .count() and add these stream operations:
    .filter(Files::isRegularFile)
    .map(Exercise4::read)
    .map(p -> Pair.of(p.first(), longestWord(p.second())))
    .sorted(Comparator.comparing(p -> -p.second().length()))
    .limit(10)
    .collect(Collectors.toList())
    
    What do they do? (In plain English or French, what's the high-level idea?)
  9. Run the program. Marvel at the output. Isn't it amazing that someone named a method makeInfoOnlyServantCacheLocalClientRequestDispatcherFactory?
  10. Now make the stream of directories parallel. Does it speed up the program? By how much? How does that compare with the speedup in exercise 3? Why isn't it as good?