計算機程序的思惟邏輯 (93) - 函數式數據處理 (下)

時間 2019-11-30

原文原文鏈接

本系列文章經補充和完善，已修訂整理成書《Java編程的邏輯》（馬俊昌著），由機械工業出版社華章分社出版，於2018年1月上市熱銷，讀者好評如潮！各大網店和書店有售，歡迎購買：京東自營連接 html

上節初步介紹了Java 8中的函數式數據處理，對於collect方法，咱們只是演示了其最基本的應用，它還有不少強大的功能，好比，能夠分組統計彙總，實現相似數據庫查詢語言SQL中的group by功能。java

具體都有哪些功能？有什麼用？如何使用？基本原理是什麼？本節進行詳細討論，咱們先來進一步理解下collect方法。git

理解collect

在上節中，過濾獲得90分以上的學生列表，代碼是這樣的：github

List<Student> above90List = students.stream()
        .filter(t->t.getScore()>90)
        .collect(Collectors.toList());
複製代碼

最後的collect調用看上去很神奇，它究竟是怎麼把Stream轉換爲List<Student>的呢？先看下collect方法的定義：數據庫

<R, A> R collect(Collector<? super T, A, R> collector) 複製代碼

它接受一個收集器collector做爲參數，類型是Collector，這是一個接口，它的定義基本是：編程

public interface Collector<T, A, R> {
    Supplier<A> supplier();
    BiConsumer<A, T> accumulator();
    BinaryOperator<A> combiner();
    Function<A, R> finisher();
    Set<Characteristics> characteristics();
}
複製代碼

在順序流中，collect方法與這些接口方法的交互大概是這樣的：swift

//首先調用工廠方法supplier建立一個存放處理狀態的容器container，類型爲A
A container = collector.supplier().get();

//而後對流中的每個元素t，調用累加器accumulator，參數爲累計狀態container和當前元素t
for (T t : data)
   collector.accumulator().accept(container, t);

//最後調用finisher對累計狀態container進行可能的調整，類型轉換(A轉換爲R)，並返回結果
return collector.finisher().apply(container);
複製代碼

combiner只在並行流中有用，用於合併部分結果。characteristics用於標示收集器的特徵，Collector接口的調用者能夠利用這些特徵進行一些優化，Characteristics是一個枚舉，有三個值：CONCURRENT, UNORDERED和IDENTITY_FINISH，它們的含義咱們後面經過例子簡要說明，目前能夠忽略。微信

Collectors.toList()具體是什麼呢？看下代碼：併發

public static <T>
Collector<T, ?, List<T>> toList() {
    return new CollectorImpl<>((Supplier<List<T>>) ArrayList::new, List::add,
                               (left, right) -> { left.addAll(right); return left; },
                               CH_ID);
}
複製代碼

它的實現類是CollectorImpl，這是Collectors內部的一個私有類，實現很簡單，主要就是定義了兩個構造方法，接受函數式參數並賦值給內部變量。對toList來講：app

supplier的實現是ArrayList::new，也就是建立一個ArrayList做爲容器
accumulator的實現是List::add，也就是將碰到的每個元素加到列表中，
第三個參數是combiner，表示合併結果
第四個參數CH_ID是一個靜態變量，只有一個特徵IDENTITY_FINISH，表示finisher沒有什麼事情能夠作，就是把累計狀態container直接返回

也就是說，collect(Collectors.toList())背後的僞代碼以下所示：

List<T> container = new ArrayList<>();
for (T t : data)
   container.add(t);
return container;
複製代碼

與toList相似的容器收集器還有toSet, toCollection, toMap等，咱們來看下。

容器收集器

toSet

toSet的使用與toList相似，只是它能夠排重，就不舉例了。toList背後的容器是ArrayList，toSet背後的容器是HashSet，其代碼爲：

public static <T>
Collector<T, ?, Set<T>> toSet() {
    return new CollectorImpl<>((Supplier<Set<T>>) HashSet::new, Set::add,
                               (left, right) -> { left.addAll(right); return left; },
                               CH_UNORDERED_ID);
}
複製代碼

CH_UNORDERED_ID是一個靜態變量，它的特徵有兩個，一個是IDENTITY_FINISH，表示返回結果即爲Supplier建立的HashSet，另外一個是UNORDERED，表示收集器不會保留順序，這也容易理解，由於背後容器是HashSet。

toCollection

toCollection是一個通用的容器收集器，能夠用於任何Collection接口的實現類，它接受一個工廠方法Supplier做爲參數，具體代碼爲：

public static <T, C extends Collection<T>>
Collector<T, ?, C> toCollection(Supplier<C> collectionFactory) {
    return new CollectorImpl<>(collectionFactory, Collection<T>::add,
                               (r1, r2) -> { r1.addAll(r2); return r1; },
                               CH_ID);
}
複製代碼

好比，若是但願排重但又但願保留出現的順序，可使用LinkedHashSet，Collector能夠這麼建立：

Collectors.toCollection(LinkedHashSet::new)
複製代碼

toMap

toMap將元素流轉換爲一個Map，咱們知道，Map有鍵和值兩部分，toMap至少須要兩個函數參數，一個將元素轉換爲鍵，另外一個將元素轉換爲值，其基本定義爲：

public static <T, K, U> Collector<T, ?, Map<K,U>> toMap(
    Function<? super T, ? extends K> keyMapper,
    Function<? super T, ? extends U> valueMapper)
複製代碼

返回結果爲Map<K,U>，keyMapper將元素轉換爲鍵，valueMapper將元素轉換爲值。好比，將學生流轉換爲學生名稱和分數的Map，代碼能夠爲：

Map<String,Double> nameScoreMap = students.stream().collect(
        Collectors.toMap(Student::getName, Student::getScore));
複製代碼

這裏，Student::getName是keyMapper，Student::getScore是valueMapper。

實踐中，常常須要將一個對象列表按主鍵轉換爲一個Map，以便之後按照主鍵進行快速查找，好比，假定Student的主鍵是id，但願轉換學生流爲學生id和學生對象的Map，代碼能夠爲：

Map<String, Student> byIdMap = students.stream().collect(
        Collectors.toMap(Student::getId, t -> t));
複製代碼

t->t是valueMapper，表示值就是元素自己，這個函數用的比較多，接口Function定義了一個靜態函數identity表示它，也就是說，上面的代碼能夠替換爲：

Map<String, Student> byIdMap = students.stream().collect(
        Collectors.toMap(Student::getId, Function.identity()));
複製代碼

上面的toMap假定元素的鍵不能重複，若是有重複的，會拋出異常，好比：

Map<String,Integer> strLenMap = Stream.of("abc","hello","abc").collect(
        Collectors.toMap(Function.identity(), t->t.length()));
複製代碼

但願獲得字符串與其長度的Map，但因爲包含重複字符串"abc"，程序會拋出異常。這種狀況下，咱們但願的是程序忽略後面重複出現的元素，這時，可使用另外一個toMap函數：

public static <T, K, U> Collector<T, ?, Map<K,U>> toMap(
    Function<? super T, ? extends K> keyMapper,
    Function<? super T, ? extends U> valueMapper,
    BinaryOperator<U> mergeFunction)    
複製代碼

相比前面的toMap，它接受一個額外的參數mergeFunction，它用於處理衝突，在收集一個新元素時，若是新元素的鍵已經存在了，系統會將新元素的值與鍵對應的舊值一塊兒傳遞給mergeFunction獲得一個值，而後用這個值給鍵賦值。

對於前面字符串長度的例子，新值與舊值實際上是同樣的，咱們能夠用任意一個值，代碼能夠爲：

Map<String,Integer> strLenMap = Stream.of("abc","hello","abc").collect(
        Collectors.toMap(Function.identity(),
                t->t.length(), (oldValue,value)->value));
複製代碼

有時，咱們可能但願合併新值與舊值，好比一個聯繫人列表，對於相同的聯繫人，咱們但願合併電話號碼，mergeFunction能夠定義爲：

BinaryOperator<String> mergeFunction = (oldPhone,phone)->oldPhone+","+phone;
複製代碼

toMap還有一個更爲通用的形式：

public static <T, K, U, M extends Map<K, U>> Collector<T, ?, M> toMap(
    Function<? super T, ? extends K> keyMapper,
    Function<? super T, ? extends U> valueMapper,
    BinaryOperator<U> mergeFunction,
    Supplier<M> mapSupplier) 
複製代碼

相比前面的toMap，多了一個mapSupplier，它是Map的工廠方法，對於前面兩個toMap，其mapSupplier實際上是HashMap::new。咱們知道，HashMap是沒有任何順序的，若是但願保持元素出現的順序，能夠替換爲LinkedHashMap，若是但願收集的結果排序，可使用TreeMap。

toMap主要用於順序流，對於併發流，Collectors有專門的名稱爲toConcurrentMap的收集器，它內部使用ConcurrentHashMap，用法相似，具體咱們就不討論了。

字符串收集器

除了將元素流收集到容器中，另外一個常見的操做是收集爲一個字符串。好比，獲取全部的學生名稱，用逗號鏈接起來，傳統上，代碼看上去像這樣：

StringBuilder sb = new StringBuilder();
for(Student t : students){
    if(sb.length()>0){
        sb.append(",");
    }
    sb.append(t.getName());
}
return sb.toString();
複製代碼

針對這種常見的需求，Collectors提供了joining收集器：

public static Collector<CharSequence, ?, String> joining()
public static Collector<CharSequence, ?, String> joining(CharSequence delimiter)
public static Collector<CharSequence, ?, String> joining(
    CharSequence delimiter, CharSequence prefix, CharSequence suffix) 
複製代碼

第一個就是簡單的把元素鏈接起來，第二個支持一個分隔符，第三個更爲通用，能夠給整個結果字符串加個前綴和後綴。好比：

String result = Stream.of("abc","老馬","hello")
        .collect(Collectors.joining(",", "[", "]"));
System.out.println(result);                                        ```        
輸出爲：
```java
[abc,老馬,hello]
複製代碼

joining的內部也利用了StringBuilder，好比，第一個joining函數的代碼爲：

public static Collector<CharSequence, ?, String> joining() {
    return new CollectorImpl<CharSequence, StringBuilder, String>(
            StringBuilder::new, StringBuilder::append,
            (r1, r2) -> { r1.append(r2); return r1; },
            StringBuilder::toString, CH_NOID);
}
複製代碼

supplier是StringBuilder::new，accumulator是StringBuilder::append，finisher是StringBuilder::toString，CH_NOID表示特徵集爲空。

分組

分組相似於數據庫查詢語言SQL中的group by語句，它將元素流中的每一個元素分到一個組，能夠針對分組再進行處理和收集，分組的功能比較強大，咱們逐步來講明。

爲便於舉例，咱們先修改下學生類Student，增長一個字段grade，表示年級，改下構造方法：

public Student(String name, String grade, double score) {
    this.name = name;
    this.grade = grade;
    this.score = score;
}
複製代碼

示例學生列表students改成：

static List<Student> students = Arrays.asList(new Student[] {
        new Student("zhangsan", "1", 91d),
        new Student("lisi", "2", 89d),
        new Student("wangwu", "1", 50d),
        new Student("zhaoliu", "2", 78d),
        new Student("sunqi", "1", 59d)});            
複製代碼

基本用法

最基本的分組收集器爲：

public static <T, K> Collector<T, ?, Map<K, List<T>>>
    groupingBy(Function<? super T, ? extends K> classifier)
複製代碼

參數是一個類型爲Function的分組器classifier，它將類型爲T的元素轉換爲類型爲K的一個值，這個值表示分組值，全部分組值同樣的元素會被歸爲同一個組，放到一個列表中，因此返回值類型是Map<K, List>。好比，將學生流按照年級進行分組，代碼爲：

Map<String, List<Student>> groups = students.stream()
        .collect(Collectors.groupingBy(Student::getGrade));
複製代碼

學生會分爲兩組，第一組鍵爲"1"，分組學生包括"zhangsan", "wangwu"和"sunqi"，第二組鍵爲"2"，分組學生包括"lisi", "zhaoliu"。

這段代碼基本等同於以下代碼：

Map<String, List<Student>> groups = new HashMap<>();
for (Student t : students) {
    String key = t.getGrade();
    List<Student> container = groups.get(key);
    if (container == null) {
        container = new ArrayList<>();
        groups.put(key, container);
    }
    container.add(t);
}
System.out.println(groups);
複製代碼

顯然，使用groupingBy要簡潔清晰的多，但它究竟是怎麼實現的呢？

基本原理

groupingBy的代碼爲：

public static <T, K> Collector<T, ?, Map<K, List<T>>>
groupingBy(Function<? super T, ? extends K> classifier) {
    return groupingBy(classifier, toList());
}
複製代碼

它調用了第二個groupingBy方法，傳遞了toList收集器，其代碼爲：

public static <T, K, A, D>
Collector<T, ?, Map<K, D>> groupingBy(Function<? super T, ? extends K> classifier,
                                      Collector<? super T, A, D> downstream) {
    return groupingBy(classifier, HashMap::new, downstream);
}
複製代碼

這個方法接受一個下游收集器downstream做爲參數，而後傳遞給下面更通用的函數：

public static <T, K, D, A, M extends Map<K, D>>
Collector<T, ?, M> groupingBy(Function<? super T, ? extends K> classifier,
                              Supplier<M> mapFactory,
                              Collector<? super T, A, D> downstream)
複製代碼

classifier仍是分組器，mapFactory是返回Map的工廠方法，默認是HashMap::new，downstream表示下游收集器，下游收集器負責收集同一個分組內元素的結果。

對最通用的groupingBy函數返回的收集器，其收集元素的基本過程和僞代碼爲：

//先建立一個存放結果的Map
Map map = mapFactory.get();
for (T t : data) {
    // 對每個元素，先分組
    K key = classifier.apply(t);
    // 找存放分組結果的容器，若是沒有，讓下游收集器建立，並放到Map中
    A container = map.get(key);
    if (container == null) {
        container = downstream.supplier().get();
        map.put(key, container);
    }
    // 將元素交給下游收集器(即分組收集器)收集
    downstream.accumulator().accept(container, t);
}
// 調用分組收集器的finisher方法，轉換結果
for (Map.Entry entry : map.entrySet()) {
    entry.setValue(downstream.finisher().apply(entry.getValue()));
}
return map;
複製代碼

在最基本的groupingBy函數中，下游收集器是toList，但下游收集器還能夠是其餘收集器，甚至是groupingBy，以構成多級分組，下面咱們來看更多的示例。

分組計數、找最大/最小元素

將元素按必定標準分爲多組，而後計算每組的個數，按必定標準找最大或最小元素，這是一個常見的需求，Collectors提供了一些對應的收集器，通常用做下游收集器，好比：

//計數
public static <T> Collector<T, ?, Long> counting()
//計算最大值
public static <T> Collector<T, ?, Optional<T>> maxBy(Comparator<? super T> comparator)
//計算最小值
public static <T> Collector<T, ?, Optional<T>> minBy(Comparator<? super T> comparator)
複製代碼

還有更爲通用的名爲reducing的歸約收集器，咱們就不介紹了，下面，看一些例子。

爲了便於使用Collectors中的方法，咱們將其中的方法靜態導入，即加入以下代碼：

import static java.util.stream.Collectors.*;
複製代碼

統計每一個年級的學生個數，代碼能夠爲：

Map<String, Long> gradeCountMap = students.stream().collect(
        groupingBy(Student::getGrade, counting()));
複製代碼

統計一個單詞流中每一個單詞的個數，按出現順序排序，代碼示例爲：

Map<String, Long> wordCountMap =
        Stream.of("hello","world","abc","hello").collect(
            groupingBy(Function.identity(), LinkedHashMap::new, counting()));
複製代碼

獲取每一個年級分數最高的一個學生，代碼能夠爲：

Map<String, Optional<Student>> topStudentMap = students.stream().collect(
        groupingBy(Student::getGrade,
                maxBy(Comparator.comparing(Student::getScore))));
複製代碼

須要說明的是，這個分組收集結果是Optional，而不是Student，這是由於maxBy處理的流多是空流，但對咱們的例子，這是不可能的，爲了直接獲得Student，可使用Collectors的另外一個收集器collectingAndThen，在獲得Optional後調用Optional的get方法，以下所示：

Map<String, Student> topStudentMap = students.stream().collect(
        groupingBy(Student::getGrade,
                collectingAndThen(
                        maxBy(Comparator.comparing(Student::getScore)),
                        Optional::get)));

關於collectingAndThen，咱們待會再進一步討論。                   
複製代碼

分組數值統計

除了基本的分組計數，還常常須要進行一些分組數值統計，好比求學生分數的和、平均分、最高分/最低分等，針對int,long和double類型，Collectors提供了專門的收集器，好比：

//求平均值，int和long也有相似方法
public static <T> Collector<T, ?, Double>
    averagingDouble(ToDoubleFunction<? super T> mapper)
//求和，long和double也有相似方法
public static <T> Collector<T, ?, Integer>
    summingInt(ToIntFunction<? super T> mapper)    
//求多種彙總信息，int和double也有相似方法
//LongSummaryStatistics包括個數、最大值、最小值、和、平均值等多種信息
public static <T> Collector<T, ?, LongSummaryStatistics>
    summarizingLong(ToLongFunction<? super T> mapper)
複製代碼

好比，按年級統計學生分數信息，代碼能夠爲：

Map<String, DoubleSummaryStatistics> gradeScoreStat =
    students.stream().collect(
            groupingBy(Student::getGrade,
                    summarizingDouble(Student::getScore)));
複製代碼

分組內的map

對於每一個分組內的元素，咱們感興趣的可能不是元素自己，而是它的某部分信息，在上節介紹的Stream API中，Stream有map方法，能夠將元素進行轉換，Collectors也爲分組元素提供了函數mapping，以下所示：

public static <T, U, A, R>
Collector<T, ?, R> mapping(Function<? super T, ? extends U> mapper,
    Collector<? super U, A, R> downstream)
複製代碼

交給下游收集器downstream的再也不是元素自己，而是應用轉換函數mapper以後的結果。好比，對學生按年級分組，獲得學生名稱列表，代碼能夠爲：

Map<String, List<String>> gradeNameMap =
        students.stream().collect(
                groupingBy(Student::getGrade,
                        mapping(Student::getName, toList())));
System.out.println(gradeNameMap);      
複製代碼

輸出爲：

{1=[zhangsan, wangwu, sunqi], 2=[lisi, zhaoliu]}
複製代碼

分組結果處理(filter/sort/skip/limit)

對分組後的元素，咱們能夠計數，找最大/最小元素，計算一些數值特徵，還能夠轉換後(map)再收集，那可不能夠像上節介紹的Stream API同樣，進行排序(sort)、過濾(filter)、限制返回元素(skip/limit)呢？Collector沒有專門的收集器，但有一個通用的方法：

public static<T,A,R,RR> Collector<T,A,RR> collectingAndThen( Collector<T,A,R> downstream, Function<R,RR> finisher) 複製代碼

這個方法接受一個下游收集器downstream和一個finisher，返回一個收集器，它的主要代碼爲：

return new CollectorImpl<>(downstream.supplier(),
    downstream.accumulator(),
    downstream.combiner(),
    downstream.finisher().andThen(finisher),
    characteristics);
複製代碼

也就是說，它在下游收集器的結果上又調用了finisher。利用這個finisher，咱們能夠實現多種功能，下面看一些例子。

收集完再排序，能夠定義以下方法：

public static <T> Collector<T, ?, List<T>> collectingAndSort(
        Collector<T, ?, List<T>> downstream,
        Comparator<? super T> comparator) {
    return Collectors.collectingAndThen(downstream, (r) -> {
        r.sort(comparator);
        return r;
    });
}
複製代碼

好比，將學生按年級分組，分組內學生按照分數由高到低進行排序，利用這個方法，代碼能夠爲：

Map<String, List<Student>> gradeStudentMap =
    students.stream().collect(
            groupingBy(Student::getGrade,
                    collectingAndSort(toList(),
                            Comparator.comparing(Student::getScore).reversed())));
複製代碼

針對這個需求，也能夠先對流進行排序，而後再分組。

收集完再過濾，能夠定義以下方法：

public static <T> Collector<T, ?, List<T>> collectingAndFilter(
        Collector<T, ?, List<T>> downstream,
        Predicate<T> predicate) {
    return Collectors.collectingAndThen(downstream, (r) -> {
        return r.stream().filter(predicate).collect(Collectors.toList());
    });
}
複製代碼

好比，將學生按年級分組，分組後，每一個分組只保留不及格的學生(低於60分)，利用這個方法，代碼能夠爲：

Map<String, List<Student>> gradeStudentMap =
    students.stream().collect(
            groupingBy(Student::getGrade,
                    collectingAndFilter(toList(), t->t.getScore()<60)));
複製代碼

針對這個需求，也能夠先對流進行過濾，而後再分組。

收集完，只返回特定區間的結果，能夠定義以下方法：

public static <T> Collector<T, ?, List<T>> collectingAndSkipLimit(
        Collector<T, ?, List<T>> downstream, long skip, long limit) {
    return Collectors.collectingAndThen(downstream, (r) -> {
        return r.stream().skip(skip).limit(limit).collect(Collectors.toList());
    });
}
複製代碼

好比，將學生按年級分組，分組後，每一個分組只保留前兩名的學生，代碼能夠爲：

Map<String, List<Student>> gradeStudentMap =
    students.stream()
        .sorted(Comparator.comparing(Student::getScore).reversed())
        .collect(groupingBy(Student::getGrade,
                    collectingAndSkipLimit(toList(), 0, 2)));
複製代碼

此次，咱們先對學生流進行了排序，而後再進行了分組。

分區

分組的一個特殊狀況是分區，就是將流按true/false分爲兩個組，Collectors有專門的分區函數：

public static <T> Collector<T, ?, Map<Boolean, List<T>>>
    partitioningBy(Predicate<? super T> predicate)
public static <T, D, A> Collector<T, ?, Map<Boolean, D>>
    partitioningBy(Predicate<? super T> predicate,
    Collector<? super T, A, D> downstream)    
複製代碼

第一個的下游收集器爲toList()，第二個能夠指定一個下游收集器。

好比，將學生按照是否及格(大於等於60分)分爲兩組，代碼能夠爲：

Map<Boolean, List<Student>> byPass = students.stream().collect(
    partitioningBy(t->t.getScore()>=60));
複製代碼

按是否及格分組後，計算每一個分組的平均分，代碼能夠爲：

Map<Boolean, Double> avgScoreMap = students.stream().collect(
        partitioningBy(t->t.getScore()>=60,
            averagingDouble(Student::getScore)));    
複製代碼

多級分組

groupingBy和partitioningBy均可以接受一個下游收集器，而下游收集器又能夠是分組或分區。

好比，按年級對學生分組，分組後，再按照是否及格對學生進行分區，代碼能夠爲：

Map<String, Map<Boolean, List<Student>>> multiGroup =
        students.stream().collect(
                groupingBy(Student::getGrade,
                        partitioningBy(t->t.getScore()>=60)));    
複製代碼

小結

本節主要討論了各類收集器，包括容器收集器、字符串收集器、分組和分區收集器等。

對於分組和分區，它們接受一個下游收集器，對同一個分組或分區內的元素進行進一步收集，下游收集器還能夠是分組或分區，以構建多級分組，有一些收集器主要用於分組，好比counting, maxBy, minBy, summarizingDouble等。

mapping和collectingAndThen也都接受一個下游收集器，mapping在把元素交給下游收集器以前先進行轉換，而collectingAndThen對下游收集器的結果進行轉換，組合利用它們，能夠構造更爲靈活強大的收集器。

至此，關於Java 8中的函數式數據處理Stream API，咱們就介紹完了，Stream API提供了集合數據處理的經常使用函數，利用它們，能夠簡潔地實現大部分常見需求，大大減小代碼，提升可讀性。

對於併發編程，Java 8也提供了一個新的類CompletableFuture，相似於Stream API對集合數據的流水線式操做，使用CompletableFuture，能夠實現對多個異步任務進行流水線式操做，它具體是什麼呢？

(與其餘章節同樣，本節全部代碼位於 github.com/swiftma/pro…，位於包shuo.laoma.java8.c93下)

未完待續，查看最新文章，敬請關注微信公衆號「老馬說編程」(掃描下方二維碼)，從入門到高級，深刻淺出，老馬和你一塊兒探索Java編程及計算機技術的本質。用心原創，保留全部版權。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。