clojure 數據結構與算法第一篇 lz77算法的實現

時間 2019-11-20

標籤 clojure 數據結構算法一篇 lz77 實現简体版

原文原文鏈接

LZ77壓縮算法

本系列文章都來自於[Clojure Data Structures and Algorithms Cookbook]一書，我僅僅是照着樹上的代碼實現一下，作一下筆記。若有疑問請先閱讀該書，而後咱們能夠一塊兒討論。算法

LZ77算法：在通過一個元素序列時，從即將出現（通過）的元素中，找出和過去（已經通過）的元素相同的元素序列，而後將該序列替換爲一對數值：distance和length。其中，distance是指從當前元素開始，要找到以前出現的序列的起點，須要通過的距離。Length是指重複的序列長度是多少。
概念：
Input stream:字節輸入流
Byte:輸入流中基本的數據元素
Coding position:將要編碼的下一個元素，即當前位置
Look ahead buffer: 從當前位置到輸入流末尾的全部元素（可是實際搜索時，僅僅搜索window長度的字節數）
Window:用於後續元素編碼的序列函數

LZ77的壓縮過程：

1，在任什麼時候刻，程序都是在處理一個特定的元素，這位於當前位置。考慮一個大小爲n的窗格，在當前位置以前已經有了n個元素，而後將當前位置直到序列最後，就是程序接下來要處理的元素。
oop

2，從字節序列的第一個元素開始（首先確定是先填充好window，也就是說序列最開頭window長度的字節序列是不會被編碼的，由於不知道參考誰），直到填滿窗口。
3，繼續處理後面一個元素，同時window跟着向前走
4，從window中找到從當前位置起，相匹配的最長的字節序列
5，若是找到了，將其替換爲以前說的距離-長度值對，而後向前移動匹配長度的距離，重複第四步。
6, 若是以前沒有找到，繼續第三步。ui

如下是代碼：this

(ns clojure-data-structures-and-argorithms.core
  (:require [clojure.set :as cset]))

;;首先是壓縮過程

;;define testcase
(def testcase ["a" "b" "c" "f" "a" "d" "a" "b" "c" "e"])


(defn all-subvecs-from-beginning
  ;;取得從起點開始，向量v全部的前綴向量
  ;;如'abc'的全部前綴向量就是['a','ab','abc']
  [v]
  (set (map #(subvec v 0 %)
            (range 1 (inc (count v))))))
;;test
(all-subvecs-from-beginning testcase)
(all-subvecs-from-beginning ["a" "b" "c"])


(defn all-subvecs
  ;;取得向量v的全部後綴向量的全部前綴向量，就是取得v的全部不重複子序列，用於搜索匹配用
  [v]
  (loop [result #{}
         remaining v]
    (if (seq remaining)
      (recur (into result (all-subvecs-from-beginning remaining));;將當前向量的全部前綴向量存入結果
             (into [] (rest remaining)));;繼續處理去掉第一個元素的向量，直到目標向量裏沒有元素
      result)))

;;定義函數：從look-ahead-buffer的全部匹配search-buffer的最長前綴序列
(defn longest-match-w-beginning
  ;;從當前的search-buffer中找出最長的匹配序列
  ;;search-buffer: 編碼用的字典
  ;;look-ahead-buffer: 搜索用的序列，即須要編碼的序列
  [search-buffer look-ahead-buffer]
  (let [all-left-chunks (all-subvecs search-buffer);;從當前的search-buffer找到全部子序列
        all-right-chunks-from-beginning (all-subvecs-from-beginning look-ahead-buffer);;從當前待編碼區找到全部的前綴子序列（由於編碼老是從當前位置開始，所以，老是得包含當前元素爲第一個元素，所以，這裏是全部的前綴序列
        all-matches (cset/intersection all-right-chunks-from-beginning
                                       all-left-chunks)];;最後，將這兩個集合取交集，而後從裏面找到最長的就是咱們想要的最長匹配序列
    (->> all-matches
         (sort-by count >)
         first)))
    
;;test
(longest-match-w-beginning ["a" "b" "c" "d"]
                           ["b" "c" "d" "a"])

(defn pos-of-subvec
  ;;求出子序列sv在原序列中的位置
  [sv v]
  {:pre [(<= (count sv)
             (count v))]}
  (loop
      [cursor 0];;從序列v的起始位置0開始
    (if (or (empty? v)
            (empty? sv)
            (= cursor (count v)))
      (do  #_(prn "here?") nil)
      (if (= (subvec v cursor (+ (count sv) cursor))
             sv);;根據當前位置比較sv和同長度的v的子序列
        cursor
        (recur (inc cursor))))))

;;test
(pos-of-subvec ["b" "c"] ["a" "c" "b" "c" "d"])

(defn LZ77-STEP
  ;;
  [window look-ahead]
  (let [longest (longest-match-w-beginning window look-ahead)]
    (if-let [pos-subv-w (pos-of-subvec longest window)]
      (let [distance (- (count window) pos-subv-w);;能夠想象一下，窗口的右邊界到子序列的開始位置的距離就是distance
            pos-subv-l (pos-of-subvec longest look-ahead)
            the-char (first (subvec look-ahead
                                    (+ pos-subv-l
                                       (count longest))))];;look-ahead中已匹配序列以後的第一個字符，也就是下一步將要匹配的字符
        {:distance distance
         :length (count longest)
         :char the-char})
      {:distance 0
       :length 0
       :char (first look-ahead)})))

(defn LZ77
  [bytes-array window-size]
  (->> (loop [result []
              cursor 0
              window []
              look-ahead bytes-array]
         (if (empty? look-ahead)
           result
           (let [this-step-output (LZ77-STEP window look-ahead)
                 distance (:distance this-step-output)
                 length (:length this-step-output)
                 literal (:char this-step-output)
                 raw-new-cursor (+ cursor
                                   length
                                   1)
                 new-cursor (min raw-new-cursor
                                 (count bytes-array))
                 new-window (subvec bytes-array
                                    (max 0 (inc (- new-cursor
                                                   window-size)))
                                    new-cursor)
                 new-look-ahead (subvec bytes-array
                                        new-cursor)]
             (recur (conj result
                          [distance length]
                          literal)
                    new-cursor
                    new-window
                    new-look-ahead))))
       (filter (partial
                not=
                [0 0]))
       (filter (comp
                not
                nil?))
       (into [])))
(LZ77 testcase 5)