Zstandard 対応のアーカイバ欲しい問題

Zstandard はパフォーマンスに優れたすばらしいものですが、今のところ個別のファイルしか圧縮できません。(いわゆるアーカイブ機能がない) Zstandard に対応したアーカイバ欲しいなと思っていたけど特に何もせずにいたのですが、最近 @__gfx__ さんの下記ツイートを見て再び興味を持ちました。

これは例えば Zip ファイルの仕様に Zstandard が追加され、その仕様に対応したアーカイバが公開されれば問題は解決しますが、そもそも仕様に追加されるかすらわかりません。 しかし、実際に仕様に追加されるのを待つのではなく、Zipでは圧縮を行わずアーカイブ機能だけを利用して、その前段階で Zstandard で圧縮すれば大体欲しい感じになりそうだということに気が付きました。 

7−Zip と Zstandard

7−Zip に Zstandard 圧縮をサポートしてもらおうという提案があったみたいですがまだサポートされておらず、非公式の Zstandard 対応版ができたりしてました。 でもまあ、勝手に拡張されたアーカイブファイルとか怖いので、今回はよりシンプルな方法を取りました。

シェルスクリプト化

というわけで、シェルスクリプトを作ってみました。 なんか作ってるうちにあれも欲しいこれも欲しいとなっていき、ワンライナーだったのが最終的にはそこそこの大きさになってしまいました…。

使ってみたい方は https://github.com/imaya/zipstd からどうぞ!

性能

可逆圧縮におけるコーパスである Silesia compression corpus を使用して、簡単にですが性能を測ってみました。

Compression: Zstandard + Zip

$ gtime -v zipstd -3 -P 1 -o silesia-zstd-3.zip silesia
zipstd: start compression
zipstd: Maximum number of Processes: 1
zipstd: [Directory: silesia]
silesia/dickens      : 36.24%   (10192446 => 3693846 bytes, silesia/dickens.zst) 
silesia/mozilla      : 36.27%   (51220480 => 18576312 bytes, silesia/mozilla.zst) 
silesia/mr           : 35.71%   (9970564 => 3560660 bytes, silesia/mr.zst)     
silesia/nci          :  8.56%   (33553445 => 2870655 bytes, silesia/nci.zst)   
silesia/ooffice      : 51.14%   (6152192 => 3146461 bytes, silesia/ooffice.zst) 
silesia/osdb         : 34.86%   (10085684 => 3515524 bytes, silesia/osdb.zst)  
silesia/reymont      : 29.51%   (6627202 => 1955962 bytes, silesia/reymont.zst) 
silesia/samba        : 23.53%   (21606400 => 5084702 bytes, silesia/samba.zst) 
silesia/sao          : 76.62%   (7251944 => 5556254 bytes, silesia/sao.zst)    
silesia/webster      : 29.46%   (41458703 => 12215621 bytes, silesia/webster.zst) 
silesia/x-ray        : 72.77%   (8474240 => 6166670 bytes, silesia/x-ray.zst)  
silesia/xml          : 11.96%   (5345280 => 639077 bytes, silesia/xml.zst)     
  adding: silesia/dickens.zst (stored 0%)
  adding: silesia/mozilla.zst (stored 0%)
  adding: silesia/mr.zst (stored 0%)
  adding: silesia/nci.zst (stored 0%)
  adding: silesia/ooffice.zst (stored 0%)
  adding: silesia/osdb.zst (stored 0%)
  adding: silesia/reymont.zst (stored 0%)
  adding: silesia/samba.zst (stored 0%)
  adding: silesia/sao.zst (stored 0%)
  adding: silesia/webster.zst (stored 0%)
  adding: silesia/x-ray.zst (stored 0%)
  adding: silesia/xml.zst (stored 0%)

	Command being timed: "zipstd -3 -P 1 -o silesia-zstd-3.zip silesia"
	User time (seconds): 2.44
	System time (seconds): 0.70
	Percent of CPU this job got: 91%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.43
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 13221888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 20196
	Voluntary context switches: 285
	Involuntary context switches: 3579
	Swaps: 0
	File system inputs: 15
	File system outputs: 12
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 2
	Page size (bytes): 4096
	Exit status: 0

Compression: Zip (Deflate)

$ gtime -v zip -r -o silesia-zip-deflate.zip silesia
  adding: silesia/ (stored 0%)
  adding: silesia/dickens (deflated 62%)
  adding: silesia/mozilla (deflated 63%)
  adding: silesia/mr (deflated 63%)
  adding: silesia/nci (deflated 90%)
  adding: silesia/ooffice (deflated 50%)
  adding: silesia/osdb (deflated 63%)
  adding: silesia/reymont (deflated 72%)
  adding: silesia/samba (deflated 75%)
  adding: silesia/sao (deflated 26%)
  adding: silesia/webster (deflated 71%)
  adding: silesia/x-ray (deflated 29%)
  adding: silesia/xml (deflated 87%)
	Command being timed: "zip -r -o silesia-zip-deflate.zip silesia"
	User time (seconds): 10.94
	System time (seconds): 0.14
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.17
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4292608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 366
	Voluntary context switches: 52
	Involuntary context switches: 2871
	Swaps: 0
	File system inputs: 4
	File system outputs: 6
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Size

$ ls -al
-rw-r--r--    1 imaya  staff  68229691  3 20  2003 silesia-zip-deflate.zip
-rw-r--r--    1 imaya  staff  66983710  7 22 22:38 silesia-zstd-3.zip

Decompression: Zstandard + Zip

$ gtime -v unzipstd -P 1 -o decompress-zipstd silesia-zstd-3.zip
unzipstd: start decompression
unzipstd: Maximum number of Processes: 1
[File: silesia-zstd-3.zip]
Archive:  silesia-zstd-3.zip
 extracting: decompress-zipstd/silesia/dickens.zst  
 extracting: decompress-zipstd/silesia/mozilla.zst  
 extracting: decompress-zipstd/silesia/mr.zst  
 extracting: decompress-zipstd/silesia/nci.zst  
 extracting: decompress-zipstd/silesia/ooffice.zst  
 extracting: decompress-zipstd/silesia/osdb.zst  
 extracting: decompress-zipstd/silesia/reymont.zst  
 extracting: decompress-zipstd/silesia/samba.zst  
 extracting: decompress-zipstd/silesia/sao.zst  
 extracting: decompress-zipstd/silesia/webster.zst  
 extracting: decompress-zipstd/silesia/x-ray.zst  
 extracting: decompress-zipstd/silesia/xml.zst  
decompress-zipstd/silesia/dickens.zst: 10192446 bytes                          
decompress-zipstd/silesia/mozilla.zst: 51220480 bytes                          
decompress-zipstd/silesia/mr.zst: 9970564 bytes                                
decompress-zipstd/silesia/nci.zst: 33553445 bytes                              
decompress-zipstd/silesia/ooffice.zst: 6152192 bytes                           
decompress-zipstd/silesia/osdb.zst: 10085684 bytes                             
decompress-zipstd/silesia/reymont.zst: 6627202 bytes                           
decompress-zipstd/silesia/samba.zst: 21606400 bytes                            
decompress-zipstd/silesia/sao.zst: 7251944 bytes                               
decompress-zipstd/silesia/webster.zst: 41458703 bytes                          
decompress-zipstd/silesia/x-ray.zst: 8474240 bytes                             
decompress-zipstd/silesia/xml.zst: 5345280 bytes                               

	Command being timed: "unzipstd -P 1 -o decompress-zipstd silesia-zstd-3.zip"
	User time (seconds): 1.01
	System time (seconds): 0.38
	Percent of CPU this job got: 88%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.58
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9043968
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 39
	Minor (reclaiming a frame) page faults: 8318
	Voluntary context switches: 152
	Involuntary context switches: 2222
	Swaps: 0
	File system inputs: 38
	File system outputs: 20
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 1
	Page size (bytes): 4096
	Exit status: 0

Decompression: Zip (Deflate)

$ gtime -v unzip silesia-zip-deflate.zip -d decompress-zip
Archive:  silesia-zip-deflate.zip
   creating: decompress-zip/silesia/
  inflating: decompress-zip/silesia/dickens  
  inflating: decompress-zip/silesia/mozilla  
  inflating: decompress-zip/silesia/mr  
  inflating: decompress-zip/silesia/nci  
  inflating: decompress-zip/silesia/ooffice  
  inflating: decompress-zip/silesia/osdb  
  inflating: decompress-zip/silesia/reymont  
  inflating: decompress-zip/silesia/samba  
  inflating: decompress-zip/silesia/sao  
  inflating: decompress-zip/silesia/webster  
  inflating: decompress-zip/silesia/x-ray  
  inflating: decompress-zip/silesia/xml  
	Command being timed: "unzip silesia-zip-deflate.zip -d decompress-zip"
	User time (seconds): 1.63
	System time (seconds): 0.14
	Percent of CPU this job got: 94%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3358720
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 309
	Voluntary context switches: 96
	Involuntary context switches: 975
	Swaps: 0
	File system inputs: 33
	File system outputs: 9
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

性能まとめ

圧縮は速いしデフォルト設定のzipより小さくなる。 伸長はどちらもはやいけど若干zipより速くなる。