Zstandard 対応のアーカイバ欲しい問題
Zstandard はパフォーマンスに優れたすばらしいものですが、今のところ個別のファイルしか圧縮できません。(いわゆるアーカイブ機能がない) Zstandard に対応したアーカイバ欲しいなと思っていたけど特に何もせずにいたのですが、最近 @__gfx__ さんの下記ツイートを見て再び興味を持ちました。
zipの中身、原理的にはzstdとかで圧縮してよさそうな気がしてる。
— FUJI Goro (@__gfx__) 2017年7月20日
これは例えば Zip ファイルの仕様に Zstandard が追加され、その仕様に対応したアーカイバが公開されれば問題は解決しますが、そもそも仕様に追加されるかすらわかりません。 しかし、実際に仕様に追加されるのを待つのではなく、Zipでは圧縮を行わずアーカイブ機能だけを利用して、その前段階で Zstandard で圧縮すれば大体欲しい感じになりそうだということに気が付きました。
7−Zip と Zstandard
7−Zip に Zstandard 圧縮をサポートしてもらおうという提案があったみたいですがまだサポートされておらず、非公式の Zstandard 対応版ができたりしてました。 でもまあ、勝手に拡張されたアーカイブファイルとか怖いので、今回はよりシンプルな方法を取りました。
シェルスクリプト化
というわけで、シェルスクリプトを作ってみました。 なんか作ってるうちにあれも欲しいこれも欲しいとなっていき、ワンライナーだったのが最終的にはそこそこの大きさになってしまいました…。
使ってみたい方は https://github.com/imaya/zipstd からどうぞ!
性能
可逆圧縮におけるコーパスである Silesia compression corpus を使用して、簡単にですが性能を測ってみました。
Compression: Zstandard + Zip
$ gtime -v zipstd -3 -P 1 -o silesia-zstd-3.zip silesia
zipstd: start compression
zipstd: Maximum number of Processes: 1
zipstd: [Directory: silesia]
silesia/dickens : 36.24% (10192446 => 3693846 bytes, silesia/dickens.zst)
silesia/mozilla : 36.27% (51220480 => 18576312 bytes, silesia/mozilla.zst)
silesia/mr : 35.71% (9970564 => 3560660 bytes, silesia/mr.zst)
silesia/nci : 8.56% (33553445 => 2870655 bytes, silesia/nci.zst)
silesia/ooffice : 51.14% (6152192 => 3146461 bytes, silesia/ooffice.zst)
silesia/osdb : 34.86% (10085684 => 3515524 bytes, silesia/osdb.zst)
silesia/reymont : 29.51% (6627202 => 1955962 bytes, silesia/reymont.zst)
silesia/samba : 23.53% (21606400 => 5084702 bytes, silesia/samba.zst)
silesia/sao : 76.62% (7251944 => 5556254 bytes, silesia/sao.zst)
silesia/webster : 29.46% (41458703 => 12215621 bytes, silesia/webster.zst)
silesia/x-ray : 72.77% (8474240 => 6166670 bytes, silesia/x-ray.zst)
silesia/xml : 11.96% (5345280 => 639077 bytes, silesia/xml.zst)
adding: silesia/dickens.zst (stored 0%)
adding: silesia/mozilla.zst (stored 0%)
adding: silesia/mr.zst (stored 0%)
adding: silesia/nci.zst (stored 0%)
adding: silesia/ooffice.zst (stored 0%)
adding: silesia/osdb.zst (stored 0%)
adding: silesia/reymont.zst (stored 0%)
adding: silesia/samba.zst (stored 0%)
adding: silesia/sao.zst (stored 0%)
adding: silesia/webster.zst (stored 0%)
adding: silesia/x-ray.zst (stored 0%)
adding: silesia/xml.zst (stored 0%)
Command being timed: "zipstd -3 -P 1 -o silesia-zstd-3.zip silesia"
User time (seconds): 2.44
System time (seconds): 0.70
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 13221888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 20196
Voluntary context switches: 285
Involuntary context switches: 3579
Swaps: 0
File system inputs: 15
File system outputs: 12
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 2
Page size (bytes): 4096
Exit status: 0
Compression: Zip (Deflate)
$ gtime -v zip -r -o silesia-zip-deflate.zip silesia
adding: silesia/ (stored 0%)
adding: silesia/dickens (deflated 62%)
adding: silesia/mozilla (deflated 63%)
adding: silesia/mr (deflated 63%)
adding: silesia/nci (deflated 90%)
adding: silesia/ooffice (deflated 50%)
adding: silesia/osdb (deflated 63%)
adding: silesia/reymont (deflated 72%)
adding: silesia/samba (deflated 75%)
adding: silesia/sao (deflated 26%)
adding: silesia/webster (deflated 71%)
adding: silesia/x-ray (deflated 29%)
adding: silesia/xml (deflated 87%)
Command being timed: "zip -r -o silesia-zip-deflate.zip silesia"
User time (seconds): 10.94
System time (seconds): 0.14
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4292608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 366
Voluntary context switches: 52
Involuntary context switches: 2871
Swaps: 0
File system inputs: 4
File system outputs: 6
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Size
$ ls -al
-rw-r--r-- 1 imaya staff 68229691 3 20 2003 silesia-zip-deflate.zip
-rw-r--r-- 1 imaya staff 66983710 7 22 22:38 silesia-zstd-3.zip
Decompression: Zstandard + Zip
$ gtime -v unzipstd -P 1 -o decompress-zipstd silesia-zstd-3.zip
unzipstd: start decompression
unzipstd: Maximum number of Processes: 1
[File: silesia-zstd-3.zip]
Archive: silesia-zstd-3.zip
extracting: decompress-zipstd/silesia/dickens.zst
extracting: decompress-zipstd/silesia/mozilla.zst
extracting: decompress-zipstd/silesia/mr.zst
extracting: decompress-zipstd/silesia/nci.zst
extracting: decompress-zipstd/silesia/ooffice.zst
extracting: decompress-zipstd/silesia/osdb.zst
extracting: decompress-zipstd/silesia/reymont.zst
extracting: decompress-zipstd/silesia/samba.zst
extracting: decompress-zipstd/silesia/sao.zst
extracting: decompress-zipstd/silesia/webster.zst
extracting: decompress-zipstd/silesia/x-ray.zst
extracting: decompress-zipstd/silesia/xml.zst
decompress-zipstd/silesia/dickens.zst: 10192446 bytes
decompress-zipstd/silesia/mozilla.zst: 51220480 bytes
decompress-zipstd/silesia/mr.zst: 9970564 bytes
decompress-zipstd/silesia/nci.zst: 33553445 bytes
decompress-zipstd/silesia/ooffice.zst: 6152192 bytes
decompress-zipstd/silesia/osdb.zst: 10085684 bytes
decompress-zipstd/silesia/reymont.zst: 6627202 bytes
decompress-zipstd/silesia/samba.zst: 21606400 bytes
decompress-zipstd/silesia/sao.zst: 7251944 bytes
decompress-zipstd/silesia/webster.zst: 41458703 bytes
decompress-zipstd/silesia/x-ray.zst: 8474240 bytes
decompress-zipstd/silesia/xml.zst: 5345280 bytes
Command being timed: "unzipstd -P 1 -o decompress-zipstd silesia-zstd-3.zip"
User time (seconds): 1.01
System time (seconds): 0.38
Percent of CPU this job got: 88%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.58
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9043968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 39
Minor (reclaiming a frame) page faults: 8318
Voluntary context switches: 152
Involuntary context switches: 2222
Swaps: 0
File system inputs: 38
File system outputs: 20
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 1
Page size (bytes): 4096
Exit status: 0
Decompression: Zip (Deflate)
$ gtime -v unzip silesia-zip-deflate.zip -d decompress-zip
Archive: silesia-zip-deflate.zip
creating: decompress-zip/silesia/
inflating: decompress-zip/silesia/dickens
inflating: decompress-zip/silesia/mozilla
inflating: decompress-zip/silesia/mr
inflating: decompress-zip/silesia/nci
inflating: decompress-zip/silesia/ooffice
inflating: decompress-zip/silesia/osdb
inflating: decompress-zip/silesia/reymont
inflating: decompress-zip/silesia/samba
inflating: decompress-zip/silesia/sao
inflating: decompress-zip/silesia/webster
inflating: decompress-zip/silesia/x-ray
inflating: decompress-zip/silesia/xml
Command being timed: "unzip silesia-zip-deflate.zip -d decompress-zip"
User time (seconds): 1.63
System time (seconds): 0.14
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3358720
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 309
Voluntary context switches: 96
Involuntary context switches: 975
Swaps: 0
File system inputs: 33
File system outputs: 9
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
性能まとめ
圧縮は速いしデフォルト設定のzipより小さくなる。 伸長はどちらもはやいけど若干zipより速くなる。