[CMU 15-445 Intro to DB Sys] Lec05 Storage Models and DB Compression

소위 'row store'로서, 한 tuple에 대한 모든 attribute를 같은 페이지 상에 저장함. OLTP에 유리함(한 번의 fetch만으로 single tuple의 모든 attribute를 가져올 수 있음).
장점: update, insert, delete가 빠르고 OLTP에 좋음.
단점: table의 많은 부분을 scan하거나 attribute의 일부만을 보고자 하는 경우에 비효율적임(불필요한 attribute들도 다같이 가져와야 하기 때문). terrible memory locality. 여러 다양한 값들이 있기 때문에 압축이 어려움.

소위 'column store'로서, single attribute for all tuples가 연속적으로 저장됨.
tuple들을 식별하는 방법은 크게 다음의 두 가지가 있음.
- 1) fixed-length offsets: 각 값들은 일정한 길이의 offset 위치에 저장됨. 따라서 같은 tuple에 속하는 서로 다른 column의 값들은 같은 offset을 가짐. 단, 이를 위해서는 fixed length value여야 함.
- 2) embedded tuple ids: 각 값 내에 tuple id를 함께 저장함. 이에 따라 storage overhead 발생
장점: reduce the amount of I/O wasted(필요한 data만을 읽어오기 때문), better query processing because of increased locality and cached data reuse, better data compression
단점: insert, update, delete 등을 위해서는 tuple splitting/stitching이 필요해서 느려진다
사실 실제로 query들은 single column만을 사용하는 것이 아니라, 결국 여러 column들을 보면서 원래의 tuple을 만들어야 하는 경우가 많다. 그렇다면 어떻게 하여 column store의 장점을 살릴 수 있을까? -> PAX

기본구조: horizontally partition rows into groups; then vertically partition their attributes into columns(즉, row group으로 끊어준 뒤에, 각 그룹 내에서는 마치 column store처럼 column단위로 끊겨 있음)
이렇게 하면 row store에서 하나의 tuple을 구성하는 attribute들 간의 spatial locality를 살리는 동시에, columnar storage의 빠른 processing을 누릴 수 있다.
(내 생각) 반대로 끊어주는 것은 왜 안 먹힐까? 그러면 row store의 locality가 구현되지 않는다. 즉 이는 columnar storage의 특성을 약간 약화시키면서 row store의 장점을 가져온 것이라고 보아야 할 것임.

I/O가 항상 병목이 되곤 하는데, DBMS는 데이터를 압축함으로써 DRAM requirements를 줄일 수 있다. 하지만 speed와 compression ratio 간의 tradeoff가 나타난다.
목표: 1) fixed-length values를 만들어야 한다 2) 최대한 늦게 decompression해야 한다(late materialization) 3) lossless해야 한다(lossy compression을 하면 데이터를 잃어버릴 것)
Granularity: 1) block level(block of tuples for the same table) 2) tuple level(NSM only) 3) attribute level(single attribute value within one tuple) 4) columnar level(multiple values for one or more attributes stored for multiple tuples; DSM only)

general-purpose third-party library를 이용하여 압축함. MySQL InnoDB에서는 page 전체를 압축해서 저장해 두다가, DBMS가 read/modify할 때는 이를 decompress해야 함. 그렇기 때문에 모든 테이블을 하나의 block으로 압축하기 보다는 좀 더 작은 단위로 나누어서 압축하게 됨(limit the scope of the compression scheme).
그리고 이러한 naive scheme은 데이터의 의미를 따지지 않기 때문에, late materialization이 불가능함.

Run-Length Encoding(RLE): 같은 값이 연속적으로 나올 경우, 이를 triplet의 형태로 바꿔줌(값, offset, length). 같은 값들을 cluster하여 묶어줄 경우 compression ratio를 증가시킬 수 있음.
Bit-Packing Encoding: 실제로 데이터를 나타내는 데에 필요한 bit만 남겨둠.
Mostly Encoding: 위의 경우에서 '튀는' 숫자가 있을 수 있기 때문에, "mostly" 작은 숫자들이 있는 경우에 이들을 작은 data type으로 바꿔주고, 압축될 수 없는 것은 별도로 저장하도록 처리함.
Bitmap Encoding: attribute의 unique value별로 bitmap을 만들어서(일종의 one-hot 같은 형태) 표시해 준다. 이는 value cardinality가 낮을 때 사용할 만하다(bitmap의 크기는 linearly proportional to the cardinality)
Delta Encoding: 연속되는 값들 사이의 차이만을 기록함. base value는 별도로 존재하거나, in-line으로 저장할 수 있음. 여기에 RLE를 적용하여 추가적으로 압축할 수도 있음.
Dictionary Compression: 실제로 가장 많이 사용되는 압축 방식임. frequent value를 smaller fixed-length code로 바꿔주고, 이에 대한 mapping을 별도로 유지함. 인코딩된 값은 order-preserving해야 함.
- dictionary는 여러 방법으로 구현가능함. 1)array(값들을 정렬한 후에 byte offset을 세서 이를 압축된 데이터의 값으로 사용) 2)hash table 3)b+tree

[CMU 15-445 Intro to DB Sys] Lec10 Sorting and Aggregation (0)	2024.01.09
[CMU 15-445 Intro to DB Sys] Lec09 Index Concurrency (0)	2024.01.09
[CMU 15-445 Intro to DB Sys] Lec08 B+Tree Index (0)	2024.01.05
[CMU 15-445 Intro to DB Sys] Lec07 Hash Tables (0)	2024.01.04
[CMU 15-445 Intro to DB Sys] Lec06 Memory & Disk IO (0)	2024.01.04

봄은 항상 낯설다