Bigtable: A Distributed Storage System for Structured Data
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Bigtable is a massive, distributed database system created by Google to manage huge amounts of structured data. It's designed to be spread across thousands of standard servers and to scale up to petabytes (millions of gigabytes) of information. It is used by many Google products, including Web Indexing, Google Analytics, and Google Earth. While it acts like a database, it is not a traditional relational database (like SQL) and provides a more flexible data model.
Core Concepts of the Data Model
At its heart, Bigtable is a giant, sparse, multi-dimensional sorted map. This means it stores data using a simple but powerful structure:
- (row key, column key, timestamp) → value
Row Key: This is the primary way data is organized. All data in the table is automatically kept sorted lexicographically by its row key. This is a crucial feature, as developers can choose their row keys intelligently to keep related data physically close to each other, making reads very fast. For example, by reversing domain names (e.g.,
com.google.maps), all pages from the same domain are grouped together.Column Families: Columns are grouped into "column families". A column family must be created beforehand, but you can add any column within that family on the fly. This is the basic unit for things like access control and compression. A table might have a few column families but can have a virtually unlimited number of columns.
Timestamp: Every piece of data (a "cell") can have multiple versions, indexed by a timestamp. This allows for keeping a history of values. Bigtable can be configured to automatically garbage-collect old versions, for instance, by keeping only the last 3 versions of a web page.
System Architecture
Bigtable's architecture consists of three main components:
Client Library: This code is part of every application that uses Bigtable. Clients talk directly to the servers that hold the data, not through a central point.
Master Server: There is one master server that acts as a coordinator. Its main jobs are assigning chunks of tables (called "tablets") to servers, load balancing, and handling schema changes (like creating a new column family). The master is not a bottleneck because clients do not need it for reading or writing data.
Tablet Servers: These are the workhorses of the system. A cluster can have many tablet servers, and more can be added dynamically to handle more data or traffic. Each tablet server manages a set of "tablets" and handles all the read and write requests for them.
A tablet is a contiguous range of rows from a table. Tables are automatically split into multiple tablets as they grow, and these tablets are the unit of distribution and load balancing.
How Reads and Writes Work
Writes: When data is written, it is first written to a commit log for durability and then stored in an in-memory buffer called a memtable. This makes writes very fast.
Compactions: When the memtable fills up, its contents are written to a new, permanent, and immutable file on disk called an SSTable. This process is a "minor compaction". Over time, multiple
SSTables accumulate. A "merging compaction" reads a few
SSTables and the memtable and combines them into a new SSTable. A "major compaction" rewrites all
SSTables for a tablet into a single one, which is an effective way to reclaim space from deleted data.Reads: A read operation must consult the in-memory memtable and all relevant SSTables on disk and merge the results to present a unified, correct view of the data.
Important Refinements and Features
Locality Groups: Clients can group several column families into a "locality group". Each locality group is stored in its own separate
SSTable. This allows for more efficient reads, as an application that only needs data from one group doesn't have to read through the data for all other groups.Compression: Bigtable supports compressing data within SSTables. This works very well when similar data is grouped together by row key, achieving compression ratios as high as 10-to-1 for web page content.
Bloom Filters: To speed up reads, clients can request the creation of Bloom filters for SSTables. A Bloom filter is a data structure that can quickly determine if an
SSTable does not contain a specific row or column, which helps avoid unnecessary disk reads.MapReduce Integration: Bigtable can be used as both an input source and an output target for Google's MapReduce framework, enabling large-scale parallel data processing
