Categories: Azure
Posted by
mheydt on
11/17/2009 2:20 AM |
Comments (0)
Why Partition?
- Classic: Data volume, work load,
- Cloud: cost, elasticity
Horizontal Partitioning (Sharding)
- Spread data across similar nodes
- Achieve massive scaleout
- Intra-partition queries easy
- Cross-partition hard
Vertical Partitioning
- Spread data across dis-similar nodes
- Frequent data in expensive indexed storage
- Large in cheap storage
- Retrieving all data required more than one query
Hybrid
- Combination of horz and vert
Table Storage Key Points
- Partitions auto balanced
- Partition key and row key = primary key
- Distributed queries priced on transaction not cpu, so less costly than things like EC2
- Continuation tokens
- queries without partition keys need these
- helps with cross partition results
- each call with the token is a transaction
- Key columns can be up to 1kb, but 260 is practical limit due to URIs
- Row key = partition key => just one partition (and no continuation tokens issued)
Horizontal partitioning - SQL Azure
- For example, first char of last name is the heuristic for partitioning
- Partition for
- Data volume > 10GB
- Transaction throttle (non-deterministic) always code for retry
- All partitioning is up to the developer
- Partitions are not auto balanced
Choosing a partition key
- Natural keys (last name, ssn, ...)
- Modulo
Hashes
- Project one distribution into another
- Use a function that is a random distribution
- Do not use a crypto hash (overkill on CPU)
- Plenty of examples: tinyurl.com/part-hash
- Be careful of using object.GetHashCode() ( boxing might give different hashes for the same value when hashed more than once)
- Lots of hash stuff on codeplex
Partition stability over time
- May need to change partition scheme
- Two options: repartition all data, or versioning partition scheme
Vertical partitioning
- Balance performance vs cost
- SQL Azure
- Fully indexable
- No query transaction charges
- $9.99/GB
- Azure storage
- ... missed this - slides went too fast
- Duplicated data can lower transaction costs on data
Azure tables != RDBMS
- Storage is cheap
- Cross-partition queries are resource intensive
Modeling Azure Tables
- Currently no secondary indexes
- build indexes yourself
- If associated data is small enough
- Save additional queries
- Duplicate data with each index
- Lots of worker roles to massage data into indexes
Summary
- Partition Data Key to scale cloud apps
- Horiz partition for scale out
- Vertical for cost/performance
- Choose appropriate keys
7ee0f96a-65b9-4bb5-9697-67a44c1aa740|0|.0