Categories: Azure Posted by mheydt on 11/17/2009 2:20 AM | Comments (0)
Why Partition?
  • Classic: Data volume, work load,
  • Cloud: cost, elasticity
Horizontal Partitioning (Sharding)
  • Spread data across similar nodes
  • Achieve massive scaleout
  • Intra-partition queries easy
  • Cross-partition hard
Vertical Partitioning
  • Spread data across dis-similar nodes
  • Frequent data in expensive indexed storage
  • Large in cheap storage
  • Retrieving all data required more than one query
Hybrid
  • Combination of horz and vert
Table Storage Key Points
  • Partitions auto balanced
  • Partition key and row key = primary key
  • Distributed queries priced on transaction not cpu, so less costly than things like EC2
  • Continuation tokens
    • queries without partition keys need these
    • helps with cross partition results
    • each call with the token is a transaction
  • Key columns can be up to 1kb, but 260 is practical limit due to URIs
  • Row key = partition key => just one partition (and no continuation tokens issued)
Horizontal partitioning - SQL Azure
  • For example, first char of last name is the heuristic for partitioning
  • Partition for
    • Data volume > 10GB
    • Transaction throttle (non-deterministic) always code for retry
  • All partitioning is up to the developer
  • Partitions are not auto balanced
Choosing a partition key
  • Natural keys (last name, ssn, ...)
  • Modulo
Hashes
  • Project one distribution into another
  • Use a function that is a random distribution
  • Do not use a crypto hash (overkill on CPU)
  • Plenty of examples: tinyurl.com/part-hash
  • Be careful of using object.GetHashCode() ( boxing might give different hashes for the same value when hashed more than once)
  • Lots of hash stuff on codeplex
Partition stability over time
  • May need to change partition scheme
  • Two options: repartition all data, or versioning partition scheme
Vertical partitioning
  • Balance performance vs cost
  • SQL Azure
    • Fully indexable
    • No query transaction charges
    • $9.99/GB
  • Azure storage
    • ... missed this - slides went too fast
  • Duplicated data can lower transaction costs on data
Azure tables != RDBMS
  • Storage is cheap
  • Cross-partition queries are resource intensive
Modeling Azure Tables
  • Currently no secondary indexes
  • build indexes yourself
  • If associated data is small enough
    • Save additional queries
    • Duplicate data with each index
  • Lots of worker roles to massage data into indexes
Summary
  • Partition Data Key to scale cloud apps
  • Horiz partition for scale out
  • Vertical for cost/performance
  • Choose appropriate keys






Comments