CS6400 Test4

8Dec - by qkim0x01 - 0 - In /Classes
Normalization
  • Normalize
    • Database are forever
    • EER diagrams may go missing
    • Never know what’s in database
  • What’s it all about?
    • Given a relation and a set of functional dependencies
    • Given an Email we know the Birthyear
      • Email -> BirthYear
    • Given an Email and Interest we know the SinceAge
      • Email, Interest -> SinceAge
    • How do we normalize the relation without information loss and so that the functional dependencies can be enforced?
  • The rules
    • No redundancy of facts
    • No cluttering of facts
    • Must preserve information
    • Must preserve functional dependencies
  • Not a relation – NF2
    • Multi-value
      • Pulling all the multi-value for the entry is not a relation (not atomic)
    • Iterate the multi-value and pull each row -> now it’s a relation
  • Relation with problems
    • Redundancy
      • There are same entries for user1
      • Redundancy may cause inconsistency if an entry is updated
    • Insertion anomaly
      • If we insert a new RegularUser without any Interest, then we must insert NULL values for Interest and SinceAge
      • If we insert a pair of BirthYear and Salary then we must insert NULL values for Email, Insert, SinceAge, and CurrentCity
    • Deletion anomaly
      • If we delete user12@gatech.edu, then we lose the fact that when BirthYear is 1974 then the salary is 38000
    • Update anomaly
      • If we update the CurrentCity of a RegularUser, then we must do it in multiple times
      • If I update BirthYear, then we must update Salary
  • Information loss
    • Try to decompose a table into multiple tables so the function dependencies can be enforced without problem
    • If we decompose RegularUser into two relations, then we could get too many rows back when we combine them
      • When we combine tables, we get more tuples that it’s supposed to have
        • Getting more information does not reflect fact in reality
          • -> information loss
  • Dependency loss
    • If we decompose RegularUser into two relations, then we cannot enforce the functional dependencies that are split between the two relations
      • Since Email and Salary do not coexists in same table, we cannot enforce them
  • Perfect
    • No redundancy
      • No inconsistent data
    • No insertion anomalies
      • Because the facts represented by functions dependencies are separated in different tables
    • No deletion anomalies
      • Because you can delete one fact without deleting other facts for particular user
    • No update anomalies
      • Because information are not redundant
    • No information loss
      • None of the join will create additional rows
    • No dependency loss
      • Separate relation for each function dependencies
    • But, how do we get there?
  • Functional dependencies
    • Let X and Y be sets of attributes of R, Y is “functionally dependent” on X in R iff for each x in R.X, there is precisely one y in R.Y
      • Email -> CurrentCity
        • There is only one CurrentCity for one Email
      • Email, Interest -> SinceAge
  • Full functional dependencies
    • Let X and Y be sets of attributes of R, Y is “fully functionally dependent” on X in R iff Y is functional dependent on X and Y is not functional dependent on any proper set of X
    • Email, Interest -> SinceAge
      • Full dependent
      • Both Email and Interest is required
    • Email, Interest -> CurrentCity
      • Because CurrentCity is dependent on Email alone
  • Functional Dependencies and Keys
    • We use “keys” to enforce full functional dependencies, X->Y
    • In a relation, the values of the keys are unique
    • That’s why it enforces a function!
    • Email, Interest -> SinceAge
      • make Email and Interest as key
  • Overview of Normal Forms
    • The whole set of data structure is called non first normal form
      • NF2
    • 1NF > 2NF > 3NF > BCNF (Boyce-codd normal formulation)
  • Normal Forms – Definitions
    • NF2
      • non-first normal form
    • 1NF
      • R is in 1NF iff all domain values are atomic
    • 2NF
      • R is in 2NF iff R is in 1NF and every nonkey attribute is fully dependent on the key
    • 3NF
      • R is in 3NF iff R is in 2NF and every nonkey attribute is non-transitively dependent on the key
    • BCNF
      • R is in BCNF iff every determinant is a candidate key
    • Determinant
      • A set of attributes on which some other attribute is fully functionally dependent
    • “All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF)”
  • 1NF BCNF flow chart
    • 1NF
      • Email->CurrentCity
      • Email->BirthYear
      • Email,BirthYear -> Salary
      • Email, Interest -> SinceAge
    • BCNF
      • Email,Interest -> SinceAge
    • 2NF
      • Email-> CurrentCity, BirthYear
      • Email, BirthYear -> Salary
    • 3NF & BCNF
      • BirthYear -> Salary
    • 3NF & BCNF
      • Email -> CurrentCity, BirthYear1
  • Compute with Functional Dependencies – Armstrong’s rules
    • Reflexivity
      • if Y is part of X, then X->Y
        • Email, Interest -> Interest
    • Augmentation
      • If X->Y, then WX->XY
        • If Email -> BirthYear, then
          • Email, Interest->BirthYear, Interest
    • Transtivity
      • If X->Y and Y->Z, then X->Z
        • Email-> BirthYear and BirthYear->Salary, then
          • Email->Salary
  • How to guarantee lossless joins?
    • The join field must be a key in at least one of the relations!
      • If join field is key, there no way of making duplicates
  • How to guarantee preservation of FDs?
    • The meaning implies by the remaining functional dependencies must be same!
    • Transitivity
      • Email->BirthYear, BirthYear->Salary
        • Even know I don’t see Email->Salary explicitly, it does
  • Email Interest
    • Email->CurrentCity, BirthYear, Salary & BirthYear->Salary
      • Email->CurrentCity, BirthYear
      • BirthYear->Salary
        • Lossless?
          • BirthYear is the key(o)
        • Dependency Preserving?
          • Email->BirthYaer->Salary (o)
  • 3NF and BCNF
    • There exists relations which can only be decomposed to 3NF, but not to BCNF, while being lossless and dependency preserving
      • It can only happen when the relation has overlapping keys
  • It never happens in practice
Efficiency
  • Computer Achitecture
    • Main memory
      • RAM- volatile, fast, small and expensive
    • Secondary memory
      • DISK – permanent, slow, bit, and cheap
    • Applications run by the CPU can only query and update data in main memory
    • Data must be written back to secondary memory after it is updated
    • Only a tiny fraction of a real database fits in main memory
  • Why should we care
    • Main memory access time <<< disk memory access time
    • Cost computation
      • Only disk I/O cost counts
      • CPU cost is ignored
  • Disk
    • 4 platters
    • 8 headers
    • 512 bytes/sector
  • Records
    • RegularUser{
      • Email varchar(50)
      • Sex char(1)
      • Birthdate date
      • CurrentCity varchar(50)
      • Hometown varchar(50)
      • …}
    • record size: 159 bytes
  • Blocks
    • Spanned
      • put start of record in a block, and put the rest in another block
    • Unspanned
      • Ignore left bytes smaller than a record
    • Assume block size: 4K (+metadata)
    • filled: ~80% (~3200 bytes)
    • records/block: ~20
  • Files
    • Blocks linked by pointers
    • block pointer: 4 bytes
    • # recoreds: 4 milion
    • # blocks: ~200,000
    • file size: ~800MB
  • Assumptions
    • Page fault
      • Seek time: 3 ~ 8 ms
      • Rotational delay: 2~3 ms
      • Transfer Time: 0.5~1 ms
      • Total: 5 ~ 12 ms or 0.01 sec
    • Extent transfers from disk to memory, say 250 blocks, save seek time and rotational delay, but require more buffer space
    • with a page fault of 10 ms each transfer cost 2.5 sec
    • as an extent trasfer costs 0.260 sec
      • Saving from seek time and rotation delay
    • buffer management
      • LRU
        • excellent for merge joins
        • bad for nested loops joins
          • MRU might better
  • Heap – unsorted file
    • Lookup time (average) :
      • N/2 where N = # data blocks
      • 200,000/2 * 0.01s = 16.6 min
  • Sorted file
    • Lookup time:
      • log2(N) for 2-ary search
      • 18 * 0.01s = 18s
  • Primary Index – point and great for range quries (not for particular data)
    • Sorted index for each file
    • log2(n) + 1 where n = #index blocks
    • fanout = ~60
      • Email (50) + pointer (4) = 54
      • 3200 / 54 = ~ 60 key,valuss/index block
    • Sparse
      • #index blocks = 3334 = 200,000 / 60
      • (log2(200,000/60) + 1) * 0.01 = 0.13s
    • Dense (all the key, values are in the index (not only the start of the data block))
      • #index blocks = 66,666
      • (log2(4,000,000/60) + 1) * 0.01 = 0.17s
      • Better finding max, min or avg (key,value)
  • Secondary Index – point quries only
    • Build index not on the key value, but on something else
    • Unsorted data, sorted index
      • -> cannot make spares because unsorted data
      • it has to be dense
    • ¬†Lookup Time
      • log2(n) + 1
    • Tricky for non-key fields
    • Build index box in the order of scan, then order the index box using the value
  • Multilevel index
    • Lookup Time
      • log(fanout)(n) + 1
    • Sparse
      • (log60(3334) + 1) * 0.01s = 0.03s
    • Need to watch out for overflow (more level -> more sensitive)
      • Some change at the bottom might give impact on upper level
    • Dense bottom possible too
  • Multilevel index B(balanced)+-Tree
    • A favorite implementation of a multi-level index
    • Insertion, deletion, and update operations are implemented to keep the tree balanced
    • Only rarely does an overflow at a lower level of the tree propagate more than 1-2 level up
  • Static Hashing
    • Hash key space much larger than address space
    • Good hash function:
      • Distribute values uniformly over the address space
      • Fill buckets as much as possible
      • Avoid collisions
        • linked list
    • Lookup Time
      • Constant: (1~2) * 0.01s = 0.01 ~ 0.02s
        • if bucket directory is in main memory
      • Dynamic hashing expands bucket directory when needed

DBMS | How to find the highest normal form of a relation

Leave a Reply

Your email address will not be published. Required fields are marked *