Normalization
- Normalize
- Database are forever
- EER diagrams may go missing
- Never know what’s in database
- What’s it all about?
- Given a relation and a set of functional dependencies
- Given an Email we know the Birthyear
- Given an Email and Interest we know the SinceAge
- Email, Interest -> SinceAge
- How do we normalize the relation without information loss and so that the functional dependencies can be enforced?
- The rules
- No redundancy of facts
- No cluttering of facts
- Must preserve information
- Must preserve functional dependencies
- Not a relation – NF2
- Multi-value
- Pulling all the multi-value for the entry is not a relation (not atomic)
- Iterate the multi-value and pull each row -> now it’s a relation
- Relation with problems
- Redundancy
- There are same entries for user1
- Redundancy may cause inconsistency if an entry is updated
- Insertion anomaly
- If we insert a new RegularUser without any Interest, then we must insert NULL values for Interest and SinceAge
- If we insert a pair of BirthYear and Salary then we must insert NULL values for Email, Insert, SinceAge, and CurrentCity
- Deletion anomaly
- If we delete user12@gatech.edu, then we lose the fact that when BirthYear is 1974 then the salary is 38000
- Update anomaly
- If we update the CurrentCity of a RegularUser, then we must do it in multiple times
- If I update BirthYear, then we must update Salary
- Information loss
- Try to decompose a table into multiple tables so the function dependencies can be enforced without problem
- If we decompose RegularUser into two relations, then we could get too many rows back when we combine them
- When we combine tables, we get more tuples that it’s supposed to have
- Getting more information does not reflect fact in reality
- Dependency loss
- If we decompose RegularUser into two relations, then we cannot enforce the functional dependencies that are split between the two relations
- Since Email and Salary do not coexists in same table, we cannot enforce them
- Perfect
- No redundancy
- No insertion anomalies
- Because the facts represented by functions dependencies are separated in different tables
- No deletion anomalies
- Because you can delete one fact without deleting other facts for particular user
- No update anomalies
- Because information are not redundant
- No information loss
- None of the join will create additional rows
- No dependency loss
- Separate relation for each function dependencies
- But, how do we get there?
- Functional dependencies
- Let X and Y be sets of attributes of R, Y is “functionally dependent” on X in R iff for each x in R.X, there is precisely one y in R.Y
- Email -> CurrentCity
- There is only one CurrentCity for one Email
- Email, Interest -> SinceAge
- Full functional dependencies
- Let X and Y be sets of attributes of R, Y is “fully functionally dependent” on X in R iff Y is functional dependent on X and Y is not functional dependent on any proper set of X
- Email, Interest -> SinceAge
- Full dependent
- Both Email and Interest is required
- Email, Interest -> CurrentCity
- Because CurrentCity is dependent on Email alone
- Functional Dependencies and Keys
- We use “keys” to enforce full functional dependencies, X->Y
- In a relation, the values of the keys are unique
- That’s why it enforces a function!
- Email, Interest -> SinceAge
- make Email and Interest as key
- Overview of Normal Forms
- The whole set of data structure is called non first normal form
- 1NF > 2NF > 3NF > BCNF (Boyce-codd normal formulation)
- Normal Forms – Definitions
- NF2
- 1NF
- R is in 1NF iff all domain values are atomic
- 2NF
- R is in 2NF iff R is in 1NF and every nonkey attribute is fully dependent on the key
- 3NF
- R is in 3NF iff R is in 2NF and every nonkey attribute is non-transitively dependent on the key
- BCNF
- R is in BCNF iff every determinant is a candidate key
- Determinant
- A set of attributes on which some other attribute is fully functionally dependent
- “All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF)”
- 1NF BCNF flow chart
- 1NF
- Email->CurrentCity
- Email->BirthYear
- Email,BirthYear -> Salary
- Email, Interest -> SinceAge
- BCNF
- Email,Interest -> SinceAge
- 2NF
- Email-> CurrentCity, BirthYear
- Email, BirthYear -> Salary
- 3NF & BCNF
- 3NF & BCNF
- Email -> CurrentCity, BirthYear1
- Compute with Functional Dependencies – Armstrong’s rules
- Reflexivity
- if Y is part of X, then X->Y
- Email, Interest -> Interest
- Augmentation
- If X->Y, then WX->XY
- If Email -> BirthYear, then
- Email, Interest->BirthYear, Interest
- Transtivity
- If X->Y and Y->Z, then X->Z
- Email-> BirthYear and BirthYear->Salary, then
- How to guarantee lossless joins?
- The join field must be a key in at least one of the relations!
- If join field is key, there no way of making duplicates
- How to guarantee preservation of FDs?
- The meaning implies by the remaining functional dependencies must be same!
- Transitivity
- Email->BirthYear, BirthYear->Salary
- Even know I don’t see Email->Salary explicitly, it does
- Email Interest
- Email->CurrentCity, BirthYear, Salary & BirthYear->Salary
- Email->CurrentCity, BirthYear
- BirthYear->Salary
- Lossless?
- Dependency Preserving?
- Email->BirthYaer->Salary (o)
- 3NF and BCNF
- There exists relations which can only be decomposed to 3NF, but not to BCNF, while being lossless and dependency preserving
- It can only happen when the relation has overlapping keys
- It never happens in practice
Efficiency
- Computer Achitecture
- Main memory
- RAM- volatile, fast, small and expensive
- Secondary memory
- DISK – permanent, slow, bit, and cheap
- Applications run by the CPU can only query and update data in main memory
- Data must be written back to secondary memory after it is updated
- Only a tiny fraction of a real database fits in main memory
- Why should we care
- Main memory access time <<< disk memory access time
- Cost computation
- Only disk I/O cost counts
- CPU cost is ignored
- Disk
- 4 platters
- 8 headers
- 512 bytes/sector
- …
- Records
- RegularUser{
- Email varchar(50)
- Sex char(1)
- Birthdate date
- CurrentCity varchar(50)
- Hometown varchar(50)
- …}
- record size: 159 bytes
- Blocks
- Spanned
- put start of record in a block, and put the rest in another block
- Unspanned
- Ignore left bytes smaller than a record
- Assume block size: 4K (+metadata)
- filled: ~80% (~3200 bytes)
- records/block: ~20
- Files
- Blocks linked by pointers
- block pointer: 4 bytes
- # recoreds: 4 milion
- # blocks: ~200,000
- file size: ~800MB
- Assumptions
- Page fault
- Seek time: 3 ~ 8 ms
- Rotational delay: 2~3 ms
- Transfer Time: 0.5~1 ms
- Total: 5 ~ 12 ms or 0.01 sec
- Extent transfers from disk to memory, say 250 blocks, save seek time and rotational delay, but require more buffer space
- with a page fault of 10 ms each transfer cost 2.5 sec
- as an extent trasfer costs 0.260 sec
- Saving from seek time and rotation delay
- buffer management
- LRU
- excellent for merge joins
- bad for nested loops joins
- Heap – unsorted file
- Lookup time (average) :
- N/2 where N = # data blocks
- 200,000/2 * 0.01s = 16.6 min
- Sorted file
- Lookup time:
- log2(N) for 2-ary search
- 18 * 0.01s = 18s
- Primary Index – point and great for range quries (not for particular data)
- Sorted index for each file
- log2(n) + 1 where n = #index blocks
- fanout = ~60
- Email (50) + pointer (4) = 54
- 3200 / 54 = ~ 60 key,valuss/index block
- Sparse
- #index blocks = 3334 = 200,000 / 60
- (log2(200,000/60) + 1) * 0.01 = 0.13s
- Dense (all the key, values are in the index (not only the start of the data block))
- #index blocks = 66,666
- (log2(4,000,000/60) + 1) * 0.01 = 0.17s
- Better finding max, min or avg (key,value)
- Secondary Index – point quries only
- Build index not on the key value, but on something else
- Unsorted data, sorted index
- -> cannot make spares because unsorted data
- it has to be dense
- Lookup Time
- Tricky for non-key fields
- Build index box in the order of scan, then order the index box using the value
- Multilevel index
- Lookup Time
- Sparse
- (log60(3334) + 1) * 0.01s = 0.03s
- Need to watch out for overflow (more level -> more sensitive)
- Some change at the bottom might give impact on upper level
- Dense bottom possible too
- Multilevel index B(balanced)+-Tree
- A favorite implementation of a multi-level index
- Insertion, deletion, and update operations are implemented to keep the tree balanced
- Only rarely does an overflow at a lower level of the tree propagate more than 1-2 level up
- Static Hashing
- Hash key space much larger than address space
- Good hash function:
- Distribute values uniformly over the address space
- Fill buckets as much as possible
- Avoid collisions
- Lookup Time
- Constant: (1~2) * 0.01s = 0.01 ~ 0.02s
- if bucket directory is in main memory
- Dynamic hashing expands bucket directory when needed
DBMS | How to find the highest normal form of a relation