# CS6400 Test4

###### Normalization

- Normalize
- Database are forever
- EER diagrams may go missing
- Never know what’s in database

- What’s it all about?
- Given a relation and a set of functional dependencies
- Given an Email we know the Birthyear
- Email -> BirthYear

- Given an Email and Interest we know the SinceAge
- Email, Interest -> SinceAge

- How do we normalize the relation without information loss and so that the functional dependencies can be enforced?

- The rules
- No redundancy of facts
- No cluttering of facts
- Must preserve information
- Must preserve functional dependencies

- Not a relation – NF2
- Multi-value
- Pulling all the multi-value for the entry is not a relation (not atomic)

- Iterate the multi-value and pull each row -> now it’s a relation

- Multi-value
- Relation with problems
- Redundancy
- There are same entries for user1
- Redundancy may cause inconsistency if an entry is updated

- Insertion anomaly
- If we insert a new RegularUser without any Interest, then we must insert NULL values for Interest and SinceAge
- If we insert a pair of BirthYear and Salary then we must insert NULL values for Email, Insert, SinceAge, and CurrentCity

- Deletion anomaly
- If we delete user12@gatech.edu, then we lose the fact that when BirthYear is 1974 then the salary is 38000

- Update anomaly
- If we update the CurrentCity of a RegularUser, then we must do it in multiple times
- If I update BirthYear, then we must update Salary

- Redundancy
- Information loss
- Try to decompose a table into multiple tables so the function dependencies can be enforced without problem
- If we decompose RegularUser into two relations, then we could get too many rows back when we combine them
- When we combine tables, we get more tuples that it’s supposed to have
- Getting more information does not reflect fact in reality
- -> information loss

- Getting more information does not reflect fact in reality

- When we combine tables, we get more tuples that it’s supposed to have

- Dependency loss
- If we decompose RegularUser into two relations, then we cannot enforce the functional dependencies that are split between the two relations
- Since Email and Salary do not coexists in same table, we cannot enforce them

- If we decompose RegularUser into two relations, then we cannot enforce the functional dependencies that are split between the two relations
- Perfect
- No redundancy
- No inconsistent data

- No insertion anomalies
- Because the facts represented by functions dependencies are separated in different tables

- No deletion anomalies
- Because you can delete one fact without deleting other facts for particular user

- No update anomalies
- Because information are not redundant

- No information loss
- None of the join will create additional rows

- No dependency loss
- Separate relation for each function dependencies

- But, how do we get there?

- No redundancy
- Functional dependencies
- Let X and Y be sets of attributes of R, Y is “functionally dependent” on X in R iff for each x in R.X, there is precisely one y in R.Y
- Email -> CurrentCity
- There is only one CurrentCity for one Email

- Email, Interest -> SinceAge

- Email -> CurrentCity

- Let X and Y be sets of attributes of R, Y is “functionally dependent” on X in R iff for each x in R.X, there is precisely one y in R.Y
- Full functional dependencies
- Let X and Y be sets of attributes of R, Y is “fully functionally dependent” on X in R iff Y is functional dependent on X and Y is not functional dependent on any proper set of X
- Email, Interest -> SinceAge
- Full dependent
- Both Email and Interest is required

- Email, Interest -> CurrentCity
- Because CurrentCity is dependent on Email alone

- Functional Dependencies and Keys
- We use “keys” to enforce full functional dependencies, X->Y
- In a relation, the values of the keys are unique
- That’s why it enforces a function!
- Email, Interest -> SinceAge
- make Email and Interest as key

- Overview of Normal Forms
- The whole set of data structure is called non first normal form
- NF2

- 1NF > 2NF > 3NF > BCNF (Boyce-codd normal formulation)

- The whole set of data structure is called non first normal form
- Normal Forms – Definitions
- NF2
- non-first normal form

- 1NF
- R is in 1NF iff all domain values are atomic

- 2NF
- R is in 2NF iff R is in 1NF and every nonkey attribute is fully dependent on the key

- 3NF
- R is in 3NF iff R is in 2NF and every nonkey attribute is non-transitively dependent on the key

- BCNF
- R is in BCNF iff every determinant is a candidate key

- Determinant
- A set of attributes on which some other attribute is fully functionally dependent

- “All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF)”

- NF2
- 1NF BCNF flow chart
- 1NF
- Email->CurrentCity
- Email->BirthYear
- Email,BirthYear -> Salary
- Email, Interest -> SinceAge

- BCNF
- Email,Interest -> SinceAge

- 2NF
- Email-> CurrentCity, BirthYear
- Email, BirthYear -> Salary

- 3NF & BCNF
- BirthYear -> Salary

- 3NF & BCNF
- Email -> CurrentCity, BirthYear1

- 1NF
- Compute with Functional Dependencies – Armstrong’s rules
- Reflexivity
- if Y is part of X, then X->Y
- Email, Interest -> Interest

- if Y is part of X, then X->Y
- Augmentation
- If X->Y, then WX->XY
- If Email -> BirthYear, then
- Email, Interest->BirthYear, Interest

- If Email -> BirthYear, then

- If X->Y, then WX->XY
- Transtivity
- If X->Y and Y->Z, then X->Z
- Email-> BirthYear and BirthYear->Salary, then
- Email->Salary

- Email-> BirthYear and BirthYear->Salary, then

- If X->Y and Y->Z, then X->Z

- Reflexivity
- How to guarantee lossless joins?
- The join field must be a key in at least one of the relations!
- If join field is key, there no way of making duplicates

- The join field must be a key in at least one of the relations!
- How to guarantee preservation of FDs?
- The meaning implies by the remaining functional dependencies must be same!
- Transitivity
- Email->BirthYear, BirthYear->Salary
- Even know I don’t see Email->Salary explicitly, it does

- Email->BirthYear, BirthYear->Salary

- Email Interest
- Email->CurrentCity, BirthYear, Salary & BirthYear->Salary
- Email->CurrentCity, BirthYear
- BirthYear->Salary
- Lossless?
- BirthYear is the key(o)

- Dependency Preserving?
- Email->BirthYaer->Salary (o)

- Lossless?

- Email->CurrentCity, BirthYear, Salary & BirthYear->Salary
- 3NF and BCNF
- There exists relations which can only be decomposed to 3NF, but not to BCNF, while being lossless and dependency preserving
- It can only happen when the relation has overlapping keys

- There exists relations which can only be decomposed to 3NF, but not to BCNF, while being lossless and dependency preserving
- It never happens in practice

###### Efficiency

- Computer Achitecture
- Main memory
- RAM- volatile, fast, small and expensive

- Secondary memory
- DISK – permanent, slow, bit, and cheap

- Applications run by the CPU can only query and update data in main memory
- Data must be written back to secondary memory after it is updated
- Only a tiny fraction of a real database fits in main memory

- Main memory
- Why should we care
- Main memory access time <<< disk memory access time
- Cost computation
- Only disk I/O cost counts
- CPU cost is ignored

- Disk
- 4 platters
- 8 headers
- 512 bytes/sector
- …

- Records
- RegularUser{
- Email varchar(50)
- Sex char(1)
- Birthdate date
- CurrentCity varchar(50)
- Hometown varchar(50)
- …}

- record size: 159 bytes

- RegularUser{
- Blocks
- Spanned
- put start of record in a block, and put the rest in another block

- Unspanned
- Ignore left bytes smaller than a record

- Assume block size: 4K (+metadata)
- filled: ~80% (~3200 bytes)
- records/block: ~20

- Spanned
- Files
- Blocks linked by pointers
- block pointer: 4 bytes
- # recoreds: 4 milion
- # blocks: ~200,000
- file size: ~800MB

- Assumptions
- Page fault
- Seek time: 3 ~ 8 ms
- Rotational delay: 2~3 ms
- Transfer Time: 0.5~1 ms
- Total: 5 ~ 12 ms or 0.01 sec

- Extent transfers from disk to memory, say 250 blocks, save seek time and rotational delay, but require more buffer space
- with a page fault of 10 ms each transfer cost 2.5 sec
- as an extent trasfer costs 0.260 sec
- Saving from seek time and rotation delay

- buffer management
- LRU
- excellent for merge joins
- bad for nested loops joins
- MRU might better

- LRU

- Page fault
- Heap – unsorted file
- Lookup time (average) :
- N/2 where N = # data blocks
- 200,000/2 * 0.01s = 16.6 min

- Lookup time (average) :
- Sorted file
- Lookup time:
- log2(N) for 2-ary search
- 18 * 0.01s = 18s

- Lookup time:
**Primary Index**– point and great for range quries (not for particular data)- Sorted index for each file
- log2(n) + 1 where n = #index blocks
- fanout = ~60
- Email (50) + pointer (4) = 54
- 3200 / 54 = ~ 60 key,valuss/index block

- Sparse
- #index blocks = 3334 = 200,000 / 60
- (log2(200,000/60) + 1) * 0.01 = 0.13s

- Dense (all the key, values are in the index (not only the start of the data block))
- #index blocks = 66,666
- (log2(4,000,000/60) + 1) * 0.01 = 0.17s
- Better finding max, min or avg (key,value)

**Secondary**Index – point quries only- Build index not on the key value, but on something else
- Unsorted data, sorted index
- -> cannot make spares because unsorted data
- it has to be dense

- Lookup Time
- log2(n) + 1

- Tricky for non-key fields
- Build index box in the order of scan, then order the index box using the value

- Multilevel index
- Lookup Time
- log(fanout)(n) + 1

- Sparse
- (log60(3334) + 1) * 0.01s = 0.03s

- Need to watch out for overflow (more level -> more sensitive)
- Some change at the bottom might give impact on upper level

- Dense bottom possible too

- Lookup Time
- Multilevel index B(balanced)+-Tree
- A favorite implementation of a multi-level index
- Insertion, deletion, and update operations are implemented to keep the tree balanced
- Only rarely does an overflow at a lower level of the tree propagate more than 1-2 level up

- Static Hashing
- Hash key space much larger than address space
- Good hash function:
- Distribute values uniformly over the address space
- Fill buckets as much as possible
- Avoid collisions
- linked list

- Lookup Time
- Constant: (1~2) * 0.01s = 0.01 ~ 0.02s
- if bucket directory is in main memory

- Dynamic hashing expands bucket directory when needed

- Constant: (1~2) * 0.01s = 0.01 ~ 0.02s