CS6210 – Advanced OS Final
L05. Distributed Systems
- Outline
- Distributed System Basics
- Lamport’s Clock
- Efficient Communication Software
- Application Interface to the kernel
- Inside the kernel
- Efficient Communication Software
- End to end Qos via Active networks
L05a. Distributed System Definitions
- Communication time >> Computation time
- Beliefs
- Process sequential
- Send time before receive time
- Happened Before relationship
L05b. Lamport Clock
- Lamport’s Logical clock
- Monitonic increase of own event time
- Ci(a) < Ci(b)
- Message receipt time greater than send time
- Ci(a) < Cj(d)
- Choose Cj(d) = max(Ci(a)++, Cj)
- Ci(a) < Cj(d)
- Need of total order
- Monitonic increase of own event time
- Lamport’s Total order
- order by pid
- Distributed Mutual Exclusive lock algorithm
- How do I know I have lock?
- My request on top of queue
- Received Ack from all others or later lock reqs (ack is coming)
- Correctness
- Message arrive in order
- No message loss
- Q’s totally ordered
- How do I know I have lock?
- Message Complexity
- Lock(L)
- 2(N-1)
- Unlock(L)
- (N-1)
- Better
- Defer Ack if my reqs precedes yours
- Combine with unlock => 2(N-1)
- Defer Ack if my reqs precedes yours
- Lock(L)
- Lamport’s physical clock
- Physical clock conditions
- 1) bound on individual clock drift
- Individual drift is very small
- 2) bound on mutual drift
- 1) bound on individual clock drift
- Physical clock conditions
- IPC time and clock drift
- 1) Mutual drift is within IPC time
- 2) Individual clock drive is negligible compared to IPC time
- => m >= e / (1-k)
- Mutual clock drift (e)
- IPC time (m)
- Individual clock drift (k) (assume 0)
- m >= e
L05b. Latency Limits
- Latency vs. Throughput
- RPC performance
- hw overhead
- sw overhead
- Focus
- How to reduce sw overhead?
- RPC performance
- Components of RPC latency
- 1. Client Call
- 2. Controller latency
- 3. Time on Wire
- 4. Interrupt Handling
- 5. Server setup to execute call
- 6. Server execution & reply
- 7. Client setup to receive result & restore
- Source of overhead RPC
- Marshaling
- Data Copying
- Control Transfer
- Protocol Processing
- Marshaling and data copying
- Three copies
- Client, kernel, DMA to control (hw action)
- Reduce the copy
- Direct write to kernel buffer
- Shared descriptors between client and kernel
- Contains start address and its length
- Three copies
- Control Transfer
- Potentially 4
- Reduce to 2 by overlap
- Or reduce it to 1
- Client doesn’t context switch at all
- Protocol Processing
- LAN reliable
- reduce latency
- Choices to reduce latency in transport
- No low level acks
- Result serves as ack
- Hardware check sum (not sw check sum)
- No low level acks
- No client buffering since client blocked
- Just resend if RPC fails
- Overlap server side buffering with result transmission
- LAN reliable
L05d. Active Network
- Primary Issues
- Route a packet reliably and quickly to dest
- Intermediate hw routers just to routing table look-up for next hop
- Can we make the routers to route smart?
- Make them active
- Make them execute code for the determination
- How & Who can write code?
- Make them execute code for the determination
- Make them active
- How to implement?
- OS take Quality of service API as hint to synthesizing code
- Problems
- Changing OS is not trivialous
- Routers are not open
- ANTS (Active Node Transfer System) toolkit
- If non-active router, just use IPhdr
- Active nodes only at edges
- Roadblocks
- Need to buy it from router vendors
- ANTS software routing cannot match throughput needed in Internet core
- Pros & Cons
- Pros
- Flexibility
- Potential for reducing network traffic
- Cons
- Protection threats
- Resource management threats
- Pros
L05e. System from Components
- Use component based approach in large software
- Can we mimic this?
- Can we reuse software components?
- Easier test and optimize
- The Big Picture
- Specification
- IOAutoma
- C-like
- IOAutoma
- Code
- OCaml
- Object oriented
- nice complement to IOA
- efficient as C
- Good for component based design
- GC
- Marshaling args
- OCaml
- Optimize
- Nuprl
- Output verified to be functionality equivalent
- Nuprl
- Specification
- From Spec to Impl
- using IOA
- You can’t execute it, but can verify if the specification meets properties
- There’s no way easily show OCaml Impl is same as IOA specification
- using IOA
- From Imple to Opt
- OCaml unoptimized
- Nuprl code unoptimized
- Optimized Nuprl
- Optimized Ocaml
- How to optimize protocol stack?
- Explicit memory management
- Avoid marshling across layers
- Buffering in parallel with transmission
- …
- Nuprl to the rescue
- Static Opt
- Dynamic Opt
- Collapse layers
- Generate bypass code
L06. Distributed Objected Technology
- Outline
- Spring OS
- Java RMI
- EJB
L06a. Spring OS
- Market place needs
- Large complex server software
- Object-based design
- Strong interface
- Isolation of state of an object from everything else
- Spring Approach
- Strong Interface
- Only expose what services are provided
- Open, flexible and extensible
- Integrate third party software into OS
- IDL
- Interface definition language
- You don’t want to be tied in a language
- Strong Interface
- Nucleus – m-kernel of Spring
- Nucleus checks door permission
- Domain
- Door ID -> Door table
- Nucleus
- -> Door -> target domain
- Object Invocation Across the network
- Proxies invisible to client & server
- Client Domain -> Nuclues B (Door B) -> Proxy B -> Proxy A -> Nucleus A (Door A) -> Server Domain
- Secure Object Invocation
- check ACL before invocation
- one-time permission to printer
- Nucleus vs. Liedtke Abstractions
- Nucleus
- Thread, IPC
- Liedtke
- Thread, IPC, AS
- Nucleus
- VM management in Spring
- Linear AS
- Memory objects
- Abstraction of memory object allow a region of a VM to be associated with backing file
- swap space, …
- Abstraction of memory object allow a region of a VM to be associated with backing file
- Memory object specific paging
- Pager objects
- mapping memory objects to the cache representation
- Multiple pager objects can manage different regions of same AS
- -> Flexibility and power
- Pager objects
- Summary
- Object oriented kernel
- Nucleus
- Thread & IPC
- m-kernel
- nucleus & AS
- Door & Door Table
- Basis for cross domain calls
- Object invocation and cross machine
- Virtual memory management
- AS object, memory object, External pages, cache object
- Nucleus
- Object oriented kernel
- Motivation of object oriented design
- Reuse components
- Open design (through API) that made third party can be integrated into kernel
- Errors in object does not affect others in kernel
- Software maintaince, bug fixing
- Subcontract
- Client -> C-stub -> C-sub_contact -> S-sub_contact -> S-stub -> Server
- Mechanism to hide the runtime object
- Contract between client and server is established through IDL
- Subcontract interface for stubs
L06b. Java RMI
- Java Distributed Object Model
- Much of the heavy lifting done under Java distributed runtime
- Remote Object
- Accessible from different AS
- Remote Interface
- Declaration fro methods in a remote object
- Failure Semantics
- Similarities / Difference to local object
- Sim : Object references can be passed
- Diff : Params only as value/result
- First choice : Reuse of local implementation
- Heavy lifting by programmer
- Second choice: Reuse of “Remote”
- Heavy lifting done by java runtime
- Impl is visible right away when instantiate it
- RMI Implementation – RRL (Remote Reference Layer)
- RRL
- do all the magic related to server
- (singleton, multiple server), server location
- (un)Marshaling
- Similar to subcontract
- do all the magic related to server
- RRL
- RMI Implementation – Transport
- Endpt
- Protection domain
- Connection Manager
- of the transport layer of Java Runtime System
- Endpt
L07. Design and Implementations of Distributed System
- Outline
- GMS
- how can we use peer memory for paging cross LAN?
- DSM
- Can we make the cluster appear like a shared memory machine?
- Software Implementation of distributed shared memory
- Create an OS abstraction that provides an illusion of shared memory on to the apps
- DFS
- How to use cluster memory for cooperative caching of files?
- GMS
L07a. Global Memory Systems
- Geriatrics
- Epoch parameters
- T(ime) max duration
- M max replacement
- Each Epoch
- Send age of pages info to initiator
- Receive {Min age, Wi} for all i
- Initiator for next epoch
- node with max wi
- Epoch parameters
- Implementation in Unix
- GMS integrated with OS F/1
- access to anonymous pages & fs mapped pages go through GMS on reads
- Pageout daemon will give info to GMS to what to give to remote GMS
- GMS integrated with OS F/1
- Data structure
- PFD (Page Frame Directory)
- UID -> PFD -> PFN
- States : local private/ shared/ global private/ on disk
- GCD (Global Cache Directory)
- UID -> GCD -> Ni
- Partitioned hash table
- – > Dynamic
- POD (Page Ownership Directory)
- UID -> POD -> Ni
- Change when add/remove node
- PFD (Page Frame Directory)
- Page Eviction
- POD -> update(UID, Ni) -> GCD
- POD -> put_page(UID) -> PFD
- page daemon aggregate eviction pages and do it at once
L07b. Distributed Shard Memory
- Memory Consistency and Cache Coherence
- Memory Consistency
- What is the model presented to the programmer?
- Answers “when” question
- When a shard mem location is modified, “how soon’ the change is going to be visible to other processes?
- Cache Coherence
- Answers “how” question
- “How” is the system implementing the model in the presence of private cache?
- How sw & hw implement the contract of memory consistency model
- Answers “how” question
- Memory Consistency
- Sequential Consistency (SC)
- Each r/w are atomic, and the program order should be preserved
- Sequential Memory model
- Do not distinguish between data r/w and synch r/w
- Coherence action on every r/w
- Do not distinguish between data r/w and synch r/w
- Release Consistency
- RC Memory model
- RC distinguish between data r/w and sync r/w
- sync only when lock release
- RC distinguish between data r/w and sync r/w
- Advantage of RC over SC
- no waiting for coherence actions on every mem access
- Overlap computation work with communication
- better performance
- no waiting for coherence actions on every mem access
- Lazy RC
- Coherence action at acquire
- Lazy pull model
- pro
- Less message (no b’cast when release)
- RC b’cast every time when it release to sync
- LRC only need to sync the data in the node when acquire
- con
- more acquire latency
- Software DSM
- Global virtual memory abstraction
- Address space partitioned
- Address equivalence
- Access memory x is same in all nodes
- Distributed ownership
- Page owner is responsible of the page coherence
- Control DSM in hw has overhead
- control every mem access
- Granularity a “page size” to cooperate with OS
- DSM software Impl layer
- Implements global virtual mem abstraction
- Knows the point of access to a page by a processor, who to contact, owner of the page
- Coop of OS and DSM
- Multiple reader, one writer
- Potential false sharing
- Global virtual memory abstraction
- LRC with multi-writer coherence protocol
- Lock(L)
- mod(x,y,z) -> xd,yd,zd
- Lock(L) <- DSM invalidate pages modified by previous lock holder
- r(x) <- Create current version of x by fetching the page from owner and xd from previous lock user
- Make twin copy to compare diff, and free it after
- The modified page is write protect after release
- you need lock to write it
- Periodically apply diffs by daemon
- Lock(L)
- Non page based DSM
- Library-based
- Structured DSM
L07c. Distributed File System
- DFS
- No central server
- I/O band cumulatively increase
- Distributed metadata management
- Bigger cumulative mem
- Cooperative among client mem
- RAID
- drawback
- Cost
- Small write problem
- To r/w to small file, have to access all disk
- drawback
- Log structured FS
- Software RAID
- xFS
- Log based stripping
- Cooperative caching
- Dynamic management of data & metadata
- Subsetting storage servers
- Distributed log cleaning
- Dynamic management
- Traditional NFS
- Hotfile problem
- Bad for scalability
- Unix FS is not concerned about file sharing
- Hotfile problem
- xFS
- Metadata management dynamically distributed
- Cooperative client file caching
- Traditional NFS
- Log based stripping and stripe groups
- Subset servers into stripe groups
- Increase availability
- efficient log cleaning
- Parallism
- Better in failure
- Subset servers into stripe groups
- Cooperative Caching
- Cache coherence
- Single writer, multiple readers
- File block unit of coherence
- Invalidate all other files when write req comes in
- The writing client gets token
- Manager revokes on other read reqs
- Cache coherence
- Log Cleaning
- After some overwrite ops, aggregate all “live” block into new seg
- xFS do this distributly
- both C and S
- C are responsible for knowing utilization of segments in there
- Each stripe group is responsible for clean in the server
- Every stripe group has a leader
- xFS Data structure
- mmap
- FileDir
- i-map
- stripe group map
- Client writing a file
- update info about log file to manager
L08. Failures and Recovery
- Outline
- LRVM
- Persistent memory layer in support of system service
- Eliminated all the heavy weight associated with truncation
- Require block I/O sync
- This is heavy weight
- How do we eliminate this?
- RioVista
- Performance conscious design of persistent memory
- QuickSilver
- Making a recovery a first class citizen in OS design
- LRVM
L08a. LRVM
- Purpose
- Goal
- Simplicity, Flexibility, Performance
- RVM Primitives
- Initialize
- map
- unmap
- begin_xact
- set_range
- end_xact
- abort_xact
- Optimization
- no-store mode in begin_xact()
- no_flush in end_xact()
- How Server uses the primitives
- Init
- Begin_xact()
- set_range()
- create undo
- write()
- end_xact()
- flush redo
- Implementation
- Redo log
- Forward displacement
- To help where to append the log
- Reverse displacement
- Forward displacement
- Crash Recovery
- Read redo log from disk and apply it to External data seg in disk
- Log truncation
- Split log into epoch
- Work in parallel in forward processing
- Redo log
L08b. Rio Vista
- Two Problems
- Power failure
- Let’s use battery (UPS)
- Software crush
- Power failure
- Rio File Cache
- Program write mmaped to file cache
- Don’t worry about calling fsync
- The memory is persistent because it’s battery backed
- Upshot
- No need to write on disk for persistency
- Write back of files written by application can be arbitrary delayed
- (you don’t have to write .tmp file to disk)
- Vista-RVM on top of Rio
- No heavy lifting
- No disk I/O
- No redo log
- Crash Recovery
- Treat like abort
- Recover old image from undo
- Crash during crash recovery?
- Idempotency of recovery => no problem
- Treat like abort
- Very Simplicity
- Why?
- No redo log or truncation code
- Checkpoint and recovery code implicated
- No group
- Upshot
- Simple like LRVM but more performant
- Make a portion of DRAM persistent
- Why?
L08c. QuickSilver
- Cleaning up state orphan processes
- NFS is stateless
- So can’t clean breadcrumb
- NFS is stateless
- IPC fundamental to System service
- Service-q
- A data structure created by server to service client
- Any client knowledge about this can make a request
- Any server can service request coming to this
- Globally unique for every service
- No message lost, no duplicate request
- = IPC guarantee
- Async client call
- Result buffered in kernel and wait client to come back
- Interchangeable client-server
- Service-q
- Building Distributed IPC & Transactions
- Transactions
- Secret sauce for recovery management
- Communication Manager
- Transaction Link
- Transaction Manager
- Root TM (owner of transaction)
- Coord for the transaction tree
- Client-Server can be unaware of transaction
- Mechanism is provided by Quicksilver, but it’s up to each service provider
- No extra overhead for the communication because it’s part of IPC even thought TMs are in different node
- Transactions
- Transaction Management
- Coord can be different from owner
- Owner can change ownership
- If owner goes away, it’s hard to clean
- Distributed Transaction
- TM create log for checkpoint
- Efficient communication (tree structure)
- Even though one node fails, the tree doesn’t abort right away
- Abort only when coord gives termination request to clean
- Commit initiated by coordinator
Usually I don’t learn article on blogs, but I wish to say that this write-up very pressured me to take a look
at and do so! Your writing style has been surprised me.
Thanks, quite nice post.
This test is for prospective students personal purposes, to gauge readiness for this graduate-level Advanced Operating Systems course.
It starts with topics such as operating systems structuring, multithreading and synchronization and then moves on to systems issues in parallel and distributed computing systems.