CS6210 – Advanced OS Final
L05. Distributed Systems
- Outline
- Distributed System Basics
- Lamport’s Clock
- Efficient Communication Software
- Application Interface to the kernel
- Inside the kernel
- Efficient Communication Software
- End to end Qos via Active networks
L05a. Distributed System Definitions
- Communication time >> Computation time
- Beliefs
- Process sequential
- Send time before receive time
- Happened Before relationship
L05b. Lamport Clock
- Lamport’s Logical clock
- Monitonic increase of own event time
- Ci(a) < Ci(b)
- Message receipt time greater than send time
- Ci(a) < Cj(d)
- Choose Cj(d) = max(Ci(a)++, Cj)
- Ci(a) < Cj(d)
- Need of total order
- Monitonic increase of own event time
- Lamport’s Total order
- order by pid
- Distributed Mutual Exclusive lock algorithm
- How do I know I have lock?
- My request on top of queue
- Received Ack from all others or later lock reqs (ack is coming)
- Correctness
- Message arrive in order
- No message loss
- Q’s totally ordered
- How do I know I have lock?
- Message Complexity
- Lock(L)
- 2(N-1)
- Unlock(L)
- (N-1)
- Better
- Defer Ack if my reqs precedes yours
- Combine with unlock => 2(N-1)
- Defer Ack if my reqs precedes yours
- Lock(L)
- Lamport’s physical clock
- Physical clock conditions
- 1) bound on individual clock drift
- Individual drift is very small
- 2) bound on mutual drift
- 1) bound on individual clock drift
- Physical clock conditions
- IPC time and clock drift
- 1) Mutual drift is within IPC time
- 2) Individual clock drive is negligible compared to IPC time
- => m >= e / (1-k)
- Mutual clock drift (e)
- IPC time (m)
- Individual clock drift (k) (assume 0)
- m >= e
L05b. Latency Limits
- Latency vs. Throughput
- RPC performance
- hw overhead
- sw overhead
- Focus
- How to reduce sw overhead?
- RPC performance
- Components of RPC latency
- 1. Client Call
- 2. Controller latency
- 3. Time on Wire
- 4. Interrupt Handling
- 5. Server setup to execute call
- 6. Server execution & reply
- 7. Client setup to receive result & restore
- Source of overhead RPC
- Marshaling
- Data Copying
- Control Transfer
- Protocol Processing
- Marshaling and data copying
- Three copies
- Client, kernel, DMA to control (hw action)
- Reduce the copy
- Direct write to kernel buffer
- Shared descriptors between client and kernel
- Contains start address and its length
- Three copies
- Control Transfer
- Potentially 4
- Reduce to 2 by overlap
- Or reduce it to 1
- Client doesn’t context switch at all
- Protocol Processing
- LAN reliable
- reduce latency
- Choices to reduce latency in transport
- No low level acks
- Result serves as ack
- Hardware check sum (not sw check sum)
- No low level acks
- No client buffering since client blocked
- Just resend if RPC fails
- Overlap server side buffering with result transmission
- LAN reliable
L05d. Active Network
- Primary Issues
- Route a packet reliably and quickly to dest
- Intermediate hw routers just to routing table look-up for next hop
- Can we make the routers to route smart?
- Make them active
- Make them execute code for the determination
- How & Who can write code?
- Make them execute code for the determination
- Make them active
- How to implement?
- OS take Quality of service API as hint to synthesizing code
- Problems
- Changing OS is not trivialous
- Routers are not open
- ANTS (Active Node Transfer System) toolkit
- If non-active router, just use IPhdr
- Active nodes only at edges
- Roadblocks
- Need to buy it from router vendors
- ANTS software routing cannot match throughput needed in Internet core
- Pros & Cons
- Pros
- Flexibility
- Potential for reducing network traffic
- Cons
- Protection threats
- Resource management threats
- Pros
L05e. System from Components
- Use component based approach in large software
- Can we mimic this?
- Can we reuse software components?
- Easier test and optimize
- The Big Picture
- Specification
- IOAutoma
- C-like
- IOAutoma
- Code
- OCaml
- Object oriented
- nice complement to IOA
- efficient as C
- Good for component based design
- GC
- Marshaling args
- OCaml
- Optimize
- Nuprl
- Output verified to be functionality equivalent
- Nuprl
- Specification
- From Spec to Impl
- using IOA
- You can’t execute it, but can verify if the specification meets properties
- There’s no way easily show OCaml Impl is same as IOA specification
- using IOA
- From Imple to Opt
- OCaml unoptimized
- Nuprl code unoptimized
- Optimized Nuprl
- Optimized Ocaml
- How to optimize protocol stack?
- Explicit memory management
- Avoid marshling across layers
- Buffering in parallel with transmission
- …
- Nuprl to the rescue
- Static Opt
- Dynamic Opt
- Collapse layers
- Generate bypass code
L06. Distributed Objected Technology
- Outline
- Spring OS
- Java RMI
- EJB
L06a. Spring OS
- Market place needs
- Large complex server software
- Object-based design
- Strong interface
- Isolation of state of an object from everything else
- Spring Approach
- Strong Interface
- Only expose what services are provided
- Open, flexible and extensible
- Integrate third party software into OS
- IDL
- Interface definition language
- You don’t want to be tied in a language
- Strong Interface
- Nucleus – m-kernel of Spring
- Nucleus checks door permission
- Domain
- Door ID -> Door table
- Nucleus
- -> Door -> target domain
- Object Invocation Across the network
- Proxies invisible to client & server
- Client Domain -> Nuclues B (Door B) -> Proxy B -> Proxy A -> Nucleus A (Door A) -> Server Domain
- Secure Object Invocation
- check ACL before invocation
- one-time permission to printer
- Nucleus vs. Liedtke Abstractions
- Nucleus
- Thread, IPC
- Liedtke
- Thread, IPC, AS
- Nucleus
- VM management in Spring
- Linear AS
- Memory objects
- Abstraction of memory object allow a region of a VM to be associated with backing file
- swap space, …
- Abstraction of memory object allow a region of a VM to be associated with backing file
- Memory object specific paging
- Pager objects
- mapping memory objects to the cache representation
- Multiple pager objects can manage different regions of same AS
- -> Flexibility and power
- Pager objects
- Summary
- Object oriented kernel
- Nucleus
- Thread & IPC
- m-kernel
- nucleus & AS
- Door & Door Table
- Basis for cross domain calls
- Object invocation and cross machine
- Virtual memory management
- AS object, memory object, External pages, cache object
- Nucleus
- Object oriented kernel
- Motivation of object oriented design
- Reuse components
- Open design (through API) that made third party can be integrated into kernel
- Errors in object does not affect others in kernel
- Software maintaince, bug fixing
- Subcontract
- Client -> C-stub -> C-sub_contact -> S-sub_contact -> S-stub -> Server
- Mechanism to hide the runtime object
- Contract between client and server is established through IDL
- Subcontract interface for stubs
L06b. Java RMI
- Java Distributed Object Model
- Much of the heavy lifting done under Java distributed runtime
- Remote Object
- Accessible from different AS
- Remote Interface
- Declaration fro methods in a remote object
- Failure Semantics
- Similarities / Difference to local object
- Sim : Object references can be passed
- Diff : Params only as value/result
- First choice : Reuse of local implementation
- Heavy lifting by programmer
- Second choice: Reuse of “Remote”
- Heavy lifting done by java runtime
- Impl is visible right away when instantiate it
- RMI Implementation – RRL (Remote Reference Layer)
- RRL
- do all the magic related to server
- (singleton, multiple server), server location
- (un)Marshaling
- Similar to subcontract
- do all the magic related to server
- RRL
- RMI Implementation – Transport
- Endpt
- Protection domain
- Connection Manager
- of the transport layer of Java Runtime System
- Endpt
L07. Design and Implementations of Distributed System
- Outline
- GMS
- how can we use peer memory for paging cross LAN?
- DSM
- Can we make the cluster appear like a shared memory machine?
- Software Implementation of distributed shared memory
- Create an OS abstraction that provides an illusion of shared memory on to the apps
- DFS
- How to use cluster memory for cooperative caching of files?
- GMS
L07a. Global Memory Systems
- Geriatrics
- Epoch parameters
- T(ime) max duration
- M max replacement
- Each Epoch
- Send age of pages info to initiator
- Receive {Min age, Wi} for all i
- Initiator for next epoch
- node with max wi
- Epoch parameters
- Implementation in Unix
- GMS integrated with OS F/1
- access to anonymous pages & fs mapped pages go through GMS on reads
- Pageout daemon will give info to GMS to what to give to remote GMS
- GMS integrated with OS F/1
- Data structure
- PFD (Page Frame Directory)
- UID -> PFD -> PFN
- States : local private/ shared/ global private/ on disk
- GCD (Global Cache Directory)
- UID -> GCD -> Ni
- Partitioned hash table
- – > Dynamic
- POD (Page Ownership Directory)
- UID -> POD -> Ni
- Change when add/remove node
- PFD (Page Frame Directory)
- Page Eviction
- POD -> update(UID, Ni) -> GCD
- POD -> put_page(UID) -> PFD
- page daemon aggregate eviction pages and do it at once
L07b. Distributed Shard Memory
- Memory Consistency and Cache Coherence
- Memory Consistency
- What is the model presented to the programmer?
- Answers “when” question
- When a shard mem location is modified, “how soon’ the change is going to be visible to other processes?
- Cache Coherence
- Answers “how” question
- “How” is the system implementing the model in the presence of private cache?
- How sw & hw implement the contract of memory consistency model
- Answers “how” question
- Memory Consistency
- Sequential Consistency (SC)
- Each r/w are atomic, and the program order should be preserved
- Sequential Memory model
- Do not distinguish between data r/w and synch r/w
- Coherence action on every r/w
- Do not distinguish between data r/w and synch r/w
- Release Consistency
- RC Memory model
- RC distinguish between data r/w and sync r/w
- sync only when lock release
- RC distinguish between data r/w and sync r/w
- Advantage of RC over SC
- no waiting for coherence actions on every mem access
- Overlap computation work with communication
- better performance
- no waiting for coherence actions on every mem access
- Lazy RC
- Coherence action at acquire
- Lazy pull model
- pro
- Less message (no b’cast when release)
- RC b’cast every time when it release to sync
- LRC only need to sync the data in the node when acquire
- con
- more acquire latency
- Software DSM
- Global virtual memory abstraction
- Address space partitioned
- Address equivalence
- Access memory x is same in all nodes
- Distributed ownership
- Page owner is responsible of the page coherence
- Control DSM in hw has overhead
- control every mem access
- Granularity a “page size” to cooperate with OS
- DSM software Impl layer
- Implements global virtual mem abstraction
- Knows the point of access to a page by a processor, who to contact, owner of the page
- Coop of OS and DSM
- Multiple reader, one writer
- Potential false sharing
- Global virtual memory abstraction
- LRC with multi-writer coherence protocol
- Lock(L)
- mod(x,y,z) -> xd,yd,zd
- Lock(L) <- DSM invalidate pages modified by previous lock holder
- r(x) <- Create current version of x by fetching the page from owner and xd from previous lock user
- Make twin copy to compare diff, and free it after
- The modified page is write protect after release
- you need lock to write it
- Periodically apply diffs by daemon
- Lock(L)
- Non page based DSM
- Library-based
- Structured DSM
L07c. Distributed File System
- DFS
- No central server
- I/O band cumulatively increase
- Distributed metadata management
- Bigger cumulative mem
- Cooperative among client mem
- RAID
- drawback
- Cost
- Small write problem
- To r/w to small file, have to access all disk
- drawback
- Log structured FS
- Software RAID
- xFS
- Log based stripping
- Cooperative caching
- Dynamic management of data & metadata
- Subsetting storage servers
- Distributed log cleaning
- Dynamic management
- Traditional NFS
- Hotfile problem
- Bad for scalability
- Unix FS is not concerned about file sharing
- Hotfile problem
- xFS
- Metadata management dynamically distributed
- Cooperative client file caching
- Traditional NFS
- Log based stripping and stripe groups
- Subset servers into stripe groups
- Increase availability
- efficient log cleaning
- Parallism
- Better in failure
- Subset servers into stripe groups
- Cooperative Caching
- Cache coherence
- Single writer, multiple readers
- File block unit of coherence
- Invalidate all other files when write req comes in
- The writing client gets token
- Manager revokes on other read reqs
- Cache coherence
- Log Cleaning
- After some overwrite ops, aggregate all “live” block into new seg
- xFS do this distributly
- both C and S
- C are responsible for knowing utilization of segments in there
- Each stripe group is responsible for clean in the server
- Every stripe group has a leader
- xFS Data structure
- mmap
- FileDir
- i-map
- stripe group map
- Client writing a file
- update info about log file to manager
L08. Failures and Recovery
- Outline
- LRVM
- Persistent memory layer in support of system service
- Eliminated all the heavy weight associated with truncation
- Require block I/O sync
- This is heavy weight
- How do we eliminate this?
- RioVista
- Performance conscious design of persistent memory
- QuickSilver
- Making a recovery a first class citizen in OS design
- LRVM
L08a. LRVM
- Purpose
- Goal
- Simplicity, Flexibility, Performance
- RVM Primitives
- Initialize
- map
- unmap
- begin_xact
- set_range
- end_xact
- abort_xact
- Optimization
- no-store mode in begin_xact()
- no_flush in end_xact()
- How Server uses the primitives
- Init
- Begin_xact()
- set_range()
- create undo
- write()
- end_xact()
- flush redo
- Implementation
- Redo log
- Forward displacement
- To help where to append the log
- Reverse displacement
- Forward displacement
- Crash Recovery
- Read redo log from disk and apply it to External data seg in disk
- Log truncation
- Split log into epoch
- Work in parallel in forward processing
- Redo log
L08b. Rio Vista
- Two Problems
- Power failure
- Let’s use battery (UPS)
- Software crush
- Power failure
- Rio File Cache
- Program write mmaped to file cache
- Don’t worry about calling fsync
- The memory is persistent because it’s battery backed
- Upshot
- No need to write on disk for persistency
- Write back of files written by application can be arbitrary delayed
- (you don’t have to write .tmp file to disk)
- Vista-RVM on top of Rio
- No heavy lifting
- No disk I/O
- No redo log
- Crash Recovery
- Treat like abort
- Recover old image from undo
- Crash during crash recovery?
- Idempotency of recovery => no problem
- Treat like abort
- Very Simplicity
- Why?
- No redo log or truncation code
- Checkpoint and recovery code implicated
- No group
- Upshot
- Simple like LRVM but more performant
- Make a portion of DRAM persistent
- Why?
L08c. QuickSilver
- Cleaning up state orphan processes
- NFS is stateless
- So can’t clean breadcrumb
- NFS is stateless
- IPC fundamental to System service
- Service-q
- A data structure created by server to service client
- Any client knowledge about this can make a request
- Any server can service request coming to this
- Globally unique for every service
- No message lost, no duplicate request
- = IPC guarantee
- Async client call
- Result buffered in kernel and wait client to come back
- Interchangeable client-server
- Service-q
- Building Distributed IPC & Transactions
- Transactions
- Secret sauce for recovery management
- Communication Manager
- Transaction Link
- Transaction Manager
- Root TM (owner of transaction)
- Coord for the transaction tree
- Client-Server can be unaware of transaction
- Mechanism is provided by Quicksilver, but it’s up to each service provider
- No extra overhead for the communication because it’s part of IPC even thought TMs are in different node
- Transactions
- Transaction Management
- Coord can be different from owner
- Owner can change ownership
- If owner goes away, it’s hard to clean
- Distributed Transaction
- TM create log for checkpoint
- Efficient communication (tree structure)
- Even though one node fails, the tree doesn’t abort right away
- Abort only when coord gives termination request to clean
- Commit initiated by coordinator
Usually I don’t learn article on blogs, but I wish to say that this write-up very pressured me to take a look
at and do so! Your writing style has been surprised me.
Thanks, quite nice post.
This test is for prospective students personal purposes, to gauge readiness for this graduate-level Advanced Operating Systems course.
It starts with topics such as operating systems structuring, multithreading and synchronization and then moves on to systems issues in parallel and distributed computing systems.
Very well worded!
Great post!