CS6210 – Advanced OS Final

30Apr - by qkim0x01 - 3 - In /Operating_System

 

L05. Distributed Systems
  • Outline
    • Distributed System Basics
    • Lamport’s Clock
      • Efficient Communication Software
        • Application Interface to the kernel
        • Inside the kernel
    • End to end Qos via Active networks

 

L05a. Distributed System Definitions
  • Communication time >> Computation time
  • Beliefs
    • Process sequential
    • Send time before receive time
  • Happened Before relationship
L05b. Lamport Clock
  • Lamport’s Logical clock
    • Monitonic increase of own event time
      • Ci(a) < Ci(b)
    • Message receipt time greater than send time
      • Ci(a) < Cj(d)
        • Choose Cj(d) = max(Ci(a)++, Cj)
    • Need of total order
  • Lamport’s Total order
    • order by pid
  • Distributed Mutual Exclusive lock algorithm
    • How do I know I have lock?
      • My request on top of queue
      • Received Ack from all others or later lock reqs (ack is coming)
    • Correctness
      • Message arrive in order
      • No message loss
      • Q’s totally ordered
  • Message Complexity
    • Lock(L)
      • 2(N-1)
    • Unlock(L)
      • (N-1)
    • Better
      • Defer Ack if my reqs precedes yours
        • Combine with unlock => 2(N-1)
  • Lamport’s physical clock
    • Physical clock conditions
      • 1) bound on individual clock drift
        • Individual drift is very small
      • 2) bound on mutual drift
  • IPC time and clock drift
    • 1) Mutual drift is within IPC time
    • 2) Individual clock drive is negligible compared to IPC time
    • => m >= e / (1-k)
      • Mutual clock drift (e)
      • IPC time (m)
      • Individual clock drift (k) (assume 0)
        • m >= e
L05b. Latency Limits
  • Latency vs. Throughput
    • RPC performance
      • hw overhead
      • sw overhead
      • Focus
        • How to reduce sw overhead?
  • Components of RPC latency
    • 1. Client Call
    • 2. Controller latency
    • 3. Time on Wire
    • 4. Interrupt Handling
    • 5. Server setup to execute call
    • 6. Server execution & reply
    • 7. Client setup to receive result & restore
  • Source of overhead RPC
    • Marshaling
    • Data Copying
    • Control Transfer
    • Protocol Processing
  • Marshaling and data copying
    • Three copies
      • Client, kernel, DMA to control (hw action)
    • Reduce the copy
      • Direct write to kernel buffer
      • Shared descriptors between client and kernel
        • Contains start address and its length
  • Control Transfer
    • Potentially 4
    • Reduce to 2 by overlap
    • Or reduce it to 1
      • Client doesn’t context switch at all
  • Protocol Processing
    • LAN reliable
      • reduce latency
    • Choices to reduce latency in transport
      • No low level acks
        • Result serves as ack
      • Hardware check sum (not sw check sum)
    • No client buffering since client blocked
      • Just resend if RPC fails
    • Overlap server side buffering with result transmission
L05d. Active Network
  • Primary Issues
    • Route a packet reliably and quickly to dest
    • Intermediate hw routers just to routing table look-up for next hop
    • Can we make the routers to route smart?
      • Make them active
        • Make them execute code for the determination
          • How & Who can write code?
  • How to implement?
    • OS take Quality of service API as hint to synthesizing code
    • Problems
      • Changing OS is not trivialous
      • Routers are not open
  • ANTS (Active Node Transfer System) toolkit
    • If non-active router, just use IPhdr
    • Active nodes only at edges
  • Roadblocks
    • Need to buy it from router vendors
    • ANTS software routing cannot match throughput needed in Internet core
  • Pros & Cons
    • Pros
      • Flexibility
      • Potential for reducing network traffic
    • Cons
      • Protection threats
      • Resource management threats
L05e. System from Components
  • Use component based approach in large software
    • Can we mimic this?
    • Can we reuse software components?
      • Easier test and optimize
  • The Big Picture
    • Specification
      • IOAutoma
        • C-like
    • Code
      • OCaml
        • Object oriented
        • nice complement to IOA
        • efficient as C
        • Good for component based design
          • GC
          • Marshaling args
    • Optimize
      • Nuprl
        • Output verified to be functionality equivalent
  • From Spec to Impl
    • using IOA
      • You can’t execute it, but can verify if the specification meets properties
    • There’s no way easily show OCaml Impl is same as IOA specification
  • From Imple to Opt
    • OCaml unoptimized
    • Nuprl code unoptimized
    • Optimized Nuprl
    • Optimized Ocaml
  • How to optimize protocol stack?
    • Explicit memory management
    • Avoid marshling across layers
    • Buffering in parallel with transmission
  • Nuprl to the rescue
    • Static Opt
    • Dynamic Opt
      • Collapse layers
      • Generate bypass code
L06. Distributed Objected Technology
  • Outline
    • Spring OS
    • Java RMI
    • EJB
L06a. Spring OS
  • Market place needs
    • Large complex server software
  • Object-based design
    • Strong interface
    • Isolation of state of an object from everything else
  • Spring Approach
    • Strong Interface
      • Only expose what services are provided
    • Open, flexible and extensible
      • Integrate third party software into OS
      • IDL
        • Interface definition language
        • You don’t want to be tied in a language
  • Nucleus – m-kernel of Spring
    • Nucleus checks door permission
    • Domain
      • Door ID -> Door table
    • Nucleus
      • -> Door -> target domain
  • Object Invocation Across the network
    • Proxies invisible to client & server
    • Client Domain -> Nuclues B (Door B) -> Proxy B -> Proxy A -> Nucleus A (Door A) -> Server Domain
  • Secure Object Invocation
    • check ACL before invocation
    • one-time permission to printer
  • Nucleus vs. Liedtke Abstractions
    • Nucleus
      • Thread, IPC
    • Liedtke
      • Thread, IPC, AS
  • VM management in Spring
    • Linear AS
    • Memory objects
      • Abstraction of memory object allow a region of a VM to be associated with backing file
        • swap space, …
  • Memory object specific paging
    • Pager objects
      • mapping memory objects to the cache representation
      • Multiple pager objects can manage different regions of same AS
        • -> Flexibility and power
  • Summary
    • Object oriented kernel
      • Nucleus
        • Thread & IPC
      • m-kernel
        • nucleus & AS
      • Door & Door Table
        • Basis for cross domain calls
      • Object invocation and cross machine
      • Virtual memory management
        • AS object, memory object, External pages, cache object
  • Motivation of object oriented design
    • Reuse components
    • Open design (through API) that made third party can be integrated into kernel
    • Errors in object does not affect others in kernel
    • Software maintaince, bug fixing
  • Subcontract
    • Client -> C-stub -> C-sub_contact -> S-sub_contact -> S-stub -> Server
    • Mechanism to hide the runtime object
    • Contract between client and server is established through IDL
  • Subcontract interface for stubs
L06b. Java RMI
  • Java Distributed Object Model
    • Much of the heavy lifting done under Java distributed runtime
    • Remote Object
      • Accessible from different AS
    • Remote Interface
      • Declaration fro methods in a remote object
    • Failure Semantics
    • Similarities / Difference to local object
      • Sim : Object references can be passed
      • Diff : Params only as value/result
  • First choice : Reuse of local implementation
    • Heavy lifting by programmer
  • Second choice: Reuse of “Remote”
    • Heavy lifting done by java runtime
    • Impl is visible right away when instantiate it
  • RMI Implementation – RRL (Remote Reference Layer)
    • RRL
      • do all the magic related to server
        • (singleton, multiple server), server location
      • (un)Marshaling
      • Similar to subcontract
  • RMI Implementation – Transport
    • Endpt
      • Protection domain
    • Connection Manager
      • of the transport layer of Java Runtime System
L07. Design and Implementations of Distributed System
  • Outline
    • GMS
      • how can we use peer memory for paging cross LAN?
    • DSM
      • Can we make the cluster appear like a shared memory machine?
      • Software Implementation of distributed shared memory
      • Create an OS abstraction that provides an illusion of shared memory on to the apps
    • DFS
      • How to use cluster memory for cooperative caching of files?
L07a. Global Memory Systems
  • Geriatrics
    • Epoch parameters
      • T(ime) max duration
      • M max replacement
    • Each Epoch
      • Send age of pages info to initiator
      • Receive {Min age, Wi} for all i
    • Initiator for next epoch
      • node with max wi
  • Implementation in Unix
    • GMS integrated with OS F/1
      • access to anonymous pages & fs mapped pages go through GMS on reads
    • Pageout daemon will give info to GMS to what to give to remote GMS
  • Data structure
    • PFD (Page Frame Directory)
      • UID -> PFD -> PFN
      • States : local private/ shared/ global private/ on disk
    • GCD (Global Cache Directory)
      • UID -> GCD -> Ni
      • Partitioned hash table
        • – > Dynamic
    • POD (Page Ownership Directory)
      • UID -> POD -> Ni
      • Change when add/remove node
  • Page Eviction
    • POD -> update(UID, Ni) -> GCD
    • POD -> put_page(UID) -> PFD
    • page daemon aggregate eviction pages and do it at once
L07b. Distributed Shard Memory
  • Memory Consistency and Cache Coherence
    • Memory Consistency
      • What is the model presented to the programmer?
      • Answers “when” question
        • When a shard mem location is modified, “how soon’ the change is going to be visible to other processes?
    • Cache Coherence
      • Answers “how” question
        • “How” is the system implementing the model in the presence of private cache?
        • How sw & hw implement the contract of memory consistency model
  • Sequential Consistency (SC)
    • Each r/w are atomic, and the program order should be preserved
  • Sequential Memory model
    • Do not distinguish between data r/w and synch r/w
      • Coherence action on every r/w
  • Release Consistency
  • RC Memory model
    • RC distinguish between data r/w and sync r/w
      • sync only when lock release
  • Advantage of RC over SC
    • no waiting for coherence actions on every mem access
      • Overlap computation work with communication
      • better performance
  • Lazy RC
    • Coherence action at acquire
    • Lazy pull model
    • pro
      • Less message (no b’cast when release)
      • RC b’cast every time when it release to sync
      • LRC only need to sync the data in the node when acquire
    • con
      • more acquire latency
  • Software DSM
    • Global virtual memory abstraction
      • Address space partitioned
      • Address equivalence
        • Access memory x is same in all nodes
      • Distributed ownership
        • Page owner is responsible of the page coherence
    • Control DSM in hw has overhead
      • control every mem access
    • Granularity a “page size” to cooperate with OS
    • DSM software Impl layer
      • Implements global virtual mem abstraction
      • Knows the point of access to a page by a processor, who to contact, owner of the page
      • Coop of OS and DSM
      • Multiple reader, one writer
        • Potential false sharing
  • LRC with multi-writer coherence protocol
    • Lock(L)
      • mod(x,y,z) -> xd,yd,zd
    • Lock(L) <- DSM invalidate pages modified by previous lock holder
      • r(x) <- Create current version of x by fetching the page from owner and xd from previous lock user
    • Make twin copy to compare diff, and free it after
    • The modified page is write protect after release
      • you need lock to write it
    • Periodically apply diffs by daemon
  • Non page based DSM
    • Library-based
    • Structured DSM

L07c. Distributed File System

  • DFS
    • No central server
    • I/O band cumulatively increase
    • Distributed metadata management
    • Bigger cumulative mem
    • Cooperative among client mem
  • RAID
    • drawback
      • Cost
      • Small write problem
        • To r/w to small file, have to access all disk
  • Log structured FS
  • Software RAID
  • xFS
    • Log based stripping
    • Cooperative caching
    • Dynamic management of data & metadata
    • Subsetting storage servers
    • Distributed log cleaning
  • Dynamic management
    • Traditional NFS
      • Hotfile problem
        • Bad for scalability
      • Unix FS is not concerned about file sharing
    • xFS
      • Metadata management dynamically distributed
      • Cooperative client file caching
  • Log based stripping and stripe groups
    • Subset servers into stripe groups
      • Increase availability
      • efficient log cleaning
        • Parallism
      • Better in failure
  • Cooperative Caching
    • Cache coherence
      • Single writer, multiple readers
      • File block unit of coherence
    • Invalidate all other files when write req comes in
      • The writing client gets token
      • Manager revokes on other read reqs
  • Log Cleaning
    • After some overwrite ops, aggregate all “live” block into new seg
    • xFS do this distributly
      • both C and S
      • C are responsible for knowing utilization of segments in there
      • Each stripe group is responsible for clean in the server
        • Every stripe group has a leader
  • xFS Data structure
    • mmap
    • FileDir
    • i-map
    • stripe group map
  • Client writing a file
    • update info about log file to manager

 

L08. Failures and Recovery
  • Outline
    • LRVM
      • Persistent memory layer in support of system service
      • Eliminated all the heavy weight associated with truncation
      • Require block I/O sync
        • This is heavy weight
        • How do we eliminate this?
    • RioVista
      • Performance conscious design of persistent memory
    • QuickSilver
      • Making a recovery a first class citizen in OS design
L08a. LRVM
  • Purpose
  • Goal
    • Simplicity, Flexibility, Performance
  • RVM Primitives
    • Initialize
    • map
    • unmap
    • begin_xact
    • set_range
    • end_xact
    • abort_xact
  • Optimization
    • no-store mode in begin_xact()
    • no_flush in end_xact()
  • How Server uses the primitives
    • Init
    • Begin_xact()
    • set_range()
      • create undo
    • write()
    • end_xact()
      • flush redo
  • Implementation
    • Redo log
      • Forward displacement
        • To help where to append the log
      • Reverse displacement
    • Crash Recovery
      • Read redo log from disk and apply it to External data seg in disk
    • Log truncation
      • Split log into epoch
      • Work in parallel in forward processing
L08b. Rio Vista
  • Two Problems
    • Power failure
      • Let’s use battery (UPS)
    • Software crush
  • Rio File Cache
    • Program write mmaped to file cache
    • Don’t worry about calling fsync
    • The memory is persistent because it’s battery backed
    • Upshot
      • No need to write on disk for persistency
      • Write back of files written by application can be arbitrary delayed
        • (you don’t have to write .tmp file to disk)
  • Vista-RVM on top of Rio
    • No heavy lifting
    • No disk I/O
    • No redo log
  • Crash Recovery
    • Treat like abort
      • Recover old image from undo
    • Crash during crash recovery?
      • Idempotency of recovery => no problem
  • Very Simplicity
    • Why?
      • No redo log or truncation code
      • Checkpoint and recovery code implicated
      • No group
    • Upshot
      • Simple like LRVM but more performant
      • Make a portion of DRAM persistent
L08c. QuickSilver
  • Cleaning up state orphan processes
    • NFS is stateless
      • So can’t clean breadcrumb
  • IPC fundamental to System service
    • Service-q
      • A data structure created by server to service client
      • Any client knowledge about this can make a request
      • Any server can service request coming to this
      • Globally unique for every service
      • No message lost, no duplicate request
        • = IPC guarantee
    • Async client call
      • Result buffered in kernel and wait client to come back
    • Interchangeable client-server
  • Building Distributed IPC & Transactions
    • Transactions
      • Secret sauce for recovery management
    • Communication Manager
    • Transaction Link
    • Transaction Manager
    • Root TM (owner of transaction)
      • Coord for the transaction tree
    • Client-Server can be unaware of transaction
      • Mechanism is provided by Quicksilver, but it’s up to each service provider
    • No extra overhead for the communication because it’s part of IPC even thought TMs are in different node
  • Transaction Management
    • Coord can be different from owner
    • Owner can change ownership
      • If owner goes away, it’s hard to clean
  • Distributed Transaction
    • TM create log for checkpoint
    • Efficient communication (tree structure)
    • Even though one node fails, the tree doesn’t abort right away
      • Abort only when coord gives termination request to clean
  • Commit initiated by coordinator

3 thoughts on “CS6210 – Advanced OS Final”

Leave a Reply

Your email address will not be published. Required fields are marked *