Patterns for Cache Optimizations on Multi-Processor Machines

13Mar - by qkim0x01 - 1 - In Private Notes
Abstract
  •  Programmers have to reason about..
    • existing cache issues and new cache issues
    • neglecting them leads to poorly performing programs that do not scale well
    • Illustrate common cache issues
  • Show how current tools aid programmers in reliably diagnosing these issues
  •  Goal
    • Incorporating performance issues directly into each pattern as forces
    • or by creating a supplementary pattern language for general performance optimization
1. Introduction
  • The First Rule of Program Optimization
    • Don’t do it
  • The second Rule of Program Optimization
    • Don’t do it yet!
  • They need to be able to reason about their parallel program
    • Understand how to structure their programs
    • What algorithms to use
    • Which libraries to use
    • How to optimize their programs when performance is not as expected
  • This paper is attempting to answer..
    • How much does a programmer need to know about comp arch?
      • to reason mem optimizations?
      • They think knowledge about cache lines is sufficient
    • What is useful way to illustrate performance issues?
      • By carefully select micro-benchmarks
        • easy to understand
        • clearly illustrate common memory problem
    • Do most programmers need to care about optimization?
      • They demonstrate the performance gains
2. Before We Begin
  • Optimizations can be very specific depends on..
    •  architecture
    • Compiler
    • OS
    • configuration
    • Optimization for one scenario might not work in a different scenario
  • Well-defined steps for optimizing a program
    • Create representative benchmarks
      • consider both the program and platform characteristics
    • Measure performance
      • with monitoring tools
    • Identify potential hotspots
    • Optimize those hotspots
    • Evaluate the impact
    • Repeat as necessary
  • Knowing what to do for each stage requires both experience and ingenuity
  • Different levels can be optimized
    •  system-level
    • app-level
    • arch-level
3. A View of Computer Memory
  • Memory access is non-negligible compared to processor’s speed
    • Known as memory wall
    • One of the major bottlenecks
    • So we rely on cache
      • Caches access memory in blocks called cache memory
  • Why od cache operate in cache line granularity?
    • spatial locality
      • Data close together tend to be accessed together
    • temporal locality
      • It is likely to access it in the future too

4. Familiar Problems with Surprising Cache Issues

  • TODO: example
    • This is bad example
      • because data_opeartion reads only the current data and value element
        • and writes only to the current data element
      • cause false sharing
        • The independent elements appear to be shared because they share cache line
    • Data Padding can fix the problem

5. Patterns

  • It is important to have a representative benchmark
    • They used TVune

5.1 Loop Interchange

5.1.1 Problem

  • Bad locality because of nested loop

5.1.3 Forces

  • There might bee arrays that you are accessing in different orders
    • Then you have to make a tradeoff between which array accesses to prioritize for cache performance
  • When they are loop-carried dependencies,
    • It is also issue of program correctness

5.1.4 Solution

  • Innermost loop to vary over the last dimension of your array
  • This method improves
    • Cache locality
    • Vectorization
      • Compiler optimization

5.2 Data Padding

5.2.1 Problem

One thought on “Patterns for Cache Optimizations on Multi-Processor Machines”

Leave a Reply

Your email address will not be published. Required fields are marked *