Patterns for Cache Optimizations on Multi-Processor Machines
Abstract
- Programmers have to reason about..
- existing cache issues and new cache issues
- neglecting them leads to poorly performing programs that do not scale well
- Illustrate common cache issues
- Show how current tools aid programmers in reliably diagnosing these issues
- Goal
- Incorporating performance issues directly into each pattern as forces
- or by creating a supplementary pattern language for general performance optimization
1. Introduction
- The First Rule of Program Optimization
- Don’t do it
- The second Rule of Program Optimization
- Don’t do it yet!
- They need to be able to reason about their parallel program
- Understand how to structure their programs
- What algorithms to use
- Which libraries to use
- How to optimize their programs when performance is not as expected
- This paper is attempting to answer..
- How much does a programmer need to know about comp arch?
- to reason mem optimizations?
- They think knowledge about cache lines is sufficient
- What is useful way to illustrate performance issues?
- By carefully select micro-benchmarks
- easy to understand
- clearly illustrate common memory problem
- By carefully select micro-benchmarks
- Do most programmers need to care about optimization?
- They demonstrate the performance gains
- How much does a programmer need to know about comp arch?
2. Before We Begin
- Optimizations can be very specific depends on..
- architecture
- Compiler
- OS
- configuration
- Optimization for one scenario might not work in a different scenario
- Well-defined steps for optimizing a program
- Create representative benchmarks
- consider both the program and platform characteristics
- Measure performance
- with monitoring tools
- Identify potential hotspots
- Optimize those hotspots
- Evaluate the impact
- Repeat as necessary
- Create representative benchmarks
- Knowing what to do for each stage requires both experience and ingenuity
- Different levels can be optimized
- system-level
- app-level
- arch-level
3. A View of Computer Memory
- Memory access is non-negligible compared to processor’s speed
- Known as memory wall
- One of the major bottlenecks
- So we rely on cache
- Caches access memory in blocks called cache memory
- Why od cache operate in cache line granularity?
- spatial locality
- Data close together tend to be accessed together
- temporal locality
- It is likely to access it in the future too
- spatial locality
4. Familiar Problems with Surprising Cache Issues
- TODO: example
- This is bad example
- because data_opeartion reads only the current data and value element
- and writes only to the current data element
- cause false sharing
- The independent elements appear to be shared because they share cache line
- because data_opeartion reads only the current data and value element
- Data Padding can fix the problem
- This is bad example
5. Patterns
- It is important to have a representative benchmark
- They used TVune
5.1 Loop Interchange
5.1.1 Problem
- Bad locality because of nested loop
5.1.3 Forces
- There might bee arrays that you are accessing in different orders
- Then you have to make a tradeoff between which array accesses to prioritize for cache performance
- When they are loop-carried dependencies,
- It is also issue of program correctness
5.1.4 Solution
- Innermost loop to vary over the last dimension of your array
- This method improves
- Cache locality
- Vectorization
- Compiler optimization
5.2 Data Padding
5.2.1 Problem
Thanks, great article.