service in Linux Kernel Vikas Shivappa - - PowerPoint PPT Presentation

service in linux kernel
SMART_READER_LITE
LIVE PREVIEW

service in Linux Kernel Vikas Shivappa - - PowerPoint PPT Presentation

Introduction to Cache Quality of service in Linux Kernel Vikas Shivappa (vikas.shivappa@linux.intel.com) 1 Agenda Problem definition Existing techniques Why use Kernel QOS framework Intel Cache qos support Kernel


slide-1
SLIDE 1

Introduction to Cache Quality of service in Linux Kernel

Vikas Shivappa (vikas.shivappa@linux.intel.com)

1

slide-2
SLIDE 2

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

2

slide-3
SLIDE 3

Without Cache QoS

High Pri apps Low Pri apps

C2

Low pri apps may get more cache Shared Processor Cache

C1 C3

Cores

Cores

  • Noisy neighbour => Degrade/inconsistency in response => QOS

difficulties

  • Cache Contention with multi threading

3

slide-4
SLIDE 4

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

4

slide-5
SLIDE 5

Existing techniques

  • Mostly heuristics on real systems
  • No methodology to identify cache lines

belonging to a particular thread

  • Lacks configurability by OS

5

slide-6
SLIDE 6

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

6

slide-7
SLIDE 7

Why use the QOS framework?

Threads Architectural details of ID management/scheduling

  • Lightweight

powerful tool to manage cache

  • Without a lot of

architectural details

7

slide-8
SLIDE 8

With Cache QoS

  • Help maximize performance and meet QoS requirements
  • In Cloud or Server Clusters
  • Mitigate jitter/inconsistent response times due to ‘Noisy neighbour’

High Pri apps Low Pri apps Kernel Cache QOS framework Intel QOS h/w support Controls to allocate the appropriate cache to high pri apps Proc Cache

User space Kernel space h/w

8

slide-9
SLIDE 9

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QoS framework
  • Intel Cache QoS support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

9

slide-10
SLIDE 10

What is Cache QoS ?

  • Cache Monitoring

– cache occupancy per thread – perf interface

  • Cache Allocation

– user can allocate

  • verlapping subsets of

cache to applications – cgroup interface

10

slide-11
SLIDE 11

Cache lines  Thread ID (Identification)

  • Cache Monitoring

– RMID (Resource Monitoring ID)

  • Cache Allocation

– CLOSid (Class of service ID)

11

slide-12
SLIDE 12

Representing cache capacity in Cache Allocation(example)

Bn B1 B0 Wk W(k-1) W3 W2 W1 W0 Capacity Bitmask Cache Ways

  • Cache capacity represented using ‘Cache bitmask’
  • However mappings are hardware implementation specific

12

slide-13
SLIDE 13

Bitmask  Class of service IDs (CLOS)

B7 B6 B5 B4 B3 B2 B1 B0 CLOS0 A A A A A A A A CLOS1 A A A A A A A A CLOS2 A A A A A A A A CLOS3 A A A A A A A A B7 B6 B5 B4 B3 B2 B1 B0 CLOS0 A A A A A A A A CLOS1 A A A A CLOS2 A A CLOS3 A A Default Bitmask – All CLOS ids have all cache Overlapping Bitmask (only contiguous bits)

13

slide-14
SLIDE 14

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

14

slide-15
SLIDE 15

Kernel Implementation

Threads Cgroup fs /sys/fs/cgroup perf

User interface

Cache alloc cache monitoring

Kernel QOS support

Intel Xeon QOS support

Shared L3 Cache

User Space

Kernel Space Hardware

MSR

Configure bitmask per CLOS Set CLOS/RMID for thread

During ctx switch Allocation configuration

Read Event counter

Read Monitored data

15

slide-16
SLIDE 16

Usage

Monitoring per thread cache occupancy in bytes Allocating Cache per thread through cache bitmask

16

Cgroup Clos : Parent.Clos bitmask : Parent.bitmask Tasks : Empty Exposed to user land

slide-17
SLIDE 17

Scenarios

  • Units that can be allocated cache

– Process/tasks – Virtual machines (transfer all PIDs of VM to one cgroup) – Containers (put the entire container into one cgroup)

  • Restrict the noisy neighbour
  • Fair cache allocation to resolve cache

contention

17

slide-18
SLIDE 18

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement
  • Future Work

18

slide-19
SLIDE 19

Challenges

  • Openstack usage
  • What if we run out of IDs ?
  • What about Scheduling overhead
  • Doing monitoring and allocation together

19

slide-20
SLIDE 20

Openstack usage

Applications Openstack dashboard Open Stack Services Standard hardware Shared L3 Cache Shared L3 Cache Integration WIP Compute Network Storage

20

slide-21
SLIDE 21

Openstack usage …

Perf syscall OpenStack libvirt Virt mgr

  • virt

. . .

KVM Xen

. . .

Kernel Cache QOS

Work beginning, not stable yet to add changes to Ceilometer (With Qiaowei qiaowei.ren@intel.com )

21

slide-22
SLIDE 22

What if we run out of IDs ?

  • Group tasks together (by

process?)

  • Group cgroups together

with same mask

  • return –ENOSPC
  • Postpone

22

slide-23
SLIDE 23

Scheduling performance

  • msrread/write costs 250-300 cycles
  • Keep a cache. Grouping helps !
  • Don’t use till user actually creates a new cache

mask

23

slide-24
SLIDE 24

Monitor and Allocate

  • RMID(Monitoring)

CLOSid(allocation) different

  • Monitoring and allocate

same set of tasks easily

– perf cannot monitor the cache alloc cgroup(?)

24

slide-25
SLIDE 25

Agenda

  • Problem definition
  • Existing techniques
  • Why use Kernel QOS framework
  • Intel Cache qos support
  • Kernel implementation
  • Challenges
  • Performance improvement and

Future Work

25

slide-26
SLIDE 26

Performance Measurement

  • Intel Xeon based server, 16GB RAM
  • 30MB L3 , 24 LPs
  • RHEL 6.3
  • With and without cache allocation comparison
  • Controlled experiment

– PCIe generating MSI interrupt and measure time for response – Also run memory traffic generating workloads (noisy neighbour)

  • Experiment Not using current cache alloc patch

26

slide-27
SLIDE 27

Performance Measurement[1]

2.8x 1.5x 1.3x

  • Minimum latency : 1.3x improvement , Max latency : 1.5x improvement , Avg latency

: 2.8x improvement

  • Better consistency in response times and less jitter and latency with the noisy

neighbour

27

slide-28
SLIDE 28

Patch status

Cache Monitoring Upstream 4.1 (Matt Fleming ,matt.fleming@intel.com) Cache Allocation Under review. (Vikas Shivappa , vikas.shivappa@intel.com) Code Data prioritization Under review. (Vikas Shivappa , vikas.shivappa@intel.com) Open stack integration (libvirt update) Work started (Qiaowei qiaowei.ren@intel.com)

28

slide-29
SLIDE 29

Future Work

  • Performance improvement measurement
  • Code and data allocation separately

– First patches shared on lkml

  • Monitor and allocate same unit
  • Openstack integration
  • Container usage

29

slide-30
SLIDE 30

Acknowledgements

  • Matt Fleming (cache monitoring support,

Intel SSG)

  • Will Auld (Architect and Principal engineer,

Intel SSG)

  • CSIG, Intel

30

slide-31
SLIDE 31

References

  • [1]

http://www.intel.com/content/www/us/en/co mmunications/cache-allocation-technology- white-paper.html

31

slide-32
SLIDE 32

Questions ?

32