redo logging (fjnish) / distributed systems 1 1 last time (1) - PowerPoint PPT Presentation

idempotency logged operations should be okay to do twice = idempotent bad example: increment inode link count as long as last committed inode value in log is right… bad example: allocate new inode with particular contents good example: overwrite data block with new value bad example: append data to last used block of fjle 6 good example: set inode link count to 4 good example: overwrite inode number X with new value

redo logging summary write intended operation to the log before ever touching ‘real’ data in format that’s safe to do twice write marker to commit to the log if exists, the operation will be done eventually actually update the real data 7

redo logging and fjlesystems fjlesystems that do redo logging are called journalling fjlesystems 8

exercise (1) suppose OS performing operation of appending 100KB to a 100KB fjle X in directory Y and uses redo logging, ext2-like fjlesystem with 1KB blocks, 4B block pointers part 1: what’s modifjed? [A] free block map [B] data blocks for fjle [C] indirect blocks for fjle [D] data blocks for directory [E] inode for fjle [F] inode for directory [G] the log 9

exercise (2) suppose OS performing operation of appending 100KB to a 100KB fjle X in directory Y and uses redo logging part 2: crash happens after writing: log entries for entire operation free block map changes indirect blocks for fjle …what is written after restart as part of this operation? [A] free block map [B] data blocks for fjle [C] indirect blocks for fjle [D] data blocks for directory [E] inode for fjle [F] inode for directory [G] the log 10

lots of writing? entire log can be written sequentially ideal for hard disk performance also pretty good for SSDs no waiting for ‘real’ updates application can proceed while updates are happening fjles will be updated even if system crashes often better for performance! 11

degrees of consistency not all journalling fjlesystem use redo logging for everything some use it only for metadata operations some use it for both metadata and user data only metadata: avoids lots of duplicate writing metadata+user data: integrity of user data guaranteed 12

distributed systems multiple machines working together to perform a single task 13 called a distributed system

some distibuted systems models 2 peer-to-peer 7 node 6 node 5 node 4 node 3 node node client/server 1 node … N client N-1 client 2 client 1 client server 14

client/server model server client GET /index.html index.html’s contents are … client(s): “sometimes on” sends requests to server(s) needs to know how to contact server server(s): “always on” responds to client requests never initiaties contact with a client 15

layers of servers? ad server database server application server web server web client web server is also application server’s client 16

example: Wikipedia architecture image by Timo Tijhof, via https://commons.wikimedia.org/wiki/File:Wikipedia_webrequest_flow_2015-10.png 17

example: Wikipedia architecture (zoom) image by Timo Tijhof, via https://commons.wikimedia.org/wiki/File:Wikipedia_webrequest_flow_2015-10.png 18

peer-to-peer no always-on server everyone knows about hopefully, no one bottleneck — “scalability” any machine can contact any other machine every machine plays an approx. equal role? set of machines may change over time 19

why distributed? multiple machine owners collaborating put (part of) service “in the cloud” combine many cheap machines to replace expensive machine 20 delegation of responsiblity to other entity easier to add incrementally redundancy — one machine can fail and system still works?

exercise which are likely advantages of client/server model over peer-to-peer? [A] easier to make whole system work despite failure of any machine [B] easier to handle most machines being offmine a majority of the time [C] better suited to a mix of a few very big/high-performance and many small/low-performance machines 21

mailbox model Recv() = “Hello” receiving program not yet received by queue of messages waiting to be sent from sending program queue of messages network knows how to get message to B B: “Hello” mailbox abstraction: send/receive messages Send(B, “Hello”) B: “Hello” B machine the network A machine 22

what about servers? client/server model: server wants to reply to clients might want to send/receive multiple messages can build this with mailbox idea send a ‘return address’ need to track related messages common abstraction that does this: the connection 23

extension: conections Conn = Accept() “4” = Recv(Conn) Send(Conn, “4”) A: (B, “4”) “2 + 2 = ?” = Recv(Conn) Send(Conn, “2 + 2 = ?”) B: (A, “2 + 2 = ?”) A: connection to B OK! connections : two-way channel for messages Conn = Connect(B) B: open connection to A? B machine A machine extra operations: connect, accept 24

connections versus pipes connections look kinda like two-direction pipes in fact, in POSIX will have the same API: each end gets fjle descriptor representing connection can use read() and write() 25

connections over mailboxes real Internet: mailbox-style communication send packets to particular mailboxes no gaurentee on order, when received no relationship between connections implemented on top of this full details: take networking (CS/ECE 4457) 26

connection missing pieces? how to specify the machine? multiple programs on one machine? who gets the message? 28

names and addresses IPv6 address 2607:f8b0:4004:80b::2005 port number 443 service name https memory address 0x7FFF9430 variable counter and device 0x2eh / 0x46d inode# 120800873 fjlename /home/cr4bd/NOTES.txt hostname mail.google.com name IPv4 address 216.58.217.69 hostname mail.google.com IPv4 address 128.143.22.36 hostname www.virginia.edu location/how to locate logical identifjer address 29

hostnames typically use domain name system (DNS) to fjnd machine names maps logical names like www.virginia.edu chosen for humans hierarchy of names …to addresses the network can use to move messages numbers ranges of numbers assigned to difgerent parts of the network network routers knows “send this range of numbers goes this way” 30

DNS: distributed database cs.virginia.edu check for updated version once in a while optimization: cache its address .edu server doesn’t change much try .edu server at … www.cs.virginia.edu? 128.143.67.11 www.cs.virginia.edu = www.cs.virginia.edu? address for DNS server DNS server my virginia.edu DNS server .edu DNS server root when it connected to network address sent to my machine DNS server ISP’s machine 31

IPv4 addresses 32-bit numbers typically written like 128.143.67.11 four 8-bit decimal values separated by dots fjrst part is most signifjcant organizations get blocks of IPs e.g. UVa has 128.143.0.0–128.143.255.255 e.g. Google has 216.58.192.0–216.58.223.255 and 74.125.0.0–74.125.255.255 and 35.192.0.0–35.207.255.255 33 same as 128 · 256 3 + 143 · 256 2 + 67 · 256 + 11 = 2 156 782 459

selected special IPv4 addresses 127.0.0.0 — 127.255.255.255 — localhost AKA loopback the machine we’re on typically only 127.0.0.1 is used 192.168.0.0–192.168.255.255 and 10.0.0.0–10.255.255.255 and 172.16.0.0–172.31.255.255 “private” IP addresses not used on the Internet also 100.64.0.0–100.127.255.255 (but with restrictions) 169.254.0.0-169.254.255.255 link-local addresses — ‘never’ forwarded by routers 34 commonly connected to Internet with network address translation

network address translation IPv4 addresses are kinda scarce solution: convert many private addrs. to one public addr. locally: use private IP addresses for machines outside: private IP addresses become a single public one commonly how home networks work (and some ISPs) 35

IPv6 addresses IPv6 like IPv4, but with 128-bit numbers written in hex, 16-bit parts, seperated by colons ( : ) strings of 0s represented by double-colons ( :: ) no need for address translation? 2607:f8b0:400d:c00::6a = 2607:f8b0:400d:0c00:0000:0000:0000:006a 2607f8b0400d0c0000000000000006a SIXTEEN 36 typically given to users in blocks of 2 80 or 2 64 addresses

selected special IPv6 addresses ::1 = localhost anything starting with fe80 = link-local addresses never forwarded by routers 37

IPv4 addresses and routing tables … network 3 anything else … … network 2 64.8.0.0–64.15.255.255 network 2 4.0.0.0–7.255.255.255 … router network 1 192.107.102.0–192.107.102.255 network 1 128.143.0.0—128.143.255.255 send it to… if I receive data for… network 3 network 2 network 1 38

port numbers we run multiple programs on a machine IP addresses identifying machine — not enough so, add 16-bit port numbers think: multiple PO boxes at address 0–49151: typically assigned for particular services 80 = http, 443 = https, 22 = ssh, … 49152–65535: allocated on demand default “return address” for client connecting to server 40

protocols protocol = agreement on how to comunicate syntax (format of messages, etc.) e.g. mailbox model: where does address go? e.g. connection: where does return address go? semantics (meaning of messages — actions to take, etc.) e.g. connection: when to consider connection created? 41

human protocol: telephone caller: pick up phone caller: check for service caller: dial caller: wait for ringing callee: “Hello?” caller: “Hi, it’s Casey…” callee: “Hi, so how about …” caller: “Sure, …” … … callee: “Bye!” caller: “Bye!” hang up hang up 42

layered protocols IP: protocol for sending data by IP addresses mailbox model limited message size UDP: send datagrams built on IP still mailbox model, but with port numbers TCP: reliable connections built on IP adds port numbers adds resending data if error occurs splits big amounts of data into many messages HTTP: protocol for sending fjles, etc. built on TCP 43

other notable protocols (transport layer) TLS: Transport Layer Security — built on TCP like TCP, but adds encryption + authentication SSH: secure shell (remote login) — built on TCP SCP/SFTP: secure copy/secure fjle transfer — built on SSH HTTPS: HTTP, but over TLS instead of TCP FTP: fjle transfer protocol … 44

sockets socket: POSIX abstraction of network I/O queue any kind of network can also be used between processes on same machine 45 a kind of fjle descriptor

connected sockets sockets can represent a connection client server (setup connection / get fd s) write(fd, buffer, size) read(fd, buffer, size) write(fd, buffer, size) read(fd, buffer, size) 46 act like bidirectional pipe

echo client/server void server_for_connection( int socket_fd) { } } if (read_count != write_count) {...error?...} write_count = write(socket_fd, request_buf, read_count); if (read_count <= 0) return ; // error or EOF read_count = read(socket_fd, request_buf, MAX_SIZE); while (1) { int read_count, write_count; char request_buf[MAX_SIZE]; } void client_for_connection( int socket_fd) { } write(STDOUT_FILENO, recv_buf, n); if (n <= 0) return ; // error or EOF n = read(socket_fd, recv_buf, MAX_SIZE); if (n != strlen(send_buf)) {...error?...} n = write(socket_fd, send_buf, strlen(send_buf)); while (prompt_for_input(send_buf, MAX_SIZE)) { int n; char send_buf[MAX_SIZE]; char recv_buf[MAX_SIZE]; 47

echo client/server void server_for_connection( int socket_fd) { } } if (read_count != write_count) {...error?...} write_count = write(socket_fd, request_buf, read_count); if (read_count <= 0) return ; // error or EOF read_count = read(socket_fd, request_buf, MAX_SIZE); while (1) { int read_count, write_count; char request_buf[MAX_SIZE]; } void client_for_connection( int socket_fd) { } write(STDOUT_FILENO, recv_buf, n); if (n <= 0) return ; // error or EOF n = read(socket_fd, recv_buf, MAX_SIZE); if (n != strlen(send_buf)) {...error?...} while (prompt_for_input(send_buf, MAX_SIZE)) { int n; char send_buf[MAX_SIZE]; char recv_buf[MAX_SIZE]; 47 n = write(socket_fd, send_buf, strlen(send_buf));

echo client/server void server_for_connection( int socket_fd) { } } if (read_count != write_count) {...error?...} write_count = write(socket_fd, request_buf, read_count); if (read_count <= 0) return ; // error or EOF read_count = read(socket_fd, request_buf, MAX_SIZE); while (1) { int read_count, write_count; char request_buf[MAX_SIZE]; } void client_for_connection( int socket_fd) { } write(STDOUT_FILENO, recv_buf, n); if (n <= 0) return ; // error or EOF if (n != strlen(send_buf)) {...error?...} n = write(socket_fd, send_buf, strlen(send_buf)); while (prompt_for_input(send_buf, MAX_SIZE)) { int n; char send_buf[MAX_SIZE]; char recv_buf[MAX_SIZE]; 47 n = read(socket_fd, recv_buf, MAX_SIZE);

client: connect(fd, addr, …) request connection sockets and server sockets fd = socket(…) connection fd = accept(ss_fd, …) server: can only accept() — create normal socket still has a fjle descriptor, but … listen() — turn socket into server socket socket() function — create socket fd client: socket listen(ss_fd, …) bind(ss_fd, addr, …) … ss_fd = socket(…) server: server socket socket server client 48

sockets and server sockets listen(ss_fd, …) connection server: can only accept() — create normal socket still has a fjle descriptor, but … listen() — turn socket into server socket socket() function — create socket fd fd = socket(…) socket client: bind(ss_fd, addr, …) … ss_fd = socket(…) server: server socket socket server client 48 client: connect(fd, addr, …) request connection fd = accept(ss_fd, …)

connections in TCP/IP on network: connection identifjed by 5-tuple used by OS to lookup “where is the fjle descriptor?” (protocol=TCP, local IP addr., local port, remote IP addr., remote port) both ends always have an address+port what is the IP address, port number? set with bind() function typically always done for servers, not done for clients system will choose default if you don’t 49

connections on my desktop 128.143.67.236:63439 tcp TIME_WAIT 128.143.67.236:111 0 128.143.67.91:50236 0 tcp TIME_WAIT 0 128.143.67.91:49302 0 128.143.67.91:22 0 tcp TIME_WAIT 128.143.67.236:111 0 128.143.67.91:54098 0 tcp 0 172.27.98.20:49566 128.143.67.236:2049 1 2 7 . 0 . 0 . 1 : 6 3 1 ESTABLISHED 12 7.0.0.1:5043 8 1 2 7 . 0 . 0 . 1 : 6 3 1 0 0 tcp ESTABLISHED 0 127 .0.0 .1:5 0438 ESTABLISHED 0 tcp TIME_WAIT 128.143.67.236:111 0 128.143.67.91:51000 0 tcp TIME_WAIT 0 128.143.67.91:40664 50 ESTABLISHED tcp ESTABLISHED 128.143.67.236:2049 0 128.143.67.91:803 0 tcp 128.143.63.34:22 0 0 128.143.67.91:49202 0 tcp State Foreign Address Active Internet connections ( w / o servers ) : 0 0 128.143.67.91:50292 128.143.67.226:22 TIME_WAIT tcp TIME_WAIT 128.143.67.236:63439 0 128.143.67.91:732 0 tcp TIME_WAIT 128.143.67.236:111 0 128.143.67.91:52002 0 tcp TIME_WAIT 128.143.67.236:2049 0 128.143.67.91:54722 0 tcp cr4bd@reiss − t3620 / zf14 / cr4bd ; netstat −− inet −− inet6 −− numeric Proto Recv − Q Send − Q Local Address

real world? varies between protocols client/server fmow (one connection at a time) create server socket client/server takes turns client writes fjrst shown here: close connection socket write response to connection socket read request from connection socket (get connection socket) accept a new connection start listening for connections bind to host:port close socket create+confjgure read response write request (gets assigned local host:port) connect socket to server hostname:port create client socket close connection communicate sockets (fd’s) of connection setup pair server socket 51

client/server fmow (one connection at a time) create server socket client/server takes turns client writes fjrst shown here: close connection socket write response to connection socket read request from connection socket (get connection socket) accept a new connection start listening for connections bind to host:port close socket create+confjgure read response write request (gets assigned local host:port) connect socket to server hostname:port create client socket close connection communicate sockets (fd’s) of connection setup pair server socket 51 real world? varies between protocols

real world? varies between protocols client/server fmow (one connection at a time) create server socket client/server takes turns client writes fjrst shown here: close connection socket write response to connection socket read request from connection socket (get connection socket) accept a new connection start listening for connections bind to host:port close socket create+confjgure read response write request (gets assigned local host:port) connect socket to server hostname:port create client socket close connection communicate sockets (fd’s) of connection setup pair server socket 51

client/server fmow (multiple connections) bind to host:port close connection socket write response to connection socket read request from connection socket (get connection socket) accept a new connection start listening for connections create server socket spawn new process (fork) close socket read response write request (gets assigned local host:port) connect socket to server hostname:port create client socket or thread per connection 52

backup slides 53

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of number of blocks (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 54

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of number of blocks = 0 (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 54

the xv6 journal transaction ready for next transaction 4 clear log header ) (if number of blocks redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 54 number of blocks = N

the xv6 journal transaction ready for next transaction 4 clear log header redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 54 number of blocks = N (if number of blocks � = 0 )

the xv6 journal transaction ready for next transaction 4 clear log header redone on recovery 3 write data (commits transaction) 2 write log header 1 write changed blocks start: num blocks = 0 no transaction otherwise: not committed or non- 0 : committed data of (one sector) log header xv6 log (one transaction) … non-log block non-log block … … second block (log copy) fjrst block (log copy) … location for second block location for fjrst block 54 number of blocks = N = 0 (if number of blocks � = 0 )

what is a transaction? so far: each fjle update? faster to do batch of updates together one log write fjnishes lots of things don’t wait to write xv6 solution: combine lots of updates into one transaction only commit when… no active fjle operation, or not enough room left in log for more operations 55

what is a transaction? so far: each fjle update? one log write fjnishes lots of things don’t wait to write xv6 solution: combine lots of updates into one transaction only commit when… no active fjle operation, or not enough room left in log for more operations 55 faster to do batch of updates together

redo logging problems doesn’t the log get infjnitely big? writing everything twice? 56

limiting log size once transaction is written to real data, can discard sometimes called “garbage collecting” the log may sometimes need to block to free up log space perform logged updates before adding more to log hope: usually log cleanup happens “in the background” 58

redo logging (fjnish) / distributed systems 1 1 last time (1) - PowerPoint PPT Presentation

redo logging (fjnish) / distributed systems 1 1 last time (1) block groups keep related data+metadata in one part of disk preference, not requirement exceptions can span multiple block groups divide up block/inode indices between block

RPC / failure 1 last time redo logging (fjnish) (weird?) choice not to use redo logging for

CS 2550 / Spring 2006 Principles of Database Systems Undo/No-Redo No-Undo/Redo

CS411 Concurrency control Database Systems Recovery Logging Redo 13: Logging

Sockets / RPC 1 last time redo logging write log + commit, then do operation on failure,

Efficient Hardware-based Undo+Redo Logging for Persistent Memory Systems Matheus Ogleari Prof.

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

distributed 1 / sockets 1 last time RAID ordering writes carefully waste space rather than

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

Debugging & Logging Java Logging Java has built-in support for logging Logs contain

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Secure Audit Logging Systems Secure Audit Logging Systems Richard Kramer, Member IEEE Oregon

Distributed 3: Network FS (fjnish) / Failure 1 Changelog Changes made in this version not seen

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

20 Schemes Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

IMRT: Patient Specific QA ICPT School on Medical Physics for Radiation Therapy Justus Adamson PhD

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

Announcements Quiz on Thursday Next assignment will be available later this week (Thursday

ONLINE DEGREE-BOUNDED STEINER Sina Dehghani Saeed Seddighin NETWORK DESIGN Ali Shafahi Fall

Logical time and logical clocks Knowing the ordering of events is important not enough with

Lamport Clocks Doug Woos Logistics notes Problem Set 1 due Friday Chandy-Lamport Snapshots

redo logging (fjnish) / distributed systems 1 1 last time (1) - PowerPoint PPT Presentation

redo logging (fjnish) / distributed systems 1 1 last time (1) block groups keep related data+metadata in one part of disk preference, not requirement exceptions can span multiple block groups divide up block/inode indices between block

RPC / failure 1 last time redo logging (fjnish) (weird?) choice not to use redo logging for

CS 2550 / Spring 2006 Principles of Database Systems Undo/No-Redo No-Undo/Redo

CS411 Concurrency control Database Systems Recovery Logging Redo 13: Logging

Sockets / RPC 1 last time redo logging write log + commit, then do operation on failure,

Efficient Hardware-based Undo+Redo Logging for Persistent Memory Systems Matheus Ogleari Prof.

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

distributed 1 / sockets 1 last time RAID ordering writes carefully waste space rather than

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

Debugging &amp; Logging Java Logging Java has built-in support for logging Logs contain

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Secure Audit Logging Systems Secure Audit Logging Systems Richard Kramer, Member IEEE Oregon

Distributed 3: Network FS (fjnish) / Failure 1 Changelog Changes made in this version not seen

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

20 Schemes Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

IMRT: Patient Specific QA ICPT School on Medical Physics for Radiation Therapy Justus Adamson PhD

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

Announcements Quiz on Thursday Next assignment will be available later this week (Thursday

ONLINE DEGREE-BOUNDED STEINER Sina Dehghani Saeed Seddighin NETWORK DESIGN Ali Shafahi Fall

Logical time and logical clocks Knowing the ordering of events is important not enough with

Lamport Clocks Doug Woos Logistics notes Problem Set 1 due Friday Chandy-Lamport Snapshots

Debugging & Logging Java Logging Java has built-in support for logging Logs contain