网络,OS,DB,分布式-一万字知识总结

网络模型

  • 电路交换
    • 不灵活 成本高 需要专有物理线路 需要建立专有连接 线路利用率低 没有转发机制
    • 容易受网络中断影响
  • 包交换
    • 更灵活 成本低 不需要专用线路 可以线路复用 线路利用率高
    • 不容易受网络中断影响
  • 我们的网络:**统计复用包交换网络 ** 统计复用=排队
    • Internet:不可靠网络 IP-service model / IP best-effort network /最大努力网络
      • 包可以丢失重复重排 packets may lose, duplicate, reorder
      • 无连接最大努力目的地转发 connectionelss best-effort destination based forwarding.
    • 可靠:保证包能够不丢失 不重复 按序到达 无差错

传输过程和基本术语

img

  • 包在每个网络节点传输方式:上至下层层封装header 下至上层层解包
    • Message “报文” 应用层的名称
    • Segment “报文段” 起始于传输层的信息单元
    • Datagram “数据报 ” 使用udp等无连接服务网络层信息单元
    • Packet “分组/包” 起始于网络层的信息单元
    • Frame“帧” 起始于链路层的传输单元
    • bit “比特” 物理层单元
  • Links: 链路 连接节点的物理介质
  • 服务/接口: 不同层之间
  • 协议:平级的peer之间
  • 端对端 Client to Server 点对点 otherwise | 应用和传输层端对端
    • E2E总时延 = 排队时延 + 处理时延 + 传输时延 + 传播时延
    • P2P总时延 = 传输时延 + 传播时延
  • 连接建立断开:通常需要多次握手/挥手
  • 两个port+ip 定义一个tcp stream / udp message
  • 流量控制:保证接受者buffer能承受住 拥塞控制: 保证网络不过于堵塞
  • 可靠:保证包能够不丢失不重复按序到达无差错 重发丢失/损坏的包
  • 根据信息在传输线上的传送方向,分为以下三种通信方式:
    • 单工通信Simplex:单向传输
    • 半双工通信Half-duplex:双向交替传输
    • 全双工通信Duplex:双向同时传输
  • 局域网:多种不同结构
    • 局域网是一种典型的广播信道,主要特点是网络为一个单位所拥有,且地理范围和站点数目均有限。有多种局域网技术,其中以太网占主导地位。可以按照网络拓扑结构对局域网进行分类:星 环 直线
  • 以太网:星形结构局域网 中间使用集线器或交换机连接
  • MAC地址 链路层地址 48位 设备网卡的唯一标识 有多少个适配器=多少个mac地址 os可更换
  • 网络设备
    • Hub 集线器 [layer 1 物理层] 作用于物理层的 能使多个主机 创建一个广播频道(only floods)的设备 具备多个网口,专门实现多台计算机的互联作用。
    • Repeater 中继器**[layer 1 物理层]** 接收并重复信号
    • Switch 交换机 [layer 2 数据链路层] 收处理转发以太网帧到网络中的其他设备 会维护一个<mac link/接口> 表 “交换表” 表类似一个LRU缓存 因此能够实现mac地址识别=说它具有“学习能力”. 支持同时存在多个点对点连接。
    • Bridge 网桥**[layer 2 数据链路层]** 连接两个局域网
    • Router 路由 [layer 3 网络层] 根据routing table提供转发和路由两种功能 转发:将数据包移送到合适输出端 路由:决定数据包的路由路径。
    • Gateway 网关 [layer 3 物理层] 两个不同网络之间的关口设备

网络IO模型

  • 同步
    • 阻塞IO 更高CPU利用率 进程被阻塞 内核代替进程完成操作后返回状态码 进程恢复
    • 非阻塞IO 浪费cpu 轮询 不可用则内核返回错误代码
    • IO复用: 让单个进程具有处理多个 I/O 事件的能力。又被称为 Event Driven I/O
      • 早期的实现
        • select 等待固定数量的fd中1个或多个成为就绪状态 适用于实时性要求高的情况
        • poll 等待一组描述符中1个成为就绪状态 适合实时性要求相对宽松情况
      • 更现代的事件驱动实现
        • epoll linux的io事件通知机制 监视多个fd等待任意一个可用 使用红黑树管理监视中的fd 适用于linux上管理大量长连接情况
          • epoll_create创建一个epoll实例 epoll_ctl把一个事件添加到epoll列表 epoll_wait等待一个io事件就绪 否则阻塞calling thread
          • 默认level triggered: epoll_wait监测到事件发生通知进程 进程可以不处理 下次调用时再处理
          • edge triggered发生事件立即通知 进程要立即处理
    • 信号驱动 使用较少 非阻塞等待创建signal handler 数据就绪后 os 发信号代表可以开始io 比非阻塞cpu利用率更高
  • 异步: 进程调用aio_read后立即返回 当io完成后os内核向进程发信号

分层架构

  • OSI architecture 7层协议 注重通信协议必要功能
    • 应用层:e2e 用户服务 HTTP FTP DNS DHCP 远程登录 邮件
      • 表示层:数据表示, 压缩, 加密
      • 会话层:建立及管理会话 SSL RPC
    • 传输层:e2e 为应用进程提供端对端通信+在应用层网络层之间multiplexing和demultiplexing TCP UDP
    • 网络层:p2p 无连接通讯,寻址,路由 控制数据在子网中的运行. IP ICMP RIP OSPF BGP ARP
    • 数据链路层: p2p 相邻网络节点/主机的数据的格式化与传输控制 LAN WAN LLC MAC ARP MTU
    • 物理层:点对点比特流传输 01=电流强弱 数模转换 模数转换
  • TCP/IP architecture 4层协议 计算机上实现应关心的 除了传输层 网络层 其他上下合并
    • 应用层:应用+表示+会话
    • 传输层,网络层
    • 链路层:数据链路+物理

数据链路层

Data Link Layer 点对点相邻网络节点/主机的数据的格式化与传输控制 e.g. LAN WAN LLC MAC ARP MTU

  • 包装成帧Framing 把网络层传下来的包封装成帧 加个开始和结束
  • 点对点传输控制P2P transmission control: Logical link control (LLC)
    • Error Detection CRC checksum
    • Flow Control " only used in wireless networks
  • 广播控制Broadcast: Media access control (MAC)
    • Frames synchronization 帧的同步 clock based, character counting, byte stuffing. STX ETX
    • Channel sharing 信道共享方法:
      • 信道复用:时分,频分,码分
      • 交替:轮询,令牌传递
      • 随机访问 主要例子:Aloha, Ethernet
        • Ethernet MAC采用CSMA/CD协议 (Carrier Sense Multiple Access / Collision Detection) 载波监听多点接入/碰撞检测
          • 线路空闲立即发送 (Carrier Sense 载波监听)
          • 如果繁忙等待 “inter-frame gap” = 96 bit time
          • 冲突发现,发送 jam signal, 进行二进制指数后退{1,2,4,8,…}, 然后延迟k*51.2 μs" collision domain 冲突域 = 1 RTT

网络层

点对点无连接通讯,寻址,路由 控制数据在子网中的运行. 涉及协议: IP, ARP, ICMP, RIP, OSPF, BGP IP是个不可靠的协议因为无连接 它的可靠性需要通过上层如TCP实现

  • 寻址
    • img
    • IP: 沙漏结构的中点,连接异构网络,使之看起来统一
      • 地址系统
        • Class-based addressing (过去版本): lead to problem of large# networks in routing table
        • Subnetting and supernetting 子网与超网
          • 子网 方便管理和路由效率 subnet ip = subnet mask & host ip address
          • 超网 用于跨网路由 CIDR 无分类跨网地址系统
            • ip地址=网络前缀+主机号 128.14.35.7/20 表示前 20 位为网络前缀。
            • 意义:减少了路由表项 查找时采取最长前缀匹配原则选择哪一个
  • 路由
    • ARP:Broadcast & Responds路由器确定设备MAC
      • routers and hosts maintain an dynammic <IP, MAC> LRU cache called ARP table. For unkown IP, router broadcast ARP request, hosts with that IP address reply its MAC
    • Routing types and protocols
      • 网内路由Intra-domain
        • RIP Routing Information Protocol: rely on local computation in all nodes
          • Distance Vector Protocol (based on Bellman-ford)
        • OSPF Open Shortest Path First Protocol: no local computation faster nonegative edge weight-loop-free convergence
          • Link state Protocol (based on Dijkstra’s shortest path)
      • 跨网路由Inter-domain: BGP Border Gateway Protocol
  • 报文的拆分重组 Fragmentation/Reassembly: 使得异构网络能够以最大传输大小传输包
    • 和tcp合作 tcp负责mtu discovery
      • TCP MTU Discovery: 不断增大发送数据包大小直到获得数据包过大的ICMP响应得出mtu
    • 和数据链路层合作 mtu协议定义链路支持最大传输大小
  • 错误报告和控制 ICMP
  • img - **封装在 IP 数据报中**,但是不属于高层协议。
  • ping 用来测试两台主机之间的连通性 通过icmp echo实现
  • Traceroute 追踪一个数据包路径: 封装一个无法交付的udp包, 利用IP协议的**“ttl”字段**,尝试从每个网关到某个主机的路径引发ICMP 超时响应。

传输层TCP/UDP

端对端为应用进程提供端到端的通信服务 在应用层和网络层之间multiplexing和demultiplexing e.g. TCP UDP

  • UDP: bare-bone protocol
    • 无连接 不可靠 无流量拥塞控制 无时限吞吐量安全保证
    • 但更可控更自由 可以在应用层实现可靠传输: RUDP
    • UDP header:src port, dest port, header length, checksum
  • TCP: 面向连接 可靠 流量拥塞控制 基于字节流 全双工 差错校验
    • TCP header:
      • SRC,DST ports (16-bit each)
      • Sequence #序号, Ack #确认号(32-bit each) 序号:当前数据段第一个字节编号 tcp要用序号拼接数据 确认号:期望下个数据段第一个字节编号
      • Header length(data offset), reserve
      • Flags(indicate pkt types): SYN FIN ACK URG(紧急指针) PSH(不在接受者缓冲区等待) RST重制连接标志
      • Receive window (16-bits 发送者接收窗口的大小) 用于流量控制flow control
      • Check sum: for error detection
      • Urgent Ptr: 优先级
      • Options
    • 全双工:建立连接后可以双向收发数据
    • 以连接为导向: 意味着需要主动建立连接
    • TCP提供包级别的可靠传输 (IP最大努力网络不能保证)
      • UDP只提供位级别的可靠传输 由checksum实现 只能进行简单的检测看看数据是否污染
    • 如何实现可靠传输:不丢失 不重复 按序到达 无差错
      • 不丢失
        • 确定包收到了:ack机制
        • 确定包没收到:计时器/超时检测机制Retransmission Timeout 制定了超时标准要比RTT稍微多一点尽量接近RTT, RTO新= RTO旧*2 by karn’s algorithm
        • 丢包补救:快重传+urgent ptr
      • 不重复 按序到达
        • 滑动窗口
      • 无差错
        • Check sum
    • 提供流量控制:确保接收方的buffer不会overflow
      • receiver发送的ack报文中的receiver window表示自己仍可缓存的容量 若超过这个限制 sender必须等待receiver的ack和更新。将窗口字段设置为 0,则sender停止发送数据。这实际上调整了sender发送窗口大小和发送速率。
    • 提供拥塞控制:与整个网络有关, 网络比较拥挤的时候控制发送方的窗口。增加一个congestion window CWND
      • send window (#unacknowledge pkts in flight) = min(CWND, RWND) 当cwnd更小时,我们就进入了一个拥塞控制的模式
        • 慢开始与拥塞避免
          • cwnd := 1, cwnd++ for each ack == cwnd*=2 each RTT 慢慢探测网络容量 指数增长 但是又被临界值限制 到达临界值后线性增长避免拥塞
          • cwnd >= ssthresh do congestion avoidance: cwnd++ for each RTT;
        • 快重传和快恢复: 快重传解决丢包重发问题(3个重复包=包丢失),快恢复临界值减半后线性增长拥塞窗口:适应性的避免网络拥塞 on dupack (pkt loss) fast retransmit the next sequence from receiver side. Fast recover it by setting ssthresh = cwnd/2, cwnd = ssthresh, do congestion avoidance

三次握手 四次挥手

img
  • 为什么三次握手:TCP是一个全双工通信协议,也就是说它是双向收发的。初始序号是两边都要随机选择:因为安全问题要避免TCP sequence prediction attack。所以双方都要告诉对方自己的初始序号 = 也就是通信双方数据原点的位置所以至少要4次握手。SYN ACK SYN ACK, 然后因为tcp header里reserve flags部分SYN ACK是可以同时为1的 中间两个步骤可以合并 所以3次就够。

  • 第三次握手过程中sender可以piggypack data而receiver只能在这次握手结束才可以。

  • 在linux socket编程中,客户端执行connect()时,将触发三次握手。
    img

  • 为什么四次挥手因为tcp全双工+tcp半关闭造成的。假如client主动关闭,那么直到client收到server的ack前,数据传输仍然可以从server—>client进行,tcp是半关闭的client-server关了但server-client方向还没关,这也是为什么有close-wait状态的原因,服务器端要等待最后这一波数据传输的完成。所以这也解释了中间两次为什么不能像建立连接一样合并。当服务器没有数据要传时他就可以通过fin来释放自己方向的连接了。

  • TIME_WAIT / 2MSL等待状态:

    • 确保server收到ack
      • server接受了ack不需要TIME_WAIT,因为它已经完成同步了可以释放资源了。
      • client必须等待确定server收到ack,否则client直接关闭server会可能收不到ack无法正常关闭。
      • ack最久要1MSL到达server或者最坏情况没收到。取上界=1MSL
      • 等待server回复的超时重传消息=最坏情况又1MSL
      • 最坏要2MSL client才知道ack有没有到达
    • 避免新旧连接混淆
  • 在linux socket编程中,任何一方执行close()操作即可产生挥手操作。

滑动窗口

  • TCP使用滑动窗口实现流量控制和乱序重排和去重 (虽然tcp byte-oriented 以下为了方便说明 改为包/sequence 而不是字节)
    • 流量控制:
      • Receive window用于接收方通知发送方自己还有可用缓冲区大小,发送方据此调整发送数据多少, 从而保证流量控制
    • sender缓存:已发送但未确认的包
    • receiver缓存:未按序到达的包
    • sender发送流水线化 提高信道利用率
    • receiver回复cumulative ack:
      • receiver只对窗口内最后一个按序到达的包进行确认 = 之前全部按序到达窗口向右滑动
      • 对于按序未到达的包 sender收不到ack超时重传
      • receiver收到该包,如果【该包+已缓存的未按序到达的包 】构成一段连续有序的segment 那么他们可以被交付给应用进程,窗口向右滑动
  • sender
窗口外已发送并收到确认 已发送未确认 可发但未发 窗口外不可发送
  • receiver
窗口外按序到达acked且交付的 按序未到达 未按序到达 允许接受 窗口外不许接收
img

拥塞控制

如果网络出现拥塞,分组将会丢失,此时发送方会继续重传,从而导致网络拥塞程度更高。因此当出现拥塞时,应当控制发送方的速率。这一点和流量控制很像,但是出发点不同。流量控制是为了让接收方能来得及接收,而拥塞控制是为了降低整个网络的拥塞程度。
img
img

慢开始与拥塞避免

  • Slow start: 初始 cwnd = 1,每收到1个ack cwnd++,cwnd指数增长:2、4、8 …
  • 为cwnd避免过快,设置一个慢开始门限 ssthresh,当 cwnd >= ssthresh 时,进入拥塞避免-每个轮次cwnd++ 线性增长
  • 若超时: 令 ssthresh = cwnd / 2,重新执行慢开始。

快重传与快恢复

  • 如何确定报文丢失:接收方只对最后一个收到的有序报文段进行确认。在发送方,如果收到三个重复确认,那么可以知道接收方下一个报文段丢失,此时执行快重传
  • 只是丢失个别报文段而不是网络拥塞:执行快恢复,令 ssthresh = cwnd / 2 ,cwnd = ssthresh,直接进入拥塞避免cwnd线性增长
    img
  • SYN flood: client发第一个syn后下线server没收到ack不断尝试直到超时 linux:5次 1+2+…+32=63秒才断开连接,可被恶意利用
    • 防护:linux提供tcp_syncookies的参数 当syn队列满了 server通过该参数回发SYN cookie 若client非恶意会回复SYN cookie 直到连接成功建立

应用层

  • 2种架构: Client Server / Peer to Peer平级
  • Socket = 门, 传输层 = 走廊
  • 主机和进程可以被ip+端口定义
  • 远程登录: SSH TELNET
  • 邮件: 发送协议常用 SMTP,读取协议常用 POP3 和 IMAP。
  • 不同情况下应用有不同需求 数据丢失 vs 实时性 vs 吞吐量

常用端口

应用 应用层协议 端口号 传输层协议 备注
域名解析 DNS 53 UDP/TCP 长度超过 512 字节时使用 TCP
动态主机配置协议 DHCP 67/68 UDP
超文本传送协议 HTTP 80 TCP
文件传送协议 FTP 20/21 TCP 数据连接 20,控制连接 21
远程终端协议 TELNET 23 TCP ssh = 22
简单邮件传送协议 SMTP 25 TCP
邮件读取协议 POP3 110 TCP
网际报文存取协议 IMAP 143 TCP

FTP

img img
  • 服务器主动(客户端配置服务端firewall) 服务器被动(只需开放端口号 不安全)

  • 2个TCP连接: TCP数据连接端口20, TCP控制连接端口21.

DHCP

img

只适用于动态ip分配的情形,如进入一个新的移动网络。 主机不知道自己ip地址 ask dhcp server, 配置好IP地址,子网掩码,网关 IP 地址.

  • client使用udp同子网内广播discover报文
  • dhcp server返回offer 包括多种选择
  • client选择一个ip 发送request
  • dhcp server返回ack

DNS

dns dns recdns DNS 可以使用 **UDP 或者 TCP** 进行传输,**使用的端口号都为 53**。大多数情况下 DNS 使用 **UDP** 进行传输,这就要求域名解析器和域名服务器都必须**自己处理超时和重传**从而保证可靠性。在两种**特殊情况下会使用 TCP** 进行传输:
  • 如果返回的响应超过 512 字节(UDP 最大只支持 512 字节的数据)。
  • DNS zone transfer

DNS 负载均衡

同一主机在dns服务器里配置多个主机记录=多个不同服务器ip,dns服务器解析域名时会轮询,这就完成了简单的负载均衡。

HTTP HTTPS

  • URI 包含 URL 和 URN。
  • HTTP是无状态的
    • 请求互相独立, 类似事务
    • 编程,本地储存,cookie,session是用来提高用户体验的
  • HTTPS 并不是新协议,而是让 HTTP 先和 SSL(Secure Sockets Layer)通信,再由 SSL 和 TCP 通信,也就是说 HTTPS 使用了隧道进行通信。
    • SSL使用RSA算法 public key加密 private key解密
  • API
    • Get 从服务器获取数据
    • Post 提交数据给服务器
    • Head 获取报文头部
    • Put/Delete 上传文件/删除文件 不安全没有验证机制
    • Patch 部分修改资源
  • 状态码
    • 200 OK
    • 3XX 重定向
    • 4XX 客户端错误 如 404 NOT FOUND
    • 5XX 服务端错误


HTTP1.0 http1.1 http2.0
默认短连接: 一次TCP连接进行一次HTTP通信. 如需使用长连接,使用 Connection : Keep-Alive 默认长连接persistent connection: 建立一次 TCP 连接就能进行多次 HTTP 通信. 断开连接需要client/server使用 Connection : close Header压缩
请求发出后需要等待响应才能发下一个请求 引入流水线: 同一条长连接上不等待响应连续发出请求, 减少延迟 解决了流水线情况下HOL blocking 类似饥饿现象
N/A 引入cookie保存状态信息 N/A

Get Post比较

  • get将请求信息放url里 post放报文体中
  • get符合幂等性安全性(只读) post不符合(因为是修改操作)
  • get可被缓存 绝大多数get请求都被cdn缓存 减少web服务器压力 可被储存到浏览器记录和书签中 而post不行
  • cookie: 存在于client边的保存用户状态信息的文本文件,由浏览器储存到client本地。 方便未来向服务器查询用户信息使用。渐渐被淘汰,现在新的浏览器直接支持将数据储存在client本地。存在浏览器中不加密会不安全。但可以减少服务器负担。
    • http request - http response & set-cookie
    • http request & cookie - http response
  • session:存在于server边,用于维护用户登录状态,关闭浏览器 一定时间后会失效
    • 由cookie实现
      • server分配给每个session唯一的jsessionID server从redis缓存中验证用户信息密码 key=session id
      • 返回http response set-cookie包含session id client收到后将cookie值存入浏览器 之后对sever的请求会包含该cookie值 = session id

Web页面请求过程

  • DHCP配置基本信息(假设主机最开始没有 IP 地址及其它信息)
    • DHCP udp broadcast - DHCP server offer - select IP - ack(assigned ip, dns server ip, 网关ip, 子网掩码)
    • 因为我们广播过了 ethernet learning switch knows host’s <MAC, link>
    • DHCP ack到达主机,主机得到ip地址 DNS服务器地址 网关IP 子网掩码 完成配置
  • 需要知道域名IP地址, 先看DNS缓存(browser-os-router-isp), 如果缓存没有, DNS根服务器查询IP 迭代或递归的 从顶级域名到次级域名
  • 得知ip地址,主机通过浏览器和服务器三次握手生成一个 TCP socket,向服务器发送HTTP GET
  • 层层路由 (通过IP获取设备MAC-ARP broadcast & responds) 路由取决于要不要跨网 涉及不同协议
  • 到达服务器,服务器从socket读取 HTTP GET 报文,回复 HTTP 响应报文,将 Web 页面内容放入报文主体中,发回给主机。
  • 浏览器收到 HTTP 响应报文后,抽取出 Web 页面内容,之后进行渲染,显示 Web 页面。

缓存

  • 使用代理服务器或client浏览器进行缓存
  • http 1.1 使用cache-control字段控制缓存 no-store no-cache(使用缓存先验证有效性) private public
    • max-age: 可在server缓存最大时间 expires 过期时间
  • 验证缓存是否有效:If-none-match: + etag: 资源的唯一标识 备用方案:last modified

OS

进程线程

  1. 进程是资源分配的基本单位:正在执行中的指令和上下文。实体:pcb
  2. 上下文:pcb里的内容 对应着 用户堆栈 内核堆栈 代码段 寄存器
  3. 进程切换=上下文切换成本高 IPC通信成本高且困难 故引入线程
  4. 线程同进程内一个执行单元/可调度实体 共享同进程资源:代码段,堆,pid, open fds, 当前目录, 用户和用户组id
  5. 线程独立内容:tid, 寄存器(pc, ip, esp),本地栈
  6. 用户内核线程管理取决于内核支不支持内核级线程:多对1,1对1
  7. 并行parallelism强调多核和硬件支持是并发的子集: 并发concurrency仅表示多线程/多进程交替执行,不要求同时,也不限定cpu数量 = 同时处理多任务 vs 轮流处理多任务
  8. fork COW
    • 实现:把页表里某些页标记为
      只读,count references 只有当write实际发生才分配新页进行拷贝
      其他例子:内存的获取:弄一页对应全0的物理内存共享给所有malloc但还没写的引用,使得系统有更多可用虚拟内存

6种IPC方式

Pipe ⾮非命名管道,⽣命周期属于对应的⽗⼦进程

Named Pipe 命名管道(FIFO)不同于无名管道之处在于它提供了一个路径名,与之关联, 以 FIFO 的⽂件形式存在于文件系统中,这样,即使与 FIFO 的创建进程不存在亲缘关系的进程,只要可以访问该路径,就能够彼此通过 FIFO 相互通信,因此,通过 FIFO 不不相关的进程也能交换数据。 常用于客户**-**服务器应用程序中,FIFO ⽤作汇聚点,在客户进程和服务器进程之间传递数据。

1
2
int mkfifo(const char *path, mode_t mode);
int mkfifoat(int fd, const char *path, mode_t mode);

Message queue 消息队列列,主要是提供⼀一种异步通信方式,消息被保存到队列里,接受者有需要就取回 有很多开源的实现:Apache activeMQ rocketMQ 但是异步的缺点是 接受者要进行polling 轮询操作才能接受到最近消息。 优点:⽐信号传递更多信息, ⽐管道 可以提供有格式的消息 不需要同步。
Semaphore
Shared Memory 实现困难 要os层面支持 而且要同步不同进程对内存的访问
Socket 不同机器

调度

  • 合作机制:中断interrupt
    • 硬件设备中断:异步的到来 但会被同步 timer / io设备等
    • 软件中断:同步的 系统调用或异常(trap实现)。均有handler处理
  • cpu虚拟化:公平,高效,安全的分配计算资源
  • 安全机制:用户内核双态转换 通过系统调用作为interface 涉及cpu上下文切换
    • 系统调用由trap机制实现 通过索引syscall table呼叫对应trap
  • 调度: 在io的时候充分利用cpu 快速响应=RR=高周转时间 tradeoff 可调度进程在ready queue里
    • 批处理:需求-高吞吐量/短周转时间
      • FCFS 非抢占 护送效应“短等长”
      • SJF 非抢占 长作业饥饿
      • SCTF 抢占 贪婪 较优
    • 交互式系统:快速响应
      • RR 抢占 低响应时间 高周转时间 时间片太小 上下文切换开销大 太大 不能保证实时性
      • MLFQ 可抢占 同时保证吞吐量和响应性 多级队列 高级有先同级RR 用完demote 周期性boost防饥饿
    • 实时系统: 一个确定时间段内响应
      • 多用抢占式保证及时响应

并发控制

  • cpu pipeline有效提高性能 但有pipeline hazards问题:

    • 数据依赖hazards | 分支hazards
  • 解决方式:cpu指令重排序/乱序执行 | 分支预测

  • 缓存不一致:多核cpu上多级缓存不同步

  • 内存可见性/一致性问题:缓存内容未同步至主存 cpu编译器指令重排 内存乱序访问

  • 内存屏障: 编译时防止指令重排 运行时保证内存可见性

  • 锁实现

    • 软件:困难 peterson算法 过时 现代处理器会乱序执行
    • 硬件:
      • 原子操作:
        • TestAndSet xchgl 更新并返回旧值供外层循环检测
        • CompareAndSwap CAS 满足期望值则更新。返回旧值供外层循环检测 cmpxchgl
        • FetchAndAdd xaddl 加上1个数 返回旧值共外层循环检测
      • 内存屏障
  • 锁类型:公平=无饥饿

    • 自旋忙等待:需搭配抢占式调度 适合多核短持锁时间 饥饿
    • yield自旋:自旋很短时间然后yield 减少了响应时间 饥饿
    • 排号自旋锁: 使用faa,新来拿号,不到回合自旋,fifo公平 饥饿 | 改进:排号yield
    • 队列锁: os支持 饥饿者扔到waiting queue挂起 锁可用再放回ready queue 公平
    • 混合锁: 定时自旋+队列 linux futex 公平

条件变量,管程,信号量,协程

  • cv: 条件等待队列

    • wait(mutex,cv) 放锁挂起caller,醒后拿锁返回
    • signal(cv) 唤醒一个挂起的线程
  • sv:用于检测状态变化 从而进行wait/signal

  • monitor: mutex + cv + sv 任何时刻只有1个线程进入管程 属于高级同步原语

    • mesa: 唤醒和线程恢复执行非原子 sv可变 需while检测
    • hoare: 原子唤醒+恢复执行 sv不变 if检测即可
    • java在语言层级上支持通过管程机制同步 同步方式有两种:互斥lock sychronized 合作cv Object.wait( ) Object.notify( ) 每个java对象都有一个锁和cv
    • 2 cvs [empty + fill] + 1 mutex 解决pc问题
      • p wait(empty, mutex) put(i) signal(fill)
      • c wait(fill, mutex) get() signal(empty)
  • semaphore: 初始非负的同步计数器 power(sem) = power(cv + lock)

    • –wait: sem-- <0挂起 >=0通过
    • signal: sem 唤醒一个挂起的线程
    • 0=占用 正=最多几个线程同时访问 负=等待中线程数量
    • sem(1) = mutex, sem(0) = cv, sem(可用资源量) = 共享资源访问控制
    • 3 sem 解决pc问题
  • 2 sem read lock + write lock + reader_counter 实现读写锁
    - 读锁:拿起rl, [若reader_counter=0,拿起wl], 放下rl 同理释放
    - 写锁:拿起wl,同理释放

  • 协程:用户态特殊函数 可挂起恢复 threadlocal不需要同步 多协程-串行执行

死锁

  • 哲学家就餐:不可同时拿左手叉子 会造成环形等待 只让最后一个哲学家先拿右手叉子即可破解环形等待 | 原子化吃饭
  • 条件与预防 | os调度预防:银行家算法
    • 互斥: 获得资源的独占控制 | 预防:不使用锁
    • 占有且等待: 拿着这个还等别的 | 预防:2 phase locking或大粒度锁
    • 不可抢占 | 预防: 得不到锁就释放资源
    • 环形等待 | 预防:决定好锁获得顺序
  • 解决:
    • 鸵鸟策略
    • 检测 | 恢复:抢占,回滚,kill,重启系统
      • 单个资源:资源waits-for图环判断(拓拓扑排序/dfs)
      • 多个资源:超时检测 标记清除[满足所有可满足+释放资源 长时间不可满足=死锁]

内存管理虚拟化

  • 动态加载: 调⽤时加载 lazy loading | 动态链接: 函数执行时lazy定位和链接

  • 静态重定位: 进程装⼊入内存前一次性完成逻辑/物理地址的变换 固定的不可变不安全

  • 动态重定位:内存虚拟化 访问内存前转换

  • 内部碎片: 分配单元内部的浪费 如padding | 外部碎片:对于os的无法被分配的小内存洞

  • 虚拟化内存:物理内存通过分页分段 + on-demand paging映射为更大逻辑内存

  • 页表:映射关系 + 元信息

  • 分页:逻辑和物理空间分为固定大小页 4KB=blocksize of disk

    • 消除物理内存外部碎片,造成逻辑内存空间连续的假象
    • 按需置换提供更多可用虚拟内存
  • 分段:地址空间分段code, stack,heap 每个段分配一些固定大小页,每段都有对应页表

    • 模块化方便共享保护
  • 分页分表合用好处:段自由生长,模块化方便共享保护,稀疏地址空间有效减少页表大小

  • TLB: cpu上缓存 mmu组件 含最近访问页表项翻译内容 减少内存访问

  • MMU: cpu上负责逻辑物理地址转换的部分 table walk unit + 多个tlb

  • 页置换策略:按需加载:代价page fault 提前加载:只适用于sequential workload 可与用户暗示结合advise

    • OPT: 未来最⻓长时间不不⽤用的 理论最优不实际
    • FIFO: 进入内存最久的 可能把最常用的踢掉
    • LRU: 若workload合适 近似opt 但实现困难 要维护一个所有页面链表
    • Clock: LRU近似 遍历环形buffer ref_bit=0停 清除经过的1

文件系统和Linux

  • 硬连接/实体链接: 两路径同inode号 不可链接⽬录/跨文件系统 ln file hardlink

  • 软连接/符号链接: data block指向路径 ln -s file softlink

  • fd_table - file table(各种访问模式read/write/r&w) - inode table | stdin out err 0 1 2

  • sigchld: 供父进程用wait/waitpid等子进程的信号

  • 孤儿进程:父进程先退出 子进程仍在运行

  • 僵尸进程:⼦进程先退出 子进程资源未释放。处理:kill⽗进程使之成为孤⼉ init会回收资源。

  • 3jNHbV.png

数据库架构

  • 储存管理storage manager: 将数据库文件以数据库pages形式表示 管理pages读写 pages内可用空间使用 数据库的一页通常和os虚拟内存中的页或硬件的页不同 大小取决于数据库设置
    • 文件的表示形式:heapfile sortedfile 说的是pages内记录有序或无序
    • pages的管理:链表 或 page directory(一个包含所有pages元信息的特殊page)
    • 不同数据类型的表示 固定长度的整数浮点数 不定长度的VARCHAR TEXT BLOB。 数据库系统元信息表
    • 储存模型:row-store适合OLTP workloads, column-store适合OLAP workloads
  • 缓存管理 buffer manager 管理数据库缓存池(磁盘中读来的数据库页)
    • 实现:使用哈希表维护当前内存中的页 每个页被修改了就要标记为脏, 此外还有pin counter计数正在使用他的线程数量 正在使用中的页不可以被淘汰
    • 策略:LRU-k 普通的LRU和clock会存在sequential flooding问题:顺序扫描冲掉缓存中所有内容
    • 优化:顺序读取情况下根据query plan预读
  • 解析引擎SQL parser:创建语法树 制定logical query plan & physical query plan(如各种join算法block nested join, hash join)
  • 优化器衡量不同plan的优劣: cost-based search, 抽样统计, 大的策略是尽量走索引。
  • 执行引擎: 按照query plan从上到下执行
  • 锁管理: HashMap<obj_name, <granted_set_txns, lock_mode, wait_queue> + lock upgrade等多粒度锁控制
  • 日志管理:UNDO log: 未完成/aborted事务 REDO log:
    • 失败-事务失败 系统失败
    • 策略:
      • shadow paging(COW)
      • Write-Ahead Logging 每个log entry:<txn_id, obj_id, before_val, after_val>
        • 周期性设置检查点将内存中内容flush到disk 作为redo undo操作新起点
        • physical logging 记录在数据库页的位置 logical logging 记录高层的操作如insert delete update和元数据
  • 容灾机制 ARIES: analyze WAL records + redo starting from a check point + undo uncommitted txns
  • 索引管理
  • 权限管理

Workloads: OLTP OLAP

  • OLTP online transaction processing (最常见)例如亚马逊购物车界面 只以事务方式简单查询语句访问/更新数据库的一 小范围数据 较少读 较多写 简单查询
  • OLAP analytical 已经通过oltp取得了数据 通过复杂查询语句在大范围内 进行数据分析 通常不涉及更新数据 kept query by oltp通常以只读方式提供 大范围数据 较多读 较少写 复杂查询
  • HTAP hybrid transaction analytical 中间位置, 混合workload
  • oltp at frontend => extract&transform&load => olap at back end with datawarehouse
  • htap db: oltp and olap at same side complicated not easy to build

关系数据库设计理论

函数依赖,异常,范式

  • X函数依赖于Y 意味着Y独特的决定X Y是当前表的候选键

  • X完全函数依赖于Y意味着X不函数依赖于Y的任何子集

  • X部分函数依赖Y = Y的一部分可以决定X Y作为键存在冗余

  • X->Z,Z->Y,X->Y是传递函数依赖

不符合范式的关系,会产生很多异常,范式可以解决异常:

  • 冗余数据栏目:例如某个信息在不同行出现了多次 该schema设计有问题
  • insert异常:因为冗余信息的存在 我们想插入一条记录 就要包括/编造这些无关冗余信息 这往往不现实
  • delete异常:删除一个含冗余信息的行,导致丢失同行其他关键信息。
  • update异常:只改了一条记录中的信息,其他记录中相同的信息忘了改。

通过无损分解消除没必要的函数依赖 产生更合理的schema 高级别范式依赖于低级别范式

  • 1NF:原子属性不可分。
  • 2NF:1NF+消除部分函数依赖 (每个非主属性完全函数依赖于主键)
  • 3NF:2NF +消除传递函数依赖(非主属性不存在“传递函数依赖”)

数据库并发控制

事务管理中的ACID原则

  • Atomicity 原子性 “all or none”: 事务所做操作要么全部提交要么全部失败回滚
    • 实现: 日志undo log, Shadow paging(COW)
  • Consistency 一致性 “looks correct”
    • 数据库一致性 仍遵循完整性约束 未来事务能看到过去事务造成的后果
    • 事务一致性 事务前后数据库内容一致 一致意味着多个事务访问相同数据得到相同结果
  • Isolation 隔离性 “lives alone”: 事务所作修改commit前对其他事务不可见
    • 两种类别的并发控制协议来保证隔离性
    • 悲观 从最初就避免问题的发生
    • 乐观 设定某种事务执行顺序,假设冲突发生不常见,发生后再处理 lazy
  • Durability 持久性 “survives failure”: 事务提交的修改应被持久化,即便系统崩溃也不受影响 redo log

串行情况下 原子性=>一致性 并行情况下 原子性+一致性=>一致性

一致性冲突

  • 读写冲突:
    • 不可重复读 “non-repeatable read” 两个读的中间数据被另一个事务覆写并提交 导致第二次读的同一行值不同
      • transaction reads committed UPDATES from another transaction. The same row now has different values than it did when your transaction began.
      • A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.
    • 幻读 “phantom read” 两个读的中间被另一个事务插入/删除行并提交 导致第二次读返回的行不同
      • Phantom reads are similar but when reading from committed INSERTS and/or DELETES from another transaction. There are new rows or rows that have disappeared since you began the transaction.
      • A phantom read occurs when, in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from the first.
    • 避免不可重复读锁行就足够,避免幻读需要锁表
  • 写读冲突: 脏读 “dirty reads” Reading uncommitted data 读了另一个事务没有提交的数据
  • 写写冲突:丢失修改 “lost updates” overwriting uncommitted data 一个事务覆盖另一个事务没有提交的数据

隔离级别

isl

  • RU: 事务中的修改,即使没有提交,对其它事务也是可见的。
  • RC: 一个事务只能读取已经提交的事务所做的修改。
  • RR: 保证在同一个事务中多次读取同一数据的结果是一样的。
  • S: 强制事务串行执行,这样多个事务互不干扰,不会出现并发一致性问题。从MVCC并发控制退化为基于锁的并发控制。不区别快照读和当前读,所有的读操作都是当前读,读加读锁(S锁),写加写锁(X锁)。在该隔离级别下,读写冲突,因此并发性能急剧下降,在MySQL/InnoDB中不建议使用。
  • 冲突可串行:一个调度是冲突可串行的意味着可以将其转换为串行调度
    • 依赖关系图一定无环
    • 这就是SERIALIZABLE隔离级别:将事务完全调度为串行执行

悲观并发控制

二段锁

  • Pessmistic 悲观的 假设事物存在高度竞争 使用锁 因此安全程度高 但并发能力会有所限制 (not allow all serializable schedules)

  • 写锁互斥 X 读锁共享

  • 2PL 总是保证冲突可串行调度 always gurantee conflicts-serializable schedule = grantee serializable schedule

  • 二个阶段 “/\”

    • 生长期 只允许拿锁
    • 收缩期 一旦释放第一个锁 进入收缩期
      • 只允许放锁 或不放锁
      • 不允许拿锁
  • 问题: 脏读情况下cascading abort例如t1写 t2读 此时t1还没commit便abort 此时t2不得不abort…如果t2也写了还没来的及提交后面还有读的那这个反应会连锁下去

    • 解决: Strong Strict 2PL (SS2PL) 一个事务提交/abort后才允许放锁 一次放完
  • 死锁处理:

    • 检测:
      • 后台周期性在waits-for graph中判环 / 超时检测
      • 选择一个victim回滚打破环,选择标准和rollback程度取决于具体设计
    • 预防:
      • 通过时间戳赋予优先度 FCFS
      • 一个人得到锁后其他所有人进入等待队列 直接严格避免死锁发生

锁粒度和层次

MySQL 中提供了两种封锁粒度:行级锁以及表级锁。
应该尽量只锁定需要修改的那部分数据,而不是所有的资源。锁定的数据量越少,发生锁争用的可能就越小,系统的并发程度就越高。
但是加锁需要消耗资源,锁的各种操作(包括获取锁、释放锁、以及检查锁状态)都会增加系统开销。因此封锁粒度越小,系统开销就越大。
在选择封锁粒度时,需要在锁开销和并发程度之间做一个权衡。

  • 一个事务想要更新1billion tuples = ask for locks 1 billion times VERY SLOW
  • 我们需要锁的层次来展现 行级表级处分别发生了什么

如何更方便的进行多粒度封锁: 意向锁

  • 意向锁Intention lock: 表示想要对表加锁,与"锁提升机制"结合,细粒度锁太多时动态的分配可用的粗粒度锁 大大减少lock manager需要处理的锁的请求数量

  • 意向锁在高层如表级告诉我们下面有什么样的锁 通过兼容矩阵 可以从hierarchy高处快速判断能不能读/写下层具体行的数据

  • Intention-Shared (IS):下面某个地方有读锁

    • Intention-Exclusive (IX):下面某个地方有写锁
  • Shared+Intention-Exclusive (SIX): 整个子树可共享读,下面某个孩子有写锁

  • 表级锁兼容矩阵:
    | | X | IX | S | IS |
    | ---- | ---- | ------ | ---- | ------ |
    | X | | | | |
    | IX | | OK | | OK |
    | S | | | OK | OK |
    | IS | | OK | OK | OK |

  • 表和表:任意 IS/IX 锁兼容: != 真正加锁 | S 和 IS 兼容

  • 表和行:表级 IX 和行级 X 兼容:两个事务先对同表加IX锁,然后可以对同表两个不同数据行加 X 锁

Mysql事务四级封锁: 强2PL+读写锁

Mysql InnoDB引擎采用强2PL,根据当前隔离级别自动加锁=隐式锁定

  • 一级 修改加写锁,事务结束放写锁 = Read Uncommited级别
  • 防止写写冲突问题Lost updates
  • 二级 一级 + 读加读锁,读完放读锁 = Read Commited级别
  • 防止脏读问题Dirty Reads
  • 三级 二级 + 读加读锁,事务结束放读锁 = Repeatable Reads级别
  • 防止不可重复读问题unrepeatable reads
  • 四级 全表锁,事务结束放锁 = Serializable级别

显式锁定

1
2
获取读锁: SELECT ... LOCK In SHARE MODE; 
获取写锁: SELECT ... FOR UPDATE;

乐观并发控制

  • 不使用锁:对common case的一种优化, 大部分情况下事务只需持续很短时间,冲突没有那么容易发生。所以我们可以使用比较懒的策略,假设冲突不会发生,检测到发生再进行补救 如选择abort

根据ACID原则 并发情况下实现一致性需要原子性和隔离性, 原子性通过CAS等原子操作解决, 隔离性通过时间戳机制赋予事务执行顺序

目前有三种主流乐观并发控制策略

  • 基于时间戳的 事务用时间戳编号排序 最早OCC论文内容
    • 读写阶段:将更改存在私有空间
    • 验证阶段:验证是否和其他事务冲突再提交
    • 写阶段:验证成功 将私有空间更改同步至主空间 否则abort
  • snapshot isolation: 朴素mvcc
  • CAS: 不仅是一种策略 也是一种实现
    • 满足期望值(没有被占用)则更新。返回旧值供外层循环检测
    • 外层while CAS(flag, 0, 1) == 1: Old value is 1, lock held by someone else, spin and wait. Otherwise go ahead do modification
    • 硬件支持:x86架构下指令为cmpxchgl

MVCC多版本并发控制

  • 当前最常见的数据库并发控制方法. 非常适合读多写少的OLTP workload 朴素MVCC解决了在REPEATABLE READ和READ COMMITTED两个隔离级别下读同一行和写同一行的两个事务的并发问题。

  • 朴素的MVCC在数据库领域可以说是乐观的 是不用锁的 单纯使用快照方式实现读写分离 类似COW

mysql默认隔离级别为RR, InnoDB的mvcc是mvcc的改进版 引入(next-key lock]解决了朴素mvcc的幻读问题 能够实现RC RR S三个隔离级别

  • 即便是改进版MVCC也只是处理读写或写读的冲突属于整个四种Isolation level处理的问题。写写冲突 lost updates 一般通过first commiter/first updater win 解决

  • 读=不加锁的非阻塞读 读旧版本快照 | 写(插入删除更新)=加锁更新一个最新的版本快照

  • 系统版本号SYS_ID:通过时间戳机制单调递增分配

  • 事务版本号TRX_ID:事务开始时的系统版本号

  • 每个事务都有这两个版本号并被记录在undo日志中

  • RC RR隔离级别下使用版本链 多个版本的快照存在undo日志中,日志通过回滚指针把一个数据行的所有快照连起来

  • MVCC 维护一个read view结构 当前未提交事务列表trx_min_id…trx_max_id 一个数据行快照的trx_ID <min则可用 >max则不可用 位于中间取决于当前隔离级别 若该快照不可用则要沿undo日志中版本链找前一个可用快照

  • 快照读select…不需要锁,当前读(数据修改insert/update/delete)操作需要锁

  • innodb为了解决rr级别下当前读(数据修改)的幻读问题,使用了next-key lock

    • 本质上是gap lock+ record lock
    • record lock锁定记录上的索引相当于一个单点锁 (使用unique index访问的情况)
    • gap lock锁定record lock单点之前的开区间(使用non-unique index / no index访问的情况)
    • (10, 15] 每一个next-key lock锁定一个左开右闭区间

索引和储存

B,B+,Hash,Bit Map

  • 哈希索引高效等值查询 位图索引适合静态low cardinality数据

  • B树平衡多路查找树 m度=m指针 m-1个键值(实际数据记录) 叶子层并非包括全部数据记录 节点半满压缩高度 节点内有序 叶子结点位于同层

  • b+树 叶子层(节点内外均有序)fixed-size数组的链表保存全部数据记录+非叶子层有序索引只存搜索键值 (储存指针多一个)

  • b+树与b树区别:

    • b+叶子层有序fixed-size数组链表同时支持高效等值查询和范围查询 (节点内有序可以2分搜索)和bulk-loading (每个固定gap选最小值作为搜索键值 从下到上建树)
    • 单叶子储存节点更多 高度压缩更明显 IO次数更少
    • 所有查询都要找到叶子结点 查询性能更稳定
  • 为什么适合储存:非平衡树可以直接排除-skew情况下退化为链表 红黑只有2叉不适合高度太高io次数太多文件系统 b+结合b树优点节点半满多路平衡查找树 高度低IO次数少 同时叶子层支持外排序后bulk-loading 更适合文件系统和数据储存 更好利用磁盘预读性质

索引

为什么要使用索引?避免全表扫描,使用更好的索引-减少数据访问量

  • 索引键=索引项 搜索键:我们要查找的栏目名称
  • clustered index / primary index / sparse index 主索引:记录顺序和索引键相同 整个文件要按索引键排序 indexed-sequential file 只能有1个主索引。例子:有序文件/hash表上的索引
    • primary index指用含主键的集合作索引键的索引 一般情况下将primary index和clustered视为相同
    • 无论索引键是否有重复 可以是dense(trivial)也可以sparse 因为完全有序 可以每个disk block建一个索引指向第一个anchor record即可 找到合适位置后顺序扫描 因此一般默认为sparse
      • 有序= 高效的范围查找 IO次数少
      • 因为有序 插入删除更新 维护成本高:有空间直接插入 否则维护overflow chain 时间长了性能下降严重 因此多数数据库会定期重新组织记录结构将overflow chain展开 但这也要开销
  • non-clustered index / secondary index 辅助索引:索引键和记录顺序不同
    • must only be dense !
    • 索引有序 按辅助索引键排序 不给记录排序 那是主索引的事!!!
    • 可能涉及到不同磁盘区块 = 更多io + 寻道时间 更慢
    • 通常使用间接索引的方式避免重复key问题 这样最底层索引相同key有序排列 底层之上高层实现稀疏索引
    • sparse辅助索引没有意义 因为数据记录无序 什么都找不到 最后依赖全表扫描或主索引
    • 例:mysql中当查询列上没有辅助索引 就走主索引
  • dense index: 索引键出现在每个含搜索键值的记录上,
    • 有序情况下支持范围查询 二分查找
    • 有序情况下非常高效:1 disk IO = load everything into memory until exceed range
    • 无序情况:需要全表扫描 不适用 现在的索引尤其是b+tree都是有序索引
  • sparse index: 索引键出现在部分含搜索键值的记录上 必须是主索引 记录要按索引键顺序排序
    • 1 key - ptr pair per disk block 1st record = anchor record
    • 有序情况下 可用二分查找 MORE space-memory effcient
  • 多列联合索引
    • 通常策略:按最左边建立b+树有序索引 索引按优先级左到右字典序排序

Mysql索引

InnoDB支持 B+树 哈希 全文索引 MyISAM支持全文索引 空间数据索引

  • DDL部分 声明的key会自动创建辅助索引 主键会创建主索引
  • 联合索引匹配遵循最左匹配原则 idx1(a,b,c,d)
    • 从左向右匹配直到遇到范围查询<,>,between,like e.g. a=3,b=4,c>5,d=2 d会用不到索引
    • 若查询均为等值查询 = , in 则可以乱序 如查询 a=1, c=3,b=2 (a,b,c) 甚至可以省略前面部分 直接查询c = 3 查询优化器会自动修改
  • 原因:B+树有序索引会建在a上 剩余内容b,c,d在前者基础上依次排序 叶子层记录以有序数组形式存在叶节点内 有序查询找到a后依次向后按序匹配
    • b的有序是基于a的有序 因此等值查询不受影响
    • 当a为范围查询时,e.g. a>1 的结果是一个多条记录的范围 此时b相对于这个范围是无序的 无法使用索引
  • 索引是建立越多越好吗?
    • 数据量小的表不需要索引 增加额外开销
    • 数据修改要维护索引 = 更多维护成本

Mysql引擎

InnoDB default MyISAM
支持事务,适合OLTP 事务处理为主 小范围数据 简单查询 重复性高 增删查改写多于读 如购物车 不支持事务,适合OLAP 数据分析为主 大范围数据 复杂查询 重复性低 读多于写 如统计计算聚集函数。
行锁设计 MVCC+next-key lock提供四个级别安全和高并发性 表锁设计
支持外键 不支持外键
支持clustered index 全表按主键排序 不支持clustered index 索引文件单独储存:主键/其他方式所建索引的偏移量 数据库文件是无序的heapfile
提供了插入缓冲、二次写、自适应哈希索引、预读等高性能和高可用的功能。 它的缓冲池只缓冲索引文件不缓冲数据文件。索引文件单独储存

SQL调优

索引调优基本heuristics

  • 修改sql尽量全走索引(覆盖索引)没索引的情况下考虑建合适的辅助索引
    • innoDB辅助索引 不覆盖会 访问主索引
    • myisam只缓存索引 不覆盖会 额外读文件-全表扫描-系统调用 page replacement
  • 选择性强的索引放前面,索引全覆盖的情况下尽量将范围查询移到末尾 遵循最左匹配原则的原理 避免查询不走索引
  • 不要把想走索引的搜索码放表达式里 如actor_id + 1 = 5; 是错误的 不会走索引
  • 性能: 1个多列索引>多个单列索引
  • 较长类型如blob, text, varchar 应使用前缀索引 只索引前几个字符

需要注意的

  • mysql optimizer会消除没有必要的数据行访问 因此不一定会走主索引或副索引

  • 找慢的query show variables找到系统变量long_query_time slow_query_log(默认off)

    • long_query_time 超过则加入slow query log, set slow_query_log为on
  • 查看query plan, explain query显示栏目

    • id=子查询执行顺序 type=找到数据行/join的方式: all全表扫描 index扫了index树
    • extra里面有各种状态比如using index = 索引完全覆盖了查询 using index for group-by部分使用索引等 using file-sort用了外排序(比如order by) using temporary 用了临时表
  • explain xxx force index(primary) / force index(xxx) 尝试使用别的索引

  • 加索引 alter table xxx add index idxname(字段名); 或者在ddl里create index xxx

查询优化

  • 切分大的事务(查询) 避免小的饥饿 或使用多粒度意向锁
  • 大连接查询切分为 单表查询 + 关联 更好利用已有缓存 提高自身缓存可用性 减少锁竞争

分布式和系统设计

模型和概论

  • 重要假设: 容错模型 假设我们可以相信节点(we write them)(若不相信需要使用拜占庭容错协议 blockchain)
  • 基本架构
    • shared everything: 单点
    • shared memory: no one does this, need OS support 多用于HPC超算中心
    • shared disk: common for cloud DBMS
    • shared nothing: most common. better performance, hard to manage ex. most nosql sys
    • 同构vs异构:集群内节点可否执行相同任务
  • 两种设计monolith vs microservice 组件耦合程度
    • monolith: 单机负责多种功能
      • 适合小团队
      • 简单: 更少解耦,维护
      • 快:少/无 RPC
      • 部署复杂
      • 单点故障波及范围大
    • microservice:单机负责单一功能 + 小型分布式缓存数据库 适合大型系统
      • easy to scale (easy for new team member)
      • decoupled : easy to reason about
      • hard to design: s1 only talks to s2 = they need to be merged
  • 两种拓展模型
水平扩展 更多机器 垂直拓展 更强机器
需要负载均衡 N/A
容错 可靠 单点故障
RPC IPC
易数据不一致 通常一致
scales well as users increase 硬件限制
  • 优化方法

    • 优化单点能力:vertical scaling
    • 解决单点故障 horizontal scaling - 备用服务器转换 / replication
    • 多节点集群负载均衡 减少单点负荷: load balancer responsible
    • 微服务架构提高可拓展性: 解耦系统组件, 各司其责
    • 日志和评估系统
  • 事件驱动模型: ex. git, game server csgo headshot

    • publisher-subscriber: relies on message queues to pass events
      • Good: decoupled, easy to identify failure ONLY SPOF, easy to add / remove
      • Bad: extra level of indirection = slow, does not gurantee strong consistency, cost for learning and maintaining message queues

原则和协议

  • CAP: 分布式系统不能同时满足一致性 可用性 分区容忍性

    • Consistency一致性:保持数据一致和最新 Linearizability
    • Availability可用性:系统可用时间占总时间的比值 All up nodes can satisfy all requests
    • Partial tolerant分区容忍性:局部网络故障消息丢失时能否继续提供一致性和可用性服务 still operates despite message loss
    • 分区容忍性一般必不可少 CAP实际上是在一致性和可用性CA间进行权衡
      • 不可以cp 因为存在k-safety限制 不得不放弃可用性
  • BASE理论 CAP的拓展 CA的协调结果 Basically avaiable, Soft State, Eventually Consistent

    • ACID要求强一致性(数据更新后 用户能够读取到最新值) 用于单机传统数据库
    • BASE要求最终一致性 中间软状态可以看情况适当妥协 保证始终提供核心基础服务即可
  • 分布式死锁检测 周期性union所有waits-for-graph

  • 如何实现共识 agree on commit abort ? quorum (voting)

  • 分布式事务协调 原子性提交相关协议2PC, Paxos, Raft, ZAB

    • 中心化: global traffic cop as middleware / coordinator route query as needed to partitions
      • 2PC: 相当于paxos的弱化版
        • prepare phase: 协调者询问参与者事务是否准备好提交 参与者投票Y/N 全票=提交

        • commit phase: 协调者决定结果 如果可以就都提交-否则回滚 参与者回复ack

        • 可以和 2PL结合

        • 问题

          • 不能解决dead coordinator问题 paxos 可以解决
          • 同步过程中阻塞问题 要等待其他参与者响应
          • 网络问题导致只有部分参与者commit ?
          • 一个节点失败=失败 容错率低
    • 去中心化: 节点自行组织产生分布式共识
      • 没有协调者
      • 服务器和master node交流 如果是同构架构 所有节点都可以是master node
      • query request发送到master node / partition
      • 提交时master node和其他节点交流safe to commit? 投票选择
        • paxos 分布式共识算法 无需全票 多数投票通过即通过 如果多数存活non-blocking 高容错 解决dead coordinator问题
          • proposer 提议提交 OK? 1个或多个proposer
          • acceptor reply 只需多数即可提交 see n+1 reject n
          • learner: 被告知投票的结果,不参与投票过程。
          • 持续处理新来的proposal 即便之前的已经同意 也当即拒绝之前的proposal和commit请求
          • 解决饥饿问题:multi-paxos 选择1个leader, 规定若如果只有一个proposer propose阶段可以跳过 周期性选举当前leader
  • raft 选举主节点

组件和机制

  • 异步消息机制: message/task queue

    • 接受任务后分配给不同服务器,若花时间太长=dead,assign to next server to avoid duplication
  • 分布式缓存

    • 缓存:avoid network calls, repeated computation, db load
      • 置换策略不好或缓存太大都会导致缓存无用
      • thrashing: constantly input & output withhout using the result = sequential flooding ?
      • server-side, client-side : fast but cache consistency?
      • 全局缓存global cache (redis): 解耦!微服务!
        • consistently accurate but slightly slower
        • can scale independently
      • write through: …on hit, update cache & database keep db always consistent but slow
      • write back: write to cache, mark dirty(inconsistent). lazily update db when dirty-cache entry is evicted db can be temporarily inconsistent but fast
      • hybrid: 重要的更新write through, 批缓存不重要的更新到cache, 低峰时间一次性同步到db
  • 分区 usually logical

    • 垂直切分/naive table partitioning :如果查询分布不均匀集中到一个点 单点计算能力可能扛不住
    • 水平切分/round-robin sharding / load balancing :query fragment分布到集群不同节点上作相同处理最后合并 有效缓解单数据库压力 多采用哈希取模方式划分
      • 可拓展 无重复的 一致性哈希:集群增删一个节点后如何避免rehash&trash our caches?
        • 绕环线性探测到下一个节点
        • 节点少数据倾斜?每台服务器做多个虚拟节点replication 实现相对均匀分布
        • 哈希函数不好 服务器分布不均not uniformly distributed:
          • 完美哈希 使用哈希函数族(嵌套一致性哈希 + multilevel sharding )
    • 数据库的分库分表:大连接慢 可与主从复制结合提高分片容错 可与和索引结合 consistency is hard 建议优先使用nosql等其他系统再考虑这种primritive
      • 表的垂直切分:一张表按列切分成多个表 分到不同库里
      • 表的水平切分:同一个表中的记录拆分到多个结构相同的表中 可以分布到集群不同节点上 配合服务器的负载均衡 减少单库压力
  • 复制 Replication

    • 创建冗余备用节点储存相同信息提高系统可用性

    • k-safety: 最少k可用复制品 #replicas < k DBMS halts go offline

    • 主从架构

    • 更新集中到master

    • master更新内容到replica

      • 更新方式
        • 同步: 主等待从完成日志并回复ack 保证强一致性strong consistency 数据是即时的
        • 异步:不等待 主直接返回ack给client 保证最终一致性eventual consistency 数据snapshot
      • 更新timing
        • 连续的:生成日志=》直接发送
        • 提交时
      • 从者主动被动?
        • 主动:事务在从处自行执行 需要检测不同从的结果匹配
        • 被动:事务在1处执行后传递给所有replicas
    • 可允许只读事务访问replica

      • 可实现读写分离: 主服务器处理写操作以及实时性要求比较高的读操作,而从服务器处理读操作。从服务器可以分布到离用户较近的地方 降低延迟 提高用户体验
      • 采用快照隔离能保证弱一致性, 增加冗余提高可用性
    • master故障 选举新master

    • 多主人

      • replica和主人平级(没有主人=多主人)- 事务可在任何replica处更新 replica们自行使用原子性提交/分布式共识协议2PC/Paxos互相同步
    • 网络故障包丢失 分区paxos选主后 split-brain双主如何在网络恢复后同步?

      • 传统/newsql dbms:多数节点连接完好才允许更新 不存在这个问题
      • nosql:需要某种形式上的中心化处理
        • 乐观:选择可用性,牺牲部分一致性,之后在合适的时间自动或人为处理 hazelcast
        • 悲观:选择一致性,牺牲部分可用性,投票下线存在问题的网络分区 mongoDB
  • 总结:分区和复制通常一起使用:如水平切分的数据库分布到集群不同单点上 并作复制 复制品可以是slave(主从结构)也可以是master (distributed-consensus based).

No-sql & in-memory DB

  • NoSQL: not-only relation: k-v, document(json, csv), graph cassandra redis mongodb

    • 优点
      • 插入删除涉及所有相关数据 此时使用nosql开销较低
      • schema比较灵活 方便更改
      • 内部支持水平切分 scales well
      • built for OLAP
    • 缺点
      • no acid guarantee
      • not read optimized join is hard 涉及大量扫描相关对象
      • no relation no constriants
  • In-memoryDB: very modern, lack durability support from ACID, store things in volatile meory. e.g. Oracle IMDB

计网总结

包交换 vs 电路交换网络

  • 电路交换
    • 需要建立连接 和 专用物理线路 线路利用率低 没有转发机制
    • 容易受网络中断影响
  • 包交换
    • 更灵活 不需要专用线路 可以线路复用 线路利用率高
    • 不容易受网络中断影响

时延

E2E总时延 = 排队时延 + 处理时延 + 传输时延 + 传播时延

P2P总时延 = 传输时延 + 传播时延

传输过程和基本术语

Message “报文” - Segment “报文段” - Datagram / Packet “数据报 / 包” - Frame“帧” - bit

img

https://blog.csdn.net/a3192048/article/details/84671340

  • Links: 链路 连接节点的物理介质

  • Service / Interface: between layering

  • Protocol: between peer

  • E2E Client to Server P2P otherwise

  • 根据信息在传输线上的传送方向,分为以下三种通信方式:

    • 单工通信Simplex:单向传输
    • 半双工通信Half-duplex:双向交替传输
    • 全双工通信Duplex:双向同时传输
  • 局域网:多种不同结构

    • 局域网是一种典型的广播信道,主要特点是网络为一个单位所拥有,且地理范围和站点数目均有限。有多种局域网技术,其中以太网占主导地位。可以按照网络拓扑结构对局域网进行分类:星 环 直线
  • 以太网:星形结构局域网 中间使用集线器或交换机连接

  • MAC地址 链路层地址 48位 设备网卡的唯一标识 有多少个适配器=多少个mac地址 os可更换

  • Switch交换机 Hub集线器 Router路由

    • Hub 集线器 [layer1] enables mulitple hosts to create a broadcast Channel (only floods) 作用于物理层的 能使多个主机 创建一个广播频道的设备 具备多个网口,专门实现多台计算机的互联作用。
    • Switch 交换机 [layer 2] 收处理转发以太网帧到网络中的其他设备 会维护一个<mac link/接口> 表 “交换表” 表类似一个LRU缓存 因此能够实现mac地址识别=说它具有“学习能力”. Can have simultaneous p2p connectivity between different hosts
    • Router 路由 [layer 3] 根据routing table提供转发和路由两种功能 转发:将数据包移送到合适输出端 路由:决定数据包的路由路径。
  • Bridge Repeater Gateway

    • Repeater 中继器**[layer 1]** receive and repeat/regenerate signal
    • Bridge 网桥**[layer 2]** 连接两个局域网
    • Gateway 网关 [layer 3] 两个不同网络之间的关口设备

IP-service model / IP best-effort network /最大努力网络

  • packets may lose, duplicate, reorder

  • connectionelss best-effort destination based forwarding.

Layering & Architecture

  • OSI architecture 应用层传输层之间多了session(建立及管理会话) presentation(数据压缩、加密以及数据描述)
  • TCP/IP architecture 将物理层和数据链路层合并为“网络接口层”

Application Layer

E2E用户服务 HTTP FTP DNS DHCP 远程登录 邮件

Transport Layer

E2E 为应用进程提供端到端的通信服务 在应用层和网络层之间multiplexing和demultiplexing TCP UDP

Hide defects and limitations of the network

Fragmentation & reassembly

resend defect packets

  • TCP
    • Connection oriented: need to set up connection (has overhead)
    • Reliable: it will make sure packets get through, if lost/corrupted resend packet
    • Flow control: “making sure receiver not overwhelmed”
    • Congestion control: “making network is not overloaded”
  • UDP
    • Connectionless: no need to set up connection, if need to send just send
    • Unreliable best-effort
    • No Flow control
    • No Congestion control

No timing / throughput / security gurantee

Why should we even bother using UDP ? you have a lot more control when having a tradeoff

Network Layer

P2P Addressing, Routing, Congestion control 点对点寻址,路由,拥塞控制 Moving data between networks. 涉及协议: IP, ARP, ICMP and routing protocol

  • Addressing

    • IP: 沙漏结构的中点,将异构网络连接起来,使之看起来像个统一的网络

      • IP is a unreliable protocol because it does not guarantee the delivery of a datagram to its destination. The reliability must be provided by the upper layer protocols like TCP. IP does not support flow control, retransmission, acknowledgement and error recovery.
      • 地址系统
        • Class-based addressing (过去版本): lead to problem of large #networks in routing table
        • Subnetting and supernetting 子网与超网
          • subnet ip = subnet mask & host ip address
          • Supernetting: classless interdomain routing CIDR 无分类跨网地址系统
            • ip地址=网络前缀+主机号 128.14.35.7/20 表示前 20 位为网络前缀。
            • 意义:减少了路由表项 查找时采取最长前缀匹配原则选择哪一个
      • Fragmentation/Reassembly 报文的拆分重组: enabling heterogenous system to transmit their own “max pkt” 和tcp合作 tcp负责mtu discovery
      • Error reporting and control (ICMP) 错误报告和控制
        • img
        • 封装在 IP 数据报中,但是不属于高层协议。
        • ping 用来测试两台主机之间的连通性 通过icmp echo实现
      • Traceroute 追踪一个数据包路径: 封装一个无法交付的udp包, 利用IP协议的“ttl”字段,尝试从每个网关到某个主机的路径引发ICMP 超时响应。
      • IP packet format
      img
  • Routing

    • How does router figure out MAC

      • ARP: routers and hosts maintain an dynammic <IP, MAC> cache, this table is called ARP table

        If IP is not in ARP table, broadcast ARP request, hosts that have that IP address will responds

    • Types & protocols used

      • Intra-domain
        • RIP Routing Information Protocol: rely on local computation in all nodes
          • Distance Vector Protocol (based on Bellman-ford)
        • OSPF Open Shortest Path First Protocol: no local computation faster loop-free convergence
          • Link state Protocol (based on Dijkstra’s shortest path)
      • Inter-domain
        • BGP Border Gateway Protocol

P2P 相邻网络节点/主机的数据传输和控制 局域网(LAN) 广域网(WAN)

  • Framing 把网络层传下来的包封装成帧 加个开始和结束
  • P2P点对点服务: Logical link control (LLC)
    • Error Detection CRC checksum
    • Flow Control “making sure receiver not overwhelmed”
  • Broadcast广播服务: Media access control (MAC)
    • Frames synchronization 帧的同步 clock based, character counting, byte stuffing.
    • Channel sharing 信道共享- methods:
      • 信道复用:时分,频分,码分
      • 交替:轮询,令牌传递
      • 随机访问 主要例子:Aloha, Ethernet
        • Ethernet MAC采用CSMA/CD协议 (Carrier Sense Multiple Access / Collision Detection) 载波监听多点接入/碰撞检测
          • Only if line is idle, start to send immediately (Carrier Sense 载波监听)
          • if busy wait for “inter-frame gap” = 96 bit time
          • if collision detected, send jam signal, do binary exponential backoff “nth randomly choose k from {0,1,2,…,2^n-1}, then delay k*51.2 μs” 二进制指数后退 collision domain = 1 RTT

Physical Layer

P2P transmission of raw bits

Application Layer

  • 2 different Architecture: Client Server / Peer to Peer

  • Socket = Door

    • Transport Layer = hallway
  • IP + port determines host & process

  • Public Domain Protocols Propertiary protocol Skype … etc.

    • HTTP
    • FTP
    • SMTP
    • BitTorrent
  • 不同情况下应用有不同需求 data loss vs time-sensitive vs throughput

常用端口

应用 应用层协议 端口号 传输层协议 备注
域名解析 DNS 53 UDP/TCP 长度超过 512 字节时使用 TCP
动态主机配置协议 DHCP 67/68 UDP
超文本传送协议 HTTP 80 TCP
文件传送协议 FTP 20/21 TCP 数据连接 20,控制连接 21
远程终端协议 TELNET 23 TCP ssh = 22
简单网络管理协议 SNMP 161/162 UDP
简单邮件传送协议 SMTP 25 TCP
邮件读取协议 POP3 110 TCP
网际报文存取协议 IMAP 143 TCP

Web and HTTP

FTP

使用TCP进行连接, 使用2个连接来传输一个文件

  • 控制连接:服务器打开端口号 21 等待客户端的连接,客户端主动建立连接后,使用这个连接将客户端的命令传送给服务器,并传回服务器的应答。
  • 数据连接:用来传送一个文件数据

根据数据连接是否是服务器端主动建立,FTP 有主动和被动两种模式:

  • 主动模式:服务器端主动建立数据连接,服务器端端口号 20,客户端端口号1024 - 65535随机(因为 0~1023 是熟知端口号)。

img

  • 被动模式:服务器端被动,客户端主动建立数据连接,客户端端口号自己指定,服务器端的端口号随机。

img

主动模式要求客户端开放端口号给服务器端,需要去配置客户端的防火墙。被动模式只需要服务器端开放端口号即可,无需客户端配置防火墙。但是被动模式会导致服务器端的安全性减弱,因为开放了过多的端口号。

DNS

dns dns recdns

DNS 可以使用 UDP 或者 TCP 进行传输,使用的端口号都为 53。大多数情况下 DNS 使用 UDP 进行传输,这就要求域名解析器和域名服务器都必须自己处理超时和重传从而保证可靠性。在两种特殊情况下会使用 TCP 进行传输:

  • 如果返回的响应超过 512 字节(UDP 最大只支持 512 字节的数据)。
  • DNS zone transfer

DNS 负载均衡

同一主机在dns服务器里配置多个主机记录=多个不同服务器ip,dns服务器解析域名时会轮询,这就完成了简单的负载均衡。

DHCP

只适用于动态ip分配的情形,如进入一个新的移动网络。 主机不知道自己ip地址 ask dhcp server

DHCP (Dynamic Host Configuration Protocol) 是用于动态ip分配和配置的协议。

DHCP 配置的内容不仅是 IP 地址,还包括子网掩码、网关 IP 地址。

DHCP 工作过程如下:

  1. 客户端发送 Discover 报文,该报文的目的地址为 255.255.255.255:67,源地址为 0.0.0.0:68,被放入 UDP 中,该报文被广播到同一个子网的所有主机上。如果客户端和 DHCP 服务器不在同一个子网,就需要使用中继代理。
  2. DHCP 服务器收到 Discover 报文之后,发送 Offer 报文给客户端,该报文包含了客户端所需要的信息。因为客户端可能收到多个 DHCP 服务器提供的信息,因此客户端需要进行选择。
  3. 如果客户端选择了某个 DHCP 服务器提供的信息,那么就发送 Request 报文给该 DHCP 服务器。
  4. DHCP 服务器发送 Ack 报文,表示客户端此时可以使用提供给它的信息。

img

远程登录协议

SSH TELNET

电子邮件协议

发送协议常用 SMTP,读取协议常用 POP3 和 IMAP。

Web页面请求过程

1. DHCP 配置主机信息

  • 假设主机最开始没有 IP 地址以及其它信息,那么就需要先使用 DHCP 来获取。
  • 主机生成一个 DHCP 请求报文,并将这个报文放入具有目的端口 67 和源端口 68 的 UDP 报文段中。
  • 该报文段则被放入在一个具有广播 IP 目的地址(255.255.255.255) 和源 IP 地址(0.0.0.0)的 IP 数据报中。
  • 该数据报则被放置在 MAC 帧中,该帧具有目的地址 FF:FF:FF:FF:FF:FF,将广播到与交换机连接的所有设备。
  • 连接在交换机的 DHCP 服务器收到广播帧之后,不断地向上分解得到 IP 数据报、UDP 报文段、DHCP 请求报文,之后生成 DHCP ACK 报文,该报文包含以下信息:IP 地址、DNS 服务器的 IP 地址、默认网关路由器的 IP 地址和子网掩码。该报文被放入 UDP 报文段中,UDP 报文段有被放入 IP 数据报中,最后放入 MAC 帧中。
  • 该帧的目的地址是请求主机的 MAC 地址,因为交换机具有自学习能力,之前主机发送了广播帧之后就记录了 MAC 地址到其转发接口的交换表项,因此现在交换机就可以直接知道应该向哪个接口发送该帧。
  • 主机收到该帧后,不断分解得到 DHCP 报文。之后就配置它的 IP 地址、子网掩码和 DNS 服务器的 IP 地址,并在其 IP 转发表中安装默认网关。

2. ARP 解析 MAC 地址

  • 主机通过浏览器生成一个 TCP 套接字,套接字向 HTTP 服务器发送 HTTP 请求。为了生成该套接字,主机需要知道网站的域名对应的 IP 地址。
  • 主机生成一个 DNS 查询报文,该报文具有 53 号端口,因为 DNS 服务器的端口号是 53。
  • 该 DNS 查询报文被放入目的地址为 DNS 服务器 IP 地址的 IP 数据报中。
  • 该 IP 数据报被放入一个以太网帧中,该帧将发送到网关路由器。
  • DHCP 过程只知道网关路由器的 IP 地址,为了获取网关路由器的 MAC 地址,需要使用 ARP 协议。
  • 主机生成一个包含目的地址为网关路由器 IP 地址的 ARP 查询报文,将该 ARP 查询报文放入一个具有广播目的地址(FF:FF:FF:FF:FF:FF)的以太网帧中,并向交换机发送该以太网帧,交换机将该帧转发给所有的连接设备,包括网关路由器。
  • 网关路由器接收到该帧后,不断向上分解得到 ARP 报文,发现其中的 IP 地址与其接口的 IP 地址匹配,因此就发送一个 ARP 回答报文,包含了它的 MAC 地址,发回给主机。

3. DNS 解析域名

  • 知道了网关路由器的 MAC 地址之后,就可以继续 DNS 的解析过程了。
  • 网关路由器接收到包含 DNS 查询报文的以太网帧后,抽取出 IP 数据报,并根据转发表决定该 IP 数据报应该转发的路由器。
  • 因为路由器具有内部网关协议(RIP、OSPF)和外部网关协议(BGP)这两种路由选择协议,因此路由表中已经配置了网关路由器到达 DNS 服务器的路由表项。
  • 到达 DNS 服务器之后,DNS 服务器抽取出 DNS 查询报文,并在 DNS 数据库中查找待解析的域名。
  • 找到 DNS 记录之后,发送 DNS 回答报文,将该回答报文放入 UDP 报文段中,然后放入 IP 数据报中,通过路由器反向转发回网关路由器,并经过以太网交换机到达主机。

4. HTTP 请求页面

  • 有了 HTTP 服务器的 IP 地址之后,主机就能够生成 TCP 套接字,该套接字将用于向 Web 服务器发送 HTTP GET 报文。
  • 在生成 TCP 套接字之前,必须先与 HTTP 服务器进行三次握手来建立连接。生成一个具有目的端口 80 的 TCP SYN 报文段,并向 HTTP 服务器发送该报文段。
  • HTTP 服务器收到该报文段之后,生成 TCP SYN ACK 报文段,发回给主机。
  • 连接建立之后,浏览器生成 HTTP GET 报文,并交付给 HTTP 服务器。
  • HTTP 服务器从 TCP 套接字读取 HTTP GET 报文,生成一个 HTTP 响应报文,将 Web 页面内容放入报文主体中,发回给主机。
  • 浏览器收到 HTTP 响应报文后,抽取出 Web 页面内容,之后进行渲染,显示 Web 页面。

TCP/UDP

  • UDP header:src port, dest port, header length, checksum

  • TCP header:

    • SRC,DST ports (16-bit each)
    • Sequence #序号, Ack #确认号(32-bit each) 序号:当前数据段第一个字节编号 确认号:期望下个数据段第一个字节编号
    • Header length(data offset), reserve flags(indicate pkt types) SYN FIN ACK
    • Receive window (16-bits that specify how big my receive buffer is) important for flow control
    • Check sum, Urgent Ptr: can be used by app to indicate a receiving host need to pay attention about packets but most case is ignored.
    • Options
  • TCP

    • 全双工:代表建立连接后可以双向传送接受数据
    • 以连接为导向: 意味着需要主动建立连接
    • 可以提供包级别的可靠传输 什么是包级别的可靠传输:udp只提供位级别的可靠传输 由checksum实现 只能进行简单的检测看看数据是否污染,包级别可靠传输 在网络层 也就是基本的的ip-service model=best-effort destination-based forwarding里是不能够被保证的,但是我们在很多实际应用中又需要这个保证,所以就在传输层由tcp来做:保证包级别可靠传输主要靠两个机制:
      • 接收方发送确认ack 就是要求接受者每次接收到数据都要回复发送方 说一声我收到了 只有这个确认机制只能保证我们可以收到包,还不够,有问题的包还需要进行恢复
      • 计时器/超时检测机制 解决了“什么时候重新发没收到的/有问题的包” 他的机制很简单:只要超时了 就要重发 超时标准要比RTT稍微多一点尽量接近RTT, RTO新= RTO旧*2 by karn’s algorithm
    • 提供流量控制:确保接收方的buffer不会overflow
      • 接收方发送的确认报文中的窗口字段可以用来控制发送方窗口大小,从而影响发送方的发送速率。将窗口字段设置为 0,则发送方不能发送数据。
    • 提供拥塞控制:与整个网络有关, 网络比较拥挤的时候控制发送方的窗口。增加一个congestion window CWND
      • send window (#unacknowledge pkts in flight) = min(CWND, RWND) 当cwnd更小时,我们就进入了一个拥塞控制的模式
        • 慢开始与拥塞避免
          • cwnd := 1, cwnd++ for each ack == cwnd*2 each RTT
          • cwnd >= ssthresh do congestion avoidance: cwnd++ for each RTT;
        • 快重传和快恢复 on dupack fast retransmit the next sequence from receiver side. Fast recover it by setting ssthresh = cwnd/2, cwnd = ssthresh, do congestion avoidance

三次握手建立连接 四次挥手关闭连接

img
  • 为什么需要三次握手:TCP是一个全双工通信协议,也就是说它是双向收发的。初始序号是两边都要随机选择, 这个主要是因为安全问题要避免TCP sequence prediction attack的。然后,因为两方都要告诉对面自己的初始序号 = 也就是通信双方数据原点的位置所以至少要4次握手。SYN ACK SYN ACK, 然后因为tcp header里reserve flags部分SYN ACK是可以同时为1的 中间两个步骤可以合并 所以3次就够。

  • 第三次握手过程中sender可以piggypack data而receiver只能在这次握手结束才可以。

  • 在socket编程中,客户端执行connect()时,将触发三次握手。

img
  • 四次挥手是因为tcp是双向的性质+tcp半关闭造成的。假如客户端主动关闭,那么直到客户端收到服务器的ack前,数据传输仍然可以从服务器到客户端进行,tcp是半关闭的client-server关了但server-client方向还没关,这也是为什么有close-wait状态的原因,服务器端要等待最后这一波数据传输的完成。所以这也解释了中间两次为什么不能像建立连接一样合并。当服务器没有数据要传时他就可以通过fin来释放自己方向的连接了。

  • TIME_WAIT / 2MSL等待状态: 首先被动方如果接受了ack不需要等,因为他已经完成同步了可以释放资源了。然后主动方必须要等因为他不知道自己发的ack对面收没收到。这里需要假设没收到。那么它需要等待来自对面的超时重传消息,这最坏情况要1MSL。然后他等到了的话又要发ack回去。所以主动方要等最坏情况一个来回2MSL。

  • 在socket编程中,任何一方执行close()操作即可产生挥手操作。

应用层面实现可靠传输-发收双方缓存工作原理:滑动窗口

  • Buffers on senders and receivers are operated as “sliding window”. Senders maintain buffers of sent but unacknowledged packets. Receivers maintains buffer to assure non-duplicate, in order delivery to applications.
  • 发送方缓存:已发送和未确认的包 接收方缓存:不重复,按序到达的包

窗口是缓存的一部分,用来暂时存放字节流。发送方和接收方各有一个窗口,接收方通过 TCP 报文段中的窗口字段告诉发送方自己的窗口大小,发送方根据这个值和其它信息设置自己的窗口大小。

发送窗口内的字节都允许被发送,接收窗口内的字节都允许被接收。如果发送窗口左部的字节已经发送并且收到了确认,那么就将发送窗口向右滑动一定距离,直到左部第一个字节不是已发送并且已确认的状态;接收窗口的滑动类似,接收窗口左部字节已经发送确认并交付主机,就向右滑动接收窗口。

接收窗口只会对窗口内最后一个按序到达的字节进行确认发送方得到一个字节的确认之后,就知道这个字节之前的所有字节都已经被接收。就可以移动了如果发送方窗口迟迟收不到来自接收方的确认,就会超时重传

img

拥塞控制

如果网络出现拥塞,分组将会丢失,此时发送方会继续重传,从而导致网络拥塞程度更高。因此当出现拥塞时,应当控制发送方的速率。这一点和流量控制很像,但是出发点不同。流量控制是为了让接收方能来得及接收,而拥塞控制是为了降低整个网络的拥塞程度。

img img

1. 慢开始与拥塞避免

  • Slow start: 初始 cwnd = 1,每收到1个ack cwnd++,cwnd指数增长:2、4、8 …

  • 为避免过快,设置一个慢开始门限 ssthresh,当 cwnd >= ssthresh 时,进入拥塞避免,每个轮次cwnd++

  • 超时: 令 ssthresh = cwnd / 2,重新执行慢开始。

2. 快重传与快恢复

接收方只对最后一个收到的有序报文段进行确认。在发送方,如果收到三个重复确认,那么可以知道接收方下一个报文段丢失,此时执行快重传

只是丢失个别报文段,而不是网络拥塞,因此执行快恢复,令 ssthresh = cwnd / 2 ,cwnd = ssthresh,注意到此时直接进入拥塞避免。

img

HTTP

  • Stateless
    • Every request is completely independent
    • Similar to transaction
    • programming, local storage, cookies, sessions are used to create enhanced user experience
  • HTTPS
    • data sent is encrpted
    • SSL / TLS
    • install a certificate on web host.
  • HTTP Methods
    • GET: fetch data from server
    • POST: submit data to server
    • PUT: update data already on the server
    • DELETE: deletes data from server
  • Headers
    • General: Request URL, Method, Status code, remote address, referer policy
    • Response: server, set-cookie, content-type, content-length, date
    • Request: cookies, accept-xxx, content-type, content-length, authorization, user-agent, referrer

基本方法

  • Get 获取资源
  • Post 传输数据
  • Head 获取报文头部
  • Put/Delete 上传/删除文件 不安全没有验证机制
  • Patch 部分修改资源

状态码

200 OK

3XX 重定向

4XX 客户端错误 如 404 NOT FOUND

5XX 服务端错误

长连接短连接&流水线

  • HTTP 1.1之后采用长连接persistent connection 只需要建立一次 TCP 连接就能进行多次 HTTP 通信。如果要断开连接,需要由客户端或者服务器端提出断开,使用 Connection : close;在 HTTP/1.1 之前默认是短连接的,如果需要使用长连接,则使用 Connection : Keep-Alive

  • 默认情况下,HTTP 请求是按顺序发出的,下一个请求只有在当前请求收到响应之后才会被发出。由于受到网络延迟可能需要等待很长时间。

    流水线是在同一条长连接上连续发出请求,而不用等待响应返回,这样可以减少延迟。

Database Concurrency Summary

数据库并发控制

Concurrency control

ACID principle for transaction management

Atomicity 原子性 “all or none”

事务所做操作要么全部提交要么全部失败回滚

  • 实现: 日志, Shadow paging

Consistency 一致性 “looks correct”

  • 数据库一致性 仍遵循完整性约束 未来事务能看到过去事务造成的后果

  • 事务一致性 事务前后数据库内容一致 一致意味着多个事务访问相同数据得到相同结果

Isolation 隔离性 “lives alone”

事务所作修改提交前对其他事务不可见

  • 两种类别的并发控制协议来保证隔离性
    • 悲观 从最初就避免问题的发生
    • 乐观 假设冲突发生不常见,发生后再处理

Durability 持久性 “survives failure”

事务提交的修改应被持久化,即便系统崩溃也不受影响

Conflict Serializability in transaction scheduling

  • Schedules are equivalent to some serial schedule.
  • This is what (almost) every DBMS supports when you ask for the SERIALIZABLE isolation level.
  • Schedule S is conflict serializable if you are able to transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions.
  • Verify using either the swapping method or dependency graphs.
  • A schedule is conflict serializable if and only if its dependency graph is acyclic.

Consistency Conflicts in transaction management

  • 读写冲突:
    • 不可重复读 “non-repeatable read” 两个读的中间数据被另一个事务覆写并提交 导致第二次读的同一行值不同
      • transaction reads committed UPDATES from another transaction. The same row now has different values than it did when your transaction began.
      • A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.
    • 幻读 “phantom read” 两个读的中间被另一个事务插入/删除行并提交 导致第二次读返回的行不同
      • Phantom reads are similar but when reading from committed INSERTS and/or DELETES from another transaction. There are new rows or rows that have disappeared since you began the transaction.
      • A phantom read occurs when, in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from the first.
    • 避免不可重复读锁行就足够,避免幻影读则需要锁表
  • 写读冲突: 脏读 “dirty reads” Reading uncommitted data 读了另一个事务没有提交的数据
  • 写写冲突:丢失修改 “lost updates” overwriting uncommitted data 一个事务覆盖另一个事务没有提交的数据

Pessimistic Concurrency Control

Two-Phase Locking CC protocol

  • Pessmistic 悲观的 assume there are a lot of contentions in transactions 因此安全程度高 但并发能力会有所限制 (not allow all serializable schedules)

  • Two basic types of locks & compatibility matrix

    • Write lock X (exclusive) Read lock S (shared)
  • 2PL always gurantee conflicts-serializable schedule = grantee serializable schedule

  • 二个阶段 “/\”

    • 生长期 只允许拿锁
    • 收缩期 一旦释放第一个锁 进入收缩期
      • 只允许释放锁 或不释放锁
      • 不允许拿锁
  • Problem: can have cascading abort in case of dirty reads 例如t1写 t2读 此时t1还没commit便abort 此时t2不得不abort…如果t2也写了还没来的及提交后面还有读的那这个反应会连锁下去

  • Solution: Strong Strict 2PL (SS2PL) 一个事务提交/abort后才允许放锁 一次放完

  • Deadlock handling:

    • Detection:
      • run backgroud task that periodically checks for cycles in waits-for graph / 超时检测
      • selects a victim to rollback and breaks the cycle 选择标准和rollback程度取决于具体设计
    • Prevention:
      • assign priority based on timestamp older timestamp = higher priority
      • after comparison, one will get lock, the other waits, this essentially breaks deadlock.

Locking Granularities and hierarchy

MySQL 中提供了两种封锁粒度:行级锁以及表级锁。
应该尽量只锁定需要修改的那部分数据,而不是所有的资源。锁定的数据量越少,发生锁争用的可能就越小,系统的并发程度就越高。
但是加锁需要消耗资源,锁的各种操作(包括获取锁、释放锁、以及检查锁状态)都会增加系统开销。因此封锁粒度越小,系统开销就越大。
在选择封锁粒度时,需要在锁开销和并发程度之间做一个权衡。

  • 一个事务想要更新1billion tuples = ask for locks 1 billion times VERY SLOW
  • WE NEED a lock hierarchy that reflects what is down there (行级) at high level (表级)

如何更方便的进行多粒度封锁: 意向锁

  • 意向锁Intention lock: “lock escalation” 细粒度锁太多时动态的分配可用的粗粒度锁 大大减少lock manager需要处理的锁的请求数量

    Intention locks allow a higher level node to be locked in shared or exclusive mode without having to check all descendant nodes. If a node is in an intention mode, then explicit locking is being done at a lower level in the tree. 基本相当于告诉你下面有什么样的锁 通过兼容矩阵 可以从hierarchy高处如表级快速判断能不能读/写下层具体行的数据

  • Intention-Shared (IS):下面某个地方有读锁

    • Intention-Exclusive (IX):下面某个地方有写锁
  • Shared+Intention-Exclusive (SIX): 整个子树可共享读,下面某个孩子有写锁

  • 各种锁的兼容关系如下:


    解释如下:
    • 任意 IS/IX 锁之间都是兼容的,因为它们只表示想要对表加锁,而不是真正加锁;
    • 这里兼容关系针对的是表级锁,而表级的 IX 锁和行级的 X 锁兼容,两个事务可以对两个数据行加 X 锁。(事务 T1 想要对数据行 R1 加 X 锁,事务 T2 想要对同一个表的数据行 R2 加 X 锁,两个事务都需要对该表加 IX 锁,但是 IX 锁是兼容的,并且 IX 锁与行级的 X 锁也是兼容的,因此两个事务都能加锁成功,对同一个表中的两个数据行做修改。)

三级封锁协议

  • 一级封锁协议 事务 T 要修改数据 A 时必须加 X 锁,直到 T 结束才释放锁。可以解决丢失修改问题
  • 二级封锁协议 在一级的基础上,要求读取数据 A 时必须加 S 锁,读取完马上释放 S 锁。可以解决读脏数据问题
  • 三级封锁协议 在二级的基础上,要求读取数据 A 时必须加 S 锁,直到事务结束了才能释放 S 锁。可以解决不可重复读的问题

MySQL 隐式与显示锁定

MySQL 的 InnoDB 存储引擎采用两段锁协议,会根据隔离级别在需要的时候自动加锁,并且所有的锁都是在同一时刻被释放,这被称为隐式锁定
InnoDB 也可以使用特定的语句进行显示锁定:

1
2
SELECT ... LOCK In SHARE MODE;
SELECT ... FOR UPDATE;

Optimistic Concurrency Control

Ensure ordering using different methods. Assume they are good, handle problems later.

Basic Timestamp ordering (T/O) CC protocol

  • Basic idea: assign unique fixed monotonically increasing numeric value “timestamp” to determine commit order.

    • different schemes for implementing timestamp: system clock, logical counter…
  • Timestamp smaller TS(i) < TS(j) = transaction i appears before j.

  • Method:

    • Transaction r/w object without locks.
    • Every object X is tagged with timestamp of last txn that successfully did read / write. R-TS(X) / W-TS(X)
    • Reads
      • Check timestamp for every operations, if txn k tries to access an object from the future, TS(k) < W-TS(X), (timestamp order of k violates writer of X), k aborts and restarts with newer TS.
      • Otherwise allow txn k to read X, update R-TS(X) = max(R-TS(X), TS(k)), make a local copy of X to ensure repeatable reads for k. (subsequent read occurs in local copy)
    • Writes
      • If TS(k) < R-TS(X) or TS(k) < W-TS(X) abort and restart k.
        • opitimization: in case of write can simply ignore and continue “thomas write rule” this allow us to even commit it. Somewhat useful but will not give us a conflict serializable schedule.
      • else allow k to write X and update W-TS(X), make a local copy of X to ensure repeatable reads for k. (subsequent read occurs in local copy)
  • Comments:

    • Can gurantee to generate a conflict serializable schedule if do not use “Thomas write rule”
    • No deadlocks because no “waiting” at all.
    • Can have starvation for long tens if shorter one keep causeing conflicts.
    • permits schedules that are not recoverable 可恢复事务: 他读到的所有事务在他之前commit
    • high overhead of copying data to local workspace and updating timestamps.

Optimistic CC (OCC) protocol

  • Basic idea: No locks at all.

    • DBMS creates a private workspace for each transaction
    • any obj read is copied into thread local workspace for following r/w, perform modifications at local workspace
    • Validation: when commits, DBMS compares local workspace to see if it incurrs conflicts with other txns. Get a timestamp.
    • Atomic install to global database if no conflicts.
  • 3 Phases

      1. Read: make w/r modifications at private thread local workspace
      2. Validation: check if conflicts exist before commit, get a timestamp
      3. Write: valid => apply local changes to global database, abort & restart otherwise.
  • Serial Validation Explained

    • Maintain a global view of all active txns.
    • record w/r set while txns running and write into private workspace
    • execute validation and write phase in a protected critical section
    • when txn invokes COMMIT, DBMS checks existence of conflicts through methods like the following, it checks if 双方write set intersects, if so, abort (myself) .
      • Backward Validation: look at all older txns in the system
      • Forward Validation: look at all younger txns in the system.
  • Comments:

    • OCC Works well when #conflicts is low, transactions read heavy, mostly accessing disjoint subsets of data. In this case locking is wasteful.
    • High overhead of copying data to local space. Serial Validation/Write bottlenecks. Aborts more wasteful than 2PL because txns have already executed. The validation step also need latches when comparing read write sets because maybe T2 is writing while T1 is reading. latch contention when a lot of concurrent txns ?

Partition-based Timestamp ordering protocol

  • Idea:
    • partition db into disjoint subsets called partitions / shards so that we do not need locks.
    • Queued up & assign timestamps upon arrival to order txns for serial execution at each partition. NO Concurrency THIS IS SINGLE-THREADED.
    • Partitions are protected by a single lock. A txn acquires the partition’s lock if it has the lowest timestamp in queue, it starts when it has all the locks of partitions it needs (r/w).
      • A txn can read anything in the partition it have locked
      • write occurs in place, maintains a in-mem buffer to undo changes if aborted.
      • abort & restart if accessing a partition that it does not have the lock.
    • Only check conflicts between txns within same partitions.
    • The finer-grained each partition is the more parallelism we get on disjoint sets of data
  • Comments:
    • Fast: when DBMS knows what partitions txn needs before it starts, most txns only need to access a single partition.
    • Mult-partition txns are slower, some partitions can be idle.

Phantom problem

  • The above discussed protocols only deal with read/update. But if we have insertions / updates / deletions, we can have the phantom problem.
    • 2PL cannot solve this because these new rows do not even have locks.
    • Predicate locking can solve it but too hard to implement.
    • Index locking can solves it

Isolation Level

产生并发不一致性问题的主要原因是破坏了事务的隔离性,解决方法是通过并发控制来保证隔离性。并发控制可以通过封锁来实现,但是封锁操作需要用户自己控制,相当复杂。数据库管理系统提供了事务的隔离级别,让用户以一种更轻松的方式处理并发一致性问题。

isl

  • RU: 事务中的修改,即使没有提交,对其它事务也是可见的。
  • RC: 一个事务只能读取已经提交的事务所做的修改。
  • RR: 保证在同一个事务中多次读取同一数据的结果是一样的。
  • S: 强制事务串行执行,这样多个事务互不干扰,不会出现并发一致性问题。从MVCC并发控制退化为基于锁的并发控制。不区别快照读和当前读,所有的读操作都是当前读,读加读锁(S锁),写加写锁(X锁)。在该隔离级别下,读写冲突,因此并发性能急剧下降,在MySQL/InnoDB中不建议使用。

Multi-version concurrency control

  • “Let writers make a “new” copy while readers use an appropriate “old” copy.”

  • 最常见的并发控制方法 DBMS. 非常适合读多写少的OLTP workload 可以提供rr级别的隔离程度

  • 读=读一个版本 写=创建一个新版本 版本号一般是通过时间戳机制分配

  • 只有 写写冲突 通过first commiter/first updater win 解决

  • RC RR隔离级别下使用版本链 多个版本的快照存在undo日志中,日志通过回滚指针把一个数据行的所有快照连起来

  • MVCC 维护一个read view结构 读的时候判断数据行快照是否可以使用

  • mvcc下快照读select…不需要锁,当前读(涉及插入删除更新)仍需要锁

  • mysql为了解决rr级别下当前读的幻读问题,使用了next-key lock本质上是索引锁+间隙锁 = 不仅锁定了索引还锁定了索引之间的间隙 就相当于数学上一个大区间中间很多个小区间 我在小区间端点上加锁之外还要在间隙上加锁才能保证安全性

关系数据库设计理论

函数依赖

记 A->B 表示 A uniquely determines B,也可以说 B functionaly dependent on A。

Full Functional Dependency : X is functionally dependent on Y and is not functionally dependent on any proper subset of Y.

A Partial Functional Dependency is when you have a Composite Primary Key (a primary key that is made up of multiple columns), and one of the non-key columns is functionally dependent on a proper subset of the columns that make up the Composite Primary Key.

对于 A->B,B->C,则 A->C 是一个传递函数依赖。

For example : Let there be a relation R ( Course, Sid , Sname , fid, schedule , room , marks )

Full Functional Dependencies : {Course , Sid) -> Sname , {Course , Sid} -> Marks, etc.

Partial Functional Dependencies : Course -> Schedule , Course -> Room

异常

以下的学生课程关系的函数依赖为 {Sno, Cname} -> {Sname, Sdept, Mname, Grade},键码为 {Sno, Cname}。也就是说,确定学生和课程之后,就能确定其它信息。

Sno Sname Sdept Mname Cname Grade
1 学生-1 学院-1 院长-1 课程-1 90
2 学生-2 学院-2 院长-2 课程-2 80
2 学生-2 学院-2 院长-2 课程-1 100
3 学生-3 学院-2 院长-2 课程-2 95

不符合范式的关系,会产生很多异常,主要有以下四种异常:

  • 冗余数据:例如 学生-2 出现了两次。
  • 修改异常:修改了一个记录中的信息,但是另一个记录中相同的信息却没有被修改。
  • 删除异常:删除一个信息,那么也会丢失其它信息。例如删除了 课程-1 需要删除第一行和第三行,那么 学生-1 的信息就会丢失。
  • 插入异常:例如想要插入一个学生的信息,如果这个学生还没选课,那么就无法插入。

范式

范式理论是为了解决以上提到四种异常。

高级别范式的依赖于低级别的范式,1NF 是最低级别的范式。

1. 第一范式 (1NF)

原子属性不可分。

2. 第二范式 (2NF)

1NF+每个非主属性不存在部分函数依赖(完全函数依赖于主属性/主键)。

可以通过分解来满足。

Sno Sname Sdept Mname Cname Grade
1 学生-1 学院-1 院长-1 课程-1 90
2 学生-2 学院-2 院长-2 课程-2 80
2 学生-2 学院-2 院长-2 课程-1 100
3 学生-3 学院-2 院长-2 课程-2 95

以上学生课程关系中,{Sno, Cname} 为主键,有如下函数依赖:

  • Sno -> Sname, Sdept
  • Sdept -> Mname
  • Sno, Cname-> Grade

Grade 完全函数依赖于主键,它没有任何冗余数据,每个学生的每门课都有特定的成绩。

Sname, Sdept 和 Mname 都部分依赖于主键,当一个学生选修了多门课时,这些数据就会出现多次,造成大量冗余数据。

分解后

关系-1

Sno Sname Sdept Mname
1 学生-1 学院-1 院长-1
2 学生-2 学院-2 院长-2
3 学生-3 学院-2 院长-2

有以下函数依赖:

  • Sno -> Sname, Sdept
  • Sdept -> Mname

关系-2

Sno Cname Grade
1 课程-1 90
2 课程-2 80
2 课程-1 100
3 课程-2 95

有以下函数依赖:

  • Sno, Cname -> Grade 完全函数依赖

3. 第三范式 (3NF)

2NF + 非主属性不存在“传递函数依赖”。

关系-1 中存在以下传递函数依赖:

  • Sno -> Sdept -> Mname

可以把它分解成以下两个表:

Sno Sname Sdept
1 学生-1 学院-1
2 学生-2 学院-2
3 学生-3 学院-2
Sdept Mname
学院-1 院长-1
学院-2 院长-2

ER 图

  • 一对多:带箭头的线
  • 一对一:双向带箭头线
  • 多对多:不带箭头的线

erdiagram

表示出现多次的关系

一个实体在联系出现几次,就要用几条线连接。

下图表示一个课程的先修关系,先修关系出现两个 Course 实体,第一个是先修课程,后一个是后修课程,因此需要用两条线来表示这种关系。


## 表示子类

# Indexing
  • data can only have one actual order sorted by a order

  • https://blog.csdn.net/jiadajing267/article/details/54581262

  • Clustered : actually order data the same way as index key

  • non-clustered: only a list of reference

  • index-key what the sort order is based on

  • b+树 叶子层有序数组链表+非叶子层平衡多叉树有序索引 同时支持高效等值查询和范围查询 (节点内有序可以2分搜索)

  • b+树与b树区别: b+树非叶节点 相当于多级索引,不含指向record的指针 (节省空间储存更多索引项) 叶子层是有序数组链表

  • 为什么适合储存:b+树节点要求半满 而且因为是多路查找树 节点内容多 相比红黑 高度压缩非常明显 树的访问查找效率和高度直接相关 b+树非常矮 大大减少IO次数 更适合文件系统和数据储存 支持bulk-loading 更适合文件系统和数据储存

  • InnoDB和MyISAM的区别

    1. InnoDB 支持事务,MyISAM 不支持事务。这是 MySQL 将默认存储引擎从 MyISAM 变成 InnoDB 的重要原因之一;

    2. InnoDB 支持外键,而 MyISAM 不支持。对一个包含外键的 InnoDB 表转为 MYISAM 会失败;

    3. InnoDB 是聚集索引,MyISAM 是非聚集索引。聚簇索引的文件存放在主键索引的叶子节点上,因此 InnoDB 必须要有主键,通过主键索引效率很高。但是辅助索引需要两次查询,先查询到主键,然后再通过主键查询到数据。因此,主键不应该过大,因为主键太大,其他索引也都会很大。而 MyISAM 是非聚集索引,数据文件是分离的,索引保存的是数据文件的指针。主键索引和辅助索引是独立的。

    4. InnoDB 不保存表的具体行数,执行 select count(*) from table 时需要全表扫描。而MyISAM 用一个变量保存了整个表的行数,执行上述语句时只需要读出该变量即可,速度很快;

    5. InnoDB 最小的锁粒度是行锁,MyISAM 最小的锁粒度是表锁。一个更新语句会锁住整张表,导致其他查询和更新都会被阻塞,因此并发访问受限。这也是 MySQL 将默认存储引擎从 MyISAM 变成 InnoDB 的重要原因之一;

Nosql

better suited for unstructured data

JVM

JVM architecture

  • Main function: load & execute java application

  • Process: edit - javac myapp.java - java myapp (create a jvm instance)

  • 1WkqSg.png
  • 1WAGnA.png
  • components:

    • class loader: input .class files output bytecode for execution engine
    • Runtime data areas
    • execution engine: executes byte code by talking to OS (may use native method calls that will be translated to machine language)

1. Class Loader and its subsystem

  • 1WA2NV.png

Load

  • Load: Load bytecode into memory
    • can read from different sources: file system, socket
    • can have classNotFound
    • Three types of class loaders:
      • bootstrap rt.jar: load java internal classes
      • extension jre/lib/ext: load classes from additional application jar in jre/lib
      • application CLASSPATH, -cp: load classes from specified path
  • Link phase
    • verify bytecode compatibility with JVM
    • Prepare allocate memory and init to default value for class (static) variables
    • Resolve resolve symbolic references to other classes / constant pool to actual reference
      • classDefNotFound

Initialize

  • Initialize
    • Execute static code block “static initializer”
    • actual initialization of static variables

类初始化时机

  1. 主动引用
    虚拟机规范中并没有强制约束何时进行加载,但是规范严格规定了有且只有下列五种情况必须对类进行初始化(加载、验证、准备都会随之发生):
  • 遇到 new、getstatic、putstatic、invokestatic 这四条字节码指令时,如果类没有进行过初始化,则必须先触发其初始化。最常见的生成这 4 条指令的场景是:使用 new 关键字实例化对象的时候;读取或设置一个类的静态字段(被 final 修饰、已在编译期把结果放入常量池的静态字段除外)的时候;以及调用一个类的静态方法的时候。
  • 使用 java.lang.reflect 包的方法对类进行反射调用的时候,如果类没有进行初始化,则需要先触发其初始化。
  • 当初始化一个类的时候,如果发现其父类还没有进行过初始化,则需要先触发其父类的初始化。
  • 当虚拟机启动时,用户需要指定一个要执行的主类(包含 main() 方法的那个类),虚拟机会先初始化这个主类;
  • 当使用 JDK 1.7 的动态语言支持时,如果一个 java.lang.invoke.MethodHandle 实例最后的解析结果为 REF_getStatic, REF_putStatic, REF_invokeStatic 的方法句柄,并且这个方法句柄所对应的类没有进行过初始化,则需要先触发其初始化;
  1. 被动引用
    以上 5 种场景中的行为称为对一个类进行主动引用。除此之外,所有引用类的方式都不会触发初始化,称为被动引用。被动引用的常见例子包括:
  • 通过子类引用父类的静态字段,不会导致子类初始化。
1
System.out.println(SubClass.value);  // value 字段在 SuperClass 中定义Copy to clipboardErrorCopied
  • 通过数组定义来引用类,不会触发此类的初始化。该过程会对数组类进行初始化,数组类是一个由虚拟机自动生成的、直接继承自 Object 的子类,其中包含了数组的属性和方法。
1
SuperClass[] sca = new SuperClass[10];Copy to clipboardErrorCopied
  • 常量在编译阶段会存入调用类的常量池中,本质上并没有直接引用到定义常量的类,因此不会触发定义常量的类的初始化。
1
System.out.println(ConstClass.HELLOWORLD);

2. Runtime data areas

  • per thread means operations in thread local storage are generally safe.
  • per JVM means operations in shared data areas such as meatspace, Heap are not thread-safe.

Per-thread

1WmTkn.png
PC Register
  • PC Register: program counter register
    • pointer to next instruction per thread
Java Stacks
  • Java Stacks: stack frames “chains of stack frames” corresponded to current methods execution per thread
    • use -Xss to mention size of stacks we want to maintain
    • when run out of memory, can have StackOverflowError
Native method Stacks
  • Native method stacks: native method stacks if needed / used per thread.
    • a thread with its method may call a native method such as loading a dll and run something, then the native method stack will be created, and you will get a pointer

Per-JVM

1WZNA1.md.png
Metaspace / Method Area
  • Method Area “PermGen space 64MB”: metadata for class, available for reflection per JVM.

    • use -XX:MaxPermSize to adjust size if we need to store a lot more classes
    • Removed since Java 8
    • Now is called: Metaspace
      • a seperate memory portion in native operating system
      • no limit now, can grow infinitely, but can have a limit if tuned.
Heap
  • Heap: stores Object data such as arrays, objects… per JVM.

    • now has runtime constant pool and string pool
    • use -Xms for min size, -Xms for max size if need to tune.
Direct Memory
  • 直接内存: 在 JDK 1.4 中新引入了 NIO 类,它可以使用 Native 函数库直接分配堆外内存,然后通过 Java 堆里的 DirectByteBuffer 对象作为这块内存的引用进行操作。这样能在一些场景中显著提高性能,因为避免了在堆内存和堆外内存来回拷贝数据。

3. Execution Engine

1WQdZF.png
  • Interperter: interpret & execute bytecode related native operations
    • done by interacting with Java Native Method Interface (JNI)
      • Platform independent (native) libraries: ex. in windows JRE /bin you will see .dll files .so on unix platform
  • JIT Compiler (Just-in-time): do not interpret instructions that will be executed again and agin, compile it on the fly and keep the bytecode to avoid wasteful interpretations.
  • Hotspot profiler: make statistics on hotspot to help JIT compiler.
  • Garbage Collector (GC): cleans up unused classes / objects in memory areas.

JVM GC

  • Intro
    • Memory leak: 内存管理不当导致的“不需要的内存没有被释放” can have Memory leak in Java
    • In C++/C, programmers responsible for manage memory, can easily lead to memory leaks if not handled properly
      • malloc() realloc() calloc() free() new and destructors
  • Basics
    • Live object = reachable (referenced by someone else)
    • dead object = unreachable (not referenced anywhere)
    • root node = main thread
    • Objects e.g. (new xxx) are allocated in the heap, static members, class definitions (metadata) are stored in Permgen / Metaspace
    • GC is carried out by a daemon thread "Garbage collector"
    • we cannot force gc to happen (System.gc()😉
    • failed new allocations in heap = java.lang.OutOfMemoryError
    • GC只处理java new出来的对象,但无法关闭其他资源,也无法处理java调用C或其他语言分配出的内存。
    • 垃圾回收分多级,0级为全部(Full)的垃圾回收,会回收OLD段中的垃圾;1级或以上为部分垃圾回收,只会回收Young中的垃圾。
    • System.gc并不保证GC执行,只是向JVM发送建议,并不是命令。
    • finalize被调用时代表gc准备回收该对象内存

Directed Graph & Reachability

  • .NET的垃圾回收采用引用计数,java的垃圾回收机制采取的是有向图的方式来实现,具体的说,java程序中的每个线程对象就可以看作是一个有向图的起点,有向边从栈中的引用者指向堆中的引用对象。在这个有向图中,如果一个对象和根节点之间是可达的,那么这个对象就是有效的,反之,这个对象就是可以被回收的。采取这样一种机制的优点是可以有效的避免循环引用。

General GC steps: Mark Sweep Compact

  • Mark
    • Starts from root node (main) walks the object graph, marks reachable object as live.
  • Sweep/Delete
    • clean unreachable objects and reclaim memory
  • Compacting
    • arrange things in order: move objects around to avoid fragmentation = make memory contiguous.

Java GC: Generational collectors

Heap division

  • 1WtC4S.png
  • Young Generation
    • Eden space: new object();
    • Survivor space 1 / 2: used to move back and force survivors that survive minor GC each turn
      • can help avoid compacting step
  • Old Generation
    • objects that survived at least “threshold” rounds of GC
    • if full, iincur Major GC

Minor vs Major GC

  • Minor:
    • only Young Generation
    • Responsive
  • Major:
    • run through entire heap
    • Long pause / latency
    • High throughput
    • Good for batch processing, jobs in database that that care about throughput > latency

Entire process

  1. new obj allocated in Eden
  2. Eden full = allocation fail for new obj
    1. Minor GC run, mark reachable objs
    2. move reachable obj to survivor 1
    3. Sweep unreachables
    4. survivors counters = “1” means 1 rounds
  3. Now Eden clear, allocate new objs… some in survivor 1 became unreachable… when Eden full,
    1. minor GC mark and move reachables to survivor 2 the empty one
    2. move reachables in S1 to S2, increment S1 survivor counter. By moving around this way we avoids compacting steps.
  4. Repeat: allocate in Eden, Eden full, mark and move all reachable (Eden or Sx) to the empty S, increment counters, sweep unreachables
  5. If counter hit threshold, promote them to Old Generation.
  6. If Old Generation is near full, Major GC runs across the entire heap MSC. very time consuming, can lead to pause.

GC types & Usages

Basic Serial Collector

  • runs in single thread, used for basic applications

Concurrent collector (CMSC)

  • A thread that performs GC concurrently for the app.
  • No waiting for the old generation.
  • small pause only when mark / remark. Otherwise no pause “concurrent”.
  • use when
    • more available memory
    • high number of CPUs
    • needs responsiveness

Parallel collector (PMSC)

  • use multiple CPU and multiple threads to do MSC (mark sweep compact).
  • only runs when heap is full / near full.
  • long pause when it runs.
  • use when
    • less available memory
    • less number of CPUs
    • needs high throughput & can withstand pauses

G1 collector (Garbage - first) 1.7+

  • Divide heap into small regions, each of them can be eden / survivor / old
  • Dynamically chose regions with most garbage to GC
    • More predictable GC pauses
    • Low pauses and fragmentation
    • Parallelism & concurrency together
    • Better heap utilization

Usage

1WBMZj.png

  • notice CMS is only in old generation.

Memory Leaks

  • 内存溢出通常发生于OLD段或Perm段垃圾回收后,仍然无内存空间容纳新的Java对象的情况。

  • 建议将不用的对象引用设为null来避免临时的内存泄漏: 将他们标记为可清理对象。

  • 常发生于

    1. 使用生命周期较长的单位:单例模式类对象, 静态集合类形成的对象引用 暂时内存泄漏:只有相应对象/类被释放时才会gc
    1
    2
    3
    4
    5
    6
    7
    Static Vector v = new Vector(); 
    for (int i = 1; i<100; i++)
    {
    Object o = new Object();
    v.add(o);
    o = null;
    }
    1. 使用了各种资源连接忘了关:数据库连接,网络连接,IO连接等,显式调用close关闭后才能被GC回收

    2. **改变哈希值,**当一个对象被存储进HashSet集合中以后,就不能修改这个对象中的那些参与计算哈希值的字段了,否则,对象修改后的哈希值与最初存储进HashSet集合中时的哈希值就不同了,在这种情况下,即使在contains方法使用该对象的当前引用作为的参数去HashSet集合中检索对象,也将返回找不到对象的结果,这也会导致无法从HashSet集合中单独删除当前对象,造成内存泄露

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      public static void main(String[] args) {
      Set<Person> set = new HashSet<Person>();
      Person p1 = new Person("唐僧","pwd1",25);
      Person p2 = new Person("孙悟空","pwd2",26);
      Person p3 = new Person("猪八戒","pwd3",27);
      set.add(p1);
      set.add(p2);
      set.add(p3);
      System.out.println("总共有:"+set.size()+" 个元素!"); //结果:总共有:3 个元素!
      p3.setAge(2); //修改p3的年龄,此时p3元素对应的hashcode值发生改变
      set.remove(p3); //此时remove不掉,造成内存泄漏
      set.add(p3); //重新添加,居然添加成功
      System.out.println("总共有:"+set.size()+" 个元素!"); //结果:总共有:4 个元素!
      }

Java对象引用类别:强引用,软引用,弱引用,虚引用

  • 强引用就是平时最常用的引用,当内存空间不足,Java虚拟机宁愿抛出OutOfMemoryError错误,使程序异常终止,也不会靠随意回收具有强引用的对象来解决内存不足的问题。

  • 如果一个对象只具有软引用,则内存空间足够,垃圾回收器就不会回收它;如果内存空间不足了,就会回收这些对象的内存。

  • 只具有弱引用的对象拥有更短暂的生命周期。在垃圾回收器线程扫描它所管辖的内存区域的过程中,一旦发现了只具有弱引用的对象,不管当前内存空间足够与否,都会回收它的内存

  • 虚引用,这种引用不常用,主要用途是关联对象,实现对对象引用关系追踪。虚引用并不会决定对象的生命周期。也无法通过虚引用得到一个对象。如果一个对象仅持有虚引用,那么它就和没有任何引用一样,在任何时候都可能被垃圾回收器回收。

  • 几种引用分别位于java.lang.ref.SoftReference; java.lang.ref.WeakReference; 和 java.lang.ref.PhantomReference;

finalize( )

  • do not use finalizer manually: finalize()

    • guranteed to be called only ONCE at the end of execution when GC is done

    • no gurantee of gc happens

    • do not use finalizer on anything important

    • suppose we try to resurect obj by recreating an object in finalize( ), this = new xxx( );

      • obj is recreated in memory at the end of gc = memory leak
  • finalize被调用时代表gc准备回收该对象内存

    • 对象不可达,但是调用finalize之后又变得可达的情况存在,在finalize函数中通过this指针让其他句柄执行本身即可,但是再下次回收时不会再调用finalize,因为只能调用一次。
    1
    2
    3
    4
    protected void finalize()
    {
    main.ref=this; // 恢复本对象,让本对象可达
    }
    • 垃圾回收器不能对用Java以外的代码编写的Class(比如JNI,C的new方法分配的内存)进行正确的回收,这时就需要覆盖默认finalize的方法来实现对这部分内存的正确释放和回收(比如C需要delete)。
    • finalize不能等同于C对象的destructor析构函数,C析构函数在在对象生命周期结束时会确定执行,而finalize函数的调用具有很大的不确定性。

    1、调用时间不确定——有资源浪费的风险

    如果把某些稀缺资源放到finalize()中释放,可能会导致该稀缺资源等上很久很久以后才被释放。造成资源的浪费!另外,某些类对象所携带的资源(比如某些JDBC的类)可能本身就很耗费内存,这些资源的延迟释放会造成很大的性能问题。

    2、可能不被调用——有资源泄漏的风险

    在某些情况下,finalize()压根儿不被调用。比如在JVM退出的当口,内存中那些对象的finalize函数可能就不会被调用了。

    因此一些清理工作如文件的关闭,连接的关闭等不要放到finalize函数中,要在程序中单独进行管理,一般finalize只做C/C++内存的回收。
    3、对象可能在finalize函数调用时复活——有诈尸的风险  
    诈尸的情况比较少见,不过俺还是稍微提一下。
      本来,只有当某个对象已经失效(没有引用),垃圾回收器才会调用该对象的finalize函数。但是,万一碰上某个变态的程序员,在finalize()函数内部再把对象自身的引用(也就是this)重新保存在某处,也就相当于把自己复活了(因为这个对象重新有了引用,不再处于失效状态)。这种做法是不是够变态啊

    为了防止发生这种诡异的事情,垃圾回收器只能在每次调用完finalize()之后再次去检查该对象是否还处于失效状态。这无形中又增加了JVM的开销。
      随便提一下。由于JDK的文档中规定了(具体见“这里”),JVM对于每一个类对象实例最多只会调用一次finalize()。所以,对于那些诈尸的实例,当它们真正死亡时,finalize()反而不会被调用了。这看起来是不是很奇怪?
    4、要记得自己做异常捕获
      刚才在介绍finalize()调用机制时提到,一旦有异常抛出到finalize函数外面,会被垃圾回收线程捕获并丢弃。也就是说,异常被忽略掉了(异常被忽略的危害,“这里”有提到)。为了防止这种事儿,凡是finalize()中有可能抛出异常的代码,你都得写上try catch语句,自己进行捕获。  
    5、要小心线程安全
      由于调用finalize()的是垃圾回收线程,和你自己代码的线程不是同一个线程;甚至不同对象的finalize()可能会被不同的垃圾回收线程调用(比如使用“并行收集器”的时候)。所以,当你在finalize()里面访问某些数据的时候,还得时刻留心线程安全的问题。

常量池总结

  • **class文件常量池 / constant pool:存静态常量,符号引用和字面量 ** 存在于.class文件中

  • 运行时常量池:类加载后,常量池中的数据会在运行时常量池中存放!这里所说的常量包括:基本类型包装类**(包装类不管理浮点型,整形只会管理-128到127)和String(也可以通过String.intern()方法可以强制将String放入常量池)**

  • 字符串常量池: HotSpot VM里,记录interned string的一个全局表叫做StringTable,它本质上就是个HashSet。注意它只存储对java.lang.String实例的引用,而不存储String对象的内容

  • 运行时常量池和字符串常量池都在堆中。

Other Notes

  1. Tune the heaps
1WspsU.png
  1. GC logging with graphical tool if we suspect gc is the problem for performance issues.
1
2
3
4
5
-verbose:gc

-XX:+PrintGCDetails

-Xloggc:gc.log
  1. jvisualvm: visual gc plugin

Design Pattern

SOLID

  1. Single Responsiblity
  2. Open Closed
  3. Liskov Substitution
  4. Inteface Segregation 专门的接口比总的接口好 更细化接口的继承!
  5. Dependency Inversion 高层模块不应依赖底层模块 “依赖倒置” 大家都应该依赖于抽象

Agile

最初先做个基本的prototype,然后迭代改进 每个版本新增一个完整的功能。

Design Pattern

设计模式的作用:

  1. 设计模式以一种标准的方式供广大开发人员使用,为开发者的沟通提供了一套机制,帮助开发者更好地明白和更清晰地描述一段被给出的代码。
  2. 设计模式可以使人们更加方便简单复用成功的设计模式和结构。
  3. 设计模式可以使人们深入理解面向对象的设计思想,提高软件的开发效率,节约设计成本。

重点

单例模式

Lazy initialization vs Eager initialization

违反单例模式的情况

  • 序列化 + 反序列化后 产生不同的对象

    • 通过重写readResolve改变反序列化行为
  • clone 类似第二个情况

    • 建议直接抛出异常:不支持克隆
    • 或者重写clone 返回唯一的实例
  • 反射修改constructor权限构建新实例

  • multiple threads access + lazy init

    • 双重检验锁 + volatile 解决缓存不一致问题 + “happens-before rule”: a change to a volatile variable happens before read

    • 1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      private static volatile Singleton soleInstance = null;
      public static Singleton getInstance() {
      //double checked locking
      if(soleInstance == null) {
      synchronized(Singleton.class) {
      soleInstance = new Singleton();
      }
      }
      return soleInstance;
      }
    • 只使用双重检验锁不够, 因为可能出现线程1new Singleton( )过程中 “只初始化了一半” 被中断切换到线程2 线程2会返回一个没初始化完的单例

    • 静态内部类holder

    • 1
      2
      3
      4
      5
      6
      public static Singleton getInstance() {
      return Holder.INSTANCE;
      }
      static class Holder {
      static final Singleton INSTANCE = new Singleton();
      }
    • 最高安全级别:使用枚举类 addresses all the concerns above

    • 1
      2
      3
      4
      enum Singleton {
      INSTANCE;
      public String getConfiguration(){return "xxx";}
      }
  • multiple class loaders + eager init:

    • no exact solution, can add check.

工厂模式

简单工厂

1
2
3
4
5
6
7
8
9
10
public class SimpleFactory {
public Product createProduct(int type) {
if (type == 1) {
return new ConcreteProduct1();
} else if (type == 2) {
return new ConcreteProduct2();
}
return new ConcreteProduct();
}
}

工厂方法模式

1hpKqx.png

父类决定实例生成方式,但不决定要生成的具体类,具体处理全部交给子类负责

用模版方法模式来构成生成实例的工厂

1
2
3
4
5
6
7
8
9
10
11
12
13
public abstract class Factory {
abstract public Product factoryMethod();
}
public class ConcreteFactory1 extends Factory {
public Product factoryMethod() {
return new ConcreteProduct1();
}
}
public class ConcreteFactory2 extends Factory {
public Product factoryMethod() {
return new ConcreteProduct2();
}
}

抽象工厂

定义模版方法将抽象零件组装成抽象产品 = 抽象工厂

1
2
3
4
5
6
7
8
9
10
public abstract class AbstractProductA {}
public class ProductA1 extends AbstractProductA {}
public abstract class AbstractFactory {
abstract AbstractProductA createProductA();
abstract AbstractProductB createProductB();
}
public class ConcreteFactory1 extends AbstractFactory {
AbstractProductA createProductA() { return new ProductA1(); }
AbstractProductB createProductB() { return new ProductB1(); }
}

代理模式

使用一个中间代理去控制对其他对象的访问 Remote Method Invocation (RMI) 两个JVM之间

其他

模版方法模式

将具体处理交给子类:抽象类 抽象方法 去定义“模版” 子类要实现的东西

Builder模式

封装一个对象的构造过程,并允许按步骤构造。 StringBuilder

Prototype模式

通过复制一个原型实例来创建新对象。 Object.clone( )

迭代器模式

提供一个遍历集合的东西 但不暴露集合内容

适配器模式

加个适配器类以便于复用现有类 Arrays.asList

Decorator模式

添加功能与对象本体功能分离:为对象动态添加功能。 类似继承。
FileInputStream - BufferedInputStream
没有缓存区,read一次 = 发送一次IO操作;
有缓存区,第一次read时,会一下读取x个字节放入缓存区,然后后续的read都会从缓存中读取, 当read到缓存区末尾时,会再次读取x个字节放入缓存区

MVC

A very popular architecture for web application.

mvc

Model

  • data related logic
  • interact with database sql / nosql
  • communicates with controller
  • can sometimes update the view (depends on framework)

View

  • What the end user sees (UI)
  • HTML/CSS
  • communicates with controller
  • can be passed dynamic values from the controller
  • template engines

Controller

  • receives input (from browser / view, url)
  • processes requests (GET POST PUT DELETE)
  • get data from model
  • passes data to the view

AOP Aspect Oriented Programming

  • Cross cutting concerns
    • having a common functionality such as logging, transaction, security
    • do we create a new class ?
    • No, we create an agent
    • define a aspect configuration: which aspect applies to which classes - dependency injection in spring

Database Management System - Indexing

Indexing

基本概念

  • search key/搜索码/搜索键 注意这里的码的定义与主码、候 选码以及超码中的码定义不同。这里是泛指。例如如果一个文件上有多个索引,那么它就有多个搜索码。 一个文件可以有多个索引,分别基于不同的搜索码。
  • 基本的索引类型:
    • 顺序索引。基于值的顺序排序。
    • 散列索引。基于将值平均分布到若干散列桶中。一个值所属的散列桶是由散列函数觉定。
  • 顺序索引与散列索引评价metric
    • 访问类型: 范围查询?等值查询?
    • 访问时间:在查询中使用该技术找到一个特定数据项或数据项集所需的时间
    • 插入时间: 插入一个新数据项所需的时间。该值包括找到插入这个新数据项的正确位置所需的时间,以及更新索引结构所需的时间。
    • 删除时间:删除一个数据项所需的时间。该值包括找到待删除项所需的时间,以 及更新索引结构所需的时间。
    • 空间开销:索引结构所占用的额外存储空间。倘若存储索引结构的额外空间大小适度,通常牺牲一定的空间代价来换取性能的提高是值得的。 通常需要在一个文件上建立多个索引。

顺序索引

  • 主索引/聚集索引 primary index / clustered/clustering index

    • 不一定是建立在主码,可以是任何码之上。
    • 包含记录的文件与搜索码指定的顺序相同!
  • 辅助索引/次级索引/非聚集索引 secondary index non-clustering index non-clustered index

    • 包含记录的文件与搜索码指定的顺序不同!
  • 稠密萦引 (dense index) :

    • 稠密聚集索引:文件中的每个搜索码值都有一个索引项。索引项包括搜索码值以及指向具有该搜索码值的第一条数据记录的指针。具有相同搜 索码值的其余记录顺序地存储在第一条数据记录之后,记录根据 相同的搜索码值排序。
    • 稠密非聚集索引中,索引必须存储指向所有具有相同搜索码值的记录的指针列表。
    • 相比稀疏索引 定位record更快
  • 稀疏索引 (sparse index) : 在稀疏索引中,只为搜索码的某些值建立索引项

    • 只有索引是聚集索引时才能使用稀疏索引。
    • 和稠密索引一样,每个索引项也包括一个搜索码值和指向具有该搜索码值的第一条数据记录的 指针。为了定位一条记录,我们找到其最大搜索码值小于或等于所查找记录的搜索码值的索引
    • require less space and impose less maintainance overhead for insertions/deletions项。然后从该索引项指向的记录开始,沿着文件中的指针查找,直到找到所需记录为止。
  • 选择那种索引?
    使用稀疏还是稠密索引是一个access time and space overhead tradeoff.
    A good compromise is to have a sparse index with one index entry per block.
    Because the dominant cost in processing a database request is the time that it takes to bring a block from disk into main memory. Once we have brought in the block, the time to scan the entire block is negligible. we minimize block accesses while keeping the size of the index (and thus our space overhead) as small as possible.

多级索引

  • If the relation instead had 100,000,000 tuples, the index would instead occupy 1,000,000 blocks, or 4 gigabytes of space. Such large indices are stored as sequential files on disk.
  • If an index is small enough to be kept entirely in main memory, the search time to find an entry is low. However, if the index is so large that not all of it can be kept in memory, index blocks must be fetched from disk when required.
    The search for an entry in the index then requires several disk-block reads.
  • The process of searching a large index may be costly.
  • To locate a record, we first use binary search on the outer index to find the record for the largest search-key value less than or equal to the one that we desire. The pointer points to a block of the inner index. We scan this block until we find the record that has the largest search-key value less than or equal to the one that we desire. The pointer in this record points to the block of the file that contains the record for which we are looking.
  • In our example, an inner index with 10,000 blocks would require 10,000 entries in the outer index, which would occupy just 100 blocks. If we assume that the outer index is already in main memory, we would read only one index block for a search using a multilevel index, rather than the 14 blocks we read with binary search. As a result, we can perform 14 times as many index searches per second.
  • Indeed, we can repeat this process as many times as necessary. Indices with two or more levels are called multilevel indices. Searching for records with a multilevel index requires significantly fewer I/O operations than does searching for records by binary search.

索引更新

Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file. Further, in case a record in the file is updated, any index whose search-key attribute is affected by the update must also be updated;
As a result we only need to consider insertion and deletion on an index, and do not need to consider updates explicitly.

  • Insertion: First, the system performs a lookup using the search-key value that appears in the record to be inserted. The actions the system takes next depend on whether the index is dense or sparse:

    • Dense indices:
    1. If the search-key value does not appear in the index, the system inserts an index entry with the search-key value in the index at the appropriate position.
    2. Otherwise the following actions are taken:
      • a. If the index entry stores pointers to all records with the same search- key value, the system adds a pointer to the new record in the index entry.
      • b. Otherwise, the index entry stores a pointer to only the first record with the search-key value. The system then places the record being inserted after the other records with the same search-key values.
    • Sparse indices: We assume that the index stores an entry for each block. If the system creates a new block, it inserts the first search-key value (in search-key order) appearing in the new block into the index. On the other hand, if the new record has the least search-key value in its block, the system updates the index entry pointing to the block; if not, the system makes no change to the index.
  • Deletion: To delete a record,the system first looks up the record to be deleted. The actions the system takes next depend on whether the index is dense or sparse:

    • Dense indices:
    1. If the deleted record was the only record with its particular search-key value, then the system deletes the corresponding index entry from the index.
    2. Otherwise the following actions are taken:
      • a. If the index entry stores pointers to all records with the same search- key value, the system deletes the pointer to the deleted record from the index entry.
      • b. Otherwise, the index entry stores a pointer to only the first record with the search-key value. In this case, if the deleted record was the first record with the search-key value, the system updates the index entry to point to the next record.
    • Sparse indices:
    1. If the index does not contain an index entry with the search-key value of the deleted record, nothing needs to be done to the index.
    2. Otherwise the system takes the following actions:
      • a. If the deleted record was the only record with its search key, the system replaces the corresponding index record with an index rec- ord for the next search-key value (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
      • b. Otherwise, if the index entry for the search-key value points to the record being deleted, the system updates the index entry to point to the next record with the same search-key value.

Insertion and deletion algorithms for multilevel indices are a simple extension of the scheme just described. On deletion or insertion, the system updates the lowest-level index as described. As far as the second level is concerned, the lowest-level index is merely a file containing records—thus, if there is any change in the lowest-level index, the system updates the second-level index as described. The same technique applies to further levels of the index, if there are any.

辅助索引

A clustering/primary index may be sparse 中间空过的搜索键值可以通过顺序扫描的方式获取
Secondary/nonclustered indices must be dense 辅助索引若是稀疏的,就不能保证sequential order, 中间相隔的搜索键值没有办法获取
基于候选键的辅助索引就像稠密聚集索引,除了:索引所指的连续的记录值实际上不是顺序储存的
通常情况下辅助索引可以有与聚集索引不同的结构:若聚集索引的搜索键不是候选键,索引只需指向有search key value的第一个记录即可,后续记录可以顺序scan。 但对于搜索键不是候选键的非聚集索引,索引只指向第一个record with each search-key value是不够的, 剩余的相同search key value的记录可以存在于文件任何地方,因为文件是以聚集索引的搜索键排序的
所以,辅助索引必须包含所有记录的指针
我们可以使用额外一层间接的方式来实现基于非候选键的辅助索引: 辅助索引里的指针指向bucket, bucket指针(们)指向文件记录

  • 索引的自动创建:
    • 大多数据库都会自动为主键创建索引(从而用于检测primary key constraint on new tuple insertion) 若无主键索引,当插入元组时整个relation都要读一下来确保不违反primary-key constraint.
      我们无法储存既被聚集索引搜索键排序的又被辅助索引搜索键排序的文件!
      因为辅助索引键的顺序和物理键的顺序不同,如果我们按辅助索引键的顺序扫描那么读取一个记录就可能需要从磁盘读一个新的block 这非常慢!

上述索引更新方法 插入删除也适用于次级索引。the actions taken are those described for dense indices storing a pointer to every record in the file. 如果一个文件有多个索引,每当文件被修改,所有索引都要更新

辅助索引可以提高 基于非主索引key的query的查询效率,但对于数据库改动会造成很大的额外开支。数据库设计者需根据估计的各种queries频率和改动来决定是否使用一些辅助索引。
所以除非根据估计辅助索引可以带来性能提升,不要使用辅助索引:因为辅助索引是nonclustered index whose key order is different from physical-key order in file. 不同的index所指record可能会在不同的磁盘区块上,会导致额外磁盘IO访问! 使用不当会效果很糟糕!

B+tree索引与哈希索引

index以block为单位进行index  within block用offset

索引类型与基本特点

  • hash index  通过哈希函数生成hash address to a bucket with possible overflow chain for managing collision cheaper than B+tree if no overflow occurs Access: O(1+#overflow buckets)所以hash索引的最大的特点就是等值查询快,不能进行范围索引。
  • 位图索引适合静态low-cardinality重复数据(属性),can be used as a compressed storage mechanism at the leaf nodes of B±trees for those values that occur very frequently.
  • B+tree 索引

Motivation

  • The primary disadvantage of the index-sequential file organization is that performance degrades as the file grows. To overcome this deficiency, we can use a B±tree index.

B+tree索引特点

普通balanced binary tree tall and thin, b tree fat and short because of the node size! b树索引同时支持范围及等值查询 log (N/2) N path length上届disk block access远远优于log2N普通平衡树 for N records in file 可以大量减少磁盘io次数

  • B tree: m-way(order m, m fanout, m-1info fields) search tree with additional constraints:  叶子层高度相同 root 2 key  其他节点至少半满ceiling(order/2)来尽量减少高度  B-tree indices are similar to B±tree indices. The primary distinction between the two approaches is that a B-tree eliminates the redundant storage of search-key values:A B-tree allows search-key values to appear only once.
  • B+ tree 更贴近多级索引,是在b树基础上, nonleaf node sparse index 减少disk page access  支持equality search 在叶子层将nonleaf节点key按中序遍历顺序拷贝下来 叶子层包含record ptrs 保持中序遍历顺序建立链表 形成dense & clustered index 密集聚集索引 从而支持range search范围搜索
    order=#ptr fields = p    /node
    #k,v fields = p-1          /node
    (p-1)(key_ptr_size + record_ptr_size) + p(block_ptr_size) <= blocksize=512
  • 若想要插入的位置已满  recursively按中序遍历顺序将中点上移 同时将前驱后继节点分开 始终保持节点半满的要求 删除: 左合并 右合并 来满足半满的限制  split if necessary can propagate to root.    重点:split colasce redistribution merge
  • bottom-up construction for empty B+tree index: 每一层每个节点使用最小值&最小值指针创建下一层entry until root is created
  • sort and then bulk-loading(insertion)

B+tree索引的缺点

  • long term performance degradation: b+tree索引或file organization的一个缺点是:相邻叶子节点可能存在于磁盘不同区域,最优情况是节点内指针遵循磁盘内容连续的分布,这样顺序扫描叶子结点基本等价于顺序扫描磁盘. 但随着越来越多的插入删除更新操作, sequentiality is increasingly lost, has to wait for disk seeks increasingly often. 这时候需要重建索引restore sequentiality
  • 次级索引的更新问题: 拆分节点可能需要更新建立的次级索引,一个叶节点可能有成百上千条record 每一条都可能在不同地方 Thus a leaf-node split may require tens or even hundreds of I/O operations to update all affected secondary indices, making it a very expensive operation。解决方法: In secondary indices, in place of pointers to the indexed records, we store the values of the primary- index search-key attributes. For example, suppose we have a primary index on theattribute ID of relation instructor; then a secondary index on dept with each department name a list of instructor’s ID values of the corresponding records, instead of storing pointers to the records. locating a record using the sec- ondary index now requires two steps: First we use the secondary index to find the primary-index search-key values, and then we use the primary index to find the corresponding records. The above approach thus greatly reduces the cost of index update due to file reorganization, although it increases the cost of accessing data using a secondary index.

b+tree file organization

使用b+tree的思想可以创建b+tree file organization 因为不再储存ptr而是file record leaf node size更大 需要更好的space utilization, 可以通过提高节点容量下限来解决 例如2n/3
B±tree file organizations可以用于储存大的对象如 SQL clobs and blobs, 这些东西通常比disk block大甚至好几个gb. 可以通过拆分成很多个小的record序列的方式使用b+tree file organization储存. The records can be sequentially numbered, or numbered by the byte offset of the record within the large object, and the record number can be used as the search key.

Hash Index

  • can be used for (strictly speaking) secondary indices/file organization,因为

    • A hash index is never needed as a clustering index structure, since, if a file itself is organized by hashing, there is no need for a separate hash index structure on it.
    • hash file organization provides the same direct access to records that indexing provides, we pretend that a file organized by hashing also has a clustering hash index on it.
  • open hashing with overflow chain for buckets: add chained bucket when need(conflict) usually used in database

  • closed hashing: the set of buckets is fixed. Deletion is troublesome: usually used in compiler and assembler because there is only lookup and insert.

    • choose other bucket when a bucket is full: Linear probing, quadratic probing
    • hash function must be choosed wisely otherwise performance degrades.
  • static hashing: linear congruential hash function with fixed #hash buckets  use overflow chain to manage contention

    • problem: performance degrades as database grows in size if no reorganziation of hash structure is done (remap everything & reconstruct corresponding buckets, time-comsuming & massive)
  • extendible/dynammic hashing:

    • copes with changes in database size by splitting and coalescing buckets as the database grows and shrinks. Does so by using extra level of directory that double its size when local depth = global depth during insertion to buckets & bucket overflow (bits are not enough to distinguish the search values of the overflown bucket.). use directory of size 2^k to store ptrs to hash buckets. 扩容numbering使用gray code. Directory numbering last k bits 0 - 2^k
    • hash function: check out last k bits / mod 2^k
    • global depth = k: Max # of bits needed to tell which bucket an entry belongs to in the directory
    • local depth: # of bits used in directory numbering to determine if an entry belongs to this bucket or not
    • If a bucket overflow happens, the bucket is split into two. The directory may or may not double, depending on whether the local depth of the overflown bucket was equal to the global depth before split.
    • After resize, we do not necessarily have new buckets, match existing directory to current buckets. Meaning usually a bucket is pointed by 2 entries or more. A bucket is only pointed by 1 when bucket overflow due to insertion and this situation is after spliting.
    • https://www.youtube.com/watch?v=TtkN2xRAgv4&list=PLkZdHIQy-AeJjLbvcLO-rp1-eImyJvO2l
    • http://delab.csd.auth.gr/papers/ExtendibleHashing2017.pdf
    • http://www.cs.sfu.ca/CourseCentral/354/lxwu/notes/chapter11.pdf

Multiple key indices

  • in general a search key can have more than one attribute. A search key containing more than one attribute is referred to as a composite search key. The structure of the index is the same as that of any other index, the only difference being that the search key is not a single attribute, but rather is a list of attributes. The search key can be represented as a tuple of values, of the form (a1, . . . , an), where the indexed attributes are A1, . . . , An. The ordering of search-key values is the lexicographic ordering.
  • As an example. Consider an index on takes relation on composite search key(courseid,semester,year). This index is useful in finding all students who have registered for a particular course in a particular semester/year. An ordered index on multiple-key can be also used to answer some other queries efficiently.

Operating System - Persistence

CS537 - Operating System Summary Part 2 Persistence

Persistence

IO Device and Communication Protocol

Motivation

  • we want hardware that will let us plug in different devices
  • we want OS that can interact with different combinations of HW

Hardware support for IO device in system architecture

  • different buses have different speed, costs, size/volume of devices that need to be connected with
  • high speed buses are very expensive to manufacture
  • Hierarchical buses are a good solution
  • proprietary bus: 60GB/s on a 4-core system
  • General I/O bus: PCI…etc. 1-4GB/s
  • Peripheral I/O bus: disk devices, SCSI, SATA, USB, 100MB/s
  • Modern system hierarcical uses more specialized chipset and p2p interconnects for better performance. Example Z270 Chipset:
  • Dedicated Graphics bus: facilitate graphics intensive applications such as gaming, interactive web browser, and photo manipulations.
  • Higher performance devices connected via PCIe, NVMe persistent storage.
  • lower performance devices connected via USB, eSATA: modern sata standard, SSD: higher speed storage

OS’s view to Device & a canonical device

  • Interface: where OS reads/writes to, allow system to control its operations
  • Internal Structure (Varies depends on different devices & manufacture): microcontroller, extra memory, special-purpose chips…connection to cache/disk drives, graphics cards…

A canonical protocol OS writing to device

1
2
3
4
5
6
while (STATUS == BUSY)
; // spin
Write data to DATA register
Write command to COMMAND register
while (STATUS == BUSY)
; // spin
  • Simple polling protocol works but is inefficient sometimes:
CPU sys_write A waits copy Data & Command to A wait(for command to be executed) B
DISK busy busy busy
  • The policy of polling itself reduces CPU utilization when job processing time can be long
  • Using interrupt instead
1
2
3
4
5
6
while (STATUS == BUSY) // 1
wait for interrupt;
Write data to DATA register // 2
Write command to COMMAND register // 3
while (STATUS == BUSY) // 4
wait for interrupt;
  • Interrupt improves CPU utilization in this case.
  • Summary: Polling/Interrupt is a tradeoff.
    • Faster devices:
      • better to spin(poll) and keep waiting than taking interrupt overhead
    • Unknown device speed:
      • Hybrid approach (spin then use interrupts)
    • Better not to use interrupts when Floods of interrupts arrive:
      • Example: floods of requests to the NIC device of a webserver
      • can lead to livelock (always handling interrupts rather than doing actual works - user level processes to service some requests)
      • Better to ignore interrupts and use some polling to make some progress handling them and control what is happening in the system
    • Other improvement
      • Interrupt coalescing (batch together several interrupts into a single one): This reduces overhead of interrupts processing

Data Transfer Costs & More efficient data movement with DMA

  • Programmed I/O:
    • Programmed IO(PIO) is a method of transferring data between the CPU and a peripheral, such as a network adapter or an ATA storage device. Each data item transfer is initiated by an instruction in the program, involving the CPU for every transaction.
    • when using PIO to transfer a large chunk of data to a device. CPU is overburdened with trivial tasks of copying data from memory to device explicitly one word at a time. Poor CPU utilization!
    • Solution: Direct Memory Access (DMA)
      • CPU let a special purpose device “DMA engine” to copy data on behalf of it.
      • OS would program the DMA engine by telling it where the data lives inmemory, how much data to copy, and which device to send it to
      • CPU is thus free and OS can do something else: this improves both CPU and disk utilization, and improves the time of copying data into the data register in a device. (not command register because commands are usually very small in size).

1
2
3
4
5
while (STATUS == BUSY) // 1
;
Write command to COMMAND register // no step 2 anymore, 3
while (STATUS == BUSY) // 4
;

Methods of Device interactions

  • How OS communicates with Device? (Doesn’t matter much, both are used)
    • Special instructions
      • each device has a port
      • in/out instructions (x86) communicate with device
    • Memory-Mapped I/O
      • H/W maps registers into address space
        • example: eax ebx cpu register
      • loads/stores of mapped register sent to device
        • load/store eax/ebx

Device Drivers

  • Motivation: we want to keep device general and neutral as much as possible and hide the details of device interactions with OS subsystems.
  • Device driver: At the lowest level, a piece of software in the OS must know in detail how a device works. Any specifics of device interaction are encapsulated within.
  • Significance: Writing device driver for each device helps us abstract hardware and avoid writing different OS for different H/W combinations.
  • Example: we want a file system that works with SSD, USB, SATA
  • Problem: Many devices in a system! Each has its own protocol!
    • Drivers are 70% of Linux Source code and major causes of OS crashes.

Hard Disks

Hard Disk Interface and its view to OS/User

  • The abstraction to OS/User
sector 0
sector 1
sector 2
  • Disk has a sector-addressable address space
  • Appears as an array of sectors
  • Similar to Paging, sectors are typically 512 bytes/sector
  • Main operations: atomic read/write to a particular sector. When power failure, can have guranntee that r/w to a sector is done or not.
  • Mechanical and slow

Internals & Performance Measure

  • Platter: a circular/disk shape entity with magnetic foam on both sides
  • Spindle: connected with motor to make platter spin.
  • RPM(Rotations Per Minutes): tells the rate of platter spinning. 10000RMP = 1 rotation/6ms
  • Surface: one side of a platter. Both sides can be written/read.
  • tracks: a ring of certain inner & outer radius, surface is divided into different tracks. Tracks are divided into numbered sectors. Each track in above graph has 8 sectors.
  • cylinder: stack of tracks across platters. This idea is useful when we want to do uniform operations on the same track of each surfaces.
  • Arm seeks over desired tracks, platter rotates! A head per surface for R/W!
  • Reading/Writing data from disks
    • Rotation Delay: the waiting time for the platter to rotate till the head is positioned at right sector on the single track. On average R/2.
    • Seek Time: the waiting time for disk arm to be positioned on the right track.
    • Transfer Time: actual time data is either read from or written to the surface.
    • Overall Time to Read/write
      • Time_IO = seek + rotation + transfer
      • IO rate (mainly used for comparing drives performance):
        • IO_rate = size to transfer / Time_IO
    • Summary
      • Seek cost (major): Function of cylinder distance. Not purely linear cost. Must accelerate, coast, decelerate, settle. Settling alone can take 0.5 - 2 ms. Entire seeks often takes 4 - 10 ms. Average seek = 1/3 of max seek.
      • Rotate cost (major): Depends on rotations per minute (RPM). 7200 RPM is common, 15000 RPM is high end. Average rotation = 1/2 of rotation delay
      • Transfer time: pretty fast. depends on RPM and sector density. 100+ MB/s is typical for maximum transfer rate.
  • IO time calculation Example
    • Find the time for 4KB random read for Cheetah (on average).
    • Solution:
      • Tseek = 4ms
      • Trotate = 15000R/60s = 15000R/60000ms = 0.25R/ms = 4 R/ms = 2R/ms on average
      • Ttransfer = 4KB/(125MB/s) = 4/125 ms

Workload Performance

  • Question: How does two kinds of workload affect performance?
    • Sequential: reads a large number of sectors consecutively from the disk, without jumping around.
    • Random: issues small (e.g., 4KB) reads to random locations on the disk.
    • Example:
      • What is throughput (IO rate) for sequential workload and random workload for Cheetah?
        • Sequential: 4ms seek + 2ms rotate + 1s transfer = 1.006s This means effective throughput is almost equal to 125MB/s
        • Random: IO time = 6ms, 4KB/6ms <<< 125MB/s much lower throughput than sequential access.
      • Conclusion: When at all possible, transfer data to and from disks in a sequential manner. If sequential is not possible, at least think about transferring data in large chunks: the bigger, the better. If I/O is done in little random pieces, I/O performance will suffer dramatically.

Some techniques manufacturers use to improve performance of disks

Track Skew (skewed layout):

  • How should sector number be laid out so that we can continue reading sequentially?
  • Goal: We want low overhead and seamless transformation from 15 to 16 when we want to read 16 after 15.
  • Solution with track skew method:
    • idea: overlapping seek and rotation
    • By the time the platter rotate, the head already place at position of 16.

Zones

  • Idea: outer tracks have more area available than inner tracks and thus can store more data. But we fixed sector size. So we can have non-uniform division: more sectors on outer tracks to utilize that space.
  • Zone bit recording: call collection of sectors as zone.

Cache inside drive

  • Idea: Drives may cache both reads and writes. (In addition to OS cache). Cache is not big (2MB-16MB)
  • Advantages of caching in drive for read:
    • Store recently read sectors. Fetch it from cache.
    • Read-ahead: read contents of entire track into cache. predictively facilitates sequential workload.
  • Advantages & Disadvantages of caching in drive for write:
    • Immediate reporting: CPU doesn’t need to wait for write to finish. Can acknowledge a write even before the write actually makes it to the magnetic medieum.
    • Danger: cached data can be lost on power failure.
  • Other advantages: multiple outstanding requests
    • Tagged command queuing: Disk can reorder/schedule requests for better performance.

IO scheduler, scheduling policies and tradeoff

Motivation

Given a stream of I/O requests, in what order should they be served?

  • Example timeline: P1Read___P2Read___P3write__

Goal

  • OS should dispatch requests in certain order to the shared storage device disk.

Key Problem

  • Much different than CPU scheduling, Position of disk head relative to request position matters more than length of job
  • Example:
    • FCFS/FIFO: Assume seek+rotate = 10 ms for random request, How long (roughly) does the below workload take? Requests are given in sector numbers:
      • 300001, 700001, 300002, 700002, 300003, 700003 = 60ms because each time we need to seek and rotate
      • 300001, 300002, 300003, 700001, 700002, 700003 = 20ms 2 sequential pattern
        -This shows why IO scheduling is important.

Crux

  • we want to implement an algorithm that more closely approximates SJF by taking both seek and rotation into account.

SSTF (Shortest SEEK Time First)

  • Strategy always choose request that requires least seek time (time for seeking and rotating)
  • Greedy algorithm (looks for local optimal)
  • Implementation in OS: use sector number as a substitite, order by nearest sector number first, try to issue requests that are closely together.
  • Disadvantages: starvation!
    • ex. 30001,30002,…,70001(starved)
    • avoid starvation:
      • Scan/Elevator Algorithm: Sweep back and forth, from one end of disk other, serving requests as pass that cylinder; Sorts by cylinder number(order of tracks); ignores rotation delays;
        • Example: input 101 201 102 301 203; output_order 101 201 301 203 102 (first bit track#)

        • This ensure for example the request at outermost track does not starve!

        • This is a “Best effort” work done on OS side -> logically like sorting

      • C-SCAN(Circular Scan Algorithm): Only sweep in one direction. 1->2->3 reset 1->2->3
        • This is more fair than SCAN because in pure backand-forth SCAN middle one 2 is treated more times on average than peripheral 1 and 3.
  • Problem: SCAN and SSTF do not actually adhere as closely to the principle of SJF as they could. In particular, they ignore rotation.

SPTF(Shortest Positioning Time First)

  • Example: we get 2 requests one to 16, one to 8. 16 gets shorter seek but longer rotate, 8 has shorter rotation delay and longer seek. If seek time is higher than rotational delay in the disk in this example, then SSTF related policies are fine = we want to minimize seek time. So go to 16. Otherwise 8 is a better choice because we need to minimize rotation delay to have better performance.

  • This algorithm is complex, for simplicity, many OS only implements shortest seek time first

Where should IO scheduler go? OS vs Disk.

  • Disk: it knows disk geometry much better but fixed by hardware, can be hard to update firmware or the scheduler.
  • OS: knows processes that are requesting, so we can do some weighted or fair sharing across processes. Easy to update scheduler software. Can have multiple disks scheudler and change scheduler based on workload pattern.
  • Typical state of the art approach: has a simple scheduler in OS side that sorts the requests based on sector locations, then has one more complex scheduler on disk side to take account of seek time and rotation time in order to minimize things further. But typically OS sends multiple requests to disk, so disk scheduelr can do some reordering between them.

How busy should we keep the disk?

1
2
3
4
5
6
7
8
9
10
11
//This is a procedure that reads from a file 
//descriptor 1KB at a time and process that.
void reader(int fd) {
char buf[1024];
int rv;
while((rv = read(buf,fd)) != 0) {
assert(rv);
// takes short time, e.g., 1ms
process(buf, rv);
}
}

  • assume 2 processes calling read() with C-SCAN,consider workload pattern: 100 101 200 201 102 103 202. This pattern is possible because there is maybe 1 ms gap between 101 and 102 so that other threads(processes) can interrupt. This is very inefficient. Should the OS always submit requests waiting present on the queue or should wait to see if other requests arrive (BE work-conserving and let disk be idle at some point so that we can make more efficient progress in the future)?

  • Work conservation (a trick used by linux scheduler)

    • not work conserving/violating work-conservation: might wait for a while to merge or get a better sequence of requests
    • Work conserving schedulers always try to do work if there’s work to be done (always run a request if resource is free)
      • Sometimes, it’s better to wait instead if system anticipates another request will arrive.
      • example: I/O Merging. OS coalesces several IO requests into ONE. Less IO overhead.

RAID

Motivation

  • Typical scenario
APP
OS FS
Storage Devices:most file systems work with one disk
  • sometimes we need many disks for reasons:

    • Capacity
    • Performance
    • Reliability
  • Solution1 JBOD - Just a bunch of disks

    • Applications store data on different FS, ex. critical data that app decides to replicate
    • Downsides: need to know multiple devices, need to be rewritten, not deployable
  • Solution2 RAID - Redundant Array of Inexpensive (Independent) Disks

    • abstract multiple physical disks into one logical disk to OS
    • Advantages: transparent to apps, deployable Improved capacity, performance, and reliability!

Fault Model of RAID

  • Simple: Fail-stop model
  • Either works correctly or fails entirely
  • System can easily detect which part is not working
  • No silent failures, No corruptions, …etc…

General strategy of RAID - Mapping

  • Mapping blocks: build fast, large disk from smaller ones
  • Very similar to VM: go from virtual space to physical space by looking up TLB and pagetable
  • VM mapping is dynamic - mapping can change for example, when memory is free and is reallocated.
  • RAID mapping is static - translation is simple static calculation: no lookup

General strategy of RAID - Redundancy

  • Add even more disks for reliability
  • More redundancy == More fault tolerance
  • This is a tradeoff
    • Redundancy improves reliability (and maybe performance)
    • Deduplication improves space efficiency

RAID analysis

  • RAID level: different levels
  • Workload: types of reads/writes issued by app
  • Metric: capacity, reliability, performance
  • Analysis mode: given Workload, Raid level, determine/calculate Metric

RAID levels

  1. Stripping
  2. Mirroring
  3. Parity
  4. Rotated parity
    We will not discuss 2, 3, 6 in this class.

Workload

Metrics

Operating System - Concurrency

CS537 - Operating System Summary Part 2 Concurrency

Concurrency

Thread

Processes vs Thread

  • Process

    • Example: Chrome (process per tab)
    • Communicate via pipe() or similar
    • Pros: Don’t need new abstractions; good for security
    • Cons:
      • Cumbersome programming
      • High communication overheads
      • Expensive context switching
  • Thread

    • Multiple threads of same process share an address space
    • Divide large task across several cooperative threads
    • Communicate through shared address space
    • Shared: page directories, page tables, code segment
    • Not Shared: instruction pointer, stack
  • Multiple threads within a single process share:

    • Process ID (PID)
    • Address space: Code (instructions), Most data (heap)
    • Open file descriptors
    • Current working directory
    • User and group id
  • Each thread has its own

    • Thread ID (TID)
    • Set of registers, including Program counter and Stack pointer
    • Stack for local variables and return addresses (in same address space)

Common Programming Models

  • Producer/consumer

    • Multiple producer threads create data (or work) that is handled by one of the multiple consumer threads
  • Pipeline

    • Task is divided into series of subtasks, each of which is handled in series by a different thread
  • Defer work with background thread

    • One thread performs non-critical work in the background (when CPU idle)

User-level threads: Many-to-one

  • Idea

    • Implemented by user-level runtime libraries
    • Create, schedule, synchronize threads at user-level
    • OS is not aware of user-level threads
    • OS thinks each process contains only a single thread of control
  • Advantages

    • Does not require OS support; Portable
    • Can tune scheduling policy to meet application demands
    • Lower overhead thread operations since no system call
  • Disadvantages

    • Cannot leverage multiprocessors
    • Entire process blocks when one thread blocks

Kernel-level threads: One-to-one

  • Idea

    • OS provides each user-level thread with a kernel thread
    • Each kernel thread scheduled independently
    • Thread operations (creation, scheduling, synchronization) performed by OS
  • Advantages

    • Each kernel-level thread can run in parallel on a multiprocessor
    • When one thread blocks, other threads from process can be scheduled
  • Disadvantages

    • Higher overhead for thread operations
    • OS must scale well with increasing number of threads

Thread Schedule Examples

  • Assume M[0x123] = 100 initially, and we want to increment it by 1 twice

  • Example 1

    Thread 1 Thread 2
    mov 0x123, %eax => %eax = 100
    add $0x1, %eax => %eax = 101
    mov %eax, 0x123 =>M[0x123] = 101






    mov 0x123, %eax => %eax = 101
    add $0x1, %eax => %eax = 102
    mov %eax, 0x123 =>M[0x123] = 102
  • Example 2

    Thread 1 Thread 2
    mov 0x123, %eax => %eax = 100



    add $0x1, %eax => %eax = 101
    mov %eax, 0x123 =>M[0x123] = 101

    mov 0x123, %eax => %eax = 100
    add $0x1, %eax => %eax = 101
    mov %eax, 0x123 =>M[0x123] = 101


Non-Determinism

  • Concurrency leads to non-deterministic results

    • Different results even with same inputs
    • race conditions
  • Whether bug manifests depends on CPU schedule!

  • How to program: imagine scheduler is malicious?!

What do we want?

  • Want 3 instructions to execute as an uninterruptable group

  • That is, we want them to be atomic

    1
    2
    3
    mov 0x123, %eax 
    add $0x1, %eax
    mov %eax, 0x123
  • More general: Need mutual exclusion for critical sections

    • if thread A is in critical section C, thread B isn’t
    • (okay if other threads do unrelated work)

Synchronization

  • Build higher-level synchronization primitives in OS
  • Operations that ensure correct ordering of instructions across threads
  • Use help from hardware

Concurrency Objective

  • Mutual exclusion (e.g., A and B don’t run at same time)

    • solved with locks
  • Ordering (e.g., B runs after A does something)

    • solved with condition variables and semaphores

Summary

  • Concurrency is needed for high performance when using multiple cores
  • Threads are multiple execution streams within a single process or address space (share PID and address space, own registers and stack)
  • Context switches within a critical section can lead to non-deterministic bugs

Locks

Introduction

  • Goal: Provide mutual exclusion (mutex)

  • Atomic operation: No other instructions can be interleaved

  1. Allocate and Initialize

    • mylock
      1
      2
      3
      4
      5
      6
          
      2. Acquire
      - Acquire exclusion access to lock;
      - Wait if lock is not available (some other process in critical section)
      - Spin or block (relinquish CPU) while waiting
      - ```Pthread_mutex_lock(&mylock);
  2. Release

    • Release exclusive access to lock; let another process enter critical section
    • 1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31

      ### Implementation Goals
      - Correctness
      - Mutual exclusion
      - Only one thread in critical section at a time
      - Progress (deadlock-free)
      - If several simultaneous requests, must allow one to proceed
      - Deadlock happens when all threads are waiting for lock

      - Bounded (starvation-free)
      - Must eventually allow each waiting thread to enter
      - The waiting time for lock is bounded

      - Fairness: Each thread waits for same amount of time
      - Performance: CPU is not used unnecessarily

      ### Spin Lock with Interrupts
      - Idea
      - Turn off interrupts for critical sections
      - Prevent dispatcher from running another thread
      - Code between interrupts executes atomically

      - Implementation code
      ```c=
      void acquire(lockT *l) {
      disableInterrupts();
      }

      void release(lockT *l) {
      enableInterrupts();
      }
  • Disadvantages
    • Only works on uniprocessors
    • Process can keep control of CPU for arbitrary length
    • Cannot perform other necessary work

Spin Lock with Load + Store

  • Idea: uses a single shared lock variable

  • Implementation code

1
2
3
4
5
6
7
8
9
10
// shared variable 
boolean lock = false;
void acquire(Boolean *lock) {
while (*lock) /* wait */ ;
*lock = true;
}

void release(Boolean *lock) {
*lock = false;
}
  • Race condition

    Thread 1 Thread 2
    while (*lock)


    lock = true;

    while (*lock)
    *lock = true;

    • Both threads grab lock!
    • Problem: Testing lock and setting lock are not atomic

Spin Lock with xchg

  • xchg: Atomic exchange or test-and-set

    1
    2
    3
    4
    5
    6
    7
    // return what was pointed to by addr
    // at the same time, store newval into addr
    int xchg(int *addr, int newval) {
    int old = *addr;
    *addr = newval;
    return old;
    }
  • Implementation code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    typedef struct {
    int flag;
    } lock_t;

    void init(lock_t *lock) {
    lock->flag = 0; // 0 => unlocked; 1 => locked
    }

    void acquire(lock_t *lock) {
    while (xchg(&lock->flag, 1) == 1) {
    // spin-wait (do nothing)
    }
    // exit loop when flag changed from 0 (unlocked) to 1 (locked)
    }

    void release(lock_t *lock) {
    lock->flag = 0; // set the flag to 0 (unlocked)
    }

Spin Lock with CAS

  • CAS: Compare and Swap

    1
    2
    3
    4
    5
    6
    7
    8
    9
    // Atomic instruction
    // set newval to *addr when *addr == expected
    // return what was pointed to by addr
    int CompareAndSwap(int *addr, int expected, int newval) {
    int actual = *addr;
    if (actual == expected)
    *addr = newval;
    return actual;
    }
  • Implementation code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    void acquire(lock_t *lock) {  
    while(CompareAndSwap(&lock->flag, 0, 1) == 1) {
    // spin-wait (do nothing)
    }
    }

    void release(lock_t *lock) {
    lock->flag = 0;
    }
  • Exercise with xchg and CAS

    • Code

      1
      2
      3
      4
      int a = 1;
      int b = xchg(&a, 2);
      int c = CompareAndSwap(&b, 2, 3);
      int d = CompareAndSwap(&b, 1, 3) ;
    • Result:

      a b c d
      1
      2 1
      2 1 1
      2 3 1 1

Ticket Locks

  • Basic spinlocks are unfair

    • Scheduler is unaware of locks/unlocks!
  • Introduction to Ticket Locks

    • Idea: reserve each thread’s turn to use a lock.

    • Each thread spins until their turn.

    • Use new atomic primitive, fetch-and-add

      1
      2
      3
      4
      5
      int FetchAndAdd(int *ptr) { 
      int old = *ptr;
      *ptr = old + 1;
      return old;
      }
    • Acquire: Grab ticket; Spin while not thread’s ticket != turn

    • Release: Advance to next turn

  • Example

    Time Event ticket Turn Result
    1 A lock() 0 0 A runs
    2 B lock() 1 B spins until turn = 1
    3 C lock() 2 C spins until turn = 2
    4 A unlock() 1 B runs
    5 A lock() 3 A spins until turn = 3
    6 B unlock() 2 C runs
    7 C unlock() 3 A runs
    8 A unlock() 4
  • Ticket Lock Implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    typedef struct { 
    int ticket;
    int turn;
    } lock_t;

    void lock_init(lock_t *lock) {
    lock->ticket = 0;
    lock->turn = 0;
    }

    void acquire(lock_t *lock) {
    int myturn = FetchAndAdd(&lock->ticket);
    while (lock->turn != myturn); // spin
    }

    void release(lock_t *lock) {
    FetchAndAdd(&lock->turn);
    }

Ticket Lock with Yield

  • Spinlock Performance

    • Fast when…

      • many CPUs
      • locks held a short time
      • advantage: avoid context switch
    • Slow when…

      • one CPU
      • locks held a long time
      • disadvantage: spinning is wasteful
  • CPU Scheduler is Ignorant

    • CPU scheduler may run B, C, D instead of A even though B, C, D are waiting for A

  • Ticket Locks with Yield

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    typedef struct { 
    int ticket;
    int turn;
    } lock_t;

    void lock_init(lock_t *lock) {
    lock->ticket = 0;
    lock->turn = 0;
    }

    void acquire(lock_t *lock) {
    int myturn = FetchAndAdd(&lock->ticket);
    while (lock->turn != myturn) {
    yield(); // yield instead of spin
    }
    }

    void release(lock_t *lock) {
    FetchAndAdd(&lock->turn);
    }
  • Yield instead of Spin

  • Time Comparison: Yield vs Spin
    • Assumption

      • Round robin scheduling, 10ms time slice
      • Process A, B, C, D, E, F, G, H, I, J in the system
    • Timeline

      • A: lock() … compute … unlock()
      • B: lock() … compute … unlock()
      • J: lock() … compute … unlock()
      • A: lock() … compute … unlock()
    • If A’s compute is 20ms long, starting at t = 0, when does B get lock with spin ?

      • 110 ms

        A…J A B
        100 10
    • If B’s compute is 30ms long, when does C get lock with spin ?

      • 320 ms

        A…J A…J A…J A B C
        100 100 100 10 10
    • If context switch time = 1ms, when does B get lock with yield ?

      • 29 ms
      • A B…J A B
        10 9 10

Queue Lock

  • Motivation

    • Time complexity of spinlock

      • Without yield: O(threads * time_slice)
      • With yield: O(threads * context_switch)
    • Even with yield, spinning is slow with high thread contention

  • Idea

    • Block and put thread on waiting queue instead of spinning
    • Remove waiting threads from scheduler ready queue
    • (e.g., park() and unpark(threadID))
    • Scheduler runs any thread that is ready
  • Example

    • Assumption

      • A & C has 60ms of work
      • A, B, D contend for lock
      • C not contending
      • Context switch + yield takes 5ms
    • Timeline

      Time Event Running Runnable Waiting
      Initial A, B, C, D
      0-20 A scheduled A B, C, D
      20-25 B scheduled & blocked C, D, A B
      25-45 C scheduled C D, A B
      45-50 D scheduled & blocked A, C B, D
      50-70 A scheduled A C B, D
      70-90 C scheduled C A B, D
      90-110 A scheduled & finished A C B, D
      110-130 C scheduled & finished C B D
      130-150 B scheduled & finished B D
      150-170 D scheduled & finished D
  • Incorrect Implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    typedef struct {
    bool lock = false;
    bool guard = false;
    queue_t q;
    } LockT;

    // 1. Grab guard
    // 2. If lock is held, add to queue and park
    // 3. If lock is not held, grab the lock
    void acquire(LockT *l) {
    while (XCHG(&l->guard, true));
    if (l->lock) {
    qadd(l->q, tid);
    l->guard = false;
    park(); // blocked
    } else {
    l->lock = true;
    l->guard = false;
    }
    }

    // 1. Grab guard
    // 2. If queue is empty, release hte lock
    // 3. If the queue is not empty, unpark head of queue
    void release(LockT *l) {
    while (XCHG(&l->guard, true));
    if (qempty(l->q))
    l->lock=false;
    else
    unpark(qremove(l->q));
    l->guard = false;
    }
  • Questions and Answers

    • Why is guard used?
      To ensure queue operations is thread safe

    • Why OK to spin on guard?
      Very shhort critical section

    • In release(), why not set lock = false when unpark?
      lock == true is passed from one thread to the next

  • Race Condition for Previous Implementation

    Thread 1 (in lock) Thread 2 (in unlock)
    if (l->lock) {
    qadd(l->q, tid);
    l->guard = false;




    park()



    while (TAS(&l->guard, true));
    if (qempty(l->q)) // false!!
    else unpark(qremove(l->q));
    l->guard = false;

  • Correct Implementation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    typedef struct {
    bool lock = false;
    bool guard = false;
    queue_t q;
    } LockT;

    void acquire(LockT *l) {
    while (XCHG(&l->guard, true));
    if (l->lock) {
    qadd(l->q, tid);
    setpark(pid); // notify of plan
    l->guard = false;
    park(); // blocked
    } else {
    l->lock = true;
    l->guard = false;
    }
    }

    void release(LockT *l) {
    while (XCHG(&l->guard, true));
    if (qempty(l->q))
    l->lock=false;
    else
    unpark(qremove(l->q));
    l->guard = false;
    }
  • Time Comparison: Yield vs Blocking

    • Assumption

      • Round robin scheduling, 10ms time slice
      • Process A, B, C, D, E, F, G, H, I, J in the system
      • Context switch takes 1ms
    • Timeline

      • A: lock() … compute … unlock()
      • B: lock() … compute … unlock()
      • J: lock() … compute … unlock()
      • A: lock() … compute … unlock()
    • If A’s compute is 30ms long, starting at t = 0, when does B get lock with yield?

      • 48 ms

        A B…J A B…J A B
        10 9 10 9 10
    • If A’s compute is 30ms long, starting at t = 0, when does B get lock with blocking?

      • 39 ms

        A B…J A A B
        10 9 10 10

Queue Lock vs Spin Lock

  • Each approach is better under different circumstances

  • Uniprocessor

    • Waiting process is scheduled à Process holding lock isn’t
    • Waiting process should always relinquish processor
    • Associate queue of waiters with each lock (as in previous implementation)
  • Multiprocessor

    • Waiting process is scheduled -> Process holding lock might be
    • Spin or block depends on how long, t, before lock is released
      • Lock released quickly -> Spin-wait
      • Lock released slowly -> Block
      • Quick and slow are relative to context-switch cost, C

Condition Variables

Ordering

  • Idea: Thread A runs after Thread B does something

  • Example: Join

    1
    2
    3
    4
    5
    6
    7
    pthread_t p1, p2; 
    Pthread_create(&p1, NULL, mythread, "A");
    Pthread_create(&p2, NULL, mythread, "B");
    // join waits for the threads to finish
    Pthread_join(p1, NULL);
    Pthread_join(p2, NULL);
    return 0;

Condition Variables

  • Condition Variable: queue of waiting threads

  • B waits for a signal on CV before running: wait(CV, …)

  • A sends signal to CV when time for B to run: signal(CV, …)

  • wait(cond_t *cv, mutex_t *lock)

    • assumes the lock is held when wait() is called
    • puts caller to sleep + releases the lock (atomically)
    • when awoken, reacquires lock before returning
  • signal(cond_t *cv)

    • wake a single waiting thread (if >= 1 thread is waiting)
    • if there is no waiting thread, just return, doing nothing

Join Attempt 1: No State

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    // parent
    void thread_join() {
    Mutex_lock(&m); // x
    Cond_wait(&c, &m); // y
    Mutex_unlock(&m); // z
    }

    // child
    void thread_exit() {
    Mutex_lock(&m); // a
    Cond_signal(&c); // b
    Mutex_unlock(&m); // c
    }
  • Intended schedule

    Time 1 2 3 4 5 6
    Parent x y z
    Child a b c
  • Broken schedule

    Time 1 2 3 4 5
    Parent x y
    Child a b c
    • Parent is stuck because nobody will call signal
  • Rule of Thumb 1

    • Keep state in addition to CV’s
    • CV’s are used to signal threads when state changes
    • If state is already as needed, thread doesn’t wait for a signal!

Join Attempt 2: No Mutex Lock

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // parent
    void thread_join() {
    Mutex_lock(&m); // w
    // If the child process already finished executing
    // the parent process doesn't need to wait
    if (done == 0) // x
    Cond_wait(&c, &m); // y
    Mutex_unlock(&m); // z
    }

    // child
    void thread_exit() {
    done = 1; // a
    Cond_signal(&c); // b
    }
  • Intended schedule

    Time 1 2 3 4 5 6
    Parent w x y z
    Child a b
  • Broken schedule

    Time 1 2 3 4 5
    Parent w x y
    Child a b
    • Parent is stuck again

Join Attempt 3: State + Mutex Lock

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // parent
    void thread_join() {
    Mutex_lock(&m); // w
    if (done == 0) // x
    Cond_wait(&c, &m); // y
    Mutex_unlock(&m); // z
    }

    // child
    void thread_exit() {
    Mutex_lock(&m); // a
    done = 1; // b
    Cond_signal(&c); // c
    Mutex_unlock(&m); // d
    }
  • Schedule

    Time 1 2 3 4 5 6 7
    Parent w x y z
    Child a b c d
  • Rule of Thumb 2

    • Hold mutex lock while calling wait/signal
    • Ensures no race between interacting with state and wait/signal

Producer/Consumer Problem

  • Example: UNIX pipes
    • A pipe may have many writers and readers

    • Internally, there is a finite-sized buffer

    • Writers add data to the buffer

      • Writers have to wait if buffer is full
    • Readers remove data from the buffer

      • Readers have to wait if buffer is empty
    • Implementation:

      • reads/writes to buffer require locking
      • when buffers are full, writers must wait
      • when buffers are empty, readers must wait
               Start (consumer)
               |
     +---------v---------------------------+------+
Buf: |         |          data             |      |
     +---------+---------------------------^------+
                                           |
                                           End (producer)
  • Producer/Consumer Problem
    • Producers generate data (like pipe writers)

    • Consumers grab data and process it (like pipe readers)

    • Producer/consumer problems are frequent in systems (e.g. web servers)

    • General strategy use condition variables to:

      • make producers wait when buffers are full
      • make consumers wait when there is nothing to consume

P/C Attempt 1: One CV

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    // 1. Producer grabs the lock
    // 2. Check whether the buffer is full. If so, wait.
    // 3. Put something to the buffer
    // 4. Signal consumers to read
    // 5. Release the lock
    void *producer(void *arg) {
    for (int i = 0; i < loops; i++) {
    Mutex_lock(&m); // p1
    if (numfull == max) // p2
    Cond_wait(&cond, &m); // p3
    do_fill(i); // p4
    Cond_signal(&cond); // p5
    Mutex_unlock(&m); // p6
    }
    }

    // 1. Consumer grabs the lock
    // 2. Check whether the buffer is empty. If so, wait.
    // 3. Get the content from buffer and remove it.
    // 4. Signal consumers to write
    // 5. Release the lock
    void *consumer(void *arg) {
    while (1) {
    Mutex_lock(&m); // c1
    if (numfull == 0) // c2
    Cond_wait(&cond, &m); // c3
    int tmp = do_get(); // c4
    Cond_signal(&cond); // c5
    Mutex_unlock(&m); // c6
    }
    }
  • Broken schedule

    Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    P p1 p2 p4 p5 p6 p1 p2 p3
    C1 c1 c2 c3
    C2 c1 c2 c3 c4 c5
    • At time 16, Consumer 1 could signal Consumer 2 to wake up

P/C Attempt 2: Two CVs

  • How to wake the right thread? Use two condition variables

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    void *producer(void *arg) {
    for (int i = 0; i < loops; i++) {
    Mutex_lock(&m); // p1
    if (numfull == max) // p2
    Cond_wait(&empty, &m); // p3
    do_fill(i); // p4
    Cond_signal(&fill); // p5
    Mutex_unlock(&m); // p6
    }
    }

    void *consumer(void *arg) {
    while (1) {
    Mutex_lock(&m); // c1
    if (numfull == 0) // c2
    Cond_wait(&fill, &m); // c3
    int tmp = do_get(); // c4
    Cond_signal(&empty); // c5
    Mutex_unlock(&m); // c6
    }
    }
  • Broken schedule

    Time 1 2 3 4 5 6 7 8 9 10 11 12
    P p1 p4 p5 p6
    C1 c1 c2 c3 c4
    C2 c1 c4 c5 c6
    • At time 12, Consumer 1 wakes up but has nothing to read
    • Note: When signal() is called, the thread may not resume immediately

P/C Attempt 3: Two CVs with While

  • Idea: Recheck the shared variable is still in the state you want after waking up

  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    void *producer(void *arg) {
    for (int i = 0; i < loops; i++) {
    Mutex_lock(&m); // p1
    while (numfull == max) // p2
    Cond_wait(&empty, &m); // p3
    do_fill(i); // p4
    Cond_signal(&fill); // p5
    Mutex_unlock(&m); // p6
    }
    }

    void *consumer(void *arg) {
    while (1) {
    Mutex_lock(&m); // c1
    while (numfull == 0) // c2
    Cond_wait(&fill, &m); // c3
    int tmp = do_get(); // c4
    Cond_signal(&empty); // c5
    Mutex_unlock(&m); // c6
    }
    }
  • Rule of Thumb 3

    • Whenever a lock is acquired, recheck assumptions about state!
    • Another thread could grab lock in between signal and wakeup from wait
    • Note that some libraries also have “spurious wakeups”
    • (may wake multiple waiting threads at signal or at any time)

Summary

  • Rules of Thumb for CVs

    1. Keep state in addition to CV’s
    2. Always do wait/signal with lock held
    3. Whenever thread wakes from waiting, recheck state
  • wait(cond_t *cv, mutex_t *lock)

    • assumes the lock is held when wait() is called
    • puts caller to sleep + releases the lock (atomically)
    • when awoken, reacquires lock before returning
  • signal(cond_t *cv)

    • wake a single waiting thread (if >= 1 thread is waiting)
    • if there is no waiting thread, just return, doing nothing

Semaphores

Introduction

  • Condition variables have no state (other than waiting queue)

    • Programmer must track additional state
  • Semaphores have state: track integer value

    • State cannot be directly accessed by user program
    • But state determines behavior of semaphore operations

Semaphore Operations

  • Allocate and Initialize

    1
    2
    3
    sem_init(sem_t *s, int initval) {
    s->value = initval;
    }
    • User cannot read or write value directly after initialization
  • Wait or Test (sometime P() for Dutch) sem_wait(sem_t*)

    • Decrements sem value, Waits until value of sem is >= 0
  • Signal or Post (sometime V() for Dutch) sem_post(sem_t*)

    • Increment sem value, then wake a single waiter

Build Lock from Semaphore

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
typedef struct {  
sem_t sem;
} lock_t;

void init(lock_t *lock) {
sem_init(&lock->sem, 1);
}

void acquire(lock_t *lock) {
sem_wait(&lock->sem);
}

void release(lock_t *lock) {
sem_post(&lock->sem);
}

Join with CV vs Semaphores

  • Join with Condition Variable

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // parent
    void thread_join() {
    Mutex_lock(&m); // w
    if (done == 0) // x
    Cond_wait(&c, &m); // y
    Mutex_unlock(&m); // z
    }

    // child
    void thread_exit() {
    Mutex_lock(&m); // a
    done = 1; // b
    Cond_signal(&c); // c
    Mutex_unlock(&m); // d
    }
  • Join with Semaphores

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    sem_t s;
    sem_init(&s, 0);

    void thread_join() {
    sem_wait(&s);
    }

    void thread_exit() {
    sem_post(&s);
    }
  • Join with Semaphores Example 1

        s                       s                      s
      +---+   parent wait()   +---+   child post()   +---+
      | 0 |+----------------->| -1|+---------------->| 0 |
      +---+                   +---+                  +---+
                                ^                      ^
                                |                      |
                                Parent blocked         Parent resumes
      
  • Join with Semaphores Example 2

        s                       s                      s
      +---+   child post()    +---+  parent wait()   +---+
      | 0 |+----------------->| 1 |+---------------->| 0 |
      +---+                   +---+                  +---+
      

P/C: 1 Producer & 1 Consumer with Buffer of Size 1

  • Use 2 semaphores

    • emptyBuffer: Initialize to 1
    • fullBuffer: Initialize to 0
  • Producer

    1
    2
    3
    4
    5
    while (1) {
    sem_wait(&emptyBuffer);
    Fill(&buffer);
    sem_signal(&fullBuffer);
    }
  • Consumer

    1
    2
    3
    4
    5
    while (1) { 
    sem_wait(&fullBuffer);
    Use(&buffer);
    sem_signal(&emptyBuffer);
    }
  • Example 1: Producer comes first

    Time Current Thread emptyBuffer fullBuffer
    Initial 1 0
    1 Producer 0 1
    2 Consumer 1 0
    3 Producer 0 1
  • Example 2: Consumer comes first

    Time Current Thread emptyBuffer fullBuffer
    Initial 1 0
    1 Consumer 1 -1
    2 Producer 0 0
    3 Consumer 1 0

P/C: 1 Producer & 1 Consumer with Buffer of Size N

  • Use 2 semaphores

    • emptyBuffer: Initialize to N
    • fullBuffer: Initialize to 0
  • Producer

    1
    2
    3
    4
    5
    6
    7
    int i = 0;
    while (1) {
    sem_wait(&emptyBuffer);
    Fill(&buffer[i]);
    i = (i + 1) % N;
    sem_signal(&fullBuffer);
    }
  • Consumer

    1
    2
    3
    4
    5
    6
    7
    int j = 0;
    while (1) {
    sem_wait(&fullBuffer);
    Use(&buffer[j]);
    j = (j + 1) % N;
    sem_signal(&emptyBuffer);
    }
  • Example 1: Producer comes first (N = 3)

    Time Curr empty
    Buffer
    full
    Buffer
    Note
    Initial 3 0
    1 P1 2 1 wait(emptyBuffer) + fill + signal(fullBuffer)
    2 P2 1 2 wait(emptyBuffer) + fill + signal(fullBuffer)
    3 P3 0 3 wait(emptyBuffer) + fill + signal(fullBuffer)
    4 P4 -1 3 wait(emptyBuffer)
    5 C1 0 2 wait(fullBuffer) + use + signal(emptyBuffer)
    6 C2 1 1 wait(fullBuffer) + use + signal(emptyBuffer)
    7 P4 0 2 fill + signal(fullBuffer)
  • Example 2: Two consumers come first (N = 3)

    Time Curr empty
    Buffer
    full
    Buffer
    Note
    Initial 3 0
    1 C1 3 -1 wait(fullBuffer)
    2 C2 3 -2 wait(fullBuffe)
    3 P 2 -1 wait(emptyBuffer) + fill + signal(fullBuffer)
    4 C1 3 -1 use + signal(emptyBuffer)

P/C: Multiple Producers & Consumers

  • Requirements

    • Each consumer must grab unique filled element
    • Each producer must grab unique empty element
  • Attempt 1

    • Producer

      1
      2
      3
      4
      5
      6
      while (1) {
      sem_wait(&emptyBuffer);
      my_i = findempty(&buffer);
      Fill(&buffer[my_i]);
      sem_signal(&fullBuffer);
      }
    • Consumer

      1
      2
      3
      4
      5
      6
      while (1) { 
      sem_wait(&fullBuffer);
      my_j = findfull(&buffer);
      Use(&buffer[my_j]);
      sem_signal(&emptyBuffer);
      }
    • Problem: findfull and findempty are not thread-safe

  • Attempt 2

    • Producer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) {
      sem_wait(&mutex);
      sem_wait(&emptyBuffer);
      my_i = findempty(&buffer);
      Fill(&buffer[my_i]);
      sem_signal(&fullBuffer);
      sem_signal(&mutex);
      }
    • Consumer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) { 
      sem_wait(&mutex);
      sem_wait(&fullBuffer);
      my_j = findfull(&buffer);
      Use(&buffer[my_j]);
      sem_signal(&emptyBuffer);
      sem_signal(&mutex);
      }
    • Problem

      • Deadlock: Consumer grabs mutex and wait for fullBuffer for ever
  • Attempt 3

    • Producer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) {
      sem_wait(&emptyBuffer);
      sem_wait(&mutex);
      my_i = findempty(&buffer);
      Fill(&buffer[my_i]);
      sem_signal(&mutex);
      sem_signal(&fullBuffer);
      }
    • Consumer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) { 
      sem_wait(&fullBuffer);
      sem_wait(&mutex);
      my_j = findfull(&buffer);
      Use(&buffer[my_j]);
      sem_signal(&mutex);
      sem_signal(&emptyBuffer);
      }
    • Problem

      • Cannot operate on multiple buffer locations at the same time
      • Only 1 thread at at time can be using of filling different buffers
  • Attempt 4

    • Producer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) {
      sem_wait(&emptyBuffer);
      sem_wait(&mutex);
      my_i = findempty(&buffer);
      sem_signal(&mutex);
      Fill(&buffer[my_i]);
      sem_signal(&fullBuffer);
      }
    • Consumer

      1
      2
      3
      4
      5
      6
      7
      8
      while (1) { 
      sem_wait(&fullBuffer);
      sem_wait(&mutex);
      my_j = findfull(&buffer);
      sem_signal(&mutex);
      Use(&buffer[my_j]);
      sem_signal(&emptyBuffer);
      }
    • Advantage

      • Works and increases concurrency; only finding a buffer is protected by mutex;
      • Filling or Using different buffers can proceed concurrently

Reader/Writer Locks

  • Idea

    • Let multiple reader threads grab lock (shared)
    • Only one writer thread can grab lock (exclusive)
      • No reader threads
      • No other writer threads
  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    typedef struct_rwlock_t {
    sem_t lock; // reader lock
    sem_t writelock;
    int readers; // number of readers
    } rwlock_t;

    void rwlock_init(rwlock_t*rw) {
    rw->readers = 0;
    // initialize locks to 1, similar to mutex initialization
    sem_init(&rw->lock, 1);
    sem_init(&rw->writelock, 1);
    }

    void rwlock_acquire_readlock(rwlock_t *rw) {
    sem_wait(&rw->lock);
    rw->readers++;
    if (rw->readers == 1)
    sem_wait(&rw->writelock);
    sem_post(&rw->lock);
    }

    void rwlock_release_readlock(rwlock_t *rw) {
    sem_wait(&rw->lock);
    rw->readers--;
    if (rw->readers == 0)
    sem_post(&rw->writelock); // let other writes
    sem_post(&rw->lock);
    }

    void rwlock_acquire_writelock(rwlock_t *rw) {
    sem_wait(&rw->writelock);
    }

    void rwlock_release_writelock(rwlock_t *rw) {
    sem_post(&rw->writelock);
    }
  • Example

    Time Current Action lock writelock readers
    Initial 1 1 0
    1 T1 acquire_readlock 0 1 0 1
    2 T2 acquire_readlock 0 1 0 2
    3 T3 acquire_writelock 1 -1 2
    4 T1 release_readlock 0 1 -1 1
    5 T2 release_readlock 0 1 0 0
  • Quiz 1

    • T1: acquire_readlock() => T1 running
    • T2: acquire_readlock() => T2 running
    • T3: acquire_writelock() => T3 blocked, waiting for write lock
  • Quiz 2

    • T6: acquire_writelock() => T6 running
    • T4: acquire_readlock() => T4 blocked, waiting for read lock
    • T5: acquire_readlock() => T5 blocked, waiting for read lock

Build Zemaphore from Lock and CV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
typedef struct { 
int value;
cond_t cond;
lock_t lock;
} zem_t;

void zem_init(zem_t *z, int value) {
z->value = value;
cond_init(&z->cond);
lock_init(&z->lock);
}

// waits until value > 0. and decrement
void zem_wait(zem_t *z) {
lock_acquire(&z->lock);
z->value--;
while (z->value < 0)
cond_wait(&z->cond);
lock_release(&z->lock);
}

// increment value, then wake a single waiter
void zem_post(zem_t *z) {
lock_acquire(&z->lock);
z->value++;
cond_signal(&z->cond);
lock_release(&z->lock);
}

Summary

  • Semaphores are equivalent to locks + condition variables

    • Can be used for both mutual exclusion and ordering
  • Semaphores contain state

    • How they are initialized depends on how they will be used
    • Init to 0: Join (1 thread must arrive first, then other)
    • Init to N: Number of available resources
  • sem_wait(): Decrement and waits until value >= 0

  • sem_post(): Increment value, then wake a single waiter (atomic)

  • Can use semaphores in producer/consumer and for reader/writer locks

Concurrency Bugs

Concurrency in Medicine: Therac-25 (1980’s)

“The accidents occurred when the high-power electron beam was activated
instead of the intended low power beam, and without the beam spreader plate
rotated into place. Previous models had hardware interlocks in place to prevent
this, but Therac-25 had removed them, depending instead on software interlocks
for safety. The software interlock could fail due to a race condition.”

“…in three cases, the injured patients later died.”

Concurrency Study

Atomicity: MySQL

  • Bug

    1
    2
    3
    4
    5
    6
    // Thread 1
    if (thd->proc_info) {
    // ...
    fputs(thd->proc_info, /*...*/);
    // ...
    }
    1
    2
    // Thread 2
    thd->proc_info = NULL;
  • Fix

    1
    2
    3
    4
    5
    6
    7
    8
    // Thread 1
    pthread_mutex_lock(&lock);
    if (thd->proc_info) {
    // ...
    fputs(thd->proc_info, /*...*/);
    // ...
    }
    pthread_mutex_unlock(&lock);
    1
    2
    3
    4
    // Thread 2
    pthread_mutex_lock(&lock);
    thd->proc_info = NULL;
    pthread_mutex_unlock(&lock);

Ordering: Mozilla

  • Bug

    1
    2
    3
    4
    5
    6
    // Thread 1
    void init() {
    // ...
    mThread = PR_CreateThread(mMain, /*...*/);
    // ...
    }
    1
    2
    3
    4
    5
    6
    // Thread 2
    void mMain(/*...*/) {
    // ...
    mState = mThread->State;
    // ...
    }
  • Fix

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    // Thread 1
    void init() {
    // ...
    mThread = PR_CreateThread(mMain, /*...*/);

    pthread_mutex_lock(&mtLock);
    mtInit = 1;
    pthread_cond_signal(&mtCond);
    pthread_mutex_unlock(&mtLock);
    // ...
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    // Thread 2
    void mMain(/*...*/) {
    // ...
    pthread_mutex_lock(&mtLock);
    while (mtInit == 0)
    pthread_cond_wait(&mtCond, &mtLock);
    pthread_mutex_unlock(&mtLock);

    mState = mThread->State;
    // ...
    }

Deadlock

Definition

  • No progress can be made because two or more threads are waiting for the other to take some action and thus neither ever does

Example 1: Circular Dependency

  • Code

    Thread 1 Thread 2
    lock(&A);


    lock(&B);(blocked)

    lock(&B);
    lock(&A);(blocked)

  • Circular Dependency

    • Cycle in dependency graph -> possible to have deadlock
  • Fix Deadlock Code

    Thread 1 Thread 2
    lock(&A);
    lock(&B);
    lock(&A);
    lock(&A);
  • Non-Circular Dependency

Example 2: Encapsulation

  • Code
1
2
3
4
5
6
7
8
9
10
11
12
set_t *set_intersection(set_t *s1, set_t *s2) {
set_t *rv = malloc(sizeof(*rv));
mutex_lock(&s1->lock);
mutex_lock(&s2->lock);
for (int i = 0; i < s1->len; i++) {
if (set_contains(s2, s1->items[i])) {
set_add(rv, s1->items[i]);
}
}
mutex_unlock(&s2->lock);
mutex_unlock(&s1->lock);
}
  • Deadlock scenario

    • Thread 1: rv = set_intersection(setA, setB);
    • Thread 2: rv = set_intersection(setB, setA);
  • Encapsulation

    1
    2
    3
    4
    5
    6
    7
    8
    if (m1 > m2) {
    // grab locks in high-to-low address order
    pthread_mutex_lock(m1);
    pthread_mutex_lock(m2);
    } else {
    pthread_mutex_lock(m2);
    pthread_mutex_lock(m1);
    }
    • Problem: Deadlock happens when m1 == m2

Deadlock Theory

  • Deadlocks can only happen with these four conditions:

    1. mutual exclusion
    2. hold-and-wait
    3. no preemption
    4. circular wait
  • Can eliminate deadlock by eliminating any one condition

1. Mutual Exclusion

  • Problem: Threads claim exclusive control of resources that they require

  • Strategy: Eliminate locks! Replace locks with atomic primitive

  • Lock-free add

    • Implement add using lock

      1
      2
      3
      4
      5
      void add(int *val, int amt) {
      Mutex_lock(&m);
      *val += amt;
      Mutex_unlock(&m);
      }
    • Atomic primitive CompareAndSwap

      1
      2
      3
      4
      5
      6
      7
      int CompareAndSwap(int *address, int expected, int new) {
      if (*address == expected) {
      *address = new;
      return 1; // success
      }
      return 0; // failure
      }
    • Implement add without lock

      1
      2
      3
      4
      5
      void add(int *val, int amt) {
      do {
      int old = *value;
      } while (!CompareAndSwap(val, old, old + amt);
      }
  • Wait-free Linked List Insert

    • Implement insert using lock

      1
      2
      3
      4
      5
      6
      7
      8
      void insert(int val) {
      node_t *n = malloc(sizeof(*n));
      n->val = val;
      lock(&m);
      n->next = head;
      head = n;
      unlock(&m);
      }
    • Implement insert using while loop

      1
      2
      3
      4
      5
      6
      7
      void insert(int val) {
      node_t *n = Malloc(sizeof(*n));
      n->val = val;
      do {
      n->next = head;
      } while (!CompareAndSwap(&head, n->next, n));
      }

2. Hold and Wait

  • Problem: Threads hold resources allocated to them while waiting for additional resources

  • Strategy: Acquire all locks atomically once. Can release locks over time, but cannot acquire again until all have been released

  • How to do this? Use a meta lock:

    1
    2
    3
    4
    lock(&meta);
    lock(&L1); /*...*/ lock(&L10);
    unlock(&L10); /*...*/ unlock(&L1);
    unlock(&meta);
  • Disadvantages

    • Locks are not fine-grained

3. No Preemption

  • Problem: Resources (e.g., locks) cannot be forcibly removed from threads that are

  • Strategy: if thread can’t get what it wants, release what it holds

    1
    2
    3
    4
    5
    6
    top: 
    lock(A);
    if (trylock(B) == -1) { // try to lock B
    unlock(A); // if failed, also unlock A
    goto top;
    }
  • Disadvantages

    • Live lock: A situation in which two or more processes continuously change their states in response to changes in the other process(es) without doing any useful work

Circular Wait

  • Circular chain of threads such that each thread holds a resource (e.g., lock)
    being requested by next thread in the chain.

  • Strategy:

    • decide which locks should be acquired before others
    • if A before B, never acquire A if B is already held!
    • document this, and write code accordingly
  • Works well if system has distinct layers

Concurrent Data Structures

Scalability Measure

  • N times as much work on N cores as done on 1 core.

  • Strong scalability

    • Fix input size, increase number of cores, can have better performance

    • e.g. Matrix multiplication: Am×n×Bn×dA_{m\times n}\times B_{n\times d} requires O(mnd)O(mnd) FLOPS (floating point operations per second)

     Time 
       ^
       |
       |     **
       |     **
       |     **
       |     **
       |     **     **
       |     **     **
       |     **     **     **
       |     **     **     **     **
       +-----++-----++-----++-----++----> Number of Cores
       0     1      2      3      4
    
  • Weak scaling:

    • Increase input size with number of cores

    • e.g. Matrix multiplication

    A B FLOPS
    1 core 100 × 100 100 × 100 106
    2 core 100 × 200 200 × 100 2×106
    3 core 100 × 300 300 × 100 2×106
    4 core 100 × 400 400 × 100 4×106
     Time 
       ^
       |
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       |     **     **     **     **
       +-----++-----++-----++-----++----> Number of Cores
       0     1      2      3      4
      

Counter

  • Non-thread-safe Counter

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    typedef struct __counter_t {
    int value;
    } counter_t;

    void init(counter_t * c) {
    c->value = 0;
    }
    void increment(counter_t * c) {
    c->value++;
    }
    int get(counter_t * c) {
    return c->value;
    }
    • Problem: Two threads calls increment at the same time
  • Thread-safe counter

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    typedef struct __counter_t {
    int value;
    pthread_mutex_t lock;
    } counter_t;

    void init(counter_t * c) {
    c->value = 0;
    }
    void increment(counter_t * c) {
    Pthread_mutex_lock(&c->lock);
    c->value++;
    Pthread_mutex_unlock(&c->lock);
    }
    int get(counter_t * c) {
    Pthread_mutex_lock(&c->lock);
    return c->value;
    Pthread_mutex_unlock(&c->lock);
    }
  • Linearizability

    • Even if two threads execute in parallel on multiple cores, the effect that you see should be as if all of them are executed in some linear order.

    • Example: T1 and T2 call increment first, then T3 calls get.

    • Since T3 arrived after T1 and T2, we would want T3 to see the values after T1 and T2 have finished executing as if these were three instructions executed by a single processor

  • The Underlying Problem

    • Ticket lock
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    struct spinlock_t {
    int current_ticket; // turn
    int next_ticket;
    }

    void spin_lock(spinlock_t *lock) {
    t = atomic_inc(lock->next_ticket)
    while (t != lock->current_ticket); // spin
    }

    void spin_unlock(spinlock_t *lock) {
    lock->current_ticket++;
    }
    • If one of the thread holds the lock, all of the other threads need to check the lock
    • So each lock acquisition becomes more and more expensive as you go from like two to four to eight…

Approximate Counter (Sloppy Counter)

  • Motivation

    • with standard thread-safe counter (strongest possible consistency) performance is poor under multithreads. Scalability is poor.
    • Cross-core messages are expensive under multicore system (Conclusion from “An analysis of Linux Scalability to Many Cores - Boyd-Wickizer et. al OSDI 2010, in the article they use 48core machine to benchmark linux”). This is because ticket lock in linux: if one of the core holds the lock all other cores need to check with this core holdiing the lock what is the next turn value going to be. We want to reduce the number of cross-core messages, which is very expensive under this situation. One way is to relax consistency.
  • Idea

    • Maintain a counter per-core and a global counter
    • Global counter lock
    • Per-core locks if more than 1 thread per-core
  • Increment:

    • update local counters at threshold update global
  • Read:

    • global counter (maybe inaccurate?)
  • Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
typedef struct __counter_t {
int global; // global count
pthread_mutex_t glock; // global lock
int local[NUMCPUS]; // local count (per cpu)
pthread_mutex_t llock[NUMCPUS]; // ... and locks
int threshold; // update frequency
} counter_t;

// init: record threshold, init locks, init values of all local counts and
// global count
void init(counter_t* c, int threshold) {
c->threshold = threshold;
c->global = 0;
pthread_mutex_init(&c->glock, NULL);

for (int i = 0; i < NUMCPUS; i++) {
c->local[i] = 0;
pthread_mutex_init(&c->llock[i], NULL);
}
}

// usually, just grab local lock and update local amount once local
// count has risen by ’threshold’, grab global lock and transfer local values to it
void update(counter_t* c, int threadID, int amt) {
pthread_mutex_lock(&c->llock[threadID]);
c->local[threadID] += amt; // assumes amt > 0
if (c->local[threadID] >= c->threshold) { // transfer to global
pthread_mutex_lock(&c->glock);
c->global += c->local[threadID];
pthread_mutex_unlock(&c->glock);
c->local[threadID] = 0;
}
pthread_mutex_unlock(&c->llock[threadID]);
}

// get: just return global amount (which may not be perfect)
int get(counter_t* c) {
pthread_mutex_lock(&c->glock);
int val = c->global;
pthread_mutex_unlock(&c->glock);
return val; // only approximate!
}

Concurrent Linked List

  • First Attempt

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    void List_Insert(list_t *L, int key) {
    pthread_mutex_lock(&L->lock);
    node_t *new = malloc(sizeof(node_t));
    if (new == NULL) {
    perror("malloc");
    pthread_mutex_unlock(&L->lock);
    return; // fail
    }
    new->key = key;
    new->next = L->head;
    L->head = new;
    pthread_mutex_unlock(&L->lock);
    return; // success
    }
  • Better Implementation (Shorter Critical Section)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    void List_Insert(list_t *L, int key) {
    node_t *new = malloc(sizeof(node_t));
    if (new == NULL) {
    perror("malloc");
    return; // fail
    }
    pthread_mutex_lock(&L->lock);
    new->key = key;
    new->next = L->head;
    L->head = new;
    pthread_mutex_unlock(&L->lock);
    return; // success
    }

Hash Table from List

  • Idea

    • Avoid contention by using different locks in each buckets — more fine-grained locks & reduce cross threads contentions, leads to better scalling under multithreads performane
  • Code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #define BUCKETS (101)
    typedef struct __hash_t {
    list_t lists[BUCKETS];
    } hash_t;

    int Hash_Insert(hash_t *H, int key) {
    int bucket = key % BUCKETS;
    return List_Insert(&H->lists[bucket], key);
    }

Concurrent Queue

  • Idea: use 2 locks to ensure that threads can enqueue/dequeue without conflicting with each other
  • One more thing to check in the following implementation: when there is only 1 elment – head and tail points to the same thing in queue, grab both locks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void Queue_Enqueue(queue_t *q, int value) {
node_t *tmp = malloc(sizeof(node_t));
assert(tmp != NULL);
tmp->value = value;
tmp->next = NULL;

pthread_mutex_lock(&q->tailLock);
q->tail->next = tmp;
q->tail = tmp;
pthread_mutex_unlock(&q->tailLock);
}
int Queue_Dequeue(queue_t *q, int *value) {
pthread_mutex_lock(&q->headLock);
node_t *tmp = q->head;
node_t *newHead = tmp->next;
if (newHead == NULL) {
pthread_mutex_unlock(&q->headLock);
return -1; // queue was empty
}
*value = newHead->value;
q->head = newHead;
pthread_mutex_unlock(&q->headLock);
free(tmp);
return 0;
}

Summary

  • Simple approach: Add a lock to each method’s start and end! example: java keyword synchronized
    public synchronized get(){} This kind of synchronoized keyword is very standard in high level language but this may reduce performance under multicore scalablility

  • Check for scalability – weak scaling, strong scaling

  • If you are not happy with scalability properties, try to optimize by: Avoid cross-thread, cross-core traffic, using methods such as

    • Per-core(sloppy) counter, relax consistency
    • Buckets in hashtable, reduce cross threads contention, more fine grained locks
    • keep critical section small
    • not using locks is faster than using locks

Boyer–Moore majority vote algorithm

Boyer–Moore majority vote algorithm

Boyer–Moore majority vote algorithm is an algorithm that finds the majority element and its count from a given sequence in O(N)TIME O(1)SPACE.

Pseudocode

  • Initialize an element m and a counter i with i = 0
  • For each element x of the input sequence:
    • If i = 0, then assign m = x and i = 1
    • else if m = x, then assign i = i + 1
    • else assign i = i − 1
  • Return m
    初始化major元素 major = arr[0],counter = 1
    遍历序列中每个元素 1:end
    if counter == 0 reset: major to be current pointer and counter to be 1
    else if major == current element increment counter
    else (current element is not major and counter need not to be reset) decrement counter

Example

Leetcode 169
Given an array of size n, find the majority element. The majority element is the element that appears more than ⌊ n/2 ⌋ times.
You may assume that the array is non-empty and the majority element always exist in the array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Solution {
public int majorityElement(int[] num) {
int major=num[0], count = 1;
for(int i=1; i<num.length;i++){
if(count==0){
count++;
major=num[i];
}else if(major==num[i]){
count++;
}else count--;
}
return major;
}
}

Reference:
https://en.wikipedia.org/wiki/Boyer–Moore_majority_vote_algorithm
https://www.zhihu.com/question/49973163/answer/235921864

  • © 2016-2020 th2zz

请我喝杯咖啡吧~

支付宝
微信