Monthly Archives: 十一月 2019

2019.11.13 星期三 代码工作者

这篇文章写的

2019.11.13 星期三 代码工作者

虽然这么多年来我一直在IT行业,方向从未转变过,甚至专业也一样,一直是一名程序员,但我仍旧认为,
在我刚毕业入行之际,那时的我还远称不上是一名程序员。

我们不能轻易地称一位有写作习惯的人为作家,就像我习惯每天都用文字记录些什么,但你还是不能称我为
作家。因为我很有自知之明,知道自己写的文字值几斤几两,确切地说,是不值几斤几两。对于一位有写作
习惯的人,我们也不能贸然称他为作家。众所周知,一个人即便对于写作有着异于常人的兴趣,以及决心,除非
他写出了一些足够好的文章或书籍,我们才会称之为作家。文章或书籍的数量要求不一,对于天赋异禀之人,
如玛格丽特·米切尔,只一本《乱世佳人》就无人能撼动其在文学史上的地位。而对如今充斥网络的网文作家,
确切地说,我们应该称他们为,网文文字工作者。工作者听起来没有作家段位高,这是必然的,因为文字工作者
卖文章收稿费,是按文章字数来定价的,比如千字一百。这些文章不论内容如何精彩纷呈、立意如何深刻、辞藻
如何华丽,一律按文章总字数来收费。还有类似行业的是翻译。对于一位初级翻译而言,他的稿费同样也是按字数收费的。
只有当这些工作者的努力积攒到由量变产生了质变,成名之后,不再按照字数来收费了,我们才称之为作家或者翻译家。

评估一位程序员的工作是很难量化的,
虽然你可能花了更少时间写代码,甚至花了大量时间重构代码,把以前一千行的文件,优化成了五百行,但除非对
这部分代码有如作者般的了解程度的人,其它人未必能发现这些工作的价值。于是乎,在早期的软件工程领域,
程序员的工作是通过代码行数来量化的。你提交的commit的代码行数越多,那么就认为你的工作输出越大,
绩效考核就越高。这听起来是个笑话,但这个行业可能也就时至目前,才逐渐摒弃了这种评估方法。

大概是2011年底,我已经坚定了自己成为一名程序员的决心。
但我刚入行时就碰到了这些让人“愤愤不平”的绩效考核问题。
当时我也很纳闷,“凭什么要用代码行数这个明显不合理的指标来评估程序员的工作呢?”
直到推人及己时,我才明白这个道理。
所以我称我刚入行时为“代码工作者”,因为我写的代码并未产生出显著的效应。
这时的工作成绩,就像是文字工作者在成长为“作家”前所做的量的积累。
这个时候的代码,更多的是为了锻炼自己的编码技巧,就像钢琴家在成为钢琴家之前所做的数万小时的对琴键的敲击。

而今我早已过了这个“愤愤不平”的问题,不再纠结于程序员的工作量考核是否取决于代码量。
就像任何一个行业的入门新手到顶级专家,都必须经过一定的修炼,经过量变产生质变。
抛弃花里胡哨的技巧,而专注于完成一件“事”,才是一个从业人员职业上的升华。

完成一件“事”,要求有多方面的能力。
以我的工作为例,需要完成方案调研、项目规划、产品形态设计、问题跟踪、人员调配等等。
最后才是编码工作。
所以我不再固执地称自己为一名“程序员”,因为我需要做的远远比一名程序员要做的编码工作要更多。

“Get things done” 才是一项职业的目标。

问题排查:haproxy tcp 长连接没有 failover

这篇文章写的

问题排查:haproxy tcp 长连接没有 failover

现象

客户端 192.168.1.35:1234 和 haproxy 监听的 192.168.1.100:81 端口建立长连接;

haproxy 有两个后端,192.168.1.38:81、192.168.1.29:81,
因为工作在非透明模式,假设这个长连接,是和 192.168.1.29:81 建立的。

客户端长连接建立后,在 haproxy 上能看到两个 keepalive 连接:

# netstat -anpto | grep "\:81" | grep keepalive
tcp        0      0 192.168.1.100:42554     192.168.1.29:81         ESTABLISHED 9943/haproxy     keepalive (30.28/0/0)
tcp        0      0 192.168.1.39:81         192.168.1.35:43620      ESTABLISHED 9943/haproxy     keepalive (30.28/0/0)

这时,可以在 haproxy 上加一个 iptables 规则:

iptables -A OUTPUT -d 192.168.1.29 -m tcp -p tcp --dport 81 -j DROP

来模拟后端宕机的情况。

当 192.168.1.29:81 断开时,haproxy 能快速地检测到后端宕机,并修改状态:

root@i-hv6jj9ay:~# /usr/local/bin/lb-collect 'show status'
lbb-0vo2fiec|UP
lbb-811tmjow|UP
lbb-v8y2j191|UP 1/2
root@i-hv6jj9ay:~# /usr/local/bin/lb-collect 'show status'
lbb-0vo2fiec|UP
lbb-811tmjow|UP
lbb-v8y2j191|DOWN
root@i-hv6jj9ay:~#

但这个客户端的长连接没有正常断开,并且hang住了。

也就是说,haproxy 没有对这种长连接的场景做 failover。

分析

haproxy 的长连接配置,有个隧道超时时间设置,

这个参数,表示长连接的两端,空闲多久后,会被认为是连接已中断。所以也不能改太小。

  • timeout tunnel

    The tunnel timeout applies when a bidirectional connection is established
    between a client and a server, and the connection remains inactive in both
    directions. This timeout supersedes both the client and server timeouts once
    the connection becomes a tunnel. In TCP, this timeout is used as soon as no
    analyser remains attached to either connection (eg: tcp content rules are
    accepted). In HTTP, this timeout is used when a connection is upgraded (eg:
    when switching to the WebSocket protocol, or forwarding a CONNECT request
    to a proxy), or after the first response when no keepalive/close option is
    specified.

    Since this timeout is usually used in conjunction with long-lived connections,
    it usually is a good idea to also set “timeout client-fin” to handle the
    situation where a client suddenly disappears from the net and does not
    acknowledge a close, or sends a shutdown and does not acknowledge pending
    data anymore. This can happen in lossy networks where firewalls are present,
    and is detected by the presence of large amounts of sessions in a FIN_WAIT
    state.

  • on-marked-down

    Modify what occurs when a server is marked down.

    Currently one action is available:

    • shutdown-sessions: Shutdown peer sessions. When this setting is enabled,
      all connections to the server are immediately terminated when the server
      goes down. It might be used if the health check detects more complex cases
      than a simple connection status, and long timeouts would cause the service
      to remain unresponsive for too long a time. For instance, a health check
      might detect that a database is stuck and that there’s no chance to reuse
      existing connections anymore. Connections killed this way are logged with
      a ‘D’ termination code (for “Down”).

      这个参数,默认是disabled的。

      测试发现,对于一个keepalive的长连接,如果backend能够在宕机后的一定时间
      (也就是tunnel timeout)内及时恢复,那么这个长连接是还能够继续的。
      所以这个参数,默认没有配置成enabled

解决

配置 default-server on-marked-down shutdown-sessions

```
default-server on-marked-down shutdown-sessions
server  lbb-811tmjow 192.168.1.29:81 check inter 200 fall 1 rise 1 weight 1
server  lbb-v8y2j191 192.168.1.38:81 check inter 200 fall 1 rise 1 weight 1
```
</code></pre>

或者是配置在某个特定的<code>server</code>里,

<pre><code>```
# default-server on-marked-down shutdown-sessions
server  lbb-811tmjow 192.168.1.29:81 check inter 200 fall 1 rise 1 weight 1
server  lbb-v8y2j191 192.168.1.38:81 check inter 200 fall 1 rise 1 weight 1 on-marked-down shutdown-sessions
```

来确保当一个后端下线时,与这个后端相关的连接都会直接断掉。

参考

问题排查:ssh到CentOS上很慢甚至超时

这篇文章写的

现象

部分线上机器 ssh 到一台跳板的 CentOS 主机时,总是连接不上,但其它客户端是能正常ssh到这台跳板机的。

加上 timeout 之后,发现总是会超时。

root@host:~# timeout 5 ssh root@1.2.3.4 -p 2222
root@host:~#

分析

通过其它地方跳转到CentOS server,抓包分析并查看连接,发现

  1. 其实三次握手已经成功建立连接,但连接建立后,两端在反复的发送接收流量,似乎在传输什么数据或者确认什么信息:
# 前三个包,说明已经完成了握手
22:23:46.194993 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [S], seq 4277261782, win 28690, options [mss 1510,nop,nop,TS val 470231038 ecr 0,nop,wscale 10], length 0
22:23:46.195046 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [S.], seq 3085339803, ack 4277261783, win 28960, options [mss 1460,nop,nop,TS val 186346509 ecr 470231038,nop,wscale 7], length 0
22:23:46.195133 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [.], ack 1, win 29, options [nop,nop,TS val 470231038 ecr 186346509], length 0
22:23:46.195329 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1:50, ack 1, win 29, options [nop,nop,TS val 470231039 ecr 186346509], length 49
22:23:46.195348 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [.], ack 50, win 227, options [nop,nop,TS val 186346510 ecr 470231039], length 0
22:23:46.207342 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 1:24, ack 50, win 227, options [nop,nop,TS val 186346522 ecr 470231039], length 23
22:23:46.207416 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [.], ack 24, win 29, options [nop,nop,TS val 470231051 ecr 186346522], length 0
22:23:46.207770 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 50:1458, ack 24, win 29, options [nop,nop,TS val 470231051 ecr 186346522], length 1408
22:23:46.208860 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 24:1664, ack 1458, win 249, options [nop,nop,TS val 186346523 ecr 470231051], length 1640
22:23:46.208945 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [.], ack 1664, win 32, options [nop,nop,TS val 470231052 ecr 186346523], length 0
22:23:46.210745 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1458:1506, ack 1664, win 32, options [nop,nop,TS val 470231054 ecr 186346523], length 48
22:23:46.221765 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 1664:1944, ack 1506, win 249, options [nop,nop,TS val 186346536 ecr 470231054], length 280
22:23:46.224305 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1506:1522, ack 1944, win 35, options [nop,nop,TS val 470231067 ecr 186346536], length 16
22:23:46.264170 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [.], ack 1522, win 249, options [nop,nop,TS val 186346579 ecr 470231067], length 0
22:23:46.264247 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1522:1566, ack 1944, win 35, options [nop,nop,TS val 470231107 ecr 186346579], length 44
22:23:46.264264 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [.], ack 1566, win 249, options [nop,nop,TS val 186346579 ecr 470231107], length 0
22:23:46.264340 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 1944:1988, ack 1566, win 249, options [nop,nop,TS val 186346579 ecr 470231107], length 44
22:23:46.264416 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1566:1626, ack 1988, win 35, options [nop,nop,TS val 470231108 ecr 186346579], length 60
22:23:46.264894 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 1988:2032, ack 1626, win 249, options [nop,nop,TS val 186346579 ecr 470231108], length 44
22:23:46.264984 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1626:1990, ack 2032, win 35, options [nop,nop,TS val 470231108 ecr 186346579], length 364
22:23:46.271079 IP 100.67.76.58.EtherNet/IP-1 > 10.16.150.17.53924: Flags [P.], seq 2032:2356, ack 1990, win 271, options [nop,nop,TS val 186346586 ecr 470231108], length 324
22:23:46.272353 IP 10.16.150.17.53924 > 100.67.76.58.EtherNet/IP-1: Flags [P.], seq 1990:2626, ack 2356, win 37, options [nop,nop,TS val 470231115 ecr 186346586], length 636
  1. 这时查看 netstat 也能看到连接是 ESTABLISHED
tcp        0     76 100.67.76.58:2222       10.16.150.17:53924      ESTABLISHED 6995/sshd: root [pr  on (0.20/0/0)
  1. 忽然想起ssh还有verbose参数,遂加上,发现如下日志:
root@ap3ar03n03:~# timeout 5 ssh root@100.67.76.58 -p 2222 -v
OpenSSH_7.7p1 Ubuntu-4ubuntu0.1+pitrix1, OpenSSL 1.0.2g  1 Mar 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 55: Deprecated option "useroaming"
debug1: Connecting to 100.67.76.58 [100.67.76.58] port 2222.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/id_rsa type 0
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_xmss type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_7.7p1 Ubuntu-4ubuntu0.1+pitrix1
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1
debug1: match: OpenSSH_6.6.1 pat OpenSSH_6.6.1* compat 0x04000000
debug1: Authenticating to 100.67.76.58:2222 as 'root'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:MTFlctqbVPTZGqzDJbMEzgNUtZ3oKUheYmB45T2tuAc
debug1: Host '[100.67.76.58]:2222' is known and matches the ECDSA host key.
debug1: Found key in /root/.ssh/known_hosts:52
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: gssapi-keyex
debug1: No valid Key exchange context
debug1: Next authentication method: gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
...
# 而后一直这样死循环

-vvv能看到,这里在反复重试gssapi-with-mic。于是搜索该关键字,发现不少人也碰到了这个问题,但碰到的没这么严重。

一般都是报登录CentOS时,登录会比较慢,或者有明显的等待,而不会像我碰到的这样,直接就连不上,最后timeout了。

解决

要解决这个问题,有两种解决思路,分别是客户端和服务端:

  1. 客户端

    a. ssh 时,命令行可以加上 -o GSSAPIAuthentication=no 参数;

    b. 或者可以配置在客户端的 /etc/ssh/ssh_config 里。

  2. 服务端

修改 /etc/ssh/sshd_config,把 GSSAPIAuthentication 关掉,并重启 sshd 服务。

# 如果是yes,要改成no。或者直接把这两个配置注释掉即可。
# GSSAPIAuthentication no
# GSSAPICleanupCredentials no

说明

GSSAPI:Generic Security Services Application Program Interface,GSSAPI 本身是一套 API,由 IETF 标准化。
其最主要也是著名的实现是基于 Kerberos 的。一般说到 GSSAPI 都暗指 Kerberos 实现。
GSSAPI 是一套通用网络安全系统接口。该接口是对各种不同的客户端服务器安全机制的封装,以消除安全接口的不同,降低编程难度。

参考