[02:36:09] <jinxer-wm>	 FIRING: LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[02:41:09] <jinxer-wm>	 RESOLVED: [2x] LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx  - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[05:26:52] <wikibugs>	 06Traffic, 10ops-magru: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10034399 (10Volans)
[09:50:14] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10034735 (10ayounsi)
[12:52:10] <wikibugs>	 06Traffic, 10ops-magru: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10035242 (10Fabfur) Do we have any evidence that the disk has not been manually removed/tampered?
[13:08:16] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10035314 (10cmooney) I've been playing with this a little on Netbox-Next, you can see the data here covering our existing GRE tunnels:  https://netbox-next.wikimedia.org/vpn/tunnels/  Initia...
[13:46:53] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_benthos@haproxy_cache.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:47:52] <sukhe>	 fabfur: ^
[13:48:25] <fabfur>	 thanks yes I'm working on that, is disabled
[15:18:30] <wikibugs>	 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10035781 (10Papaul) @cmooney links removed. You can resolve the task if nothing else needs to be done.
[16:03:58] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10036015 (10cmooney) After discussing with @ayounsi on irc I've adjusted the approach:  https://netbox-next.wikimedia.org/vpn/tunnels/  Principal decisions were:  # We will use a group calle...
[18:48:29] <jinxer-wm>	 FIRING: HAProxyRestarted: HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2031&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[18:50:00] <sukhe>	 hmmmm
[18:51:38] <jinxer-wm>	 FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 208.80.153.224:443 @ cp2031 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[18:51:53] <jinxer-wm>	 FIRING: SystemdUnitFailed: haproxy.service on cp2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:52:04] <sukhe>	 depooled 
[18:52:09] <sukhe>	 looking at what's up because something is
[18:53:29] <jinxer-wm>	 RESOLVED: HAProxyRestarted: HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2031&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[18:53:41] <vgutierrez>	 What was messing with haproxy?
[18:53:47] <sukhe>	 still looking
[18:55:57] <cdanis>	 er
[18:56:03] <cdanis>	 sorry, I didn't !log apparently
[18:56:13] <cdanis>	 I was trying something there, and didn't realize it would alert in here
[18:56:13] <sukhe>	 I guessed looking at the puppetborad :)
[18:56:16] <vgutierrez>	 Aug 01 18:46:51 cp2031 haproxy[3638139]: [ALERT]    (3638139) : Current worker (2408212) exited with code 143 (Terminated)
[18:56:17] <sukhe>	 no worries
[18:56:21] <sukhe>	 it's depooled now fwiw
[18:56:25] <cdanis>	 ack
[18:56:28] <sukhe>	 I wasn't sure what's up and hence
[18:56:38] <jinxer-wm>	 RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 208.80.153.224:443 @ cp2031 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[18:56:53] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: haproxy.service on cp2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:57:01] <vgutierrez>	 hmmm 143 is SIGTERM, isn't it?
[18:57:17] <cdanis>	 vgutierrez: yeah, I did a systemctl restart with a config that wasn't parsing
[18:57:19] <sukhe>	 yes
[18:57:29] <cdanis>	 (while it was depooled)
[18:57:31] <vgutierrez>	 cdanis: a reload is enough btw
[18:57:43] <vgutierrez>	 and it checks the config syntax for you 
[18:57:59] <cdanis>	 do you happen to know, is a reload enough even if you change stick-table definitions?
[18:58:27] <vgutierrez>	 cdanis: a reload ends up spawning a new haproxy process
[18:58:30] <vgutierrez>	 so yes
[18:58:34] <sukhe>	 cdanis: you can confirm with the API even but I think so
[18:58:44] <sukhe>	 (not that I am the expert in this)
[18:58:56] <cdanis>	 ah okay cool, does it do the trick where it passes sockets and state to the new process?
[18:59:04] <vgutierrez>	 yes
[18:59:05] <sukhe>	 yeah
[18:59:15] <vgutierrez>	 that's the default behavior for latest haproxy versions BTW
[18:59:18] <cdanis>	 cool
[18:59:18] <sukhe>	 we have found the reload is enough for most of our needs
[18:59:52] <cdanis>	 will do
[18:59:54] <cdanis>	 sorry for the noise :)
[18:59:58] <vgutierrez>	 no problem
[19:00:00] <sukhe>	 np at all
[19:00:04] * vgutierrez back to dying in the sofa
[19:00:15] <sukhe>	 vgutierrez: you shouldn't be here anyway. learn to disconnect
[19:00:17] <sukhe>	 :P
[19:00:24] <vgutierrez>	 got pinged by haproxy
[19:00:31] <vgutierrez>	 it's one of my pets ;P
[19:00:36] <cdanis>	 you ping on haproxy?
[19:00:51] <vgutierrez>	 not really... I got a few emails from the alerts
[19:00:52] <sukhe>	 cdanis: he has at least 20 confirmed words, ballpark
[19:01:01] <vgutierrez>	 and it's weird to see haproxy dying unexpectedly
[19:01:13] <vgutierrez>	 so that's why I came online
[19:01:31] <vgutierrez>	 it's even weirder to see cdanis messing up a configuration BTW
[19:01:37] <cdanis>	 that's not true at all
[19:01:43] <cdanis>	 i bang on things until they work
[19:01:51] * vgutierrez steals cdanis hammer
[19:02:16] <cdanis>	 😅
[19:03:32] <sukhe>	 cdanis: I am sure you know but recently we wanted to figure out if the correct certs were reloaded or not and we wanted to do a restart
[19:03:49] <sukhe>	 so we did echo "show ssl cert" | sudo nc -U /run/haproxy/haproxy.sock
[19:03:59] <sukhe>	 and for this *I think*
[19:03:59] <sukhe>	 echo "show acl" | sudo nc -U /run/haproxy/haproxy.sock
[19:04:24] <cdanis>	 the admin socket API is alright
[19:04:37] <sukhe>	 yeah
[19:06:21] <cdanis>	 on old haproxy versions there's sometimes inconsistencies in the output if you use it to dump very large stick-table contents
[19:10:01] <sukhe>	 that fun I leave it to your hands :)
[23:55:49] <wikibugs>	 06Traffic, 13Patch-For-Review: Error message says "%error_body_content%" - https://phabricator.wikimedia.org/T371424#10037221 (10BCornwall) 05Open→03In progress p:05Triage→03High a:03CDobbins