[02:36:09] FIRING: LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3008 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [02:41:09] RESOLVED: [2x] LVSHighRX: Excessive RX traffic on lvs3008:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:26:52] 06Traffic, 10ops-magru: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10034399 (10Volans) [09:50:14] 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10034735 (10ayounsi) [12:52:10] 06Traffic, 10ops-magru: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10035242 (10Fabfur) Do we have any evidence that the disk has not been manually removed/tampered? [13:08:16] 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10035314 (10cmooney) I've been playing with this a little on Netbox-Next, you can see the data here covering our existing GRE tunnels: https://netbox-next.wikimedia.org/vpn/tunnels/ Initia... [13:46:53] FIRING: SystemdUnitFailed: wmf_auto_restart_benthos@haproxy_cache.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:52] fabfur: ^ [13:48:25] thanks yes I'm working on that, is disabled [15:18:30] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10035781 (10Papaul) @cmooney links removed. You can resolve the task if nothing else needs to be done. [16:03:58] 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10036015 (10cmooney) After discussing with @ayounsi on irc I've adjusted the approach: https://netbox-next.wikimedia.org/vpn/tunnels/ Principal decisions were: # We will use a group calle... [18:48:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2031&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:50:00] hmmmm [18:51:38] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 208.80.153.224:443 @ cp2031 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:51:53] FIRING: SystemdUnitFailed: haproxy.service on cp2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:04] depooled [18:52:09] looking at what's up because something is [18:53:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2031&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [18:53:41] What was messing with haproxy? [18:53:47] still looking [18:55:57] er [18:56:03] sorry, I didn't !log apparently [18:56:13] I was trying something there, and didn't realize it would alert in here [18:56:13] I guessed looking at the puppetborad :) [18:56:16] Aug 01 18:46:51 cp2031 haproxy[3638139]: [ALERT] (3638139) : Current worker (2408212) exited with code 143 (Terminated) [18:56:17] no worries [18:56:21] it's depooled now fwiw [18:56:25] ack [18:56:28] I wasn't sure what's up and hence [18:56:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 208.80.153.224:443 @ cp2031 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:56:53] RESOLVED: SystemdUnitFailed: haproxy.service on cp2031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:01] hmmm 143 is SIGTERM, isn't it? [18:57:17] vgutierrez: yeah, I did a systemctl restart with a config that wasn't parsing [18:57:19] yes [18:57:29] (while it was depooled) [18:57:31] cdanis: a reload is enough btw [18:57:43] and it checks the config syntax for you [18:57:59] do you happen to know, is a reload enough even if you change stick-table definitions? [18:58:27] cdanis: a reload ends up spawning a new haproxy process [18:58:30] so yes [18:58:34] cdanis: you can confirm with the API even but I think so [18:58:44] (not that I am the expert in this) [18:58:56] ah okay cool, does it do the trick where it passes sockets and state to the new process? [18:59:04] yes [18:59:05] yeah [18:59:15] that's the default behavior for latest haproxy versions BTW [18:59:18] cool [18:59:18] we have found the reload is enough for most of our needs [18:59:52] will do [18:59:54] sorry for the noise :) [18:59:58] no problem [19:00:00] np at all [19:00:04] * vgutierrez back to dying in the sofa [19:00:15] vgutierrez: you shouldn't be here anyway. learn to disconnect [19:00:17] :P [19:00:24] got pinged by haproxy [19:00:31] it's one of my pets ;P [19:00:36] you ping on haproxy? [19:00:51] not really... I got a few emails from the alerts [19:00:52] cdanis: he has at least 20 confirmed words, ballpark [19:01:01] and it's weird to see haproxy dying unexpectedly [19:01:13] so that's why I came online [19:01:31] it's even weirder to see cdanis messing up a configuration BTW [19:01:37] that's not true at all [19:01:43] i bang on things until they work [19:01:51] * vgutierrez steals cdanis hammer [19:02:16] 😅 [19:03:32] cdanis: I am sure you know but recently we wanted to figure out if the correct certs were reloaded or not and we wanted to do a restart [19:03:49] so we did echo "show ssl cert" | sudo nc -U /run/haproxy/haproxy.sock [19:03:59] and for this *I think* [19:03:59] echo "show acl" | sudo nc -U /run/haproxy/haproxy.sock [19:04:24] the admin socket API is alright [19:04:37] yeah [19:06:21] on old haproxy versions there's sometimes inconsistencies in the output if you use it to dump very large stick-table contents [19:10:01] that fun I leave it to your hands :) [23:55:49] 06Traffic, 13Patch-For-Review: Error message says "%error_body_content%" - https://phabricator.wikimedia.org/T371424#10037221 (10BCornwall) 05Open→03In progress p:05Triage→03High a:03CDobbins