[07:57:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11275363 (10cmooney) >>! In T405499#11273763, @ssingh wrote: > FWIW we have typically reimaged for this in the past. I am not suggesting, just sha... [07:57:25] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11275365 (10taavi) [08:38:59] 06Traffic, 06SRE Observability: Package benthos/redpanda for trixie - https://phabricator.wikimedia.org/T407320 (10Vgutierrez) 03NEW [09:35:53] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11275677 (10elukey) SSD firmwares updated on all cp hosts! So at this point we can try to reimage all hosts to trixie. For some reason cp2043 wasn't able to PX... [10:21:25] Would I be ok to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195679/1 in a couple minutes ? [10:23:45] +1 [10:24:00] no pending activities AFAIK [10:24:12] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11275939 (10elukey) @Jhancock.wm Hi! So I've reimaged cp2044 with Debian Trixie and everything went fine, we can proceed to reimage the rest with Trixie and see... [10:24:51] Cool thanks [10:24:58] I'll go make a coffee and hit that then [11:21:09] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11276137 (10MoritzMuehlenhoff) >>! In T405102#11273708, @ssingh wrote: > Traffic discussed this in the team meeting today. We decided that given the above blocker,... [12:25:51] fabfur: I have a followup patch because I actually broke stuff do you have a minute to take a look? [12:26:03] sure! [12:26:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196416 [12:26:13] tyvm <3 [12:26:53] 👀 [12:27:43] +1 for me to proceed w/ deploy [12:28:39] Cool ty <3 [13:10:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11276580 (10ssingh) >>! In T405499#11275363, @cmooney wrote: >>>! In T405499#11273763, @ssingh wrote: >> FWIW we have typically reimaged for this... [13:18:18] 06Traffic, 10Observability-Alerting: Port DNS icinga checks to Alertmanager - https://phabricator.wikimedia.org/T384425#11276608 (10tappof) [13:22:14] 10Domains, 06Traffic, 10DNS, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11276624 (10ssingh) Hi @SCampos-WMF: > I was also looped into a new request today. As part of the birthday initiative, the Fundraising t... [13:24:29] elukey: thanks for all you and Jenn's work on the cp hosts in codfw [13:24:38] not sure if you saw it but we decided to go ahead with trixie [13:24:48] (and thanks to moritz's also for trying to backport the kernel!) [13:26:09] sukhe: yep saw it! I reimaged one today with Trixie as test, so in theory Jenn should be able to do the rest without issues. The cp2043 host seems requiring some manual-dc-intervention, aside from it we are godo [13:26:11] *good [13:26:31] thanks elukey! [13:26:36] cool. I just _assumed_ since it worked with bookworm that it will work with trixie [13:26:43] but hardware doesn't work like that :] [13:26:51] yep yep :D [13:27:03] cp2044 is ready for inspection if you want to check it [13:27:28] thanks, we will take a look [13:27:35] but yeah, our plan so far is to start preparing for the upgrade [13:27:41] but since we won't get to it on time anyway before the break [13:27:49] we will most likely provision in January [13:28:38] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:29:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:29:42] ^ reboots [13:29:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6009 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [13:29:51] hmm that needs a check for sure [13:30:22] na all good [13:33:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:34:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [13:34:43] RESOLVED: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6009 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [14:06:26] bblack: hey, do you think you'll have time to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196154 today? [14:07:21] 06Traffic, 10Observability-Logging: Package benthos/redpanda for trixie - https://phabricator.wikimedia.org/T407320#11276943 (10colewhite) Do you happen to have a trixie host available that we can try the existing package on? [14:09:52] 10netops, 06Infrastructure-Foundations, 10Toolforge, 06tools-infrastructure-team: Plan networking for Toolforge-on-Metal experiment - https://phabricator.wikimedia.org/T407140#11276967 (10taavi) p:05Triage→03High [14:10:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:15:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:24:20] 06Traffic, 10Observability-Logging: Package benthos/redpanda for trixie - https://phabricator.wikimedia.org/T407320#11277083 (10ssingh) >>! In T407320#11276943, @colewhite wrote: > Do you happen to have a trixie host available that we can try the existing package on? If you meant //a// trixie host, feel free... [14:51:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6011 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:56:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6011 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:07:54] fabfur: ^ can you manually downtime that alert please? [15:08:08] that's actually me [15:08:09] not hiim [15:08:22] oh ok [15:08:25] and I can but I am also trying to see when/why are they fired. because they are not fired in all cases [15:08:30] so what causes them to be fired in some but not all [15:08:41] timing? :) [15:08:43] is it possible to downtime a specific alert while not fired? [15:08:51] not a whole host I mean [15:08:54] that's what the cookbook does :D [15:08:58] yeah, timing but not sure why though [15:09:08] sukhe: server reboot differences? [15:09:22] vgutierrez: but in theory the host should be downtimed for the entirety of event [15:09:29] so probably _when_ the alert is fired? [15:10:08] that alert is defined on the host or on the lb? [15:10:25] sukhe: how's the downtime defined? [15:10:27] fabfur: in the host [15:11:13] vgutierrez: we simply do what is in class Runner(SRELBBatchRunnerBase): [15:11:15] I'm wondering if downtime it's expecting an `instance` label [15:11:18] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/cdn/roll-reboot.py [15:11:23] and that alert doesn't have that [15:11:42] meaning? it's a per-host alert so the entire host should be downtimed? [15:11:56] sukhe: how the host is downtimed? :) [15:12:03] how the alert is linked to a specific host? [15:12:07] by checking the instance label? [15:12:34] `label_replace(lvs_realserver_mss_value, "hostname", "$1", "instance", "(.*):.*")` [15:12:42] probably that's messing with the downtime [15:12:50] I don't recall offhand, I need to read the base class [15:18:19] alert created [15:18:19] if that's the problem it looks like we could replace the label and not rename it: https://w.wiki/FhHZ [15:18:23] downtiming alert [15:22:11] unless the downtiming regex expects a `:` on the instance value [15:23:07] and of course it could be related to when the downtime is removed [15:23:17] given that puppet needs to run before haproxy is able to start [15:23:40] I mean the fact that it fires on some hosts and not others, then surely it can't be all the alert itself [15:23:49] but I will need to look and I can't right now so I will stop guessing :P [15:24:17] timing... alert requires 3 checks in a row that failed (3 minutes) [15:32:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:32:45] lol sigh [15:32:49] 06Traffic, 10Observability-Logging: Package benthos/redpanda for trixie - https://phabricator.wikimedia.org/T407320#11277492 (10herron) Had a quick chat with @Vgutierrez and I've just copied the package to trixie-wikimedia ` benthos | 4.27.0-1 | trixie-wikimedia | amd64, source ` [15:33:25] and it disappeared [15:33:34] ugh [15:33:35] fun, and then when I tried to match it, it gave me nothing [15:33:47] maybe we should involve o11y [15:33:54] it starts to smell like a bug [15:33:58] I will file something later [15:36:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:37:25] FIRING: [4x] SystemdUnitFailed: haproxy.service on cp6012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:38] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:37:48] haproxy failure [15:37:51] interesting [15:37:52] hmm we are also getting the ahaproxy one [15:38:06] looks fine of course [15:38:09] so yeah, timing [15:38:13] so that points to puppet run [15:38:15] cool I will file something later (in a meeting) [15:41:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:42:09] the other interesting thing is that last time we did these reboots [15:42:16] we didn't have this much spam [15:42:25] RESOLVED: [6x] SystemdUnitFailed: haproxy.service on cp6012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:28] or we just didn't look enough :P [16:11:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.240:443 @ cp6005 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:13:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6013 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:13:56] I tried. it just doesn't take the silence. [16:14:38] > No alerts matched [16:14:38] lol [16:16:09] 10Domains, 06Traffic, 10DNS, 13Patch-For-Review: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11277821 (10Dzahn) Thank you for the explanation about the Google site, @SCampos-WMF I understand now. It's appreciated. [16:16:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.240:443 @ cp6005 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:18:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6013 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:40:03] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11278009 (10cmooney) @BCornwall @Jclark-ctr provided thinks go ok in the intervening perio... [16:41:27] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11278019 (10cmooney) @BCornwall @Jclark-ctr provided thinks go ok in the intervening perio... [16:48:26] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11278047 (10Jclark-ctr) I am good for that day just let me know time in advance [16:49:14] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11278052 (10Jclark-ctr) Good for this day just let me know time [16:52:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.240:443 @ cp6006 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [16:57:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.240:443 @ cp6006 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:12:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7007:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [17:12:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6014 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [17:15:01] 06Traffic, 06SRE, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278220 (10Jdlrobson-WMF) p:05Triage→03High [17:15:39] 06Traffic, 06SRE, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278221 (10Jdlrobson-WMF) [17:33:24] 06Traffic, 06SRE, 05FY2025-26 WE3.3 Engaging core audiences: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11278338 (10Jdlrobson-WMF) [17:36:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6015 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:37:06] yeah I guess before the next round, we need to make sure this is fixed [17:37:12] how? I don't know, I will try again [17:37:31] I already silenced the entirety of this but it's not enough [17:37:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6015 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [17:41:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6015 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:46:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7015 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:47:40] FIRING: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp7007:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [17:51:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7015 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [17:52:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6015 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [17:53:20] we will soon be done for today's round. and before the next, we will fix this downtiming. sorry about the noise folks [18:12:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11278592 (10RobH) [18:17:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6016 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:18:25] FIRING: SystemdUnitFailed: haproxykafka.service on cp6016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.224:443 @ cp6016 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=drmrs&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:22:43] FIRING: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp6016 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [18:23:25] RESOLVED: SystemdUnitFailed: haproxykafka.service on cp6016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7016 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [18:32:40] FIRING: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp7007:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [18:34:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.240:443 @ cp7016 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [19:12:10] 06Traffic, 06DC-Ops, 10ops-magru: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421 (10ssingh) 03NEW [19:12:40] RESOLVED: VarnishPrometheusExporterDown: Varnish Exporter on instance cp7007:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:12:43] RESOLVED: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp7007 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7007 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [19:13:22] 06Traffic, 06DC-Ops, 10ops-magru: cp7007 hardware issues after reboot - https://phabricator.wikimedia.org/T407421#11278777 (10ssingh) a:03BCornwall [19:17:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp7007:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7007 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [19:17:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp7007:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=magru%20prometheus/ops&var-instance=cp7007 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [19:18:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7007 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7007 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [19:18:49] yeah expected, should get better soon (and it's depooled) [19:20:29] interesting though the haproxykafka message. because a depooled host won't produce any messages anyway so we have to tune that [19:27:48] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host cp2045.codfw.wmnet with OS bullseye [19:31:37] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278830 (10Jhancock.wm) [19:51:40] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11278889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host cp2045.codfw.wmnet with OS bullseye executed with er... [23:18:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7007 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7007 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages