[00:09:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075657 (owner: 10TrainBranchBot) [00:27:00] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:27:32] here, looking [00:27:48] here, same [00:27:52] !incidents [00:27:52] 5280 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [00:27:53] 5279 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [00:27:53] 5278 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [00:28:00] !ack 5280 [00:28:01] 5280 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [00:28:02] probably just going to repool eqiad but double-checking first to make sure that's actually that [00:28:11] sirenbot: congrats on your first mile [00:28:39] here as well [00:28:51] lol [00:28:55] I see a recovery? [00:29:24] > Issued recovery for rule 'Primary outbound port utilisation over 80% #page' to transport 'alertmanager' [00:29:44] doesn't seem to have actually registered anywhere though outside of that [00:30:01] hmmm ... interesting [00:31:16] in any case, same deal as before - brief spike on cr2-codfw:xe-1/1/1:0, on top of what's already elevated use [00:31:18] we can keep an eye on it for a bit but IMO I think we can wait [00:31:20] okay I'm pretty well convinced -- any objections to repooling eqiad? even if this clears, it looks like it'll be fragile overnight [00:31:22] oh okay [00:31:29] hm [00:32:00] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:32:13] yeah [00:32:23] so, this is the second instance of this, and if I (mentally) overlay ulsfo, codfw, eqiad CDN egress peaks, we have maybe 1-2h left [00:32:26] that took a while though ~5 minutes [00:32:35] that's a good point. [00:32:49] also, this will presumably happen tomorrow too, and so on =/ [00:33:10] we haven't started any network maintenance or anything in eqiad yet, right? [00:33:22] each of these is not terribly impactful (small blip in out-discards), so I'm kind of split on how to proceed [00:33:24] yeah so I guess: 1) we either pool eqiad and move on 2) we address and look out for specific alerts as they come [00:33:27] yep [00:33:28] rzl: not that I'm aware of [00:33:56] I think the one thing that concerns me a bit is stressing codfw <> eqiad transport links [00:33:59] I guess the problem is neither of these alerts has been specifically actionable, we're riding right at the edge so we trip the alerts briefly but then they clear on their own [00:34:00] the other (possibly crazy) thing to do might be to change the alert to page for 90% usage [00:34:08] since RO services are depooled in eqiad [00:34:29] if we get a big media scrape or something it'll really knock us over, but in that case all we'd do is repool anyway [00:34:57] however, the presumably biggest "backend" elephant in the room is uploads, right? and luckily we've already repooled swift in eqiad :) [00:35:01] oh, I guess we repooled swift so that's not-- yeah [00:35:10] well, another more compelling example then, lol [00:35:36] yeah, I'm pretty torn on this one :) [00:36:11] I do think we should either repool or raise the threshold -- no point in keeping the alert if we're pre-deciding not to do anything [00:36:34] (there's more nuance than that, around what we'd do if it stayed at 81% for longer, but I think the basic point is still right) [00:37:56] or, taking a step back -- part of the point of keeping eqiad depooled was to find out if we have the capacity to do it without changing anything, and right now the answer is, no [00:38:01] so in some sense that experiment is complete [00:38:19] in a real emergency we wouldn't mind fiddling with the threshold, but we don't actually have to do that right now [00:38:51] if that's the only concern right now, I would say we repool especially given the time, if nothing else [00:38:52] if we want to come back tomorrow during the work day, make a thoughtful decision about it, and then depool again, we can always do that too [00:38:54] (my .1 cents) [00:40:13] I think the reasoning is solid: either change the alert to be more indicative of clear impact, or go ahead and repool, revisiting tomorrow whether we should go back to codfw-only [00:40:36] ok then. ok to repool? [00:40:39] any objections? [00:40:51] 👍 [00:41:04] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: repooling due to repeated port utilization alerts, T370962] [00:41:12] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: repooling due to repeated port utilization alerts, T370962] [00:41:12] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [00:41:14] done [00:41:26] man that cookbook is ncie [00:41:30] *nice [00:41:31] thanks sukhe [00:41:33] sukhe: thank you, was just about to ask you to tag that task :) [00:41:41] ha [00:41:54] lemme pull up some transport link graphs [00:43:36] fancy new (draft) gnmi ones: https://grafana.wikimedia.org/goto/T8jUf8RNg?orgId=1 [00:43:56] wowee [00:43:59] fancy indeed! [00:46:48] ah, and cr1-eqiad:et-1/1/2 <> cr1-codfw:et-1/0/2 is back in service [00:46:55] that was down earlier today [00:54:27] well, cr2-codfw:xe-1/1/1:0 is running much cooler [00:54:56] and transport links seem to be well within "this is fine" territory [00:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:56:42] this is not related :P [00:56:46] heh [00:57:56] yeah graphs look good. I think it was the right decision to repool :] [00:57:57] yeah I don't see anything that still needs doing [00:58:13] yeah, I think we're in a good spot [00:58:19] I'm going to write up a quick summary of the current state and drop it in -sre [00:58:21] we can come back tomorrow and make a plan [00:58:23] thanks folks, hopefully see you tomorrow :) [00:58:25] swfrench-wmf: thanks! [00:58:26] SGTM [00:58:27] ah thank you [00:58:33] thank you both! [01:07:04] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1075638 (https://phabricator.wikimedia.org/T359127) (owner: 10RLazarus) [01:17:40] FIRING: KubernetesRsyslogDown: rsyslog on ml-serve2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2005 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:18:36] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10178169 (10ssingh) Apologies for the long text that follows but the TL;DR is that we think that issues in `magru` are not confined to just the CPU on the affected hosts bu... [01:22:40] RESOLVED: KubernetesRsyslogDown: rsyslog on ml-serve2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2005 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 862.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:28:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 863.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 812.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:43:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 812.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:59:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:07:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:43:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:43:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:54:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [03:56:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 898.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:59:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:01:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 838.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:01:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:01:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:03:18] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10178217 (10wiki_willy) Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there a... [04:03:59] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 913.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:15:59] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 847.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:17:51] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:18:52] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 827.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:29:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 827.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:58:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T0600) [06:00:05] marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:17] (03PS1) 10Vgutierrez: varnish: support base64 encoded files in puppet catalog [puppet] - 10https://gerrit.wikimedia.org/r/1075765 [06:07:48] (03PS2) 10Vgutierrez: varnish: support base64 encoded files in puppet catalog [puppet] - 10https://gerrit.wikimedia.org/r/1075765 [06:09:07] (03PS3) 10Vgutierrez: varnish: support base64 encoded files in puppet catalog [puppet] - 10https://gerrit.wikimedia.org/r/1075765 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:33:01] (03CR) 10Muehlenhoff: [C:03+2] scap_proxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075507 (owner: 10Muehlenhoff) [06:33:10] (03CR) 10Vgutierrez: [C:04-2] "do not merge, we aren't looking at the right RSA usage data at the moment. I'll add more context on the phabricator task" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [06:38:25] (03PS1) 10Muehlenhoff: mail::smarthost: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075774 [06:40:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1075607 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [06:43:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:43:07] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 17.2 [puppet] - 10https://gerrit.wikimedia.org/r/1075775 (https://phabricator.wikimedia.org/T375710) [06:46:14] (03CR) 10Muehlenhoff: [C:03+2] mail::smarthost: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075774 (owner: 10Muehlenhoff) [06:47:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075775 (https://phabricator.wikimedia.org/T375710) (owner: 10Jelto) [06:48:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:48:50] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.2 [puppet] - 10https://gerrit.wikimedia.org/r/1075775 (https://phabricator.wikimedia.org/T375710) (owner: 10Jelto) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:07] !log apt-get clean on grafana2001 to free some space in the root partition [07:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:28] o/ [07:01:51] here. Let's deploy first patch abijeet [07:01:58] thanks, kart_ [07:03:28] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan) [07:03:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [07:03:40] (03PS1) 10Elukey: docker_registry_ha: increase proxy timeouts to 300 (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1075779 (https://phabricator.wikimedia.org/T242604) [07:04:23] (03Merged) 10jenkins-bot: Translate: Add VirtualDomainsMapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [07:04:59] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075521|Translate: Add VirtualDomainsMapping (T372287)]] [07:05:02] (03CR) 10Slyngshede: [C:03+2] Minor UI improvements. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) (owner: 10Slyngshede) [07:05:04] T372287: Create new translate_message_group_subscriptions table on Wikimedia wikis with the Translate extension installed - https://phabricator.wikimedia.org/T372287 [07:07:44] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1075521|Translate: Add VirtualDomainsMapping (T372287)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:51] (03Merged) 10jenkins-bot: Minor UI improvements. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075550 (https://phabricator.wikimedia.org/T375168) (owner: 10Slyngshede) [07:08:08] abijeet: any way to test the patch on mwdebug? [07:08:20] abijeet: patch is available to test if possible. [07:08:20] (03PS1) 10Elukey: Set puppet 7 for registry1004 [puppet] - 10https://gerrit.wikimedia.org/r/1075781 (https://phabricator.wikimedia.org/T332016) [07:08:22] (03PS1) 10Elukey: Set puppet 7 for registry2004 [puppet] - 10https://gerrit.wikimedia.org/r/1075782 (https://phabricator.wikimedia.org/T332016) [07:08:44] (03CR) 10Elukey: [C:03+2] docker_registry_ha: increase proxy timeouts to 300 (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1075779 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [07:09:27] kart_, I'll do a quick sanity check [07:12:08] (03CR) 10Alexandros Kosiaris: [C:03+2] Add role for wikikube-worker1240-1304 [puppet] - 10https://gerrit.wikimedia.org/r/1075576 (https://phabricator.wikimedia.org/T369744) (owner: 10Alexandros Kosiaris) [07:13:08] kart_, looks good, we can continue [07:13:21] sure [07:13:25] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:16:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:16:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1075556 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [07:18:24] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075521|Translate: Add VirtualDomainsMapping (T372287)]] (duration: 13m 25s) [07:18:29] T372287: Create new translate_message_group_subscriptions table on Wikimedia wikis with the Translate extension installed - https://phabricator.wikimedia.org/T372287 [07:18:38] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert "Temporarily disable stunnel for the Puppet 7 migration of deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1072754 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:19:16] (03CR) 10Elukey: [C:03+2] Set puppet 7 for registry1004 [puppet] - 10https://gerrit.wikimedia.org/r/1075781 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey) [07:19:43] akosiaris: shall I merge yours too? [07:19:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1034.eqiad.wmnet [07:20:18] abijeet_: first patch done. Should we go second one now? [07:20:30] elukey: ah go ahead, thanks [07:20:33] ack! [07:20:50] (03PS1) 10Muehlenhoff: Switch cloudcephosd1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075834 (https://phabricator.wikimedia.org/T349619) [07:20:57] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry1004.eqiad.wmnet with OS bookworm [07:21:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:22:43] kart_, yes [07:22:47] cool. [07:22:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:23:50] (03Merged) 10jenkins-bot: Revert^2 "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:24:13] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075522|Revert^2 "Enable message group subscription feature for Test Wikipedia" (T372386)]] [07:24:19] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [07:24:33] FIRING: KubernetesCalicoDown: wikikube-worker1254.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1254.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:25:01] that's expected ^ [07:25:16] nodes are coming into service over the next 20 minutes or so [07:26:21] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1075522|Revert^2 "Enable message group subscription feature for Test Wikipedia" (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:26:32] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:27:24] abijeet_: patch ready to test [07:27:56] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:28:22] kart_, thanks, checking [07:28:58] (I can see Watch button!) [07:29:33] FIRING: [11x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:29:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:30:00] kart_, yup. Looks good I think. Give me 1 more minute. [07:30:03] FIRING: [11x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:30:09] Sure [07:30:18] FIRING: [11x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:30:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074405 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [07:31:02] kart_, ok to go ahead. [07:31:24] (03CR) 10Brouberol: Enable the usage of the Kubernetes executor (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075165 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [07:31:57] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075834 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:32:09] (03PS1) 10Slyngshede: C:idm disable automatic Django localization. [puppet] - 10https://gerrit.wikimedia.org/r/1075840 [07:32:39] Nice! [07:32:43] !log kartik@deploy1003 kartik, abi: Continuing with sync [07:32:48] (03CR) 10JMeybohm: [C:03+2] ferm: Make reload via ferm-status the default [puppet] - 10https://gerrit.wikimedia.org/r/1074405 (https://phabricator.wikimedia.org/T374366) (owner: 10JMeybohm) [07:33:25] (03PS2) 10Slyngshede: P:idp: On test host behind the load balancer, avoid exposing port 8080. [puppet] - 10https://gerrit.wikimedia.org/r/1073460 [07:34:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:34:33] FIRING: [16x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:34:53] (03CR) 10Slyngshede: P:idp: On test host behind the load balancer, avoid exposing port 8080. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [07:34:56] FIRING: SystemdUnitFailed: rsync-patches_module.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephosd1034.eqiad.wmnet [07:35:25] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:36:06] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Connect - Telxius, AS12956/IPv4: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [07:36:53] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry1004.eqiad.wmnet with reason: host reimage [07:37:12] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on registry1004.eqiad.wmnet with reason: host reimage [07:37:50] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075522|Revert^2 "Enable message group subscription feature for Test Wikipedia" (T372386)]] (duration: 13m 37s) [07:37:56] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [07:38:22] abijeet_: done. [07:38:42] kart_, thanks [07:38:54] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqia [07:38:54] 01/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:39:33] FIRING: [17x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:39:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:40:18] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:40:54] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqia [07:40:54] 01/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:40:55] (03PS1) 10Muehlenhoff: Don't pass ferm_status_restart in firewall class [puppet] - 10https://gerrit.wikimedia.org/r/1075841 [07:40:55] !log dcaro@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Run failed when reimaging cloudcephosd1039 and asked to run manually - dcaro@cumin1002 - T372814" [07:41:00] !log dcaro@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Run failed when reimaging cloudcephosd1039 and asked to run manually - dcaro@cumin1002 - T372814" [07:41:01] T372814: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814 [07:41:06] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Arelion, AS1299/IPv6: Connect - Arelion https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:41:08] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad [07:41:08] 1/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:41:32] (03PS2) 10JMeybohm: Don't pass ferm_status_restart in firewall class [puppet] - 10https://gerrit.wikimedia.org/r/1075841 (owner: 10Muehlenhoff) [07:41:54] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqia [07:41:54] 01/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:42:14] (03PS3) 10JMeybohm: Don't pass ferm_status_restart in firewall class [puppet] - 10https://gerrit.wikimedia.org/r/1075841 (https://phabricator.wikimedia.org/T374366) (owner: 10Muehlenhoff) [07:42:32] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 165 probes of 787 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:42:32] (03CR) 10JMeybohm: [C:03+1] Don't pass ferm_status_restart in firewall class [puppet] - 10https://gerrit.wikimedia.org/r/1075841 (https://phabricator.wikimedia.org/T374366) (owner: 10Muehlenhoff) [07:42:46] (03CR) 10Muehlenhoff: [C:03+2] Don't pass ferm_status_restart in firewall class [puppet] - 10https://gerrit.wikimedia.org/r/1075841 (https://phabricator.wikimedia.org/T374366) (owner: 10Muehlenhoff) [07:42:54] (03CR) 10Slyngshede: [C:03+2] P:idp: On test host behind the load balancer, avoid exposing port 8080. [puppet] - 10https://gerrit.wikimedia.org/r/1073460 (owner: 10Slyngshede) [07:43:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:43:40] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [07:43:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:44:06] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:44:56] RESOLVED: SystemdUnitFailed: rsync-patches_module.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1034.eqiad.wmnet [07:48:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:48:42] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 85.45 ms [07:49:09] (03CR) 10Vgutierrez: [C:03+2] varnish: support base64 encoded files in puppet catalog [puppet] - 10https://gerrit.wikimedia.org/r/1075765 (owner: 10Vgutierrez) [07:49:33] FIRING: [7x] KubernetesCalicoDown: wikikube-worker1254.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:50:22] (03PS1) 10Jelto: sretest: test defs_from_etcd with new separate sets [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) [07:50:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1034.eqiad.wmnet [07:52:32] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 787 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:54:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1033.eqiad.wmnet [07:55:23] (03PS1) 10David Caro: cloudcephosd1040/41: force puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075844 (https://phabricator.wikimedia.org/T372814) [07:55:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:32] (03PS1) 10Muehlenhoff: Switch cloudcephosd1033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075845 (https://phabricator.wikimedia.org/T349619) [07:55:33] (03CR) 10David Caro: [C:03+2] cloudcephosd1040/41: force puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075844 (https://phabricator.wikimedia.org/T372814) (owner: 10David Caro) [07:57:04] (03PS4) 10Alexandros Kosiaris: Add parsoidtest1001 preseed and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) [07:58:04] (03PS4) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) [07:58:18] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:58:31] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [07:58:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bul... [07:59:33] FIRING: [6x] KubernetesCalicoDown: wikikube-worker1266.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:00:05] brennen and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T0800). [08:02:13] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1033 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075845 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:04:33] RESOLVED: [6x] KubernetesCalicoDown: wikikube-worker1266.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:06:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1033.eqiad.wmnet [08:07:25] (03PS1) 10David Caro: cloudcephosd1025: take out of the pool [puppet] - 10https://gerrit.wikimedia.org/r/1075846 (https://phabricator.wikimedia.org/T348643) [08:07:40] (03CR) 10David Caro: [C:03+2] cloudcephosd1025: take out of the pool [puppet] - 10https://gerrit.wikimedia.org/r/1075846 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [08:07:43] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: connect to address 10.64.32.143 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [08:08:03] PROBLEM - Docker registry HTTPS interface certificate expiry on registry1004 is CRITICAL: connect to address 10.64.32.143 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [08:08:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1032.eqiad.wmnet [08:09:43] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Docker [08:09:49] RECOVERY - Docker registry HTTPS interface certificate expiry on registry1004 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Thu 24 Oct 2024 07:49:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [08:09:56] FIRING: [2x] SystemdUnitFailed: docker-registry-ha-jwt.service on registry1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host registry1004.eqiad.wmnet with OS bookworm [08:12:45] (03PS1) 10Muehlenhoff: Switch cloudcephosd1032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075847 (https://phabricator.wikimedia.org/T349619) [08:13:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075634 (https://phabricator.wikimedia.org/T374861) (owner: 10C. Scott Ananian) [08:13:44] (03CR) 10Muehlenhoff: [C:03+1] "Thus sounds like a gute Idee." [puppet] - 10https://gerrit.wikimedia.org/r/1075840 (owner: 10Slyngshede) [08:14:02] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1032 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075847 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:14:56] RESOLVED: [2x] SystemdUnitFailed: docker-registry-ha-jwt.service on registry1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:30] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [08:17:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [08:19:19] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1035.eqiad.wmnet [08:19:42] (03PS2) 10Jelto: sretest: test defs_from_etcd with new separate sets [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) [08:20:01] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [08:20:20] (03CR) 10Jelto: sretest: test defs_from_etcd with new separate sets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [08:20:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1032.eqiad.wmnet [08:21:01] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:14] (03PS1) 10Vgutierrez: varnish: Prepare for KA field on X-Connection-Properties [puppet] - 10https://gerrit.wikimedia.org/r/1075849 (https://phabricator.wikimedia.org/T375711) [08:21:15] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:21:22] (03CR) 10Alexandros Kosiaris: [C:03+2] Add parsoidtest1001 preseed and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [08:21:32] (03PS2) 10Vgutierrez: varnish: Prepare for KA field on X-Connection-Properties [puppet] - 10https://gerrit.wikimedia.org/r/1075849 (https://phabricator.wikimedia.org/T375711) [08:22:03] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:09] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:43] (03CR) 10Muehlenhoff: Add parsoidtest1001 preseed and site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [08:23:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1031.eqiad.wmnet [08:27:54] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1035.eqiad.wmnet [08:28:13] (03PS1) 10Muehlenhoff: Switch cloudcephosd1031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075851 (https://phabricator.wikimedia.org/T349619) [08:29:02] (03PS1) 10Alexandros Kosiaris: Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369743) [08:30:04] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1031 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075851 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:32:26] (03PS2) 10Alexandros Kosiaris: Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369744) [08:32:54] (03CR) 10Ayounsi: [C:03+1] Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369744) (owner: 10Alexandros Kosiaris) [08:33:33] (03CR) 10Vgutierrez: "varnish XCP tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1075849 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [08:34:56] (03CR) 10Alexandros Kosiaris: [C:03+2] Add parsoidtest1001 preseed and site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [08:35:50] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1036.eqiad.wmnet [08:36:51] (03CR) 10Alexandros Kosiaris: [C:03+2] Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369744) (owner: 10Alexandros Kosiaris) [08:37:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [08:38:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1031.eqiad.wmnet [08:40:08] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369744) (owner: 10Alexandros Kosiaris) [08:40:10] (03Merged) 10jenkins-bot: Add lsw1-{e,f}{6,7} to common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075852 (https://phabricator.wikimedia.org/T369744) (owner: 10Alexandros Kosiaris) [08:42:43] (03PS1) 10Vgutierrez: haproxy: Fix RSA usage reporting for TLSv1.3 [puppet] - 10https://gerrit.wikimedia.org/r/1075853 (https://phabricator.wikimedia.org/T375711) [08:44:03] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1036.eqiad.wmnet [08:44:19] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1037.eqiad.wmnet [08:46:33] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:46:39] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:46:40] (03CR) 10Slyngshede: [C:03+2] C:idm disable automatic Django localization. [puppet] - 10https://gerrit.wikimedia.org/r/1075840 (owner: 10Slyngshede) [08:47:17] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:47:18] (03CR) 10Elukey: [C:03+2] Set puppet 7 for registry2004 [puppet] - 10https://gerrit.wikimedia.org/r/1075782 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey) [08:47:36] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:48:21] !log deploy calico on all clusters to pick up the configuration for the lsw1-{e,f}{6,7} leaf switches. It is a noop in some clusters, but the config should be in-sync anyways [08:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:27] (03PS1) 10Abijeet Patro: Load styles for legacy message box markup [extensions/Translate] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075854 (https://phabricator.wikimedia.org/T375696) [08:49:05] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry2004.codfw.wmnet with OS bookworm [08:49:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Translate] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075854 (https://phabricator.wikimedia.org/T375696) (owner: 10Abijeet Patro) [08:50:12] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:50:23] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:50:24] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:50:50] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:50:52] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:51:04] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:51:05] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:51:33] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:51:34] !log akosiaris@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [08:52:01] !log akosiaris@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [08:52:03] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:52:14] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:52:15] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:52:24] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:52:32] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1037.eqiad.wmnet [08:53:20] (03PS1) 10Elukey: conftool-data: remove registry[12]003 from docker-registry's cfg [puppet] - 10https://gerrit.wikimedia.org/r/1075855 (https://phabricator.wikimedia.org/T332016) [08:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:58:55] !log dcaro@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [08:59:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye executed with errors... [08:59:23] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [08:59:30] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [09:03:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:08:19] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry2004.codfw.wmnet with reason: host reimage [09:08:21] (03CR) 10Btullis: [C:03+1] "Great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075165 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [09:09:41] (03CR) 10Brouberol: [C:03+2] Enable the usage of the Kubernetes executor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075165 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [09:09:55] (03PS1) 10Vgutierrez: varnish: Use XCP KA field to report TLS auth data [puppet] - 10https://gerrit.wikimedia.org/r/1075858 (https://phabricator.wikimedia.org/T375711) [09:10:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:13] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) [09:10:34] (03CR) 10CI reject: [V:04-1] openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [09:11:05] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075855 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey) [09:11:31] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1030.eqiad.wmnet [09:11:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry2004.codfw.wmnet with reason: host reimage [09:12:27] PROBLEM - Host wikikube-worker1280 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:37] (03PS1) 10Muehlenhoff: Switch cloudcephosd1030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075860 (https://phabricator.wikimedia.org/T349619) [09:13:15] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1030 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075860 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:14:57] RECOVERY - Host wikikube-worker1280 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:17:37] (03PS1) 10Alexandros Kosiaris: parsoidtest: Remove duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1075863 (https://phabricator.wikimedia.org/T363399) [09:17:59] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [09:18:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:00] (03PS1) 10Brouberol: airflow: set webserver.base_url to fix Datahub->Airflow backlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075864 (https://phabricator.wikimedia.org/T375713) [09:19:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1030.eqiad.wmnet [09:21:05] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [09:21:24] !log installing exim security updates [09:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:11] (03PS2) 10Brouberol: airflow: set webserver.base_url to fix Datahub->Airflow backlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075864 (https://phabricator.wikimedia.org/T375713) [09:22:29] (03PS2) 10Vgutierrez: varnish: Use XCP KA field to report TLS auth data [puppet] - 10https://gerrit.wikimedia.org/r/1075858 (https://phabricator.wikimedia.org/T375711) [09:24:56] (03CR) 10Vgutierrez: "varnish XCP tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1075858 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [09:26:22] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075864 (https://phabricator.wikimedia.org/T375713) (owner: 10Brouberol) [09:26:46] (03CR) 10Brouberol: [C:03+2] airflow: set webserver.base_url to fix Datahub->Airflow backlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075864 (https://phabricator.wikimedia.org/T375713) (owner: 10Brouberol) [09:27:53] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host registry2004.codfw.wmnet with OS bookworm [09:28:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:30:18] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=registry1003.eqiad.wmnet,service=docker-registry,dc=eqiad [09:30:27] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=registry2003.codfw.wmnet,service=docker-registry,dc=eqiad [09:30:39] (03CR) 10Elukey: [C:03+2] conftool-data: remove registry[12]003 from docker-registry's cfg [puppet] - 10https://gerrit.wikimedia.org/r/1075855 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey) [09:32:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:32:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:32:47] (03CR) 10Vgutierrez: [C:03+2] varnish: Prepare for KA field on X-Connection-Properties [puppet] - 10https://gerrit.wikimedia.org/r/1075849 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [09:33:05] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts registry1003.eqiad.wmnet [09:35:56] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1075609 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [09:36:23] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) [09:37:15] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [09:37:19] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:38:10] (03CR) 10CI reject: [V:04-1] openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [09:38:53] (03PS1) 10Muehlenhoff: docker_registry: Drop support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1075871 [09:39:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075872 [09:40:18] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [09:40:24] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye completed: - cloudce... [09:40:50] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: registry1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:41:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: registry1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:41:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:41:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry1003.eqiad.wmnet [09:41:54] (03PS1) 10Slyngshede: P:idm Assign CloudIdm to Infrastructure Foundation. [puppet] - 10https://gerrit.wikimedia.org/r/1075873 (https://phabricator.wikimedia.org/T375723) [09:42:25] (03CR) 10FNegri: [C:03+1] P:idm Assign CloudIdm to Infrastructure Foundation. [puppet] - 10https://gerrit.wikimedia.org/r/1075873 (https://phabricator.wikimedia.org/T375723) (owner: 10Slyngshede) [09:43:09] PROBLEM - Disk space on phab1004 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=phab1004&var-datasource=eqiad+prometheus/ops [09:43:11] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts registry2003.codfw.wmnet [09:43:37] (03CR) 10Slyngshede: [C:03+2] P:idm Assign CloudIdm to Infrastructure Foundation. [puppet] - 10https://gerrit.wikimedia.org/r/1075873 (https://phabricator.wikimedia.org/T375723) (owner: 10Slyngshede) [09:43:50] (03PS8) 10EoghanGaffney: lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247 [09:44:19] (03CR) 10EoghanGaffney: [C:03+1] site: remove vrts1001 & vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/1075622 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [09:44:38] mvolz: can we delay like an hour? Something came up and I have to pick up the kid from school [09:45:00] (03PS1) 10Muehlenhoff: docker-registry: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1075874 (https://phabricator.wikimedia.org/T329529) [09:47:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075874 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [09:47:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075871 (owner: 10Muehlenhoff) [09:47:26] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:48:52] (03PS1) 10Brouberol: airflow: enable basic auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 [09:48:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:49:35] (03PS2) 10Brouberol: airflow: enable basic auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [09:50:31] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [09:50:42] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: registry2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:50:52] (03PS1) 10Muehlenhoff: Switch docker_registry to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1075876 (https://phabricator.wikimedia.org/T349619) [09:50:54] (03CR) 10Elukey: [C:03+1] "pcc seems to fail probably because the pcc facts are not up-to-date, so LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1075874 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [09:51:25] (03CR) 10Elukey: [C:03+1] Switch docker_registry to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1075876 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:51:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: registry2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:51:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:51:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry2003.codfw.wmnet [09:52:05] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [09:52:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye [09:54:46] (03CR) 10Btullis: [C:03+1] admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [09:59:12] (03CR) 10Btullis: "I wonder if we should try to ensure that API access is only permitted from internal addresses. Are we weakening the security by making a p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1000) [10:01:53] PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:07] akosiaris: np [10:02:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075863 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [10:02:23] sorry, wasn't around, also dropping kid off at school! [10:02:44] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) [10:03:09] RECOVERY - Disk space on phab1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=phab1004&var-datasource=eqiad+prometheus/ops [10:03:44] (03CR) 10Brouberol: "That's a good question. The issue with OIDC is that it requires a human being involved in the login flow, and we're dealing with API acces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [10:04:33] (03CR) 10CI reject: [V:04-1] openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [10:06:21] (03CR) 10Gmodena: mw-page-content-change-enrich: enable calico network policies. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [10:06:55] RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.62 ms [10:09:07] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1038.eqiad.wmnet [10:09:53] (03PS1) 10Elukey: profile::docker::reporter: use the docker-registry's internal endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1075879 (https://phabricator.wikimedia.org/T348876) [10:11:28] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [10:11:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4131/console" [puppet] - 10https://gerrit.wikimedia.org/r/1075879 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [10:12:02] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix RSA usage reporting for TLSv1.3 [puppet] - 10https://gerrit.wikimedia.org/r/1075853 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [10:12:52] (03CR) 10Elukey: [C:03+1] docker_registry: Drop support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1075871 (owner: 10Muehlenhoff) [10:15:21] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [10:17:06] (03CR) 10Ebrahim: "Makes sense, done, https://meta.wikimedia.org/wiki/Meta:Babel#c-Ebrahim-20240926101000-Ebrahim-20240913041900 and thanks for bearing with " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [10:17:32] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1038.eqiad.wmnet [10:18:56] (03CR) 10Muehlenhoff: [C:03+2] docker_registry: Drop support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1075871 (owner: 10Muehlenhoff) [10:21:08] !log start dry run of docker distribution GC on registry1004 (info in https://phabricator.wikimedia.org/T375645#10176397, you can find a root tmux session named as the task on the host to stop) [10:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:10] (03PS1) 10Vgutierrez: haproxy: Fix XCP value syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075880 (https://phabricator.wikimedia.org/T375711) [10:22:47] (03CR) 10Muehlenhoff: [C:03+2] Switch docker_registry to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1075876 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:24:15] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix XCP value syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075880 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [10:24:24] (03Abandoned) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:27:14] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10178834 (10MoritzMuehlenhoff) [10:27:29] (03CR) 10Muehlenhoff: [C:03+2] docker-registry: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1075874 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [10:30:39] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:33:29] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye [10:33:36] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10178851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye completed: - cloudce... [10:34:06] (03PS1) 10Muehlenhoff: Add library hint for gsl [puppet] - 10https://gerrit.wikimedia.org/r/1075881 [10:34:06] (03PS1) 10Muehlenhoff: docker_registry: Fix package dependency [puppet] - 10https://gerrit.wikimedia.org/r/1075882 (https://phabricator.wikimedia.org/T329529) [10:34:35] (03PS2) 10Muehlenhoff: docker_registry: Fix package dependency [puppet] - 10https://gerrit.wikimedia.org/r/1075882 (https://phabricator.wikimedia.org/T329529) [10:35:54] (03CR) 10Muehlenhoff: [C:03+2] docker_registry: Fix package dependency [puppet] - 10https://gerrit.wikimedia.org/r/1075882 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [10:40:12] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:40:12] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:41:04] <_joe_> why is a cloud router paging us? [10:41:23] <_joe_> arturo, dcaro, dhinus any idea what's causing that? [10:41:34] <_joe_> also cc XioNoX topranks [10:41:43] <_joe_> !incidents [10:41:44] 5281 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [10:41:44] 5282 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [10:41:44] 5280 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [10:41:44] 5279 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [10:41:44] 5278 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [10:42:01] dcaro: that's likely ongoing Ceph maintrenance related, right? [10:42:02] _joe_: I have no idea why the paging. I have not set that alert [10:42:18] yes, we are doing ceph maintenance [10:42:30] <_joe_> arturo: my question was if you know what could cause high ports utilization, and yeah ceph seems a likely culprit [10:42:45] we are currently rebalancing the cluster [10:43:01] this is the first time I see this alert, which is otherwise welcome, because we had problems in the past about port saturations [10:43:02] <_joe_> I'd like to hear from XioNoX or topranks if we can just "not worry" about potential issues for the rest of the infra [10:43:08] but I ignore why it is paging you [10:43:31] Cathal is out this week [10:43:32] <_joe_> arturo: that's ok, the port utilization alert by default pages the main SRE team, we clearly didn't separate alerts [10:43:40] <_joe_> moritzm: ack [10:43:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:58] On the cloudsw we’ve seen this before due to large ceph dataflows [10:44:23] <_joe_> topranks: sorry if you're out, I didn't want to ping you [10:44:28] the ceph network in cloudsw should be a completely different network circuit, not affecting the rest of the infra [10:44:36] the current flows are caused by the https://phabricator.wikimedia.org/T348643 work [10:44:51] I think we can safely ignore - if WMCS can confirm that that is indeed the cause and the traffic is expected [10:44:51] np [10:44:59] <_joe_> ack! [10:45:03] <_joe_> thanks <3 [10:45:12] RESOLVED: Primary outbound port utilisation over 80% #page: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:45:12] RESOLVED: Primary inbound port utilisation over 80% #page: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:45:19] the longer term mitigation will be https://phabricator.wikimedia.org/T371501 [10:45:39] <_joe_> moritzm: do you think we should downtime that specific alert then? [10:46:28] I think so, but I can't really tell for how long, let's wait for dcaro to give an ETA [10:47:04] <_joe_> ah yeah that was gonna be my next question [10:48:05] this happened another time while doing similar Ceph rebalancing, and IIRC it did not cause issues to the rest of the infra [10:48:16] I don't think ceph maintenance is going to stop any time soon. We will be shuffling data for a while (days, weeks, even months) [10:48:28] I think the rebalancing will continue for potentially a long time yes [10:48:30] so maybe we should figure out how to separate the alert [10:49:56] looking at those alerts - the usage on cr2-codfw will not be due to cloud ceph usage, we should look at that [10:49:57] <_joe_> so I have a philosophical question: if it runs for so long, can we call it maintenance? :) [10:50:26] <_joe_> I think that's from last night? [10:50:51] <_joe_> and it's kinda expected we might have higher traffic in codfw right now that eqiad is fully depooled [10:51:06] and yes arturo we should separate, and probably think about how to perhaps ignore saturation if the usage is in the scavenger class [10:51:06] _joe_: sry yeah I scrolled too far back ignore that [10:51:22] (03PS1) 10Muehlenhoff: docker_registry: Fix one more occurance of a nginx-common dep [puppet] - 10https://gerrit.wikimedia.org/r/1075884 [10:51:52] ack [10:52:54] topranks: iirc the alerts were setup on librenms, and you had a task somewhere to give that a look? XD probably wig the move to gnmi we can move those to alert manager [10:52:57] (03CR) 10Muehlenhoff: [C:03+2] docker_registry: Fix one more occurance of a nginx-common dep [puppet] - 10https://gerrit.wikimedia.org/r/1075884 (owner: 10Muehlenhoff) [10:54:10] dcaro: yep, and the gnmi stats give us the per-qos class usage too which we might be able to factor in [10:55:18] (03PS1) 10Gmodena: EventStreamConfig: remove topic prefixes from dumps stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) [10:56:49] (03CR) 10CI reject: [V:04-1] EventStreamConfig: remove topic prefixes from dumps stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) (owner: 10Gmodena) [10:56:55] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for gsl [puppet] - 10https://gerrit.wikimedia.org/r/1075881 (owner: 10Muehlenhoff) [10:57:55] (03PS2) 10Gmodena: EventStreamConfig: remove topic prefixes from dumps stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) [10:58:54] !log prune now obsolete nginx packages from docker-registry hosts T329529 [10:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:00] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [11:01:10] akosiaris: everything okay? or should we postpone again? [11:01:21] Btw. The eta is until end next week, I'm adding 3 new osd nodes [11:03:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:37] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:55] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#10178892 (10MoritzMuehlenhoff) [11:06:01] (03PS3) 10Vgutierrez: varnish: Use XCP KA field to report TLS auth data [puppet] - 10https://gerrit.wikimedia.org/r/1075858 (https://phabricator.wikimedia.org/T375711) [11:06:03] mvolz: I am around, thanks for accomodating [11:06:05] we can start [11:06:37] I definitely know how it is! I'll +2 [11:06:44] (03CR) 10Mvolz: [C:03+2] Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [11:07:52] (03Merged) 10jenkins-bot: Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [11:08:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:09:26] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [11:09:28] (03CR) 10Vgutierrez: [C:03+2] varnish: Use XCP KA field to report TLS auth data [puppet] - 10https://gerrit.wikimedia.org/r/1075858 (https://phabricator.wikimedia.org/T375711) (owner: 10Vgutierrez) [11:09:59] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:10:45] ok, so I'm getting internal server error on staging with the update [11:10:55] dcaro: we can also make the librenms alert about cloud switches only notify WMCS (and/or netops), right now we just put them all in the same pool [11:11:24] * akosiaris looking [11:11:29] interestingly, the search end point works for the isbn [11:11:37] (03PS1) 10Slyngshede: data.yaml: Extend aitolkyn to December 3rd. [puppet] - 10https://gerrit.wikimedia.org/r/1075887 [11:11:53] but the query for the probe fails. [11:12:12] XioNoX: sounds good to me too, whichever you prefer [11:12:53] (03CR) 10Sergio Gimeno: Drop support for the old impact variant (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [11:12:55] curl returns internal server error after 15s I see [11:13:05] akosiaris: aha, example.com works. It's only when we try to scrape wikipedia that it fails. could this be some sort of internal ip shenanigans? [11:13:08] dcaro: that can be done quickly, prting the same kind of alerting to gnmi is going to take longer [11:13:15] curl -k -d 'http://www.example.com' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web [11:13:17] works [11:13:27] search works immediately I see as well. less than 800ms [11:13:30] curl -k -d 'http://www.example.com' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web [11:13:32] failes [11:13:34] err [11:13:46] curl -k -d 'https://en.wikipedia.org/wiki/Darth_Vader' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web [11:13:48] failes [11:14:49] hmmm [11:14:59] I wonder if I can dupe at home using localhost. [11:15:13] time curl -k -d 'https://www.politico.com' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web [11:15:13] Internal Server Error [11:15:13] real 0m0.284s [11:15:22] immediate failure [11:15:31] the wikipedia article takes 15s though [11:15:57] nytimes.com works 2.43s [11:16:18] cnn.com works, 2.635s [11:16:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075887 (owner: 10Slyngshede) [11:16:52] 2 greek sites I tested, one returned internal server error at 759ms and the other one success at 4.3s [11:17:16] ip shenanigans that only affect requests to same ip? [11:17:19] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend aitolkyn to December 3rd. [puppet] - 10https://gerrit.wikimedia.org/r/1075887 (owner: 10Slyngshede) [11:17:42] doesn't look like it. In every case those requests go via the urldownloader, they share the ip [11:17:53] huh [11:17:57] I think the wikipedia.org one might be the exception, lemme doublecheck [11:18:15] NO_PROXY: wikipedia.org,wiktionary.org,wikiquote.org,wikibooks.org,wikiquote.org,wikinews.org,wikisource.org,wikiversity.org,wikivoyage.org,www.wikidata.org,meta.wikimedia.org,commons.wikimedia.org,www.mediawiki.org [11:18:20] (03PS4) 10Arturo Borrero Gonzalez: openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) [11:18:38] unless something changed in newer node versions, it's being told for wikipedia.org to not go via urldownloader [11:18:51] every other URL has been going via urldownloader [11:19:05] maybe there's some nodejs change for handling those variables? [11:19:22] what was the previous nodejs version again? [11:19:28] 16 [11:20:11] amazing, the changelog is ofc on github [11:20:20] and github returns a pink unicorn right now [11:20:26] This page is taking too long to load. [11:20:33] (03CR) 10Vgutierrez: [C:04-2] "we need to wait till we collect a significant amount of data for RSA usage in TLSv1.3" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [11:20:35] off to another venue [11:21:20] i can see it... not sure what I'm looking for tbh [11:21:20] ah worked this time around [11:22:06] (03PS1) 10Effie Mouzeli: admin.yaml: add arthurtaylor to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/1075888 (https://phabricator.wikimedia.org/T373969) [11:22:17] no mention of the word proxy in 17 or 18 fwiw [11:22:33] let me doublecheck that is is indeed trying to reach out the the cdn [11:23:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:25:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:25:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:26:23] mvolz: heh, add a . to the end of wikipedia.org. [11:26:27] and it works [11:26:35] .... [11:26:37] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:26:49] this starting to smell of IPv6 and DNS [11:27:23] PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:00] sigh, I see we 've already applied the envoy tls proxy ipv6 fix [11:30:07] so probably not that [11:30:28] yeah :/ [11:30:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:30:37] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:30:39] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:41] i was hoping that was it last time [11:32:25] RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.33 ms [11:38:24] mvolz: I 've enabled debug logging temporarily by hand [11:38:37] ooh [11:38:38] and I got this very "useful" piece of information from zotero [11:38:41] (1)(+0000023): Error: ETIMEDOUT [11:38:43] oh. [11:38:54] well, yes. [11:38:56] :P [11:39:03] that's when trying HTTP GET https://en.wikipedia.org/wiki/Darth_Vader [11:39:24] when adding the . well it outputs more stuff [11:40:02] you can see by kube_env zotero staging ; kubectl logs -l app=zotero -c zotero-staging [11:40:29] nothing useful though [11:40:51] (03PS1) 10Effie Mouzeli: admin.yaml: jiawang change groups [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373194) [11:44:07] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins group for jiawang - https://phabricator.wikimedia.org/T373379#10178955 (10jijiki) @mpopov @jwang is also present in `analytics_privatedata_users `, `analytics-product-users`, shall we remove them from those groups as well? [11:46:11] (03CR) 10Phuedx: [C:03+1] Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [11:46:38] mvolz: oh I think I found it [11:46:47] gimme a sec [11:47:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373194) (owner: 10Effie Mouzeli) [11:47:26] 👍️ [11:48:09] (03CR) 10Slyngshede: [C:04-1] "I think you linked the wrong ticket." [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373194) (owner: 10Effie Mouzeli) [11:48:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075888 (https://phabricator.wikimedia.org/T373969) (owner: 10Effie Mouzeli) [11:49:01] (03CR) 10Slyngshede: [C:04-1] "Probably should have been: https://phabricator.wikimedia.org/T373379" [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373194) (owner: 10Effie Mouzeli) [11:49:39] mvolz: try it out, I 'll upload the patch [11:50:00] but it's absolutely specific to the check fwiw [11:50:06] and it was indeed IPv6 [11:50:11] huh ok [11:50:23] not just all of the proxied ones? [11:50:29] err un proxied [11:50:34] nodejs 18 switched to IPv6 by default for everything and we never allowed for some reason the IPv6 addresses for the wikipedia.org nedpoints [11:50:42] ooooh [11:50:42] endpoints* [11:51:35] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 07Kubernetes: "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#10178988 (10kamila) I randomly came across this, and I'm wondering _why_ we want to override this on... [11:52:09] !log jynus@cumin1002 dbctl commit (dc=all): 's4 weight tuning T375732', diff saved to https://phabricator.wikimedia.org/P69416 and previous config saved to /var/cache/conftool/dbconfig/20240926-115208-jynus.json [11:52:15] T375732: Post dc switchover mediawiki database weight tuning - https://phabricator.wikimedia.org/T375732 [11:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:55:17] (03PS1) 10Alexandros Kosiaris: zotero: Allow reaching out to text-lb over IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075890 (https://phabricator.wikimedia.org/T361728) [11:55:49] mvolz: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075890 [11:57:18] (03CR) 10Alexandros Kosiaris: [C:03+2] zotero: Allow reaching out to text-lb over IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075890 (https://phabricator.wikimedia.org/T361728) (owner: 10Alexandros Kosiaris) [11:57:42] huh. what happens if a website is ipv4? [11:57:56] (03CR) 10Mvolz: [C:03+2] zotero: Allow reaching out to text-lb over IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075890 (https://phabricator.wikimedia.org/T361728) (owner: 10Alexandros Kosiaris) [11:58:14] business as usual [11:58:24] (03Merged) 10jenkins-bot: zotero: Allow reaching out to text-lb over IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075890 (https://phabricator.wikimedia.org/T361728) (owner: 10Alexandros Kosiaris) [11:58:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:58:41] for IPv6 too btw. Everything else goes via urldownloaders and they handle that detail [11:58:59] it's everything in NO_PROXY that doesn't [11:59:20] anyway, let me deploy that change to staging and we should be good to go for the production DCs [11:59:51] (03CR) 10Urbanecm: [C:03+1] Drop support for the old impact variant (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [11:59:57] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [12:00:04] jouncebot: nowandnext [12:00:04] For the next 0 hour(s) and 59 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1200) [12:00:04] In 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1300) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1200) [12:00:08] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:00:14] !log jynus@cumin1002 dbctl commit (dc=all): 's4 weight tuning T375732', diff saved to https://phabricator.wikimedia.org/P69417 and previous config saved to /var/cache/conftool/dbconfig/20240926-120013-jynus.json [12:00:33] I need to figure out why the . trick returned IPv6, but that's for later [12:00:35] er [12:00:38] T375732: Post dc switchover mediawiki database weight tuning - https://phabricator.wikimedia.org/T375732 [12:00:40] I need to figure out why the . trick returned *IPv4*, but that's for later [12:01:09] mvolz: tests look good to me, you are clear to proceed with eqiad and then codfw [12:01:20] eqiad is still depooled at the services level btw [12:01:37] ok [12:01:52] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:02:26] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:02:54] (03PS6) 10Brouberol: Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) [12:03:27] eqiad seems okay [12:04:20] should we do codfw or should we let eqiad cook for a bit? [12:05:51] doing codfw [12:06:12] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [12:06:30] yeah codfw sgtm [12:06:42] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:08:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:09:04] (03PS1) 10Muehlenhoff: Add support for centrally managing /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) [12:09:06] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [12:09:27] (03CR) 10CI reject: [V:04-1] Add support for centrally managing /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [12:11:06] !log jynus@cumin1002 dbctl commit (dc=all): 's4 weight tuning T375732', diff saved to https://phabricator.wikimedia.org/P69418 and previous config saved to /var/cache/conftool/dbconfig/20240926-121105-jynus.json [12:11:09] (03PS1) 10Elukey: Revert "docker_registry_ha: reduce maxentries' default to 25" [puppet] - 10https://gerrit.wikimedia.org/r/1075901 [12:11:11] T375732: Post dc switchover mediawiki database weight tuning - https://phabricator.wikimedia.org/T375732 [12:11:20] (03PS2) 10Muehlenhoff: Add support for centrally managing /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) [12:11:34] mvolz: everything ok? [12:11:46] I don't see any alerts, so I assume yes [12:13:08] (03CR) 10CI reject: [V:04-1] Revert "docker_registry_ha: reduce maxentries' default to 25" [puppet] - 10https://gerrit.wikimedia.org/r/1075901 (owner: 10Elukey) [12:13:22] yup, just checking graphs to make extra sure, everything seems fine. [12:13:25] woo [12:13:51] OK to deploy cxserver ie: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1073789 ? mvolz akosiaris [12:13:55] (03PS2) 10Elukey: Revert "docker_registry_ha: reduce maxentries' default to 25" [puppet] - 10https://gerrit.wikimedia.org/r/1075901 [12:14:31] fine with me [12:14:56] Thanks! [12:14:56] !next [12:15:03] jouncebot: !next [12:15:10] sigh [12:15:15] :) [12:15:40] kart_: go ahead [12:16:13] jouncebot: next [12:16:13] In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1300) [12:17:17] (03CR) 10KartikMistry: [C:03+2] Updated cxserver to 2024-09-18-104433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073789 (https://phabricator.wikimedia.org/T375017) (owner: 10KartikMistry) [12:17:36] (03CR) 10Jelto: [C:03+2] profile::firewall: separate ipv4 and ipv6 in nftables BLOCKED_NETS [puppet] - 10https://gerrit.wikimedia.org/r/1075556 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:17:39] (03CR) 10Jelto: [C:03+2] sretest: test defs_from_etcd with new separate sets [puppet] - 10https://gerrit.wikimedia.org/r/1075842 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:18:20] (03Merged) 10jenkins-bot: Updated cxserver to 2024-09-18-104433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073789 (https://phabricator.wikimedia.org/T375017) (owner: 10KartikMistry) [12:18:25] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1040.eqiad.wmnet [12:19:17] akosiaris: I see change in puppetca.crt.pem - is that fine to deploy? [12:19:55] (03CR) 10Btullis: [C:03+1] Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [12:19:55] (03CR) 10Elukey: [C:03+2] Revert "docker_registry_ha: reduce maxentries' default to 25" [puppet] - 10https://gerrit.wikimedia.org/r/1075901 (owner: 10Elukey) [12:20:18] let me have a look [12:20:52] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1075902 (owner: 10L10n-bot) [12:21:08] kart_: yeah [12:21:18] Thanks. Deploying. [12:21:34] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:21:44] (03CR) 10Cwhite: [C:03+2] opensearch: bump curator version to wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/1075591 (https://phabricator.wikimedia.org/T364190) (owner: 10Cwhite) [12:21:56] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:22:37] !log jynus@cumin1002 dbctl commit (dc=all): 's8 weight tuning T375732', diff saved to https://phabricator.wikimedia.org/P69419 and previous config saved to /var/cache/conftool/dbconfig/20240926-122237-jynus.json [12:22:44] T375732: Post dc switchover mediawiki database weight tuning - https://phabricator.wikimedia.org/T375732 [12:26:42] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1040.eqiad.wmnet [12:27:16] !log jynus@cumin1002 dbctl commit (dc=all): 's8 weight tuning T375732', diff saved to https://phabricator.wikimedia.org/P69420 and previous config saved to /var/cache/conftool/dbconfig/20240926-122715-jynus.json [12:27:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [12:27:57] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:27:59] (03CR) 10Btullis: [C:03+1] Redeploy postgresql-airflow-test-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [12:28:36] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:28:36] (03CR) 10Jelto: [C:03+2] profile::firewall: separate ipv4 and ipv6 in nftables BLOCKED_NETS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075556 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:28:40] (03CR) 10Brouberol: Redeploy postgresql-airflow-test-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [12:29:50] (03CR) 10Btullis: [C:03+1] Redeploy postgresql-airflow-test-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [12:29:56] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:30:35] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:30:41] (03Abandoned) 10Muehlenhoff: Install a Puppet generator to create a known hosts file for Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [12:31:54] !log Updated cxserver to 2024-09-18-104433-production (T375017, T374815, T374644) [12:31:59] !log installing glib2.0 bugfix updates from Bookworm point release [12:32:03] (03PS1) 10Jelto: Revert "sretest: test defs_from_etcd with new separate sets" [puppet] - 10https://gerrit.wikimedia.org/r/1075907 (https://phabricator.wikimedia.org/T348734) [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:08] T375017: Post-creation work for rskwiki - https://phabricator.wikimedia.org/T375017 [12:32:10] T374815: Post-creation work for kgewiki - https://phabricator.wikimedia.org/T374815 [12:32:10] T374644: Post-creation work for moswiki - https://phabricator.wikimedia.org/T374644 [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:23] (03CR) 10Muehlenhoff: [C:03+1] Revert "sretest: test defs_from_etcd with new separate sets" [puppet] - 10https://gerrit.wikimedia.org/r/1075907 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:34:45] (03CR) 10Btullis: "Yes, I think that envoy seems like a good place to block external access to the API." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [12:36:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [12:37:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1002.eqiad.wmnet [12:42:25] (03PS1) 10Muehlenhoff: Switch cloudcephosd1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075909 (https://phabricator.wikimedia.org/T349619) [12:45:52] (03PS1) 10Elukey: Add insetup config for new aux k8s ctrl/worker VMs [puppet] - 10https://gerrit.wikimedia.org/r/1075910 (https://phabricator.wikimedia.org/T375746) [12:46:50] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on vrts1001.eqiad.wmnet with reason: Decom [12:46:53] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on vrts1001.eqiad.wmnet with reason: Decom [12:47:25] !incidents [12:47:26] 5282 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [12:47:26] 5281 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [12:47:26] 5280 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [12:47:26] 5279 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [12:47:27] 5278 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [12:47:56] !log aokoth@cumin1002 START - Cookbook sre.hosts.decommission for hosts vrts1001.eqiad.wmnet [12:48:30] I'll be doing early +2 for upcoming Translate backport deployment (which is in around 12 minutes from now() ) [12:48:36] (03CR) 10AOkoth: [C:03+2] site: remove vrts1001 & vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/1075622 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [12:48:43] (03CR) 10KartikMistry: [C:03+2] Load styles for legacy message box markup [extensions/Translate] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075854 (https://phabricator.wikimedia.org/T375696) (owner: 10Abijeet Patro) [12:49:49] (03CR) 10Jelto: [C:03+2] Revert "sretest: test defs_from_etcd with new separate sets" [puppet] - 10https://gerrit.wikimedia.org/r/1075907 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:50:38] (03CR) 10Jelto: [C:03+2] Revert "profile::firewall: separate ipv4 and ipv6 in nftables BL..." [puppet] - 10https://gerrit.wikimedia.org/r/1075908 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:54:51] !log aokoth@cumin1002 START - Cookbook sre.dns.netbox [12:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:14] FIRING: JobUnavailable: Reduced availability for job sql_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:58:43] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [12:58:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:59:05] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075909 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:01:27] (03CR) 10DCausse: [C:03+1] mw-page-content-change-enrich: enable calico network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [13:02:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075910 (https://phabricator.wikimedia.org/T375746) (owner: 10Elukey) [13:03:29] No ping about backport deployment? [13:03:36] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: vrts1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [13:03:36] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:03:37] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts vrts1001.eqiad.wmnet [13:03:38] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1041.eqiad.wmnet [13:03:55] cscott: you can deploy your change. [13:05:14] Or Cyndywikime ^ [13:05:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1002.eqiad.wmnet [13:06:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10179370 (10MoritzMuehlenhoff) [13:06:58] i'm here [13:07:06] !log installing QT security updates [13:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:12] yeah, i was surprised not to get a ping. does that mean there's no deployer on call? [13:08:02] although i've got the permission bits to deploy, i haven't actually done a deploy in years (although I ran maintenance scripts a couple of weeks ago!) [13:08:03] Seems broken bot. [13:08:20] i can probably follow directions, but i'd prefer there to be someone more experienced around if things break. [13:08:30] jouncebot: nowandnext [13:08:31] For the next 0 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1300) [13:08:31] In 1 hour(s) and 51 minute(s): Southward Datacenter Switchover: Deployment server (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1500) [13:08:31] In 1 hour(s) and 51 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1500) [13:08:48] cscott: I can deploy your change if you want. [13:08:56] i'd appreciate it! [13:09:10] Let's do it. Starting.. [13:09:17] thanks kart_ ! [13:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075634 (https://phabricator.wikimedia.org/T374861) (owner: 10C. Scott Ananian) [13:10:01] cscott: fwiw backports are very simple nowadays -- the deployment calendar wiki links to for ex. https://deploy-commands.toolforge.org/bacc/1075634 [13:10:17] o/ [13:10:22] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to shn/fr/el/vi/it wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075634 (https://phabricator.wikimedia.org/T374861) (owner: 10C. Scott Ananian) [13:10:47] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075634|Deploy Parsoid Read Views to shn/fr/el/vi/it wikivoyage (T374861)]] [13:10:54] T374861: Deploy Parsoid Read Views to el/fr/it/shn/vi wikivoyage (week of Sep 23) - https://phabricator.wikimedia.org/T374861 [13:11:50] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1041.eqiad.wmnet [13:12:09] oooh deploy-commands is very neat. That combined with https://schedule-deployment.toolforge.org/ makes life very easy [13:13:21] !log kartik@deploy1003 kartik, cscott: Backport for [[gerrit:1075634|Deploy Parsoid Read Views to shn/fr/el/vi/it wikivoyage (T374861)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:05] cscott: available to test using mwdebug servers. Let me know once change is OK to deploy. [13:14:10] (03PS2) 10Effie Mouzeli: admin.yaml: jiawang change groups [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373379) [13:14:17] ok, testing [13:14:31] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1003.eqiad.wmnet [13:15:00] cdanis: indeed! life is much easier! :) [13:15:31] (03PS1) 10Muehlenhoff: Switch cloudcephosd1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075917 (https://phabricator.wikimedia.org/T349619) [13:15:32] (03CR) 10Effie Mouzeli: [C:03+2] admin.yaml: jiawang change groups [puppet] - 10https://gerrit.wikimedia.org/r/1075889 (https://phabricator.wikimedia.org/T373379) (owner: 10Effie Mouzeli) [13:15:43] (03CR) 10Effie Mouzeli: [C:03+2] admin.yaml: add arthurtaylor to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/1075888 (https://phabricator.wikimedia.org/T373969) (owner: 10Effie Mouzeli) [13:15:50] (03PS2) 10CDanis: coredns: add support for Service externalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075311 (https://phabricator.wikimedia.org/T344171) [13:15:50] (03PS1) 10CDanis: calico: add BGP communities to serviceExternalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075918 (https://phabricator.wikimedia.org/T344171) [13:17:38] kart_: looks good to me! thanks! [13:17:40] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:18:08] !log kartik@deploy1003 kartik, cscott: Continuing with sync [13:18:18] cscott: cool. going ahead. [13:18:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:13] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075917 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:20:37] (03PS1) 10Btullis: Switch to systemd::service for radosgw on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075919 (https://phabricator.wikimedia.org/T374477) [13:20:43] RESOLVED: JobUnavailable: Reduced availability for job sql_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:21:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:21:36] (03Merged) 10jenkins-bot: Load styles for legacy message box markup [extensions/Translate] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075854 (https://phabricator.wikimedia.org/T375696) (owner: 10Abijeet Patro) [13:22:30] am here :) [13:23:01] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:24:10] (03CR) 10Ssingh: "Looking good, almost there:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:25:14] scap bird says: "ssh: Could not resolve hostname snapshot1009.eqiad.wmnet: Name or service not known, ssh: Could not resolve hostname snapshot1008.eqiad.wmnet: Name or service not known" Known issue? [13:25:19] akosiaris: ^ [13:26:01] (03PS1) 10Ssingh: sre.dns.admin: sort DC list by DATACENTER_NUMBERING_PREFIX [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 [13:26:20] (03CR) 10Alexandros Kosiaris: [C:03+2] parsoidtest: Remove duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1075863 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [13:26:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:27:35] abijeet: next is our patch, but scap seems throwing erros and seems slower than usual :/ [13:27:56] dns issues can make things slow, dunno if that's what's going on [13:27:58] (03CR) 10Elukey: "I left some comments but overall it looks good to me. It is difficult to say if it works at first try, there may be some things to tweak, " [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [13:28:10] (03CR) 10Elukey: [C:03+2] Add insetup config for new aux k8s ctrl/worker VMs [puppet] - 10https://gerrit.wikimedia.org/r/1075910 (https://phabricator.wikimedia.org/T375746) (owner: 10Elukey) [13:28:36] kart_, ok. I'm around [13:28:38] also: "ssh: connect to host parsoidtest1001.eqiad.wmnet port 22: Connection timed out" [13:29:34] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10179477 (10ssingh) Another use case for this is the `sre.dns.admin` cookbook. ` sukhe@cumin1002:~$ sudo cookbook sre.dns.admin show => CURRENT STA... [13:29:40] (03CR) 10FNegri: "Looks good overall. I'm confused by the fact that these 3 lines are removed in PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [13:29:40] (03PS3) 10Cwhite: zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) [13:29:49] akosiaris: the parsoidtest1001 issue seems like yours? cf T363402 [13:29:50] T363402: parsoidtest1001 implementation tracking - https://phabricator.wikimedia.org/T363402 [13:29:54] (03PS2) 10CDanis: calico: add BGP communities to serviceExternalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075918 (https://phabricator.wikimedia.org/T344171) [13:30:27] parsoidtest1001 is set to eventually replace scandium, but i don't think it's actually in-use yet. [13:31:03] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10179491 (10elukey) @Jhancock.wm Hi! You can leave sretest2001 to me since I have been doing some tests, it is surely my fault. I ran the provision cookbook on both, after setting them to "Planned" in netbox, and now... [13:31:43] snapshot1008 was decommissioned in ~June, not sure why it is still present in scap: T364455 [13:31:44] T364455: decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455 [13:31:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1003.eqiad.wmnet [13:31:55] (i don't know anything about either of these, just searching phab) [13:32:30] cscott: that was a merge error, fix has jsut been merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075863 [13:32:57] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1003.eqiad.wmnet [13:32:59] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [13:34:50] kart_: ^ seems like moritzm explained both the parsoidtest and the snapshot errors [13:36:05] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075634|Deploy Parsoid Read Views to shn/fr/el/vi/it wikivoyage (T374861)]] (duration: 25m 17s) [13:36:11] T374861: Deploy Parsoid Read Views to el/fr/it/shn/vi wikivoyage (week of Sep 23) - https://phabricator.wikimedia.org/T374861 [13:36:13] finally! [13:36:24] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1003.eqiad.wmnet - elukey@cumin1002" [13:36:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1003.eqiad.wmnet - elukey@cumin1002" [13:36:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:29] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1003.eqiad.wmnet on all recursors [13:36:30] cscott: Please check once again if everything is OK. [13:36:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1003.eqiad.wmnet on all recursors [13:36:45] abijeet: lets go with our patch. [13:36:55] (03CR) 10JHathaway: [C:03+1] Add support for centrally managing /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [13:36:58] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1003.eqiad.wmnet - elukey@cumin1002" [13:37:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1003.eqiad.wmnet - elukey@cumin1002" [13:37:27] kart_: ok, checking (now w/o x-debug) [13:37:35] (03CR) 10FNegri: [C:03+2] "This has now be approved by all the teams involved, I'll merge it and apply it to the wiki replicas dbs." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [13:37:48] (03CR) 10CI reject: [V:04-1] sre.dns.admin: sort DC list by DATACENTER_NUMBERING_PREFIX [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [13:37:53] kart_, ok. reaedy [13:38:19] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075854|Load styles for legacy message box markup (T375696)]] [13:38:25] T375696: Translate wiki should use Codex markup not unsupported message markup - https://phabricator.wikimedia.org/T375696 [13:38:49] (03CR) 10JHathaway: [C:03+1] mirrors: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075617 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [13:39:57] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [13:40:15] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179581 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [13:40:39] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1075854|Load styles for legacy message box markup (T375696)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:45] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete rendering certs [puppet] - 10https://gerrit.wikimedia.org/r/1075152 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:41:00] (03PS1) 10Muehlenhoff: Remove obsolete config-master cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) [13:41:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:42:25] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1003.eqiad.wmnet with OS bookworm [13:43:11] abijeet: ready for testing on mwdebug.. [13:43:49] kart_, ok checking [13:44:20] kart_, looks good. [13:44:50] cool [13:44:58] !log kartik@deploy1003 kartik, abi: Continuing with sync [13:45:07] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [13:45:22] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179597 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [13:45:34] kart_: looks good, thanks [13:45:40] (03CR) 10Dreamy Jazz: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [13:45:47] cscott: nice! [13:46:48] (03CR) 10Muehlenhoff: "The PCC failure is an unrelated glitch in the matrix..." [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:48:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:59] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [13:50:13] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179621 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [13:53:23] (03CR) 10Ssingh: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [13:54:56] FIRING: [2x] SystemdUnitFailed: apache2.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:36] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459#10179652 (10VRiley-WMF) a:03VRiley-WMF [13:57:47] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1003.eqiad.wmnet with reason: host reimage [13:58:31] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375459#10179663 (10VRiley-WMF) 05Open→03Resolved Reseated cable and it seems to be communicating now. Will close this and monitor. [13:59:22] (03CR) 10Muehlenhoff: [C:03+2] Add support for centrally managing /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1075898 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [13:59:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1004.eqiad.wmnet [14:00:51] (03PS1) 10Muehlenhoff: Switch cloudcephosd1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075924 (https://phabricator.wikimedia.org/T349619) [14:02:45] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075924 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:02:51] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075854|Load styles for legacy message box markup (T375696)]] (duration: 24m 31s) [14:03:00] T375696: Translate wiki should use Codex markup not unsupported message markup - https://phabricator.wikimedia.org/T375696 [14:03:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1003.eqiad.wmnet with reason: host reimage [14:03:58] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [14:04:03] kart_: will you handle Cyndywikime [14:04:12] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179700 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [14:04:12] *change? [14:04:29] sergi0: yes. I can deploy it. [14:04:40] abijeet: our change is deployed, please recheck. [14:04:48] ok [14:05:00] Cyndywikime: let's deploy your change. [14:05:14] thanks kart_ [14:05:18] kart_: thank you! [14:05:36] kart_, works well. thank you. [14:06:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [14:06:08] (03PS2) 10Btullis: Configure logging correctly for the radosgw service on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075919 (https://phabricator.wikimedia.org/T374477) [14:06:28] Is a change in initialise-labs test-able with wmdebug? [14:06:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1004.eqiad.wmnet [14:06:49] (03Merged) 10jenkins-bot: Drop support for the old impact variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075148 (https://phabricator.wikimedia.org/T350077) (owner: 10Cyndywikime) [14:07:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4133/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075919 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [14:07:42] Cyndywikime: done. [14:07:58] Cyndywikime: since it is beta only change, scap was faster! [14:08:29] kart_: thanks! :) [14:11:08] (03CR) 10Ottomata: "Added a comment, but looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [14:11:30] kart_: , testing [14:11:47] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [14:12:02] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179747 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [14:13:23] (03CR) 10Elukey: [C:03+1] Remove obsolete config-master cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:13:35] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [14:14:21] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179759 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [14:15:46] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179767 (10fnegri) The cookbook is failing for a bunch of different reasons th... [14:17:02] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179781 (10Dreamy_Jazz) >>! In T371486#10179767, @fnegri wrote: > The cookbook... [14:17:45] Cyndywikime: "14:07:07 Skipping sync since all commits were beta/labs-only changes. Operation completed." [14:17:55] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375758 (10phaultfinder) 03NEW [14:19:56] RESOLVED: [2x] SystemdUnitFailed: apache2.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:04] sergi0: "initialise-labs test-able with wmdebug" no [14:20:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1003.eqiad.wmnet with OS bookworm [14:20:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1003.eqiad.wmnet [14:23:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:24:56] FIRING: [2x] SystemdUnitFailed: apache2.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:49] (03CR) 10Kamila Součková: [C:03+1] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:25:57] (03CR) 10Kamila Součková: [C:03+1] hieradata: update deployment_server to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1073894 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:26:13] kart_: changes not verifiable on Beta :( [14:27:41] (03PS1) 10Vgutierrez: postfix: Add wikifunctions.org to the list of aliases [puppet] - 10https://gerrit.wikimedia.org/r/1075927 [14:27:49] AFAIK, beta changes just need to merge. No sync needed. [14:27:58] (03CR) 10Ssingh: [C:03+1] postfix: Add wikifunctions.org to the list of aliases [puppet] - 10https://gerrit.wikimedia.org/r/1075927 (owner: 10Vgutierrez) [14:28:02] (03CR) 10BBlack: [C:03+1] postfix: Add wikifunctions.org to the list of aliases [puppet] - 10https://gerrit.wikimedia.org/r/1075927 (owner: 10Vgutierrez) [14:28:31] (03CR) 10Vgutierrez: [C:03+2] postfix: Add wikifunctions.org to the list of aliases [puppet] - 10https://gerrit.wikimedia.org/r/1075927 (owner: 10Vgutierrez) [14:29:01] (03CR) 10JHathaway: [C:03+1] postfix: Add wikifunctions.org to the list of aliases [puppet] - 10https://gerrit.wikimedia.org/r/1075927 (owner: 10Vgutierrez) [14:29:18] (03PS1) 10FNegri: update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 [14:29:54] !log sudo cumin 'O:postfix::mx_in' 'run-puppet-agent' [14:29:57] (03PS1) 10Majavah: P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 [14:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:18] (03CR) 10Arturo Borrero Gonzalez: "that's the file (/etc/keystone/keystone.conf) that is being templated, with the template being for example modules/openstack/templates/car" [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [14:30:21] (03CR) 10CI reject: [V:04-1] P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 (owner: 10Majavah) [14:30:55] (03PS2) 10Majavah: P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 [14:31:16] (03CR) 10CI reject: [V:04-1] P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 (owner: 10Majavah) [14:31:28] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1075929 (owner: 10Majavah) [14:31:41] (03PS2) 10FNegri: update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) [14:31:55] (03PS3) 10Majavah: P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 [14:32:50] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [14:33:02] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179872 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [14:33:10] (03CR) 10Majavah: [C:03+2] P:idp: Add cloud default for expose_tomcat [puppet] - 10https://gerrit.wikimedia.org/r/1075929 (owner: 10Majavah) [14:34:50] 06SRE, 10Maps: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761 (10Ckepper) 03NEW [14:34:56] FIRING: [4x] SystemdUnitFailed: apache2.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:13] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Arelion, AS1299/IPv4: Connect - Arelion https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:40:12] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [14:40:12] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:40:23] hello [14:40:26] !incidents [14:40:27] 5283 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [14:40:27] 5284 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [14:40:27] 5282 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [14:40:27] 5281 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [14:40:27] 5280 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [14:40:28] 5279 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [14:40:30] you can ignore :) [14:40:32] !ack 5283 [14:40:32] 5283 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [14:40:34] !ack 5284 [14:40:34] 5284 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [14:40:35] (03PS1) 10Gmodena: dse-k8s-service: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [14:40:46] !incidents [14:40:47] 5283 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [14:40:47] 5284 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [14:40:47] 5282 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-e4-eqiad.mgmt.eqiad.wmnet) [14:40:47] 5281 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet) [14:40:47] 5280 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [14:40:48] 5279 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [14:41:04] Okay, I just saw Arzhel's comment. [14:41:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [14:41:42] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [14:41:54] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10179905 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [14:42:10] kart_>, how should i proceed, should we retry this again in a future deployment window? [14:42:46] (03PS2) 10Gmodena: dse-k8s-services: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [14:43:24] denisse: yeah! [14:43:38] (03PS3) 10Gmodena: dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [14:43:58] (03CR) 10CI reject: [V:04-1] update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) (owner: 10FNegri) [14:44:31] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 40 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:44:40] FYI, it's in the deployment calendar, but in about 15 minutes we'll start the switch of the deployment server from eqiad to codfw (deploy1003 to deploy2002). [14:45:12] RESOLVED: Primary outbound port utilisation over 80% #page: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [14:45:12] RESOLVED: Primary inbound port utilisation over 80% #page: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:45:28] (03PS3) 10FNegri: update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) [14:45:36] (03PS3) 10Gmodena: EventStreamConfig: remove topic prefixes from dump streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) [14:46:27] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:43] huh??? [14:47:56] sukhe: SEL has an unspecified critical bus error [14:48:03] "on a component at slot 3" [14:48:10] yeah thanks, will check after depool [14:48:13] (which we just did) [14:48:19] !log depool cp2037.codfw.wmnet [14:48:19] and "on a component at bus 174 device 0 function 0" [14:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:25] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [14:49:31] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 6 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:54:44] (03CR) 10Gmodena: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [14:56:54] (03CR) 10Snwachukwu: "Thank you Otto. The manual intervention would be done after this patch is merged right?" [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [14:57:06] (03PS1) 10Vgutierrez: ssl: Add digicert-2024 crt files [puppet] - 10https://gerrit.wikimedia.org/r/1075934 [14:57:42] (03CR) 10CI reject: [V:04-1] update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) (owner: 10FNegri) [14:58:17] (03PS2) 10Vgutierrez: ssl: Add digicert-2024 crt files [puppet] - 10https://gerrit.wikimedia.org/r/1075934 (https://phabricator.wikimedia.org/T368560) [14:58:23] (03PS1) 10Brouberol: airflow: introduce a way to define Airflow variables as values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) [14:59:02] (03CR) 10FNegri: "Exactly, I would also expect to see a diff in the python file... I think directories should be shown as well, but I'm not sure." [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [15:00:04] swfrench-wmf: How many deployers does it take to do Southward Datacenter Switchover: Deployment server deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1500). [15:00:05] brennen and jnuche: How many deployers does it take to do Train log triage deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1500). [15:00:07] (03CR) 10Ssingh: [C:03+1] ssl: Add digicert-2024 crt files [puppet] - 10https://gerrit.wikimedia.org/r/1075934 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:00:16] here o/ [15:00:20] (03PS2) 10Snwachukwu: Change New Eventschemas Git URLs [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) [15:00:41] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1075902 (owner: 10L10n-bot) [15:00:45] I'll be starting work to switch the deployment server to codfw shortly [15:01:05] 10ops-codfw, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766 (10ssingh) 03NEW [15:01:30] !log starting switchover day 3 deployment server switch - T370962 [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:42] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:01:53] (03CR) 10Snwachukwu: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [15:01:59] (03PS1) 10Vgutierrez: hiera: Deploy digicert-2024 [puppet] - 10https://gerrit.wikimedia.org/r/1075936 (https://phabricator.wikimedia.org/T368560) [15:02:12] (03PS2) 10Brouberol: airflow: introduce a way to define Airflow variables as values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) [15:02:31] (03PS2) 10Vgutierrez: hiera: Deploy digicert-2024 [puppet] - 10https://gerrit.wikimedia.org/r/1075936 (https://phabricator.wikimedia.org/T368560) [15:03:12] (03PS3) 10Brouberol: airflow: introduce a way to define Airflow variables as values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) [15:03:28] (03CR) 10Scott French: [C:03+2] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1073900 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [15:03:51] (03CR) 10Ssingh: [C:03+1] hiera: Deploy digicert-2024 [puppet] - 10https://gerrit.wikimedia.org/r/1075936 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:03:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:40] !log running authdns-update for deployment CNAME switch - T370962 [15:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:47] (03PS1) 10Vgutierrez: secrets: Add digicert-2024 dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/1075938 [15:05:56] (03CR) 10Ssingh: [C:03+1] "[-1, no cliche snake oil string]" [labs/private] - 10https://gerrit.wikimedia.org/r/1075938 (owner: 10Vgutierrez) [15:06:38] (03CR) 10Scott French: [C:03+2] hieradata: update deployment_server to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1073894 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [15:06:48] 10ops-codfw, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180060 (10ssingh) I think this server is out of warranty but I may be mistaken. [15:07:44] (03CR) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:11:15] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Arelion, AS1299/IPv4: Connect - Arelion https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:06] (03PS4) 10Brouberol: airflow: introduce a way to define Airflow variables as values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075935 (https://phabricator.wikimedia.org/T375715) [15:12:53] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075940 [15:16:31] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 73 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:17:17] !log jynus@cumin1002 dbctl commit (dc=all): 'depool db1206', diff saved to https://phabricator.wikimedia.org/P69425 and previous config saved to /var/cache/conftool/dbconfig/20240926-151716-jynus.json [15:18:38] brennen I'm around now for next hour if you want to unblock the deployment. Otherwise I can get a team member to deploy it later today. [15:21:22] (03CR) 10Brouberol: [C:03+2] Redeploy postgresql-airflow-test-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075138 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:21:31] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:22:05] !log jynus@cumin1002 dbctl commit (dc=all): 'repool db1206', diff saved to https://phabricator.wikimedia.org/P69426 and previous config saved to /var/cache/conftool/dbconfig/20240926-152204-jynus.json [15:22:43] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [15:23:02] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180139 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [15:24:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:24:50] (03CR) 10Vgutierrez: [C:03+2] ssl: Add digicert-2024 crt files [puppet] - 10https://gerrit.wikimedia.org/r/1075934 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:24:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:25:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:25:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:26:28] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secrets: Add digicert-2024 dummy files [labs/private] - 10https://gerrit.wikimedia.org/r/1075938 (owner: 10Vgutierrez) [15:27:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075936 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:28:09] (03PS1) 10Brouberol: allow airflow-deploy to deploy PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075942 (https://phabricator.wikimedia.org/T374950) [15:31:28] (03CR) 10CI reject: [V:04-1] allow airflow-deploy to deploy PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075942 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:31:59] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [15:32:08] (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy digicert-2024 [puppet] - 10https://gerrit.wikimedia.org/r/1075936 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:32:13] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180195 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [15:32:16] (03PS4) 10FNegri: update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) [15:32:39] FYI, I'll be running a `scap sync-world` to test deployments on deploy2002.codfw.wmnet shortly [15:33:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:59] (03PS2) 10Brouberol: allow airflow-deploy to deploy PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075942 (https://phabricator.wikimedia.org/T374950) [15:34:17] !log swfrench@deploy2002 Started scap sync-world: No-op deployment to verify switchover - T370962 [15:34:23] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:38:18] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [15:38:35] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180226 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [15:38:48] (03PS1) 10Vgutierrez: hiera: Deploy digicert-2024 take two [puppet] - 10https://gerrit.wikimedia.org/r/1075944 (https://phabricator.wikimedia.org/T368560) [15:38:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:39:17] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Arelion, AS1299/IPv4: Connect - Arelion https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:26] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075944 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:39:50] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [15:40:05] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180243 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [15:40:07] (03CR) 10Btullis: [C:03+1] allow airflow-deploy to deploy PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075942 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:40:15] (03CR) 10Brouberol: [C:03+2] allow airflow-deploy to deploy PG clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075942 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:40:30] (03CR) 10Ssingh: [C:03+1] hiera: Deploy digicert-2024 take two [puppet] - 10https://gerrit.wikimedia.org/r/1075944 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:40:45] (03CR) 10Ebrahim: "Pppery agreed with the change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [15:40:48] FIRING: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:42:00] (03CR) 10Brouberol: [C:03+1] Configure logging correctly for the radosgw service on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075919 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [15:42:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:42:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:42:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:42:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:43:33] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 71 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:43:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10180265 (10VRiley-WMF) This drive has been replaced. Please let us know if there are any further issues. [15:44:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:44:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:44:28] (03CR) 10Xcollazo: EventStreamConfig: remove topic prefixes from dump streams. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) (owner: 10Gmodena) [15:45:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:46:13] (03CR) 10Xcollazo: [C:03+1] "LGTM, but I have no idea really!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [15:46:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4134/co" [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [15:46:28] (03CR) 10Btullis: [V:03+1 C:03+2] Configure logging correctly for the radosgw service on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075919 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [15:46:52] (03CR) 10Majavah: [C:03+1] update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) (owner: 10FNegri) [15:47:26] (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy digicert-2024 take two [puppet] - 10https://gerrit.wikimedia.org/r/1075944 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [15:47:33] (03CR) 10Ottomata: "Hm, I'm not so sure about this. I understand the reasoning, but I think it might make Refine ingestion complicated. The event Refine job" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) (owner: 10Gmodena) [15:48:06] btullis: go ahead with my change if it pooped up on your puppet-merge [15:48:11] *popped [15:48:14] great typo [15:48:33] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:49:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10180297 (10Papaul) [15:51:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10180308 (10Papaul) [15:51:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10180305 (10Papaul) 05Open→03Resolved This is complete [15:51:06] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776 (10phaultfinder) 03NEW [15:52:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10180309 (10Papaul) 05Open→03Resolved This is complete [15:53:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:54:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:54:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:55:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:55:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:55:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:55:46] (03PS1) 10Btullis: Correct error in the rsyslog config for radosgw on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075948 (https://phabricator.wikimedia.org/T374477) [15:55:52] (03PS1) 10JMeybohm: Fix lsw1-{e,f}{6,7} in common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075949 (https://phabricator.wikimedia.org/T369744) [15:56:37] (03CR) 10Btullis: [C:03+2] Correct error in the rsyslog config for radosgw on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075948 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [15:56:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180345 (10Papaul) @Jhancock.wm can you please clear all the logs on this server and upgrade the BIOS and IDRAC please. tha... [15:59:42] (03CR) 10JMeybohm: [C:03+2] Fix lsw1-{e,f}{6,7} in common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075949 (https://phabricator.wikimedia.org/T369744) (owner: 10JMeybohm) [15:59:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:00:04] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:05] cwhite: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Switch Zuul StatsD Target deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1600). [16:00:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:00:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:00:33] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 55 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:00:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:02:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180393 (10Jhancock.wm) a:03Jhancock.wm [16:03:03] (03Merged) 10jenkins-bot: Fix lsw1-{e,f}{6,7} in common-bgp.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075949 (https://phabricator.wikimedia.org/T369744) (owner: 10JMeybohm) [16:03:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:03:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:04:07] (03CR) 10Cwhite: [C:03+2] zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [16:04:10] !log swfrench@deploy2002 Finished scap sync-world: No-op deployment to verify switchover - T370962 (duration: 29m 53s) [16:04:26] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [16:05:00] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:05:07] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:05:08] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:05:17] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Arelion, AS1299/IPv4: Connect - Arelion https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:05:25] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:05:25] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:05:25] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:05:27] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:05:29] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:05:40] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:05:41] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:05:51] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:06:02] (03PS1) 10Brouberol: airflow-test-k8s: run PG normally now that the data has been imported [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075952 (https://phabricator.wikimedia.org/T374950) [16:06:07] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:06:08] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:06:21] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:06:22] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:06:49] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:06:51] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:07:14] I realize we've run over a bit into the puppet window, though it looks like no patches are schedule [16:07:18] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:07:19] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:07:31] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:07:32] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:07:42] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:08:38] the test deployment on the new deployment has completed, but I have a couple of remaining items I'd like to check. please reach out before attempting to deploy. [16:08:45] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: run PG normally now that the data has been imported [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075952 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [16:08:51] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: run PG normally now that the data has been imported [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075952 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [16:09:04] !log `systemctl restart zuul` on contint1002 T233089 [16:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:10] T233089: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 [16:09:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [16:10:15] (03CR) 10FNegri: [C:03+2] update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) (owner: 10FNegri) [16:10:50] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Conditionally monitor vcl reloads [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [16:11:43] (03CR) 10BCornwall: [C:03+2] P:toolforge: proxy: Remove rsa-2048 certs from nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1075609 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [16:13:03] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10180446 (10Papaul) setup/configuration of both switches done. Just need to add the switches to monitoring was we have pfw1-codfw up. https://... [16:13:08] (03CR) 10BCornwall: [C:03+2] dumps: Remove rsa-2048 certs from nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1075610 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [16:13:30] (03CR) 10BCornwall: [C:03+2] ldap: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075607 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [16:13:46] (03CR) 10BCornwall: [C:03+2] mirrors: Remove rsa-2048 certs from Apache config [puppet] - 10https://gerrit.wikimedia.org/r/1075617 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [16:14:39] (03CR) 10BCornwall: [C:03+2] varnish: Set Cache-Control: no-transform header [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618) (owner: 10BCornwall) [16:14:43] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [16:15:33] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:16:17] (03PS1) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [16:17:03] (03CR) 10CI reject: [V:04-1] Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [16:17:11] (03PS2) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [16:17:52] (03PS3) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [16:17:56] (03CR) 10CI reject: [V:04-1] Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [16:18:17] (03PS4) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [16:20:45] (03CR) 10Pppery: Missing.php: Improve detection of interwikis in certain cases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [16:21:12] (03CR) 10Jdlrobson: [C:03+1] "Wahoo! Let's do this then!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [16:22:05] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:41] ^ this is depooled. downtiming as well [16:22:49] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins group for jiawang - https://phabricator.wikimedia.org/T373379#10180475 (10jwang) > is also present in analytics_privatedata_users , analytics-product-users, shall we remove them from those groups as well? Please keep me in... [16:22:59] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins group for jiawang - https://phabricator.wikimedia.org/T373379#10180473 (10mpopov) 05Open→03Resolved Thank you for updating membership, @jijiki! >>! In T373379#10178955, @jijiki wrote: > @mpopov @jwang is also present... [16:23:18] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2037.codfw.wmnet with reason: T375766 [16:23:24] T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766 [16:23:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2037.codfw.wmnet with reason: T375766 [16:25:01] (03Merged) 10jenkins-bot: update-views: improve filter handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1075928 (https://phabricator.wikimedia.org/T375760) (owner: 10FNegri) [16:25:13] alright, I believe things are working as expected on deploy2002 [16:25:56] (03PS1) 10JHathaway: wmfusercontent.org: add MX records [dns] - 10https://gerrit.wikimedia.org/r/1075960 [16:27:10] (03CR) 10Ssingh: [C:03+1] wmfusercontent.org: add MX records [dns] - 10https://gerrit.wikimedia.org/r/1075960 (owner: 10JHathaway) [16:27:18] brennen: jnuche: I see you're the deployers for the upcoming train window. FYI that: [16:27:18] 1. the active deployment server is now deploy2002 [16:27:19] 2. you *might* observe some non-critical errors in the final `php-fpm-restarts` phase of the deployment related to parse1001 and parse2001, which we're still investigating (looks like something related to stale dsh groups) [16:27:37] !log done with switchover day 3 deployment server switch - T370962 [16:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:51] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [16:27:51] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [16:29:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:29:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:30:26] !log jiawang@deploy2002 Started deploy [airflow-dags/analytics_product@8f6a9ed]: (no justification provided) [16:30:58] (03PS1) 10JHathaway: postfix: add wmfusercontent.org and wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/1075961 [16:31:17] !log jiawang@deploy2002 Finished deploy [airflow-dags/analytics_product@8f6a9ed]: (no justification provided) (duration: 01m 13s) [16:31:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [16:32:38] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [16:32:56] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180512 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [16:34:07] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [16:34:17] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 2 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180514 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-view... [16:34:57] jouncebot: now [16:34:57] For the next 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1600) [16:34:58] For the next 0 hour(s) and 25 minute(s): Switch Zuul StatsD Target (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1600) [16:35:07] (03CR) 10Ssingh: "wikimediafoundation.org seems to be in both in domain_aliases_generic and relay_domains I think and which is why it complains?" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [16:39:30] brennen: i'm going to be updating scap this morning fyi. it contains a pretty big change to how MW scripts are executed but dancy and I have tested it quite thoroughly in train-dev [16:39:53] still, i will be sure to be around during the train window in case of trouble [16:42:43] dduvall: FYI, see note above about unrelated errors you might see in the terminal phases of a deployment (some stale dsh groups that need cleaned up) [16:43:18] ah ok. will syncing to those entries fail? [16:43:39] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10180539 (10RobH) Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=shar... [16:43:44] `scap install-world` goes out to all targets i believe [16:44:06] dduvall: it just emits a non-fatal error when attempting to run php-fpm restarts on hosts it should not [16:44:11] so yeah, should be fine [16:44:28] got it, ok. ty! [16:44:30] (and may not be a problem at all if I fix it first :)) [16:44:35] :) [16:45:51] brett: did you intend to +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075609 but not merge it? [16:50:55] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180544 (10fnegri) 05Open→03Resolved p:05Triage→03High a:05Dreamy... [16:52:44] taavi: getting my ducks in a row. Thanks for the reminder :) [16:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:59:09] brennen: scratch that. no scap deployment today. dancy surfaced an issue! :) [16:59:53] (03PS37) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [16:59:56] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:05] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1700) [17:00:14] (03CR) 10CDobbins: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:00:19] (03PS1) 10Btullis: Tweak the logging settings for radosgw on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075967 (https://phabricator.wikimedia.org/T374477) [17:00:28] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [17:00:37] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10180628 (10RobH) Inbound shipment ticket 00980858 for UPS 1Z20506Y0100053206 (already delivered today and got the shipment notice last night). Next step is sc... [17:01:13] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4135/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075967 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [17:01:42] (03CR) 10BCornwall: [C:03+1] wmfusercontent.org: add MX records [dns] - 10https://gerrit.wikimedia.org/r/1075960 (owner: 10JHathaway) [17:01:54] dduvall: ack [17:02:15] (03PS38) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [17:02:22] * bd808 will be deploying developer portal in this window [17:02:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180634 (10Jhancock.wm) firmware updated and event log cleared. [17:02:58] bd808: mind giving me a ping when finished? i have a backport to sling out pre-train. [17:03:29] !log removing downtime on cp2037 but still keeping it depooled: T375766 [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:35] T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766 [17:03:42] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2037.codfw.wmnet [17:03:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2037.codfw.wmnet [17:03:52] brennen: backport away. My stuff is just a k8s container version bump. Shouldn't be any conflict. [17:03:58] (03CR) 10Btullis: [V:03+1 C:03+2] Tweak the logging settings for radosgw on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1075967 (https://phabricator.wikimedia.org/T374477) (owner: 10Btullis) [17:04:00] cool cool, thx [17:04:48] brennen: would you mind holding on one minute while I apply a manual fix to the newly active deployment server? [17:04:55] swfrench-wmf: sure thing [17:05:09] just noticed deployment server had switched and was wondering [17:06:26] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1017.eqiad.wmnet [17:07:29] (03PS1) 10BCornwall: dotfiles: Update brett's configs [puppet] - 10https://gerrit.wikimedia.org/r/1075970 [17:07:37] brennen: thanks! you should be good to go now. this was a fix for the non-fatal error I mentioned in my mention to you above ^ [17:07:56] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-09-26-122625-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075971 (https://phabricator.wikimedia.org/T375211) [17:08:23] !log cleared contents of a stale scap dsh group resource on deploy2002 - T370962 [17:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:35] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [17:08:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:09:04] (03CR) 10BCornwall: [C:03+2] dotfiles: Update brett's configs [puppet] - 10https://gerrit.wikimedia.org/r/1075970 (owner: 10BCornwall) [17:09:05] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: sync [17:09:23] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: sync [17:09:52] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-09-26-122625-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075971 (https://phabricator.wikimedia.org/T375211) (owner: 10BryanDavis) [17:10:52] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-09-26-122625-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075971 (https://phabricator.wikimedia.org/T375211) (owner: 10BryanDavis) [17:12:13] 06SRE, 06Data-Engineering, 10Data-Services, 06Trust and Safety Product Team, and 3 others: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null - https://phabricator.wikimedia.org/T371486#10180670 (10fnegri) > However, I checked and the globalblocks table in the... [17:12:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1017.eqiad.wmnet [17:13:02] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:13:23] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:13:36] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:14:04] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:14:14] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:14:52] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:17:03] swfrench-wmf: right on, thanks. [17:17:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:17:48] brennen: i am here if you need verification [17:19:53] Jdlrobson: cool, will let you know once ready for a check. [17:22:05] Jdlrobson: hrm, CI failures here? [17:22:42] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T375785 (10phaultfinder) 03NEW [17:23:45] brennen: mm [17:24:17] brennen: looks like some quibble failure [17:24:22] brennen: not related to patch [17:27:13] (03CR) 10Ebernhardson: [C:03+2] Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [17:27:48] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T375785#10180744 (10phaultfinder) [17:28:25] (03Merged) 10jenkins-bot: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [17:28:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:32:18] Jdlrobson: yeah, seems like T374830 [17:32:19] T374830: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [17:32:29] brennen: 🥲 [17:35:56] short term, is this just a recheck situation? [17:36:32] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:36:41] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:39:21] brennen: hopefully [17:40:17] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:40:27] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:41:35] (03CR) 10Brennen Bearnes: [C:04-1] Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:41:56] (03CR) 10Brennen Bearnes: [C:03+2] Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:43:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:49:59] (03CR) 10CI reject: [V:04-1] Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:50:12] brennen: hmm [17:50:30] (03CR) 10Brennen Bearnes: Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:50:50] (03CR) 10Brennen Bearnes: [C:03+2] Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [17:52:23] (03CR) 10Ottomata: "Yup, during our scheduled migration window." [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [17:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:56:30] (03PS29) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:56:48] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:57:30] (03PS30) 10CDobbins: sre.cdn.roll-restart: add rolling restart script for haproxy and pdns [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:57:53] (03CR) 10CDobbins: sre.cdn.roll-restart: add rolling restart script for haproxy and pdns (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:00:05] brennen and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T1800). [18:03:02] will move on to train once https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1075652 is deployed. [18:03:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:07:54] !log rebooting msw-e1-eqiad for maintenance [18:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:16] (03PS31) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [18:10:03] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.65.2.143 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:11:26] (03PS1) 10Ebernhardson: cirrus: Update staging to use codfw as prefix filter after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075982 [18:12:21] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375758#10180882 (10VRiley-WMF) a:03VRiley-WMF [18:13:26] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375758#10180883 (10VRiley-WMF) After troubleshooting the cables and seeing multiple issues with other servers. It was recommended to reboot the switch. Logged it and then proceeded to reboot. It looks like this has cleard up... [18:13:37] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375758#10180886 (10VRiley-WMF) 05Open→03Resolved [18:14:06] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10180887 (10VRiley-WMF) a:03VRiley-WMF [18:15:10] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update staging to use codfw as prefix filter after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075982 (owner: 10Ebernhardson) [18:15:59] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - elastic1089 - https://phabricator.wikimedia.org/T374897#10180889 (10VRiley-WMF) 05Open→03Resolved after troubleshooting this, we had to reboot E1 managment switch. This issue should be cleared up. [18:16:16] (03Merged) 10jenkins-bot: cirrus: Update staging to use codfw as prefix filter after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075982 (owner: 10Ebernhardson) [18:19:10] (03Merged) 10jenkins-bot: Limit a heading selector to Parsoid HTML [extensions/MobileFrontend] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1075652 (https://phabricator.wikimedia.org/T375701) (owner: 10Jdlrobson) [18:20:23] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:20:37] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:21:03] (03PS1) 10JHathaway: Move wikimediafoundation.org out of secret puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1075984 [18:23:15] (03CR) 10Gmodena: "> Hm, I'm not so sure about this. I understand the reasoning, but I think it might make Refine ingestion complicated. The event Refine job" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) (owner: 10Gmodena) [18:23:45] (03Abandoned) 10Gmodena: EventStreamConfig: remove topic prefixes from dump streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075885 (https://phabricator.wikimedia.org/T368755) (owner: 10Gmodena) [18:23:51] !log repooling cp2037; downtimed removed for some time, looks good to repool: T375766 [18:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:57] T375766: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766 [18:24:20] kart_ / Cyndywikime: any idea what happened with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1075148 ? [18:24:23] (03CR) 10JHathaway: [C:03+2] Move wikimediafoundation.org out of secret puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1075984 (owner: 10JHathaway) [18:24:24] (03CR) 10JHathaway: [V:03+2 C:03+2] Move wikimediafoundation.org out of secret puppet [labs/private] - 10https://gerrit.wikimedia.org/r/1075984 (owner: 10JHathaway) [18:24:40] appears to have been merged but not deployed? [18:25:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:30:59] (gah, i just realized this was a labs thing. proceeding.) [18:31:15] (am i firing on all cylinders today? oh, definitely.) [18:31:58] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1075652|Limit a heading selector to Parsoid HTML (T375701)]] [18:32:05] T375701: [Regression wmf24] All headings have borders and additional padding on mobile - https://phabricator.wikimedia.org/T375701 [18:42:53] (03CR) 10Bking: [C:03+2] admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [18:45:06] (03CR) 10Ssingh: "I think we are almost there. Just need to confirm depool_services for haproxy so let me get back to you on that." [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:46:20] (03CR) 10JHathaway: "We had non-generic aliases in secret puppet, those have now been removed from the labs variety, I5fd09d9bda62aaa977e319c540e7c42c7b46388d" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:46:23] (03Merged) 10jenkins-bot: admin-ng: add airflow namespaces to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [18:47:04] (03CR) 10Ssingh: "The Puppet 5 failure is expected? (I am asking!)" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:47:43] (03CR) 10Ssingh: [C:03+1] postfix: add wmfusercontent.org and wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:48:25] (03CR) 10JHathaway: "yes, sorry, I use the `get` function which is only in 7, https://www.puppet.com/docs/puppet/7/function.html#get" [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:49:02] (03CR) 10Ssingh: [C:03+1] "TIL. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:49:55] (03CR) 10JHathaway: [C:03+2] wmfusercontent.org: add MX records [dns] - 10https://gerrit.wikimedia.org/r/1075960 (owner: 10JHathaway) [18:50:50] (03CR) 10JHathaway: [C:03+2] postfix: add wmfusercontent.org and wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/1075961 (owner: 10JHathaway) [18:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:03:35] (03PS2) 10Ahonc: Change votewiki language to Ukrainian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075990 (https://phabricator.wikimedia.org/T302443) [19:03:50] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1075995 [19:04:52] !log brennen@deploy2002 brennen, jdlrobson: Backport for [[gerrit:1075652|Limit a heading selector to Parsoid HTML (T375701)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:04:52] !log brennen@deploy2002 Sync cancelled. [19:04:58] T375701: [Regression wmf24] All headings have borders and additional padding on mobile - https://phabricator.wikimedia.org/T375701 [19:05:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075990 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [19:05:18] Jdlrobson: well that took longer than it should have. anyhow, please to verify if you're still around... [19:05:24] brennen: yep [19:05:26] (03PS12) 10BCornwall: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:05:26] (03CR) 10BCornwall: varnish: Give 1% of views RSA cert warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:05:26] on it [19:05:33] thanks! [19:05:43] confirmed as fixed on the debug servers! [19:06:04] cool, going ahead [19:06:08] (03PS6) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:06:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:08:26] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075999 (https://phabricator.wikimedia.org/T373643) [19:08:28] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075999 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [19:09:16] argh. managed to cancel sync for that patch; i'm going ahead with group2 on the theory that should get things in the desired state. [19:09:44] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075999 (https://phabricator.wikimedia.org/T373643) (owner: 10TrainBranchBot) [19:13:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:15:38] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10181005 (10RobH) Opened ticket CS1011077 for the above updated google doc draft. [19:16:47] (03PS13) 10BCornwall: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:20:13] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:57] (03CR) 10CDobbins: "Thank you; sounds good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [19:22:03] RECOVERY - MegaRAID on es1022 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:26:15] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1076006 [19:27:20] (03PS7) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:27:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:28:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:30:41] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.24 refs T373643 [19:30:47] T373643: 1.43.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T373643 [19:36:27] (03PS4) 10Herron: opentelemetry::collector: set default port and update template [puppet] - 10https://gerrit.wikimedia.org/r/1076006 [19:37:21] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076006 (owner: 10Herron) [19:38:48] (03PS8) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:39:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:40:48] FIRING: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:41:39] (03PS9) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:42:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:44:40] (03PS10) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:45:15] (03PS11) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:45:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:45:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:46:02] (03PS12) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:46:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:46:22] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:47:48] (03PS13) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:49:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:53:07] (03PS14) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:57:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:00:05] thcipriani, RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240926T2000). [20:00:05] derenrich, bpirkle, and Ahonc: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] o/ [20:00:13] I'm here [20:00:25] + [20:01:22] I can deploy [20:02:50] (03PS1) 10Ebernhardson: cirrus: Include a private wiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076017 [20:04:45] (03CR) 10Ebernhardson: [C:03+2] cirrus: Include a private wiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076017 (owner: 10Ebernhardson) [20:05:11] thanks [20:05:32] alright, I'm just going to do these in order since they're all config changes derenrich you're up [20:05:47] (03Merged) 10jenkins-bot: cirrus: Include a private wiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076017 (owner: 10Ebernhardson) [20:05:57] 👍 [20:06:51] aaand, right, we just moved deploy hosts :) [20:07:30] (03PS1) 10Ahmon Dancy: scap-master-sync: Fix cdb exclude [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) [20:08:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075333 (owner: 10DErenrich) [20:09:06] (03CR) 10CI reject: [V:04-1] scap-master-sync: Fix cdb exclude [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [20:09:30] (03Merged) 10jenkins-bot: Bump coverage of the add-a-fact quicksurvey to 0.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075333 (owner: 10DErenrich) [20:09:45] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1075333|Bump coverage of the add-a-fact quicksurvey to 0.2]] [20:09:48] Ahonc: I've never changed a wiki language before. Are there additional steps beyond this patch? [20:10:11] I am also do it first time :) [20:10:17] oh good :) [20:10:27] did you find docs for this? [20:10:51] that's what I was searching for and then I thought it'd be faster to just ask [20:11:07] Cladis told me how to do it [20:11:31] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:11:37] he tolad only to to commit on gerrit [20:11:43] *told [20:11:45] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:05] and then you should merge it [20:12:20] ok, let me keep searching a bit (and if Cladis is around and has pointers, that'd be helpful) [20:13:17] !log thcipriani@deploy2002 thcipriani, derenrich: Backport for [[gerrit:1075333|Bump coverage of the add-a-fact quicksurvey to 0.2]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:36] ^ derenrich any way to check? [20:13:59] (03PS39) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:14:11] not that i know of but it's such a simple change. i can't imagine it not working [20:14:41] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:14:59] 06SRE, 06Traffic, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org - https://phabricator.wikimedia.org/T375795 (10Urbanecm_WMF) 03NEW [20:16:07] (03PS14) 10BCornwall: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [20:16:19] ok, checked for explosions, logs look fine. derenrich continuing with sync. [20:16:23] !log thcipriani@deploy2002 thcipriani, derenrich: Continuing with sync [20:20:28] Ahonc: looking at one of the last times this was done it does just look like the deployer at that time sync'd and the purged the main page of votewiki, so let's try that. (CC urbanecm the deployer I'm stalking in the sal: https://sal.toolforge.org/production?p=1&q=&d=2020-11-09 :)) [20:20:34] 06SRE, 06Traffic, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181160 (10bd808) [20:20:38] * urbanecm was summoned [20:21:13] urbanecm: had a patch for changing language in vote wiki, looks like you did this 4 years ago :D I've never done it before: any special considerations here? [20:21:45] looks like you sync'd and purged the cache for the main page and called it a day per the sal [20:21:46] 06SRE, 06Traffic, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181157 (10bd808) I don't know if there is a task for this yet, but it is known. The bug here is that we c... [20:22:07] apart from the CDN cache having the old language for a while, none that would come to mind [20:22:28] all pages would be cached, i think i just purged main page to verify it was deployed and then left the rest to invalidate itself [20:22:51] got it, I figured you were doing that to check somehting [20:22:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:22:56] alright, thank you! [20:23:02] (03CR) 10Dduvall: [C:03+1] scap-master-sync: Fix cdb exclude [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [20:23:03] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:23:38] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075333|Bump coverage of the add-a-fact quicksurvey to 0.2]] (duration: 13m 53s) [20:23:47] ^ derenrich all sync'd [20:23:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:23:55] great! thanks [20:24:07] sure thing bpirkle you're up [20:24:28] Ready [20:24:36] (03PS2) 10BPirkle: REST: Adjust REST Sandbox spec for new specs module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) [20:25:03] 06SRE, 06Traffic, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181161 (10Urbanecm_WMF) >>! In T375795#10181157, @bd808 wrote: > I don't know if there is a task for this... [20:25:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [20:26:20] (03Merged) 10jenkins-bot: REST: Adjust REST Sandbox spec for new specs module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [20:26:35] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1075335|REST: Adjust REST Sandbox spec for new specs module (T375512)]] [20:26:42] T375512: REST API Sandbox throwing 404 on test wiki - https://phabricator.wikimedia.org/T375512 [20:28:49] !log thcipriani@deploy2002 bpirkle, thcipriani: Backport for [[gerrit:1075335|REST: Adjust REST Sandbox spec for new specs module (T375512)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:10] ^ bpirkle should be live on mwdebug, check please [20:30:08] Looks as expected [20:30:33] bpirkle: cool, thanks for checking, continuing with sync [20:30:39] 06SRE, 06Traffic, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181167 (10bd808) >>! In T375795#10181161, @Urbanecm_WMF wrote: > Interesting, good to know. This is fairl... [20:30:42] !log thcipriani@deploy2002 bpirkle, thcipriani: Continuing with sync [20:33:12] 06SRE, 10WikimediaDebug, 10wikitech.wikimedia.org: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181190 (10bd808) [20:36:00] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075335|REST: Adjust REST Sandbox spec for new specs module (T375512)]] (duration: 09m 25s) [20:36:10] T375512: REST API Sandbox throwing 404 on test wiki - https://phabricator.wikimedia.org/T375512 [20:36:13] ^ bpirkle live everywhere [20:36:17] Ahonc: you're up [20:36:25] thank you! [20:37:28] (03CR) 10Ahmon Dancy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [20:40:26] * Ahonc waits [20:40:49] * thcipriani reorganizing windows [20:41:38] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki file File:Byway in Hoth Wood - geograph.org.uk - 6765054.jpg not found - https://phabricator.wikimedia.org/T375797 (10prabhat) 03NEW [20:41:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075990 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [20:41:50] (03CR) 10BCornwall: "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:42:26] (03Merged) 10jenkins-bot: Change votewiki language to Ukrainian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075990 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [20:42:40] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1075990|Change votewiki language to Ukrainian. (T302443)]] [20:42:46] T302443: Undertake Wikimedia Ukraine 2022 AGM elections on SecurePoll - https://phabricator.wikimedia.org/T302443 [20:43:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:44:53] !log thcipriani@deploy2002 thcipriani, ahonc: Backport for [[gerrit:1075990|Change votewiki language to Ukrainian. (T302443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:22] hello! I've been working today on reusing wikimedia's check_bacula.py script and found some bugs in the file that were preventing it from working correctly https://gitlab.torproject.org/tpo/tpa/team/-/issues/41633#note_3084475 I don't have an account to submit a patch on phabricator.wikimedia.org so I wonder if I could get those fixes across to somebody somehow [20:49:11] Ahonc: looking good on the test servers to me, I'll go ahead with the sync [20:49:23] yes [20:50:08] !log thcipriani@deploy2002 thcipriani, ahonc: Continuing with sync [20:51:20] LeLutin: you upload it to gerrit and anyone can create an account. Gerrit/Phab help is in #wikimedia-releng and questions about check_bacula are probably for #wikimedia-data-persistence [20:52:44] LeLutin: ^ those are good recommendations. We don't use phabricator for code review, only tasks. For code changes, you'd use gerrit. You can set up an account via https://www.mediawiki.org/wiki/Developer_account and then login to gerrit with those credentials. If you've not used gerrit, the docs are: https://www.mediawiki.org/wiki/Gerrit [20:55:22] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075990|Change votewiki language to Ukrainian. (T302443)]] (duration: 12m 42s) [20:55:28] T302443: Undertake Wikimedia Ukraine 2022 AGM elections on SecurePoll - https://phabricator.wikimedia.org/T302443 [20:55:41] LeLutin: the steps are (1) create a developer account and ensure you've added your ssh key (2) git clone ssh://gerrit.wikimedia.org:29418/operations/puppet (3) make your changes (4) git push origin HEAD:refs/for/production [20:56:03] or create an account and use the Gerrit UI... :D [20:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:56:33] Reedy: magic [20:56:47] Ahonc: looks live to me! [20:57:16] ok [20:57:19] thanks [20:58:28] Ahonc: thank you, hopefully your first backport experience was mostly an easy one :) [20:58:49] :) [21:00:12] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:07] thcipriani: how are we looking? i was hoping to get our new scap release deployed if now is an ok time [21:02:17] dduvall: all clear [21:02:26] yay [21:03:11] !log dduvall@deploy2002 Installing scap version "4.106.0" for 212 hosts [21:03:41] RhinosF1:, thcipriani: ok thanks. I'll try to get around to doing this soon [21:04:29] <3 [21:04:51] LeLutin: :), shout in any of the channels if you need help. I'm in both. I'll probably be in bed soon though. In Europe base. [21:07:28] !log dduvall@deploy2002 Installation of scap version "4.106.0" completed for 212 hosts [21:08:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:18:50] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@f6ea258]: Deploying to get fix for datahub ingestion [21:19:58] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@f6ea258]: Deploying to get fix for datahub ingestion (duration: 01m 08s) [21:28:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:20] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki file File:Byway in Hoth Wood - geograph.org.uk - 6765054.jpg not found - https://phabricator.wikimedia.org/T375797#10181584 (10prabhat) [21:35:06] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10181602 (10prabhat) [21:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:58:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:03:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:13:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:28:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:29:33] !log dduvall@deploy2002 Installing scap version "4.107.0" for 212 hosts [22:33:43] !log dduvall@deploy2002 Installation of scap version "4.107.0" completed for 212 hosts [22:48:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:53:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:53:09] (03PS3) 10Scott French: service_node_spec.pb: avoid use of merge_config [puppet] - 10https://gerrit.wikimedia.org/r/1076040 [22:53:09] (03CR) 10Scott French: "Hey Giuseppe - I realize this was a while ago, but I suspect you might be familiar with service::node et al." [puppet] - 10https://gerrit.wikimedia.org/r/1076040 (owner: 10Scott French) [22:53:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:13:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076046 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076046 (owner: 10TrainBranchBot) [23:41:04] FIRING: PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:41:17] !log silenced KubernetesAPILatency for 24h for resource=~"(blockaffinities|ipamblocks)" site="eqiad" (f33e2c42-d921-4686-a761-59d43ad14d84) - T369744 [23:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:23] T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking - https://phabricator.wikimedia.org/T369744 [23:43:35] (03PS1) 10Dduvall: Define $wmgLBFactoryConfigCallback in offline mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076047 [23:45:09] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:45:19] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:45:19] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:45:44] (03CR) 10Ahmon Dancy: [C:03+1] Define $wmgLBFactoryConfigCallback in offline mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076047 (owner: 10Dduvall)