[00:00:03] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:05] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:06] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:07] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:07] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:07] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:09] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2024-10-16 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:07:54] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:08:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077119 (owner: 10TrainBranchBot) [00:21:02] the above alerts are expected. we will start rolling out the renewed cert on Oct 7 [00:24:16] thanks [00:24:42] I am going to ACK them otherwise the other hosts will also start alerting [00:25:47] esams, magru, eqsin, drmrs are the ones using digicert [00:33:00] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:40:10] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235 (10phaultfinder) 03NEW [00:53:00] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:11:04] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10194495 (10Jhancock.wm) @colewhite your servers are good now. task is still open for sretest [01:15:10] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10194500 (10phaultfinder) [01:18:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2005.codfw.wmnet with OS bookworm [01:20:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [01:20:17] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10194503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [02:38:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [02:40:37] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10194558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [02:59:30] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:28] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:17] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10194593 (10Papaul) @ayounsi I will soon be setting up interfaces and assigning them to VLAN's. I wanted to know if we are keeping the same pro... [03:42:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10194601 (10Papaul) @ayounsi @cmooney I have been working on the migration process and put together the proposal below. I also had a meeting with @Jgreen and @... [04:01:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10194641 (10Papaul) [04:08:09] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:33:00] FIRING: Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:39:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:00] RESOLVED: Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [04:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:20:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10194663 (10phaultfinder) [05:39:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:37] (03PS1) 10WMDE-Fisch: Improve sub-ref check to avoid false positives [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) [05:53:02] (03PS2) 10WMDE-Fisch: Improve sub-ref check to avoid false positives [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:13] (03CR) 10Jelto: "Thanks Antoine for the response! Then this is blocked until we verify that Zuul (and other tools) are no longer using RSA." [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [06:34:03] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10194752 (10Aklapper) > the line mentioning that enwp.org is in widespread use. If it is (it'd be good to see some stats If nobody provides such stats I again propose to decline this task. Folks are welcome to use https... [06:42:29] (03PS1) 10Muehlenhoff: Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1077240 [06:45:55] (03PS2) 10RhinosF1: signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 [06:47:32] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [06:47:46] (03CR) 10Muehlenhoff: [C:03+2] Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1077240 (owner: 10Muehlenhoff) [06:48:39] !log root@cumin2002 START - Cookbook sre.idm.logout Logging AndyRussG out of all services on: 706 hosts [06:48:55] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 706 hosts [06:49:30] !log root@cumin2002 START - Cookbook sre.idm.logout Logging AndyRussG out of all services on: 1497 hosts [06:50:26] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 1497 hosts [06:54:10] Np moritzm :) [06:54:27] (03CR) 10Slyngshede: [C:03+2] signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [06:54:32] (03CR) 10Slyngshede: [C:03+1] signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [06:54:38] (03CR) 10Slyngshede: [C:03+2] signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [06:54:40] (03CR) 10Slyngshede: [V:03+2 C:03+2] signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [06:56:57] (03Merged) 10jenkins-bot: signups: fix a typo [software/bitu] - 10https://gerrit.wikimedia.org/r/1077241 (owner: 10RhinosF1) [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T0700) [07:00:05] sfaci and bpirkle: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:02] I'm here! [07:03:06] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] "👍" [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) (owner: 10WMDE-Fisch) [07:09:09] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp[3071-3072].esams.wmnet with reason: HW maintenance [07:09:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[3071-3072].esams.wmnet with reason: HW maintenance [07:09:37] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10194783 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94132346-5cb8-4ed8-b2f6-868a8962928b) set by vgutierrez@cumin1002 for 4:00:00 on 2 host(s) and their services with reaso... [07:32:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [07:35:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [07:37:37] (03PS1) 10AikoChou: ml-services: update ref-quality isvc in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) [07:48:00] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [07:49:22] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [07:49:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [07:50:16] (03PS2) 10AikoChou: ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) [07:56:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [07:56:30] Hi Amir1, is the deployment backport window going to happen? [07:57:33] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10194826 (10elukey) It failed with: ` failed to garbage collect: failed to mark: swift: swift: failed to retrieve tags unknown repository name... [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T0800) [08:06:35] o/ [08:07:25] sfaci: I will deploy your patch [08:08:09] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:08:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [08:08:28] Thanks hashar! [08:09:10] (03Merged) 10jenkins-bot: Metrics Platform monotable: Base stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074396 (https://phabricator.wikimedia.org/T373967) (owner: 10Santiago Faci) [08:10:11] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1074396|Metrics Platform monotable: Base stream configuration (T373967)]] [08:10:14] T373967: MPIC: Create Metrics Platform base stream configuration - https://phabricator.wikimedia.org/T373967 [08:12:53] !log hashar@deploy2002 hashar, sfaci: Backport for [[gerrit:1074396|Metrics Platform monotable: Base stream configuration (T373967)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:22] sfaci: my guess is there is nothing to test and we can proceed? [08:13:46] Yes, I already tested what I needed. The change is working. Thanks!! [08:16:03] !log hashar@deploy2002 hashar, sfaci: Continuing with sync [08:20:39] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1074396|Metrics Platform monotable: Base stream configuration (T373967)]] (duration: 10m 27s) [08:20:41] T373967: MPIC: Create Metrics Platform base stream configuration - https://phabricator.wikimedia.org/T373967 [08:21:48] FIRING: PuppetFailure: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:23:24] (03CR) 10JMeybohm: [C:03+2] kubernetes: Migrate taint node-role.kubernetes.io/master [puppet] - 10https://gerrit.wikimedia.org/r/1077036 (https://phabricator.wikimedia.org/T334234) (owner: 10JMeybohm) [08:24:19] Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds. [08:24:22] hmm [08:24:31] I have no clue what that means :D [08:24:58] that started at 00:30 UTC [08:25:26] and the SAL is out of order [08:28:27] hashar: cluster30 is es6 [08:29:00] !log Restarted stashbot based on instructions at https://wikitech.wikimedia.org/wiki/Tool:Stashbot [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:09] volans: I ll fee a task eventually [08:32:21] !log added the taint node-role.kubernetes.io/control-plane:NoSchedule to all k8s apiservers - T334234 [08:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:24] T334234: Migrate to node-role.kubernetes.io/control-plane label/taint - https://phabricator.wikimedia.org/T334234 [08:36:42] !log removed the label node-role.kubernetes.io/master and the taint node-role.kubernetes.io/master:NoSchedule to all k8s apiservers - T334234 [08:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.197s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:41:50] https://phabricator.wikimedia.org/T376249 [08:44:38] (03CR) 10JMeybohm: [C:04-1] "This needs a chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [08:45:02] (03CR) 10Jcrespo: [C:03+1] "change looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1077082 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [08:46:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.197s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:51:42] (03PS1) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) [08:52:00] (03CR) 10CI reject: [V:04-1] P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [08:52:03] (03PS2) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) [08:52:34] (03CR) 10JMeybohm: "AIUI the current plan is to use the apiservers IPs for now in the upstream DNS. This means we need to have port 53tcp/udp open as nodeport" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [08:54:23] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@3b76c68]: (no justification provided) [08:55:12] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@3b76c68]: (no justification provided) (duration: 00m 52s) [08:55:52] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1001.eqiad.wmnet [08:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:57:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1001.eqiad.wmnet [08:57:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org [08:57:26] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl1001.eqiad.wmnet [08:57:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl1001.eqiad.wmnet [08:58:01] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4173/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [08:58:38] (03PS1) 10Muehlenhoff: Remove obsolete ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1077319 [09:00:12] (03CR) 10Elukey: "Nit on the chosen port, the rest looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:00:27] (03CR) 10Elukey: [C:03+1] Remove obsolete ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1077319 (owner: 10Muehlenhoff) [09:00:50] I am running the train [09:00:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org [09:03:50] (03PS8) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [09:04:01] (03PS3) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) [09:04:29] (03CR) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:04:44] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/1077319 (owner: 10Muehlenhoff) [09:06:48] RESOLVED: PuppetFailure: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:07:48] (03CR) 10Giuseppe Lavagetto: [C:03+2] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:08:15] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4174/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:08:39] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp[3071-3072].esams.wmnet [09:08:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[3071-3072].esams.wmnet [09:09:35] (03CR) 10Alexandros Kosiaris: Initial commit of containerd puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:11:42] (03PS1) 10Elukey: Add basic config for aux-k8s-etcd100[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1077320 (https://phabricator.wikimedia.org/T376253) [09:13:08] !log repooling cp3071 and cp3072 after HW maintenance - T374986 [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:10] T374986: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986 [09:13:56] (03CR) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:14:36] (03CR) 10Elukey: [C:03+1] P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:14:46] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077321 (https://phabricator.wikimedia.org/T375656) [09:14:48] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077321 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [09:14:49] (03PS1) 10Jelto: wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) [09:14:56] (03CR) 10Elukey: [C:03+2] Add basic config for aux-k8s-etcd100[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1077320 (https://phabricator.wikimedia.org/T376253) (owner: 10Elukey) [09:15:30] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077321 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [09:15:36] (03CR) 10Slyngshede: [C:03+2] P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077316 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:16:25] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1004.eqiad.wmnet [09:16:26] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:16:37] !log elukey@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:16:50] !log elukey@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host aux-k8s-ctrl1004.eqiad.wmnet [09:17:49] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1004.eqiad.wmnet [09:17:55] !log jynus@cumin1002 dbctl commit (dc=all): 'Set es2024 to weight 10 as the rest of es-rw hosts T376249', diff saved to https://phabricator.wikimedia.org/P69443 and previous config saved to /var/cache/conftool/dbconfig/20241002-091754-jynus.json [09:17:58] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:17:58] T376249: Wikimedia\Rdbms\DBUnexpectedError: Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds. - https://phabricator.wikimedia.org/T376249 [09:19:02] so the train dies [09:19:12] :( [09:19:14] caught by httpbb tests which fail on every mwdebug servers [09:19:15] :D [09:19:18] too bad [09:19:25] * hashar takes a couple quarters of vacations [09:19:56] those httpbb tests failing look genuine [09:21:12] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd1004.eqiad.wmnet - elukey@cumin1002" [09:22:35] (03PS1) 10JMeybohm: kubernetes::worker_containerd: Fix registry_auth hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/1077323 (https://phabricator.wikimedia.org/T362408) [09:22:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd1004.eqiad.wmnet - elukey@cumin1002" [09:22:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:22:40] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1004.eqiad.wmnet on all recursors [09:22:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1004.eqiad.wmnet on all recursors [09:23:14] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd1004.eqiad.wmnet - elukey@cumin1002" [09:23:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd1004.eqiad.wmnet - elukey@cumin1002" [09:25:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10195025 (10phaultfinder) [09:25:28] (03PS1) 10Hashar: Revert "group1 to 1.43.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077325 (https://phabricator.wikimedia.org/T375656) [09:25:30] (03CR) 10Hashar: [C:03+2] Revert "group1 to 1.43.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077325 (https://phabricator.wikimedia.org/T375656) (owner: 10Hashar) [09:25:36] (03PS2) 10Jelto: wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) [09:26:13] (03Merged) 10jenkins-bot: Revert "group1 to 1.43.0-wmf.25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077325 (https://phabricator.wikimedia.org/T375656) (owner: 10Hashar) [09:26:37] (03CR) 10JMeybohm: [V:03+2 C:03+2] kubernetes::worker_containerd: Fix registry_auth hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/1077323 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:27:59] (03PS18) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [09:28:27] (03CR) 10JMeybohm: Initial commit of containerd puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:28:41] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:30:09] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation/Advancement/Community Growth/Community Resources" "Wikimedia Foundation/Advancement/Community Growth/Community Resources and Partnerships" "Zabe" --reason "per request [[:phab:T376246|T376246]]" [09:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] T376246: Request to move translatable page: :meta:Wikimedia Foundation/Advancement/Community Growth/Community Resources - https://phabricator.wikimedia.org/T376246 [09:30:24] (03PS1) 10Slyngshede: Revert "P:prometheus::ops add ircstream prometheus job." [puppet] - 10https://gerrit.wikimedia.org/r/1077326 [09:31:08] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to [php-1.43.0-wmf.24]" - T375656 [09:31:11] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [09:32:04] I rolledback the train due to T376255 [09:32:05] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org - https://phabricator.wikimedia.org/T376255 [09:33:52] (03CR) 10Slyngshede: [C:03+2] Revert "P:prometheus::ops add ircstream prometheus job." [puppet] - 10https://gerrit.wikimedia.org/r/1077326 (owner: 10Slyngshede) [09:34:47] (03CR) 10Gmodena: [C:03+2] services: page-content-change-enrich: set deployment value. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076680 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [09:35:43] (03Merged) 10jenkins-bot: services: page-content-change-enrich: set deployment value. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076680 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [09:37:15] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-etcd1004.eqiad.wmnet with OS bullseye [09:37:29] MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' [09:37:30] :) [09:40:13] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:09] (03PS1) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [09:42:43] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:42:45] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:43:30] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [09:43:32] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:44:07] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:44:09] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:44:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4175/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:47:23] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-etcd1004.eqiad.wmnet with reason: host reimage [09:49:05] (03CR) 10Elukey: [C:03+1] P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:50:36] hashar: o/ is it ok if I rollout a mw-config change or the train is still ongoing? [09:50:56] (03CR) 10Muehlenhoff: P:prometheus::ops add ircstream prometheus job. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:50:56] I rolled it back [09:51:01] so yeah go ahead I guess [09:51:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-etcd1004.eqiad.wmnet with reason: host reimage [09:52:02] super anks [09:52:04] *thanks [09:52:16] I am off for lunch [09:52:24] (03CR) 10Elukey: [C:03+2] services: add irc2003 to the MW's network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077003 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:53:19] !log elukey@deploy2002 Started scap sync-world: Add irc2003 to the network policies [09:54:54] !log elukey@deploy2002 Finished scap sync-world: Add irc2003 to the network policies (duration: 02m 15s) [09:55:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by elukey@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077004 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:55:34] (03CR) 10Alexandros Kosiaris: [C:03+1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:56:10] (03Merged) 10jenkins-bot: Add irc2003 to the irc settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077004 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [09:56:34] !log elukey@deploy2002 Started scap sync-world: Backport for [[gerrit:1077004|Add irc2003 to the irc settings (T376014)]] [09:56:37] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [09:58:56] !log elukey@deploy2002 elukey: Backport for [[gerrit:1077004|Add irc2003 to the irc settings (T376014)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:59:00] !log elukey@deploy2002 elukey: Continuing with sync [09:59:34] (03PS12) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1000) [10:03:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-etcd1004.eqiad.wmnet with OS bullseye [10:03:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1004.eqiad.wmnet [10:03:45] !log elukey@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077004|Add irc2003 to the irc settings (T376014)]] (duration: 07m 11s) [10:03:48] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [10:04:17] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-etcd1005.eqiad.wmnet [10:04:19] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [10:09:47] (03PS3) 10Jelto: wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) [10:11:23] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd1005.eqiad.wmnet - elukey@cumin1002" [10:11:28] (03PS1) 10Arturo Borrero Gonzalez: cloudlb2004-dev: use insetup role and add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1077337 (https://phabricator.wikimedia.org/T370678) [10:13:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-etcd1005.eqiad.wmnet - elukey@cumin1002" [10:13:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:10] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-etcd1005.eqiad.wmnet on all recursors [10:13:13] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-etcd1005.eqiad.wmnet on all recursors [10:13:40] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd1005.eqiad.wmnet - elukey@cumin1002" [10:13:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-etcd1005.eqiad.wmnet - elukey@cumin1002" [10:16:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:17:06] (03PS2) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [10:17:34] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-etcd1005.eqiad.wmnet with OS bullseye [10:19:49] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4176/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:21:19] (03PS3) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [10:21:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:23:32] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4177/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:27:14] (03PS4) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [10:27:39] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-etcd1005.eqiad.wmnet with reason: host reimage [10:28:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:29:12] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4178/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:31:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-etcd1005.eqiad.wmnet with reason: host reimage [10:32:09] (03PS13) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [10:32:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:33:25] (03PS5) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [10:35:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) (owner: 10WMDE-Fisch) [10:35:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4179/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:37:05] (03CR) 10Effie Mouzeli: [C:03+2] openstack: Stop running Wikitech jobs on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077041 (https://phabricator.wikimedia.org/T292707) (owner: 10Majavah) [10:37:30] (03PS1) 10Zabe: Use wgDonationInterfaceFundraiserMaintenance [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077344 (https://phabricator.wikimedia.org/T376255) [10:38:10] (03PS5) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) [10:38:10] (03CR) 10Volans: "This v1 is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [10:38:14] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: Stop monitoring Wikitech on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077040 (https://phabricator.wikimedia.org/T292707) (owner: 10Majavah) [10:40:07] (03PS6) 10Slyngshede: P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) [10:42:07] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4180/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:43:47] (03PS1) 10Zabe: Drop WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/1077345 (https://phabricator.wikimedia.org/T371592) [10:43:58] (03CR) 10Slyngshede: [V:03+1] P:prometheus::ops add ircstream prometheus job. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:46:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-etcd1005.eqiad.wmnet with OS bullseye [10:46:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-etcd1005.eqiad.wmnet [10:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:53:38] (03PS2) 10Ladsgroup: mariadb: Remove specific wikitech grants [puppet] - 10https://gerrit.wikimedia.org/r/1077082 (https://phabricator.wikimedia.org/T376129) [10:53:42] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Remove specific wikitech grants [puppet] - 10https://gerrit.wikimedia.org/r/1077082 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [10:55:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10195390 (10phaultfinder) [10:55:27] (03PS2) 10Ladsgroup: mariadb: Remove wikitech firewall holes [puppet] - 10https://gerrit.wikimedia.org/r/1077083 (https://phabricator.wikimedia.org/T376129) [10:55:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Remove wikitech firewall holes [puppet] - 10https://gerrit.wikimedia.org/r/1077083 (https://phabricator.wikimedia.org/T376129) (owner: 10Ladsgroup) [10:55:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:57:04] <_joe_> !log restarted rsyslog on kubernetes1045 [10:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:38] (03PS1) 10Muehlenhoff: Update template to latest upstream update [puppet] - 10https://gerrit.wikimedia.org/r/1077348 [11:00:05] mvolz: #bothumor I ïżœ Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1100). [11:00:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1045 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:03:18] (03PS1) 10Zabe: reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077349 (https://phabricator.wikimedia.org/T292707) [11:03:38] (03PS1) 10Brouberol: Upgrade airflow to 2.10.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077350 (https://phabricator.wikimedia.org/T373210) [11:08:45] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudlb2004-dev: use insetup role and add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1077337 (https://phabricator.wikimedia.org/T370678) (owner: 10Arturo Borrero Gonzalez) [11:09:32] (03PS1) 10Muehlenhoff: Create a separate repo component to use the sse mode in ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1077351 (https://phabricator.wikimedia.org/T376014) [11:15:22] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:prometheus::ops add ircstream prometheus job. [puppet] - 10https://gerrit.wikimedia.org/r/1077329 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:15:43] (03PS1) 10Effie Mouzeli: cloudweb: remove wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) [11:16:43] (03CR) 10Arturo Borrero Gonzalez: "LGTM. Maybe `profile::openstack::eqiad1::nutcracker` can be deleted too?" [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:17:03] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1077351 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:18:30] (03CR) 10Muehlenhoff: [C:03+2] Create a separate repo component to use the sse mode in ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1077351 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:19:56] (03CR) 10Muehlenhoff: [C:03+2] Update template to latest upstream update [puppet] - 10https://gerrit.wikimedia.org/r/1077348 (owner: 10Muehlenhoff) [11:21:03] (03CR) 10Effie Mouzeli: "I am not sure if nutcracker was used on cloudweb solely for wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:21:04] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:21:08] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10195454 (10jcrespo) [11:23:50] (03PS5) 10Arturo Borrero Gonzalez: openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) [11:23:56] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [11:27:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "per the comment a few lines above, it is only for mediawiki." [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:33:14] (03PS1) 10Zabe: labswiki: Reduce revision-slots expiry to 60s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T376129) [11:35:54] (03PS2) 10Effie Mouzeli: cloudweb: remove wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) [11:36:06] (03PS1) 10Btullis: wikireplicas: Remove abuse_filter_log view [puppet] - 10https://gerrit.wikimedia.org/r/1077360 (https://phabricator.wikimedia.org/T375751) [11:36:37] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:37:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4181/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077360 (https://phabricator.wikimedia.org/T375751) (owner: 10Btullis) [11:39:07] (03CR) 10Ladsgroup: "just do all of s6 together 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T376129) (owner: 10Zabe) [11:44:04] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10195505 (10jcrespo) Preliminary incident report: https://wikitech.wikimedia.org/w... [11:50:00] (03PS1) 10Slyngshede: P:prometheus::ops Enable ircstream collection. [puppet] - 10https://gerrit.wikimedia.org/r/1077363 (https://phabricator.wikimedia.org/T376014) [11:52:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4182/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077363 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:52:46] (03CR) 10Kosta Harlan: [C:03+1] wikireplicas: Remove abuse_filter_log view [puppet] - 10https://gerrit.wikimedia.org/r/1077360 (https://phabricator.wikimedia.org/T375751) (owner: 10Btullis) [11:52:54] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:prometheus::ops Enable ircstream collection. [puppet] - 10https://gerrit.wikimedia.org/r/1077363 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:52:57] (03PS2) 10Zabe: s6: Reduce revision-slots cache expiry to 60s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T183490) [11:53:39] (03CR) 10Ladsgroup: [C:03+1] s6: Reduce revision-slots cache expiry to 60s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [11:53:42] (03PS1) 10Urbanecm: ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077364 (https://phabricator.wikimedia.org/T376124) [11:53:51] (03PS1) 10Urbanecm: ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1077365 (https://phabricator.wikimedia.org/T376124) [11:54:06] (03CR) 10Effie Mouzeli: [C:03+2] cloudweb: remove wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1077355 (https://phabricator.wikimedia.org/T371378) (owner: 10Effie Mouzeli) [11:55:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:55:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:57:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:58:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:00:47] (03CR) 10Btullis: [V:03+1 C:03+2] wikireplicas: Remove abuse_filter_log view [puppet] - 10https://gerrit.wikimedia.org/r/1077360 (https://phabricator.wikimedia.org/T375751) (owner: 10Btullis) [12:01:19] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:02:40] (03CR) 10JMeybohm: [C:03+2] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:02:44] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:03:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:03:59] (03CR) 10Zabe: [C:03+2] s6: Reduce revision-slots cache expiry to 60s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [12:04:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:04:57] (03Merged) 10jenkins-bot: s6: Reduce revision-slots cache expiry to 60s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077359 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [12:05:28] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1077359|s6: Reduce revision-slots cache expiry to 60s (T183490 T376129)]] [12:05:35] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [12:05:35] T376129: Database clean ups after migration of wikitech to production - https://phabricator.wikimedia.org/T376129 [12:05:59] (03PS1) 10Muehlenhoff: When enabling eventstreams install ircstream from component [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) [12:06:18] (03CR) 10CI reject: [V:04-1] When enabling eventstreams install ircstream from component [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:06:35] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:06:36] !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=93) [12:06:50] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views [12:07:15] about to make a scary schema change on s7 [12:08:04] (03PS2) 10Muehlenhoff: When enabling eventstreams install ircstream from component [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) [12:08:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:08:27] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [12:08:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [12:08:38] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [12:08:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [12:08:43] done [12:08:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:09:06] !log zabe@deploy2002 zabe: Backport for [[gerrit:1077359|s6: Reduce revision-slots cache expiry to 60s (T183490 T376129)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:09:42] !log zabe@deploy2002 zabe: Continuing with sync [12:11:07] (03CR) 10David Caro: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [12:11:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:11:51] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [12:12:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:13:39] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bookworm [12:14:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:14:19] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077359|s6: Reduce revision-slots cache expiry to 60s (T183490 T376129)]] (duration: 08m 50s) [12:14:23] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [12:14:23] T376129: Database clean ups after migration of wikitech to production - https://phabricator.wikimedia.org/T376129 [12:16:38] (03PS1) 10Daimona Eaytoy: beta: Drop $wgCampaignEventsShowEventInvitationSpecialPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077370 (https://phabricator.wikimedia.org/T373442) [12:17:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:17:42] (03PS1) 10Daimona Eaytoy: prod: Drop $wgCampaignEventsShowEventInvitationSpecialPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077371 (https://phabricator.wikimedia.org/T373442) [12:17:44] (03CR) 10David Caro: alertmanager: fix WMCS template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [12:17:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:18:33] (03CR) 10Slyngshede: [C:03+1] "LGTM, assuming that we update the config in a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:21:38] (03CR) 10Ladsgroup: "haven't tested it. These came to mind." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [12:24:21] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.474s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:24:40] FIRING: KubernetesRsyslogDown: rsyslog on parse2019:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2019 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:29:03] (03CR) 10CI reject: [V:04-1] ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1077365 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [12:29:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.474s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:29:44] (03CR) 10Brouberol: [C:03+1] Upgrade airflow to 2.10.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077350 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [12:29:56] (03CR) 10Brouberol: [C:03+2] Upgrade airflow to 2.10.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077350 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [12:31:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:32:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:35:23] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [12:38:35] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195673 (10aborrero) >>! In T375847#10187153, @cmooney wrote: > @aborrero the network assignment is incorrect also. > [[ https://netbo... [12:39:06] jouncebot: refresh [12:39:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [12:39:08] I refreshed my knowledge about deployments. [12:39:10] jouncebot: nowandnext [12:39:10] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [12:39:10] In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1300) [12:39:16] I am sneaking https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FundraiserLandingPage/+/1077344 [12:39:19] (03CR) 10Hashar: [C:03+2] Use wgDonationInterfaceFundraiserMaintenance [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077344 (https://phabricator.wikimedia.org/T376255) (owner: 10Zabe) [12:39:40] RESOLVED: KubernetesRsyslogDown: rsyslog on parse2019:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=parse2019 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:39:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077344 (https://phabricator.wikimedia.org/T376255) (owner: 10Zabe) [12:40:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077344 (https://phabricator.wikimedia.org/T376255) (owner: 10Zabe) [12:42:27] (03Merged) 10jenkins-bot: Use wgDonationInterfaceFundraiserMaintenance [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077344 (https://phabricator.wikimedia.org/T376255) (owner: 10Zabe) [12:42:54] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077344|Use wgDonationInterfaceFundraiserMaintenance (T376255)]] [12:42:57] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org: MediaWiki\Config\ConfigException: MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' - https://phabricator.wikimedia.org/T376255 [12:43:46] (03CR) 10CDanis: "That's totally enough. I suggested the Daemonset approach because AFAICT Alex/you were opposed to the need for SNAT and an extra hop, but" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:44:50] (03CR) 10Hashar: [C:03+2] "I am starting CI ahead of the deployment window." [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) (owner: 10WMDE-Fisch) [12:45:01] (03PS3) 10Muehlenhoff: When enabling eventstreams install ircstream from component [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) [12:45:15] !log hashar@deploy2002 hashar, zabe: Backport for [[gerrit:1077344|Use wgDonationInterfaceFundraiserMaintenance (T376255)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:45:18] !log hashar@deploy2002 hashar, zabe: Continuing with sync [12:46:16] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195699 (10aborrero) I guess next bits to test with neutron would be to enable north-south traffic, meaning working on these two ticke... [12:46:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:47:07] (03CR) 10Muehlenhoff: "Yeah, sure. That's all WIP anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:47:50] (03CR) 10FNegri: [C:03+1] "Apparently this is expected (see https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-sre/20241002.txt)." [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [12:48:29] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: content_history: update docker image. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [12:49:25] (03Merged) 10jenkins-bot: dse-k8s-services: content_history: update docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077047 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [12:49:55] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: keystone: dont add default security rules via wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/1075859 (https://phabricator.wikimedia.org/T375111) (owner: 10Arturo Borrero Gonzalez) [12:49:55] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077344|Use wgDonationInterfaceFundraiserMaintenance (T376255)]] (duration: 07m 01s) [12:49:58] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10195719 (10cmooney) >>! In T375847#10195699, @aborrero wrote: > I guess next bits to test with neutron would be to enable north-south... [12:49:58] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org: MediaWiki\Config\ConfigException: MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' - https://phabricator.wikimedia.org/T376255 [12:51:31] (03CR) 10CDanis: "Yes, I was going to do it after the other change was merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:51:49] (03PS1) 10Muehlenhoff: Remove deployment access group from cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1077374 (https://phabricator.wikimedia.org/T371383) [12:52:12] (03CR) 10JMeybohm: "Sorry, there where quite a number of different approaches floating around. Maybe some of them got confused here and there. I think it woul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:52:36] (03Abandoned) 10CDanis: CoreDNS chart changes to serve outside the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077043 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [12:54:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077374 (https://phabricator.wikimedia.org/T371383) (owner: 10Muehlenhoff) [12:55:44] there are a ton of labswiki circuit breaking errors in logspam-watch / mediawiki-errors, is that a known issue? [12:56:08] not a spike, but a pretty steady distribution over the past ~12 hours [12:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:56:48] (03CR) 10Gmodena: [C:03+1] changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [12:57:31] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: UEFI test [12:57:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: UEFI test [12:57:38] Lucas_WMDE: can you note that on phab? [12:57:43] definitely worth investigating [12:57:50] cdanis: yeah I just found T376249 [12:57:51] T376249: Wikimedia\Rdbms\DBUnexpectedError: Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds. - https://phabricator.wikimedia.org/T376249 [12:57:57] I think the error just moved from cluster30 to s6 [12:57:57] ahh grat [12:57:58] (03CR) 10Effie Mouzeli: [C:03+2] Remove deployment access group from cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/1077374 (https://phabricator.wikimedia.org/T371383) (owner: 10Muehlenhoff) [12:58:03] I’ll reopen it [12:58:08] thanks! [12:58:24] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10195740 (10cmooney) At a high level I think we need to: * Create an aggregate policy on //cloudsw1-b1-codfw// to generate 2a02:ec80:a100::/48 if par... [12:59:22] !log upload python3-aiohttp-sse-client 0.2.1-0 to apt.wikimedia.org bookworm/ircstream-sse component (needed by the eventstream feature branch of ircstream) T376014 [12:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [12:59:51] (03PS1) 10FNegri: wikireplicas.update-views: clean up removed tables [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [12:59:55] (03PS1) 10CDanis: Bug: T344171 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077376 (https://phabricator.wikimedia.org/T344171) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1300). [13:00:05] Daimona, WMDE-Fisch, and hashar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:21] \o [13:00:40] I can deploy! [13:00:43] o/ [13:00:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076797 (https://phabricator.wikimedia.org/T373821) (owner: 10Daimona Eaytoy) [13:00:55] I'm deep in a meeting and will be half commuting in a bit so happy for anyone doing the job [13:00:59] thanks Lucas_WMDE [13:01:01] let’s start with zhwiki CampaignEvents [13:01:08] WMDE-Fisch: zuul predicts at least 15 more minutes for CI anyway [13:01:24] I guess I can see if I’m able to reproduce that error [13:01:27] I'll be on my phone :-) [13:01:36] (03CR) 10FNegri: [C:04-1] "Unfortunately this won't work together with the table filter, because the script does not accept --table and --clean together:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [13:01:41] Should be easy to reproduce [13:01:42] (03CR) 10Zabe: Remove deployment access group from cloudweb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077374 (https://phabricator.wikimedia.org/T371383) (owner: 10Muehlenhoff) [13:02:19] (03Merged) 10jenkins-bot: [zhwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076797 (https://phabricator.wikimedia.org/T373821) (owner: 10Daimona Eaytoy) [13:02:27] hmph, I don’t see any indentation on the reference [13:02:47] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1076797|[zhwiki] Enable the CampaignEvents extension (T373821)]] [13:02:55] (03PS2) 10CDanis: coredns: support NodePort & bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077376 (https://phabricator.wikimedia.org/T344171) [13:02:56] Must be on a group0 wiki [13:03:00] ah [13:03:01] T373821: Enable CampaignEvents Extension on zhwiki - https://phabricator.wikimedia.org/T373821 [13:03:25] (03CR) 10Ladsgroup: Remove deployment access group from cloudweb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077374 (https://phabricator.wikimedia.org/T371383) (owner: 10Muehlenhoff) [13:03:46] (03CR) 10Majavah: [C:03+1] "thanks, I'll take care of this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077349 (https://phabricator.wikimedia.org/T292707) (owner: 10Zabe) [13:04:09] * Lucas_WMDE looks for testwiki pages with references [13:04:39] ok, can confirm [13:05:01] \o/ [13:05:08] o/ [13:05:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1076797|[zhwiki] Enable the CampaignEvents extension (T373821)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:40] Daimona: can you test on zhwiki? [13:07:13] looking [13:07:38] AFAICT sysops gain the campaignevents-delete-registration permission, which wasn’t mentioned on the task – I don’t know about CE to know if that’s okay or not [13:07:45] (the rest of the diff at https://zh.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups|restrictions&format=json&formatversion=2 looks as expected to me) [13:09:12] (03PS1) 10Ayounsi: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) [13:09:49] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:09:50] (03PS1) 10Effie Mouzeli: site.pp: add mc-misc1x hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077378 (https://phabricator.wikimedia.org/T371987) [13:10:52] LGTM, Daimona is the wikimedia sepcific part of the extension coming later? [13:11:00] cdanis: did you do something about those errors? logspam-watch looks like they went away a few minutes ago [13:11:15] It seems to be working, although it's claiming that a RL module doesn't exist. I assume it's just the cache. Also @HouseOfM invitation lists are not enabled here. I think that's correct but I'll bring this up later with the team. [13:11:27] What part? It should all be there. [13:11:31] (03CR) 10Giuseppe Lavagetto: [C:03+1] site.pp: add mc-misc1x hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077378 (https://phabricator.wikimedia.org/T371987) (owner: 10Effie Mouzeli) [13:11:44] Lucas_WMDE: I did not, I'm still catching up on IRC and email and on my coffee consumption [13:11:45] I'm not seeing the community list, assuming that's also correct [13:11:52] cdanis: okay, just curious ^^ [13:11:56] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add mc-misc1x hosts [puppet] - 10https://gerrit.wikimedia.org/r/1077378 (https://phabricator.wikimedia.org/T371987) (owner: 10Effie Mouzeli) [13:12:00] maybe it was one of the people listening on that task then [13:12:28] (03Merged) 10jenkins-bot: Improve sub-ref check to avoid false positives [extensions/Cite] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077145 (https://phabricator.wikimedia.org/T376242) (owner: 10WMDE-Fisch) [13:12:29] That's expected, community list is not enabled in any production wiki. The RL thing also fixed itself in the meantime. So, looking good AFAICT. [13:12:55] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [13:12:56] ok! [13:13:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:14:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10195834 (10jijiki) @Jclark-ctr very sorry for missing your message. I added the hosts in `site.pp`, there is already a 'mc*' reference in `preseed.yaml` to mat... [13:15:17] (03PS1) 10Hashar: Revert "Use wgDonationInterfaceFundraiserMaintenance" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077380 (https://phabricator.wikimedia.org/T376255) [13:15:35] (03CR) 10CDanis: [C:03+2] coredns: improve debuggability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [13:16:10] !log upload ircstream 0.13.0~dev+wmf1 to apt.wikimedia.org bookworm/ircstream-sse component (seperate build using the experimental eventstream feature branch of ircstream) T376014 [13:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:13] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [13:17:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076797|[zhwiki] Enable the CampaignEvents extension (T373821)]] (duration: 14m 45s) [13:17:35] T373821: Enable CampaignEvents Extension on zhwiki - https://phabricator.wikimedia.org/T373821 [13:18:04] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [13:18:17] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1077145|Improve sub-ref check to avoid false positives (T376242)]] [13:18:29] T376242: Adding a new ref using VE will indent it in reuse-search - https://phabricator.wikimedia.org/T376242 [13:18:50] (03Merged) 10jenkins-bot: coredns: improve debuggability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077090 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [13:20:29] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch, lucaswerkmeister-wmde: Backport for [[gerrit:1077145|Improve sub-ref check to avoid false positives (T376242)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:21:00] looks fixed to me [13:21:08] WMDE-Fisch: want to test as well or should I go ahead? [13:21:24] (03CR) 10Muehlenhoff: [C:03+2] When enabling eventstreams install ircstream from component [puppet] - 10https://gerrit.wikimedia.org/r/1077367 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:21:38] zhwiki looks good. Thanks Lucas! [13:21:53] \o/ [13:21:53] (03PS2) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:22:34] :) [13:24:19] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch, lucaswerkmeister-wmde: Continuing with sync [13:25:22] (03PS8) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [13:25:44] (03PS3) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:27:27] Lucas_WMDE: I did not deploy the Cite patch https://gerrit.wikimedia.org/r/1077145 [13:27:40] I merely +2ed to hvae it merged for the deployment window [13:27:49] hashar: I’m deploying that one right now [13:27:56] sorry for the mess! [13:28:38] the other "Use wgDonationInterfaceFundraiserMaintenance" is one I deployed just before the window as it is a train blocker [13:28:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077145|Improve sub-ref check to avoid false positives (T376242)]] (duration: 10m 32s) [13:28:50] no problem, I was also planning to +2 it early and then I saw you’d already done it ^^ [13:28:50] and I am talking about it with fundraising team :) [13:28:53] T376242: Adding a new ref using VE will indent it in reuse-search - https://phabricator.wikimedia.org/T376242 [13:28:55] so you can ignore it :] [13:28:57] yeah, I saw you also uploaded a revert of that one? [13:29:00] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10195922 (10Dzahn) One domain/line more in an existing list like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1069643/12/modules/ncredir/files/nc_redirects.dat won't make a big difference either way. But we als... [13:29:00] ok, I’ll ignore it then [13:29:03] (03PS4) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:29:06] (03PS1) 10Muehlenhoff: Add a separate role for sse-enabled ircstream and a Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1077385 (https://phabricator.wikimedia.org/T376014) [13:29:12] ah, and the Cite deploy finished in the meantime [13:29:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077385 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:29:44] ok, so nothing else to deploy right now AFAICT [13:31:53] !log UTC afternoon backport+config window done [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:18] I’m done, hashar feel free to deploy fundraising stuff when needed :) [13:33:17] Lucas_WMDE sorry was just commuting in a crowded metro ^^ [13:33:31] np ^^ [13:33:35] let me know if it needs a revert after all [13:33:39] but as far as I could tell the fix worked [13:34:01] (03PS42) 10Ssingh: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:34:01] But thanks all seems fine. Yes. [13:34:25] (03PS43) 10Ssingh: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:34:38] <_Gerges> Hi everyone, I have basics in wikimedia operations and good experience in php, is there an active group I can join? [13:35:10] (03PS5) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:35:30] (03PS6) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:35:53] (03PS1) 10Slyngshede: P:ircstream allow config to switch between UDP and SSE. [puppet] - 10https://gerrit.wikimedia.org/r/1077386 (https://phabricator.wikimedia.org/T376014) [13:36:02] Lucas_WMDE: I got the fundraising patch deployed before the window :) [13:36:13] so I guess the window has been successful [13:36:17] thank you for having handled the patches! [13:37:04] (03PS6) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) [13:37:17] (03Abandoned) 10Hashar: Revert "Use wgDonationInterfaceFundraiserMaintenance" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077380 (https://phabricator.wikimedia.org/T376255) (owner: 10Hashar) [13:37:22] (03CR) 10Volans: "Thanks for the first pass, replies inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [13:39:23] (03PS44) 10Ssingh: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [13:40:17] (03PS7) 10FNegri: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) [13:40:28] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:52] (03CR) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:42:31] (03CR) 10Volans: wikireplicas.update-views: add --clean arg (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [13:44:04] (03CR) 10FNegri: wikireplicas.update-views: add --clean arg (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [13:45:38] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add a separate role for sse-enabled ircstream and a Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1077385 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:45:46] (03CR) 10Btullis: [C:03+1] "Looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [13:48:32] (03PS1) 10Hashar: Remove Maintenance check [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077390 (https://phabricator.wikimedia.org/T376255) [13:48:47] (03PS2) 10Slyngshede: P:ircstream allow config to switch between UDP and SSE. [puppet] - 10https://gerrit.wikimedia.org/r/1077386 (https://phabricator.wikimedia.org/T376014) [13:50:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10196019 (10phaultfinder) [13:51:28] (03CR) 10Brouberol: [C:03+1] dse-k8s: add kube_env config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1077096 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [13:52:10] (03CR) 10Bking: [C:03+2] dse-k8s: add kube_env config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1077096 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [13:52:56] (03PS1) 10Ammarpad: logos: Sync config.yaml and logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) [13:54:01] (03CR) 10Muehlenhoff: [C:03+1] "One nit inline, LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/1077386 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [13:54:04] (03PS1) 10Slyngshede: P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) [13:54:24] (03CR) 10CI reject: [V:04-1] P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [13:55:48] (03PS2) 10Slyngshede: P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) [13:56:07] (03PS1) 10DCausse: rdf-streaming-updater: use SSL to connect to kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077396 (https://phabricator.wikimedia.org/T333373) [13:56:08] (03CR) 10CI reject: [V:04-1] P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [13:56:28] (03PS1) 10Elukey: Add basic config for irc[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/1077397 (https://phabricator.wikimedia.org/T376014) [13:56:53] (03CR) 10Hashar: [C:03+2] Remove Maintenance check [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077390 (https://phabricator.wikimedia.org/T376255) (owner: 10Hashar) [13:58:26] (03CR) 10Brouberol: "I could only see a final issue to fix and we should be all good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:59:44] (03PS3) 10Slyngshede: P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) [14:00:04] (03CR) 10CI reject: [V:04-1] P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1400) [14:00:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077390 (https://phabricator.wikimedia.org/T376255) (owner: 10Hashar) [14:00:17] (03CR) 10Elukey: [C:03+2] Add basic config for irc[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/1077397 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:00:33] I am doing a deployment [14:00:48] (03Merged) 10jenkins-bot: Remove Maintenance check [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077390 (https://phabricator.wikimedia.org/T376255) (owner: 10Hashar) [14:01:14] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077390|Remove Maintenance check (T376255)]] [14:01:17] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org: MediaWiki\Config\ConfigException: MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' - https://phabricator.wikimedia.org/T376255 [14:01:46] (03PS4) 10Slyngshede: P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) [14:02:00] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: eqiad: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376282 (10elukey) 03NEW [14:02:09] (03PS1) 10Urbanecm: labswiki: Disallow account autocreation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077399 [14:02:11] (03PS1) 10Ammarpad: hawiki: Add temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077400 (https://phabricator.wikimedia.org/T376049) [14:02:52] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: eqiad: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376282#10196070 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+--... [14:02:58] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4185/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [14:03:09] (03PS2) 10Urbanecm: labswiki: Disallow account autocreation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077399 (https://phabricator.wikimedia.org/T161859) [14:03:13] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host irc1004.wikimedia.org [14:03:15] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [14:03:17] Amir1: ^^ this should help :) [14:03:38] !log hashar@deploy2002 hashar: Backport for [[gerrit:1077390|Remove Maintenance check (T376255)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:38] !log hashar@deploy2002 Sync cancelled. [14:03:44] oh no [14:03:58] I have pressed the return key [14:04:05] :/ [14:04:14] :( [14:04:14] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077390|Remove Maintenance check (T376255)]] [14:04:16] i've been there [14:04:33] scap should flush stdin before invoking `input()` [14:04:43] or whatever the python built-in is used to read user input [14:05:22] but I am tired of filing bugs since I know I am unable to then move them forward [14:05:26] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4186/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [14:05:41] (03PS5) 10Slyngshede: P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) [14:05:54] I need stop the habit of pressing return in my terminals [14:06:01] (03PS1) 10JMeybohm: kubelet/containerd: Fix runc config and kubelet systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) [14:06:05] thank you ! [14:06:22] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1004.wikimedia.org - elukey@cumin1002" [14:06:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4187/console" [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [14:06:26] !log hashar@deploy2002 hashar: Backport for [[gerrit:1077390|Remove Maintenance check (T376255)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:28] (03CR) 10Ladsgroup: [C:03+1] "You're the best <3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077399 (https://phabricator.wikimedia.org/T161859) (owner: 10Urbanecm) [14:06:28] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org: MediaWiki\Config\ConfigException: MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' - https://phabricator.wikimedia.org/T376255 [14:06:29] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:06:30] testing [14:06:37] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [14:07:05] !log hashar@deploy2002 hashar: Continuing with sync [14:07:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1004.wikimedia.org - elukey@cumin1002" [14:07:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:54] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache irc1004.wikimedia.org on all recursors [14:07:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc1004.wikimedia.org on all recursors [14:08:00] (03PS2) 10JMeybohm: kubelet/containerd: Fix runc config and kubelet systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) [14:08:23] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc1004.wikimedia.org - elukey@cumin1002" [14:08:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc1004.wikimedia.org - elukey@cumin1002" [14:09:07] (03PS3) 10JMeybohm: kubelet/containerd: Fix runc config and kubelet systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) [14:09:42] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:11:42] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077390|Remove Maintenance check (T376255)]] (duration: 07m 27s) [14:11:45] T376255: 1.43.0-wmf.25 breaks donate.wikimedia.org: MediaWiki\Config\ConfigException: MediaWiki\Config\GlobalVarConfig::get: undefined option: 'ContributionTrackingFundraiserMaintenance' - https://phabricator.wikimedia.org/T376255 [14:12:05] I ll resume the train later tonight [14:12:08] (03CR) 10Elukey: "Left a nit on the systemd override, but everything looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:12:09] hashar: ok for me to deploy sth? [14:12:19] urbanecm: yes [14:12:26] (03CR) 10Urbanecm: [C:03+2] labswiki: Disallow account autocreation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077399 (https://phabricator.wikimedia.org/T161859) (owner: 10Urbanecm) [14:12:28] let's go :) [14:12:31] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host irc1004.wikimedia.org with OS bookworm [14:12:36] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18), 13Patch-For-Review: eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10196102 (10bking) ACK, I will go ahead and provision a VM at 16 GB . As you can see from [[ https://g... [14:12:45] I will promote group1 later tonight [14:13:07] afk [14:13:09] (03Merged) 10jenkins-bot: labswiki: Disallow account autocreation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077399 (https://phabricator.wikimedia.org/T161859) (owner: 10Urbanecm) [14:13:11] Amir1: and the other thing: T376284 [14:13:12] T376284: Running fetchFieldValues on a numerical column returns strings - https://phabricator.wikimedia.org/T376284 [14:13:28] hashar: i am out much of this morning for a dental appointment, but can probably handle a group1 promotion around 1pm local if needed. [14:14:02] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077399|labswiki: Disallow account autocreation (T161859)]] [14:14:05] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [14:14:36] Thank you! [14:16:04] (03PS4) 10JMeybohm: kubelet/containerd: Fix runc config and kubelet systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) [14:16:07] (03CR) 10Brouberol: wdqs-categories: introduce VM for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [14:16:21] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1077399|labswiki: Disallow account autocreation (T161859)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:16:27] (03CR) 10JMeybohm: kubelet/containerd: Fix runc config and kubelet systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:16:35] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:17:03] doesn't seem to work... [14:17:05] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:17:10] rolling out anyway, will look later [14:18:19] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10196133 (10ssingh) Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks! [14:19:27] (03CR) 10Elukey: [C:03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:19:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [14:20:12] (03CR) 10JMeybohm: [C:03+2] kubelet/containerd: Fix runc config and kubelet systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1077401 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:20:13] (03CR) 10Bking: wdqs-categories: introduce VM for testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [14:20:22] (03CR) 10Bking: [C:03+2] wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) (owner: 10Bking) [14:20:42] (03CR) 10Ammarpad: "Without this change, the logo script cannot be properly run for this new patch I84546b1, because those committed arwiki changes will be re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) (owner: 10Ammarpad) [14:21:04] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10196144 (10nisrael) @Reedy following up on this. Any update on what could be causing the emails to direct to that inbox? [14:21:40] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077399|labswiki: Disallow account autocreation (T161859)]] (duration: 07m 38s) [14:21:43] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [14:21:48] cross-posting, as I just noticed that my deployment calendar entry was deleted: [14:21:49] friendly reminder that in about an hour (15:00 UTC) we'll be doing the last part of the switchover, where we repool eqiad for services (already repooled for traffic). [14:21:49] just to minimize the number of changes happening at once, it would be preferable if deployments wrap up before 15:00, until things stabilize (I'll post here when all-clear). [14:22:16] I see the entry at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1500 o_O [14:22:26] (but good to remind people anyway ^^) [14:22:30] jouncebot: next [14:22:30] In 2 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1700) [14:22:31] yup, I just added it back :) [14:22:34] aha ^^ [14:22:37] jouncebot: reload [14:22:43] jouncebot: refresh [14:22:43] I refreshed my knowledge about deployments. [14:22:44] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:22:45] that was it [14:22:46] jouncebot: next [14:22:47] In 0 hour(s) and 37 minute(s): Southward Datacenter Switchover: Services + Traffic (Day 8) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1500) [14:22:47] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc1004.wikimedia.org with reason: host reimage [14:22:48] yay [14:23:07] TIL, reload! thanks, Lusas_WMDE :) [14:23:33] I think refresh is the right one FWIW, I just misremembered it ^^ [14:25:32] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10196151 (10Reedy) No, I don't have any access to check into this. It was just unclear from the report where she was actually (incorrectly) recieving them. [14:26:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc1004.wikimedia.org with reason: host reimage [14:27:32] (03CR) 10Ssingh: "Test runs for both with --service and without look good. I bumped the repool interval to allow the recdns to catch up. " [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [14:28:18] (03CR) 10Hnowlan: [C:03+1] changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [14:28:39] (03CR) 10FNegri: [C:03+2] wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [14:29:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto) [14:29:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bookworm [14:30:26] !log bking@cumin2002 START - Cookbook sre.ganeti.makevm for new host wdqs-categories1001.eqiad.wmnet [14:30:28] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:34:27] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10196171 (10elukey) Backtracking a little before proceeding further, there are some things that I don't fully grasp. In most of the docs that... [14:35:25] (03PS1) 10Majavah: snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) [14:36:52] (03PS1) 10JMeybohm: kubernetes::worker_containerd: Fix registry_auth hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/1077404 (https://phabricator.wikimedia.org/T362408) [14:37:38] (03PS2) 10Majavah: snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) [14:38:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10196176 (10kamila) [14:38:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:48] FIRING: PuppetFailure: Puppet has failed on kubestage2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:40:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host irc1004.wikimedia.org with OS bookworm [14:40:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc1004.wikimedia.org [14:41:08] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10196186 (10nisrael) What would be next steps for investigating this? [14:41:09] (03PS1) 10JMeybohm: kubelet/containerd: Fix registry authentication [puppet] - 10https://gerrit.wikimedia.org/r/1077406 (https://phabricator.wikimedia.org/T362408) [14:41:24] (03CR) 10JMeybohm: [V:03+2 C:03+2] kubernetes::worker_containerd: Fix registry_auth hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/1077404 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:41:28] (03Merged) 10jenkins-bot: wikireplicas.update-views: add --clean arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1077375 (https://phabricator.wikimedia.org/T375751) (owner: 10FNegri) [14:42:58] (03PS1) 10Ladsgroup: tables-catalog: Add tables for Translate extension [puppet] - 10https://gerrit.wikimedia.org/r/1077407 (https://phabricator.wikimedia.org/T363581) [14:44:48] RESOLVED: PuppetFailure: Puppet has failed on kubestage2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:44:49] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wdqs-categories1001.eqiad.wmnet - bking@cumin2002" [14:45:04] (03PS2) 10Ladsgroup: tables-catalog: Add tables for Translate extension [puppet] - 10https://gerrit.wikimedia.org/r/1077407 (https://phabricator.wikimedia.org/T363581) [14:45:09] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add tables for Translate extension [puppet] - 10https://gerrit.wikimedia.org/r/1077407 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:45:46] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wdqs-categories1001.eqiad.wmnet - bking@cumin2002" [14:45:47] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:47] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs-categories1001.eqiad.wmnet on all recursors [14:45:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs-categories1001.eqiad.wmnet on all recursors [14:46:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wdqs-categories1001.eqiad.wmnet - bking@cumin2002" [14:46:23] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wdqs-categories1001.eqiad.wmnet - bking@cumin2002" [14:46:47] 06SRE, 06Infrastructure-Foundations, 06serviceops: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285 (10elukey) 03NEW [14:50:05] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: eqiad: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376282#10196209 (10elukey) 05Open→03Resolved a:03elukey [14:51:07] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs-categories1001.eqiad.wmnet with OS bullseye [14:51:18] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10196230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs-categories... [14:56:34] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:57:39] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077406 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:59:25] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10196267 (10Aklapper) Likely #Infrastructure-Foundations ( https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering#Infrastructure_Foundation... [14:59:30] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:00:05] swfrench-wmf: #bothumor I ïżœ Unicode. All rise for Southward Datacenter Switchover: Services + Traffic (Day 8) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1500). [15:00:33] here o/ will be starting work shortly - just checking on a couple of things [15:00:40] (03PS1) 10JMeybohm: kubelet: Remove --pod-infra-container-image when using containerd [puppet] - 10https://gerrit.wikimedia.org/r/1077412 (https://phabricator.wikimedia.org/T362408) [15:00:50] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:00:53] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:02:40] (03CR) 10Elukey: [C:03+1] kubelet/containerd: Fix registry authentication [puppet] - 10https://gerrit.wikimedia.org/r/1077406 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [15:03:57] (03CR) 10JMeybohm: [C:03+1] coredns: support NodePort & bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077376 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:04:14] (03CR) 10CDanis: [C:03+2] coredns: support NodePort & bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077376 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:04:16] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:05:00] (03CR) 10JMeybohm: [C:03+2] kubelet/containerd: Fix registry authentication [puppet] - 10https://gerrit.wikimedia.org/r/1077406 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [15:05:17] (03Abandoned) 10CDanis: calico: add BGP communities to serviceExternalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075918 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:07:37] (03Merged) 10jenkins-bot: coredns: support NodePort & bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077376 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:07:39] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:07:44] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - T370962 [15:07:47] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:09:58] (03CR) 10Cwhite: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:12:12] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1075598 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:12:13] !log cdanis@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:12:19] (03CR) 10Xcollazo: [C:03+1] snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) (owner: 10Majavah) [15:12:19] (03PS4) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [15:12:53] !log cdanis@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:13:43] !log dancy@deploy2002 Installing scap version "4.108.0" for 210 hosts [15:14:13] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on registry1004.eqiad.wmnet with reason: testing [15:14:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry1004.eqiad.wmnet with reason: testing [15:15:29] (03PS1) 10Fomafix: Use ?? instead of default value in getRawVal() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077417 (https://phabricator.wikimedia.org/T376245) [15:15:35] (03PS1) 10CDanis: coredns: enable nodePort 53 everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077418 (https://phabricator.wikimedia.org/T344171) [15:18:04] !log dancy@deploy2002 Installation of scap version "4.108.0" completed for 210 hosts [15:18:12] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [15:19:47] (03PS5) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [15:19:50] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [15:20:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10196320 (10phaultfinder) [15:22:23] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10196335 (10DLynch) > If nobody provides such stats I again propose to decline this task. Folks are welcome to use https://w.wiki/ instead. Not really comprehensive, but just scanning [these google results](https://www... [15:22:44] !log dancy@deploy2002 Started scap sync-world: Testing T370934 [15:22:50] T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions - https://phabricator.wikimedia.org/T370934 [15:23:04] (03CR) 10CDanis: [C:03+2] "https://phabricator.wikimedia.org/P69448" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077418 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:23:04] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:24:09] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:25:29] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10196352 (10nisrael) Sounds good. Anything I can do from my end to help? Will Infrastructure team be able to continue discussing in this phab? [15:26:04] !log dancy@deploy2002 Finished scap sync-world: Testing T370934 (duration: 03m 19s) [15:27:18] (03Merged) 10jenkins-bot: coredns: enable nodePort 53 everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077418 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:27:22] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Datacenter Switchover - T370962 [15:27:29] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [15:27:37] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10196360 (10violetwtf) Popping in to mention that I haven't spoken to Thomas since March 2023 when I first opened this thread. Happy to reach back out if WMF reaches a decision to take the domain though. As of then, h... [15:28:14] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@3a7901e]: T375153 [15:28:16] T375153: ETL pipeline for Automoderator daily monitoring metrics - https://phabricator.wikimedia.org/T375153 [15:28:25] !log swfrench@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:28:28] !log swfrench@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:30:09] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@3a7901e]: T375153 (duration: 01m 59s) [15:31:31] !log cdanis@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:31:47] !log cdanis@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:32:10] (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]001 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:32:16] (03CR) 10Herron: [C:03+1] alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1075598 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:32:30] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:44] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291 (10cmooney) 03NEW p:05Triage→03Medium [15:32:46] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196414 (10cmooney) [15:33:23] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:33:38] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:34:46] !log cdanis@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:35:28] !log cdanis@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:35:53] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:36:00] !log cdanis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:36:31] !log cdanis@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:36:50] !log cdanis@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:36:51] !log cdanis@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:37:30] RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:34] !log cdanis@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:38:00] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:38:12] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:38:51] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196446 (10cmooney) [15:39:48] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196447 (10cmooney) [15:40:55] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196467 (10cmooney) [15:41:06] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:41:11] (03PS1) 10Msz2001: Revert "wikimaniawiki: Update logos to 2024" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077422 (https://phabricator.wikimedia.org/T376292) [15:41:27] (03PS6) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [15:41:50] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:42:16] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [15:42:24] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196463 (10Volans) Is there plan to try to get away from the very long hardcoded lists in hiera? How often do you expect the data to change? This mi... [15:43:24] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:43:54] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:45:21] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:45:30] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:09] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [15:47:11] (03CR) 10David Caro: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [15:48:08] (03PS2) 10Msz2001: Revert "wikimaniawiki: Update logos to 2024" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077422 (https://phabricator.wikimedia.org/T376292) [15:48:33] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10196558 (10aborrero) here is a proposal: * 2a02:ec80:a100:fe01::/64 - cr1-codfw uplink * 2a02:ec80:a100:fe02::/64 - cr2-codfw uplink * 2a02:ec80:a10... [15:49:03] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196562 (10cmooney) >>! In T376291#10196463, @Volans wrote: > Is there plan to try to get away from the very long hardcoded lists in hiera? I'm mor... [15:49:30] FIRING: [4x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:40] (03PS4) 10RLazarus: deployment_server: Print logs command when mwscript-k8s --attach fails [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) [15:49:46] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196564 (10cmooney) [15:50:29] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10196561 (10CDanis) >>! In T376291#10196463, @Volans wrote: > Is there plan to try to get away from the very long hardcoded lists in hiera? No idea... [15:51:18] (03CR) 10Andrea Denisse: "Yes, this is to be merged after running the cookbook. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1075598 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:52:52] alright, all services have been repooled as of 15:27 and mediawiki service have been at full load since ~ 15:20 without issue. I think that's all-clear :) [15:53:53] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298 (10WRai-WMF) 03NEW [15:54:30] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:13] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1076893 (https://phabricator.wikimedia.org/T369142) (owner: 10RLazarus) [15:57:47] (03PS9) 10Andrea Denisse: alert: Remove the alert[12]001 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) [15:58:12] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10196613 (10RobH) [15:58:20] (03CR) 10Andrea Denisse: alert: Remove the alert[12]001 hosts as alertmanagers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [15:58:39] (03PS2) 10Andrea Denisse: alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1075598 (https://phabricator.wikimedia.org/T372607) [15:59:25] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10196625 (10RobH) >>! In T373993#10196133, @ssingh wrote: > Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks! The panels we... [15:59:49] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10196610 (10RobH) 05Open→03Resolved Thanks @Vgutierrez for the assist, I was ready to go to bed and they took over supporting the remote tech doing the cpu thermal paste swaps. This is no... [16:01:09] (03CR) 10Andrea Denisse: [C:03+2] alert: Remove the alert[12]001 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [16:01:32] jouncebot: nowandnext [16:01:32] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:32] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1700) [16:01:51] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077364 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:02:06] (03CR) 10Urbanecm: [C:03+2] ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1077365 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:03:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs-categories1001.eqiad.wmnet with OS bullseye [16:03:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host wdqs-categories1001.eqiad.wmnet [16:03:22] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views [16:03:36] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10196649 (10ayounsi) No interface range as each switch will be independent. [16:03:59] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10196660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs-categories1001... [16:04:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10196661 (10Papaul) Thanks [16:06:19] jouncebot: nowandnext [16:06:19] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [16:06:19] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1700) [16:06:51] urbanecm: think I could sneak in a config patch before your backports merge? [16:06:58] taavi: go ahead [16:07:15] thanks [16:08:03] (03CR) 10Majavah: [C:03+2] reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077349 (https://phabricator.wikimedia.org/T292707) (owner: 10Zabe) [16:08:07] (03PS3) 10Majavah: snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) [16:08:09] (03CR) 10Ladsgroup: [C:03+2] snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) (owner: 10Majavah) [16:08:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] snapshots: Dump wikitech (labswiki) like any other wiki [puppet] - 10https://gerrit.wikimedia.org/r/1077403 (https://phabricator.wikimedia.org/T292707) (owner: 10Majavah) [16:08:49] (03Merged) 10jenkins-bot: reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077349 (https://phabricator.wikimedia.org/T292707) (owner: 10Zabe) [16:09:34] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1077349|reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet (T292707)]] [16:09:37] T292707: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 [16:10:11] (03PS1) 10Bking: wdqs-categories: use correct insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1077427 (https://phabricator.wikimedia.org/T376079) [16:11:46] !log taavi@deploy2002 zabe, taavi: Backport for [[gerrit:1077349|reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet (T292707)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:11:59] !log taavi@deploy2002 zabe, taavi: Continuing with sync [16:12:45] (03PS1) 10Bartosz DziewoƄski: Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077429 [16:12:58] if y'all are doing wikitech config fixes
 how about that one? ^ [16:13:20] urbanecm: I have made a patch to have scap discard any previous input before prompting a user for a question: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/468 . We will see :] [16:13:24] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:13:37] hashar: that sounds lovely! [16:13:42] i definitely had troubles with that before :) [16:13:52] CI doesn't like it though? [16:15:03] yeah style check :D [16:15:15] I haven't tested the code though [16:15:18] but that is the idea [16:16:10] (03CR) 10Ladsgroup: [C:03+1] Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077429 (owner: 10Bartosz DziewoƄski) [16:16:35] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077349|reverse-proxy: Drop all public ips except cloudweb2002-dev.codfw.wmnet (T292707)]] (duration: 07m 01s) [16:16:38] T292707: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 [16:19:30] MatmaRex: I'm happy to but I also don't want to delay urbanecm's backport which is currently in CI :D [16:19:41] (03PS1) 10Bartosz DziewoƄski: logging: Remove unused global $wmgMonologProcessors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077431 [16:19:41] (03PS1) 10Bartosz DziewoƄski: Remove references to removed wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077432 [16:20:53] :D [16:21:50] taavi: we still have 10 mins or so in CI [16:21:59] i think you'd make it [16:22:02] the last scap took 7 mins so that's enough :D [16:22:05] (03CR) 10Majavah: [C:03+2] Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077429 (owner: 10Bartosz DziewoƄski) [16:22:18] (03CR) 10Ladsgroup: [C:03+1] Remove references to removed wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077432 (owner: 10Bartosz DziewoƄski) [16:22:34] (03CR) 10Ladsgroup: [C:03+1] logging: Remove unused global $wmgMonologProcessors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077431 (owner: 10Bartosz DziewoƄski) [16:22:37] (03PS2) 10Bartosz DziewoƄski: logging: Remove unused global $wmgMonologProcessors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077431 [16:22:50] (03CR) 10Majavah: [C:03+2] logging: Remove unused global $wmgMonologProcessors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077431 (owner: 10Bartosz DziewoƄski) [16:22:54] (03Merged) 10jenkins-bot: Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077429 (owner: 10Bartosz DziewoƄski) [16:23:04] (03PS2) 10Bartosz DziewoƄski: Remove references to removed wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077432 [16:23:06] (03CR) 10Majavah: [C:03+2] Remove references to removed wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077432 (owner: 10Bartosz DziewoƄski) [16:23:48] (03Merged) 10jenkins-bot: logging: Remove unused global $wmgMonologProcessors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077431 (owner: 10Bartosz DziewoƄski) [16:23:53] (03Merged) 10jenkins-bot: Remove references to removed wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077432 (owner: 10Bartosz DziewoƄski) [16:24:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [16:24:28] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1077429|Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains]], [[gerrit:1077431|logging: Remove unused global $wmgMonologProcessors]], [[gerrit:1077432|Remove references to removed wikitech.php]] [16:24:51] (03Abandoned) 10Muehlenhoff: ircstream: Switch to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1076991 (owner: 10Muehlenhoff) [16:25:52] i'm away for a bit, but these wikitech patches should be fine without me. thanks for deploying them [16:26:46] !log taavi@deploy2002 matmarex, taavi: Backport for [[gerrit:1077429|Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains]], [[gerrit:1077431|logging: Remove unused global $wmgMonologProcessors]], [[gerrit:1077432|Remove references to removed wikitech.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:27:11] !log taavi@deploy2002 matmarex, taavi: Continuing with sync [16:27:29] !log Running the sre.hosts.decommission cookbook on the alert1001, and alert2001 hosts - T372607 [16:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:31] T372607: Decommission the alert1001 and alert1002 hosts - https://phabricator.wikimedia.org/T372607 [16:27:36] !log denisse@cumin2002 START - Cookbook sre.hosts.decommission for hosts alert[1001,2001].wikimedia.org [16:27:48] I will run the train later tonight [16:28:00] well in 93 minutes [16:30:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10196866 (10kamila) @seanleong-WMDE Can you please confirm you've read https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#User_responsibilities ? Thank you! [16:31:40] !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [16:31:42] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077429|Add wikitech.wikimedia.org to $wgCrossSiteAJAXdomains]], [[gerrit:1077431|logging: Remove unused global $wmgMonologProcessors]], [[gerrit:1077432|Remove references to removed wikitech.php]] (duration: 07m 13s) [16:31:48] urbanecm: made it :D [16:32:17] congrats :) [16:32:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077364 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:32:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1077365 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:33:11] !log start extensions/GlobalUsage/maintenance/refreshGlobalimagelinks.php on labswiki to backfill global usage information [16:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:37] zeroth minute is the longest one [16:36:16] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10196919 (10elukey) Third day :) * irc2003 is now configured in MediaWiki and it is get... [16:36:31] (03CR) 10Majavah: [C:04-1] "This needs to wait until labtestwikitech is gone too :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1077345 (https://phabricator.wikimedia.org/T371592) (owner: 10Zabe) [16:38:54] !log denisse@cumin2002 START - Cookbook sre.hosts.decommission for hosts alert[1001,2001].wikimedia.org [16:39:19] (03PS1) 10Giuseppe Lavagetto: git::replicated_local_repo: use ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/1077437 [16:40:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10196970 (10kamila) This also requires approval of one of the group approvers. (Tagging with #data-engineering for now, will ping individually if needed.) [16:41:19] (03Merged) 10jenkins-bot: ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077364 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:41:40] (03CR) 10CI reject: [V:04-1] git::replicated_local_repo: use ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/1077437 (owner: 10Giuseppe Lavagetto) [16:44:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10196987 (10kamila) p:05Triage→03Medium a:03kamila [16:45:35] (03Merged) 10jenkins-bot: ReassignMentees: Add additional logging [extensions/GrowthExperiments] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1077365 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [16:46:05] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077364|ReassignMentees: Add additional logging (T376124)]], [[gerrit:1077365|ReassignMentees: Add additional logging (T376124)]] [16:46:08] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [16:46:09] !log denisse@cumin2002 START - Cookbook sre.dns.netbox [16:46:12] FIRING: JobUnavailable: Reduced availability for job icinga in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:48:18] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1077364|ReassignMentees: Add additional logging (T376124)]], [[gerrit:1077365|ReassignMentees: Add additional logging (T376124)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:49:27] !log denisse@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: alert[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin2002" [16:50:05] !log denisse@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: alert[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin2002" [16:50:05] !log denisse@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:06] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts alert[1001,2001].wikimedia.org [16:51:01] (03CR) 10Andrea Denisse: [C:03+2] alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1075598 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [16:52:27] RESOLVED: JobUnavailable: Reduced availability for job icinga in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:11] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:56:38] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10decommission-hardware, and 3 others: Decommission the alert1001 and alert1002 hosts - https://phabricator.wikimedia.org/T372607#10197095 (10andrea.denisse) a:05andrea.denisse→03None [16:57:28] (03PS1) 10Majavah: dumps: Stop fetching custom Wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/1077440 [16:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:58:51] !log btullis@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1700) [17:00:48] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077364|ReassignMentees: Add additional logging (T376124)]], [[gerrit:1077365|ReassignMentees: Add additional logging (T376124)]] (duration: 14m 42s) [17:01:10] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [17:01:45] !log btullis@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet [17:02:01] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [17:02:01] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=93) on VRTS host vrts1003.eqiad.wmnet [17:03:05] (03PS1) 10Kamila SoučkovĂĄ: analytics_privatedata_users: add ifeatunnaobiwmde [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) [17:03:47] (03CR) 10Kamila SoučkovĂĄ: [C:04-2] "DNM, waiting for group approval and NDA" [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila SoučkovĂĄ) [17:04:55] (03PS1) 10AOkoth: vrts: make phab task optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1077443 [17:05:53] (03PS9) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [17:06:21] (03CR) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [17:06:51] (03PS2) 10Kamila SoučkovĂĄ: analytics_privatedata_users: add seanleong-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) [17:09:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10197148 (10kamila) [17:17:15] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging in the interest of time (I have an unconfigured VM sitting around now)" [puppet] - 10https://gerrit.wikimedia.org/r/1077427 (https://phabricator.wikimedia.org/T376079) (owner: 10Bking) [17:18:11] (03CR) 10AOkoth: "This should not break anything so will just merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077443 (owner: 10AOkoth) [17:18:14] (03CR) 10AOkoth: [C:03+2] vrts: make phab task optional [cookbooks] - 10https://gerrit.wikimedia.org/r/1077443 (owner: 10AOkoth) [17:18:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:20:53] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [17:22:30] !log aokoth@cumin1002 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [17:22:57] FIRING: CertAlmostExpired: Certificate for service lsw1-e7-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e7-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:27:57] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e7-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:41:55] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:57] FIRING: [3x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:45:00] (03CR) 10Brouberol: [C:03+1] "Perfect!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [17:47:57] FIRING: [4x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:50:10] (03PS1) 10RLazarus: scap: Add a deprecation warning to classic mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) [17:50:54] (03PS2) 10RLazarus: scap: Add a deprecation warning to classic mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) [17:55:11] (03Abandoned) 10Mforns: hieradata::services_proxy::envoy.yaml: enable data-gateway listener [puppet] - 10https://gerrit.wikimedia.org/r/1076784 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [17:57:57] FIRING: [5x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T1800) [18:00:26] GOOD MORNING MEDIAWIKI [18:01:13] Good morning! [18:02:17] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077452 (https://phabricator.wikimedia.org/T375656) [18:02:18] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077452 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [18:03:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [18:03:03] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077452 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [18:03:06] if my theory hold true, this train is going to explode with bunch of exceptions, errors and regressions [18:03:12] if not, that dismiss my theory [18:03:31] which is that devs refrain from writing bugs the week before I run the train [18:03:40] and me running it this week was a last minute change [18:04:01] which will help experiment the theory and see whether it is disproven [18:04:56] * hashar mumbles something about scientific method, proof, hypothesis etc [18:06:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [18:10:07] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.25 refs T375656 [18:10:15] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [18:11:35] (03CR) 10Dzahn: [C:04-1] "generally looks good but the UID number seems wrong to me" [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila SoučkovĂĄ) [18:12:40] (03CR) 10Btullis: [C:03+1] airflow: automatically inject the configuration checksum annotation on deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076755 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [18:13:03] (03CR) 10Dzahn: [C:04-1] analytics_privatedata_users: add seanleong-wmde (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila SoučkovĂĄ) [18:14:41] (03CR) 10Dzahn: "the compiler doesn't seem to like it" [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [18:15:30] PHP Warning: Invalid argument supplied for foreach() [18:16:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [18:16:06] (03CR) 10Dzahn: [C:04-1] "Need to wait for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076905 I think" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:16:19] * hashar blames parser cache / json [18:16:23] (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) (owner: 10Brouberol) [18:17:09] (03CR) 10Dzahn: [C:04-1] "yea, ack. we gotta wait for the test as described by Jelto" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:18:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10197359 (10KFrancis) Hi all, confirmingI have an NDA on file for Sean, thanks! [18:20:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10197363 (10Dzahn) Thanks, Katie! @kamila When WMDE staff is onboarded it normally always comes with the LDAP groups `nda`... [18:21:39] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.9.1 - T376256 [18:21:51] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.9.1 - T376256 (duration: 00m 12s) [18:23:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [18:23:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [18:23:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [18:24:58] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197377 (10Dzahn) >>! In T332220#10195922, @Dzahn wrote: > One more domain/line .. won't make a big difference I have to add an important part here. Redirecting a domain (for example the various typo domains) to the... [18:25:44] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197378 (10BCornwall) > Not really comprehensive, but just scanning [these google results](https://www.google.com/search?q=%22enwp.org%22) I see a decent amount of usage of it across a range of applications, not just e... [18:28:05] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197381 (10Dzahn) >>! In T332220#10196335, @DLynch wrote: > I see a number of academic papers using it. This could be seen as unfortunate but to me it's a very good pro argument to take it over and ensure it keeps wo... [18:33:06] (03CR) 10Ssingh: [C:03+1] "Ready to be merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:33:15] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197398 (10violetwtf) > Folks are welcome to use https://w.wiki/ instead. > But using it as an active URL shortener AND not breaking existing URLs that are already in use is a whole project that isn't that cheap. When... [18:33:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:33:46] (03CR) 10Muehlenhoff: "The failing node fails for entirely unrelated reasons" [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [18:40:01] (03CR) 10CDobbins: [C:03+2] sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:41:23] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197419 (10Dzahn) > enwp.org does not operate like w.wiki and covers a separate use-case. Thanks! That's an important distinction. Indeed, if it's possible to rewrite everything just with a simple rewrite rule from... [18:42:57] FIRING: [7x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:45:42] (03CR) 10Dzahn: [C:03+1] profile::envoy: When adding rules based on nftables check for empty ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [18:49:02] (03PS2) 10Scott French: services_proxy: sets_sni: true on data-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1077456 (https://phabricator.wikimedia.org/T368035) [18:52:52] (03Merged) 10jenkins-bot: sre.dns.roll-restart: add rolling restart script for DNS boxes [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:53:01] (03CR) 10RLazarus: [C:03+1] services_proxy: sets_sni: true on data-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1077456 (https://phabricator.wikimedia.org/T368035) (owner: 10Scott French) [18:56:53] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1077456 (https://phabricator.wikimedia.org/T368035) (owner: 10Scott French) [18:57:26] (03CR) 10Scott French: [C:03+2] services_proxy: sets_sni: true on data-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1077456 (https://phabricator.wikimedia.org/T368035) (owner: 10Scott French) [19:05:29] (03CR) 10BCornwall: [C:03+2] P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [19:06:56] !log brett@cumin2002 conftool action : set/pooled=no; selector: name=cp4041.ulsfo.wmnet [19:09:30] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197554 (10DLynch) Yeah, calling it a "shortener service" is very misleading really. In practice it's literally just a way to not have to type out "en.wikipedia.org/wiki" because you can replace it with "enwp.org". The... [19:13:49] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [19:14:29] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197583 (10Dzahn) Just keep in mind that, as far as I can tell, we wouldn't want the combination where WMF owns the domain while it points to Thomas' servers. So I think it's either Thomas keeps running the service a... [19:15:50] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [19:16:12] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [19:19:41] (03PS1) 10C. Scott Ananian: Turn on Parsoid Selective Update metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) [19:21:18] !log cumin -b11 "A:cp" "run-puppet-agent --enable 'rolling out 1038884'" [19:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:14] (03PS1) 10Dzahn: add project language 'ann' (Obolo) [dns] - 10https://gerrit.wikimedia.org/r/1077461 (https://phabricator.wikimedia.org/T376332) [19:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10197614 (10phaultfinder) [19:25:05] (03CR) 10Dzahn: [C:03+2] "Obolo (or Andoni) is a major Cross River language of Nigeria." [dns] - 10https://gerrit.wikimedia.org/r/1077461 (https://phabricator.wikimedia.org/T376332) (owner: 10Dzahn) [19:26:59] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host mc-misc1001.eqiad.wmnet with OS bookworm [19:27:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host mc-misc1002.eqiad.wmnet with OS bookworm [19:27:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host mc-misc1001.eqiad.wmnet with OS bookworm [19:27:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host mc-misc1002.eqiad.wmnet with OS bookworm [19:33:29] (03PS1) 10Reedy: signups_signup.html: Remove extra full stop [software/bitu] - 10https://gerrit.wikimedia.org/r/1077465 (https://phabricator.wikimedia.org/T376334) [19:34:20] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197663 (10BCornwall) Sorry for the confusion, and thanks for pointing out that the domain is a simple redirection and not a shortener. What a detail to miss! Indeed, this should be simple enough to fit into our infras... [19:34:24] 06SRE, 10Domains, 06Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197665 (10BCornwall) 05Open→03In progress [19:36:56] (03PS1) 10BCornwall: ncredir: Add enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:37:47] (03PS2) 10BCornwall: ncredir: Add enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:38:10] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid Selective Update metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [19:38:39] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc1002.eqiad.wmnet with reason: host reimage [19:38:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc1001.eqiad.wmnet with reason: host reimage [19:39:02] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4188/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:39:21] (03PS1) 10Reedy: InitialiseSettings.php: Fix comment about $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077467 [19:39:38] (03CR) 10Dzahn: "If this also covers the example "https://enwp.org/URL_shortening redirects to https://en.wikipedia.org/wiki/URL_shortening" then it's good" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:40:32] (03CR) 10Dzahn: "Note though how the "/wiki/" part needs to be added.. so we might need a rewrite rule after all." [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:41:48] (03PS3) 10BCornwall: ncredir: Add enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:42:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc1002.eqiad.wmnet with reason: host reimage [19:43:03] (03CR) 10Dzahn: [C:03+1] ncredir: Add enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:43:57] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197677 (10ssingh) Thanks for filing this task! 1. So it seems like there is a possibility that this list (or rather, these lists) can be maintaine... [19:45:36] (03CR) 10Dzahn: [C:03+2] "this was for https://phabricator.wikimedia.org/T375762" [puppet] - 10https://gerrit.wikimedia.org/r/1077102 (owner: 10EoghanGaffney) [19:45:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc1001.eqiad.wmnet with reason: host reimage [19:46:49] (03CR) 10Violetwtf: "Please also add a rule for https://c.enwp.org/$1 -> https://commons.wikimedia.org/$1 in order to maintain backwards compatibility." [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:46:51] (03CR) 10BCornwall: "ncredir does handle this - the latest PS will redirect as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:48:05] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4189/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:48:44] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197699 (10violetwtf) I've reached out to Thomas and will notify here if/when I get a reply. I've also made a WMF developer account to comment on Gerrit to ensure we support c.enwp.org, which was... [19:51:13] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197703 (10ssingh) Re: the point regarding snippets and using `INCLUDE`: I think that's not optional anyway -- we have to keep only one `10.in-addr.... [19:52:21] (03PS4) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:53:04] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197705 (10BCornwall) I've updated the CR to include c.enwp.org. [19:54:32] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10197706 (10BCornwall) [19:55:18] (03PS5) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:55:47] (03PS6) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:55:53] (03CR) 10BCornwall: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:56:53] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4190/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:57:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:58:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:58:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc1002.eqiad.wmnet with OS bookworm [19:58:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host mc-misc1002.eqiad.wmnet with OS bookworm completed: - mc-misc1002 (**PASS**)... [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T2000). [20:00:05] Ammar, Ammar, and bpirkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] I [20:00:16] I'm here [20:00:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:01:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:01:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc1001.eqiad.wmnet with OS bookworm [20:01:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host mc-misc1001.eqiad.wmnet with OS bookworm completed: - mc-misc1001 (**PASS**)... [20:01:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197726 (10Jclark-ctr) [20:02:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197730 (10Jclark-ctr) a:05jijiki→03Jclark-ctr [20:04:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10197727 (10Jclark-ctr) 05Open→03Resolved @jijiki thanks for updating. they are finished now [20:14:30] I have a pair of patches I'd like to throw into the window if there's time [20:14:38] it looks like the window isn't *too* full [20:15:59] cscott: don't think anyone is actually here to carry out the deploy yet so id say probably not [20:16:21] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10197756 (10Papaul) The recommended Junos version for srx300 is 23.4R2-S2 as for 2024-9-10. Are going for version 23 or 22? [20:16:26] Ammar and bpirkle are both here though [20:16:42] Yeah. Mine is very non-urgent so I may end up just waiting until next week. [20:18:30] CI seems very unhappy at the moment [20:19:22] cscott: https://phabricator.wikimedia.org/T374830 [20:20:50] https://integration.wikimedia.org/zuul/ says some mediawiki/core tests are currently running and overall looks healthy. [20:21:19] DNS issues in wmcs of course unrelated.. if they are still happening that's not good [20:26:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:07] mutante: that's not what I'm seeing at all: https://phabricator.wikimedia.org/T374830#10197775 [20:37:08] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10197785 (10Tgr) >>! In T375797#10186429, @Pppery wrote: > The first image seems to have started existing somehow. Some job on the MediaWiki side got delayed? There are two... [20:37:10] cscott: oh, you mean "Exception: Error cloning"? that does indeed sound like what RhinosF1 linked to :/ [20:38:18] it seems to go in bursts, but when it happens it takes down almost every patch running in CI. [20:41:57] (03PS1) 10RLazarus: deployment_server: Give mwscript-k8s --verbose more granular options [puppet] - 10https://gerrit.wikimedia.org/r/1077475 (https://phabricator.wikimedia.org/T341553) [20:48:31] cscott: I assume we can say this deployment window is abandoned. Cc bpirkle who had a patch too. [20:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241002T2100) [21:04:32] 06SRE, 06Infrastructure-Foundations, 06Traffic: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10197890 (10cmooney) >>! In T376291#10197677, @ssingh wrote: > * It seem the network data in `dns_reverse_zones.yaml` and the corresponding reverse... [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:17] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10197945 (10Dzahn) This membership should not be needed to login at Gerrit and ToolForge. It should just work. Unless it's about being able to merge / +2 in Gerrit in some repos, then it's needed. [21:17:31] (03CR) 10Scott French: [C:03+1] deployment_server: Give mwscript-k8s --verbose more granular options [puppet] - 10https://gerrit.wikimedia.org/r/1077475 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:22:10] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:28] !log phab1004 - link=$(/usr/bin/readlink -f /srv/phab) ; /usr/bin/git config -f /etc/gitconfig.d/10-phab-deploy-safedir.gitconfig --add safe.directory $link ; /bin/cat /etc/gitconfig.d/*.gitconfig > /etc/gitconfig - T360756 [21:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:30] T360756: Make config page display version information - https://phabricator.wikimedia.org/T360756 [21:47:57] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:49:28] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: "list has X moderation requests waiting" email should provide a link - https://phabricator.wikimedia.org/T374694#10198053 (10Aklapper) [21:51:25] FIRING: SystemdUnitFailed: wdqs-categories.service on wdqs-categories1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs-categories1001.eqiad.wmnet with reason: T375687 [21:55:00] T375687: Test categories performance under Ganeti - https://phabricator.wikimedia.org/T375687 [21:55:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs-categories1001.eqiad.wmnet with reason: T375687 [22:11:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:16:12] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:16:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:12] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:25] jouncebot: nowandnext [22:26:25] No deployments scheduled for the next 7 hour(s) and 33 minute(s) [22:26:25] In 7 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0600) [22:26:25] In 7 hour(s) and 33 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0600) [22:27:25] (03PS1) 10Urbanecm: logging: Enable logging for debug GrowthExperiments events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077484 (https://phabricator.wikimedia.org/T376124) [22:27:39] (03CR) 10Urbanecm: [C:03+2] logging: Enable logging for debug GrowthExperiments events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077484 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [22:28:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077484 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [22:28:24] (03Merged) 10jenkins-bot: logging: Enable logging for debug GrowthExperiments events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077484 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [22:28:39] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10198161 (10cmooney) That seems fine to me @aborrero thanks! [22:28:53] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077484|logging: Enable logging for debug GrowthExperiments events (T376124)]] [22:28:55] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [22:35:45] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077484|logging: Enable logging for debug GrowthExperiments events (T376124)]] (duration: 06m 52s) [22:35:48] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [22:36:32] (03PS1) 10Cathal Mooney: Delegate Kubernetes POD IP reverse ranges to k8s control-plane nodes [dns] - 10https://gerrit.wikimedia.org/r/1077486 (https://phabricator.wikimedia.org/T376291) [22:41:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.564s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:46:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.564s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:51:50] (03PS1) 10Hamish: bjnwiki: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077487 (https://phabricator.wikimedia.org/T375055) [22:52:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10198288 (10VRiley-WMF) [22:53:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10198291 (10VRiley-WMF) Location: D5 U31 CableID 2576 Port 30 [23:07:01] (03PS1) 10Hamish: bjnwiktionary: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077488 (https://phabricator.wikimedia.org/T374898) [23:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10198380 (10phaultfinder) [23:37:41] (03PS1) 10Urbanecm: Revert "logging: Enable logging for debug GrowthExperiments events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077494 (https://phabricator.wikimedia.org/T376124) [23:38:20] (03CR) 10Urbanecm: [C:03+2] Revert "logging: Enable logging for debug GrowthExperiments events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077494 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077495 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077495 (owner: 10TrainBranchBot) [23:39:05] (03Merged) 10jenkins-bot: Revert "logging: Enable logging for debug GrowthExperiments events" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077494 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [23:39:50] 06SRE-OnFire, 06Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10198393 (10Dzahn) [23:39:52] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077494|Revert "logging: Enable logging for debug GrowthExperiments events" (T376124)]] [23:39:55] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [23:41:58] (03PS1) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [23:43:23] (03PS1) 10Cwhite: logstash: put logging-hd200[4-5] in service [puppet] - 10https://gerrit.wikimedia.org/r/1077498 (https://phabricator.wikimedia.org/T375447) [23:46:16] (03PS1) 10Cwhite: logstash: put logging-hd100[4-5] in service [puppet] - 10https://gerrit.wikimedia.org/r/1077499 (https://phabricator.wikimedia.org/T375447) [23:46:59] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077494|Revert "logging: Enable logging for debug GrowthExperiments events" (T376124)]] (duration: 07m 07s) [23:47:02] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [23:51:42] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077495 (owner: 10TrainBranchBot) [23:53:31] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway)