[00:11:47] PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100% [00:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933668 [00:38:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933668 (owner: 10TrainBranchBot) [00:54:34] (03PS1) 10Jdlrobson: WIP: Update logos where logos are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 [00:57:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933668 (owner: 10TrainBranchBot) [01:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:22:22] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui [01:24:21] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui (duration: 01m 58s) [01:35:27] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui [01:37:48] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: installing (but not registering) magnum-ui (duration: 02m 20s) [01:52:09] RECOVERY - Check systemd state on analytics1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:49] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:59] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:43] (03PS4) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) [04:06:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:11:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:42] (03PS1) 10KartikMistry: Update MinT to 2023-06-28-034912-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933698 [05:49:07] (03PS4) 10Hashar: Fix wm-custom-links to show links in footer again [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [05:51:33] (03CR) 10Hashar: [C: 03+1] "I am pretty sure I have tested it when splitting the feature to a standalone javascript file which would imply that got broken while doing" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T0600) [06:27:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:37:06] (03PS1) 10Marostegui: production-m5.sql.erb: Add dbproxy1027 [puppet] - 10https://gerrit.wikimedia.org/r/933847 (https://phabricator.wikimedia.org/T337812) [06:39:20] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add dbproxy1027 [puppet] - 10https://gerrit.wikimedia.org/r/933847 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:40:38] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/933848 (https://phabricator.wikimedia.org/T337812) [06:41:12] (03PS1) 10Marostegui: dbproxy1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933849 (https://phabricator.wikimedia.org/T337812) [06:41:51] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/933848 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:41:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933849 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [06:42:24] !log Failover m1-master to dbproxy1024 T337812 [06:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:29] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [06:48:59] (03PS1) 10Marostegui: wmnet: Failover m2-master to dbproxy1025 [dns] - 10https://gerrit.wikimedia.org/r/933850 (https://phabricator.wikimedia.org/T337812) [06:50:22] (03CR) 10Muehlenhoff: [C: 03+2] Don't reboot Ganeti master nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/933482 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [06:54:10] (03PS1) 10Marostegui: dbproxy1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933851 (https://phabricator.wikimedia.org/T337812) [06:54:54] (03CR) 10Marostegui: [C: 03+2] dbproxy1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933851 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [07:00:04] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:15] nothing to do indeed [07:06:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [07:07:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [07:07:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:08:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [07:08:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:09:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [07:12:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:12:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [07:18:58] (03PS1) 10Muehlenhoff: sre.ganeti.drain-node: Pass -f to evacuate command [cookbooks] - 10https://gerrit.wikimedia.org/r/933852 (https://phabricator.wikimedia.org/T203964) [07:20:01] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932818 (owner: 10PipelineBot) [07:21:48] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932818 (owner: 10PipelineBot) [07:24:31] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.drain-node: Pass -f to evacuate command [cookbooks] - 10https://gerrit.wikimedia.org/r/933852 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:26:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:26:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [07:30:35] (03CR) 10Hashar: [C: 03+1] contint: replace Apache 2.2 access control syntax for Jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [07:32:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:34:24] PROBLEM - puppet last run on build2001 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:39:30] RECOVERY - puppet last run on build2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:49:59] (03PS3) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [07:53:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [08:00:05] brennen and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T0800). [08:00:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [08:00:29] 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10database-backups: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10jcrespo) [08:01:44] (03CR) 10Jaime Nuche: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [08:01:45] 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10database-backups: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10jcrespo) This is a different DIMM than the one at T258249. Could you check if the host is under warranty and either request a replacement or search for one on spares? Thank you. [08:04:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [08:07:12] RECOVERY - Host db1145 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [08:07:21] !log Failover m2-master to dbproxy1025 T337812 [08:07:23] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master to dbproxy1025 [dns] - 10https://gerrit.wikimedia.org/r/933850 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:07:24] PROBLEM - MariaDB Replica IO: s4 on db1145 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [08:07:28] PROBLEM - MariaDB read only s4 on db1145 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:07:46] PROBLEM - MariaDB read only s5 on db1145 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:07:52] PROBLEM - MariaDB Replica IO: s5 on db1145 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:08:16] PROBLEM - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:08:34] PROBLEM - mysqld processes on db1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:08:34] PROBLEM - MariaDB Replica SQL: s4 on db1145 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:08:38] PROBLEM - MariaDB Replica SQL: s5 on db1145 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:08:44] (03CR) 10Jaime Nuche: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [08:09:58] (03PS1) 10Marostegui: db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933856 [08:09:59] jynus: ^ [08:11:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [08:11:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [08:11:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [08:12:02] (03CR) 10Marostegui: [C: 03+2] db1145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933856 (owner: 10Marostegui) [08:12:19] 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10database-backups: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10Marostegui) Disabled notifications via https://gerrit.wikimedia.org/r/c/operations/puppet/+/933856 [08:12:44] 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10database-backups: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10Marostegui) p:05Triage→03Medium [08:14:17] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/933857 (https://phabricator.wikimedia.org/T337812) [08:15:16] (03PS1) 10Marostegui: dbproxy1027: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933858 (https://phabricator.wikimedia.org/T337812) [08:15:32] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/933857 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:15:40] !log Failover m5-master to dbproxy1027 T337812 [08:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:44] T337812: Productionize dbproxy10[22-27] - https://phabricator.wikimedia.org/T337812 [08:15:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1027: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/933858 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [08:17:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [08:22:56] (03PS1) 10Btullis: Re-enable the use of TLS for datahub's database connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/933859 (https://phabricator.wikimedia.org/T329514) [08:23:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [08:23:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [08:24:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [08:25:24] (03CR) 10Btullis: [C: 03+2] Re-enable the use of TLS for datahub's database connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/933859 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:25:38] ACKNOWLEDGEMENT - MariaDB Replica IO: s4 on db1145 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:38] ACKNOWLEDGEMENT - MariaDB Replica IO: s5 on db1145 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:38] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:38] ACKNOWLEDGEMENT - MariaDB Replica SQL: s4 on db1145 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:38] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db1145 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:38] ACKNOWLEDGEMENT - MariaDB read only s4 on db1145 is CRITICAL: Could not connect to localhost:3314 Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:25:39] ACKNOWLEDGEMENT - MariaDB read only s5 on db1145 is CRITICAL: Could not connect to localhost:3315 Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:25:39] ACKNOWLEDGEMENT - mysqld processes on db1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui Known https://phabricator.wikimedia.org/T340610 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:26:10] (03Merged) 10jenkins-bot: Re-enable the use of TLS for datahub's database connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/933859 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [08:28:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [08:28:42] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:35:26] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10gengh) [08:36:24] (03CR) 10Filippo Giunchedi: profile::pyrra::filesystem: add profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:38:29] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42059/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [08:40:34] (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:40:49] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:42:17] (03CR) 10Filippo Giunchedi: profile::pyrra::api: create profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [08:45:34] (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:59:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [09:01:17] PROBLEM - Host netflow4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:29] RECOVERY - Host netflow4002 is UP: PING OK - Packet loss = 0%, RTA = 71.21 ms [09:06:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [09:06:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [09:06:29] (03CR) 10Muehlenhoff: [C: 03+2] nftables: Also write out empty sets if no ipv4 or ipv6 addresses are present [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:08:39] !log failover ganeti master in codfw to ganeti4008 [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:14] !log depool cp4037 for some ATS tests [09:09:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [09:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:13:32] (03PS2) 10Hnowlan: trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) [09:14:16] (03CR) 10Hnowlan: trafficserver: add lua script for gateway routing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [09:15:16] (03PS1) 10EoghanGaffney: doc: Add option to quickdatacopy for --ignore-missing-args [puppet] - 10https://gerrit.wikimedia.org/r/933864 [09:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:17:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Late LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/933462 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:19:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [09:20:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:29:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [09:29:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [09:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:33:10] (03PS1) 10Elukey: role::cache::text: add ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) [09:35:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42060/console" [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:36:07] (03CR) 10Elukey: role::cache::text: add ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:40:57] (03PS1) 10Elukey: role::ml_cache::storage: use java 11 for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/933867 [09:41:53] jouncebot: nowandthen [09:41:56] (03PS3) 10JMeybohm: Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) [09:42:29] jouncebot: nowandnext [09:42:29] For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T0800) [09:42:30] In 0 hour(s) and 17 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1000) [09:42:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42061/console" [puppet] - 10https://gerrit.wikimedia.org/r/933867 (owner: 10Elukey) [09:44:42] (03PS1) 10Elukey: Add ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/933869 (https://phabricator.wikimedia.org/T330414) [09:45:57] (03CR) 10Vgutierrez: [C: 04-1] "CR looks good, deployed TLS material on the new endpoint needs to be fixed:" [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:46:29] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 8 hosts with reason: Decommissioning [09:46:46] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:46:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 8 hosts with reason: Decommissioning [09:47:40] (03CR) 10Vgutierrez: [C: 03+1] [beta] Update wgCdnServersNoPurge to remove unused cache servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933463 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:49:02] (03PS1) 10Elukey: admin_ng: add extra SANs to the ores-legacy's TLS config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/933870 (https://phabricator.wikimedia.org/T330414) [09:51:11] jouncebot: nowandnext [09:51:11] For the next 0 hour(s) and 8 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T0800) [09:51:12] In 0 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1000) [09:51:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by fabfur@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933463 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:51:46] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:51:54] (03CR) 10Elukey: [C: 03+2] admin_ng: add extra SANs to the ores-legacy's TLS config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/933870 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [09:52:30] (03Merged) 10jenkins-bot: [beta] Update wgCdnServersNoPurge to remove unused cache servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933463 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:55:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:57:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:57:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:58:05] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:58:22] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF) 05Open→03In progress [09:58:24] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:58:29] 10SRE, 10Bitu, 10Infrastructure-Foundations: Consider reusing some wiki data sources for signup/restrictions - https://phabricator.wikimedia.org/T320806 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1000) [10:01:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:02:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:03:37] (03CR) 10Elukey: role::cache::text: add ores-legacy.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:05:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:08:12] (03CR) 10Vgutierrez: trafficserver: add lua script for gateway routing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:11:01] (03CR) 10Effie Mouzeli: [C: 03+1] Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:11:34] !log repool cp4037 [10:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:35] (03CR) 10Hnowlan: trafficserver: add lua script for gateway routing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:15:52] https://wiki.debian.org/Python/LibraryStyleGuide?action=show&redirect=Python%2FPackaging [10:16:41] (03PS1) 10Btullis: Enable the datahub setup jobs for mysql and elasticsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/933873 (https://phabricator.wikimedia.org/T329514) [10:18:12] (03PS1) 10Elukey: ml-services: add new FQDNs to ores-legacy's ingress config [deployment-charts] - 10https://gerrit.wikimedia.org/r/933874 (https://phabricator.wikimedia.org/T330414) [10:19:02] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:20:57] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [10:21:40] !log disabling puppet on A:cp-text for testing 933508 [10:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:55] (03PS2) 10Elukey: ml-services: add new FQDNs to ores-legacy's ingress config [deployment-charts] - 10https://gerrit.wikimedia.org/r/933874 (https://phabricator.wikimedia.org/T330414) [10:22:57] (03CR) 10Hnowlan: [C: 03+2] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/933508 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:23:48] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good, per https://cassandra.apache.org/doc/latest/cassandra/getting_started/java11.html Java 11 should be fully supported in 4.1." [puppet] - 10https://gerrit.wikimedia.org/r/933867 (owner: 10Elukey) [10:24:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: use java 11 for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/933867 (owner: 10Elukey) [10:24:59] (03CR) 10Elukey: [C: 03+2] ml-services: add new FQDNs to ores-legacy's ingress config [deployment-charts] - 10https://gerrit.wikimedia.org/r/933874 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:28:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [10:29:28] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll restart to pick up Java 11 - elukey@cumin1001 [10:31:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:32:35] (03CR) 10Btullis: [C: 03+2] Enable the datahub setup jobs for mysql and elasticsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/933873 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:32:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:21] (03Merged) 10jenkins-bot: Enable the datahub setup jobs for mysql and elasticsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/933873 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:34:10] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [10:34:14] (03CR) 10JMeybohm: [C: 03+2] Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:34:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [10:35:08] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [10:36:32] (03PS1) 10Elukey: ml-services: fix the docker image name for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/933877 [10:37:10] (03PS1) 10Muehlenhoff: Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/933879 [10:37:31] (03CR) 10Elukey: [C: 03+2] ml-services: fix the docker image name for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/933877 (owner: 10Elukey) [10:38:20] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-text_codfw [10:38:48] (03PS1) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['defaultEntityNamespaces'] to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) [10:39:31] (03CR) 10CI reject: [V: 04-1] Set $wgWBRepoSettings['defaultEntityNamespaces'] to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) (owner: 10Lucas Werkmeister (WMDE)) [10:39:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [10:40:01] (03Merged) 10jenkins-bot: Update all charts to mesh.configuration 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933392 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [10:40:03] (03PS1) 10Hnowlan: Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/933633 [10:41:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:41:19] (03PS2) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['defaultEntityNamespaces'] to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) [10:42:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [10:42:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [10:42:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:42:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:44:26] (03PS1) 10Fabfur: cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 [10:44:31] (03CR) 10Elukey: role::cache::text: add ores-legacy.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:44:41] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:47:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll restart to pick up Java 11 - elukey@cumin1001 [10:47:17] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll restart to pick up Java 11 - elukey@cumin1001 [10:47:47] (03CR) 10Lucas Werkmeister (WMDE): "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) (owner: 10Daniel Kinzler) [10:47:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] "@Eevans: all works fine afaics on ml-cache nodes!" [puppet] - 10https://gerrit.wikimedia.org/r/933867 (owner: 10Elukey) [10:47:51] (03Abandoned) 10Lucas Werkmeister (WMDE): Allow only properties on Special:EntitiesWithoutLabel and Special:EntitiesWithoutDescription. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) (owner: 10Daniel Kinzler) [10:48:15] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42063/console" [puppet] - 10https://gerrit.wikimedia.org/r/933864 (owner: 10EoghanGaffney) [10:49:58] (03PS2) 10Fabfur: cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) [10:50:22] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:50:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:50:44] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:51:15] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:51:31] !log Migrating to rsync::quickdatacopy for deployment servers - T289857 [10:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] T289857: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 [10:51:44] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [10:52:01] (03CR) 10CI reject: [V: 04-1] cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [10:52:09] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [10:52:59] (03CR) 10Vgutierrez: [C: 03+1] role::cache::text: add ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:53:39] (03PS1) 10Muehlenhoff: Add a systemd timer to cleanup cookbooks_testing [puppet] - 10https://gerrit.wikimedia.org/r/933882 [10:54:22] (03PS3) 10Fabfur: cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) [10:55:41] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:57:10] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:57:27] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:57:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:58:13] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:02:34] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:02:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:04:43] (03PS2) 10Muehlenhoff: Add a systemd timer to cleanup cookbooks_testing [puppet] - 10https://gerrit.wikimedia.org/r/933882 [11:04:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll restart to pick up Java 11 - elukey@cumin1001 [11:04:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [11:05:11] (03CR) 10CI reject: [V: 04-1] Add a systemd timer to cleanup cookbooks_testing [puppet] - 10https://gerrit.wikimedia.org/r/933882 (owner: 10Muehlenhoff) [11:06:07] (03PS3) 10Muehlenhoff: Add a systemd timer to cleanup cookbooks_testing [puppet] - 10https://gerrit.wikimedia.org/r/933882 [11:07:32] (03PS7) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [11:08:00] !log Reverting migration to rsync::quickdatacopy for deployment servers - T289857 [11:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:05] T289857: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 [11:08:47] (03PS1) 10Clément Goubert: Revert "deployment: Use rsync::quickdatacopy, enable encryption" [puppet] - 10https://gerrit.wikimedia.org/r/933634 [11:09:19] (03CR) 10Clément Goubert: [C: 03+2] "Self +2 ing to revert" [puppet] - 10https://gerrit.wikimedia.org/r/933634 (owner: 10Clément Goubert) [11:09:22] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Revert "deployment: Use rsync::quickdatacopy, enable encryption" [puppet] - 10https://gerrit.wikimedia.org/r/933634 (owner: 10Clément Goubert) [11:11:02] 10SRE, 10serviceops: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) Reverted because rsync::quickdatacopy wants fqdns, we're giving it IPs, nothing gets deployed. I will prepare a fix and we can try again. [11:12:22] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42064/console" [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [11:14:22] (03CR) 10Vgutierrez: [C: 03+1] cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [11:16:37] (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/933633 (owner: 10Hnowlan) [11:16:55] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/933633 (owner: 10Hnowlan) [11:18:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [11:25:41] (03PS1) 10Volans: scripts/interface: fix VM and bridge interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/933886 (https://phabricator.wikimedia.org/T340190) [11:25:43] (03PS1) 10Jbond: merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) [11:26:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933879 (owner: 10Muehlenhoff) [11:28:11] (03CR) 10CI reject: [V: 04-1] merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:28:51] (03PS2) 10Jbond: merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) [11:30:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/933886 (https://phabricator.wikimedia.org/T340190) (owner: 10Volans) [11:30:24] (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [11:31:17] (03CR) 10CI reject: [V: 04-1] merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:31:42] (03CR) 10Volans: [C: 03+2] scripts/interface: fix VM and bridge interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/933886 (https://phabricator.wikimedia.org/T340190) (owner: 10Volans) [11:32:18] (03Merged) 10jenkins-bot: scripts/interface: fix VM and bridge interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/933886 (https://phabricator.wikimedia.org/T340190) (owner: 10Volans) [11:32:21] (03CR) 10CI reject: [V: 04-1] team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [11:32:49] (03CR) 10Volans: [C: 04-1] "See inline, I don't think this should happen." [puppet] - 10https://gerrit.wikimedia.org/r/933882 (owner: 10Muehlenhoff) [11:33:15] (03PS3) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [11:33:22] !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:33:31] !log volans@cumin2002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [11:33:36] !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:33:41] !log volans@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:35:01] (03CR) 10CI reject: [V: 04-1] team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [11:36:31] (03PS3) 10Jbond: merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) [11:36:33] (03PS7) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [11:40:09] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/933879 (owner: 10Muehlenhoff) [11:40:37] (03PS2) 10Muehlenhoff: Point eqiad URL downloaders to bullseye host [dns] - 10https://gerrit.wikimedia.org/r/933441 (https://phabricator.wikimedia.org/T329945) [11:42:10] 10SRE, 10serviceops: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) a:03Clement_Goubert [11:42:28] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10Clement_Goubert) [11:42:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42067/console" [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:42:49] 10SRE, 10serviceops: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) 05Open→03In progress [11:46:46] (03PS1) 10Volans: test-cookbook: fix typos in help message [puppet] - 10https://gerrit.wikimedia.org/r/933891 [11:46:48] (03PS1) 10Volans: test-cookbook: do not run as root [puppet] - 10https://gerrit.wikimedia.org/r/933892 [11:47:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [11:47:12] (03CR) 10Volans: [C: 04-1] "I've sent I87f9fb6e65f5ad56fb3d1fdc8543a1f2bf08dda1" [puppet] - 10https://gerrit.wikimedia.org/r/933882 (owner: 10Muehlenhoff) [11:49:01] (03CR) 10Muehlenhoff: [C: 03+2] Point eqiad URL downloaders to bullseye host [dns] - 10https://gerrit.wikimedia.org/r/933441 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [11:52:24] (03PS4) 10Jbond: merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) [11:52:26] (03PS8) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [11:53:00] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:54:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [11:54:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [11:55:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [11:57:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42069/console" [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:58:33] (03PS5) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) [11:58:47] (03PS9) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [11:59:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42070/console" [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:59:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] merge_cli: migrate puppetmasteres to module version of merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/933887 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:59:54] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:00:19] (03PS10) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [12:00:42] (03CR) 10CI reject: [V: 04-1] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:02:06] (03PS11) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [12:02:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42071/console" [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:04:16] (03PS12) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [12:07:58] (03PS13) 10Jbond: merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) [12:08:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42072/console" [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:11:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [12:13:25] (03PS1) 10Jcrespo: dbbackups: Change s4 and s5 eqiad backup sources to db1150 and db1216 [puppet] - 10https://gerrit.wikimedia.org/r/933895 (https://phabricator.wikimedia.org/T340610) [12:14:16] (03PS4) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [12:14:31] (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Add type URL to http and listener filters [puppet] - 10https://gerrit.wikimedia.org/r/933112 (https://phabricator.wikimedia.org/T337405) (owner: 10JMeybohm) [12:14:42] (03CR) 10Jbond: [C: 03+2] merge_cli: Make the paths a parameter and add them to a config file [puppet] - 10https://gerrit.wikimedia.org/r/933422 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:16:20] (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/933897 [12:16:21] (03CR) 10CI reject: [V: 04-1] team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [12:18:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [12:18:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [12:21:01] (03CR) 10Jbond: [C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/933897 (owner: 10Jbond) [12:21:35] (03PS1) 10Jbond: Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/933635 [12:23:41] (03CR) 10Jbond: [C: 03+2] Revert "test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/933635 (owner: 10Jbond) [12:29:19] (03PS5) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [12:29:51] !log failover ganeti master in eqsin to ganeti5007 [12:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:44] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:38:39] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [12:40:19] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 (10jbond) [12:40:21] 10SRE, 10Bitu, 10Infrastructure-Foundations: Forgot my username - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) [12:40:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 (10jbond) p:05Triage→03Medium [12:41:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [12:43:15] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10Reedy) [12:44:04] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) [12:44:49] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2027*} and A:cp [12:46:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2027*} and A:cp [12:47:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [12:47:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): investigate using cfssl to provide a itermediate certificate for puppetserver - https://phabricator.wikimedia.org/T339913 (10jbond) 05Open→03Resolved a:03jbond [12:47:35] (03PS1) 10Jbond: puppetserver: add labs private repo [puppet] - 10https://gerrit.wikimedia.org/r/933900 (https://phabricator.wikimedia.org/T340635) [12:47:44] 10SRE, 10Bitu, 10Infrastructure-Foundations: Figure out an HA setup for the IDM - https://phabricator.wikimedia.org/T320605 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Medium [12:47:46] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [12:48:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42074/console" [puppet] - 10https://gerrit.wikimedia.org/r/933900 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [12:48:56] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [12:49:00] 10SRE, 10Bitu, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [12:49:49] 10SRE, 10Infrastructure-Foundations: Determine which sender address to use for email notification - https://phabricator.wikimedia.org/T335091 (10SLyngshede-WMF) 05Open→03Resolved p:05Low→03Medium a:03SLyngshede-WMF Currently using noc@ but is configurable. [12:49:51] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [12:53:33] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10SLyngshede-WMF) Small note: We store both cn, with username capitalized, to be in compliance with mediaWiki, otherwise users could not be authenticated on wikitech. uid con... [12:53:44] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10SLyngshede-WMF) 05In progress→03Resolved p:05Low→03Medium a:03SLyngshede-WMF [12:53:46] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [12:53:58] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2027*} and A:cp [12:54:33] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933669 [12:54:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add labs private repo [puppet] - 10https://gerrit.wikimedia.org/r/933900 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [12:58:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [12:58:35] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933669 (owner: 10PipelineBot) [12:59:16] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933669 (owner: 10PipelineBot) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1300) [13:00:05] aanzx and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:49] * TheresNoTime unavailable [13:01:52] TheresNoTime: rude [13:02:02] (03CR) 10Jbond: "PCC looks good to me https://puppet-compiler.wmflabs.org/output/933192/42073/" [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [13:02:22] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10MoritzMuehlenhoff) That's not really resolved? There'll be plenty of additional attributes needed when we actually get to add the core attributes? [13:02:46] Lucas_WMDE: Do you need to test it? Or just deploy? :P [13:02:48] I’m also not available yet, will be back later [13:02:54] I would appreciate a review ;) [13:02:59] (unless it got one and I didn’t see it yet) [13:03:03] * Lucas_WMDE afk for a bit longer [13:03:08] I'll deploy it, just wondering if you care about testing it [13:03:29] * Reedy tries the new magic [13:03:41] (03PS5) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T340635) [13:03:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by reedy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933632 (https://phabricator.wikimedia.org/T340609) (owner: 10Anzx) [13:04:11] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:04:18] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:04:24] (03CR) 10Jbond: [C: 03+2] Add 'git_tag' argument to git::clone [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [13:04:54] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:02] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:05:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [13:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [13:05:10] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:05:17] (03Merged) 10jenkins-bot: eowikisource: Add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933632 (https://phabricator.wikimedia.org/T340609) (owner: 10Anzx) [13:05:45] !log reedy@deploy1002 Started scap: Backport for [[gerrit:933632|eowikisource: Add project namespace alias (T340609)]] [13:05:49] T340609: Add a new namespace alias on Esperanto wikisource - https://phabricator.wikimedia.org/T340609 [13:07:25] !log reedy@deploy1002 reedy and anzx: Backport for [[gerrit:933632|eowikisource: Add project namespace alias (T340609)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:09:51] (03CR) 10Jbond: [C: 03+2] "i have applied this and ran puppet on depoloy1002 and contint1002 all seemed fine" [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [13:09:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3003.esams.wmnet [13:09:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2027*} and A:cp [13:10:20] (03CR) 10Jbond: [C: 03+2] puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/931275 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [13:10:46] Reedy: tested , looks good [13:10:59] aanzx: I've been doing this long enough I'm just deploying it ;) [13:11:13] Thanks Reedy [13:11:42] (03PS3) 10Reedy: Set $wgWBRepoSettings['defaultEntityNamespaces'] to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) (owner: 10Lucas Werkmeister (WMDE)) [13:12:29] !log add puppetserver to puppet-merge [13:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] !log reedy@deploy1002 Finished scap: Backport for [[gerrit:933632|eowikisource: Add project namespace alias (T340609)]] (duration: 08m 18s) [13:14:08] T340609: Add a new namespace alias on Esperanto wikisource - https://phabricator.wikimedia.org/T340609 [13:14:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by reedy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) (owner: 10Lucas Werkmeister (WMDE)) [13:15:20] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:15:36] (03CR) 10Fabfur: [V: 03+1 C: 03+2] cache: Setting port 80 as default redirection port in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/933881 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:15:57] (03Merged) 10jenkins-bot: Set $wgWBRepoSettings['defaultEntityNamespaces'] to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933880 (https://phabricator.wikimedia.org/T291617) (owner: 10Lucas Werkmeister (WMDE)) [13:16:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet [13:16:23] !log reedy@deploy1002 Started scap: Backport for [[gerrit:933880|Set $wgWBRepoSettings['defaultEntityNamespaces'] to false (T291617)]] [13:16:25] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:16:27] T291617: WikibaseRepo and WikibaseClient should not require loading default or example settings files - https://phabricator.wikimedia.org/T291617 [13:16:54] * Lucas_WMDE back [13:17:13] Reedy: thanks! [13:17:21] the only thing to test on mwdebug is that nothing blows up ^^ [13:17:25] jbond: just did a `puppet-merge` on puppetmaster1001 and seems all fine (jfyi) [13:17:28] but I can test it when scap backport is ready [13:17:29] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [13:17:35] Lucas_WMDE: it's on its way to debug hosts [13:17:56] !log reedy@deploy1002 reedy and lucaswerkmeister-wmde: Backport for [[gerrit:933880|Set $wgWBRepoSettings['defaultEntityNamespaces'] to false (T291617)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:17:59] fabfur: thanks :) [13:18:03] or there now [13:18:28] * Lucas_WMDE looks [13:19:57] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:20:04] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:20:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet [13:20:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3003.esams.wmnet [13:20:04] Reedy: everything working fine as far as I can tell [13:20:09] (unsurprisingly ^^) [13:20:22] famous last words [13:22:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3002.esams.wmnet [13:22:14] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] secrets: ssh: remove instance-puppet-user key [labs/private] - 10https://gerrit.wikimedia.org/r/875407 (owner: 10Majavah) [13:22:40] (03PS2) 10Reedy: Revert "Add to verify Mastodon account on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926819 (owner: 10Legoktm) [13:23:22] 10SRE, 10Traffic, 10envoy, 10serviceops: Refactor envoy.filters.http.router and envoy.filters.listener.tls_inspector - https://phabricator.wikimedia.org/T337405 (10JMeybohm) 05Open→03Resolved [13:23:31] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [13:23:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:25:24] (03PS6) 10Arturo Borrero Gonzalez: team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) [13:25:33] If anyone else has anything mw-config wise they want deploying... [13:25:43] !log reedy@deploy1002 Finished scap: Backport for [[gerrit:933880|Set $wgWBRepoSettings['defaultEntityNamespaces'] to false (T291617)]] (duration: 09m 19s) [13:25:47] T291617: WikibaseRepo and WikibaseClient should not require loading default or example settings files - https://phabricator.wikimedia.org/T291617 [13:25:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by reedy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926819 (owner: 10Legoktm) [13:25:56] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:26:04] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:49] (03Merged) 10jenkins-bot: Revert "Add to verify Mastodon account on mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926819 (owner: 10Legoktm) [13:27:15] !log reedy@deploy1002 Started scap: Backport for [[gerrit:926819|Revert "Add to verify Mastodon account on mediawiki.org"]] [13:27:33] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:27:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:28:07] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:28:11] sudo cumin 'A:dns-auth' 'disable-puppet "merging CR 926509"' [13:28:13] !log sudo cumin 'A:dns-auth' 'disable-puppet "merging CR 926509"' [13:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:25] (03CR) 10Ssingh: [C: 03+2] gdnsd: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [13:28:43] !log reedy@deploy1002 legoktm and reedy: Backport for [[gerrit:926819|Revert "Add to verify Mastodon account on mediawiki.org"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:29:34] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 (10jbond) We need to get the ssh key for the puppet servers to the puppet masters. [13:30:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:30:23] (03PS4) 10Reedy: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:31:17] slyngs, Amir1: ^^ around? [13:31:23] (03PS5) 10Reedy: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:31:29] vgutierrez: Yes [13:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet [13:32:06] (03CR) 10Reedy: "Looks like some of it was already removed... Rebsaed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:32:20] (03CR) 10Reedy: [C: 03+2] Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:33:21] (03PS1) 10Jbond: Revert "puppetserver: Add new puppet server to block" [puppet] - 10https://gerrit.wikimedia.org/r/933636 [13:33:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetserver: Add new puppet server to block" [puppet] - 10https://gerrit.wikimedia.org/r/933636 (owner: 10Jbond) [13:33:49] (03Merged) 10jenkins-bot: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 (owner: 10Matthias Mullie) [13:34:08] (03PS3) 10Reedy: Make $wgAccountCreationThrottle an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 (owner: 10Daniel Kinzler) [13:34:21] (03CR) 10Reedy: [C: 03+2] Make $wgAccountCreationThrottle an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 (owner: 10Daniel Kinzler) [13:34:54] (03PS2) 10Reedy: CentralAuth: Remove config that was deprecated in 1.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751740 (owner: 10Kosta Harlan) [13:35:01] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:01] (03Abandoned) 10Reedy: CentralAuth: Remove config that was deprecated in 1.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751740 (owner: 10Kosta Harlan) [13:35:03] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:35:04] (03Merged) 10jenkins-bot: Make $wgAccountCreationThrottle an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 (owner: 10Daniel Kinzler) [13:35:09] (03PS1) 10Muehlenhoff: Remove urldownloader role from old buster servers [puppet] - 10https://gerrit.wikimedia.org/r/933904 (https://phabricator.wikimedia.org/T329945) [13:35:41] (03CR) 10Paladox: [C: 03+1] "@hashar works for me. Tested locally. I'm happy with this." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [13:35:57] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:36:06] !log reedy@deploy1002 Finished scap: Backport for [[gerrit:926819|Revert "Add to verify Mastodon account on mediawiki.org"]] (duration: 08m 51s) [13:36:08] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:36:34] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:36:43] (03PS4) 10Reedy: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697110 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [13:36:46] (03PS5) 10Reedy: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697110 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [13:37:56] !log remove puppetserver from puppet-merge [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:49] (03PS1) 10Aklapper: AVA: Make score.php not fail with Fatal Error after libphutil removal [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/933907 (https://phabricator.wikimedia.org/T340633) [13:38:54] !log sudo cumin 'A:dns-auth' 'enable-puppet "merging CR 926509"' [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:59] vgutierrez: I'm ooo. Alex is covering for rest of the week (and I take over next week) [13:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet [13:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3002.esams.wmnet [13:39:15] Amir1: ack [13:39:26] (03PS2) 10Reedy: Remove configuration which is the same as the extension's default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [13:39:27] Why sirenbot is saying I'm oncall. Sigh [13:39:27] seems like sirenbot is toasted? [13:39:33] !log failover ganeti master in esams to ganeti3003 [13:39:33] (03CR) 10Reedy: [C: 03+2] Remove configuration which is the same as the extension's default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [13:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:55] https://usercontent.irccloud-cdn.com/file/SsKDwKJk/grafik.png [13:39:59] !oncall-now sre [13:40:00] Oncall now for team sre, rotation business_hours: [13:40:00] s.lyngs, A.mir1 [13:40:12] (03PS3) 10Reedy: Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 (owner: 10Daniel Kinzler) [13:40:18] (03CR) 10Jaime Nuche: [C: 03+1] releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [13:40:23] (heh I like the anti-ping measure for sirenbot :p) [13:40:28] (03Merged) 10jenkins-bot: Remove configuration which is the same as the extension's default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [13:40:32] Amir1: it looks like sirenbot thinks you're oncall roday [13:40:34] *today [13:40:38] Did I mess up the override? [13:40:39] (03Abandoned) 10Reedy: group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874925 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [13:40:53] Amir1: and the same for the screenshot you've pasted [13:40:56] (03Abandoned) 10Reedy: build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [13:41:00] (03CR) 10Hashar: [C: 03+2] "Cool!" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [13:41:04] (03Restored) 10Reedy: build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [13:41:07] (03CR) 10Reedy: "needs rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [13:41:32] (03PS4) 10Reedy: Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 (owner: 10Daniel Kinzler) [13:41:36] (03CR) 10Reedy: [C: 03+2] Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 (owner: 10Daniel Kinzler) [13:41:42] (03Merged) 10jenkins-bot: Fix wm-custom-links to show links in footer again [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/932641 (https://phabricator.wikimedia.org/T340372) (owner: 10Paladox) [13:41:51] (03PS4) 10Reedy: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx) [13:41:56] https://usercontent.irccloud-cdn.com/file/leKGvtgG/cfc08690-0d3b-4299-86bf-5432984b2910.jpeg [13:42:20] (03Merged) 10jenkins-bot: Use sendemail limit instead of emailuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879864 (owner: 10Daniel Kinzler) [13:42:21] !log hashar@deploy1002 Started deploy [gerrit/gerrit@1ae182f]: Fix wm-custom-links to show links in footer again - T340372 [13:42:24] (03CR) 10Klausman: [C: 03+1] Add ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/933869 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:42:25] T340372: Custom Gerrit footer not displayed - https://phabricator.wikimedia.org/T340372 [13:42:29] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@1ae182f]: Fix wm-custom-links to show links in footer again - T340372 (duration: 00m 08s) [13:42:32] PROBLEM - ganeti-wconfd running on ganeti3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:42:45] vgutierrez: the screenshot is for yesterday [13:42:51] (03PS5) 10Reedy: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx) [13:42:55] (03CR) 10Reedy: [C: 03+2] beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx) [13:42:58] (03CR) 10Klausman: [C: 03+1] role::cache::text: add ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:43:38] (03Merged) 10jenkins-bot: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx) [13:43:40] phabricator.wikimedia.org appears to be down [13:43:47] (03CR) 10Reedy: "needs manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [13:44:01] akosiaris: ^ [13:44:06] (03PS2) 10Reedy: InitialiseSettings.php: Change termbox url for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:44:16] (03CR) 10CI reject: [V: 04-1] InitialiseSettings.php: Change termbox url for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914274 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [13:44:21] The alert is resolved [13:44:30] wfm [13:44:46] :/ [13:44:50] Afk for real [13:44:50] I just commented on https://phabricator.wikimedia.org/T340633 <- I think maybe that's related [13:45:03] Lucas_WMDE: -1 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/914274 is that just sticky/not removed yet? [13:45:12] (03PS3) 10Reedy: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 (owner: 10Krinkle) [13:45:16] (03CR) 10Reedy: [C: 03+2] private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 (owner: 10Krinkle) [13:45:20] * Lucas_WMDE looks [13:45:22] !help Phabricator is down/ [13:45:22] You're not allowed to perform this action. [13:45:22] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [13:45:42] Skynet: it's not down for me [13:45:46] nor me [13:45:48] Skynet: not down for me either [13:45:53] works for me too [13:45:54] Well the rest of the internet is up for me. [13:45:55] idem [13:45:55] (03CR) 10Elukey: [C: 03+2] role::cache::text: add ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/933866 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:46:01] I'm in Europe [13:46:04] same here [13:46:08] Reedy: I believe it’s still blocked on that discussion to be had (https://phabricator.wikimedia.org/T334064#8823645) [13:46:08] Weird [13:46:12] * Lucas_WMDE also in europe [13:46:17] Skynet: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [13:46:18] slyngs: good catch [13:46:23] maybe it's related [13:46:35] Skynet: Phabricator did have a blurb 15 minutes ago [13:46:36] (03PS2) 10Reedy: [enwiktionary] add interface-editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929020 (https://phabricator.wikimedia.org/T318436) (owner: 10Lupok) [13:46:38] (03Merged) 10jenkins-bot: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 (owner: 10Krinkle) [13:46:54] Not a single device is able to load Phabricator over here. [13:46:56] (03Abandoned) 10Reedy: [enwiktionary] add interface-editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929020 (https://phabricator.wikimedia.org/T318436) (owner: 10Lupok) [13:47:04] The browser claims the page isn't responding. [13:47:15] (03PS2) 10Reedy: Remove "Create a book" link from sidebar on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932649 (https://phabricator.wikimedia.org/T340274) (owner: 10Hamish) [13:47:23] (03CR) 10Reedy: [C: 03+2] Remove "Create a book" link from sidebar on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932649 (https://phabricator.wikimedia.org/T340274) (owner: 10Hamish) [13:47:37] Lucas_WMDE: aha, danke [13:47:42] Clearing cookies, and session data didn't help. [13:48:08] (03Merged) 10jenkins-bot: Remove "Create a book" link from sidebar on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932649 (https://phabricator.wikimedia.org/T340274) (owner: 10Hamish) [13:48:53] (03PS5) 10Reedy: Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [13:49:16] Skynet: can you tell us at least what IP gets phabricator.wikimedia.org for you? :) [13:49:24] Okay, definitely some kind of issue with networking. No idea if it's my ISP, or an issue with Phabricator, but using a VPN, I can get it to load. [13:49:51] (03PS2) 10Elukey: Add ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/933869 (https://phabricator.wikimedia.org/T330414) [13:49:53] I VPN'd into my private network in America. I'm not sharing that IP with you. ;-) [13:49:54] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:50:12] Skynet: Fair enough :-) [13:50:17] Skynet: err I don't want your IP, I need the IP of the DC that you're getting when you lookup phabricator.wikimedia.org [13:50:18] Skynet: what IP phabricator.wikimedia.org resolves to for you, not your own IP [13:50:28] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:50:50] vgutierrez: ah. Hold on, need to disconnect my VPN again. [13:51:01] (03CR) 10Ssingh: [C: 03+2] gdnsd: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [13:51:22] (03Abandoned) 10Ssingh: depool codfw (emergency patch, do not merge, testing new LVS host) [dns] - 10https://gerrit.wikimedia.org/r/928840 (owner: 10Ssingh) [13:51:27] (03Abandoned) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/927214 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [13:52:04] I get 91.198.174.192 [13:52:13] that's our Amsterdam DC, thanks [13:52:40] I also seem to be closest to esams (same IP here) [13:53:29] (03CR) 10Ssingh: [C: 03+1] Add ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/933869 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:54:01] (03CR) 10Reedy: [C: 03+2] Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [13:54:45] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:54:48] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:54:49] (03CR) 10Ssingh: "Merged in I6a0282943f8dbd82c194784318f16ab9d88389a6" [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973) (owner: 10BCornwall) [13:54:53] wfm via esams too [13:54:58] https://www.irccloud.com/pastebin/2KFMnD48/ [13:55:01] (03Abandoned) 10Ssingh: dnsbox: bind hc to pdns-recursor and gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973) (owner: 10BCornwall) [13:55:12] every ATS instances in text@esams is able to reach phabricator backends [13:55:26] (03Merged) 10jenkins-bot: Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [13:55:35] Skynet: if it's timeouting for you... are you able to browse wikipedia? [13:56:18] (03CR) 10Ssingh: "Merged in I6a0282943f8dbd82c194784318f16ab9d88389a6" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [13:56:25] (03Abandoned) 10Ssingh: wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [13:57:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3001.esams.wmnet [13:57:22] (03PS2) 10Reedy: build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [13:57:37] !log reedy@deploy1002 Synchronized private: I62beb66a6d073cafee59a0420ccc4f54d46d1db8 (duration: 06m 22s) [13:57:44] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [13:57:53] (03CR) 10Reedy: [C: 03+2] build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [13:58:35] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:18] (03Merged) 10jenkins-bot: build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [14:02:52] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) Logging in wmcs-cookbooks is currently handled by the `SALLogger` class in [wmcs_libs/common.py](https:/... [14:03:35] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:03:49] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10andrea.denisse) [14:04:14] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in codfw - https://phabricator.wikimedia.org/T340596 (10andrea.denisse) 05Open→03Resolved [14:04:25] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:04:31] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:04:33] !log reedy@deploy1002 Synchronized wmf-config/: Various changes (duration: 06m 27s) [14:04:42] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:04:48] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:05:14] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10MoritzMuehlenhoff) Ok, that works for me [14:05:38] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10andrea.denisse) [14:06:07] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: 1 VM request for XHGui in eqiad - https://phabricator.wikimedia.org/T340595 (10andrea.denisse) 05Open→03Resolved [14:06:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [14:07:18] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:44] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet'] [14:08:11] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:08:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [14:09:57] (03PS1) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [14:10:11] (03CR) 10CI reject: [V: 04-1] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:10:38] (03PS2) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [14:12:35] (03CR) 10Jforrester: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 (owner: 10Jforrester) [14:13:02] (03CR) 10CI reject: [V: 04-1] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:13:04] (03PS3) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [14:14:49] (03PS4) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [14:15:52] (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [14:16:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42078/console" [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:17:16] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Jdforrester-WMF) Approved from my end. [14:17:25] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:17:33] (03CR) 10Jbond: [C: 03+1] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:37] (03CR) 10Jbond: [V: 03+1] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:19:01] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:19:07] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:21:07] vgutierrez: sorry, had to go AFK. No, Wikipedia is not loading for me either. [14:22:54] I can load it just fine when I VPN into my private network located in NC, USA [14:23:42] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Arnoldokoth) @thcipriani Can I safely assume this already has your go ahead since the same user is listed on (T339936) and you already approved on that ticket? [14:23:45] Skynet: could you access https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue , even with the VPN- there are some data there needed to see what's the issue (and you can send it to noc@ so you don't share it publicly here) [14:24:04] if not, https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [14:25:29] (03PS1) 10Andrew Bogott: cloudcontrols: update host selection where we only want something on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [14:25:51] (03CR) 10CI reject: [V: 04-1] cloudcontrols: update host selection where we only want something on one cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [14:27:21] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Arnoldokoth) [14:27:55] ok, I think you are talking to vgutierrez privatelly, so that should be enough (probably) [14:28:17] (03PS1) 10Alexandros Kosiaris: parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) [14:30:16] (03CR) 10CI reject: [V: 04-1] parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [14:33:23] (03PS2) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [14:33:47] (03CR) 10David Caro: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [14:33:50] (03PS1) 10Gmodena: page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) [14:33:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] team-wmcs: add openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/933477 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez) [14:35:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3001.esams.wmnet [14:35:09] (03PS1) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) [14:36:05] (03PS2) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) [14:38:05] (03PS1) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/933639 (https://phabricator.wikimedia.org/T340635) [14:39:03] (03PS2) 10Jbond: puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/933639 (https://phabricator.wikimedia.org/T340635) [14:42:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3001.esams.wmnet [14:42:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3001.esams.wmnet [14:45:05] (03PS2) 10Hnowlan: rest-gateway: add domain list for restbase parity, fix regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/933427 (https://phabricator.wikimedia.org/T324678) [14:45:20] (03PS2) 10Alexandros Kosiaris: parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) [14:45:30] (03PS3) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) [14:47:58] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add domain list for restbase parity, fix regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/933427 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [14:48:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42081/console" [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:48:35] (03CR) 10Dzahn: [C: 03+2] "Thanks! So that means it wasn't working regardless of my change. Probably all this time since envoy was added. Makes me think if it matter" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [14:48:52] (03Merged) 10jenkins-bot: rest-gateway: add domain list for restbase parity, fix regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/933427 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [14:49:26] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [14:52:42] (03PS4) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [14:52:53] (03PS1) 10Ssingh: sites.yaml: add new dns host dns1004 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/933917 (https://phabricator.wikimedia.org/T326685) [14:53:12] (03PS1) 10Ssingh: dns1004: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685) [14:54:40] (03CR) 10CI reject: [V: 04-1] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:56:34] (03CR) 10Ssingh: "This is a good test to check if the NTP/resolv.conf changes are automatically generated when Ia5bf77c134f20666a2f7c53105ef8b890221f534 is" [puppet] - 10https://gerrit.wikimedia.org/r/933918 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [14:59:54] (03CR) 10Elukey: [C: 03+2] Add ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/933869 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [15:00:03] (03CR) 10Muehlenhoff: [C: 03+1] parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [15:01:32] (03CR) 10Elukey: [C: 03+1] parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [15:02:19] (03PS5) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [15:05:39] Skynet: Can you try again without the VPN? [15:06:15] (03PS3) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:06:18] !log Disable Vodafone DE BGP peering on cr2-esams to troubleshoot reports of users from Germany [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:08] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:08:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:11:10] (03PS1) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 [15:12:01] (03CR) 10CI reject: [V: 04-1] Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [15:12:27] (03CR) 10JHathaway: "@dcaro would love your opinion on this proposed approach, specifically whether it would affect cloud." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:16:42] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1001.eqiad.wmnet with OS bullseye [15:16:50] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1001.eqiad.wmnet with OS bullseye [15:17:44] (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:18:28] (03CR) 10Ahmon Dancy: Add 'git_tag' argument to git::clone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933192 (https://phabricator.wikimedia.org/T218900) (owner: 10Ahmon Dancy) [15:18:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [15:18:51] (03PS2) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 [15:20:51] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:23:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [15:23:23] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [15:24:06] (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:24:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye [15:25:22] (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:27:03] (03PS4) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:28:58] (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:28:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:29:23] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:29:29] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1001.eqiad.wmnet with reason: host reimage [15:31:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:31:35] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:31:49] (03PS5) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:32:27] (03PS3) 10Alexandros Kosiaris: parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) [15:32:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1001.eqiad.wmnet with reason: host reimage [15:32:49] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:32:55] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:33:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T340550 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm new part arrived. new PSU was installed in the server and the out of warranty part was returned to the server it was borrowed from. the part that was broken has been taken to the do... [15:34:06] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42089/console" [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [15:34:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2032.codfw.wmnet [15:34:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [15:34:58] (03PS6) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [15:35:33] (03CR) 10CI reject: [V: 04-1] puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [15:35:53] (03PS6) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:35:58] (03PS1) 10Elukey: ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/933967 [15:37:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:37:18] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:39:47] (03PS7) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:40:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [15:40:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2031.codfw.wmnet [15:40:22] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [15:40:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [15:40:36] (03CR) 10Dzahn: [C: 03+1] parsoid-vd: Don't try to restart it [puppet] - 10https://gerrit.wikimedia.org/r/933913 (https://phabricator.wikimedia.org/T257906) (owner: 10Alexandros Kosiaris) [15:41:16] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, and 2 others: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10Jclark-ctr) a:03Jclark-ctr This device is out of warranty. I do have spares available is server depooled? @jcrespo [15:42:47] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, and 2 others: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10jcrespo) Yes!, if you are ready just go ahead and reset it/power off as you see fit- service and notifications have been moved elsewhere and I will rebuild its data after servic... [15:43:49] (03PS1) 10Jbond: puppet::agent: set manage_puppet_ca_file false [puppet] - 10https://gerrit.wikimedia.org/r/933968 [15:44:24] (03PS8) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:44:57] (03CR) 10Ilias Sarantopoulos: "One typo other than that everything is correct!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/933967 (owner: 10Elukey) [15:45:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [15:45:51] (03PS2) 10Elukey: ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/933967 [15:45:57] (03CR) 10Elukey: ml-services: update docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/933967 (owner: 10Elukey) [15:46:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42092/console" [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond) [15:48:03] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/933967 (owner: 10Elukey) [15:48:10] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:48:44] (03PS9) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:49:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42095/console" [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond) [15:49:40] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond) [15:50:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1001.eqiad.wmnet with OS bullseye [15:50:54] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1001.eqiad.wmnet with OS bullseye completed: - sessionstore1001... [15:51:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [15:51:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [15:52:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:53:00] (03PS1) 10Jbond: README.release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933971 [15:53:11] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:53:18] (03CR) 10Dzahn: "+0.8 (+1 minus the nitpicks from Filippo are legit)" [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [15:53:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:53:28] (03CR) 10Jbond: "ready for review" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933971 (owner: 10Jbond) [15:53:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:53:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:54:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1002.eqiad.wmnet with OS bullseye [15:54:43] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1002.eqiad.wmnet with OS bullseye [15:55:29] (03PS10) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [15:55:52] (03CR) 10CI reject: [V: 04-1] cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [15:56:15] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/933911 (https://phabricator.wikimedia.org/T289857) (owner: 10Clément Goubert) [15:57:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:57:45] 10SRE, 10Infrastructure-Foundations, 10netops: Connection errors from users on Vodafone DE (AS3209) [28.06.2023] - https://phabricator.wikimedia.org/T340670 (10cmooney) p:05Triage→03Low [15:58:53] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933971 (owner: 10Jbond) [15:59:01] (03PS1) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:05] Deploy window Wikifunctions service staging deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1600) [16:00:37] (03PS11) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:00:49] (03CR) 10Clément Goubert: "What is the rationale behind not using ferm specific constants for a ferm module?" [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [16:01:34] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:02:26] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro) [16:02:34] (03PS2) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 [16:02:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:04:30] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:07:21] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage [16:08:01] (03PS1) 10Tchanders: Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 [16:08:24] (03PS2) 10Jforrester: wikifunctions: Add some more real sample values for limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/933618 (https://phabricator.wikimedia.org/T297314) [16:08:28] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Add some more real sample values for limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/933618 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [16:08:35] (03CR) 10Stef Dunlap: [C: 03+1] wikifunctions: Add some more real sample values for limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/933618 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [16:09:04] (03CR) 10CI reject: [V: 04-1] Assign 'edit' right to the 'temp' group in dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933974 (owner: 10Tchanders) [16:09:32] (03Merged) 10jenkins-bot: wikifunctions: Add some more real sample values for limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/933618 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [16:09:54] (03CR) 10David Caro: "This runs also the functional tests on CI \o/" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro) [16:10:08] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:10:15] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:10:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1002.eqiad.wmnet with reason: host reimage [16:11:09] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:17:30] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:18:27] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:19:55] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:20:15] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:22:04] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:22:16] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:22:22] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:23:29] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:23:33] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:23:50] (This is going great.) [16:25:53] (03PS1) 10Ottomata: Update approvers for analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/933976 [16:26:10] (03PS12) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:27:02] (03CR) 10CI reject: [V: 04-1] cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [16:27:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1002.eqiad.wmnet with OS bullseye [16:28:05] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1002.eqiad.wmnet with OS bullseye completed: - sessionstore1002... [16:28:18] (03PS13) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:28:24] (03CR) 10Btullis: [C: 03+1] "Thanks. I'm happy to take on this approval role." [puppet] - 10https://gerrit.wikimedia.org/r/933976 (owner: 10Ottomata) [16:29:29] (03PS3) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) [16:30:39] (03PS1) 10Jforrester: wikifunctions: Rev the version of our charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/933978 [16:30:41] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:30:47] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Rev the version of our charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/933978 (owner: 10Jforrester) [16:31:32] (03Merged) 10jenkins-bot: wikifunctions: Rev the version of our charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/933978 (owner: 10Jforrester) [16:31:36] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, and 2 others: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10Jclark-ctr) @jcrespo replaced failled Dimm [16:33:01] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1003.eqiad.wmnet with OS bullseye [16:33:08] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host sessionstore1003.eqiad.wmnet with OS bullseye [16:33:31] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:33:36] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:33:39] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:34:17] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:34:23] (03PS14) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:36:50] (03PS15) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:39:00] (03PS16) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:41:13] (03PS17) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:42:34] (03PS18) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:42:58] (03CR) 10CI reject: [V: 04-1] cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [16:45:59] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye [16:46:00] (03PS19) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:46:03] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage [16:46:23] (03CR) 10CI reject: [V: 04-1] cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [16:48:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1003.eqiad.wmnet with reason: host reimage [16:53:10] (03PS20) 10Andrew Bogott: cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 [16:53:21] (03PS1) 10Btullis: Configure an-test-coord1002 to be a spare system [puppet] - 10https://gerrit.wikimedia.org/r/933982 (https://phabricator.wikimedia.org/T336062) [16:55:17] (03CR) 10Btullis: [C: 03+2] Configure an-test-coord1002 to be a spare system [puppet] - 10https://gerrit.wikimedia.org/r/933982 (https://phabricator.wikimedia.org/T336062) (owner: 10Btullis) [16:56:59] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrols: update host selection for single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/933912 (owner: 10Andrew Bogott) [16:57:14] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-test-coord1002.eqiad.wmnet [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1700) [17:03:45] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [17:03:59] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [17:06:12] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-test-coord1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1001" [17:07:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1003.eqiad.wmnet with OS bullseye [17:07:32] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host sessionstore1003.eqiad.wmnet with OS bullseye completed: - sessionstore1003... [17:09:24] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) [17:09:45] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Eevans) [17:09:49] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10Eevans) 05Open→03Resolved a:03Eevans macro-deployed [17:13:53] (03CR) 10Muehlenhoff: noc: Pass ports without ferm-specific service constants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [17:15:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-test-coord1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1001" [17:15:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-test-coord1002.eqiad.wmnet [17:16:41] (03PS1) 10Btullis: Remove the remaining reference to an-test-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/933985 (https://phabricator.wikimedia.org/T336062) [17:17:23] (03CR) 10Jbond: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [17:18:22] (03CR) 10Btullis: [C: 03+2] Remove the remaining reference to an-test-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/933985 (https://phabricator.wikimedia.org/T336062) (owner: 10Btullis) [17:19:48] 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) a:05BTullis→03Jclark-ctr [17:23:53] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, and 2 others: db1145 crashed - https://phabricator.wikimedia.org/T340610 (10Jclark-ctr) 05Open→03Resolved [17:33:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [17:34:59] hmmm [17:35:00] that is doh1001 [17:35:01] looking [17:36:29] seems to have recovered [17:37:09] one day™ we will figure out this weird bug that seems to have no obvious causes/timing [18:00:05] brennen and jnuche: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1800) [18:00:05] brennen and jnuche: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1800). [18:00:35] o/ [18:01:32] !log train 1.41.0-wmf.15 ( [18:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:03] !log train 1.41.0-wmf.15 (T340243): no current blockers, rolling to group1. [18:02:06] !log train 1.41.0-wmf.15 ) [18:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:07] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [18:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:22] (that open paren was going to drive me to distraction for the rest of the day) [18:06:59] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933987 (https://phabricator.wikimedia.org/T340243) [18:07:01] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933987 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [18:07:44] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933987 (https://phabricator.wikimedia.org/T340243) (owner: 10TrainBranchBot) [18:14:37] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.15 refs T340243 [18:14:42] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [18:17:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:56] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.15 refs T340243 (duration: 06m 18s) [18:21:01] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [18:32:37] (03CR) 10Marostegui: [C: 03+1] dbbackups: Change s4 and s5 eqiad backup sources to db1150 and db1216 [puppet] - 10https://gerrit.wikimedia.org/r/933895 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [18:40:41] (03PS2) 10Hashar: zuul: remove mode/umask from config git clone [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) [18:40:43] (03PS5) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) [18:53:55] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:56:38] (03CR) 10JHathaway: puppetserver::git: add operations/private (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [18:57:52] (03CR) 10JHathaway: [C: 03+1] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [18:58:07] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [18:59:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond) [19:01:54] (03CR) 10Dzahn: [C: 03+1] "very detailed commit message, positive change" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:02:19] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:00] !log contint* - temp disabled puppet - deploying gerrit:927980 - related to git cloning zuul config on CI servers [19:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:36] (03CR) 10Dzahn: [C: 03+1] "19:02 < mutante> !log contint* - temp disabled puppet - deploying gerrit:927980 - related to git cloning zuul config on CI servers" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:04:34] (03PS1) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [19:06:17] (03CR) 10Dzahn: [C: 03+1] "hey, side question, I noticed that the git state of "/etc/zuul/wikimedia/" is different on the 3 contint servers. each one is at a differe" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:06:47] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [19:07:04] (03CR) 10Dzahn: [C: 03+2] zuul: remove mode/umask from config git clone [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:08:34] (03CR) 10Dzahn: [C: 03+2] "contint2002 done before prod:" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:11:58] (03CR) 10Dzahn: [C: 03+2] zuul: remove mode/umask from config git clone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:13:49] !log contint1002,2002,2001 - sudo chmod -R g-w /etc/zuul/wikimedia with deploying gerrit:927980 for T338277 [19:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:54] T338277: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277 [19:14:34] (03PS5) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [19:14:58] (03CR) 10CI reject: [V: 04-1] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [19:15:15] (03CR) 10Dzahn: [C: 03+2] "this is all done, deployed on 3 contint servers. did also run the chmod to remove group writable bit, puppet running" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:16:27] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [19:19:14] (03PS6) 10Jbond: puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) [19:31:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) [19:35:27] (03PS1) 10Jdlrobson: Revert "Deprecate use of targets" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933642 [19:35:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:36:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:38:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:39:13] 10Puppet, 10Release-Engineering-Team, 10Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277 (10Dzahn) deployed change to /etc/zuul/wikimedia on contint2002, contint1002, finally contint2001. zuul class does not use umask parameter anymore f... [19:40:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:10] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/933911 (https://phabricator.wikimedia.org/T289857) (owner: 10Clément Goubert) [19:43:51] jouncebot nowandnext [19:43:51] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T1800) [19:43:52] In 0 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T2000) [19:44:00] (03CR) 10Dzahn: "@Arnold, I think this is safe to merge next. The new URL points to the general VRTS page now regardless if the ClamAV section exists. and " [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:44:04] (03CR) 10Dzahn: [C: 03+1] vrts: replace OTRS in Wikitech monitoring notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:44:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933642 (owner: 10Jdlrobson) [19:44:31] (03CR) 10Dzahn: [C: 03+1] "have you ever had to interact with clamAV / clamd on vrts ?" [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:46:54] !log train 1.41.0-wmf.15 (T340243): deploying a revert for T127268 related deprecation logspam - this is likely to impinge on upcoming backport window, which currently has no patches. will update when finished. [19:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:00] T340243: 1.41.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T340243 [19:47:01] T127268: Dismantle ResourceLoader's "targets" system - https://phabricator.wikimedia.org/T127268 [19:49:47] (03CR) 10JHathaway: [C: 03+1] puppetserver::git: add operations/private (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond) [19:55:58] (03PS1) 10Jdlrobson: Load OCR code on editor page on mobile as well as desktop [extensions/Wikisource] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933643 (https://phabricator.wikimedia.org/T340679) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230628T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:01:22] (note log message above.) [20:03:00] (03Merged) 10jenkins-bot: Revert "Deprecate use of targets" [core] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933642 (owner: 10Jdlrobson) [20:03:33] !log brennen@deploy1002 Started scap: Backport for [[gerrit:933642|Revert "Deprecate use of targets"]] [20:05:05] !log brennen@deploy1002 jdlrobson and brennen: Backport for [[gerrit:933642|Revert "Deprecate use of targets"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:10:57] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:933642|Revert "Deprecate use of targets"]] (duration: 07m 23s) [20:33:39] (03PS1) 10Ahmon Dancy: profile::local_dev::docker_publish: /srv/dev-images is safe directory [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T338277) [20:35:20] (03PS2) 10Ahmon Dancy: profile::local_dev::docker_publish: /srv/dev-images is safe directory [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T335354) [20:37:33] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T335354) (owner: 10Ahmon Dancy) [20:37:38] (03Abandoned) 10Jdlrobson: Load OCR code on editor page on mobile as well as desktop [extensions/Wikisource] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/933643 (https://phabricator.wikimedia.org/T340679) (owner: 10Jdlrobson) [20:42:41] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/933994/2032/" [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T335354) (owner: 10Ahmon Dancy) [20:58:46] (03CR) 10Dzahn: [C: 03+2] "per IRC chat - the error was noticed during deploy of unrelated change to the zuul config dir - thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T335354) (owner: 10Ahmon Dancy) [21:07:28] (03CR) 10Dzahn: [C: 03+2] "deployed on contint* and this fixed the puppet errors. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/933994 (https://phabricator.wikimedia.org/T335354) (owner: 10Ahmon Dancy) [21:29:02] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Vendor 0.61.0 as 0.61.0-wmf.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933210 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [21:30:03] (03Merged) 10jenkins-bot: opentelemetry-collector: Vendor 0.61.0 as 0.61.0-wmf.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/933210 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [21:33:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [21:38:04] (03PS2) 10Cwhite: opensearch: disable security plugin for both clusters [puppet] - 10https://gerrit.wikimedia.org/r/927772 (https://phabricator.wikimedia.org/T333732) [21:41:48] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: codfw1dev: OpenStack services can only sort of talk to memacached on cloudcontrols - https://phabricator.wikimedia.org/T340488 (10Andrew) Some piece is still missing ` root@cloudcontrol2001-dev:~# telnet cloudcontrol2004-dev.private.co... [21:43:58] (03CR) 10Cwhite: [C: 03+2] opensearch: disable security plugin for both clusters [puppet] - 10https://gerrit.wikimedia.org/r/927772 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [21:44:39] (03PS1) 10Func: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) [21:47:12] (03PS2) 10Func: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) [21:47:50] (03CR) 10CI reject: [V: 04-1] Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [21:49:34] (03PS3) 10Func: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) [21:50:17] (03PS1) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) [21:52:59] (03PS4) 10Func: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) [22:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:36:48] (03CR) 10Volans: [C: 04-1] "I think there is a spurious file. LGTM as approach, some comment inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [22:50:51] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [22:51:16] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [23:08:51] RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:53] RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state