[00:03:48] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10119924 (10Dzahn) 05Open→03In progress [00:04:44] (03PS3) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) [00:06:12] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [00:07:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070682 (owner: 10TrainBranchBot) [00:08:25] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:35] (03PS1) 10Dzahn: gerrit: add backup::host, gerrit::migration etc to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1070683 (https://phabricator.wikimedia.org/T372804) [00:17:52] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1070683/3886/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1070683 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [00:19:37] RECOVERY - Host kubernetes2035 is UP: PING WARNING - Packet loss = 75%, RTA = 30.30 ms [00:20:09] PROBLEM - SSH on kubernetes2035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:26:01] PROBLEM - Host kubernetes2035 is DOWN: PING CRITICAL - Packet loss = 100% [00:26:12] (03Abandoned) 10Amire80: Modify namespace translation for Mongolian (mn) [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068862 (https://phabricator.wikimedia.org/T366271) (owner: 10Amire80) [00:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:00:51] FIRING: [4x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:08:25] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:19] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Unified pattern for RemoteHosts accessors in Spicerack - https://phabricator.wikimedia.org/T374073 (10Scott_French) 03NEW [01:14:57] (03CR) 10Scott French: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [01:15:13] (03Abandoned) 10Scott French: mediawiki: fetch active deployment host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [01:18:01] (03PS2) 10Bartosz Dziewoński: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 [01:18:01] (03PS3) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [01:18:01] (03PS3) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 [01:18:02] (03PS1) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 [01:18:08] (03PS2) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 [01:18:34] (03CR) 10Bartosz Dziewoński: "This is the last one :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński) [01:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 103 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:18:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:08:09] (03CR) 10Fabfur: [C:03+1] "I think it's ok, if you want we can try the deploy on one depooled host first just to be extra-safe" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [04:39:45] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T374078 (10Mlkmooeede92) 03NEW [04:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:00:51] FIRING: [4x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:03:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:06:14] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:06:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:44] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:23] (03CR) 10Majavah: "yes in 2022, T302870" [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede) [05:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:56] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table srwiki.recentchanges: Index for table recentchanges is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1161-bin.002681, end_log_pos 71314887 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_rep [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:06] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:16:09] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10120165 (10Joe) The check is done using `Webrequest::getIP()` which uses `REMOTE_ADDR` as a source for the address, and then overr... [06:16:14] PROBLEM - MariaDB Replica Lag: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:16:36] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:16:36] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:23:30] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10120170 (10Dzahn) There are at least 2 different aspects to this. One is "are WMF mail servers configured to route mails to this addre... [06:23:40] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10120172 (10Dzahn) ` [mx1001:~] $ sudo exim4 -bt mobile-ios-wikipedia@wikimedia.org mobile-ios-wikipedia@wikimedia.org router = gsuite_... [06:25:57] (03CR) 10Brouberol: [C:03+1] "The generated diff looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070649 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [06:32:09] (03PS1) 10Giuseppe Lavagetto: exim: fix VERP handling for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1070805 (https://phabricator.wikimedia.org/T338761) [06:32:30] (03CR) 10CI reject: [V:04-1] exim: fix VERP handling for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1070805 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [06:34:34] (03CR) 10Brouberol: Add a profile::analytics::cluster::hdfs_file defined type (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [06:35:21] (03PS2) 10Giuseppe Lavagetto: exim: fix VERP handling for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1070805 (https://phabricator.wikimedia.org/T338761) [06:37:30] (03PS3) 10Slyngshede: P:idp Add keystone OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070586 [06:41:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] exim: fix VERP handling for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1070805 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [06:43:15] (03Abandoned) 10Slyngshede: P:idp::client::httpd::site Increase LimitRequestFieldSize [puppet] - 10https://gerrit.wikimedia.org/r/1053535 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [06:43:20] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10120180 (10Krd) I would like to see the whole content of vrts_aliases.py and discuss (perhaps at a different venue) if things can be cle... [06:44:26] !log installing openssl security updates [06:47:11] (03PS1) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070806 (https://phabricator.wikimedia.org/T368760) [06:47:43] (03Abandoned) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070806 (https://phabricator.wikimedia.org/T368760) (owner: 10Brouberol) [06:53:50] RECOVERY - MegaRAID on backup2003 is OK: OK: optimal, 1 logical, 24 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:57:14] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [06:58:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [06:59:23] (03PS4) 10Muehlenhoff: Switch acmechief1001/2001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) [06:59:43] (03PS42) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [07:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:14] Little late for the deployment.. [07:02:06] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [07:02:07] (03PS2) 10KartikMistry: aswiki: Set MT threshold for CX to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070596 (https://phabricator.wikimedia.org/T369417) [07:02:17] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10120199 (10MoritzMuehlenhoff) a:05SLyngshede-WMF→03MoritzMuehlenhoff [07:02:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [07:03:01] (03PS43) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [07:03:04] (03CR) 10David Caro: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [07:05:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070596 (https://phabricator.wikimedia.org/T369417) (owner: 10KartikMistry) [07:06:04] (03Merged) 10jenkins-bot: aswiki: Set MT threshold for CX to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070596 (https://phabricator.wikimedia.org/T369417) (owner: 10KartikMistry) [07:06:40] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1070596|aswiki: Set MT threshold for CX to 80% (T369417)]] [07:06:43] T369417: Translation percentage change in Content Translation for Assamese - https://phabricator.wikimedia.org/T369417 [07:09:00] !log kartik@deploy1003 kartik: Backport for [[gerrit:1070596|aswiki: Set MT threshold for CX to 80% (T369417)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:09:35] (03PS1) 10Muehlenhoff: Temporarily remove puppetmaster1003 from rotation [puppet] - 10https://gerrit.wikimedia.org/r/1070813 (https://phabricator.wikimedia.org/T373888) [07:11:01] (03CR) 10Vgutierrez: [C:04-2] "acmechief1001 currently is the acme-chief instance in charge of actually issuing the certificates from Let's Encrypt" [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:11:19] (03CR) 10Slyngshede: [C:03+2] MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 (owner: 10Slyngshede) [07:11:29] !log kartik@deploy1003 kartik: Continuing with sync [07:11:43] (03CR) 10DCausse: search: use mul fallback for manually-tuned search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:11:51] (03PS9) 10Arnaudb: mariadb: productionize db2221 db2222 [puppet] - 10https://gerrit.wikimedia.org/r/1068667 (https://phabricator.wikimedia.org/T373579) [07:11:57] (03CR) 10Slyngshede: MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 (owner: 10Slyngshede) [07:12:12] (03PS2) 10Slyngshede: MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 [07:14:45] (03CR) 10Slyngshede: [C:03+2] MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 (owner: 10Slyngshede) [07:15:59] !log disable puppet fleetwide to restart puppetdb jvms (without impacting ongoing puppet runs, getting noise etc..) [07:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:18] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070596|aswiki: Set MT threshold for CX to 80% (T369417)]] (duration: 09m 37s) [07:16:21] T369417: Translation percentage change in Content Translation for Assamese - https://phabricator.wikimedia.org/T369417 [07:16:26] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2221 db2222 [puppet] - 10https://gerrit.wikimedia.org/r/1068667 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [07:17:00] (03Merged) 10jenkins-bot: MediaWiki: Remove the MediaWiki app and dependencies. [software/bitu] - 10https://gerrit.wikimedia.org/r/1066750 (owner: 10Slyngshede) [07:24:29] (03PS4) 10DCausse: search: use mul fallback for fine-tuned search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) [07:25:41] (03PS1) 10Vgutierrez: acme-chief: Set acmechief1002 as the active [puppet] - 10https://gerrit.wikimedia.org/r/1070844 (https://phabricator.wikimedia.org/T365799) [07:26:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070844 (https://phabricator.wikimedia.org/T365799) (owner: 10Vgutierrez) [07:28:36] (03CR) 10Vgutierrez: [C:04-2] "blocked till I9c6049269470ebdb20ff6b2cf9778e1a354a25ab is merged and puppet runs on the affected hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:29:21] (03PS2) 10Vgutierrez: acme-chief: Set acmechief1002 as active host [puppet] - 10https://gerrit.wikimedia.org/r/1070844 (https://phabricator.wikimedia.org/T365799) [07:34:56] (03PS1) 10Muehlenhoff: Remove Frantz Joseph from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1070856 [07:39:30] !log restart puppetdb on puppetdb[12]003 to pick up the new jvm [07:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:28] (03CR) 10David Caro: [C:03+1] Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott) [07:46:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:52] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:13] (03CR) 10Brouberol: WIP: airflow: implement SSO auth (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [07:49:28] (03CR) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [07:50:43] (03PS44) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [07:52:20] (03CR) 10Elukey: "Quick question to double check - will this remove 1003 from the post-commit hooks (private and public)? If not we may have commits/puppet-" [puppet] - 10https://gerrit.wikimedia.org/r/1070813 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [07:52:35] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10120254 (10Joe) I still see the errors in the logs, and it's baffling. In fact, I've tried the command now listed in exim's config... [07:53:28] (03CR) 10Stevemunene: WIP: airflow: implement SSO auth (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [07:55:21] !log re-enabling puppet on all nodes after puppetdb restarts [07:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] (03PS1) 10Giuseppe Lavagetto: postfix: add X-Client-IP to VERP handler [puppet] - 10https://gerrit.wikimedia.org/r/1070858 (https://phabricator.wikimedia.org/T338761) [07:56:11] (03CR) 10CI reject: [V:04-1] postfix: add X-Client-IP to VERP handler [puppet] - 10https://gerrit.wikimedia.org/r/1070858 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [07:56:21] (03PS2) 10Giuseppe Lavagetto: postfix: add X-Client-IP to VERP handler [puppet] - 10https://gerrit.wikimedia.org/r/1070858 (https://phabricator.wikimedia.org/T338761) [07:58:36] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1070858 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [08:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T0800) [08:02:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet [08:03:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: provisionning db2221.codfw.wmnet - T373579 [08:03:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: provisionning db2221.codfw.wmnet - T373579 [08:03:10] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [08:03:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: provisionning db2221.codfw.wmnet - T373579 [08:03:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: provisionning db2221.codfw.wmnet - T373579 [08:05:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2168 in db2221 for T373579', diff saved to https://phabricator.wikimedia.org/P68685 and previous config saved to /var/cache/conftool/dbconfig/20240905-080540-arnaudb.json [08:06:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:06:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet [08:06:27] (03CR) 10Jelto: [V:03+1 C:03+2] profile::firewall::nftables_throttling: throttle connections not packets [puppet] - 10https://gerrit.wikimedia.org/r/1070591 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:08:03] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:08:14] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:09:09] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2168.codfw.wmnet onto db2221.codfw.wmnet [08:16:59] (03PS1) 10Jelto: rofile::firewall::nftables_throttling: fix timeout in trackinglist [puppet] - 10https://gerrit.wikimedia.org/r/1070861 (https://phabricator.wikimedia.org/T365259) [08:18:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:19:49] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, might need adjustment re: active host if this is merged after the alert2002 migration" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [08:20:40] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_esams for 9.2.5-1wm2 [08:20:45] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3888/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070861 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:21:42] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:21:56] (03CR) 10Jelto: [V:03+1 C:03+2] rofile::firewall::nftables_throttling: fix timeout in trackinglist [puppet] - 10https://gerrit.wikimedia.org/r/1070861 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:23:17] (03CR) 10Brouberol: WIP: airflow: implement SSO auth (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [08:24:49] (03PS19) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [08:25:00] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] postfix: add X-Client-IP to VERP handler [puppet] - 10https://gerrit.wikimedia.org/r/1070858 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [08:25:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10120336 (10cmooney) >>! In T371434#10119784, @Papaul wrote: > The diagram below will outline the cabling of the new Fundraising network devices > >... [08:26:06] (03CR) 10JMeybohm: [C:03+2] renumber-node: Add --os parameter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070590 (owner: 10JMeybohm) [08:26:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10120339 (10cmooney) 05Open→03Resolved a:03cmooney [08:28:07] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2088.codfw.wmnet [08:28:24] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2088.codfw.wmnet with OS bookworm [08:28:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120363 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumbering for host wikikube-w... [08:28:34] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2088 [08:28:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2088.codfw.... [08:30:55] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:30:59] (03PS1) 10Arnaudb: mariadb: productionize db2224 [puppet] - 10https://gerrit.wikimedia.org/r/1070864 (https://phabricator.wikimedia.org/T373579) [08:31:06] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070649 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [08:32:17] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070649 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [08:33:48] (03PS1) 10Muehlenhoff: librenms: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070865 (https://phabricator.wikimedia.org/T135991) [08:34:07] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2088 - jayme@cumin1002" [08:34:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2088 - jayme@cumin1002" [08:34:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:34:11] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2088.codfw.wmnet 231.16.192.10.in-addr.arpa 1.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:34:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2088.codfw.wmnet 231.16.192.10.in-addr.arpa 1.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:34:16] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2088 [08:34:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2088 [08:34:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2088 [08:36:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:37] PROBLEM - Host kubernetes2054 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:43] (03CR) 10JMeybohm: [C:03+1] Temporarily disable stunnel for the Puppet 7 migration of deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070236 (owner: 10Muehlenhoff) [08:45:16] (03PS1) 10JMeybohm: Rename/Renumber mw2434,2435 to wikikube-worker2089,2090 [puppet] - 10https://gerrit.wikimedia.org/r/1070866 (https://phabricator.wikimedia.org/T372878) [08:46:23] (03CR) 10Cathal Mooney: [C:03+2] Update prefix-lists for new private, global IPv6 ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1070589 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [08:46:28] (03CR) 10Vgutierrez: [C:03+2] acme-chief: Set acmechief1002 as active host [puppet] - 10https://gerrit.wikimedia.org/r/1070844 (https://phabricator.wikimedia.org/T365799) (owner: 10Vgutierrez) [08:46:53] (03Merged) 10jenkins-bot: Update prefix-lists for new private, global IPv6 ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1070589 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [08:49:08] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2434.codfw.wmnet [08:49:14] (03CR) 10JMeybohm: [C:03+2] Rename/Renumber mw2434,2435 to wikikube-worker2089,2090 [puppet] - 10https://gerrit.wikimedia.org/r/1070866 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [08:49:28] (03PS1) 10Elukey: redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) [08:49:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2434.codfw.wmnet [08:49:53] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2435.codfw.wmnet [08:50:15] (03PS2) 10Elukey: redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) [08:50:26] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2435.codfw.wmnet [08:51:18] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2088.codfw.wmnet with reason: host reimage [08:51:30] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from mw2434 to wikikube-worker2089 [08:51:47] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:52:01] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from mw2435 to wikikube-worker2090 [08:54:05] (03PS1) 10Santiago Faci: MPIC: Deploying to staging a new release (v0.1.4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070869 (https://phabricator.wikimedia.org/T361346) [08:54:43] !log acmechief1002 is now the acme-chief active host [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2088.codfw.wmnet with reason: host reimage [08:55:33] (03CR) 10Vgutierrez: [C:03+1] "acmechief1002 is now the active host, you can proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:56:18] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2434 to wikikube-worker2089 - jayme@cumin1002" [08:56:54] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:57:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2434 to wikikube-worker2089 - jayme@cumin1002" [08:57:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:57:26] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2089 [08:57:41] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1070870 (https://phabricator.wikimedia.org/T374086) [08:58:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host planet2003.codfw.wmnet [08:58:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host miscweb2003.codfw.wmnet [08:58:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1070871 (https://phabricator.wikimedia.org/T374087) [08:59:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2089 [08:59:40] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1070873 (https://phabricator.wikimedia.org/T374088) [08:59:59] (03CR) 10Slyngshede: [C:03+1] "Looks good. It was me who forgot to remove them from approvers." [puppet] - 10https://gerrit.wikimedia.org/r/1070856 (owner: 10Muehlenhoff) [09:00:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2434 to wikikube-worker2089 [09:00:18] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2435 to wikikube-worker2090 - jayme@cumin1002" [09:00:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2435 to wikikube-worker2090 - jayme@cumin1002" [09:00:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:23] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2090 [09:00:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from mw2434 to wik... [09:00:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2090 [09:01:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2435 to wikikube-worker2090 [09:01:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from mw2435 to wik... [09:01:56] (03PS1) 10David Caro: toolforge::prometheus: keep only the metrics we use for nginx [puppet] - 10https://gerrit.wikimedia.org/r/1070877 (https://phabricator.wikimedia.org/T370143) [09:02:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet2003.codfw.wmnet [09:02:10] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2089.codfw.wmnet [09:02:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120477 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [09:02:22] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2089.codfw.wmnet [09:02:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120478 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [09:02:45] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2089.codfw.wmnet wikikube-worker2090.codfw.wmnet on all recursors [09:02:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2003.codfw.wmnet [09:02:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2089.codfw.wmnet wikikube-worker2090.codfw.wmnet on all recursors [09:02:57] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2089.codfw.wmnet [09:03:07] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2089.codfw.wmnet [09:03:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120480 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [09:03:17] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:03:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120481 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [09:03:38] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2089.codfw.wmnet [09:03:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host doc1003.eqiad.wmnet [09:03:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120482 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [09:03:59] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2089.codfw.wmnet with OS bookworm [09:04:09] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2089 [09:04:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [09:04:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host planet1003.eqiad.wmnet [09:04:24] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:04:40] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2090.codfw.wmnet [09:04:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120484 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [09:05:00] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2090.codfw.wmnet with OS bullseye [09:05:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [09:06:04] (03PS1) 10Joely Rooke WMDE: Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070878 (https://phabricator.wikimedia.org/T66315) [09:06:25] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [skins/MinervaNeue] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070878 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [09:07:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1003.eqiad.wmnet [09:07:46] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2089 - jayme@cumin1002" [09:07:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2089 - jayme@cumin1002" [09:07:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:07:51] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2089.codfw.wmnet 122.16.192.10.in-addr.arpa 2.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:07:54] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2089.codfw.wmnet 122.16.192.10.in-addr.arpa 2.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:07:54] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2089 [09:10:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet1003.eqiad.wmnet [09:10:48] (03CR) 10Brouberol: [C:03+1] MPIC: Deploying to staging a new release (v0.1.4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070869 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [09:10:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2089 [09:10:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2089 [09:11:40] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2090 [09:12:09] (03PS45) 10Stevemunene: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [09:12:15] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:12:29] (03PS1) 10Giuseppe Lavagetto: BounceHandler: add IPs for the new mx servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) [09:12:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2088.codfw.wmnet with OS bookworm [09:12:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [09:13:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2168.codfw.wmnet onto db2221.codfw.wmnet [09:14:03] (03CR) 10Muehlenhoff: "Fortunately not! The list of nodes to be addressed during puppet-merge is configured via /etc/puppet-merge/shell_config.conf and gets mana" [puppet] - 10https://gerrit.wikimedia.org/r/1070813 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [09:16:13] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2090 - jayme@cumin1002" [09:16:16] !log homer lsw1-b8-codfw* commit 'T372878' [09:16:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2090 - jayme@cumin1002" [09:16:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:16:18] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2090.codfw.wmnet 123.16.192.10.in-addr.arpa 3.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [09:16:21] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:16:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2090.codfw.wmnet 123.16.192.10.in-addr.arpa 3.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:16:23] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2090 [09:16:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host miscweb1003.eqiad.wmnet [09:16:53] (03CR) 10Volans: "you could use wmflib's @retry instead" [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [09:16:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host doc2002.codfw.wmnet [09:18:03] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2090 [09:18:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2090 [09:18:15] (03CR) 10Alexandros Kosiaris: [C:03+1] BounceHandler: add IPs for the new mx servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:18:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2090 [09:18:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2090 [09:18:38] (03PS1) 10Jelto: etherpad: stop the prometheus exporter on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1070881 (https://phabricator.wikimedia.org/T374083) [09:18:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_esams for 9.2.5-1wm2 [09:18:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:19:00] (03CR) 10MVernon: [C:03+1] "Checked thus:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:19:25] <_joe_> jouncebot: nowandnext [09:19:25] For the next 0 hour(s) and 40 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T0800) [09:19:25] In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1000) [09:19:38] <_joe_> is the train running? [09:19:40] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2088.codfw.wmnet [09:19:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2088.codfw.wmnet [09:19:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2088.codfw.wmnet [09:19:52] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10120515 (10JMeybohm) [09:19:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120516 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [09:20:26] (03PS2) 10Giuseppe Lavagetto: BounceHandler: add IPs for the new mx servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) [09:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb1003.eqiad.wmnet [09:20:52] (03PS4) 10Hnowlan: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 [09:20:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3889/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070881 (https://phabricator.wikimedia.org/T374083) (owner: 10Jelto) [09:20:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc2002.codfw.wmnet [09:21:25] (03CR) 10Giuseppe Lavagetto: "thanks, amended the comment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:21:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: provisionning db2222.codfw.wmnet - T373579 [09:21:38] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying to staging a new release (v0.1.4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070869 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [09:21:40] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [09:21:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: provisionning db2222.codfw.wmnet - T373579 [09:21:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: provisionning db2222.codfw.wmnet - T373579 [09:21:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: provisionning db2222.codfw.wmnet - T373579 [09:22:40] (03Merged) 10jenkins-bot: MPIC: Deploying to staging a new release (v0.1.4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070869 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [09:23:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2168 in db2222 for T373579', diff saved to https://phabricator.wikimedia.org/P68686 and previous config saved to /var/cache/conftool/dbconfig/20240905-092339-arnaudb.json [09:24:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aphlict1002.eqiad.wmnet [09:24:45] (03CR) 10Giuseppe Lavagetto: BounceHandler: add IPs for the new mx servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:24:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:24:57] (03CR) 10Hnowlan: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [09:25:02] (03CR) 10Hnowlan: [C:03+2] sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [09:25:44] (03Merged) 10jenkins-bot: BounceHandler: add IPs for the new mx servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070879 (https://phabricator.wikimedia.org/T338761) (owner: 10Giuseppe Lavagetto) [09:25:57] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2168.codfw.wmnet onto db2222.codfw.wmnet [09:26:04] !log oblivian@deploy1003 Started scap sync-world: Backport for [[gerrit:1070879|BounceHandler: add IPs for the new mx servers (T338761)]] [09:26:07] T338761: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761 [09:26:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host etherpad1004.eqiad.wmnet [09:28:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1002.eqiad.wmnet [09:30:10] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1004.eqiad.wmnet [09:30:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aphlict2001.codfw.wmnet [09:32:28] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [09:32:39] !log oblivian@deploy1003 oblivian: Backport for [[gerrit:1070879|BounceHandler: add IPs for the new mx servers (T338761)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:32:41] T338761: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761 [09:32:45] !log oblivian@deploy1003 oblivian: Continuing with sync [09:34:46] (03CR) 10Muehlenhoff: [C:03+2] Temporarily remove puppetmaster1003 from rotation [puppet] - 10https://gerrit.wikimedia.org/r/1070813 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [09:34:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict2001.codfw.wmnet [09:35:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [09:36:05] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2090.codfw.wmnet with reason: host reimage [09:36:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:36:52] (03CR) 10Volans: [C:03+1] "LGTM, nice work!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:37:20] !log oblivian@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070879|BounceHandler: add IPs for the new mx servers (T338761)]] (duration: 11m 16s) [09:37:33] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [09:38:22] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-node: handle invalid/missing host [cookbooks] - 10https://gerrit.wikimedia.org/r/1070611 (owner: 10Hnowlan) [09:38:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2090.codfw.wmnet with reason: host reimage [09:39:23] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Unified pattern for RemoteHosts accessors in Spicerack - https://phabricator.wikimedia.org/T374073#10120608 (10Volans) Thanks for the task, we'll evaluate the various options and come up with a final proposal. [09:39:55] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10120624 (10BTullis) >>! In T368098#10116854, @BTullis wrote: > I have prepared a [[https://gerrit.wikimedia.org/r/1070558... [09:40:00] PROBLEM - Host gitlab-replica-a.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:41:49] (03PS12) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [09:41:50] (03PS16) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [09:41:51] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10120626 (10LSobanski) @Krd I created a separate task for this and linked to the full script in the description: {https://phabricator.wik... [09:42:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3890/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [09:42:47] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_esams for 9.2.5-1wm2 [09:42:48] (03CR) 10Volans: "Question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [09:43:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [09:44:02] is that gitlab-replica-a alert expected? [09:44:11] oh.. that answers my question [09:44:12] :) [09:45:02] RECOVERY - Host gitlab-replica-a.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:45:28] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10120647 (10Joe) I'm finally seeing bounces get processed in logstash https://logstash.wikimedia.org/goto/3d34190bb82088f19669b0c66... [09:45:32] (03CR) 10Volans: [C:03+1] "LGTM with the caveat to check if DELL's behaves correctly" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:47:14] yep expected and resolved :) [09:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:56] (03PS13) 10Btullis: Add a profile::analytics::cluster::hdfs_file defined type [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) [09:48:57] (03PS17) 10Btullis: Add some test secrets to an-test-master servers [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) [09:49:50] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3891/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070628 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [09:53:18] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:53:19] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:53:50] (03PS1) 10JMeybohm: kubernetes-generic: Alert on workers being unschedulable (cordoned) [alerts] - 10https://gerrit.wikimedia.org/r/1070887 [09:54:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2089.codfw.wmnet with OS bookworm [09:54:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmne... [09:54:41] (03PS2) 10JMeybohm: kubernetes-generic: Alert on workers being unschedulable (cordoned) [alerts] - 10https://gerrit.wikimedia.org/r/1070887 [09:59:32] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2089.codfw.wmnet [09:59:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2089.codfw.wmnet [09:59:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2089.codfw.wmnet [09:59:37] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:59:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120679 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke... [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1000) [10:00:32] (03CR) 10Elukey: "Okok perfect, but at this point when we'll shutdown 1003 we'll have to tell people not to puppet merge or commit to private, otherwise I t" [puppet] - 10https://gerrit.wikimedia.org/r/1070813 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [10:01:34] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10120683 (10Volans) I need to recollect my old memories and check local branches, the hardest part IIRC are not the code changes but the grammar changes to support it. Do... [10:02:01] (03CR) 10Btullis: [C:04-1] "Thanks Andrew. All great points." [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:03:35] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2084.codfw.wmnet [10:03:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120684 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbering for host wikikube... [10:03:48] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2084.codfw.wmnet with OS bullseye [10:03:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2084.codf... [10:03:58] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2084 [10:04:05] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:06:18] (03PS46) 10Stevemunene: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [10:06:25] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:56] (03PS1) 10Hnowlan: Allow copyuploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) [10:07:06] (03PS2) 10Hnowlan: Allow copyuploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) [10:10:15] (03PS1) 10Brouberol: airflow: run git-sync before initdb in the init phase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070892 [10:11:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on puppetmaster1003.eqiad.wmnet with reason: hardware fix [10:12:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on puppetmaster1003.eqiad.wmnet with reason: hardware fix [10:12:09] (03CR) 10Stevemunene: [C:03+1] airflow: run git-sync before initdb in the init phase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070892 (owner: 10Brouberol) [10:12:25] (03CR) 10Brouberol: [C:03+2] airflow: run git-sync before initdb in the init phase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070892 (owner: 10Brouberol) [10:12:52] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow: run git-sync before initdb in the init phase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070892 (owner: 10Brouberol) [10:13:17] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.6 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1070894 [10:13:35] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10120731 (10MoritzMuehlenhoff) @VRiley-WMF puppetmaster1003 has been taken out of active duty and I've set downtime, you can proceed with the drive swap any time. [10:14:47] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2084 - hnowlan@cumin1002" [10:14:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2084 - hnowlan@cumin1002" [10:14:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:14:52] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2084.codfw.wmnet 170.16.192.10.in-addr.arpa 0.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:14:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2084.codfw.wmnet 170.16.192.10.in-addr.arpa 0.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:14:56] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2084 [10:15:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:16:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:16:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:17:35] (03PS47) 10Brouberol: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [10:19:15] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070877 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [10:19:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2084 [10:19:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2084 [10:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:20] PROBLEM - Host mw2319 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:44] PROBLEM - Host lists2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:14] RECOVERY - Host lists2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [10:29:33] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:29:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2168.codfw.wmnet onto db2222.codfw.wmnet [10:30:03] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:30:21] RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:45] (03PS1) 10Muehlenhoff: CAS: Disable memcached on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1070899 (https://phabricator.wikimedia.org/T367487) [10:30:48] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7246 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [10:30:58] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7256 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [10:31:09] "that's expected" [10:32:30] FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:36:21] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2084.codfw.wmnet with reason: host reimage [10:36:25] (03CR) 10Brouberol: Add a profile::analytics::cluster::hdfs_file defined type (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1070617 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:38:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2084.codfw.wmnet with reason: host reimage [10:39:03] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.2.6 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1070894 (owner: 10Volans) [10:40:37] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Prompt for deploy puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 [10:40:37] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Log cookbook failure as error [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 [10:42:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_esams for 9.2.5-1wm2 [10:44:47] (03PS6) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [10:47:34] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10120858 (10hnowlan) [10:48:19] (03CR) 10Ladsgroup: [C:03+1] mariadb: productionize db2224 [puppet] - 10https://gerrit.wikimedia.org/r/1070864 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [10:50:29] (03CR) 10Hnowlan: [C:03+1] sre.k8s.renumber-node: Log cookbook failure as error [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [10:50:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070899 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [10:51:34] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:51:40] (03CR) 10AOkoth: [C:03+1] etherpad: stop the prometheus exporter on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1070881 (https://phabricator.wikimedia.org/T374083) (owner: 10Jelto) [10:51:56] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:52:30] RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:53:59] (03PS3) 10Elukey: redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) [10:53:59] (03PS1) 10Elukey: tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) [10:54:14] (03CR) 10Hnowlan: [C:03+1] "lgtm, one question" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [10:55:31] (03CR) 10Elukey: tests: add more tests for Redfish's module change user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:55:33] jouncebot: now [10:55:33] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1000) [10:55:37] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10120867 (10MatthewVernon) Further Data Persistence nodes (Ceph / Swift) in `C2`: |`C2` | moss-be2003 | needs maintenance mode setting (a... [10:55:37] jouncebot: next [10:55:37] In 1 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1200) [10:55:52] (03PS1) 10AOkoth: vrts: swap replica to new host [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) [10:55:55] (03CR) 10Slyngshede: "Fixed a few styling issues, and allow the feature to be selectively enabled." [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [10:56:09] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.6 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1070894 (owner: 10Volans) [10:56:41] (03PS2) 10AOkoth: vrts: swap replica to new host [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) [10:56:47] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070899 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [10:57:14] (03CR) 10Elukey: redfish: allow 200 responses in chassis_reset (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:57:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2084.codfw.wmnet with OS bullseye [10:57:45] (03PS3) 10AOkoth: vrts: swap replica to new host [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) [10:57:45] (03CR) 10Muehlenhoff: [C:03+2] Switch acmechief1001/2001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:57:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2084.codfw.wm... [10:58:07] !log homer lsw1-b3-codfw* commit [10:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:53] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2084.codfw.wmnet [11:00:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2084.codfw.wmnet [11:00:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2084.codfw.wmnet [11:01:05] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120886 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [11:03:02] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10120893 (10VRiley-WMF) a:05VRiley-WMF→03MoritzMuehlenhoff This drive has been replaced! Thanks! [11:05:18] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:05:58] (03CR) 10CI reject: [V:04-1] redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [11:07:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 365, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:36] (03PS2) 10Muehlenhoff: Switch acmechief to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1050244 (https://phabricator.wikimedia.org/T365799) [11:08:04] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db2198 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T374095 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:08:13] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [11:08:17] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:08:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095 (10ops-monitoring-bot) 03NEW [11:08:27] (03CR) 10Jelto: [V:03+1 C:03+2] etherpad: stop the prometheus exporter on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1070881 (https://phabricator.wikimedia.org/T374083) (owner: 10Jelto) [11:09:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [11:09:19] (03PS2) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 [11:09:19] (03PS2) 10Clément Goubert: sre.k8s.renumber-node: Log cookbook failure as error [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 [11:10:03] (03CR) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on deploy servers (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [11:11:00] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 [11:11:00] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: Log cookbook failure as error [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 [11:12:06] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1070908/3893/" [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [11:12:26] (03CR) 10Muehlenhoff: [C:03+2] Remove Frantz Joseph from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1070856 (owner: 10Muehlenhoff) [11:13:28] Krinkle: Re https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069258 does this need to be scheduled for deployment using the deployment calender or can that be merged any time? [11:14:22] (03CR) 10Vgutierrez: [C:03+1] Switch acmechief to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1050244 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:15:49] (03CR) 10Clément Goubert: [C:03+1] kubernetes-generic: Alert on workers being unschedulable (cordoned) [alerts] - 10https://gerrit.wikimedia.org/r/1070887 (owner: 10JMeybohm) [11:16:34] (03CR) 10JMeybohm: [C:03+1] sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [11:16:45] (03CR) 10JMeybohm: [C:03+2] eventgate: Replace end-to-end readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [11:16:49] PROBLEM - Host lists2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:02] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet [11:17:07] (03CR) 10AOkoth: "New Replica: https://puppet-compiler.wmflabs.org/output/1070908/3894/" [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [11:17:23] (03CR) 10AOkoth: "Old Replica." [puppet] - 10https://gerrit.wikimedia.org/r/1070908 (https://phabricator.wikimedia.org/T373420) (owner: 10AOkoth) [11:17:31] RECOVERY - Host lists2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [11:17:56] (03Merged) 10jenkins-bot: eventgate: Replace end-to-end readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [11:17:56] (03CR) 10JMeybohm: [C:03+2] kubernetes-generic: Alert on workers being unschedulable (cordoned) [alerts] - 10https://gerrit.wikimedia.org/r/1070887 (owner: 10JMeybohm) [11:18:07] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 447, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:36] (03CR) 10JMeybohm: [C:03+1] sre.k8s.renumber-node: Log cookbook failure as error [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [11:19:11] (03Merged) 10jenkins-bot: kubernetes-generic: Alert on workers being unschedulable (cordoned) [alerts] - 10https://gerrit.wikimedia.org/r/1070887 (owner: 10JMeybohm) [11:19:51] (03CR) 10Slyngshede: [V:03+2 C:03+2] P:idp Add Keystone dummy secret [labs/private] - 10https://gerrit.wikimedia.org/r/1070588 (owner: 10Slyngshede) [11:19:53] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:19:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2090.codfw.wmnet with OS bullseye [11:19:59] !log ladsgroup@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s6 T374087 [11:20:02] T374087: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T374087 [11:20:37] !log ladsgroup@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s6 T374087 [11:20:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet [11:21:06] (03CR) 10Clément Goubert: [C:03+1] Allow copyuploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:21:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2090.codfw.wmne... [11:21:22] !log ladsgroup@cumin2002 dbctl commit (dc=all): 'Set db2214 with weight 0 T374087', diff saved to https://phabricator.wikimedia.org/P68688 and previous config saved to /var/cache/conftool/dbconfig/20240905-112121-ladsgroup.json [11:21:44] !log homer cr*codfw* commit 'T372878' [11:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:47] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:22:00] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2090.codfw.wmnet [11:22:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2090.codfw.wmnet [11:22:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2090.codfw.wmnet [11:22:09] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [11:22:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke... [11:22:12] jouncebot: nowandnext [11:22:12] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [11:22:12] In 0 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1200) [11:22:21] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [11:23:08] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [11:23:50] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [11:24:55] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [11:25:13] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2029.codfw.wmnet [11:25:17] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2029.codfw.wmnet [11:25:20] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [11:25:21] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [11:25:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120948 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumbering for host wikikub... [11:25:39] (03CR) 10Btullis: Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070649 (https://phabricator.wikimedia.org/T361346) (owner: 10Santiago Faci) [11:25:42] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [11:25:43] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [11:25:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2029.codfw.wmnet [11:26:02] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [11:26:20] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1070871 (https://phabricator.wikimedia.org/T374087) [11:26:23] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1070871 (https://phabricator.wikimedia.org/T374087) (owner: 10Gerrit maintenance bot) [11:26:25] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1070871 (https://phabricator.wikimedia.org/T374087) (owner: 10Gerrit maintenance bot) [11:26:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2029.codfw.wmnet with OS bullseye [11:26:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10120950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2029.cod... [11:27:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2029 [11:27:48] !log Starting s6 codfw failover from db2129 to db2214 - T374087 [11:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] T374087: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T374087 [11:27:57] (03PS5) 10Samtar: CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) [11:28:02] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [11:28:12] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet [11:28:48] !log ladsgroup@cumin2002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T374087', diff saved to https://phabricator.wikimedia.org/P68689 and previous config saved to /var/cache/conftool/dbconfig/20240905-112846-ladsgroup.json [11:28:59] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [11:29:00] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:29:25] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:29:26] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [11:29:52] going to deploy two config patches, any objections? [11:30:05] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [11:31:19] (03CR) 10Volans: tests: add more tests for Redfish's module change user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [11:31:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:31:48] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:31:55] !log ladsgroup@cumin2002 dbctl commit (dc=all): 'Depool db2129 T374087', diff saved to https://phabricator.wikimedia.org/P68690 and previous config saved to /var/cache/conftool/dbconfig/20240905-113153-ladsgroup.json [11:32:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet [11:32:40] (03Merged) 10jenkins-bot: CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:32:55] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:33:00] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1063847|CS: Load CommunityRequests (T372527)]] [11:33:03] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [11:34:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:34:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:35:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:35:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:36:21] (03CR) 10Volans: "I've seen multiple CRs for this cookbook, are those different developments or errors that were not catched using test-cookbook?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [11:36:26] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:36:34] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:36:47] !log samtar@deploy1003 samtar: Backport for [[gerrit:1063847|CS: Load CommunityRequests (T372527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:36:54] !log samtar@deploy1003 samtar: Continuing with sync [11:37:09] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:38:17] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:38:25] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [11:38:28] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:38:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P68691 and previous config saved to /var/cache/conftool/dbconfig/20240905-113852-ladsgroup.json [11:38:54] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:39:25] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [11:39:26] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:39:32] (03CR) 10Clément Goubert: "The first CR is a functionality change, this one is just a fix for error logging, that I didn't want to mix with the first one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [11:39:41] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:40:09] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:40:10] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [11:41:21] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:41:33] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063847|CS: Load CommunityRequests (T372527)]] (duration: 08m 32s) [11:41:36] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [11:41:37] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:41:57] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [11:41:58] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [11:42:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070557 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:42:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:42:17] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [11:42:49] (03Merged) 10jenkins-bot: IS: Enable CommunityRequests on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070557 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:43:08] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070557|IS: Enable CommunityRequests on Meta (T372527)]] [11:45:07] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10121019 (10Ladsgroup) It's a backup source. There shouldn't be anything needed from our side. [11:45:09] !log samtar@deploy1003 samtar: Backport for [[gerrit:1070557|IS: Enable CommunityRequests on Meta (T372527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:45:45] !log samtar@deploy1003 samtar: Continuing with sync [11:45:51] (03CR) 10Volans: [C:03+2] Upstream release v1.2.6 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1070912 (owner: 10Volans) [11:46:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gerrit2003.wikimedia.org [11:47:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:47:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070586 (owner: 10Slyngshede) [11:48:25] (03CR) 10Muehlenhoff: [C:03+2] Switch acmechief to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1050244 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:50:18] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070557|IS: Enable CommunityRequests on Meta (T372527)]] (duration: 07m 09s) [11:50:20] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [11:52:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:52:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2003.wikimedia.org [11:52:51] (03CR) 10Filippo Giunchedi: [C:03+1] librenms: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070865 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:53:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:53:52] (03PS1) 10Cathal Mooney: Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) [11:53:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P68692 and previous config saved to /var/cache/conftool/dbconfig/20240905-115357-ladsgroup.json [11:54:57] (03CR) 10CI reject: [V:04-1] Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [11:55:47] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:55:50] (03CR) 10Muehlenhoff: [C:03+2] librenms: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070865 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:57:45] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:58:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:58:19] (03PS2) 10Cathal Mooney: Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) [11:59:23] (03CR) 10CI reject: [V:04-1] Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [11:59:40] (03PS1) 10Muehlenhoff: Revert "Temporarily remove puppetmaster1003 from rotation" [puppet] - 10https://gerrit.wikimedia.org/r/1070917 (https://phabricator.wikimedia.org/T373888) [11:59:58] (03PS3) 10Cathal Mooney: Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1200) [12:00:29] (03PS2) 10Muehlenhoff: Revert "Temporarily remove puppetmaster1003 from rotation" [puppet] - 10https://gerrit.wikimedia.org/r/1070917 (https://phabricator.wikimedia.org/T373888) [12:00:51] (03Merged) 10jenkins-bot: Upstream release v1.2.6 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1070912 (owner: 10Volans) [12:01:41] (03CR) 10Cathal Mooney: "apologies for self-merge, serviceops blocked on dns cookbook however." [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [12:01:44] (03CR) 10Cathal Mooney: [C:03+2] Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64 [dns] - 10https://gerrit.wikimedia.org/r/1070916 (https://phabricator.wikimedia.org/T330153) (owner: 10Cathal Mooney) [12:04:18] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2029.codfw.wmnet on all recursors [12:04:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2029.codfw.wmnet on all recursors [12:05:58] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2029 - cgoubert@cumin1002" [12:06:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2029 - cgoubert@cumin1002" [12:06:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:06:02] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2029.codfw.wmnet 199.16.192.10.in-addr.arpa 9.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:06:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2029.codfw.wmnet 199.16.192.10.in-addr.arpa 9.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:06:06] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2029 [12:06:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2029 [12:06:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2029 [12:09:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P68693 and previous config saved to /var/cache/conftool/dbconfig/20240905-120903-ladsgroup.json [12:09:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070594 (owner: 10Slyngshede) [12:09:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:15] !log homer cr*codfw* commit 'T372878' [12:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:18] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [12:12:44] (03CR) 10David Caro: [C:03+2] toolforge::prometheus: keep only the metrics we use for nginx [puppet] - 10https://gerrit.wikimedia.org/r/1070877 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [12:18:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 363, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:03] !log eoghan@cumin1002 START - Cookbook sre.hosts.reboot-single for host lists1004.wikimedia.org [12:20:12] (03PS1) 10Filippo Giunchedi: logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) [12:21:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 5%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68694 and previous config saved to /var/cache/conftool/dbconfig/20240905-122108-arnaudb.json [12:22:47] (03CR) 10CI reject: [V:04-1] logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [12:23:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage [12:24:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P68695 and previous config saved to /var/cache/conftool/dbconfig/20240905-122408-ladsgroup.json [12:24:38] (03PS4) 10Clément Goubert: sre.k8s.renumber-node: Refactor logging and error handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 [12:25:23] (03CR) 10Clément Goubert: sre.k8s.renumber-node: Refactor logging and error handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 (owner: 10Clément Goubert) [12:26:13] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2224 [puppet] - 10https://gerrit.wikimedia.org/r/1070864 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:26:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage [12:26:52] dcaro: are you ok for me to merge your change? David Caro: toolforge::prometheus: keep only the metrics we use for nginx (47329da0ed) [12:27:06] arnaudb: oh yes please :) [12:27:09] lets go then! [12:28:09] {{done}} [12:29:01] (03PS2) 10Filippo Giunchedi: logging: add script to query for orphan traces [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) [12:29:53] (03PS19) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [12:30:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [12:31:00] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lists1004.wikimedia.org [12:33:43] Hm. That cookbook failure was: `ValueError: invalid literal for int() with base 10: "Warning: Permanently added the ECDSA host key for IP address '2620:0:861:3:208:80:154:81' to the list of known hosts.\n1725539433"` [12:34:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: provisionning db2224.codfw.wmnet - T373579 [12:34:50] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [12:35:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: provisionning db2224.codfw.wmnet - T373579 [12:35:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: provisionning db2224.codfw.wmnet - T373579 [12:35:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: provisionning db2224.codfw.wmnet - T373579 [12:35:19] (03CR) 10Elukey: [C:03+1] "While I am very ignorant on the custom script's specific, the logic looks good!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) (owner: 10Cathal Mooney) [12:36:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2124 in db2224 for T373579', diff saved to https://phabricator.wikimedia.org/P68696 and previous config saved to /var/cache/conftool/dbconfig/20240905-123619-arnaudb.json [12:36:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 15%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68697 and previous config saved to /var/cache/conftool/dbconfig/20240905-123627-arnaudb.json [12:38:15] (03CR) 10Andrew Bogott: [C:03+1] P:idp Add keystone OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070586 (owner: 10Slyngshede) [12:38:29] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2124.codfw.wmnet onto db2224.codfw.wmnet [12:38:50] (03PS16) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [12:39:35] (03PS3) 10Cathal Mooney: Fix some bugs with the move_server Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) [12:41:02] PROBLEM - mailman3_runners on lists1004 is CRITICAL: PROCS CRITICAL: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:55] (03CR) 10Cathal Mooney: [C:03+2] Fix some bugs with the move_server Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) (owner: 10Cathal Mooney) [12:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:28] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 [12:49:34] (03Merged) 10jenkins-bot: Fix some bugs with the move_server Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1070670 (https://phabricator.wikimedia.org/T374024) (owner: 10Cathal Mooney) [12:49:38] (03CR) 10David Caro: [V:03+1] spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [12:51:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68698 and previous config saved to /var/cache/conftool/dbconfig/20240905-125133-arnaudb.json [12:51:50] PROBLEM - Host mw2434 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:52:06] PROBLEM - Host mw2435 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:45] (03PS4) 10Elukey: redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) [12:53:45] (03PS2) 10Elukey: tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) [12:54:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2029.codfw.wmnet with OS bullseye [12:54:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2029.codfw.w... [12:54:49] !log homer lsw1-b6-codfw* commit 'T372878' [12:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [12:55:03] (03PS1) 10David Caro: toolforge::prometheus: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/1070933 [12:56:17] (03PS3) 10Elukey: tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) [12:56:20] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:56:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:56:39] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:56:47] (03CR) 10David Caro: [C:03+2] toolforge::prometheus: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/1070933 (owner: 10David Caro) [12:56:47] (03PS1) 10Arnaudb: mariadb: productionize db2225 [puppet] - 10https://gerrit.wikimedia.org/r/1070934 (https://phabricator.wikimedia.org/T373579) [12:57:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:58:19] (03CR) 10Ladsgroup: [C:03+1] mariadb: productionize db2225 [puppet] - 10https://gerrit.wikimedia.org/r/1070934 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:58:25] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2029.codfw.wmnet [12:58:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2029.codfw.wmnet [12:58:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2029.codfw.wmnet [12:58:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121341 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [12:59:32] (03CR) 10Muehlenhoff: [C:03+2] Temporarily disable stunnel for the Puppet 7 migration of deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070236 (owner: 10Muehlenhoff) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1300) [13:00:04] MatmaRex, joelyrookewmde, and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] hi! I'm here :) [13:00:28] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2225 [puppet] - 10https://gerrit.wikimedia.org/r/1070934 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [13:00:40] o/ [13:00:40] hi [13:00:52] moritzm: I seem to have collided with your patch [13:00:53] my patches are just cleanup, they can go last if we have time [13:00:55] can I merge ? :) [13:02:45] I can probably deploy in a few minutes fwiw [13:02:50] though some of the patches look scary [13:04:09] Lucas_WMDE: I can deploy if you want :) [13:04:19] sure ^^ [13:04:41] I have a stack of logging changes to review which were done by MatmaRex [13:05:08] but I feel intimidated at jumping in that boat :/ [13:05:12] anyway lets proceed [13:05:26] moritzm: I release the lock, I'll let you get at it if/when you're ready, my part can go through anytime :) [13:05:40] going to do hnowlan one [13:06:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:06:22] MatmaRex: I am happy to see you have scheduled the logging config changes already! Sorry for the lack of review on them so far :/ [13:06:27] Gerrit could not merge the change '1070891' as is and could require a rebase [13:06:28] grr [13:06:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68699 and previous config saved to /var/cache/conftool/dbconfig/20240905-130638-arnaudb.json [13:06:41] (03PS2) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 [13:06:44] (03PS3) 10Hnowlan: Allow copyuploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) [13:06:49] thanks hashar [13:07:22] I am still conflicted as to whether I should change Gerrit config to reduce the amount of so called "conflicts" in wmf-config/InitialiseSettings.php [13:07:27] or maybe that file should be split [13:07:38] (03CR) 10CI reject: [V:04-1] tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:07:50] hashar: no problem, i'm not in a hurry. Krinkle reviewed those two so i went ahead [13:07:54] arnaudb: please merge along! [13:08:08] MatmaRex: I will deploy them so :) [13:08:15] (03CR) 10Ebernhardson: [C:03+1] search: use mul fallback for fine-tuned search profiles (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [13:08:18] (03PS4) 10Elukey: tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) [13:08:26] hashar: that's just because the repo is configured to require rebase instead of merge, right? (which is actually kind of nice) [13:08:39] (especially now that gerrit rebases "on behalf of" the original author) [13:08:57] (03PS3) 10Bartosz Dziewoński: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 [13:09:02] (03PS4) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [13:09:08] MatmaRex: because the repo does not have "Allow content merge" ticked. So anytime a change is merged that is touching InitialiseSettings.php, any other chnage touching that file are marked as being in merge conflict [13:09:23] solely cause the file got touched. Ie Gerrit does not check whether it is mergeable as-is [13:09:42] and yeah auto rebasing is quite nice :] [13:09:51] joelyrookewmde: I will deploy your patch next [13:09:58] (03CR) 10TrainBranchBot: "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:10:10] * Lucas_WMDE around now btw [13:10:16] (meeting isn’t happening after all) [13:10:27] hashar: i've just snuck another config patch in if theres time [13:10:38] (03CR) 10Hashar: [C:03+2] Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070878 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:10:44] (03Merged) 10jenkins-bot: Allow copyuploads on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:10:45] great thanks! It's a small bugfix for an older ticket which is currently with our pilot wikis [13:10:48] (03CR) 10Elukey: tests: add more tests for Redfish's module change user (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:11:04] joelyrookewmde: that sounds awesome. I love seeing old tickets being fixed! [13:11:06] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]] [13:11:09] ebernhardson: no problems [13:11:09] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:11:10] (03CR) 10Elukey: [C:03+2] redfish: allow 200 responses in chassis_reset (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:11:30] oh [13:11:34] while things merges [13:11:41] do you know how to brag on LinkedIn? [13:11:55] i love bragging on linkedin [13:12:03] what's your task [13:12:54] comment on a post such as "Slack is the new generation" "COmmunication troubles how to solve them" [13:13:07] and comment something like: "I use IRC." [13:13:21] variant "I use IRC professionally." [13:13:25] free karma [13:13:26] anyway. [13:13:28] hahahhahaha [13:13:38] some people like to watch the world burn :P [13:13:57] I have to start doing this [13:14:36] 15:14:10 ⇐ joelyrookewmde quit (~joelyrook@user/joelyrookewmde) Quit: Client closed [13:14:36] 15:14:25 → joelyrookewmde joined (~joelyrook@user/joelyrookewmde) [13:14:40] yeah nothing replaces IRC [13:14:46] perfect timing [13:14:57] !log hashar@deploy1003 hnowlan, hashar: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:13] it wasn't the IRC dault I was just vigorously shaming linkedin users for their slack posts [13:15:14] hnowlan: does it need test? [13:15:34] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir-eqsin [13:16:12] hashar: I'll give it a look [13:16:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir-eqsin [13:17:25] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:43] hashar: looks good, thanks! [13:18:23] !log hashar@deploy1003 hnowlan, hashar: Continuing with sync [13:18:46] joelyrookewmde patch is 900 seconds away from merging ( https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1070878?tab=checks ) [13:18:47] ... [13:19:05] ebernhardson: lets break search instead :) [13:19:27] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10121449 (10MatthewVernon) There are 4 swift servers in `C4` - ms-be2058 ms-be2064 ms-be2072 ms-be2077 ; they'll need checking afterwards.... [13:20:49] hashar: sadly this can't break anything, the config isn't used until the next train deploy [13:21:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2124.codfw.wmnet onto db2224.codfw.wmnet [13:21:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68700 and previous config saved to /var/cache/conftool/dbconfig/20240905-132144-arnaudb.json [13:22:52] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]] (duration: 11m 45s) [13:22:55] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:23:29] arnaudb: I've merged it now along [13:23:42] ebernhardson: how so that is a delayed breakage :] [13:23:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [13:23:51] (meanwhile I posted on linkedin https://www.linkedin.com/feed/update/urn:li:share:7237450137826996224/ ) [13:24:04] (03Merged) 10jenkins-bot: redfish: allow 200 responses in chassis_reset [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070868 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:24:10] (I should delete my account there really but that is another story) [13:24:29] (03Merged) 10jenkins-bot: search: use mul fallback for fine-tuned search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [13:24:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10121464 (10MatthewVernon) There are some impact Swift servers: - ms-be2054 and ms-be2078 and thanos-be2003 - these just need a quick c... [13:24:50] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]] [13:24:53] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [13:25:38] (03PS48) 10Stevemunene: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [13:25:57] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [13:26:53] !log hashar@deploy1003 hashar, dcausse: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:56] !log hashar@deploy1003 hashar, dcausse: Continuing with sync [13:27:48] joelyrookewmde: do you know how to test a change on the debug servers? [13:27:48] hashar: I’m intrigued by the urn in that LinkedIn URL [13:27:58] yep [13:28:04] i think [13:28:05] great! [13:28:09] let's find out [13:28:17] Lucas_WMDE: I don't know, maybe that is how they internally address the rendering of a resource [13:28:26] Lucas_WMDE: similar to how we use uri for the extenral storage [13:28:33] yeah, maybe [13:29:39] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [13:29:53] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 14s) [13:31:02] RECOVERY - Host kubernetes2010 is UP: PING WARNING - Packet loss = 90%, RTA = 0.18 ms [13:31:12] PROBLEM - SSH on kubernetes2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:31:19] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]] (duration: 06m 28s) [13:31:22] T371401: Adapt search ranking for mul language code - https://phabricator.wikimedia.org/T371401 [13:31:45] (03PS17) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:32:30] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:33:11] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:33:52] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10121496 (10MatthewVernon) These racks have the following Swift/Ceph nodes: - ms-fe2012 moss-fe2002 thanos-fe2003 (need depool beforehand... [13:34:21] (03CR) 10Elukey: [C:03+2] tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:34:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [13:35:47] joelyrookewmde: I will do you r patch as soon as it merges ( watching https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php74/16995/console ) [13:35:59] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10121520 (10MatthewVernon) No affected swift/Ceph nodes in these racks. [13:36:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10121503 (10MatthewVernon) No Swift/Ceph nodes affected in this one. [13:36:43] (03CR) 10Muehlenhoff: "There's quite a few things which went wrong here; I think it would be better to revert entirely." [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [13:36:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68701 and previous config saved to /var/cache/conftool/dbconfig/20240905-133649-arnaudb.json [13:36:56] (03PS4) 10Bartosz Dziewoński: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 [13:36:56] (03PS5) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [13:37:26] PROBLEM - Host kubernetes2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:33] I think I will just change the gerrit setting [13:37:37] but after that window [13:38:04] (03CR) 10Andrew Bogott: [C:03+2] P:idp Add keystone OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070586 (owner: 10Slyngshede) [13:39:10] (03Merged) 10jenkins-bot: Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070878 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:39:11] it will still require rebase to merge, right? [13:39:33] (03CR) 10Muehlenhoff: planet: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [13:39:47] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070878|Fix missing wikibase link in Minerva sidebar (T66315)]] [13:39:50] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [13:40:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10121536 (10MatthewVernon) There are these impacted Swift/Ceph nodes: - thanos-be2004 ms-be2056 ms-be2059 ms-be2073 ms-be2080... [13:40:09] MatmaRex: yeah but then Gerrit would compute the mergeability and if the change is mergeable it would no more be flagged as being in merge conflict [13:40:26] so upon submitting it, it will rebase it and resolve the "conflict" [13:40:34] I mean [13:40:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10121551 (10MatthewVernon) [13:40:43] technically it is a conflict (same file touched) [13:40:55] but it is resovable automatically [13:40:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10121554 (10MatthewVernon) [13:41:05] and I don't think we need to rereview what has changed in the file [13:41:09] well maybe we should rereview [13:41:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10121555 (10MatthewVernon) [13:41:30] joelyrookewmde: your change is in the pipes [13:41:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10121561 (10MatthewVernon) [13:41:43] it can be risky with those long lists of wikis for different variables, you could easily have a patch apply in the wrong place [13:41:46] !log hashar@deploy1003 hashar, joelyrookewmde: Backport for [[gerrit:1070878|Fix missing wikibase link in Minerva sidebar (T66315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:12] so requiring manual rebase seems like a good thing to me, so that you can at least see the final version of the patch that was applied [13:42:12] Changes synced to the testservers. (see https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:13] Please do any necessary checks before continuing. [13:42:17] joelyrookewmde: ^ :) [13:42:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68702 and previous config saved to /var/cache/conftool/dbconfig/20240905-134225-arnaudb.json [13:42:28] (instead of that happening in a merge commit that is not very visible) [13:43:10] (03PS18) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [13:43:47] (03PS49) 10Stevemunene: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [13:44:07] checking now [13:44:30] MatmaRex: I think forcing people to think about the rebase is why that repository does not auto resolve merges conflicts. That is the same on mediawiki-config [13:44:54] I think I removed it from puppet.git since we now have a way to test / review diffs of what a change is doing (via the puppet catalogue compiler) [13:45:08] (03CR) 10Btullis: [C:03+1] "Nice work. Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [13:45:50] (03Merged) 10jenkins-bot: tests: add more tests for Redfish's module change user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070907 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:46:09] (03PS1) 10Filippo Giunchedi: jaeger: set shorter span age [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070943 [13:46:39] hmmm does it matter which server I use? I am using K8s-mwdebug rn, checking on Hebrew and ukrainian wikipedia, but the change doesn't appear to be active [13:46:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [13:47:06] (03CR) 10CDanis: [C:03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070943 (owner: 10Filippo Giunchedi) [13:47:11] hmm [13:47:28] I guess cause the rendering is cached? [13:47:46] (03CR) 10Elukey: [C:03+2] jaeger: add securityContext configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:47:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: provisionning db2225.codfw.wmnet - T373579 [13:47:50] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [13:47:51] ok will check a null edit [13:48:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: provisionning db2225.codfw.wmnet - T373579 [13:48:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: provisionning db2225.codfw.wmnet - T373579 [13:48:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: provisionning db2225.codfw.wmnet - T373579 [13:49:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2125 in db2225 for T373579', diff saved to https://phabricator.wikimedia.org/P68703 and previous config saved to /var/cache/conftool/dbconfig/20240905-134929-arnaudb.json [13:49:57] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: set shorter span age [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070943 (owner: 10Filippo Giunchedi) [13:50:53] (03CR) 10CDanis: [C:03+1] jaeger: add securityContext configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:51:01] (03PS1) 10Kamila Součková: kubernetes: Rename mw242[01] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070944 (https://phabricator.wikimedia.org/T372878) [13:51:40] (03PS20) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:51:40] (03PS1) 10Andrew Bogott: cloudweb: remove override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) [13:52:07] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2125.codfw.wmnet onto db2225.codfw.wmnet [13:52:17] (03PS2) 10Andrew Bogott: cloudweb: remove override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) [13:52:17] (03PS21) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:52:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:53:10] nope not there even after null edits [13:53:51] I guess I will have to debug this further but don't want to hold up deployment [13:54:02] well I can still deploy that fully [13:54:06] since it does not seem to cause issues [13:54:10] (03PS3) 10Andrew Bogott: cloudweb: remove override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) [13:54:10] (03PS22) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:54:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:54:45] !log hashar@deploy1003 hashar, joelyrookewmde: Continuing with sync [13:54:48] sure, thanks [13:54:56] !log filippo@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:55:08] this way you can live debug it on the mwdebug servers :) [13:55:20] !log filippo@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:57:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 15%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68704 and previous config saved to /var/cache/conftool/dbconfig/20240905-135731-arnaudb.json [13:57:50] (03PS4) 10Andrew Bogott: cloudweb: remove override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) [13:57:50] (03PS23) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:57:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:58:04] I am waiting for the mediawiki related change to be fully deployed [13:58:19] then we will do MatmaRex logging changes [13:58:51] hashar: i'm around, but i'd also be fine with rescheduling them, up to you [13:59:00] well tomorrow is friday [13:59:02] after that is a week-end [13:59:08] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070878|Fix missing wikibase link in Minerva sidebar (T66315)]] (duration: 19m 21s) [13:59:11] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [13:59:11] who knows what will happen next week [13:59:15] I think it is fine to do now [13:59:17] heh [13:59:18] jouncebot: nowandnext [13:59:18] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1300) [13:59:18] In 1 hour(s) and 0 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1500) [13:59:51] unless you rather have them deployed next week? [13:59:53] (03PS1) 10Hnowlan: Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) [14:00:01] QQ - do group 1 or 2 wikis have 1.43.0-wmf.21 right now? [14:00:23] joelyrookewmde: https://tools.wmflabs.org/versions/ [14:00:31] since that is the current train version, and it is post thursday lunchtime I would have thought yes but https://versions.toolforge.org/ says np [14:00:34] no* [14:00:52] ah [14:01:19] so that is dancy who is running the train this week and he is on the USA west coast. He will then promote group2 wikis later this evening (in 4 hours) [14:01:25] well group 1 [14:01:34] cause they did not get promoted yesterday, possibly due to a blocker [14:01:38] rightt [14:01:51] your backport was against wmf.21 ? [14:02:00] that completely explains why these changes do not affect my group 1 and 2 pilot wikis [14:02:02] yes [14:02:08] oops [14:02:16] throught it would already be in those groups [14:02:21] (03CR) 10Kamila Součková: [C:03+1] Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:02:23] sorry! I should have confirmed [14:02:54] (03CR) 10Effie Mouzeli: [C:03+1] Enable Copyupload-allowed-domains on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070948 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:03:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [14:03:19] so tonight group 1 gets promoted, group 2 I guess would get wmf.21 on monday? [14:03:28] (03PS5) 10Andrew Bogott: cloudweb: correct override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) [14:03:28] (03PS24) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [14:03:36] joelyrookewmde: or maybe all will be promoted tonight [14:03:42] (03PS1) 10Btullis: Enable IPv6 for the envoyproxy on DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070949 (https://phabricator.wikimedia.org/T330153) [14:03:43] (03PS1) 10Btullis: Add the anycast VIP for radosgw to DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070950 (https://phabricator.wikimedia.org/T330153) [14:03:45] joelyrookewmde: or group 2 is only done next monday [14:03:48] (03PS1) 10Arnaudb: mariadb: productionize db2227 [puppet] - 10https://gerrit.wikimedia.org/r/1070946 (https://phabricator.wikimedia.org/T373579) [14:03:49] it depends [14:03:50] hmm ok [14:03:56] but we can backport to wmf.20 if that applies [14:03:58] (03PS1) 10Muehlenhoff: grafana: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070951 (https://phabricator.wikimedia.org/T135991) [14:04:00] (03Merged) 10jenkins-bot: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [14:04:10] would it be better to do that now ? [14:04:19] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1069320|Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile]] [14:04:19] usually people let the patch in master [14:04:34] and they check it when it is deployed the next week [14:04:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3895/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070950 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [14:04:46] which saves one the trouble of dealing with backports / deploy / verify etc [14:04:47] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070949 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [14:04:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:04:52] aka they let their fixes rid the train [14:05:05] yeah I see [14:05:08] but I dont mind doing the wmf.20 deployment if you can make a cherry pick for it [14:05:13] it is just long [14:05:19] would appreciate that if you have the time [14:05:31] lets do that :) [14:05:43] just cherry pick and give me the change number :] [14:05:53] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10121660 (10VRiley-WMF) a:03VRiley-WMF [14:05:59] awesome one sec [14:06:03] !log uploaded python3-wmflib_1.2.6 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [14:06:04] (03CR) 10Stevemunene: [C:03+2] airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [14:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:06] (03PS1) 10Joely Rooke WMDE: Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070953 (https://phabricator.wikimedia.org/T66315) [14:06:16] 1070953 [14:06:20] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1069320|Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:29] hashar: for my logging cleanup, i think you'll have to check that the testwiki extra log file is still being written after the change, since i don't have access to the servers where it lives. let me look that up [14:06:31] + $wmgExtraLogFile = '/tmp/wiki.log'; [14:06:33] * hashar whistles [14:06:43] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10121659 (10VRiley-WMF) Hi @BTullis we can replace this drive at any time. Although the LED on the drive isn't on, as long as we know the slot, that works for us. [14:07:10] (03Merged) 10jenkins-bot: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [14:07:15] hashar: yeah, i don't know if tht one is even reachable, i think we can ignore that [14:07:26] but the other one is https://wikitech.wikimedia.org/wiki/MediaWiki_UDP_logging : /home/wikipedia/logs/testwiki.log [14:07:33] not sure on which machine(s)… [14:07:35] hmm [14:07:43] /home/wikipedia/ no more exists [14:07:49] but yeah that would be the mwlog server [14:08:01] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:08:05] (03CR) 10Andrew Bogott: [C:03+2] cloudweb: correct override of profile::idp::server_name [puppet] - 10https://gerrit.wikimedia.org/r/1070945 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:08:10] mwlog1002.eqiad.wmnet in /srv/mw-log [14:08:18] the beta cluster equivalent is https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Log_Files "/srv/mw-log on deployment-mwlog02.deployment-prep.eqiad1.wikimedia.cloud" [14:08:33] my understanding is that is where the udp2log system live and were udp receives logs end up written to disk [14:08:47] yeah that is it [14:08:47] yep, that's the one [14:08:50] you don't have access? [14:09:07] i tried very hard to avoid getting production access so far ;) [14:09:08] you and gergo and your team should most probably get access on it, else how can you grep logs? [14:09:10] i might have to give up one day [14:09:12] ahaha [14:09:13] touché [14:09:17] * hashar giggles [14:09:19] i have access to logstash [14:09:27] * hashar promotes MatmaRex to MediaWiki SRE [14:09:27] which has everything useful anyway, as far as i can tell [14:09:41] grepping logs is so 1990 [14:09:51] oh [14:09:53] :> [14:09:56] we have a perl script to process them [14:09:59] cause we are so 1980's [14:10:08] :D [14:10:28] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [14:10:42] normal logging to logstash seems to be still fine as well [14:10:44] -rw-r--r-- 1 udp2log udp2log 8903952483 Sep 5 14:10 /srv/mw-log/testwiki.log [14:10:50] and well that file is rather spammy [14:10:56] [Exception ErrorException] (/srv/mediawiki/php-1.43.0-wmf.21/includes/utils/FileContentsHasher.php:62) PHP Warning: filemtime(): stat failed for resources/lib/codex/mixins/button-layout-flush.less [14:11:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:11:38] (03PS1) 10David Caro: aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) [14:11:41] !log add interface qos schedulers on cr1-codfw T339850 [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:44] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [14:12:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68705 and previous config saved to /var/cache/conftool/dbconfig/20240905-141238-arnaudb.json [14:12:52] hashar  1070953 is the new cherry pick , wanted to make sure it didn't get lost in logs [14:13:03] MatmaRex: for `$wmgExtraLogFile = '/tmp/wiki.log';` I think the use case is when ones uses maintenance scripts [14:13:08] (03PS2) 10David Caro: aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) [14:13:09] sorry if you already saw [14:13:17] so you can do `MW_DEBUG_LOCAL=1 mwscript shell.php --wiki=testwiki` [14:13:21] and tail /tmp/wiki.log [14:13:36] joelyrookewmde: better two notifications than none :] [14:13:39] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:13:44] btw i updated https://wikitech.wikimedia.org/w/index.php?title=MediaWiki_UDP_logging&diff=prev&oldid=2223146 [14:13:51] (03CR) 10Hashar: [C:03+2] Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070953 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [14:13:53] (03PS1) 10DCausse: wdqs: better isolation of categories related [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) [14:13:56] (03PS1) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [14:14:10] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:14:14] (03CR) 10CI reject: [V:04-1] wdqs: better isolation of categories related [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:14:19] (03CR) 10CI reject: [V:04-1] wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:14:35] hashar: oh, right [14:14:39] (03PS3) 10David Caro: aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) [14:14:57] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069320|Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile]] (duration: 10m 37s) [14:14:59] (03PS1) 10Muehlenhoff: Add Cumin alias for ldap-maint [puppet] - 10https://gerrit.wikimedia.org/r/1070957 [14:15:24] (03CR) 10Slavina Stefanova: [C:03+1] aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) (owner: 10David Caro) [14:15:31] (03CR) 10David Caro: [C:03+2] aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) (owner: 10David Caro) [14:15:31] (03CR) 10FNegri: [C:03+1] aptrepo: add k8s 1.27 repos [puppet] - 10https://gerrit.wikimedia.org/r/1070954 (https://phabricator.wikimedia.org/T359641) (owner: 10David Caro) [14:15:55] MatmaRex: the testwiki logging bucket is still being written to by udp2log [14:16:13] (why did we wrote udp2log instead of relying on syslog?) [14:16:19] my bet is syslog is slow [14:16:21] nice. no catastrophe [14:16:37] and udp2log got written by Tim so it is neccessarily simple/stable/easy/performant [14:16:48] congratulations and dropping that wgDebugLogFile [14:17:04] I think that is what I have struggled the most with with my patch in April [14:17:10] too many layers and sources of confusion [14:17:25] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:46] yeah, it took me also a long time to figure out what that really did [14:18:01] so next one ? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069321/ [14:18:16] wmgDefaultMonologHandlers is also a doozy. but that one is for later: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070685 [14:18:23] hashar: yep. that is labs-only [14:18:29] "(see CommonSettings.php, line 324 and 4414)" [14:18:36] (03PS2) 10Muehlenhoff: Add Cumin alias for ldap-maint [puppet] - 10https://gerrit.wikimedia.org/r/1070957 [14:18:38] that might not be true anymore if CommonSettings.php got changed [14:18:39] :D [14:18:52] oh right, let me fix that [14:18:55] I rather quote the lines directly in the message [14:18:56] maybe [14:19:01] (03PS1) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [14:19:11] while still having a line number reference possibly. But that saves one from having to look them up [14:19:15] * hashar king of nitpick [14:19:35] (03PS6) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [14:19:40] hashar: fixed [14:19:56] i think i wnated the number to show that they are done in the specific order [14:20:00] one below the other [14:20:12] (03CR) 10Clément Goubert: [C:03+1] kubernetes: Rename mw242[01] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070944 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [14:20:23] ah [14:20:54] i'll add comments on these lines in gerrit [14:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:21:10] "handler was sending all of the carefully formatted logs right to /dev/null" [14:21:13] that is hilarious [14:21:13] :) [14:21:40] I probably wrote those lines that you are removing now [14:21:53] (03CR) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [14:22:12] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for ldap-maint [puppet] - 10https://gerrit.wikimedia.org/r/1070957 (owner: 10Muehlenhoff) [14:24:25] the fact that we set `$wmgExtraLogFile = '/dev/null';` is also a bit weird, why not just leave it unset? but i suppose it's a defense against misconfiguration. thanks to that, the beta cluster was just silently logging to nowhere instead of throwing exceptions. whether you think that's good or bad is up to you… [14:25:15] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/206399 [14:25:16] (03CR) 10Andrew Bogott: [C:03+2] Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:25:44] my guess is that we had $wgDebugLogFile set for MediaWiki to do the file based logging [14:25:46] for all requests [14:27:12] PROBLEM - MariaDB Replica SQL: s3 on db1240 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: mswiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:27:16] ha, indeed: "Also sets the default value of $wgDebugLogFile in prod to /dev/null which will avoid 'Missing stream uri, the stream can not be opened.' errors" [14:27:22] (03CR) 10Hashar: [C:03+2] "In the early day, I am pretty sure we ran beta cluster with `$wgDebugLogFile` enabled so we could watch what was happening." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [14:27:25] (03PS19) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [14:27:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68706 and previous config saved to /var/cache/conftool/dbconfig/20240905-142743-arnaudb.json [14:27:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [14:28:06] (03Merged) 10jenkins-bot: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [14:28:40] 14:28:25 Skipping sync since all commits were beta/labs-only changes. Operation completed. [14:28:42] well [14:28:51] scap is becoming smarter and smarter [14:29:09] https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/511846/console [14:30:37] !log UTC afternoon backport window completed [14:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:58] I will have to plunge in the rest of the php logging code [14:31:41] oh no [14:31:45] I forgot joelyrookewmde patch! [14:31:58] ha i was just writing [14:32:03] !log UTC afternoon backport window not completed! https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1070953 pending [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:26] MatmaRex: we have some new logging error in the mediawiki-new-error dashboard https://logstash.wikimedia.org/app/dashboards#/view/c7013c90-a487-11ec-be91-b3435f0c0c49 [14:32:31] !log  UTC afternoon backport window not completed! [14:32:32] didn't know that was possible [14:32:39] oh [14:32:40] thanks so much [14:32:43] that comes from MinervaNeue [14:32:47] PHP Notice: Undefined index: id [14:32:53] from /srv/mediawiki/php-1.43.0-wmf.21/skins/MinervaNeue/includes/Skins/SkinMinerva.php(176) [14:32:53] #0 /srv/mediawiki/php-1.43.0-wmf.21/skins/MinervaNeue/includes/Skins/SkinMinerva.php(176): MWExceptionHandler::handleError(int, string, string, int, array) [14:32:53] #1 [internal function]: MediaWiki\Minerva\Skins\SkinMinerva::MediaWiki\Minerva\Skins\{closure}(array) [14:32:53] #2 /srv/mediawiki/php-1.43.0-wmf.21/skins/MinervaNeue/includes/Skins/SkinMinerva.php(177): array_filter(array, Closure) [14:33:26] looking, but i don't think that's mine. maybe just coincidence [14:33:37] no that definitely resembles mine [14:34:17] (03CR) 10Hashar: [C:04-2] "The wmf.21 patch causes a notice:" [skins/MinervaNeue] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070953 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [14:34:48] so yeah I think we can abandon that wmf.20 patch ^ [14:34:52] is it conflicting with the git history bc we already merged this commit to 21 [14:35:30] nop they are different branches [14:35:30] ok sadness [14:35:49] wmf.20 was cut before wmf.21 [14:36:00] so yeah there is a missing 'id' [14:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:13] happy to abandon it but not sure what the error is... and if it will occur again in the train deployment ? [14:36:32] the patch in master will also makes its way in wmf.22 and be deployed next week which will cause the same error [14:36:50] 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, 06Traffic: improve GeoDNS-to-edge mapping - https://phabricator.wikimedia.org/T316160#10121802 (10CDanis) [14:36:54] in `return $sitelink[ 'id' ] === 't-wikibase';` i think that $sitelink may also be a HTML string, rather than an array of data. i vaguely recall running into that in the past when i worked on skins [14:37:12] RECOVERY - MariaDB Replica SQL: s3 on db1240 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:37:41] 06SRE, 06Infrastructure-Foundations, 10netops, 10probenet, 06Traffic: improve GeoDNS-to-edge mapping - https://phabricator.wikimedia.org/T316160#10121808 (10CDanis) [14:37:53] (03PS1) 10Hashar: Revert "Fix missing wikibase link in Minerva sidebar" [skins/MinervaNeue] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070961 (https://phabricator.wikimedia.org/T66315) [14:38:20] RECOVERY - MariaDB Replica SQL: s5 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:57] joelyrookewmde: I have sent the revert for wmf.21 at https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1070961 and will push + deploy it [14:39:05] we would need a fix in master for next week :-] [14:39:11] thanks [14:39:38] I am sadly off next week and in a baby team with just me and one 1 month old developer [14:39:52] wondering wether I should also rollback the original change [14:40:24] joelyrookewmde: definitely yes [14:40:31] cause the same notice will happen next week [14:40:31] ok [14:40:35] that will end up being a train blocker [14:40:40] and we will then revert it :) [14:40:54] so yeah I am happt to +2 a revert in master [14:40:58] let's try and avoid that [14:41:00] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070961|Revert "Fix missing wikibase link in Minerva sidebar" (T66315)]] [14:41:03] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [14:41:09] you can then send it again against master to create a new change and mark it to think about it after your vacations! [14:41:18] the important point is [14:41:25] it is always ok to rollback and revert [14:41:43] (03CR) 10Filippo Giunchedi: [C:03+1] grafana: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070951 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:41:48] and I express my sympathy for the patch you did :-] [14:41:49] (or maybe the 'id' is really missing from the array. i'm looking at some extensions that add things there: https://codesearch.wmcloud.org/deployed/?q=SidebarBeforeOutput and some of them seem to omit it, e.g. https://gerrit.wikimedia.org/g/mediawiki/extensions/ReportIncident/+/aa0c8116b23026e0dab5ede73ba76e10d06809cb/src/Hooks/Handlers/MainHooksHandler.php#52 ) [14:42:11] maybe the people from the web team can help with the review/test? [14:42:14] ok cool - I have not done a revert before - what action do I need to take in gerrit to make that happen? just a new commit removing the code? [14:42:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68707 and previous config saved to /var/cache/conftool/dbconfig/20240905-144249-arnaudb.json [14:43:02] !log hashar@deploy1003 hashar: Backport for [[gerrit:1070961|Revert "Fix missing wikibase link in Minerva sidebar" (T66315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:43:05] tbh we had 3 people test it (because I am also a new dev) and it works locally (lol) I guess not all sitelinks have an id, just the wikibase one [14:43:07] (03Abandoned) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [14:43:46] (03Abandoned) 10Hashar: Fix missing wikibase link in Minerva sidebar [skins/MinervaNeue] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070953 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [14:43:56] joelyrookewmde: yeah possibly [14:43:58] and again [14:44:03] IT IS NORMAL FOR THING TO BREAK [14:44:05] :D [14:44:25] haha cheers [14:44:31] what is possible for the next iteration is to set a feature flag [14:44:45] joelyrookewmde: hashar already did it here, but yes, you can just do a `git revert` locally and push to gerrit; or you can click the big "REVERT" button it has on every merged patch [14:44:56] like $wgMinervaNeueAddBackWikibaseLink = false; [14:45:01] beautiful thank you MatmaRex [14:45:15] have your code hidden behind it ( if ( $wgMinervaNeueAddBackWikibaseLink ) ) [14:45:18] roll the code [14:45:27] then you can enable it on beta cluster with the feature flag [14:45:31] hashar we actually have a feature flag for the main feature [14:45:34] ah [14:45:37] great :-] [14:45:42] this is a bugfix highlighted by the pilot wiki we rolled out to [14:45:55] !log hashar@deploy1003 hashar: Continuing with sync [14:46:06] (03CR) 10Hnowlan: [C:03+1] sre.k8s.renumber-node: Run puppet on deploy servers [cookbooks] - 10https://gerrit.wikimedia.org/r/1070903 (owner: 10Clément Goubert) [14:46:07] so I have rolled bakc the wmf.21 patch by pushing it [14:46:08] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2420.codfw.wmnet [14:46:19] the wmf.20 never got merged cause I had it CR - 2 and I have abandoned it [14:46:22] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2421.codfw.wmnet [14:46:30] so happy days only the people using hebrew mobile website are affected but yes will still try and get a wmf 22 fix out for them before I go on vacation [14:46:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2420.codfw.wmnet [14:46:44] any other action required from me? [14:46:45] for https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1070534 you can press REVERT at top right and I will be happy to +2 :) [14:46:51] (03CR) 10Kamila Součková: [C:03+2] kubernetes: Rename mw242[01] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070944 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [14:46:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2421.codfw.wmnet [14:46:57] (03CR) 10Muehlenhoff: [C:03+2] Revert "Temporarily remove puppetmaster1003 from rotation" [puppet] - 10https://gerrit.wikimedia.org/r/1070917 (https://phabricator.wikimedia.org/T373888) (owner: 10Muehlenhoff) [14:47:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121836 (10Jhancock.wm) [14:47:08] on the revert now [14:47:27] it is great it only affects a single wiki :) [14:47:30] kudos on that! [14:48:09] (03PS2) 10DCausse: wdqs: better isolation of categories related [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) [14:48:09] (03PS2) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [14:48:09] (03PS2) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [14:48:35] (03CR) 10CI reject: [V:04-1] wdqs: better isolation of categories related [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:48:37] (03CR) 10CI reject: [V:04-1] wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:48:56] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:48:58] RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:49:02] 1070963 here is the revert change [14:49:12] for 22 [14:49:24] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:49:30] RECOVERY - MariaDB Replica Lag: s5 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:50:08] joelyrookewmde: here is the t-shirt that was handed to people that broke the site (and fixed it) in the early days https://commons.wikimedia.org/wiki/File:Framed_%22I_BROKE_WIKIPEDIA..._THEN_I_FIXED_IT!%22_T-shirt.jpg [14:50:27] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070961|Revert "Fix missing wikibase link in Minerva sidebar" (T66315)]] (duration: 09m 27s) [14:50:28] adorable. need. [14:50:28] hashar: o/ Good morning! [14:50:30] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [14:50:36] good morning! [14:50:43] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2420 to wikikube-worker2091 [14:50:50] I ran a bunch of backports over the last two hours :] [14:51:00] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:51:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Unified pattern for RemoteHosts accessors in Spicerack - https://phabricator.wikimedia.org/T374073#10121866 (10elukey) [14:51:29] !log deploying python3-wmflib 1.2.6 fleet-wide [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:37] yeah seriously thank you hashar you have put in the hours [14:51:41] joelyrookewmde: don't forget to note what matmarex linked above which I guess would help for the next iteration of the patch :) [14:51:45] and yeah pushing things [14:51:48] whatching breakage [14:51:50] rolling back [14:51:53] or celebrating [14:51:58] that is literally my job description :] [14:52:02] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2421 to wikikube-worker2092 [14:52:11] congrats on the deployment [14:52:12] ! [14:52:31] (even if code ends up being rolled back, it at least got deployed) [14:52:40] and that is I guess to be celebrated [14:52:44] hahahaha [14:52:49] I assume you're dealing with `SkinMinerva:176 PHP Notice: Undefined index: id` ? [14:52:54] all milestones are another mile [14:52:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:58] yes dancy [14:53:01] yes dancy [14:53:19] that is solved now and can be filtered out from the dashboard or ignored [14:53:37] there is something sketchy in some other area of our code base :/ [14:53:47] ok so I check again if anything else is requiring action (apart from correctly fix the bugfix ?) [14:53:50] the good news is that it solves one of next week train blocker! [14:54:02] *celebrations* [14:54:05] not from my side joelyrookewmde :] [14:54:09] gorgeous [14:54:23] peace out and cheers again [14:54:24] jouncebot: nowandnext [14:54:24] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [14:54:25] In 0 hour(s) and 5 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1500) [14:54:25] my best wishes in figuring out what is going on with the sitemap :] [14:54:26] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2420 to wikikube-worker2091 - kamila@cumin1002" [14:54:32] Is there a deployment running right now? [14:54:36] claime: you can proceed. I have finished the backport window [14:54:39] cool [14:54:41] thanks [14:55:13] !log UTC afternoon backport window completed [14:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] thanks hashar [14:55:21] \o/ [14:55:31] !log depooling kubernetes nodes for T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019 [14:55:33] (03PS3) 10DCausse: wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) [14:55:33] (03PS3) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [14:55:33] (03PS3) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:34] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [14:55:36] I guess I will spend my friday in those wmf-config logging patches [14:55:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2017.codfw.wmnet [14:55:59] (03CR) 10CI reject: [V:04-1] wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:56:04] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:56:07] (03CR) 10CI reject: [V:04-1] wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:56:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2017.codfw.wmnet [14:56:33] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2021.codfw.wmnet [14:57:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2021.codfw.wmnet [14:57:15] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2038.codfw.wmnet [14:57:16] (03CR) 10Ladsgroup: [C:03+1] mariadb: productionize db2227 [puppet] - 10https://gerrit.wikimedia.org/r/1070946 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:57:23] I caught someone's netbox hiera changes regarding 172.16.8.0/22 / cloud-flat-eqiad1, I assume I can proceed? [14:57:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2038.codfw.wmnet [14:57:55] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2039.codfw.wmnet [14:58:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: post db2224 clone repool', diff saved to https://phabricator.wikimedia.org/P68709 and previous config saved to /var/cache/conftool/dbconfig/20240905-145755-arnaudb.json [14:58:03] (03PS1) 10Muehlenhoff: stewards: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070965 (https://phabricator.wikimedia.org/T135991) [14:58:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2039.codfw.wmnet [14:58:38] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2335.codfw.wmnet [14:58:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2420 to wikikube-worker2091 - kamila@cumin1002" [14:58:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:44] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2091 [14:59:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2335.codfw.wmnet [14:59:20] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2336.codfw.wmnet [14:59:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2091 [14:59:49] (03PS4) 10DCausse: wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) [14:59:49] (03PS4) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [14:59:49] (03PS4) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [14:59:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2336.codfw.wmnet [15:00:00] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2337.codfw.wmnet [15:00:05] dancy and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1500) [15:00:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2420 to wikikube-worker2091 [15:00:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2337.codfw.wmnet [15:00:38] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2421 to wikikube-worker2092 - kamila@cumin1002" [15:00:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2420 to wi... [15:00:42] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2338.codfw.wmnet [15:00:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2421 to wikikube-worker2092 - kamila@cumin1002" [15:00:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:44] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2092 [15:00:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2092 [15:01:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2338.codfw.wmnet [15:01:20] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2412.codfw.wmnet [15:01:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2421 to wikikube-worker2092 [15:01:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2421 to wi... [15:01:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2412.codfw.wmnet [15:01:57] (03PS1) 10Muehlenhoff: Extend MX Cumin aliases for new postfix roles [puppet] - 10https://gerrit.wikimedia.org/r/1070967 [15:01:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2413.codfw.wmnet [15:02:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2413.codfw.wmnet [15:02:39] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2414.codfw.wmnet [15:03:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2414.codfw.wmnet [15:03:15] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121912 (10Papaul) [15:03:18] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2415.codfw.wmnet [15:03:21] !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2091.codfw.wmnet [15:03:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121913 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumber... [15:03:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2415.codfw.wmnet [15:03:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2091.codfw.wmnet with OS bullseye [15:03:55] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2416.codfw.wmnet [15:04:03] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2091 [15:04:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:04:15] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:04:17] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121920 (10Papaul) [15:04:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2416.codfw.wmnet [15:04:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2417.codfw.wmnet [15:05:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2417.codfw.wmnet [15:05:16] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2418.codfw.wmnet [15:05:19] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121924 (10Papaul) [15:05:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2418.codfw.wmnet [15:05:55] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2419.codfw.wmnet [15:06:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121929 (10Papaul) [15:06:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2419.codfw.wmnet [15:06:34] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2019.codfw.wmnet [15:07:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2019.codfw.wmnet [15:07:24] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp2035.codfw.wmnet with reason: T373096 [15:07:27] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [15:07:35] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121940 (10Papaul) @Dwisehaupt civi2002 and frpig2002 relocation complete. I am checking pay-lb2002 [15:07:37] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2035.codfw.wmnet with reason: T373096 [15:07:59] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2091 - kamila@cumin1002" [15:08:06] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp2036.codfw.wmnet with reason: T373096 [15:08:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2091 - kamila@cumin1002" [15:08:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:07] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2091.codfw.wmnet 9.0.192.10.in-addr.arpa 9.0.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:08:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2091.codfw.wmnet 9.0.192.10.in-addr.arpa 9.0.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:08:11] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2091 [15:08:19] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2036.codfw.wmnet with reason: T373096 [15:08:54] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp2035.ulsfo.wmnet [15:09:01] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp2036.ulsfo.wmnet [15:09:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:49] (03CR) 10Hnowlan: [C:03+1] sre.discovery.datacenter: update EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:10:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10121964 (10Fabfur) Hosts cp203[5-6] downtimed and depooled [15:10:55] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 [15:11:02] jouncebot: nowandnext [15:11:02] For the next 0 hour(s) and 48 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1500) [15:11:02] In 0 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1600) [15:11:45] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10121968 (10Papaul) [15:12:13] I just realized I saw Horizon’s own login page (rather than the IDP/SSO one) for probably the last time in my life earlier today [15:12:16] woop woop \o/ [15:12:23] (03PS5) 10Clément Goubert: sre.k8s.renumber-node: Refactor logging and error handling [cookbooks] - 10https://gerrit.wikimedia.org/r/1070904 [15:12:23] (03PS4) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 [15:12:27] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10121954 (10MoritzMuehlenhoff) 05Open→03Resolved I've kicked off the RAID rebuild; it should complete in half an hour. I've also re-added puppetmaster1003 back to active duty. [15:12:29] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [15:12:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2091 [15:12:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2091 [15:13:02] (03CR) 10Muehlenhoff: [C:03+2] grafana: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070951 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:14:40] (03PS20) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [15:14:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:16:30] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: update EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:16:42] (03CR) 10Volans: spicerack: allow running by non-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [15:17:04] !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2092.codfw.wmnet [15:17:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121992 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumber... [15:17:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2092.codfw.wmnet with OS bullseye [15:17:44] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2092 [15:17:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10121994 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:18:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2125.codfw.wmnet onto db2225.codfw.wmnet [15:18:13] 06SRE, 06collaboration-services, 10vrts: Dissociate/release old iOS and Android support email addresses (currently VRTS queues) - https://phabricator.wikimedia.org/T373485#10121999 (10Dzahn) >>! In T373485#10120180, @Krd wrote: > I would like to see the whole content of vrts_aliases.py and discuss (perhaps a... [15:18:35] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:18:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:19:25] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:07] !log prep lsw1-c2-codfw for server migration from asw-c2-codfw T373096 [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:11] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [15:21:44] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2092 - kamila@cumin1002" [15:21:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2092 - kamila@cumin1002" [15:21:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:21:48] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2092.codfw.wmnet 77.0.192.10.in-addr.arpa 7.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:21:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2092.codfw.wmnet 77.0.192.10.in-addr.arpa 7.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:21:52] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2092 [15:22:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2092 [15:22:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2092 [15:22:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070971 [15:22:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070971 (owner: 10TrainBranchBot) [15:25:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68710 and previous config saved to /var/cache/conftool/dbconfig/20240905-152534-arnaudb.json [15:27:09] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10122010 (10colewhite) 05Open→03Resolved a:03Joe BounceHandler has signs of life again. Thanks @Joe! [15:28:45] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:28:47] !log restart swift-proxy on ms-fe1014 T360913 [15:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:50] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [15:30:40] !log prep lsw1-c3-codfw for server migration from asw-c3-codfw T373096 [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:43] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [15:31:34] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2091.codfw.wmnet with reason: host reimage [15:35:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2091.codfw.wmnet with reason: host reimage [15:36:03] RECOVERY - MD RAID on puppetmaster1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:36:23] (03CR) 10JHathaway: "I wasn't aware of that library, happy to rework the patch if you think that would be valuable." [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [15:37:40] (03PS1) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 for wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070973 (https://phabricator.wikimedia.org/T372878) [15:38:17] (03PS1) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [15:38:19] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2031.codfw.wmnet [15:38:25] RECOVERY - MariaDB Replica SQL: s2 on db2197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:38:45] RESOLVED: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:38:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2031.codfw.wmnet [15:39:01] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2320.codfw.wmnet [15:39:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [15:39:33] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [15:39:47] (03CR) 10Pppery: "This change should definitely go out in Tech News before being merged/deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [15:40:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68711 and previous config saved to /var/cache/conftool/dbconfig/20240905-154040-arnaudb.json [15:41:46] (03CR) 10Pppery: "To be clear, this isn't because I have any objection to the concept behind it, but it's a major change that people will notice, and I'd ra" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [15:41:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [15:42:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2320.codfw.wmnet [15:42:19] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2321.codfw.wmnet [15:42:40] (03CR) 10Scott French: "Thank you both!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:42:43] (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: update EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:42:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2321.codfw.wmnet [15:42:58] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2322.codfw.wmnet [15:43:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2322.codfw.wmnet [15:48:10] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on backup2006.codfw.wmnet with reason: Move backup2006 uplink to lsw1-c2-codfw [15:48:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on backup2006.codfw.wmnet with reason: Move backup2006 uplink to lsw1-c2-codfw [15:48:45] (03PS2) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [15:48:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122089 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8726666c-096a-491c-b6d3-edc93e2996f1) set by cmoon... [15:49:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070971 (owner: 10TrainBranchBot) [15:49:26] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [15:52:22] (03PS1) 10Santiago Faci: MPIC: Moving monitoring configuration from chart to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) [15:53:20] (03PS2) 10Santiago Faci: MPIC: Moving monitoring configuration from chart to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) [15:53:33] (03PS3) 10Santiago Faci: MPIC: Moving monitoring configuration from chart to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070977 (https://phabricator.wikimedia.org/T361346) [15:54:25] (03CR) 10Clément Goubert: [C:03+1] "LGTM, maybe just wait until https://phabricator.wikimedia.org/T373096 is done and we've repooled capacity to depool the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1070973 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [15:54:39] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:54:57] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:55:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68712 and previous config saved to /var/cache/conftool/dbconfig/20240905-155545-arnaudb.json [15:56:06] (03Merged) 10jenkins-bot: sre.discovery.datacenter: update EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1070668 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:57:14] (03PS3) 10Brouberol: airflow: deploy the scheduler via a separate Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) [16:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:37] (03CR) 10Volans: [C:03+1] "Thanks Scott for the huge effort, this has been a long lasting tech debt and I really appreciate you took the opportunity of the next swit" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [16:02:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10122158 (10Jclark-ctr) @MMuhlenhoff is there any reason we can not rack these in row e/f ? [16:02:51] !log homer cr*-codfw* commit 'T372878' [16:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:55] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:03:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070980 [16:03:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070980 (owner: 10TrainBranchBot) [16:03:50] (03CR) 10Btullis: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:04:53] (03PS1) 10Cwhite: opensearch: enable curator debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1070981 (https://phabricator.wikimedia.org/T364190) [16:05:22] (03CR) 10CI reject: [V:04-1] opensearch: enable curator debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1070981 (https://phabricator.wikimedia.org/T364190) (owner: 10Cwhite) [16:05:31] (03CR) 10Brouberol: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:05:49] (03PS2) 10Cwhite: opensearch: enable curator debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1070981 (https://phabricator.wikimedia.org/T364190) [16:06:07] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070981 (https://phabricator.wikimedia.org/T364190) (owner: 10Cwhite) [16:06:45] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070980 (owner: 10TrainBranchBot) [16:06:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10122171 (10jhathaway) >>! In T372666#10120683, @Volans wrote: > I need to recollect my old memories and check local branches, the hardest part IIRC are not the code chang... [16:07:09] (03CR) 10Btullis: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:08:40] (03CR) 10Brouberol: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:09:19] (03CR) 10Brouberol: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:10:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: post clone repool', diff saved to https://phabricator.wikimedia.org/P68713 and previous config saved to /var/cache/conftool/dbconfig/20240905-161051-arnaudb.json [16:13:25] (03PS1) 10Cathal Mooney: Enable qos scheduling on cr1-codfw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1070983 (https://phabricator.wikimedia.org/T339850) [16:14:09] (03CR) 10Volans: "Some questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [16:14:36] (03CR) 10Cathal Mooney: [C:03+2] Enable qos scheduling on cr1-codfw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1070983 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:14:46] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [16:15:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070985 [16:15:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070985 (owner: 10TrainBranchBot) [16:15:19] (03CR) 10Brouberol: airflow: deploy the scheduler via a separate Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:15:19] (03PS4) 10Brouberol: airflow: deploy the scheduler via a separate Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) [16:15:22] (03Merged) 10jenkins-bot: Enable qos scheduling on cr1-codfw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1070983 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:15:43] (03CR) 10Volans: [C:03+1] "It would be n" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [16:16:30] (03CR) 10Volans: "Up to you, it's installed everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [16:17:06] (03PS5) 10Brouberol: airflow: deploy the scheduler via a separate Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) [16:19:25] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2141 db2144 db2150 db2169 db2186 db2191 - T370852', diff saved to https://phabricator.wikimedia.org/P68714 and previous config saved to /var/cache/conftool/dbconfig/20240905-162057-arnaudb.json [16:21:01] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [16:21:19] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373894#10122227 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alert cleared. [16:21:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: network maintenance T370852 [16:21:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: network maintenance T370852 [16:24:31] Anything going on that should prevent me from rolling the train to group1? [16:26:21] (03PS1) 10Cathal Mooney: Missing 'k' on shaper rate [homer/public] - 10https://gerrit.wikimedia.org/r/1070989 (https://phabricator.wikimedia.org/T339850) [16:26:39] topranks is about to do some network work; dunno if that's a likely blocker [16:27:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet [16:27:55] (03CR) 10Cathal Mooney: [C:03+2] Missing 'k' on shaper rate [homer/public] - 10https://gerrit.wikimedia.org/r/1070989 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:28:30] (03Merged) 10jenkins-bot: Missing 'k' on shaper rate [homer/public] - 10https://gerrit.wikimedia.org/r/1070989 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:29:04] Emperor: thanks [16:29:22] dancy: I don't believe anything should adversely affect the train so I think you can proceed [16:29:29] Thanks! [16:30:01] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070991 (https://phabricator.wikimedia.org/T373640) [16:30:03] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070991 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [16:30:08] !log moss-be2003 to maintenance mode T373096 [16:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:11] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [16:30:44] !log depool moss-fe2001 ms-fe2011 T373096 [16:30:45] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070991 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [16:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122267 (10MatthewVernon) @cmooney all good to go from a Swift/Ceph perspective, thanks for your patience [16:36:00] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [16:37:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122281 (10cmooney) >>! In T373096#10122267, @MatthewVernon wrote: > @cmooney all good to go from a Swift/Ceph perspective, th... [16:37:45] (03PS1) 10Alexandros Kosiaris: service: Remove php7.2 specific health check [puppet] - 10https://gerrit.wikimedia.org/r/1070993 [16:37:46] (03PS1) 10Alexandros Kosiaris: deployment_server: Remove buster php-readline stanza [puppet] - 10https://gerrit.wikimedia.org/r/1070994 [16:37:46] (03PS1) 10Alexandros Kosiaris: mediawiki::maintenance: Remove php72 remnant [puppet] - 10https://gerrit.wikimedia.org/r/1070995 [16:37:46] (03PS1) 10Alexandros Kosiaris: tests: Bump various tests from php7.2 to php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1070996 [16:37:47] (03PS1) 10Alexandros Kosiaris: apt: Remove mention of php72 component [puppet] - 10https://gerrit.wikimedia.org/r/1070997 [16:38:13] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.21 refs T373640 [16:38:15] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [16:38:24] !log cmooney@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: Move server uplinks codfw racks C2 [16:39:01] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: Move server uplinks codfw racks C2 [16:39:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cde90074-86b4-49ac-9878-436a5d041f2b) set by cmoon... [16:41:56] (03CR) 10Alexandros Kosiaris: [C:03+1] Temporarily disable stunnel for the Puppet 7 migration of deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070236 (owner: 10Muehlenhoff) [16:42:30] !log disabling puppet on cp nodes to test puppet change [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:42] !log move server uplinks codfw rack c2 from asw-c2-codfw to lsw1-c2-codfw T373096 [16:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:48] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [16:43:21] (03CR) 10Clément Goubert: [C:03+1] service: Remove php7.2 specific health check [puppet] - 10https://gerrit.wikimedia.org/r/1070993 (owner: 10Alexandros Kosiaris) [16:43:27] FIRING: [2x] SystemdUnitCrashLoop: logstash.service crashloop on apifeatureusage1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:43:27] (03CR) 10Clément Goubert: [C:03+1] deployment_server: Remove buster php-readline stanza [puppet] - 10https://gerrit.wikimedia.org/r/1070994 (owner: 10Alexandros Kosiaris) [16:43:40] (03CR) 10Clément Goubert: [C:03+1] mediawiki::maintenance: Remove php72 remnant [puppet] - 10https://gerrit.wikimedia.org/r/1070995 (owner: 10Alexandros Kosiaris) [16:43:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070985 (owner: 10TrainBranchBot) [16:44:06] (03CR) 10Clément Goubert: [C:03+1] tests: Bump various tests from php7.2 to php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1070996 (owner: 10Alexandros Kosiaris) [16:44:24] (03CR) 10Clément Goubert: [C:03+1] apt: Remove mention of php72 component [puppet] - 10https://gerrit.wikimedia.org/r/1070997 (owner: 10Alexandros Kosiaris) [16:44:26] (03CR) 10JHathaway: [C:03+2] puppet8: ensure type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:46:36] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 359, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2092.codfw.wmnet with OS bullseye [16:49:29] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2092.codfw.wmnet [16:49:33] !log dancy@deploy1003 Installing scap version "4.101.1" for 211 hosts [16:49:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10122317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [16:49:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10122318 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by kamila@cumin1002 Renumbering... [16:49:54] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10122309 (10BTullis) >>! In T373800#10121659, @VRiley-WMF wrote: > Hi @BTullis we can replace this drive at any time. Although the LED on the drive isn't on, as long as we... [16:51:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2091.codfw.wmnet with OS bullseye [16:51:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:51:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10122337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [16:51:23] !log cmooney@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 35 hosts with reason: Move server uplinks codfw racks C3 [16:51:24] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on 35 hosts with reason: Move server uplinks codfw racks C3 [16:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:54:09] (03CR) 10Bugreporter: "Per T45934 I will suggest a namespace number such as 3000 that is used by no other wikis. 120 is currently used in Wikidata as number of P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [16:54:38] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 441, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:54:48] jouncebot: next [16:54:48] In 0 hour(s) and 5 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1700) [16:54:48] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1700) [16:55:10] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2091.codfw.wmnet [16:55:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2091.codfw.wmnet [16:55:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2091.codfw.wmnet [16:55:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10122354 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by kamila@cumin1002 Renumbering... [16:55:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: kubernetes2035 (renamed to wikikube-worker2087) reporting "Comm Error: Backplane 0" - https://phabricator.wikimedia.org/T374019#10122351 (10Jhancock.wm) @JMeybohm power cycled it and reseated the connection between the system board and the backplane. looks like i... [16:56:07] (03PS1) 10Cwhite: logstash: temporarily disable useragent filter [puppet] - 10https://gerrit.wikimedia.org/r/1071002 [16:57:10] !log cmooney@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 34 hosts with reason: Move server uplinks codfw racks C3 [16:58:03] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 34 hosts with reason: Move server uplinks codfw racks C3 [16:58:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122371 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=07e91a47-4c42-404a-bc7d-ad277bbf3e2b) set by cmoon... [16:58:41] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2091.codfw.wmnet [16:58:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2091.codfw.wmnet [16:59:01] (03CR) 10Bugreporter: "Also, you did not define a corresponding talk namespace, which will cases problem unless you unset it explicitly (e.g. https://gerrit.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [16:59:03] (03CR) 10Cwhite: [C:03+2] logstash: temporarily disable useragent filter [puppet] - 10https://gerrit.wikimedia.org/r/1071002 (owner: 10Cwhite) [16:59:06] !log move server uplinks codfw rack c3 from asw-c3-codfw to lsw1-c3-codfw T373096 [16:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:09] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [17:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1700) [17:01:09] (03CR) 10Btullis: [C:03+1] admin_ng: add wikidata-query-gui service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [17:03:09] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2035.codfw.wmnet [17:03:10] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2035.codfw.wmnet [17:03:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2036.codfw.wmnet [17:03:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2036.codfw.wmnet [17:03:37] (03PS1) 10Ebernhardson: search: Update Cirrus Saneitizer alert [alerts] - 10https://gerrit.wikimedia.org/r/1071004 [17:04:49] !log moss-be2003 exit maintenance mode T373096 [17:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:52] T373096: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 [17:05:13] !log pool moss-fe2001 ms-fe2011 T373096 [17:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:39] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp2035.ulsfo.wmnet [17:06:51] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp2036.ulsfo.wmnet [17:07:01] all done, thanks [17:08:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68715 and previous config saved to /var/cache/conftool/dbconfig/20240905-170800-arnaudb.json [17:08:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68716 and previous config saved to /var/cache/conftool/dbconfig/20240905-170800-arnaudb.json [17:08:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 25%: T370852', diff saved to https://phabricator.wikimedia.org/P68717 and previous config saved to /var/cache/conftool/dbconfig/20240905-170801-arnaudb.json [17:08:06] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [17:08:20] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2017.codfw.wmnet [17:08:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2017.codfw.wmnet [17:08:27] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2021.codfw.wmnet [17:08:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2021.codfw.wmnet [17:08:30] !log Repooling kubernetes nodes after T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019 [17:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:34] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2038.codfw.wmnet [17:08:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2038.codfw.wmnet [17:08:41] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2039.codfw.wmnet [17:08:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2039.codfw.wmnet [17:08:48] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2335.codfw.wmnet [17:08:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2335.codfw.wmnet [17:08:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122430 (10MatthewVernon) Swift / Ceph back to normal, thanks! [17:08:56] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2336.codfw.wmnet [17:08:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2336.codfw.wmnet [17:09:02] (03CR) 10Bugreporter: "Note enwiki also has a number of pages with title Talk:MOS:xxx. They can not be fixed by namespaceDupes.php, but can be fixed by cleanupTi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [17:09:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2337.codfw.wmnet [17:09:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2337.codfw.wmnet [17:09:11] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2338.codfw.wmnet [17:09:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2338.codfw.wmnet [17:09:18] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2412.codfw.wmnet [17:09:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2412.codfw.wmnet [17:09:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122440 (10ABran-WMF) kudos @Jhancock.wm! d/p nodes are repooling [17:09:25] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2413.codfw.wmnet [17:09:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122429 (10cmooney) All links moved and all hosts now responding to ping again. Average interruption in the region of seconds... [17:09:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2413.codfw.wmnet [17:09:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2414.codfw.wmnet [17:09:34] (03CR) 10Pppery: "And you also need `--move-talk` on the namespaceDupes run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [17:09:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2414.codfw.wmnet [17:09:39] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2415.codfw.wmnet [17:09:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2415.codfw.wmnet [17:09:46] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2416.codfw.wmnet [17:09:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2416.codfw.wmnet [17:09:53] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2417.codfw.wmnet [17:09:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2417.codfw.wmnet [17:10:00] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2418.codfw.wmnet [17:10:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2418.codfw.wmnet [17:10:08] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2419.codfw.wmnet [17:10:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2419.codfw.wmnet [17:10:15] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2019.codfw.wmnet [17:10:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2019.codfw.wmnet [17:13:28] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:02] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:06] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:16:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [17:19:51] (03PS1) 10Ebernhardson: cirrus: Exclude wikitech from streaming updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071007 [17:20:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10122514 (10Papaul) @Dwisehaupt all good we had the netmask wrong [17:21:22] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10122515 (10Papaul) @Dwisehaupt all good we had the netmask wrong [17:23:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68718 and previous config saved to /var/cache/conftool/dbconfig/20240905-172305-arnaudb.json [17:23:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 50%: T370852', diff saved to https://phabricator.wikimedia.org/P68719 and previous config saved to /var/cache/conftool/dbconfig/20240905-172306-arnaudb.json [17:23:08] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [17:25:02] (03CR) 10Ebernhardson: [C:03+2] cirrus: Exclude wikitech from streaming updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071007 (owner: 10Ebernhardson) [17:25:13] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10122530 (10Papaul) @Dwisehaupt all good on this one as well [17:26:03] (03Merged) 10jenkins-bot: cirrus: Exclude wikitech from streaming updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071007 (owner: 10Ebernhardson) [17:28:06] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:28:12] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:32:51] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:33:09] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:38:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68720 and previous config saved to /var/cache/conftool/dbconfig/20240905-173810-arnaudb.json [17:38:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68721 and previous config saved to /var/cache/conftool/dbconfig/20240905-173810-arnaudb.json [17:38:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: T370852', diff saved to https://phabricator.wikimedia.org/P68722 and previous config saved to /var/cache/conftool/dbconfig/20240905-173811-arnaudb.json [17:38:14] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [17:40:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:43:27] FIRING: [2x] SystemdUnitCrashLoop: logstash.service crashloop on apifeatureusage1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:44:21] (03PS1) 10Cwhite: logstash: ensure agent is a string before passing to useragent filter [puppet] - 10https://gerrit.wikimedia.org/r/1071011 (https://phabricator.wikimedia.org/T374142) [17:52:18] (03PS2) 10Muehlenhoff: Remove puppet checkout on pybaltest [puppet] - 10https://gerrit.wikimedia.org/r/1047509 (https://phabricator.wikimedia.org/T365798) [17:53:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68723 and previous config saved to /var/cache/conftool/dbconfig/20240905-175315-arnaudb.json [17:53:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68724 and previous config saved to /var/cache/conftool/dbconfig/20240905-175316-arnaudb.json [17:53:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: T370852', diff saved to https://phabricator.wikimedia.org/P68725 and previous config saved to /var/cache/conftool/dbconfig/20240905-175316-arnaudb.json [17:53:19] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [17:53:27] RESOLVED: [2x] SystemdUnitCrashLoop: logstash.service crashloop on apifeatureusage1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:53:58] (03PS3) 10Scott French: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) [17:53:58] (03PS3) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) [17:53:58] (03PS3) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) [17:53:58] (03PS3) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) [17:58:16] PROBLEM - eventlogging Varnishkafka log producer on cp3068 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:59:18] RECOVERY - eventlogging Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:59:21] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:59:35] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T1800) [18:06:28] !log add interface qos scheduler config to remaining CRs T339850 [18:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:31] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [18:08:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10122695 (10cmooney) 05Open→03Resolved a:03cmooney [18:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:23:19] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071015 (https://phabricator.wikimedia.org/T373640) [18:23:20] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071015 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:24:04] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071015 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:28:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10122772 (10Dwisehaupt) @Papaul Thanks. Confirmed that I can get to the drac and will start building the host. [18:28:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10122770 (10Dwisehaupt) a:05Dwisehaupt→03None @Papaul Thanks. Confirmed that I can get to the drac and will start building the host. [18:29:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10122773 (10Dwisehaupt) @Papaul Thanks. Confirmed that I can get to the drac and will start building the host. [18:37:11] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.21 refs T373640 [18:37:14] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [18:38:26] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10122794 (10andrea.denisse) Hi @karapayneWMDE , do you approve of this request? [18:38:44] PROBLEM - Host lists2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:36] RECOVERY - Host lists2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [18:42:52] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10122807 (10KFrancis) Hi all, the NDA is complete! Please proceed with next steps. Thanks! [18:47:02] (03CR) 10Ebernhardson: "Looks like wikidata has been reindexed for both indexes in both prod clusters, but I'm not seeing the appropriate mappings created for com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [18:48:28] (03CR) 10Scott French: "Thank you very much for the reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [18:50:26] (03CR) 10Scott French: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [18:51:04] PROBLEM - Host mw2420 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:20] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10122873 (10andrea.denisse) [19:02:34] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10122876 (10andrea.denisse) a:05KFrancis→03andrea.denisse [19:03:24] (03CR) 10Cathal Mooney: [C:03+1] Don't uninstall libnet-dns-perl when moving from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1070273 (https://phabricator.wikimedia.org/T373637) (owner: 10Muehlenhoff) [19:08:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1070950 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [19:09:03] (03CR) 10Cathal Mooney: [C:03+1] "Intent seems clear, I'm not an expert on envoy though." [puppet] - 10https://gerrit.wikimedia.org/r/1070949 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [19:14:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:19:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:20:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [19:20:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10123018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8... [19:21:25] !log add interface qos scheduler config to codfw switches T339850 [19:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:28] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [19:33:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [19:36:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [19:51:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [19:51:27] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153 (10Iflorez) 03NEW [19:51:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10123081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-wo... [19:53:05] (03PS1) 10JHathaway: puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) [19:53:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:55:50] (03CR) 10CI reject: [V:04-1] puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:59:57] (03PS2) 10JHathaway: puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T2000) [20:00:04] physikerwelt: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:00:36] I am here [20:01:23] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:01:47] (03CR) 10Dzahn: [C:03+2] stewards: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1070965 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:02:40] (03CR) 10CI reject: [V:04-1] puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:03:05] (03PS1) 10Andrea Denisse: ldap: Add JJMC89's wmf_prod SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1071022 (https://phabricator.wikimedia.org/T369314) [20:03:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10123100 (10andrea.denisse) [20:04:20] (03PS3) 10JHathaway: puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) [20:05:11] ^^^ bgp alert is a flap for durum4002, back up so seems ok [20:05:33] (03CR) 10Dzahn: [C:03+2] "timers/services have been created on both steward hosts and lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1070965 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:08:53] (03PS4) 10Scott French: sre.switchdc.mediawiki: migrate to the class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) [20:08:53] (03PS4) 10Scott French: sre.switchdc.mediawiki: add --task-id argument [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) [20:08:54] (03PS4) 10Scott French: sre.switchdc.mediawiki: use admin reason in puppet disable [cookbooks] - 10https://gerrit.wikimedia.org/r/1068898 (https://phabricator.wikimedia.org/T330273) [20:08:54] (03PS4) 10Scott French: sre.switchdc.mediawiki: record RO start/end in task [cookbooks] - 10https://gerrit.wikimedia.org/r/1068899 (https://phabricator.wikimedia.org/T330273) [20:10:10] (03CR) 10Dzahn: [C:03+2] planet: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:10:48] Can someone help me to get https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069258 deployed. It's supposed to do nothing, just cleaning up [20:11:28] (03CR) 10Scott French: sre.switchdc.mediawiki: migrate to the class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1068896 (https://phabricator.wikimedia.org/T328908) (owner: 10Scott French) [20:13:55] (03CR) 10Dzahn: [C:03+2] planet: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:16:35] (03PS1) 10Dzahn: planet: drop firewall rule for http from localhost [puppet] - 10https://gerrit.wikimedia.org/r/1071024 [20:16:51] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071024" [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:17:42] physikerwelt: there might not be any deployers around… i could go ping people, but if this is just cleanup, perhaps it can wait until next window? (that'd be on monday) or is something else waiting on this change to be done? [20:18:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155 (10cmooney) 03NEW p:05Triage→03High [20:19:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:20:16] I'm not available on Monday, but it could be also deployed togehter with the main change "Enable native mathml rendering by default on group0 and test wikis in production" I just wanted to do it in two steps. Just in case something unexpected happens... [20:20:20] (03PS3) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) [20:20:20] (03CR) 10RLazarus: "I'm happy to wait until https://gerrit.wikimedia.org/r/1068897 is submitted and rebase on top -- let me know what makes the most sense." [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [20:23:10] MatmaRex: do you know, if a deployment window is needed for that change? [20:23:47] I am around, btw. [20:23:53] Feel free to deploy [20:23:58] (03CR) 10Scott French: sre.switchdc.mediawiki: add --task-id argument (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1068897 (https://phabricator.wikimedia.org/T330273) (owner: 10Scott French) [20:23:58] especially fixes [20:24:12] or lemme know what needs deployin' [20:24:41] physikerwelt: i don't think so, that could just be scheduled for the normal backport window, if you ask me [20:24:52] dancy: the backport window, if you have a coupel of minutes :) just one no-op patch [20:24:59] jouncebot: now [20:24:59] For the next 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240905T2000) [20:25:50] physikerwelt: Do you have a way to validate the change once it hits testservers? [20:26:01] dancy: this is the patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069258 it would be awesome if you could do it [20:26:20] dancy: to some degree [20:26:44] ok [20:26:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [20:27:00] dancy: I mean it should do nothing, so I could only see if it goes wrong [20:27:47] Things not going wrong are what I care about the most, so that sounds good. [20:28:02] (or, better, noticing that something is going wrong). [20:28:11] (03CR) 10Dzahn: [C:03+2] phabricator: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:28:35] (03PS3) 10Physikerwelt: Remove redundandant setting of $wgDefaultUserOptions['math'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) [20:28:46] (03CR) 10TrainBranchBot: "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [20:29:37] (03Merged) 10jenkins-bot: Remove redundandant setting of $wgDefaultUserOptions['math'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [20:29:52] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1069258|Remove redundandant setting of $wgDefaultUserOptions['math'] (T373703)]] [20:29:56] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [20:29:57] I think I don't have access to any wiki in a group called private, so that is the part I can not test (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069258/2/wmf-config/InitialiseSettings.php#b7030) but for the rest i can test it [20:31:14] Timo signed off on the change w/ a detailed note so I'm not too worried. [20:31:53] !log dancy@deploy1003 dancy, physikerwelt: Backport for [[gerrit:1069258|Remove redundandant setting of $wgDefaultUserOptions['math'] (T373703)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:32:26] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123163 (10Iflorez) [20:33:04] (03CR) 10Dzahn: [C:03+2] phabricator: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:33:56] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071011 (https://phabricator.wikimedia.org/T374142) (owner: 10Cwhite) [20:33:56] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071011 (https://phabricator.wikimedia.org/T374142) (owner: 10Cwhite) [20:35:40] dancy: i didn't spot any change using the test server [20:35:49] ok. proceeding [20:35:54] !log dancy@deploy1003 dancy, physikerwelt: Continuing with sync [20:37:03] (03CR) 10Cwhite: [C:03+2] logstash: ensure agent is a string before passing to useragent filter [puppet] - 10https://gerrit.wikimedia.org/r/1071011 (https://phabricator.wikimedia.org/T374142) (owner: 10Cwhite) [20:37:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155#10123168 (10cmooney) Should mention there is a good case for shutting down PyBal on lvs1019 now, so that no traffic uses this bad link (instead rou... [20:37:47] (03PS1) 10Ahmon Dancy: Set parser for image gallery in CampaignPageFormatter [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071027 (https://phabricator.wikimedia.org/T374146) [20:37:52] (03PS1) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [20:38:13] (03CR) 10CI reject: [V:04-1] phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:38:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10123172 (10Jclark-ctr) [20:40:07] (03CR) 10C. Scott Ananian: "On the phab task I also suggested the we split this in half and do everywiki-except-for-enwiki first, make sure there aren't any unexpecte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [20:40:29] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069258|Remove redundandant setting of $wgDefaultUserOptions['math'] (T373703)]] (duration: 10m 36s) [20:40:34] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [20:41:26] !log add interface qos scheduler config to eqiad switches T339850 [20:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:28] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [20:41:55] physikerwelt: All set [20:42:21] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123181 (10Iflorez) [20:44:20] !log dancy@deploy1003 Installing scap version "4.101.2" for 211 hosts [20:46:32] dancy: thank you. [20:46:47] !log dancy@deploy1003 install-world aborted: (duration: 02m 28s) [20:47:01] (03PS1) 10JHathaway: puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1071031 (https://phabricator.wikimedia.org/T372664) [20:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071031 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:49:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071027 (https://phabricator.wikimedia.org/T374146) (owner: 10Ahmon Dancy) [20:49:45] (03CR) 10C. Scott Ananian: "https://meta.wikimedia.org/w/index.php?title=Tech%2FNews%2F2024%2F37&diff=27410025&oldid=27409299" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [20:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:54:40] (03PS1) 10JHathaway: geoip: add fake info [labs/private] - 10https://gerrit.wikimedia.org/r/1071034 [20:55:56] (03CR) 10JHathaway: [C:03+2] geoip: add fake info [labs/private] - 10https://gerrit.wikimedia.org/r/1071034 (owner: 10JHathaway) [20:56:02] (03CR) 10JHathaway: [V:03+2 C:03+2] geoip: add fake info [labs/private] - 10https://gerrit.wikimedia.org/r/1071034 (owner: 10JHathaway) [20:56:48] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10123246 (10jhathaway) [21:00:33] (03Merged) 10jenkins-bot: Set parser for image gallery in CampaignPageFormatter [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071027 (https://phabricator.wikimedia.org/T374146) (owner: 10Ahmon Dancy) [21:00:45] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1071027|Set parser for image gallery in CampaignPageFormatter (T374146)]] [21:00:51] T374146: PHP Deprecated: Use of ImageGalleryBase::setHeights without parser was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\UploadWizard\CampaignPageFormatter::generateReadHtml] - https://phabricator.wikimedia.org/T374146 [21:02:41] !log dancy@deploy1003 dancy: Backport for [[gerrit:1071027|Set parser for image gallery in CampaignPageFormatter (T374146)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:04] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123284 (10Iflorez) [21:09:50] !log dancy@deploy1003 Sync cancelled. [21:10:38] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@b47c79e] (releasing): (no justification provided) [21:11:18] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@b47c79e] (releasing): (no justification provided) (duration: 00m 39s) [21:12:33] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ihurbain - https://phabricator.wikimedia.org/T373811#10123303 (10andrea.denisse) 05Open→03In progress [21:13:59] (03PS1) 10Physikerwelt: Enable native MathML by default on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) [21:14:06] (03PS1) 10JHathaway: puppet8: avoid relying on g10k::config_file being defined [puppet] - 10https://gerrit.wikimedia.org/r/1071038 (https://phabricator.wikimedia.org/T372664) [21:14:06] (03PS2) 10Bking: search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:15:02] !log add interface qos scheduler config to remaining switches T339850 [21:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:05] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [21:15:08] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123309 (10cmooney) Can you share the full terminal output from when you run this command? And also confirm with operating system (Linux, Mac etc) you are using? ` ssh -N stat9 -L 8888:127.0.0.1:8880 -v ` Thanks [21:16:03] PROBLEM - Check whether ferm is active by checking the default input chain on mw1354 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:16:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [21:16:26] (03PS1) 10Ahmon Dancy: Revert "Set parser for image gallery in CampaignPageFormatter" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071039 [21:16:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10123316 (10jhathaway) [21:17:08] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10123330 (10jhathaway) [21:17:37] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071038 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:22:13] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123337 (10Iflorez) With @Btullis help I've tried the below #try to see the output ssh bast1003.wikimedia.org -v result: machine is checking weird keys that are not in the .ssh folder and it says it's finding in... [21:23:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071039 (owner: 10Ahmon Dancy) [21:23:31] (03CR) 10Dzahn: "I tested it, still works without this rule. The nft ruleset has a rule to allow from loopback and traffic goes through that between envoy " [puppet] - 10https://gerrit.wikimedia.org/r/1071024 (owner: 10Dzahn) [21:24:29] (03CR) 10Dzahn: [C:03+2] phabricator: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:26:34] (03CR) 10CI reject: [V:04-1] Revert "Set parser for image gallery in CampaignPageFormatter" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071039 (owner: 10Ahmon Dancy) [21:28:10] (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Revert "Set parser for image gallery in CampaignPageFormatter" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071039 (owner: 10Ahmon Dancy) [21:28:31] (03PS2) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [21:29:01] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123349 (10Iflorez) >>! In T374153#10123309, @cmooney wrote: > Can you share the full terminal output from when you run this command? And also confirm with operating system (Linux, Mac etc) you are using? > ` >... [21:29:28] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1071039|Revert "Set parser for image gallery in CampaignPageFormatter"]] [21:31:29] !log dancy@deploy1003 dancy: Backport for [[gerrit:1071039|Revert "Set parser for image gallery in CampaignPageFormatter"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:03] !log dancy@deploy1003 Sync cancelled. [21:33:05] !log gitlab1004 systemct list-units --state=failed listed wmf_auto_restart_ssh-gitlab.service but at the same time it's 'Service ssh-gitlab not present or not running'.(?). Did a systemctl reset-failed to clear monitoring and it doesn't seem to come back. T374106 [21:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:08] T374106: SystemdUnitFailed - gitlab1004 - wmf_auto_restart_ssh - https://phabricator.wikimedia.org/T374106 [21:34:42] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123362 (10Iflorez) checking config file reading: ` $ ssh -v stat9 OpenSSH_9.7p1, LibreSSL 3.3.6 debug1: Reading configuration data /Users/iflorez_1/.ssh/config debug1: /Users/iflorez_1/.ssh/config line 4: Appl... [21:37:06] (03PS3) 10Bking: search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:37:20] (03PS4) 10Bking: search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:38:25] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123379 (10cmooney) >>! In T374153#10123349, @Iflorez wrote: > Hello, @cmooney > Thank you for taking a look! No problem! > Want to give you a heads up that @btulllis is helping troubleshoot. Ok cool, well w... [21:45:57] RECOVERY - Check whether ferm is active by checking the default input chain on mw1354 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:48:13] (03PS5) 10Bking: search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1396 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:49:49] (03CR) 10Ryan Kemper: [C:03+1] search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:50:00] (03CR) 10Bking: [C:03+2] search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:50:15] (03CR) 10Bking: [V:03+2 C:03+2] search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [21:52:04] !log dancy@deploy1003 Installing scap version "4.101.3" for 211 hosts [21:53:33] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123425 (10Dzahn) Both of the examples you gave contain "Entering interactive session". So it seems to me both are working, not just the second one. When running `ssh -N stat9 -L 8888:127.0.0.1:8880 -v` and i... [21:54:04] !log dancy@deploy1003 install-world aborted: (duration: 02m 00s) [21:54:53] !log dancy@deploy1003 Installing scap version "4.101.3" for 211 hosts [21:56:14] (03PS1) 10Cathal Mooney: Remove variable controlling what devices have interface qos added [homer/public] - 10https://gerrit.wikimedia.org/r/1071045 (https://phabricator.wikimedia.org/T339850) [21:57:30] !log bking@grafana1002 apply grizzly SLO dashboard updates slo-Search added slo-apigw updated P68729 [21:57:30] (03CR) 10Cathal Mooney: [C:03+2] Remove variable controlling what devices have interface qos added [homer/public] - 10https://gerrit.wikimedia.org/r/1071045 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:55] !log bking@grafana1002 apply grizzly SLO dashboard updates slo-Search added slo-apigw updated P68729 T328330 [21:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:58] T328330: Create SLI / SLO on Search update lag - https://phabricator.wikimedia.org/T328330 [21:58:06] (03Merged) 10jenkins-bot: Remove variable controlling what devices have interface qos added [homer/public] - 10https://gerrit.wikimedia.org/r/1071045 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:59:14] !log dancy@deploy1003 Installing scap version "4.101.3" for 211 hosts [22:03:32] !log dancy@deploy1003 Installation of scap version "4.101.3" completed for 211 hosts [22:05:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#10123484 (10cmooney) 05Open→03Resolved a:03cmooney [22:08:38] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#10123487 (10cmooney) 05Open→03Resolved [22:16:37] (03PS3) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [22:18:53] RECOVERY - Check whether ferm is active by checking the default input chain on mw1396 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:21:55] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123518 (10Iflorez) Thank you @cmooney. Everything is working smoothly now. This ticket can be closed, thank you! [22:27:19] PROBLEM - Check whether ferm is active by checking the default input chain on mw1453 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:31:51] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123542 (10Dzahn) a:05Dzahn→03None [22:32:30] 06SRE: Accessing stat machines after computer resetting - https://phabricator.wikimedia.org/T374153#10123538 (10Dzahn) 05Open→03Resolved a:03Dzahn [22:38:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10123557 (10Papaul) [22:42:20] (03CR) 10Scott French: "I'm also happy to deal with the merge conflicts, as long as you're cool with reviewing the result when this class-API'd. So yeah, no stron" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [22:43:23] (03PS1) 10Andrea Denisse: ldap: Add ihurbain to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1071048 (https://phabricator.wikimedia.org/T373811) [22:43:30] (03CR) 10Andrea Denisse: [C:03+2] ldap: Add ihurbain to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1071048 (https://phabricator.wikimedia.org/T373811) (owner: 10Andrea Denisse) [22:44:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ihurbain - https://phabricator.wikimedia.org/T373811#10123576 (10andrea.denisse) [22:45:17] (03PS1) 10EoghanGaffney: lists: Set number of processes for mailman3_runner to minimum of 14 [puppet] - 10https://gerrit.wikimedia.org/r/1071049 [22:46:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ihurbain - https://phabricator.wikimedia.org/T373811#10123581 (10andrea.denisse) 05In progress→03Resolved Hi, I'm closing this task as resolved. Feel free to reopen it if there's anything else I can help with. [22:47:13] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10123584 (10andrea.denisse) >>! In T373927#10119259, @Aklapper wrote: >>>! In T373927#10118977, @andrea.denisse wrote: >> I'm unable to remove the from the #acl-Project-ad... [22:49:18] !log gerrit-replica.wikimedia.org (gerrit2002) - rebooting T373980 [22:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:21] T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed. - https://phabricator.wikimedia.org/T373980 [22:49:38] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10123588 (10andrea.denisse) [22:49:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit2002.wikimedia.org with reason: T373980 [22:50:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2002.wikimedia.org with reason: T373980 [22:52:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10123591 (10andrea.denisse) 05In progress→03Resolved Hi @WMDE-leszek , the offboarding process for Manuel is complete. Feel free to reopen this task if there's any... [22:53:07] !log disable PyBal on lvs1019 to swing traffic to lvs1020 and allow for intrusive work to correct link errors T374155 [22:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:11] T374155: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155 [22:54:39] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1019.eqiad.wmnet with reason: Move traffic off lvs1019 to lvs1029 to troubleshooot faulty link [22:54:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: Move traffic off lvs1019 to lvs1029 to troubleshooot faulty link [22:54:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155#10123611 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01519f9e-2903-4b5b-b71f-e25b1467cc00) set by cmooney@cumin1002 for 2:0... [22:55:40] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10123603 (10andrea.denisse) 05Open→03Stalled Hi @KFrancis , can you please confirm NDA status for @philippe.saade.WMDE ? [22:55:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155#10123609 (10cmooney) @Jclark-ctr is on site so I will swing the live traffic to lvs1020 so we can investigate. [22:56:01] (03CR) 10Andrea Denisse: [C:03+2] ldap: Add JJMC89's wmf_prod SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1071022 (https://phabricator.wikimedia.org/T369314) (owner: 10Andrea Denisse) [22:56:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:57:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10123616 (10andrea.denisse) a:03andrea.denisse [22:57:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw1453 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:00:37] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10123633 (10andrea.denisse) 05Open→03Stalled [23:01:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10123613 (10andrea.denisse) 05In progress→03Resolved a:05andrea.denisse→03None Hi, I'm closing this task as resolved. Feel free to reopen it if there's anyt... [23:09:52] 06SRE, 06Commons, 07Wikimedia-production-error: Cannot delete file on Commons: DBQueryError (File:Logo-headlinejabar.jpg) - https://phabricator.wikimedia.org/T373748#10123655 (10andrea.denisse) 05Open→03Resolved a:03andrea.denisse Hi, the referenced images were successfully deleted by @Yann : {F57... [23:09:54] 06SRE, 06Data-Engineering, 06Data-Platform, 06serviceops: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10123659 (10andrea.denisse) [23:11:35] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, 13Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10123661 (10tstarling) 05In progress→03Open [23:11:48] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10123676 (10KFrancis) @philippe.saade.WMDE Hello! Please send your full name and email address to kfrancis@wikimedia.org and I'll draft the NDA for you. Thanks! [23:13:56] FIRING: [2x] MaxConntrack: Max conntrack at 99.99% on ncredir6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:18:56] RESOLVED: [2x] MaxConntrack: Max conntrack at 99.99% on ncredir6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:34:10] (03CR) 10Krinkle: [C:03+1] "OK to deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [23:35:01] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure: Requesting access to `contint-admins`, `contint-docker` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10123799 (10Dzahn) In addition to manager, @thcipriani is group approver for contint-admins and contint-docker groups. [23:35:03] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176 (10Papaul) 03NEW [23:36:12] !log re-enable PyBal on lvs1019 after fixing faulty link with replacement optic T374155 [23:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:15] T374155: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155 [23:37:28] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10123837 (10Papaul) [23:38:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10123839 (10Papaul) [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071053 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071053 (owner: 10TrainBranchBot) [23:39:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw1495 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:43:04] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1019.eqiad.wmnet [23:43:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1019.eqiad.wmnet [23:43:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1019 to ssw1-f1-eqiad - https://phabricator.wikimedia.org/T374155#10123855 (10cmooney) 05Open→03Resolved Ok so we replaced the optic on the lvs1019 side, and things seem to be good. Sent a test stream of...