[00:03:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052827 (owner: 10TrainBranchBot) [00:12:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T367856)', diff saved to https://phabricator.wikimedia.org/P66019 and previous config saved to /var/cache/conftool/dbconfig/20240709-001250-marostegui.json [00:12:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [00:12:53] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:13:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [00:13:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66020 and previous config saved to /var/cache/conftool/dbconfig/20240709-001324-arnaudb.json [00:13:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [00:13:28] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:13:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [00:14:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [00:49:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest2001.codfw.wmnet [00:54:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [00:54:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [00:54:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T367781)', diff saved to https://phabricator.wikimedia.org/P66021 and previous config saved to /var/cache/conftool/dbconfig/20240709-005456-arnaudb.json [00:55:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:57:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367781)', diff saved to https://phabricator.wikimedia.org/P66022 and previous config saved to /var/cache/conftool/dbconfig/20240709-005720-arnaudb.json [01:06:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.13 [core] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052835 (https://phabricator.wikimedia.org/T366958) [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.13 [core] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052835 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [01:11:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:12:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P66023 and previous config saved to /var/cache/conftool/dbconfig/20240709-011227-arnaudb.json [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P66024 and previous config saved to /var/cache/conftool/dbconfig/20240709-012735-arnaudb.json [01:31:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.13 [core] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052835 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [01:32:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:37:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:42:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367781)', diff saved to https://phabricator.wikimedia.org/P66025 and previous config saved to /var/cache/conftool/dbconfig/20240709-014242-arnaudb.json [01:42:46] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [01:43:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [01:48:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0200) [02:04:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:09:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:28:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:31:45] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [02:39:17] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:17] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0300) [03:01:50] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052838 (https://phabricator.wikimedia.org/T366958) [03:01:52] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052838 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [03:02:31] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052838 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [03:02:59] !log mwpresync@deploy1002 Started scap sync-world: testwikis wikis to 1.43.0-wmf.13 refs T366958 [03:03:04] T366958: 1.43.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T366958 [03:15:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:20:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [03:25:31] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:25:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:29:31] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:30:33] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:41:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:41:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:46:41] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:47:43] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 141, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:53:51] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.13 refs T366958 (duration: 50m 52s) [03:53:54] T366958: 1.43.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T366958 [03:55:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367856)', diff saved to https://phabricator.wikimedia.org/P66026 and previous config saved to /var/cache/conftool/dbconfig/20240709-035529-marostegui.json [03:55:33] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0400) [04:01:01] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.10 (duration: 00m 57s) [04:02:28] (03CR) 10Tim Starling: [C:03+1] Missing.php: check REQUEST_URI in addition to PATH_INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) (owner: 10Pppery) [04:02:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:07:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:10:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P66027 and previous config saved to /var/cache/conftool/dbconfig/20240709-041036-marostegui.json [04:16:53] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:21:53] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:22:03] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 141, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P66028 and previous config saved to /var/cache/conftool/dbconfig/20240709-042544-marostegui.json [04:34:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:40:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367856)', diff saved to https://phabricator.wikimedia.org/P66029 and previous config saved to /var/cache/conftool/dbconfig/20240709-044051-marostegui.json [04:40:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:40:55] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:41:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:41:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:41:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:41:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T367856)', diff saved to https://phabricator.wikimedia.org/P66030 and previous config saved to /var/cache/conftool/dbconfig/20240709-044128-marostegui.json [04:49:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:58:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369339 [04:58:05] T369339: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T369339 [04:58:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1222 with weight 0 T369339', diff saved to https://phabricator.wikimedia.org/P66031 and previous config saved to /var/cache/conftool/dbconfig/20240709-045814-marostegui.json [04:58:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369339 [04:59:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1052200 (https://phabricator.wikimedia.org/T369339) (owner: 10Gerrit maintenance bot) [04:59:31] (03PS2) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052201 (https://phabricator.wikimedia.org/T369339) [05:07:28] (03PS1) 10Marostegui: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052841 [05:07:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 414.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:08:01] (03CR) 10Marostegui: [C:03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052841 (owner: 10Marostegui) [05:08:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:48] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052201 (https://phabricator.wikimedia.org/T369339) (owner: 10Gerrit maintenance bot) [05:17:11] !log Starting s2 eqiad failover from db1162 to db1222 - T369339 [05:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:14] T369339: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T369339 [05:17:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T369339', diff saved to https://phabricator.wikimedia.org/P66032 and previous config saved to /var/cache/conftool/dbconfig/20240709-051749-marostegui.json [05:18:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1222 to s2 primary and set section read-write T369339', diff saved to https://phabricator.wikimedia.org/P66033 and previous config saved to /var/cache/conftool/dbconfig/20240709-051814-marostegui.json [05:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T369339', diff saved to https://phabricator.wikimedia.org/P66034 and previous config saved to /var/cache/conftool/dbconfig/20240709-051911-marostegui.json [05:20:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Long schema change [05:20:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Long schema change [05:20:42] !log Deploy schema change on s2 eqiad db1162 dbmaint T367856 [05:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:45] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:35:50] (03PS1) 10Aklapper: phabricator: Clarify quarter of quarterly data for WMF QLS [puppet] - 10https://gerrit.wikimedia.org/r/1052844 [05:41:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:41:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:56:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0600) [06:00:04] marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:19] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9964011 (10Count_Count) @bd808 Thank you! I have started using the Wikimedia REST API instead (see https://phabricator.wikimedia.org/T369414#99... [06:06:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:15:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:21:58] PROBLEM - MariaDB Replica SQL: s2 on db1239 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:31:45] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:54:29] !log Start `foreachwikiindblist group2.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` in a tmux session [06:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [06:58:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) (owner: 10Anzx) [06:59:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0700). [07:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:04:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:07:54] o/ [07:08:23] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3073.*} and A:cp [07:08:37] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on P{cp3073.*} and A:cp [07:10:22] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3073.*} and A:cp [07:12:25] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-reboot (exit_code=1) rolling reboot on P{cp3073.*} and A:cp [07:13:51] Is anyone around to deploy your change? [07:14:17] anzx: [07:16:27] I can deploy your change anzx [07:20:00] (03CR) 10Dreamy Jazz: [C:03+2] jawiki: add throttle rule for edit-a-thon July 11-18, 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) (owner: 10Anzx) [07:20:42] Dreamy_Jazz: ok [07:20:42] (03Merged) 10jenkins-bot: jawiki: add throttle rule for edit-a-thon July 11-18, 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) (owner: 10Anzx) [07:21:36] (03PS1) 10Jelto: exclude phabricator.discovery.wmnet from ATSBackendErrorsHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1052850 (https://phabricator.wikimedia.org/T362401) [07:22:10] Dreamy_Jazz: I think you might need to reset the throttle cache too as it's less than 72 hours in advance [07:22:44] Thanks. Following steps on https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold, so will do that once I've done the deployment step. [07:22:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 125, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:54] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:24:16] RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 80.66 ms [07:25:57] It looks like there isn't a range argument to that maintenance script, so I will need to run the maintenance script for each IP in the range? [07:25:59] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3073.*} and A:cp [07:26:25] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netbox-dev2002.codfw.wmnet [07:27:56] Dreamy_Jazz: I think `--ip 202.25.144.0/20 ` would work [07:30:07] I can't find an example of that in the server admin log (where a range was provided) and my understanding of the code is that it takes a single IP [07:30:51] (03PS1) 10Ayounsi: Ganeti RAPI replace netbox-dev2002 with 2003 [puppet] - 10https://gerrit.wikimedia.org/r/1052851 (https://phabricator.wikimedia.org/T336275) [07:30:56] !log dreamyjazz@deploy1002 Synchronized wmf-config/throttle.php: Deploying throttle change for T369522 (duration: 09m 50s) [07:30:59] T369522: Lift IP cap on 2024-07-11 and 2024-07-18 for Editation for jawiki - https://phabricator.wikimedia.org/T369522 [07:31:54] I'm hesitant to run the maintenance script 4000 times for each IP in the range [07:32:42] !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3073.esams.wmnet [07:32:42] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3073.*} and A:cp [07:33:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:33:55] RhinosF1: Do you know if I can use a range for that maintenance script. [07:35:28] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:35:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:38:14] (03CR) 10Elukey: [C:03+2] merge_cli: fix a puppet-merge.sh comment [puppet] - 10https://gerrit.wikimedia.org/r/1052260 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey) [07:38:42] !log repool cp3073 [07:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:45] My understanding of that script is that it clears the throttle instead of updating the upper threshold. As such, considering that the maximum currently is 6 and the number of participants is 23, I think the limit of 50 should still provide enough space to avoid hitting the throttle without having to reset it. [07:39:18] As such, I will not run the maintenance script (considering I would need to run it 4000 times). [07:39:46] If anyone else has more experience with this script and wants to do it, then feel free. [07:40:03] anzx: Does that make sense to you? [07:40:21] !log Morning UTC backport window done [07:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:41] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [07:40:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1052851 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:40:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:40:47] (03CR) 10Ayounsi: [C:03+2] Ganeti RAPI replace netbox-dev2002 with 2003 [puppet] - 10https://gerrit.wikimedia.org/r/1052851 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:40:58] Dreamy_Jazz: i will ask in afternoon backport window [07:41:05] 👍 [07:42:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [07:42:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:42:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox-dev2002.codfw.wmnet [07:45:40] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:45:45] (03PS2) 10Gmodena: beta: eventbus: enable instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) [07:46:31] (03CR) 10Elukey: [C:03+1] "LGTM! Left a comment, feel free to skip it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052776 (owner: 10Alexandros Kosiaris) [07:46:32] Dreamy_Jazz: no [07:46:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:49:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809#9964152 (10ayounsi) 05Open→03Declined Closing this task as afaik we haven't seen any issue in esams, and the proper path forward is tracked in {T367973}... [07:49:49] 06SRE, 06Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743#9964168 (10ayounsi) 05Open→03Declined Closing this task as afaik we haven't seen any issue in esams, and the proper path forward is tracked in {T367973}. Please re-open if you disagree. [07:51:02] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: adjust fr payments-listener endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [07:51:09] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you Dallas!" [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi) [07:51:25] (03PS1) 10Jforrester: Revert "wikifunctions: Add addNestedMetadata to production orchestrator config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052906 (https://phabricator.wikimedia.org/T368892) [07:51:32] (03PS2) 10Jforrester: Revert "wikifunctions: Add addNestedMetadata to production orchestrator config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052906 (https://phabricator.wikimedia.org/T368892) [07:51:43] FIRING: [2x] OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:53:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:53:32] (03CR) 10Elukey: [C:03+1] "Left some comments but feel free to proceed with the test!" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:53:48] jouncebot: nowandnext [07:53:48] For the next 0 hour(s) and 6 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0700) [07:53:48] In 0 hour(s) and 6 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0800) [07:54:24] (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: Add addNestedMetadata to production orchestrator config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052906 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [07:55:30] (03Merged) 10jenkins-bot: Revert "wikifunctions: Add addNestedMetadata to production orchestrator config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052906 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [07:55:58] * andre will be a few minutes late running the train [07:56:09] (03CR) 10Elukey: [C:03+1] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [07:56:26] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:43] RESOLVED: [2x] OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [07:57:36] (03CR) 10Filippo Giunchedi: [C:03+1] exclude phabricator.discovery.wmnet from ATSBackendErrorsHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1052850 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [07:57:38] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:58:19] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:58:39] 06SRE, 06Data Products, 06Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9964185 (10Joe) 05Open→03Resolved a:03Joe I'll boldly consider this task resolved, please reopen it if the problem is still present. [07:59:31] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [07:59:47] (03CR) 10Filippo Giunchedi: [C:03+1] istio_sli_avail: alert if metric goes absent [alerts] - 10https://gerrit.wikimedia.org/r/1052784 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [08:00:04] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0800) [08:01:15] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9964208 (10elukey) >>! In T354410#9961563, @Volans wrote: > @elukey do you know how much of a... [08:01:50] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [08:01:52] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [08:03:43] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [08:05:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:06:09] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#9964221 (10Joe) [08:07:46] I will now start promoting group0 wikis to 1.43.0-wmf.13 [08:09:01] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052910 (https://phabricator.wikimedia.org/T366958) [08:09:02] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052910 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [08:09:41] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052910 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [08:10:44] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:12:07] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [08:13:42] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [08:17:25] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.13 refs T366958 [08:17:28] T366958: 1.43.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T366958 [08:20:44] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:22:43] (03PS1) 10Filippo Giunchedi: mobileapps: lower tracing sampling percentage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052912 (https://phabricator.wikimedia.org/T320563) [08:25:17] (03PS1) 10Elukey: role::puppetserver: deploy the gitpuppet admin group [puppet] - 10https://gerrit.wikimedia.org/r/1052914 (https://phabricator.wikimedia.org/T368023) [08:25:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:26:05] (03CR) 10Filippo Giunchedi: [C:03+2] mobileapps: lower tracing sampling percentage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052912 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:27:36] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [08:28:01] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [08:28:02] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [08:28:24] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [08:37:27] (03PS1) 10Filippo Giunchedi: admin: add eup to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1052921 (https://phabricator.wikimedia.org/T369500) [08:38:16] (03CR) 10CI reject: [V:04-1] admin: add eup to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1052921 (https://phabricator.wikimedia.org/T369500) (owner: 10Filippo Giunchedi) [08:38:35] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9964364 (10fgiunchedi) >>! In T369500#9963066, @EUwandu-WMF wrote: > Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in wit... [08:40:35] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9964365 (10fgiunchedi) >>! In T369500#9964364, @fgiunchedi wrote: >>>! In T369500#9963066, @EUwandu-WMF wrote: >> Hello @fgiunchedi , Please can you check again to see if it works... [08:42:18] (03CR) 10Klausman: [C:03+1] "I *think* this will work, but I am still not sure I understand PSA/PSS correctly. So let's merge it, try it in staging and see what color " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:42:26] (03CR) 10Klausman: [C:03+1] "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:44:20] (03CR) 10Filippo Giunchedi: [C:03+1] "Manager approval now present" [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) (owner: 10Volans) [08:53:54] (03CR) 10Volans: [C:03+1] "approved on task" [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) (owner: 10Volans) [08:54:43] (03CR) 10Jelto: [C:03+2] exclude phabricator.discovery.wmnet from ATSBackendErrorsHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1052850 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [08:55:29] (03PS1) 10Slyngshede: Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 [08:55:44] (03CR) 10Elukey: "It should work, but I'd check the diff with istioctl (current manifest and the new one) since the ML cluster has an extra complication, na" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:55:53] (03Merged) 10jenkins-bot: exclude phabricator.discovery.wmnet from ATSBackendErrorsHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1052850 (https://phabricator.wikimedia.org/T362401) (owner: 10Jelto) [08:57:29] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9964436 (10Volans) Ok, sounds good to me. Thanks for looking into this and yes there is no re... [08:57:29] (03PS1) 10Slyngshede: Permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [08:58:18] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add sharvaniharan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) (owner: 10Volans) [09:02:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9964474 (10fgiunchedi) [09:03:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9964479 (10fgiunchedi) You should be able to access the dashboards in ~30 min from now, please confirm that is the case. Also please confirm you have re... [09:03:21] jouncebot: now [09:03:21] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0800) [09:06:00] (03PS1) 10Filippo Giunchedi: admin: add lferreira [puppet] - 10https://gerrit.wikimedia.org/r/1052927 (https://phabricator.wikimedia.org/T369348) [09:06:45] !log restart purged @ cp3073 [09:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] (03CR) 10Alexandros Kosiaris: [C:04-1] "Took me a while to reason through all of these cryptic rules and how this patch alters them, but I think I finally figured it out." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [09:10:10] (03CR) 10Ayounsi: [C:03+1] admin: add lferreira [puppet] - 10https://gerrit.wikimedia.org/r/1052927 (https://phabricator.wikimedia.org/T369348) (owner: 10Filippo Giunchedi) [09:10:22] 06SRE, 10Thumbor: wikipedia-pl-sysop: local images fail to generate thumbnail - https://phabricator.wikimedia.org/T368945#9964497 (10fgiunchedi) p:05Triage→03Medium [09:10:32] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add lferreira [puppet] - 10https://gerrit.wikimedia.org/r/1052927 (https://phabricator.wikimedia.org/T369348) (owner: 10Filippo Giunchedi) [09:10:52] (03PS1) 10Lucas Werkmeister (WMDE): termbox: update to 2024-07-09-084416-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052928 (https://phabricator.wikimedia.org/T368523) [09:12:07] anyone wanna +1 the version bump at ^ ? then I’d deploy it later today :) [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:18] (03CR) 10Lucas Werkmeister (WMDE): Add $wgMaxShellWallClockTime setting for shellbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [09:15:08] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9964504 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Lferreira you are now part of the `wmf` ldap group, I'm optimistically resolving the task, though please reopen if... [09:26:19] (03Abandoned) 10Elukey: role::puppetserver: deploy the gitpuppet admin group [puppet] - 10https://gerrit.wikimedia.org/r/1052914 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:26:24] !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@0e9b3ac]: (no justification provided) [09:26:57] !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@0e9b3ac]: (no justification provided) (duration: 00m 32s) [09:27:08] (03CR) 10Klausman: [C:03+1] "Looks as expected:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:31:24] (03PS4) 10Clément Goubert: Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) [09:31:50] (03CR) 10Clément Goubert: "No you're right, done." [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:32:27] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9964576 (10Aklapper) 05Resolved→03Open I'll bite per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group [09:35:13] hashar: andre are you lot running the train or it remained blocked? [09:35:14] (03PS4) 10Clément Goubert: Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) [09:35:57] effie: all fine per https://versions.toolforge.org/ [09:36:00] hashar: andre I mean, are you done for this window? [09:36:08] oh sorry, yes, we are done [09:36:24] sorry, I'll take a note to be more explicit about this [09:36:28] andre: deployment calendar has a blocked marked [09:36:44] thus I saw the deployment, but not sure if you are done for now [09:37:04] effie: hmm, which URI are you looking at? [09:37:12] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T0800 [09:37:51] effie: the link to the blockers task is always there [09:37:55] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9964597 (10fgiunchedi) >>! In T369348#9964576, @Aklapper wrote: > I'll bite per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group doh, of course! thank... [09:38:07] ah that's the general blocker bug for tracking, interesting (and confusing, indeed) [09:38:09] that's just the weekly parent task that any blockers will be added too [09:38:25] oh my bad, I dont remeber seeing it in the past, though who knows [09:38:43] hmm [09:39:04] ok sorry for the major confusion, andre ok for me to use the rest of your window? [09:39:17] andre: sorry I kind of missed you were running the train this week, I assumed it was run by our estimated USA colleagues [09:39:45] effie, yes, go ahead [09:39:49] cheers [09:39:56] 🍿 [09:39:58] hashar: no problem :) [09:40:16] (03PS1) 10Btullis: Configure analytics_meta MariaDB clients to connect to an-mariadb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1052932 (https://phabricator.wikimedia.org/T365503) [09:40:23] andre: we can sync up tomorrow if you want. I am happy to share a coffee over a virtual meet [09:41:40] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9964599 (10Aklapper) 05Open→03Resolved Thanks. (And sorry, I know I could have done this myself but I don't know a better way to occasionally remind folks to skip muscle mem... [09:41:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3181/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052932 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis) [09:41:51] hashar, let's do tomorrow :) [09:42:17] (03CR) 10Elukey: "I just realized that the mesh's perms will be deployed via the kserve-inference chart settings, so we should be good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:46:17] (03PS1) 10Alexandros Kosiaris: Remove role::etcd::v3::kubernetes and hosts [puppet] - 10https://gerrit.wikimedia.org/r/1052933 (https://phabricator.wikimedia.org/T353464) [09:47:58] (03CR) 10Elukey: "Hi folks! I found this change while working on the private repository on puppetserver1001, that soon should replace /srv/private on puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [09:48:40] (03PS1) 10Elukey: Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 [09:51:21] (03CR) 10CI reject: [V:04-1] Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 (owner: 10Elukey) [09:53:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#9964670 (10elukey) While prepping for making a commit on puppetserver1001, I ended up filing a revert: https://gerrit.wikimedi... [09:54:29] (03PS1) 10Marostegui: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052935 [09:54:47] (03CR) 10Kamila Součková: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [09:55:11] (03PS2) 10Elukey: Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 [09:55:13] (03CR) 10Marostegui: [C:03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052935 (owner: 10Marostegui) [09:55:30] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 (owner: 10Elukey) [09:55:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66035 and previous config saved to /var/cache/conftool/dbconfig/20240709-095538-root.json [09:56:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 36 hosts with reason: Primary switchover s1 T369515 [09:56:52] T369515: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T369515 [09:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2203 with weight 0 T369515', diff saved to https://phabricator.wikimedia.org/P66036 and previous config saved to /var/cache/conftool/dbconfig/20240709-095659-root.json [09:57:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 36 hosts with reason: Primary switchover s1 T369515 [09:57:32] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052751 (https://phabricator.wikimedia.org/T369515) (owner: 10Gerrit maintenance bot) [09:58:16] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052933 (https://phabricator.wikimedia.org/T353464) (owner: 10Alexandros Kosiaris) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T1000) [10:00:05] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:01:19] (03Merged) 10jenkins-bot: mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:01:44] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3182/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [10:03:09] (03PS1) 10Btullis: Facilitate a role swap from an-mariadb1001 to an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052937 (https://phabricator.wikimedia.org/T365503) [10:03:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:04:46] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:10:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66037 and previous config saved to /var/cache/conftool/dbconfig/20240709-101043-root.json [10:11:42] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:00] (03PS3) 10Elukey: Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 [10:14:17] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:14:17] (03PS4) 10Elukey: Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 [10:14:30] (03PS5) 10Elukey: Partially Revert "P:puppetserver::git: do not mark directories as safe" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 [10:14:54] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9964826 (10SGupta-WMF) @mforns Thank you for verifying and raising the MR . I... [10:15:42] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3183/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052934 (owner: 10Elukey) [10:16:52] (03CR) 10Lucas Werkmeister (WMDE): Add $wgMaxShellWallClockTime setting for shellbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [10:25:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66038 and previous config saved to /var/cache/conftool/dbconfig/20240709-102549-root.json [10:29:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192 db1198 db1199 T365995', diff saved to https://phabricator.wikimedia.org/P66039 and previous config saved to /var/cache/conftool/dbconfig/20240709-102947-root.json [10:29:51] T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995 [10:31:45] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:15] !log Starting s1 codfw failover from db2212 to db2203 - T369515 [10:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:18] T369515: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T369515 [10:32:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2203 to s1 primary T369515', diff saved to https://phabricator.wikimedia.org/P66040 and previous config saved to /var/cache/conftool/dbconfig/20240709-103238-root.json [10:33:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2212 T369515', diff saved to https://phabricator.wikimedia.org/P66041 and previous config saved to /var/cache/conftool/dbconfig/20240709-103331-root.json [10:34:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66042 and previous config saved to /var/cache/conftool/dbconfig/20240709-103409-root.json [10:37:12] !log Finished running maintenance scripts for T366781 [10:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:16] T366781: Run maintenance script to delete entries only for use when reading old on WMF wikis - https://phabricator.wikimedia.org/T366781 [10:38:41] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9964910 (10fgiunchedi) No worries at all and totally fair @Aklapper ! [10:40:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66043 and previous config saved to /var/cache/conftool/dbconfig/20240709-104054-root.json [10:42:55] (03PS1) 10Btullis: Temporarily pause gobblin to facilitate Hive/MariaDB maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1052944 (https://phabricator.wikimedia.org/T365503) [10:42:57] (03PS1) 10Btullis: Revert the change to disable the gobbin timers on an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/1052945 (https://phabricator.wikimedia.org/T365503) [10:45:47] (03PS1) 10Matthias Mullie: Re-introduce notices [extensions/UploadWizard] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052946 (https://phabricator.wikimedia.org/T369053) [10:46:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UploadWizard] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052946 (https://phabricator.wikimedia.org/T369053) (owner: 10Matthias Mullie) [10:48:37] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#9964939 (10elukey) Next steps: * Wait until a fix for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052934 is live and... [10:49:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66044 and previous config saved to /var/cache/conftool/dbconfig/20240709-104914-root.json [10:51:00] (03CR) 10CI reject: [V:04-1] Re-introduce notices [extensions/UploadWizard] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052946 (https://phabricator.wikimedia.org/T369053) (owner: 10Matthias Mullie) [10:54:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:54:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:54:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T367856)', diff saved to https://phabricator.wikimedia.org/P66045 and previous config saved to /var/cache/conftool/dbconfig/20240709-105454-marostegui.json [10:54:58] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:55:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1052948 (https://phabricator.wikimedia.org/T369616) [10:55:14] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052949 (https://phabricator.wikimedia.org/T369616) [10:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66046 and previous config saved to /var/cache/conftool/dbconfig/20240709-105600-root.json [11:01:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:13] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, mw1492.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1386.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1415.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, mw1484.eqiad.wmnet, [11:02:13] 1.eqiad.wmnet, mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1370.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1014.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1486.eqiad.wmnet, mw1356.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, mw1483.eqiad.wmnet, mw1458.eqiad.wmnet, kubernetes1048.eqiad.wmnet, mw1371.eqiad.wmnet, parse1012.eqiad.wmnet, mw1431 [11:02:13] mnet, kubernetes1028.eqiad.wmnet, wikikube-worker1010.eqiad.wmnet, kubernetes1024.eqiad.wmnet, mw1464.eqiad.wmnet, parse1019.eqiad.wmnet, mw1391.eqiad.wmnet, kubernetes1056.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [11:02:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox-video_4080: Servers parse1011.eqiad.wmnet, mw1433.eqiad.wmnet, mw1380.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1386.eqiad.wmnet, parse1013.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1415.eqiad.wmnet, mw1484.eqiad.wmnet, mw1405.eqiad.wmnet, kubernetes1047.eqiad.wmnet, kubernetes103 [11:02:17] wmnet, mw1391.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, mw1389.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1395.eqiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes1014.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1367.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1469.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1356.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, wikikube-work [11:02:17] qiad.wmnet, mw1431.eqiad.wmnet, wikikube-worker1010.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1031.eqiad.wmnet, parse102 https://wikitech.wikimedia.org/wiki/PyBal [11:02:36] hnowlan ^ ? [11:02:37] erm, that would be me uploading something heheh. [11:02:39] hahaha [11:02:45] ouch [11:02:49] only used on testwiki, no need for alarm [11:02:55] looking into it [11:02:56] Did I break something? [11:03:09] kamila_: did you just upload something? [11:03:20] I uploaded a 100mb file :') [11:03:24] !incidents [11:03:24] 4841 (UNACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [11:03:27] I made an attempt and got a 503 xD [11:03:31] !ack 4841 [11:03:32] 4841 (ACKED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [11:04:02] sorry for the noise [11:04:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66047 and previous config saved to /var/cache/conftool/dbconfig/20240709-110420-root.json [11:04:48] so the concurrency in jobqueue might be an issue [11:04:56] because currently we only have 2 workers [11:04:57] Mhm [11:05:16] Right, it doesn't match [11:05:45] gonna bump replicas anyway [11:05:52] (03CR) 10Hnowlan: [C:03+2] shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:09:10] (03Merged) 10jenkins-bot: shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:09:13] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:09:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:10:10] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:10:29] (03PS1) 10Aklapper: phabricator weekly changes email: Include EditEngine Form changes [puppet] - 10https://gerrit.wikimedia.org/r/1052953 (https://phabricator.wikimedia.org/T369548) [11:10:43] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:11:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66048 and previous config saved to /var/cache/conftool/dbconfig/20240709-111105-root.json [11:11:30] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:11:58] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:12:51] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:13:46] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:14:46] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:15:28] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:15:49] !log drained dse-k8s-worker1006.eqiad.wmnet ready for T365995 [11:15:49] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:52] T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995 [11:16:04] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:17:19] !log set cephosd cluster into noout mode to prevent rebalancing for T365995 [11:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:24] !log depooled druid1010 for T365995 [11:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66049 and previous config saved to /var/cache/conftool/dbconfig/20240709-111925-root.json [11:25:43] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66050 and previous config saved to /var/cache/conftool/dbconfig/20240709-112611-root.json [11:26:13] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:28:14] !log Decommissioning lists1001 T331706 [11:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:16] T331706: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 [11:30:55] (03PS1) 10Hnowlan: shellbox-video: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052957 (https://phabricator.wikimedia.org/T356241) [11:32:29] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9965061 (10phaultfinder) [11:34:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66051 and previous config saved to /var/cache/conftool/dbconfig/20240709-113430-root.json [11:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:39:46] (03PS2) 10Btullis: Configure analytics_meta MariaDB clients to connect to an-mariadb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1052932 (https://phabricator.wikimedia.org/T365503) [11:41:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3184/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052932 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis) [11:45:13] !log eoghan@cumin1002 START - Cookbook sre.hosts.decommission for hosts lists1001.wikimedia.org [11:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66052 and previous config saved to /var/cache/conftool/dbconfig/20240709-114935-root.json [11:54:31] !log eoghan@cumin1002 START - Cookbook sre.dns.netbox [11:54:38] (03PS1) 10EoghanGaffney: lists: Remove references to lists1001 after decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052959 [11:55:24] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965147 (10Marostegui) databases are ready [11:56:59] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3185/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052959 (owner: 10EoghanGaffney) [11:59:24] !log eoghan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T1200) [12:01:15] !log eoghan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [12:01:15] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:01:16] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lists1001.wikimedia.org [12:01:28] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9965158 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by eoghan@cumin1002 for hosts: `lists1001.wikimedia.org` - lists100... [12:04:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66053 and previous config saved to /var/cache/conftool/dbconfig/20240709-120440-root.json [12:04:43] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:07:54] (03PS1) 10Marostegui: db120[0-3]: Minor fix [puppet] - 10https://gerrit.wikimedia.org/r/1052963 [12:08:27] (03CR) 10Marostegui: [C:03+2] db120[0-3]: Minor fix [puppet] - 10https://gerrit.wikimedia.org/r/1052963 (owner: 10Marostegui) [12:10:07] (03PS2) 10EoghanGaffney: lists: Remove references to lists1001 after decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052959 (https://phabricator.wikimedia.org/T331706) [12:10:24] (03PS1) 10JMeybohm: Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) [12:10:34] (03CR) 10CI reject: [V:04-1] lists: Remove references to lists1001 after decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052959 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:11:11] (03CR) 10CI reject: [V:04-1] Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [12:11:17] (03PS3) 10EoghanGaffney: lists: Remove references to lists1001 after decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052959 (https://phabricator.wikimedia.org/T331706) [12:17:07] (03PS1) 10Btullis: Fail over hive and presto services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1052965 (https://phabricator.wikimedia.org/T365993) [12:18:22] (03CR) 10Btullis: [C:03+2] Fail over hive and presto services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1052965 (https://phabricator.wikimedia.org/T365993) (owner: 10Btullis) [12:25:26] (03PS1) 10Filippo Giunchedi: pontoon: remove note re: sandbox/ branch [puppet] - 10https://gerrit.wikimedia.org/r/1052966 (https://phabricator.wikimedia.org/T352640) [12:31:40] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: remove note re: sandbox/ branch [puppet] - 10https://gerrit.wikimedia.org/r/1052966 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:41:59] (03PS2) 10JMeybohm: Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) [12:42:39] (03CR) 10CI reject: [V:04-1] Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [12:42:47] (03CR) 10Matthias Mullie: [C:03+1] "recheck" [extensions/UploadWizard] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1052946 (https://phabricator.wikimedia.org/T369053) (owner: 10Matthias Mullie) [12:44:43] (03CR) 10JMeybohm: "@hashar@free.fr can you help out with this? I think I completely misunderstood the purpose of the top level tox.ini. Can we not just have " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [12:49:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T367856)', diff saved to https://phabricator.wikimedia.org/P66054 and previous config saved to /var/cache/conftool/dbconfig/20240709-124907-marostegui.json [12:49:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:54:24] (03CR) 10Hashar: [C:03+1] "> How are we handling the service restart and make sure it's not forgotten?" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [12:59:47] !log Restart Gerrit replica on gerrit2002 to apply a configuration change | T367505 [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:55] T367505: Use Gerrit 3.10 built-in log rotation - https://phabricator.wikimedia.org/T367505 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240709T1300). [13:00:05] kamila_, gmodena, and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:14] o/ [13:00:18] o/ [13:00:45] I can deploy! [13:01:18] o/ [13:01:35] kamila_ and/or gmodena: do you want to self-serve? (I think you might have deployment rights but I’m not sure from the puppet config ^^) [13:01:58] I have no idea how to do it :D [13:02:07] ok, then I can do it :) [13:02:09] Lucas_WMDE I could use some help, I have no idea how to deploy :( [13:02:17] can you ssh to deployment.eqiad.wmnet? [13:02:25] I've only did deploys (self-serve) o k8s [13:02:26] morea training ;) [13:02:28] Ican [13:02:36] then I think you could do the deployment if you want [13:02:40] let’s do kamila_ first [13:02:43] Lucas_WMDE: does IP range work for https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold asking for T369522 [13:02:44] T369522: Lift IP cap on 2024-07-11 and 2024-07-18 for Editation for jawiki - https://phabricator.wikimedia.org/T369522 [13:02:46] Lucas_WMDE ack [13:02:54] can I follow in a tmux or something Lucas_WMDE ? [13:02:56] (03PS4) 10Kamila Součková: Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) [13:03:02] sure, one moment [13:03:11] <3 [13:03:14] okay I have a tmux open [13:03:17] if you’re root you can attach, I think [13:03:22] otherwise… I don’t know how to do it [13:03:32] I feel like I shouldn’t make my tmux socket group-writable ^^ [13:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [13:04:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P66055 and previous config saved to /var/cache/conftool/dbconfig/20240709-130414-marostegui.json [13:04:20] (`sudo -u lucaswerkmeister-wmde tmux attach` if you can sudo, I assume) [13:04:36] for scap? [13:04:39] * Lucas_WMDE reads https://wikitech.wikimedia.org/wiki/Collaborative_tmux_sessions [13:04:45] Lucas_WMDE: thanks, got it [13:04:50] (03Merged) 10jenkins-bot: Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [13:05:31] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052325|Add $wgMaxShellWallClockTime setting for shellbox (T356241)]] [13:05:34] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:05:41] one can spy scap log channel via https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 [13:05:41] o_O apparently chmod’ing the socket is actually expected? [13:05:51] eventually filter out the log.level=DEBUG messages [13:06:03] Lucas_WMDE: I'm already in the tmux, since yes, I can sudo [13:06:12] hashar's tip is also a good one [13:06:14] ok! [13:06:29] I'm just being annoyingly curious because I've never done a deploy myself, don't mind me :D [13:06:40] (I mean the bare metal part) [13:06:58] there’s precious little bare-metal left in this deployment too ^^ [13:07:13] and you basically just run `scap backport GERRIT-URL` and watch it do things [13:07:15] but my change is basically only relevant for those :D [13:07:22] that's neat [13:07:27] the backport feature is new-ish, right? [13:07:27] “Learn how manual deployment works and then don't do it. Use scap backport instead.” – https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers [13:07:31] ack :D [13:07:33] nice :D [13:07:40] I think `scap backport` is 6 months old or so [13:07:46] so yeah relatively new [13:08:04] !log lucaswerkmeister-wmde@deploy1002 kamila, lucaswerkmeister-wmde: Backport for [[gerrit:1052325|Add $wgMaxShellWallClockTime setting for shellbox (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:19] alright, the change should now be on mwdebug [13:08:23] can you test it there? [13:08:28] (I’m not sure if this change is testable…) [13:08:35] Lucas_WMDE: I'll check async, since it only affects videoscalers [13:08:42] ok, then let’s just sync [13:08:44] (03PS1) 10Effie Mouzeli: mw-*: fully rollout use of mw-mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052970 (https://phabricator.wikimedia.org/T346690) [13:08:46] !log lucaswerkmeister-wmde@deploy1002 kamila, lucaswerkmeister-wmde: Continuing with sync [13:08:47] yeah [13:09:20] (03PS1) 10GergesShamon: use text() instead of escaped() for msg recentchanges [skins/MinervaNeue] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052971 (https://phabricator.wikimedia.org/T352626) [13:09:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [skins/MinervaNeue] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052971 (https://phabricator.wikimedia.org/T352626) (owner: 10GergesShamon) [13:10:08] meh, I can’t find out when scap backport was announced [13:10:21] because searching for it in my emails yields a bunch of “Finished scap: Backport for…” results ^^ [13:10:27] :D [13:10:37] well the important part is that it's there now :D [13:11:15] huh, apparently at least a year https://phabricator.wikimedia.org/T279322 [13:11:17] how time flies ^^ [13:11:33] wat :D [13:12:31] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] beta: eventbus: enable instrumentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [13:13:03] Lucas_WMDE I can ssh to deployment, but I'm not root [13:13:17] you don’t need root, I don’t have it either [13:13:27] AFAIK if you can SSH to it you should also have enough privileges to deploy [13:13:32] but let me peek at puppet again [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:00] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052325|Add $wgMaxShellWallClockTime setting for shellbox (T356241)]] (duration: 08m 28s) [13:14:02] yeah, the deployment group includes *platform_engineering_members which includes gmodena [13:14:03] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:14:04] Lucas_WMDE I mean for tmux - but I saw I can follow along over logstash [13:14:10] okay, I see [13:14:15] yeah neither of us will be able to see each other’s tmux I think [13:14:29] kamila_: should be deployed now :) [13:14:40] thanks Lucas_WMDE <3 [13:15:03] gmodena: do you want to try the deployment in a tmux and I’ll follow along on logstash? [13:15:03] let's see if it fixed things or broken them some more :D [13:15:17] and if something goes wrong you can try to copy+paste the output or give me access to the tmux session somehow [13:15:45] Lucas_WMDE sure, let me bring up some deployment doc [13:16:09] Lucas_WMDE I could also share a screen over meet if that works for you [13:16:10] hashar: thanks for that logstash link btw, looks useful! [13:16:19] :) [13:16:22] (I now see it’s also one of the linked dashboards on the front page, so I won’t have to bookmark the url ^^) [13:16:30] gmodena: sure [13:16:46] !log dummy authdns-update run [13:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:13] (03PS3) 10Gmodena: beta: eventbus: enable instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) [13:19:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P66056 and previous config saved to /var/cache/conftool/dbconfig/20240709-131921-marostegui.json [13:19:30] (03PS4) 10Gmodena: beta: eventbus: enable instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) [13:19:39] (03CR) 10Gmodena: beta: eventbus: enable instrumentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [13:19:51] (03CR) 10Lucas Werkmeister (WMDE): beta: eventbus: enable instrumentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [13:20:38] gmodena: you can check that the config change works as expected in the diffConfig CI output https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/1224/console [13:20:43] (in this case it’s pretty simple ^^) [13:21:12] and if it looks okay to you, run `scap backport https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1051709` [13:21:35] Lucas_WMDE if you want to join https://meet.google.com/gao-vrfz-ffd