[00:06:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071978 (owner: 10TrainBranchBot) [00:07:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:47] (03CR) 10Stang: "Where's "oathauth-enable"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [00:09:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:14:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:17:03] (03PS1) 10Scott French: sre.switchdc.mediawiki: skip check_core_masters_in_sync in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) [00:34:12] 06SRE, 10MediaWiki-libs-BagOStuff, 06MediaWiki-Platform-Team, 13Patch-For-Review, 07Wikimedia-production-error: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786#10136181 (10Krinkle) [00:41:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:41:38] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:42:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:44:56] (03CR) 10Hamish: "Arbcom is one of $wmgPrivilegedGroups and hence default true for oathauth-enable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [00:45:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:47:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68867 and previous config saved to /var/cache/conftool/dbconfig/20240911-004743-ladsgroup.json [00:47:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:48:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:50:58] (03CR) 10Stang: [C:03+1] "thanks for clarification" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [00:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:00:51] (03PS3) 10Hamish: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) [01:02:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68868 and previous config saved to /var/cache/conftool/dbconfig/20240911-010250-ladsgroup.json [01:04:21] (03CR) 10Stang: [C:03+1] Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [01:07:25] FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68869 and previous config saved to /var/cache/conftool/dbconfig/20240911-011758-ladsgroup.json [01:33:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68870 and previous config saved to /var/cache/conftool/dbconfig/20240911-013305-ladsgroup.json [01:33:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [01:33:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:33:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [01:33:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68871 and previous config saved to /var/cache/conftool/dbconfig/20240911-013327-ladsgroup.json [01:57:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:02:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:12:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:17:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:30:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68872 and previous config saved to /var/cache/conftool/dbconfig/20240911-023058-ladsgroup.json [02:31:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:36:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:44:41] (03CR) 10Andrea Denisse: "Hi Cole, no, they're meant to be two different commits as one is for failing over from alert1001 to alert2002 (also setting up alert2002 a" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [02:45:04] (03CR) 10Andrea Denisse: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [02:46:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68873 and previous config saved to /var/cache/conftool/dbconfig/20240911-024605-ladsgroup.json [02:46:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:47:08] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp1110 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:48:08] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp1110 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:48:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:55:52] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68874 and previous config saved to /var/cache/conftool/dbconfig/20240911-030112-ladsgroup.json [03:01:52] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:02:25] FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:10:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:14:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:16:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68875 and previous config saved to /var/cache/conftool/dbconfig/20240911-031621-ladsgroup.json [03:16:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:16:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:16:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:16:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68876 and previous config saved to /var/cache/conftool/dbconfig/20240911-031643-ladsgroup.json [03:17:00] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:35:08] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:36:26] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:41:10] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:41:28] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.42 ms [03:51:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:19:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68877 and previous config saved to /var/cache/conftool/dbconfig/20240911-041922-ladsgroup.json [04:19:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:34:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68878 and previous config saved to /var/cache/conftool/dbconfig/20240911-043429-ladsgroup.json [04:49:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68879 and previous config saved to /var/cache/conftool/dbconfig/20240911-044936-ladsgroup.json [04:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:04:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68880 and previous config saved to /var/cache/conftool/dbconfig/20240911-050444-ladsgroup.json [05:04:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [05:04:48] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:04:48] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance [05:05:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68881 and previous config saved to /var/cache/conftool/dbconfig/20240911-050506-ladsgroup.json [05:09:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:22:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:27:38] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072001 [05:51:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:56:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68882 and previous config saved to /var/cache/conftool/dbconfig/20240911-060444-ladsgroup.json [06:04:48] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:08:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68883 and previous config saved to /var/cache/conftool/dbconfig/20240911-061951-ladsgroup.json [06:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:34:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68884 and previous config saved to /var/cache/conftool/dbconfig/20240911-063458-ladsgroup.json [06:36:03] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10136388 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF thanks @Jhancock.wm for the follow up, will let you know if there is any issue [06:36:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10136392 (10ABran-WMF) p:05Triage→03Medium a:05ABran-WMF→03None [06:40:20] (03CR) 10Slyngshede: [C:03+2] Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [06:41:29] (03CR) 10Muehlenhoff: [C:03+2] Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [06:42:48] (03Merged) 10jenkins-bot: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede) [06:50:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68885 and previous config saved to /var/cache/conftool/dbconfig/20240911-065005-ladsgroup.json [06:50:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [06:50:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance [06:50:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68886 and previous config saved to /var/cache/conftool/dbconfig/20240911-065026-ladsgroup.json [06:51:17] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1072105 (https://phabricator.wikimedia.org/T374512) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T0700). [07:00:05] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] hello [07:02:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 10%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68887 and previous config saved to /var/cache/conftool/dbconfig/20240911-070254-arnaudb.json [07:06:15] (03PS1) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107 [07:07:22] (03CR) 10Slyngshede: "NDA might not be the best group for testing, I'm open to other suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede) [07:09:12] (03PS1) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) [07:11:01] If no deployer is around, I can self-deploy [07:11:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:40] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T374512 [07:11:55] T374512: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T374512 [07:12:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2140 with weight 0 T374512', diff saved to https://phabricator.wikimedia.org/P68888 and previous config saved to /var/cache/conftool/dbconfig/20240911-071205-arnaudb.json [07:12:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T374512 [07:13:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2140 from API/vslow/dump T374512', diff saved to https://phabricator.wikimedia.org/P68889 and previous config saved to /var/cache/conftool/dbconfig/20240911-071335-arnaudb.json [07:14:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:18:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 25%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68890 and previous config saved to /var/cache/conftool/dbconfig/20240911-071802-arnaudb.json [07:18:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno) [07:18:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:32] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:10] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1072105 (https://phabricator.wikimedia.org/T374512) (owner: 10Gerrit maintenance bot) [07:19:14] (03Merged) 10jenkins-bot: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno) [07:19:48] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]] [07:19:51] T370907: Metrics Platform Integration: Agree on a stream name convention - https://phabricator.wikimedia.org/T370907 [07:21:35] !log Starting s4 codfw failover from db2179 to db2140 - T374512 [07:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:38] T374512: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T374512 [07:22:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2140 to s4 primary T374512', diff saved to https://phabricator.wikimedia.org/P68891 and previous config saved to /var/cache/conftool/dbconfig/20240911-072210-arnaudb.json [07:24:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [07:24:40] !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:24:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 T374512', diff saved to https://phabricator.wikimedia.org/P68892 and previous config saved to /var/cache/conftool/dbconfig/20240911-072458-arnaudb.json [07:26:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 T374512', diff saved to https://phabricator.wikimedia.org/P68893 and previous config saved to /var/cache/conftool/dbconfig/20240911-072612-arnaudb.json [07:27:22] (03CR) 10Muehlenhoff: "Let's use cn=idptest-users. This was a group we once created for an external pen test of CAS and basically only grants access to the puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede) [07:27:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:24] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:51] (03PS2) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) [07:29:08] !log sgimeno@deploy1003 sgimeno: Continuing with sync [07:29:29] (03PS2) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107 [07:29:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10136469 (10ABran-WMF) T374512 done. all remaining hosts are either non prod critical or depoolable [07:29:59] (03CR) 10Volans: sre.switchdc.mediawiki: skip check_core_masters_in_sync in live-test (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [07:30:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [07:33:03] PROBLEM - MariaDB Replica SQL: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: kmwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:33:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 50%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68894 and previous config saved to /var/cache/conftool/dbconfig/20240911-073307-arnaudb.json [07:33:10] checking [07:33:14] !incidents [07:33:15] 5157 (UNACKED) db1166 (paged)/MariaDB Replica SQL: s3 (paged) [07:33:15] 5156 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [07:33:15] 5155 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [07:33:15] 5152 (RESOLVED) NELHigh sre (thanos-rule tcp.address_unreachable) [07:33:16] 5151 (RESOLVED) ProbeDown sre (10.2.2.25 ip4 prometheus-https:443 probes/service http_prometheus-https_ip4 eqiad) [07:33:18] !ack 5157 [07:33:18] 5157 (ACKED) db1166 (paged)/MariaDB Replica SQL: s3 (paged) [07:33:21] thanks vgutierrez [07:33:28] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: Add frlog2002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1071970 (https://phabricator.wikimedia.org/T372933) (owner: 10Dwisehaupt) [07:33:44] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]] (duration: 13m 56s) [07:33:47] T370907: Metrics Platform Integration: Agree on a stream name convention - https://phabricator.wikimedia.org/T370907 [07:34:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'prod issue kmwiki.pagelinks', diff saved to https://phabricator.wikimedia.org/P68895 and previous config saved to /var/cache/conftool/dbconfig/20240911-073420-arnaudb.json [07:34:33] host depooled, rebuilding the index [07:36:23] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: post fix', diff saved to https://phabricator.wikimedia.org/P68896 and previous config saved to /var/cache/conftool/dbconfig/20240911-073643-arnaudb.json [07:36:45] host is repooling [07:36:50] !resolve 5157 [07:36:50] 5157 (ACKED) db1166 (paged)/MariaDB Replica SQL: s3 (paged) [07:37:03] RECOVERY - MariaDB Replica SQL: s3 #page on db1166 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:38:56] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine) [07:43:23] (03PS1) 10Muehlenhoff: Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355) [07:46:55] (03CR) 10Effie Mouzeli: "It is not related. We are not setting activeDeadlineSeconds in the spec, but I will update the job module to support it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [07:48:03] (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2003 with kafka-main2008 [puppet] - 10https://gerrit.wikimedia.org/r/1072138 (https://phabricator.wikimedia.org/T363210) [07:48:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 75%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68897 and previous config saved to /var/cache/conftool/dbconfig/20240911-074813-arnaudb.json [07:49:26] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10136495 (10dcaro) [07:49:40] !log evacuating leadership for all partitions assigned to broker id 2003 on kafka-main-codfw - T363210 [07:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:43] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [07:49:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: post fix', diff saved to https://phabricator.wikimedia.org/P68898 and previous config saved to /var/cache/conftool/dbconfig/20240911-075149-arnaudb.json [07:52:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2003,2008].codfw.wmnet with reason: Hardware refresh [07:53:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2003,2008].codfw.wmnet with reason: Hardware refresh [07:53:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68899 and previous config saved to /var/cache/conftool/dbconfig/20240911-075310-ladsgroup.json [07:53:14] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:59:32] (03PS1) 10Jelto: gitlab: rotate logfiles by date and size also in production [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448) [08:01:10] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3949/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448) (owner: 10Jelto) [08:01:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 100%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68903 and previous config saved to /var/cache/conftool/dbconfig/20240911-080319-arnaudb.json [08:06:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: post fix', diff saved to https://phabricator.wikimedia.org/P68904 and previous config saved to /var/cache/conftool/dbconfig/20240911-080654-arnaudb.json [08:08:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68905 and previous config saved to /var/cache/conftool/dbconfig/20240911-080817-ladsgroup.json [08:18:04] jouncebot: next [08:18:04] In 1 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000) [08:19:48] (03CR) 10Elukey: [C:03+1] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:20:16] (03CR) 10Elukey: [C:03+1] Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:22:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: post fix', diff saved to https://phabricator.wikimedia.org/P68906 and previous config saved to /var/cache/conftool/dbconfig/20240911-082200-arnaudb.json [08:22:31] (03PS1) 10Muehlenhoff: config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) [08:23:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68907 and previous config saved to /var/cache/conftool/dbconfig/20240911-082324-ladsgroup.json [08:23:30] (03PS2) 10Muehlenhoff: config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) [08:25:41] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm [08:25:50] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm [08:26:07] (03CR) 10CI reject: [V:04-1] config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:26:47] (03CR) 10Elukey: [C:03+1] config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:27:28] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 201, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:58] (03PS1) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 [08:28:56] (03PS2) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 [08:29:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [08:29:42] (03PS3) 10Muehlenhoff: config_master: Explicitly configure the server for Puppet merges [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) [08:35:20] (03PS3) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 [08:35:43] FIRING: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:35:58] (03CR) 10Elukey: "My bad thanks for the patch! Added an alternative proposal, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [08:36:53] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dragonfly-supernode1001.eqiad.wmnet with reason: host reimage [08:38:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68908 and previous config saved to /var/cache/conftool/dbconfig/20240911-083831-ladsgroup.json [08:38:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [08:38:37] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:38:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [08:38:52] (03CR) 10Elukey: "Is it something that triggered a problem? Because the weekly rebuild is healthy to pick up security upgrades for the OS, without it we'll " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos) [08:38:54] (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [08:39:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dragonfly-supernode1001.eqiad.wmnet with reason: host reimage [08:40:44] (03CR) 10Clément Goubert: [C:03+2] httpbb: Move wikifunctions to its own test suite [puppet] - 10https://gerrit.wikimedia.org/r/1071919 (https://phabricator.wikimedia.org/T374442) (owner: 10Clément Goubert) [08:41:14] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [08:42:21] (03CR) 10Elukey: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [08:42:22] (03PS1) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) [08:45:18] (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [08:45:31] (03CR) 10David Caro: cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [08:46:33] (03PS2) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) [08:46:52] (03CR) 10Elukey: [C:03+2] aux-services: update Docker images for Jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071872 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [08:48:37] (03PS1) 10Muehlenhoff: pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355) [08:49:22] (03CR) 10JMeybohm: [C:03+1] ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [08:49:52] (03CR) 10JMeybohm: [C:04-1] "Cool, thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan) [08:49:54] (03CR) 10Klausman: "I concur with Luca that unless this causes a problem, we should keep doing weeklies. Since we (SRE) are working finding a way to expire un" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos) [08:50:43] RESOLVED: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:53:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm [08:53:31] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm completed: - dragonfly-supernode1001 (**PASS**... [08:53:47] (03CR) 10David Caro: [C:04-1] cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [08:54:21] (03PS3) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) [08:54:26] (03CR) 10David Caro: cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [08:55:47] (03CR) 10Stevemunene: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [08:58:44] (03CR) 10Brouberol: [C:03+2] global_config: add the s3-eqiad-dpe external service [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [08:58:49] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rotate logfiles by date and size also in production [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448) (owner: 10Jelto) [09:00:12] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [09:00:48] brouberol, jelto: I'll merge your changes along, ok? [09:00:55] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [09:01:26] yes please go ahead 2a546bf3e8 :) brouberol was about to merge this also [09:01:49] (03PS1) 10Effie Mouzeli: cronjobs: add support for activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 [09:02:07] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [09:02:24] (03PS1) 10Elukey: spark: force a rebuild to pick up OS package upgrades [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) [09:02:35] (03CR) 10Ilias Sarantopoulos: "No issue occurred, I had been thinking about the size and then I bumped into this special case for Spark images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos) [09:02:46] (03CR) 10CI reject: [V:04-1] cronjobs: add support for activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli) [09:02:50] (03Abandoned) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos) [09:02:55] ack, now all merged [09:02:59] thanks [09:03:41] (03CR) 10Elukey: "Weekly rebuild are not happening due to https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/976663, so a manual " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) (owner: 10Elukey) [09:04:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [09:05:47] (03PS2) 10Clément Goubert: sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) [09:05:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) (owner: 10Cathal Mooney) [09:10:46] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [09:11:11] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [09:11:27] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:12:03] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:12:59] (03CR) 10Brouberol: [C:03+2] airflow: store the connections.yaml content in a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [09:14:57] (03CR) 10David Caro: [C:03+2] cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro) [09:18:56] (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [09:20:26] (03CR) 10Muehlenhoff: [C:03+2] config_master: Explicitly configure the server for Puppet merges [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [09:21:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:22:08] (03CR) 10Elukey: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [09:22:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:22:31] (03CR) 10Clément Goubert: [C:03+2] sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [09:25:30] (03PS2) 10Effie Mouzeli: cronjobs: add support for activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 [09:26:25] (03CR) 10CI reject: [V:04-1] cronjobs: add support for activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli) [09:27:26] (03PS3) 10Effie Mouzeli: cronjobs: add support for activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 [09:30:05] (03PS4) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 [09:30:19] (03CR) 10Clément Goubert: [C:03+1] mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [09:30:20] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [09:30:39] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [09:31:32] (03PS3) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) [09:32:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [09:33:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:33:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:33:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:33:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [09:34:55] (03Merged) 10jenkins-bot: sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert) [09:36:17] (03CR) 10Btullis: [C:03+1] airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [09:37:02] (03PS5) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) [09:38:38] (03CR) 10Brouberol: [C:03+2] airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [09:39:08] (03PS1) 10Elukey: jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491) [09:39:49] (03CR) 10CI reject: [V:04-1] jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [09:40:22] (03CR) 10Filippo Giunchedi: "Looks like this change can be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071701 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1064828 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [09:41:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:42:15] !log depooling cp4037 to test haproxykafka (T374473) [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:18] T374473: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473 [09:42:41] (03Abandoned) 10Elukey: jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [09:42:43] (03CR) 10Fabfur: [C:03+2] cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [09:42:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:42:56] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [09:43:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede) [09:46:53] (03PS1) 10Elukey: jaeger: set securityContext for the oauth sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491) [09:52:56] (03PS1) 10Fabfur: Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) [09:54:50] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM %request for poolcounter - https://phabricator.wikimedia.org/T374520 (10elukey) 03NEW [09:55:32] (03CR) 10CI reject: [V:04-1] Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:56:11] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM %request for poolcounter - https://phabricator.wikimedia.org/T374520#10136739 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +---... [09:56:19] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136740 (10elukey) [09:56:52] (03PS2) 10Fabfur: Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) [09:59:49] (03PS3) 10Fabfur: cache:haproxykafka: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000) [10:00:34] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136771 (10elukey) @MoritzMuehlenhoff I'd proceed with the creation of `poolcounter2005` in row A if you are ok, using `sre.ganeti.makevm`. [10:00:56] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136772 (10elukey) [10:01:42] (03PS4) 10Fabfur: cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) [10:02:01] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [10:05:41] (03PS1) 10Dreamy Jazz: Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) [10:06:23] jouncebot: nowandnext [10:06:23] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000) [10:06:23] In 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1100) [10:07:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz) [10:08:29] (03CR) 10Fabfur: [C:03+2] cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [10:14:03] (03CR) 10CI reject: [V:04-1] Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz) [10:14:48] (03PS1) 10Elukey: profile::docker::reporter: add gitlab images to k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) [10:15:20] (03CR) 10Dreamy Jazz: "recheck" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz) [10:19:06] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [10:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:09] (03CR) 10Filippo Giunchedi: [C:03+1] pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [10:22:51] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [10:26:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:27:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:30:25] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136808 (10MoritzMuehlenhoff) +1 [10:31:15] (03CR) 10Muehlenhoff: [C:03+2] pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [10:33:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523 (10cmooney) 03NEW p:05Triage→03Medium [10:35:35] (03PS5) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 [10:37:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [10:38:49] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [10:38:57] (03CR) 10Cathal Mooney: [C:03+2] Trust DSCP markings from VMs on routed ganeti hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) (owner: 10Cathal Mooney) [10:40:27] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [10:46:31] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10136850 (10cmooney) 05Open→03Resolved Patch merged, working as expected: ` cmooney@ganeti2033:~$ cat /etc/nftables/postrouting/05_trust-vm-ds... [10:47:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136856 (10ABran-WMF) I'll get to T374425 to get to T374421 and unblock this T374523 [10:48:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136865 (10cmooney) >>! In T374523#10136856, @ABran-WMF wrote: > I'll get to T374425 to get to T374421 and unblock this T374523 Thanks! [10:48:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136866 (10cmooney) [10:48:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10136867 (10cmooney) [10:49:05] (03PS1) 10Ladsgroup: wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496) [10:50:21] !log repooling cp4037 to test haproxykafka (T374473) [10:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:24] T374473: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473 [10:50:28] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [10:50:48] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2205.codfw.wmnet [10:51:11] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:53:03] (03CR) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [10:55:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2205.codfw.wmnet [10:55:58] PROBLEM - MariaDB Replica Lag: s3 on db2205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 79746.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:05] (normal) [10:56:33] (03PS1) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 [10:57:05] (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup) [10:59:21] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [11:00:04] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1100). [11:01:13] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@19cd97a]: (no justification provided) [11:01:45] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@19cd97a]: (no justification provided) (duration: 00m 32s) [11:02:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:02:52] (03PS2) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 [11:03:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:03:19] (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup) [11:04:39] (03PS3) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 [11:05:02] (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup) [11:11:05] (03CR) 10Alexandros Kosiaris: [C:03+1] "I am mildly worried the regex might be a bit too broad, but that's mostly a worry I can't justify/quantify right now. Let's cross the brid" [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey) [11:14:57] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2227.codfw.wmnet onto db2205.codfw.wmnet [11:15:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:15:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:15:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68909 and previous config saved to /var/cache/conftool/dbconfig/20240911-111549-ladsgroup.json [11:15:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:18:15] (03CR) 10Jcrespo: [C:03+2] mariadb: productionize db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [11:18:15] (03CR) 10Ssingh: [C:03+1] wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [11:18:32] (03CR) 10Jcrespo: [C:03+1] "Sanity check, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [11:21:14] (03PS9) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:21:47] (03PS10) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:22:03] (03PS1) 10Muehlenhoff: Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) [11:22:05] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [11:23:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:12] (03PS4) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 [11:23:50] <_joe_> !log uploaded conftool 3.2.3 to apt [11:23:51] (03PS1) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) [11:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:53] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [11:25:05] (03PS2) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) [11:25:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dragonfly::supernode [11:25:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [11:26:58] (03PS1) 10Muehlenhoff: Switch dragonfly-supernode to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072173 (https://phabricator.wikimedia.org/T349619) [11:27:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:18] (03PS1) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) [11:30:46] (03PS1) 10Dreamy Jazz: IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) [11:30:52] (03PS11) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:31:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz) [11:31:23] (03CR) 10Muehlenhoff: [C:03+2] Switch dragonfly-supernode to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072173 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:34:01] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [11:34:08] (03CR) 10Ladsgroup: [C:03+2] wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [11:34:17] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2003 with kafka-main2008 [puppet] - 10https://gerrit.wikimedia.org/r/1072138 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [11:35:11] (03CR) 10Jcrespo: "❤️. Giving it a look." [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup) [11:35:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [11:37:47] (03PS1) 10Hamish: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) [11:37:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dragonfly::supernode [11:38:52] (03PS12) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [11:41:29] (03PS1) 10Hamish: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) [11:42:19] (03CR) 10Vgutierrez: [C:03+1] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [11:43:00] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10137027 (10MoritzMuehlenhoff) [11:43:13] (03CR) 10JMeybohm: "Would you mind rebasing this on top of a verbatim copy of job 2.0.0 modules to make the actual diff visible?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli) [11:43:38] (03PS3) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) [11:44:03] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [11:45:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, if some repos tend to be too noisy, we can still add excludes" [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey) [11:45:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish) [11:46:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish) [11:48:37] (03CR) 10Stang: [C:03+1] Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish) [11:49:37] (03PS1) 10Muehlenhoff: Install poolcounter2005 with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072179 (https://phabricator.wikimedia.org/T332015) [11:54:38] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 481, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:57] (03CR) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński) [11:55:45] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr1-eqiad with reason: reconfigure equinix port into LAG [11:55:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-eqiad with reason: reconfigure equinix port into LAG [11:57:23] (03PS1) 10JMeybohm: kafka-main: Fix regex for kafka-main in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072182 (https://phabricator.wikimedia.org/T363210) [11:58:10] (03CR) 10JMeybohm: [C:03+2] kafka-main: Fix regex for kafka-main in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072182 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [11:58:47] (03PS3) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 [11:58:52] matmarex and I are changing MediaWiki logging config https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200 [11:59:07] !incidents [11:59:08] 5157 (RESOLVED) db1166 (paged)/MariaDB Replica SQL: s3 (paged) [11:59:08] 5156 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [11:59:08] 5155 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [11:59:08] 5152 (RESOLVED) NELHigh sre (thanos-rule tcp.address_unreachable) [11:59:22] hu? [11:59:43] vgutierrez: did you get a page for cr2-magru as well? [11:59:47] yeah this was the old page [11:59:52] nope [12:00:01] hi hashar :) [12:00:05] hashar and MatmaRex: Deploy window MediaWiki logging configuration tweaks (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200) [12:00:07] sukhe: from when was that one? [12:00:12] which never resolved for some reason: T374401 [12:00:13] T374401: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401 [12:00:14] oh yeah [12:00:14] unsure what that page was - cr2-magru - router is up and online anyway, quick health check looks ok plus bgp stable for weeks etc [12:00:15] got ehre [12:00:35] ah, okay [12:00:48] I am going to mark this as resolved and we can carry on discussing in the task why victorops didn't do so [12:00:51] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [12:00:51] any objections? [12:00:59] oh great, topranks did it [12:01:00] I sort of shrugged off the previous time it happens, we'll need to take a closer look though [12:01:02] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 11s) [12:01:04] sukhe: yeah I already did [12:01:10] topranks: thanks [12:01:12] MatmaRex: I am checking your earlier comment :) [12:01:12] cool, thanks [12:01:15] just to stop any panic in its tracks [12:01:20] yep [12:02:43] (03PS1) 10Brouberol: airflow: fix the s3 logging integration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787) [12:03:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [12:03:17] sukhe: oh sorry just realising this was the old page re-triggering ? [12:03:23] (reading scrollback) [12:05:10] (03CR) 10Hashar: logging: Replace 'blackhole' handler with no handlers at all (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [12:05:38] MatmaRex: so essentially +1 on removing that blackhole [12:05:45] I guess yesterday I wanted to double check [12:05:57] (03CR) 10Btullis: [C:03+1] "Awesome." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [12:06:18] (03PS5) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 [12:06:18] (03PS4) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 [12:06:19] (03PS4) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 [12:06:30] that is me rebasing the whole series [12:06:37] (03CR) 10Brouberol: [C:03+2] airflow: fix the s3 logging integration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol) [12:06:41] topranks: yep! same old page [12:06:47] not a new one [12:06:59] ah ok [12:07:04] and I guess we can do the first one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069716 [12:07:07] hashar: no problem. we should probably test that in production again when actually enabling the default log channel [12:07:09] so the only question is why it didn't resolve [12:07:11] probably doesn't deserve too much more detective work on the cr / alerting side then [12:07:15] yeah [12:07:18] olly is looking into that [12:07:21] hashar: whenever you're ready [12:07:21] topranks: yeah [12:07:26] lets do [12:07:26] ok [12:07:27] cool [12:08:04] !log test bundling xe-3/0/6 into ae6 on cr1-eqiad T370696 [12:08:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [12:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:34] hashar: these changes don't really have anything obvious to test on mwdebug btw. they all should have no effect. i think we can just look at logstash afterwards and verify that the volume of logs didn't change [12:08:48] yeah that was my idea [12:08:48] (03Merged) 10jenkins-bot: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [12:08:49] :) [12:09:10] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]] [12:09:19] I think at some point I wanted to craft a CI job that would generate the logging configuraiton diff [12:09:40] well the whoe diff actually [12:09:44] but that is not easily doable [12:09:58] maybe that patch makes it easier now [12:10:29] https://grafana.wikimedia.org/d/000000102/production-logging might be the best place to watch for breakage/log vanishing [12:10:56] hmm, maybe it could be included in the diffConfig jobs somehow? i don't know how that works [12:11:14] I think that one iterates over each db [12:11:16] but it looks like it just makes some JSON files and diffs them. the logging config should be JSON-serializable too, so the same approach should work [12:11:17] but yeah possibly [12:11:20] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:12:26] pretty dashboard [12:12:35] !log restoring leadership for partitions assigned to broker id 2003 on kafka-main-codfw - T363210 [12:12:37] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [12:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:38] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [12:13:02] RECOVERY - MariaDB Replica Lag: s3 on db2205 is OK: OK slave_sql_lag Replication lag: 5.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:13:38] (03PS1) 10Hamish: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) [12:14:02] what happened on 2024--09-05? :o https://phabricator.wikimedia.org/F57499437 [12:14:44] (03PS1) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 [12:15:13] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [12:15:22] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:44] MatmaRex: some new mediawiki code landed / DBAs broke the infra? :) [12:15:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2227.codfw.wmnet onto db2205.codfw.wmnet [12:16:01] and that graph is relative grr [12:16:27] anyway one can search in logstash [12:16:52] and the log volume logs below and log scale, so you can barely see that we double the number of WARNINGs [12:17:03] well, the wikis did not fall over, so it's not too bad [12:17:10] but i will try to find out what it was [12:17:11] yeah that dashboard is nice but has several usuability problems indeed [12:17:22] are log-scale* [12:17:29] (03PS2) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 [12:17:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:18:01] Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}): {query} [12:18:01] Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met (actual: {actualSeconds}): {query} [12:18:19] roughly 720 000 of them per hour :) [12:18:25] !log re-activate Equinix IXP peers on cr1-eqiad T370696 [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:55] (03PS1) 10David Caro: typos: add colud to the list [puppet] - 10https://gerrit.wikimedia.org/r/1072188 [12:19:47] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]] (duration: 10m 38s) [12:20:13] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 (owner: 10Brouberol) [12:20:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68912 and previous config saved to /var/cache/conftool/dbconfig/20240911-122056-ladsgroup.json [12:21:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:21:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:22:04] it looks like we just got some exceptions like this: "JobQueueError: Could not enqueue jobs" [12:22:09] which is hopefully unrelated to the deploy [12:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:22:49] (03PS3) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 [12:23:01] yeah I think it is fine [12:23:28] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:29] started at 12:17 (6 minutes ago) [12:23:35] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:23:42] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:23:43] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:23:58] hmm [12:24:06] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:24:08] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:24:15] (03PS1) 10Jforrester: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241) [12:24:19] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:24:21] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:24:27] (03PS1) 10Jforrester: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241) [12:24:30] and it's still ongoing [12:24:42] what have we broke [12:24:47] also a lot of "The maximum execution time of 60 seconds was exceeded" [12:24:51] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:24:52] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:24:55] it still looks like a coincidence to me [12:25:06] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:25:08] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:25:10] (i'm looking here: https://logstash.wikimedia.org/goto/eea316bb32ebedad09d7e1640283513c) [12:25:24] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:25:40] and it looks like the exceptions stopped happening [12:25:42] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:25:43] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:25:45] 🤷‍♂️ [12:25:56] (03PS4) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 [12:26:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:26:18] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:26:19] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:26:21] i don't know, maybe somebody just did some bot things too quickly? [12:26:34] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:26:35] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:26:40] well exception rate has exploded for sure https://grafana-rw.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now&viewPanel=19 [12:26:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10137107 (10jcrespo) I will want to stop ms backups at codfw for backup2011 before it happens. No big deal if I don't do it (ju... [12:26:45] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:26:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10137114 (10jcrespo) I will want to stop ms backups at codfw for backup2007 before it happens. No big deal if I don't do it (ju... [12:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:27:34] (03CR) 10Brouberol: [C:03+2] airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 (owner: 10Brouberol) [12:27:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579 [12:27:37] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [12:27:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579 [12:27:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579 [12:27:52] 200 per minute is maybe a minor fire in the kitchen, not an explosion ;) [12:28:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579 [12:28:34] !log installing glibc bugfix updates from bookworm 12.7 point release [12:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] seriously [12:28:46] our whole infrastructure is crippled :/ [12:28:54] heh [12:29:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2138 in db2238 for T373579', diff saved to https://phabricator.wikimedia.org/P68913 and previous config saved to /var/cache/conftool/dbconfig/20240911-122910-arnaudb.json [12:29:12] and there are a bunch of messages in the `jsonTruncated` channel [12:29:21] which are log messages being too long to be parsed by the logging stack [12:29:25] so they end up mostly ignored [12:29:29] hidding real problems [12:29:29] grr [12:29:51] that is wikifucntions requests timeout, it happened last week already [12:30:21] yeah looks like it's all RequestTimeoutException with a realllllyyy long stack trace [12:30:38] yeah I think we had some talk about it on friday [12:31:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:31:50] (because it times out while validating some really big nested recursive structure, apparently) [12:31:57] yeah [12:32:04] hashar: anyway. i think we can proceed with the next patches, if you're happy with them [12:32:04] so that is more log spam we have to manage [12:32:23] the thousnad jobs not enqueuing, I don't think it is relatted at all [12:32:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:33:02] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2138.codfw.wmnet onto db2238.codfw.wmnet [12:33:32] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:33:51] pff [12:34:00] I will process with the next one [12:34:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [12:34:26] !log brouberol@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:35:00] sometime I feel we could use #mediawiki-operations channel to cut from the rest of the wmf operations :] [12:35:07] (03Merged) 10jenkins-bot: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [12:35:10] or #wikimedia-mw-infra :D [12:35:24] hashar: btw the new rdbms warnings, i think they're all here: https://logstash.wikimedia.org/goto/1f5d398c8c6f7ceb2ea570bd57d22564 they're "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" with ExternalStoreDB in the stack trace [12:35:27] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]] [12:35:47] i wish the cookbook or whatever would not emit 5 log messages for every action. it's really difficult to read in the SAL later too [12:36:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68914 and previous config saved to /var/cache/conftool/dbconfig/20240911-123603-ladsgroup.json [12:37:33] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:37:41] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [12:37:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:38:12] (i'll file a bug for the rdbms warnings, they seem to not be known) [12:38:50] thanks [12:38:57] (03CR) 10Cathal Mooney: [C:03+1] "Good shout LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1071814 (owner: 10Muehlenhoff) [12:39:41] (03PS1) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) [12:39:53] (03PS2) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) [12:40:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [12:41:55] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) [12:42:11] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]] (duration: 06m 43s) [12:42:36] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:42:50] (03CR) 10Elukey: [C:03+2] Install poolcounter2005 with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072179 (https://phabricator.wikimedia.org/T332015) (owner: 10Muehlenhoff) [12:42:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:44:36] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:44:59] hashar, MatmaRex: Thank you both for working on that! [12:47:11] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter2005.codfw.wmnet [12:47:12] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:47:45] and the last one [12:48:28] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add gitlab images to k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey) [12:49:45] (03CR) 10CDanis: [C:03+1] "FWIW upstream took a similar patch from me very quickly" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [12:49:57] (03PS5) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [12:50:18] thanks hashar. logging still looks happy after the last changes [12:50:32] and i filed https://phabricator.wikimedia.org/T374534 about the rdbms WARNING logs [12:50:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński) [12:50:53] awesome [12:51:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68915 and previous config saved to /var/cache/conftool/dbconfig/20240911-125110-ladsgroup.json [12:51:55] (03Merged) 10jenkins-bot: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński) [12:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:52:11] (03CR) 10Elukey: "Left a note, lemme know what you think about it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [12:52:16] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]] [12:52:27] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2005.codfw.wmnet - elukey@cumin1002" [12:52:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2005.codfw.wmnet - elukey@cumin1002" [12:52:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:33] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter2005.codfw.wmnet on all recursors [12:52:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter2005.codfw.wmnet on all recursors [12:53:03] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2005.codfw.wmnet - elukey@cumin1002" [12:53:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2005.codfw.wmnet - elukey@cumin1002" [12:53:26] (03CR) 10Elukey: [C:03+2] jaeger: set securityContext for the oauth sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [12:53:40] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:06] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter2005.codfw.wmnet with OS bookworm [12:54:21] !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:54:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish) [12:54:31] (03CR) 10CDanis: [C:03+1] "lgtm!! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi) [12:54:37] !log hashar@deploy1003 matmarex, hashar: Continuing with sync [12:55:10] (03PS1) 10AikoChou: admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) [12:55:16] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync [12:55:36] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync [12:55:42] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:46] (03PS1) 10Ilias Sarantopoulos: (WIP) amd-pytorch: add vllm for ROCm to pytorch 2.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072194 (https://phabricator.wikimedia.org/T370149) [12:56:20] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1010 is OK: SSL OK - Certificate kafka-jumbo1010.eqiad.wmnet valid until 2025-08-17 13:15:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [12:57:16] MatmaRex: I have changed the graph of logs by channels to use absolute values instead of relative/percentage them [12:57:24] and sorted them by number of entries (total) https://grafana-rw.wikimedia.org/d/000000102/mediawiki-production-logging [12:57:29] https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging [12:57:34] for the read-only link [12:57:54] for the severity, my guess is we would need to repeat the panel for each severity [12:57:58] hashar: nice [12:58:03] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:58:06] but when I query the severity values, I get a bunch of non sense values :/ [12:58:11] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [12:58:15] beside DEBUG/ERROR/INFO/NOTICE/WARNING [12:58:19] so hmm I don't know [12:58:20] (03CR) 10Muehlenhoff: Add an explicit Hiera variable to determine the active swift ring server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [12:58:36] oh the levels are graphed independently at the bottom [12:59:03] hashar: i think my next steps, later this week or next week, will be to enable the @default channel on testwiki, and if that doesn't break the world, then enable it everywhere (so basically, your original patch, just updated to fit my other changes) [12:59:10] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]] (duration: 06m 53s) [12:59:19] jouncebot: nowandnext [12:59:20] For the next 0 hour(s) and 0 minute(s): MediaWiki logging configuration tweaks (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200) [12:59:20] In 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300) [12:59:30] heh, perfect timing [12:59:41] :D [12:59:56] so hello and welcome to the backport window [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300). [13:00:05] JustHannah, Dreamy_Jazz, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] we might $have broken logging :] [13:00:12] \o [13:00:28] hashar: if you have the editor open already, want to also change the per-level charts to not be log-scale? so that spikes are actually visible? [13:00:43] (03CR) 10Dreamy Jazz: [C:03+2] Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz) [13:00:45] (03CR) 10Dreamy Jazz: [C:03+2] IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz) [13:00:50] (i scheduled some unrelated no-op cleanup patches for the backport window) [13:01:08] MatmaRex: the problem is that debug have a large amount of entries so eg warning spiking would not show up at all [13:01:15] (03CR) 10Elukey: Add an explicit Hiera variable to determine the active swift ring server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [13:01:17] I think we need another graph that highlight the spikes/change of rates [13:01:38] hmm [13:01:41] (03PS2) 10Hokwelum: Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) [13:02:08] hashar: oh, i don't mean the "MW logs by severity" chart, i mean only the "MW logs (INFO)" etc. charts below [13:02:15] those that just have one data series on them [13:02:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:29] Can the window proceed then? [13:02:35] If logs are broken [13:02:56] Dreamy_Jazz: yeah, they're not broken :) [13:03:04] :D [13:03:07] Thanks [13:03:09] we're just tweaking a dashboard [13:03:19] https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging [13:03:28] Dreamy_Jazz: I'm here [13:03:28] (03PS1) 10Elukey: admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491) [13:03:34] Hello [13:04:02] Dreamy_Jazz: Please proceed [13:04:05] I'm sorry maybe I missed some messages, but why reschedule my deployments? [13:04:39] They should't have been [13:05:10] MatmaRex: Did you deliberately remove the other changes in the window? [13:05:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10137235 (10MoritzMuehlenhoff) [13:06:01] Dreamy_Jazz: aaargh. nopr [13:06:14] MatmaRex: ah true, I will change them [13:06:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68916 and previous config saved to /var/cache/conftool/dbconfig/20240911-130618-ladsgroup.json [13:06:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [13:06:21] (03CR) 10Dreamy Jazz: [C:03+2] Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum) [13:06:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:06:32] (03PS1) 10AikoChou: hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) [13:06:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [13:06:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68917 and previous config saved to /var/cache/conftool/dbconfig/20240911-130639-ladsgroup.json [13:06:55] JustHannah: I can deploy your change. Can you test it? I see that the default is now `true`, but want to make sure it still works as expected. [13:07:00] (03Merged) 10jenkins-bot: Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum) [13:07:19] Dreamy_Jazz: i undid that. sorry, i guess i didn't notice that when editing [13:07:28] Thanks [13:07:40] Dreamy_Jazz: Yes I can test it! [13:07:41] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter2005.codfw.wmnet with reason: host reimage [13:07:41] i'll schedule my cleanup for some other time. maybe one day the window will not be full [13:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:07:46] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:46] (03CR) 10Klausman: [C:03+1] admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:07:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:07:57] hashar: thanks. and thanks for deploying :D [13:08:00] MatmaRex, thank you:) [13:08:45] (03CR) 10Dreamy Jazz: [C:03+2] Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [13:09:28] (03Merged) 10jenkins-bot: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish) [13:09:30] (03CR) 10Klausman: [C:03+1] hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:09:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:10:24] (03Merged) 10jenkins-bot: Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz) [13:10:29] (03CR) 10Dreamy Jazz: [C:03+2] Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish) [13:10:50] (03CR) 10Klausman: [C:03+2] hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:10:53] (03PS2) 10Hamish: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) [13:11:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter2005.codfw.wmnet with reason: host reimage [13:11:57] (03CR) 10CDanis: [C:03+1] admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:12:04] (03CR) 10Dreamy Jazz: [C:03+2] Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish) [13:12:07] (03CR) 10Elukey: [C:03+2] admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:12:32] Dreamy_Jazz, and thank you lolll [13:12:45] (03Merged) 10jenkins-bot: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish) [13:12:46] Waiting for one change to finish gate-and-submit-wmf [13:12:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:12:57] Then will start on the config patches that I've +2'd [13:13:07] (03PS2) 10Hamish: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) [13:13:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:38] sure np [13:14:00] (03CR) 10Dreamy Jazz: [C:03+2] Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish) [13:14:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:26] jan_drewniak: You around for the window? [13:14:39] (03Merged) 10jenkins-bot: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish) [13:14:41] (03PS3) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) [13:14:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:15:01] Dreamy_Jazz: hey! Yes I'm here [13:15:12] I can deploy it. Can you test it? [13:15:13] Bit of a last minute addition [13:15:18] Yes [13:15:33] (03CR) 10Muehlenhoff: [C:03+2] nftables-compat-check: Don't flag dscp_default as needing conversion [puppet] - 10https://gerrit.wikimedia.org/r/1071814 (owner: 10Muehlenhoff) [13:16:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz) [13:16:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish) [13:16:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:16:35] (03CR) 10Klausman: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3952/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:16:38] (03PS2) 10Hamish: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) [13:16:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz) [13:16:45] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish) [13:16:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:17:14] (03CR) 10Klausman: [C:03+2] admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:17:20] (03Merged) 10jenkins-bot: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [13:17:25] (03Merged) 10jenkins-bot: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish) [13:17:42] Should be about 7 or so mins before the process starts - Still waiting on a slow test job. [13:17:51] Np [13:19:46] (03CR) 10Btullis: [C:03+1] "Many thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) (owner: 10Elukey) [13:20:44] (03Merged) 10jenkins-bot: admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:21:10] MatmaRex: Do you want me to ping you if there is enough time in the window to do your logging changes? [13:21:19] jouncebot: nowandnext [13:21:19] For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300) [13:21:20] In 0 hour(s) and 38 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400) [13:21:53] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [13:21:56] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:22:05] Dreamy_Jazz: thanks, but we definitely won't make it in 38 minutes, and i have to leave soon afterwards [13:22:14] these changes can wait, they do nothing :) [13:22:17] Okay. [13:22:21] :D [13:23:07] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:23:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [13:24:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [13:24:09] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:24:37] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536 (10MoritzMuehlenhoff) 03NEW [13:24:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137289 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:25:20] Still waiting on test jobs - Watching it slowly process the tests is almost like watching paint dry :D [13:25:24] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:26:05] Soft fire makes sweet malt :0 [13:26:14] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:26:23] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:26:45] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:26:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter2005.codfw.wmnet with OS bookworm [13:26:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter2005.codfw.wmnet [13:26:57] (03PS1) 10Elukey: kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) [13:27:16] (03Merged) 10jenkins-bot: IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz) [13:27:22] (03PS1) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) [13:27:33] (03PS13) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [13:27:41] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oathauth-enable flag ( [13:27:41] T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]] [13:27:48] T343492: Phase out SqlModuleDependencyStore - https://phabricator.wikimedia.org/T343492 [13:27:48] T374277: View full log does not work on wikis with language other than English - https://phabricator.wikimedia.org/T374277 [13:27:49] T374526: InvalidArgumentException: Invalid IP address error when loading IPInfo logs - https://phabricator.wikimedia.org/T374526 [13:27:49] T374455: Create the "arbcom" user group on zhwiki - https://phabricator.wikimedia.org/T374455 [13:27:50] T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528 [13:27:50] T374504: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki - https://phabricator.wikimedia.org/T374504 [13:27:50] T374323: Raise RelatedArticlesCardLimit to 9 in zhwikinews - https://phabricator.wikimedia.org/T374323 [13:27:51] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [13:28:31] (03CR) 10Muehlenhoff: Bird::anycast - allow BFD connections from router link-local IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [13:29:28] !log brouberol@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [13:29:49] !log dreamyjazz@deploy1003 jdrewniak, hokwelum, dreamyjazz, hamishz: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oatha [13:29:49] uth-enable flag (T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:29:51] jan_drewniak: Hamishcz: JustHannah: Please test your changes, they are live on the test servers now. [13:30:19] Okay! [13:30:20] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [13:30:33] sure:) [13:30:47] (03CR) 10Elukey: "Chris: Last one I promise! :D" [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:32:15] (03CR) 10CDanis: [C:03+1] kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:32:46] Dreamy_Jazz: yeah it's fine [13:32:52] Thanks! [13:33:00] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10137345 (10elukey) 05Open→03Resolved a:03elukey [13:33:13] MatmaRex: I have made a few more tweaks on https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m [13:33:48] hashar: i like it :D [13:34:12] My changes work. [13:34:53] Dreamy_Jazz, my changes are all fine for me [13:34:57] Thanks. [13:35:05] JustHannah: How is testing going? [13:35:17] (03PS1) 10Arnaudb: mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) [13:35:17] (03CR) 10Arnaudb: "for a sanity check → those hosts are not due for "real" production before Manuel comes back and we run some tests" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb) [13:35:26] (03PS14) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [13:36:11] : looks good! [13:36:18] Thanks. Proceeding. [13:36:20] !log dreamyjazz@deploy1003 jdrewniak, hokwelum, dreamyjazz, hamishz: Continuing with sync [13:36:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137366 (10MoritzMuehlenhoff) [13:36:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137367 (10MoritzMuehlenhoff) [13:36:59] (03PS1) 10Elukey: Swap poolcounter2003 with poolcounter2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) [13:37:53] (03CR) 10Elukey: "Hey folks, the host is up and running, it seems working fine but some validation from serviceops is needed :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [13:38:08] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [13:38:19] (03CR) 10Elukey: [C:03+2] kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:40:50] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [13:40:50] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oathauth-enable flag [13:40:50] (T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]] (duration: 13m 09s) [13:40:56] Deploys done. [13:40:57] T343492: Phase out SqlModuleDependencyStore - https://phabricator.wikimedia.org/T343492 [13:40:58] T374277: View full log does not work on wikis with language other than English - https://phabricator.wikimedia.org/T374277 [13:40:58] T374526: InvalidArgumentException: Invalid IP address error when loading IPInfo logs - https://phabricator.wikimedia.org/T374526 [13:40:58] T374455: Create the "arbcom" user group on zhwiki - https://phabricator.wikimedia.org/T374455 [13:40:59] T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528 [13:40:59] T374504: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki - https://phabricator.wikimedia.org/T374504 [13:40:59] T374323: Raise RelatedArticlesCardLimit to 9 in zhwikinews - https://phabricator.wikimedia.org/T374323 [13:41:00] T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039 [13:41:49] wikitech.wikimedia.org might be broken? [13:42:17] Loading https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=edit§ion=6 says "File not found" [13:42:39] Dreamy_Jazz, it works normal from my end [13:43:07] Apparently it doesn't work if you still have the mwdebug servers enabled [13:44:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [13:44:17] ah yes [13:44:24] For some reason https://wikitech.wikimedia.org/wiki/Deployments now always redirects me to https://foundation.wikimedia.org/wiki/Deployments [13:44:33] Even with the debug server off [13:45:00] Dreamy_Jazz: yep and you can’t also reload a page with the debug enabled too [13:45:26] Looks like it was a caching issue. Opening my developer tools fixed the redirect. [13:45:33] Anyway, not caused by the deployments so that is all good. [13:46:02] yes... https://wikitech.wikimedia.org/wiki/Main_Page always redirect me to https://foundation.wikimedia.org/wiki/Home [13:46:16] w/ that off [13:47:05] !log Afternoon UTC backport window done [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:53] (03PS1) 10Bartosz Dziewoński: logging: Default to log any error (on beta and group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) [13:48:21] (03PS15) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [13:48:34] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian) [13:48:43] (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [13:49:00] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on ganeti2012.codfw.wmnet with reason: Move ganeti2012 server uplink [13:49:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on ganeti2012.codfw.wmnet with reason: Move ganeti2012 server uplink [13:49:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10137421 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ff7d01a-40d8-4196-9008-7bf9b79ea4e8) set by c... [13:51:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2138.codfw.wmnet onto db2238.codfw.wmnet [13:52:07] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:52:35] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [13:53:12] (03PS1) 10Ladsgroup: DNM: Add pc5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072208 (https://phabricator.wikimedia.org/T374496) [13:54:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540 (10ops-monitoring-bot) 03NEW [13:55:59] (03PS1) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 [13:56:57] (03CR) 10JHathaway: [C:03+1] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff) [13:57:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh) [13:57:37] hashar: i updated your patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1018637 , want to un-WIP it? [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400) [14:00:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.767 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:51] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:05:47] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [14:05:59] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 11s) [14:06:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10137512 (10cmooney) Still chasing Equinix to get this sorted, back-and-forth now with them for almost 2 weeks without an... [14:07:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374380#10137505 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:07:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:07:55] (03PS4) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) [14:09:45] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [14:09:49] (03PS1) 10JMeybohm: Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) [14:10:15] (03CR) 10Volans: "post-merge suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:10:36] (03CR) 10Vgutierrez: [C:03+1] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:11:04] (03CR) 10Fabfur: [C:03+2] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:11:41] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: add cookbook for GeoDNS pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:12:51] (03PS1) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 [14:13:06] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2008.codfw.wmnet [14:13:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2008.codfw.wmnet [14:14:12] (03PS1) 10Ladsgroup: conftool: Add pc5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072215 (https://phabricator.wikimedia.org/T374496) [14:14:25] !log reverted 1072172 and repooling cp4037 (T370668) [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:28] T370668: New software: haproxykafka - https://phabricator.wikimedia.org/T370668 [14:14:31] (03PS2) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496) [14:14:34] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [14:14:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68918 and previous config saved to /var/cache/conftool/dbconfig/20240911-141449-ladsgroup.json [14:14:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:17:24] (03CR) 10Elukey: [C:03+1] Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [14:17:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 10%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68919 and previous config saved to /var/cache/conftool/dbconfig/20240911-141732-arnaudb.json [14:17:54] (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [14:19:42] (03Merged) 10jenkins-bot: Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [14:19:59] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2013.codfw.wmnet [14:20:13] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542 (10JMeybohm) 03NEW [14:20:18] (03CR) 10Ladsgroup: [C:03+2] conftool: Add pc5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072215 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [14:20:24] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137567 (10JMeybohm) [14:20:26] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [14:20:35] (03PS3) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496) [14:20:38] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [14:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:21:13] (03PS1) 10JHathaway: postfix: remove wikimedia.com domain from relay hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489) [14:21:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489) (owner: 10JHathaway) [14:21:52] (03PS1) 10Arnaudb: mariadb: productionize db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) [14:21:52] (03CR) 10Arnaudb: "@Ladsgroup@gmail.com I've checked on https://fault-tolerance.toolforge.org/map?cluster=db-masters and it will be in the same rack as db223" [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [14:21:54] (03PS2) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 [14:23:07] (03CR) 10Clément Goubert: "I'm unsure if it is the right solution, but... I don't really have another answer to this problem." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:24:45] (03CR) 10JHathaway: [C:03+2] postfix: remove wikimedia.com domain from relay hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489) (owner: 10JHathaway) [14:25:01] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2003.codfw.wmnet [14:26:08] (03PS1) 10Ladsgroup: conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496) [14:26:17] (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [14:26:37] (03PS2) 10Ladsgroup: conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496) [14:26:42] (03CR) 10Ladsgroup: [V:03+2 C:03+2] conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [14:27:17] (03PS1) 10JMeybohm: Decom kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/1072219 (https://phabricator.wikimedia.org/T374542) [14:27:43] MatmaRex: amazing [14:27:59] (03PS2) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) [14:28:03] does beta still has some log/kibana ? [14:28:41] (03PS3) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 [14:28:44] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno) [14:29:15] (03CR) 10Hnowlan: "Yeah, I'm not either :( However, this is at least limited to shellbox-video for now. One option I considered here that is worth mentioning" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:29:41] hashar: say that again after we deploy it and it works ;) [14:29:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68920 and previous config saved to /var/cache/conftool/dbconfig/20240911-142956-ladsgroup.json [14:30:18] hashar: i need to be off for today, but i'll schedule some deploys some time soon [14:30:21] (03CR) 10BBlack: [C:03+1] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh) [14:30:22] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:30:34] MatmaRex: found it found it https://beta-logs.wmcloud.org so I guess I will ninja enable it on beta :) [14:30:37] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [14:30:37] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:30:37] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [14:30:37] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:30:38] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:30:38] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:30:42] and we can pair the deploy tomorrow [14:30:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:31:05] hashar: cool :D [14:31:12] !log last 7 helmfile deploys did not happen [14:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:21] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10137610 (10dcaro) >>! In T348643#10113626, @wiki_willy wrote: > Thanks @dcaro, sounds good. I'll bug them again abo... [14:32:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 25%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68921 and previous config saved to /var/cache/conftool/dbconfig/20240911-143237-arnaudb.json [14:32:44] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [14:32:50] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2014.codfw.wmnet [14:32:57] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [14:33:15] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:34:31] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:34:33] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:34:40] (03CR) 10Clément Goubert: [C:03+1] "Let's see if it works before adding more parameters. To clarify, this change will only impact `shellbox-video` because it's the only deplo" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:34:48] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:34:50] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:35:14] (03PS16) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) [14:36:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:10] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [14:37:59] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:38:00] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [14:38:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [14:38:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2003.codfw.wmnet [14:38:21] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2003.codfw.wmnet` - kafka-main2003.codf... [14:38:22] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [14:38:30] (03CR) 10Jgiannelos: [C:04-1] "Blocking this until apps team gives us a wiki to start with. Ptwiki is used for some experiments at the moment so we might not want to ris" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [14:39:01] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [14:39:02] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:39:29] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137630 (10JMeybohm) [14:40:40] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:40:41] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:41:28] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:41:29] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:41:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Pool pc5 into production traffic (T374496)', diff saved to https://phabricator.wikimedia.org/P68922 and previous config saved to /var/cache/conftool/dbconfig/20240911-144147-ladsgroup.json [14:41:50] T374496: Bring pc5 into rotation - https://phabricator.wikimedia.org/T374496 [14:42:40] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:43:15] !log deployed changeprop-jobqueue changeprop cirrus-streaming-updater eventgate-main eventstreams mw-page-content-change-enrich rdf-streaming-updater for kafka connection string updates - T363210 [14:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:18] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [14:43:32] (03CR) 10Cathal Mooney: "Ok hopefully this looks a bit better now." [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [14:43:46] (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) [14:44:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian) [14:45:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68923 and previous config saved to /var/cache/conftool/dbconfig/20240911-144504-ladsgroup.json [14:45:05] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:45:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:46:12] (03PS3) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) [14:46:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:46:44] (03CR) 10Clément Goubert: [C:03+1] php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:47:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 50%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68924 and previous config saved to /var/cache/conftool/dbconfig/20240911-144743-arnaudb.json [14:48:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2001.codfw.wmnet [14:48:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2001.codfw.wmnet [14:48:47] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2003.codfw.wmnet [14:48:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2003.codfw.wmnet [14:49:58] !log Depooling kubernetes2042.codfw.wmnet kubernetes2043.codfw.wmnet mw2350.codfw.wmnet mw2351.codfw.wmnet mw2352.codfw.wmnet mw2353.codfw.wmnet mw2354.codfw.wmnet mw2355.codfw.wmnet mw2356.codfw.wmnet mw2357.codfw.wmnet mw2359.codfw.wmnet parse2014.codfw.wmnet parse2015.codfw.wmnet wikikube-ctrl2002.codfw.wmnet wikikube-worker2020.codfw.wmnet wikikube-worker2021.codfw.wmnet [14:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:00] wikikube-worker2022.codfw.wmnet wikikube-worker2023.codfw.wmnet wikikube-worker2024.codfw.wmnet wikikube-worker2032.codfw.wmnet - T373101 [14:50:02] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [14:50:05] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:50:22] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2042.codfw.wmnet [14:50:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2042.codfw.wmnet [14:51:00] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2043.codfw.wmnet [14:51:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2043.codfw.wmnet [14:51:41] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2350.codfw.wmnet [14:52:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2350.codfw.wmnet [14:52:20] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2351.codfw.wmnet [14:52:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2351.codfw.wmnet [14:53:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2352.codfw.wmnet [14:53:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2352.codfw.wmnet [14:53:41] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2353.codfw.wmnet [14:53:46] (03PS2) 10Hashar: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [14:53:46] (03PS1) 10Hashar: logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838) [14:54:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2353.codfw.wmnet [14:54:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2354.codfw.wmnet [14:54:30] (03CR) 10Hashar: "I have moved beta to a standalone job https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072226 . I will deploy it immediatel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [14:54:55] (03CR) 10Hashar: [C:03+1] "Awesome work Bartosz thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński) [14:54:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2354.codfw.wmnet [14:55:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2355.codfw.wmnet [14:55:02] (03PS14) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838) [14:55:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2355.codfw.wmnet [14:55:39] (03PS1) 10Superzerocool: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484) [14:55:44] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2356.codfw.wmnet [14:56:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2356.codfw.wmnet [14:56:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2357.codfw.wmnet [14:56:36] jouncebot: now [14:56:37] For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400) [14:56:50] (03CR) 10Hashar: [C:03+2] logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [14:56:55] ^ that is solely for beta [14:57:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2357.codfw.wmnet [14:57:05] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2359.codfw.wmnet [14:57:33] (03Merged) 10jenkins-bot: logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [14:57:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2359.codfw.wmnet [14:57:48] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2014.codfw.wmnet [14:58:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2014.codfw.wmnet [14:58:27] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2015.codfw.wmnet [14:58:33] about to pool pc5 [14:58:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Pool pc5 into production traffic (T374496)', diff saved to https://phabricator.wikimedia.org/P68925 and previous config saved to /var/cache/conftool/dbconfig/20240911-145844-ladsgroup.json [14:58:48] T374496: Bring pc5 into rotation - https://phabricator.wikimedia.org/T374496 [15:00:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68926 and previous config saved to /var/cache/conftool/dbconfig/20240911-150011-ladsgroup.json [15:00:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:00:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:00:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2015.codfw.wmnet [15:01:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2002.codfw.wmnet [15:01:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2002.codfw.wmnet [15:01:50] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2020.codfw.wmnet [15:01:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2020.codfw.wmnet [15:02:27] (03PS3) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [15:02:29] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2021.codfw.wmnet [15:02:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 75%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68927 and previous config saved to /var/cache/conftool/dbconfig/20240911-150249-arnaudb.json [15:03:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2021.codfw.wmnet [15:03:07] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2022.codfw.wmnet [15:03:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2022.codfw.wmnet [15:03:49] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2023.codfw.wmnet [15:04:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2023.codfw.wmnet [15:04:29] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2024.codfw.wmnet [15:04:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:05:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2024.codfw.wmnet [15:05:12] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2032.codfw.wmnet [15:05:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2032.codfw.wmnet [15:07:13] (03Abandoned) 10Ladsgroup: DNM: Add pc5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072208 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup) [15:08:35] (03CR) 10Volans: sre.dns.admin: add guardrails for depool of sites/resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh) [15:08:36] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2014.codfw.wmnet [15:09:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:11:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:50] (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [15:15:54] (03CR) 10Volans: "You probably want to use the `verbatim_hosts` flag, see https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.aler" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [15:17:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 100%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68928 and previous config saved to /var/cache/conftool/dbconfig/20240911-151754-arnaudb.json [15:18:08] (03CR) 10Volans: "LGTM, modulo fixing the current CI failures that are legit (just rebase your local checkout with master). Although I'm not sure if it ever" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [15:21:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=(cp2037|cp2038).codfw.wmnet [reason: depooling for T373101] [15:21:48] (03CR) 10Hnowlan: [V:03+2 C:03+2] php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:21:48] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [15:26:17] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided) [15:26:58] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided) (duration: 00m 41s) [15:27:24] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@4635fcb] (releasing): (no justification provided) [15:28:00] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@4635fcb] (releasing): (no justification provided) (duration: 00m 35s) [15:28:09] (03PS5) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) [15:28:09] (03PS2) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) [15:28:55] (03CR) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:29:28] (03PS1) 10DCausse: rdf-streaming-updater: use SSL to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 [15:31:43] !log push server and vlan configuration to lsw1-c6-codfw with Homer to prep physical moves T373101 [15:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [15:33:05] (03CR) 10DCausse: [C:04-1] "sorry for the noise, it's not ready just realized that the consumers are still hardcoded to plaintext... needs a patch in the codebase 😞" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (owner: 10DCausse) [15:34:01] (03PS1) 10Hnowlan: php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232 [15:34:30] (03CR) 10Clément Goubert: [C:03+1] php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232 (owner: 10Hnowlan) [15:35:13] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on phab2002.codfw.wmnet with reason: nftables migration [15:35:23] (03CR) 10Hnowlan: [V:03+2 C:03+2] php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232 (owner: 10Hnowlan) [15:35:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on phab2002.codfw.wmnet with reason: nftables migration [15:35:33] !log phab2002 - rebooting for nftables migration [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on moscovium.eqiad.wmnet with reason: nftables migration [15:36:42] !log moscovium - rebooting for nftables migration [15:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:48] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moscovium.eqiad.wmnet with reason: nftables migration [15:37:14] (03PS1) 10Hnowlan: Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234 [15:37:36] !log depooling thanos-fe2004.codfw.wmnet — T373101 [15:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:39] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [15:38:02] (03PS1) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [15:38:19] (03PS2) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) [15:39:03] (03CR) 10Clément Goubert: [C:03+1] Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234 (owner: 10Hnowlan) [15:39:58] (03CR) 10CI reject: [V:04-1] sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [15:40:11] (03CR) 10Hnowlan: [V:03+2 C:03+2] Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234 (owner: 10Hnowlan) [15:41:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2115 db2116 db2127 db2167 db2168 db2179 db2180 db2210 es2022 es2038 - T370852', diff saved to https://phabricator.wikimedia.org/P68929 and previous config saved to /var/cache/conftool/dbconfig/20240911-154114-arnaudb.json [15:41:18] T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 [15:43:14] (03PS3) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) [15:43:58] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:43:58] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:44:04] earlier today I have complained about jsontruncated messages coming from wikifunctions . That is T374241 :) [15:44:04] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [15:44:16] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:44:38] PROBLEM - WDQS Main SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query-main.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:44:44] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:45:48] FIRING: PuppetFailure: Puppet has failed on wdqs2021:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:45:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10137884 (10ABran-WMF) depoolable hosts have been depooled https://phabricator.wikimedia.org/P68929 [15:46:25] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:51] FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:28] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2021.codfw.wmnet with OS bullseye [15:50:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791 [15:50:24] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [15:50:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791 [15:55:21] !log moscovium - apt-get upgrade - installing new apache2 version and more package upgrades [15:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [15:56:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [15:56:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68930 and previous config saved to /var/cache/conftool/dbconfig/20240911-155608-ladsgroup.json [15:56:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:56:41] (03CR) 10Volans: [C:03+1] "Makes sense to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French) [16:00:10] (03CR) 10Lucas Werkmeister: [C:03+1] typos: add colud to the list [puppet] - 10https://gerrit.wikimedia.org/r/1072188 (owner: 10David Caro) [16:07:01] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 34 hosts with reason: Move server uplinks codfw racks C6 [16:07:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 34 hosts with reason: Move server uplinks codfw racks C6 [16:07:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10137966 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6257a49b-1ea6-4675-9944-c5d85eb38288) set by cmooney@cumin1002 for... [16:07:45] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10137961 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:08:03] !log begin server uplink moves from asw-c6-codfw to lsw1-c6-codfw T373101 [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:07] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:08:59] * arnaudb grabs his popcorn [16:09:38] (03PS1) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [16:10:43] (03PS2) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [16:16:38] PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2135.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2135.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:49] ah [16:16:52] it's not been muted [16:18:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:18:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2135.codfw.wmnet with reason: network maintenance [16:18:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2135.codfw.wmnet with reason: network maintenance [16:19:44] (03PS1) 10Hnowlan: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213) [16:20:32] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 24 hosts with reason: Move server uplinks codfw racks C7 [16:20:38] RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:20:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 24 hosts with reason: Move server uplinks codfw racks C7 [16:21:07] !log bking@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: 8 [16:21:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138029 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6458a64c-9bf9-4b09-a6e1-82f1e6f72fc3) set by cmooney@cumin1002 for... [16:21:19] !log bking@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: 8 (duration: 00m 12s) [16:21:46] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:22:00] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:23:00] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:23:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [16:23:16] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:40] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [16:25:43] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [16:26:09] (03CR) 10EoghanGaffney: [C:03+2] lists: Set number of processes for mailman3_runner to minimum of 14 [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney) [16:27:50] (03PS1) 10EoghanGaffney: lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247 [16:27:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2037.codfw.wmnet [16:27:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2037.codfw.wmnet [16:28:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2038.codfw.wmnet [16:28:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2038.codfw.wmnet [16:28:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138056 (10cmooney) All hosts successfully moved and responding to ping again. [16:29:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:29:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:31:36] ^^ expected [16:31:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68931 and previous config saved to /var/cache/conftool/dbconfig/20240911-163137-arnaudb.json [16:31:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68932 and previous config saved to /var/cache/conftool/dbconfig/20240911-163142-arnaudb.json [16:31:43] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:31:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68933 and previous config saved to /var/cache/conftool/dbconfig/20240911-163147-arnaudb.json [16:31:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68934 and previous config saved to /var/cache/conftool/dbconfig/20240911-163152-arnaudb.json [16:31:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68935 and previous config saved to /var/cache/conftool/dbconfig/20240911-163157-arnaudb.json [16:32:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68936 and previous config saved to /var/cache/conftool/dbconfig/20240911-163202-arnaudb.json [16:32:05] (03PS4) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 [16:32:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68937 and previous config saved to /var/cache/conftool/dbconfig/20240911-163207-arnaudb.json [16:32:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68938 and previous config saved to /var/cache/conftool/dbconfig/20240911-163212-arnaudb.json [16:32:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68939 and previous config saved to /var/cache/conftool/dbconfig/20240911-163217-arnaudb.json [16:32:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68940 and previous config saved to /var/cache/conftool/dbconfig/20240911-163222-arnaudb.json [16:33:10] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh) [16:34:18] (03PS1) 10CDanis: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 [16:34:20] !log Repooling kubernetes2042.codfw.wmnet kubernetes2043.codfw.wmnet mw2350.codfw.wmnet mw2351.codfw.wmnet mw2352.codfw.wmnet mw2353.codfw.wmnet mw2354.codfw.wmnet mw2355.codfw.wmnet mw2356.codfw.wmnet mw2357.codfw.wmnet mw2359.codfw.wmnet parse2014.codfw.wmnet parse2015.codfw.wmnet wikikube-ctrl2002.codfw.wmnet wikikube-worker2020.codfw.wmnet wikikube-worker2021.codfw.wmnet [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:22] wikikube-worker2022.codfw.wmnet wikikube-worker2023.codfw.wmnet wikikube-worker2024.codfw.wmnet wikikube-worker2032.codfw.wmnet - T373101 [16:34:33] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2042.codfw.wmnet [16:34:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2042.codfw.wmnet [16:34:40] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2043.codfw.wmnet [16:34:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2043.codfw.wmnet [16:34:44] !log pooling thanos-fe2004.codfw.wmnet — T373101 [16:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2350.codfw.wmnet [16:34:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2350.codfw.wmnet [16:34:51] (03CR) 10BBlack: [C:03+1] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh) [16:34:55] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2351.codfw.wmnet [16:34:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2351.codfw.wmnet [16:35:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2352.codfw.wmnet [16:35:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2352.codfw.wmnet [16:35:09] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2353.codfw.wmnet [16:35:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2353.codfw.wmnet [16:35:16] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2354.codfw.wmnet [16:35:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2354.codfw.wmnet [16:35:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2355.codfw.wmnet [16:35:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2355.codfw.wmnet [16:35:30] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2356.codfw.wmnet [16:35:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2356.codfw.wmnet [16:35:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2357.codfw.wmnet [16:35:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2357.codfw.wmnet [16:35:41] (03PS2) 10CDanis: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) [16:35:43] !log disable now unused ports on asw-c6-codfw after server move T373101 [16:35:45] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2359.codfw.wmnet [16:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2359.codfw.wmnet [16:35:52] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2014.codfw.wmnet [16:35:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2014.codfw.wmnet [16:35:59] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2015.codfw.wmnet [16:36:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2015.codfw.wmnet [16:36:06] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2002.codfw.wmnet [16:36:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2002.codfw.wmnet [16:36:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138099 (10VRiley-WMF) a:03VRiley-WMF [16:36:13] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2020.codfw.wmnet [16:36:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2020.codfw.wmnet [16:36:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138100 (10ABran-WMF) hosts are repooling [16:36:21] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2021.codfw.wmnet [16:36:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2021.codfw.wmnet [16:36:29] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2022.codfw.wmnet [16:36:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2022.codfw.wmnet [16:36:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2023.codfw.wmnet [16:36:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2023.codfw.wmnet [16:36:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2024.codfw.wmnet [16:36:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2024.codfw.wmnet [16:36:50] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2032.codfw.wmnet [16:36:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2032.codfw.wmnet [16:37:34] (03CR) 10Giuseppe Lavagetto: [C:03+1] wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:37:49] (03CR) 10CDanis: [C:03+2] wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:38:50] (03Merged) 10jenkins-bot: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:39:12] jouncebot: nowandnext [16:39:12] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [16:39:12] In 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700) [16:39:38] (03PS1) 10CDanis: mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549) [16:39:54] (03CR) 10CDanis: [C:03+2] mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:40:11] lucaswerkmeister: I'm around if you'd still like to deploy those fatal-error patches today [16:40:18] sure! [16:40:29] I figured out the fatal-error.php password too [16:40:51] ah perfect [16:41:09] (03Merged) 10jenkins-bot: mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:41:44] (and as far as I could tell, the X-Request-Id response header is never sent, so I guess the condition in https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/15/1071715/2/modules/profile/files/mediawiki/php/php7-fatal-error.php#104 is always false…) [16:42:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [16:42:29] (03PS1) 10AikoChou: ml-services: add ref-quality isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) [16:42:35] (but maybe that only affects /w/fatal-error.php and “real” fatal errors still get a chance to send that header. no idea) [16:43:16] hm, okay [16:43:29] (03PS1) 10CDanis: wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253 [16:43:53] lucaswerkmeister: do you want to deploy and test these together, or one at a time? [16:44:18] together, I think [16:44:30] works for me [16:44:38] I don’t even know how to test the first change, presumably the request ID being unset Should Never Happen™ ^^ [16:44:40] (03CR) 10CDanis: [C:03+2] wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253 (owner: 10CDanis) [16:45:25] (03CR) 10RLazarus: [C:03+2] errorpage: Remove redundant 'unknown' $reqId fallback [puppet] - 10https://gerrit.wikimedia.org/r/1071714 (owner: 10Lucas Werkmeister) [16:45:36] (03CR) 10RLazarus: [C:03+2] errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (https://phabricator.wikimedia.org/T291192) (owner: 10Lucas Werkmeister) [16:45:39] (03Merged) 10jenkins-bot: wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253 (owner: 10CDanis) [16:46:06] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:46:22] (waiting on puppet-merge) [16:46:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68941 and previous config saved to /var/cache/conftool/dbconfig/20240911-164644-arnaudb.json [16:46:47] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:46:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68942 and previous config saved to /var/cache/conftool/dbconfig/20240911-164648-arnaudb.json [16:46:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68943 and previous config saved to /var/cache/conftool/dbconfig/20240911-164653-arnaudb.json [16:46:56] in my experience, "never" is defined in the wikimedia world as "something that probably happens at least once an hour somewhere" :) [16:46:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68944 and previous config saved to /var/cache/conftool/dbconfig/20240911-164657-arnaudb.json [16:47:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68945 and previous config saved to /var/cache/conftool/dbconfig/20240911-164703-arnaudb.json [16:47:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68946 and previous config saved to /var/cache/conftool/dbconfig/20240911-164708-arnaudb.json [16:47:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68947 and previous config saved to /var/cache/conftool/dbconfig/20240911-164713-arnaudb.json [16:47:14] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:47:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68948 and previous config saved to /var/cache/conftool/dbconfig/20240911-164718-arnaudb.json [16:47:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68949 and previous config saved to /var/cache/conftool/dbconfig/20240911-164723-arnaudb.json [16:47:25] (now waiting on the puppet agent at deploy1003) [16:47:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68950 and previous config saved to /var/cache/conftool/dbconfig/20240911-164728-arnaudb.json [16:47:30] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:48:35] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:48:42] !bash in my experience, "never" is defined in the wikimedia world as "something that probably happens at least once an hour somewhere" :) [16:48:43] lucaswerkmeister: Stored quip at https://bash.toolforge.org/quip/Z5784ZEBFFSCpsJzvl4r [16:48:55] (hope you don’t mind, delete it if you do ^^) [16:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:51:59] cdanis: my helmfile deploy is picking up your wikifunctions tracing changes, is it okay if those go out? [16:52:04] rzl: please [16:52:11] 👍 [16:52:13] I hadn't started them because I didn't want to get in your way [16:52:22] swfrench-wmf++ for these good diffs appearing [16:52:43] oh is it better now?? [16:52:56] oh I just mean diffs appearing here at all [16:53:05] no news afaik, I just think it's neat that scap does that [16:53:11] ah yeah [16:53:21] nothing else unexpected here, off we go [16:53:22] !log rzl@deploy1003 Started scap sync-world: 1071714, 1071715 (T291192) [16:53:29] T291192: Update php-wmerrors page to include request ID - https://phabricator.wikimedia.org/T291192 [16:53:38] RECOVERY - mailman3_runners on lists1004 is OK: PROCS OK: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:54:12] !log rzl@deploy1003 rzl: 1071714, 1071715 (T291192) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:54:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:54:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:54:32] * lucaswerkmeister looks up the mwdebug curl incantation [16:54:32] lucaswerkmeister, cdanis: at mwdebug, ready for testing [16:54:47] !incidents [16:54:47] 5158 (UNACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [16:54:47] 5157 (RESOLVED) db1166 (paged)/MariaDB Replica SQL: s3 (paged) [16:54:48] 5156 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [16:54:48] 5155 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [16:55:01] arnoldokoth, herron: fyi I have a deploy in progress but it's only as far as mwdebug and almost certainly can't be related to that NEL page [16:55:13] I think that NEL page from GB is the same false positive as earlier this week [16:55:15] we had some of that recently [16:55:21] what cdanis said [16:55:27] !ack 5158 [16:55:28] 5158 (ACKED) NELHigh sre (thanos-rule tcp.address_unreachable) [16:55:31] rzl: seems to work, I see the HTML comment :) [16:55:36] I'll do something quickly to exclude that bogus hostname in the logstash exporter that backs the metric for the alert [16:55:41] lucaswerkmeister: sweet [16:55:48] ok sounds good cdanis thanks [16:56:00] er, after I have some lunch, since I just realize I haven't yet [16:56:07] but yeah before eod today :) [16:56:21] write both before and after food and compare :) [16:56:31] cdanis: do you want to test anything on that tracing change while it's at mwdebug, or should I just roll it everywhere? [16:56:39] rzl: roll [16:56:54] herron: any objection wrt that alert? [16:57:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=(cp2037|cp2038).codfw.wmnet [reason: done T373101] [16:57:12] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [16:57:13] rzl: negative sgtm [16:57:18] rzl: cdanis: Thanks. [16:57:21] 🚀 [16:57:23] !log rzl@deploy1003 rzl: Continuing with sync [16:58:24] !log rzl@deploy1003 Finished scap sync-world: 1071714, 1071715 (T291192) (duration: 07m 37s) [16:58:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68951 and previous config saved to /var/cache/conftool/dbconfig/20240911-165838-ladsgroup.json [16:58:42] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:59:39] !log sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1072209"' [16:59:40] and I think that's yesterday's puppet window complete 😅 [16:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] (03PS1) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700) [17:00:21] (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh) [17:01:17] (and re what I wrote above about the X-Request-ID response header missing from the fatal-error response, it turns out Krinkle already figured that out three years ago, it works as expected but gets filtered out unless WikimediaDebug is used [17:01:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/721923/3#message-8147826063be8a55a599fb775df9f242d3e075ea) [17:01:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68952 and previous config saved to /var/cache/conftool/dbconfig/20240911-170149-arnaudb.json [17:01:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68953 and previous config saved to /var/cache/conftool/dbconfig/20240911-170153-arnaudb.json [17:01:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68954 and previous config saved to /var/cache/conftool/dbconfig/20240911-170158-arnaudb.json [17:02:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68955 and previous config saved to /var/cache/conftool/dbconfig/20240911-170203-arnaudb.json [17:02:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68956 and previous config saved to /var/cache/conftool/dbconfig/20240911-170208-arnaudb.json [17:02:12] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [17:02:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68957 and previous config saved to /var/cache/conftool/dbconfig/20240911-170213-arnaudb.json [17:02:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68958 and previous config saved to /var/cache/conftool/dbconfig/20240911-170218-arnaudb.json [17:02:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68959 and previous config saved to /var/cache/conftool/dbconfig/20240911-170223-arnaudb.json [17:02:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68960 and previous config saved to /var/cache/conftool/dbconfig/20240911-170228-arnaudb.json [17:02:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68961 and previous config saved to /var/cache/conftool/dbconfig/20240911-170233-arnaudb.json [17:05:05] !log sukhe@dns7001:~$ sudo systemctl restart ntpsec.service [17:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:46] PROBLEM - NTP peers on dns7001 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [17:08:46] RECOVERY - NTP peers on dns7001 is OK: NTP OK: Offset 0.000833847 secs https://wikitech.wikimedia.org/wiki/NTP [17:09:52] (03PS1) 10Zabe: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) [17:12:36] (03PS2) 10Zabe: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) [17:12:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [17:13:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68962 and previous config saved to /var/cache/conftool/dbconfig/20240911-171346-ladsgroup.json [17:14:20] !log installing gtk+2.0 security updates on bookworm [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68963 and previous config saved to /var/cache/conftool/dbconfig/20240911-171655-arnaudb.json [17:16:59] T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 [17:17:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68964 and previous config saved to /var/cache/conftool/dbconfig/20240911-171700-arnaudb.json [17:17:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68965 and previous config saved to /var/cache/conftool/dbconfig/20240911-171704-arnaudb.json [17:17:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68966 and previous config saved to /var/cache/conftool/dbconfig/20240911-171709-arnaudb.json [17:17:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68967 and previous config saved to /var/cache/conftool/dbconfig/20240911-171714-arnaudb.json [17:17:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68968 and previous config saved to /var/cache/conftool/dbconfig/20240911-171719-arnaudb.json [17:17:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68969 and previous config saved to /var/cache/conftool/dbconfig/20240911-171724-arnaudb.json [17:17:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68970 and previous config saved to /var/cache/conftool/dbconfig/20240911-171729-arnaudb.json [17:17:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68971 and previous config saved to /var/cache/conftool/dbconfig/20240911-171734-arnaudb.json [17:17:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68972 and previous config saved to /var/cache/conftool/dbconfig/20240911-171739-arnaudb.json [17:17:40] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [17:17:42] RECOVERY - WDQS Main SPARQL on wdqs2021 is OK: HTTP OK: HTTP/1.1 200 OK - 785 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:17:47] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [17:18:06] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:18:08] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:18:51] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:19:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [17:20:49] rzl: I forgot to say thanks, so thanks for deploying! \o/ [17:20:56] jouncebot: nowandnext [17:20:56] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700) [17:20:56] In 0 hour(s) and 39 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1800) [17:21:55] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:56] (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:22:02] (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:22:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:22:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:24:36] lucaswerkmeister: of course, any time! thanks for your flexibility [17:25:02] (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:25:17] (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:25:41] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]] [17:25:44] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [17:26:23] (03PS1) 10Zabe: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) [17:26:57] (03PS1) 10Zabe: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) [17:27:00] (03PS1) 10Ssingh: P:ntp: bump check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261 [17:27:10] (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:27:15] (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:27:53] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add ref-quality isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [17:28:07] (03PS2) 10Ssingh: P:ntp: bump monitoring check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261 [17:28:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68973 and previous config saved to /var/cache/conftool/dbconfig/20240911-172852-ladsgroup.json [17:29:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:29:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:30:01] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main,name=codfw [17:30:18] (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:30:20] (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [17:30:39] (03PS2) 10AikoChou: ml-services: deploy ref-quality isvc in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) [17:30:44] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: ref [17:30:44] erence (T183490)]] [17:30:47] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [17:32:07] (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [17:33:04] (03Merged) 10jenkins-bot: ml-services: deploy ref-quality isvc in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [17:35:05] (03CR) 10Ssingh: [C:03+2] P:ntp: bump monitoring check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261 (owner: 10Ssingh) [17:39:42] !log imported php-uuid_1.2.0-12+wmf11u1 into component/php81 - T372507 [17:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:46] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [17:43:50] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10138403 (10MoritzMuehlenhoff) [17:44:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68974 and previous config saved to /var/cache/conftool/dbconfig/20240911-174400-ladsgroup.json [17:44:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [17:44:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:44:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [17:44:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68975 and previous config saved to /var/cache/conftool/dbconfig/20240911-174422-ladsgroup.json [17:45:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138408 (10MoritzMuehlenhoff) [17:45:45] !log installing postgresql-15 security updates [17:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:12] (03PS1) 10Varnent: Updated license information from CC 3.0 to CC 4.0 per request from Legal. [puppet] - 10https://gerrit.wikimedia.org/r/1072265 [17:47:14] !log sukhe@dns7001:~$ sudo systemctl restart ntpsec.service [17:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:08] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:48:10] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:48:17] !log zabe@deploy1003 zabe: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]] [17:48:17] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:48:19] !log zabe@deploy1003 zabe: Continuing with sync [17:48:20] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [17:48:52] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:49:21] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:50:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138420 (10MoritzMuehlenhoff) [17:50:13] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:50:19] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10138421 (10VRiley-WMF) After working with Dell on this issue for a while and they reviewed the logs, they don't see any issues with the Hardware. Would it be possible to reinstall the OS... [17:50:23] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:51:45] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:51:51] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:52:50] !log re-enable puppet on A:dnsbox and enable agent [17:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:55] !log re-enable puppet on A:dnsbox and [run] agent [17:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:45] (03CR) 10Cathal Mooney: [C:03+2] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney) [17:54:01] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138444 (10MoritzMuehlenhoff) [17:58:41] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox [18:00:04] dduvall and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1800). [18:00:13] sorry, scap is still running, it's far slower than expected [18:01:37] PROBLEM - NTP peers on dns1004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [18:02:06] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: re [18:02:06] ference (T183490)]] (duration: 31m 21s) [18:02:10] * zabe done [18:02:10] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [18:04:02] sukhe: are you around - I'm unsure about the dns1004 ntp alert above? [18:04:10] topranks: yeah, all good [18:04:15] ok yeah [18:04:20] checking on the host all looks ok tome [18:04:27] thanks. we removed iburst today so the initial sync takes longer [18:04:38] I bumped the check_interval but I think it needs to be higher [18:04:40] will fix it [18:04:50] ok cool [18:04:57] remind me what iburst does again? [18:05:26] (03CR) 10Xcollazo: "I was under the impression that older revisions continue to be CC 3.0 rather than CC 4.0." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [18:05:29] offset's don't seem too bad we obviously have a lowish threshold for it [18:05:30] so like when ntpsec service on dns1004 starts or we reimage the server or reboot [18:05:48] we send a burst of six packets for a faster sync vs the usual one [18:05:55] (03PS1) 10Jforrester: dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268 [18:05:56] we debated quite a lot about this and decided to do away with it [18:06:16] zabe: all clear? [18:06:18] ok cool - good to refresh my memory on that thanks [18:06:24] dduvall: yep [18:06:27] thanks! [18:06:28] (03PS1) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) [18:06:29] we had it both for our servers (the dns boxes) and per-country pools and we removed both [18:06:37] RECOVERY - NTP peers on dns1004 is OK: NTP OK: Offset -0.001005084 secs https://wikitech.wikimedia.org/wiki/NTP [18:06:37] cool [18:07:29] it's still there on dns1004 in ntp.conf though [18:07:31] pool 0.us.pool.ntp.org iburst [18:07:48] not for the other dns servers though just that one [18:09:24] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641) [18:09:25] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:10:02] (03CR) 10Varnent: "I have pinged Shaun S in Legal via Slack to have him verify. He may differ to someone else within Legal who already has an account here to" [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [18:10:21] (03PS2) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) [18:11:01] topranks: thanks for pointing that out. it's weird because https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1502b725372051a7d1e4a31d501b4e69ad393330%5E%21/#F1 [18:11:03] (03CR) 10Varnent: "To clarify, Shaun is Legal rep that made initial request for that information to be updated." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [18:11:23] and https://puppetboard.wikimedia.org/report/dns1004.wikimedia.org/a4c829ca1019a6d345486767ef567e7b6237b574 [18:11:34] ah sorry yeah [18:11:41] you are looking at ntp.conf. the file should be ntpsec.conf [18:11:57] I will remove ntp.conf from everywhere [18:12:00] (03CR) 10Scott French: "Hugh, since you kindly reviewed the last patch series, could I ask you to take a look at this as well? Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [18:12:06] so look at /etc/ntpsec/ntp.conf [18:12:23] pool 0.us.pool.ntp.org [18:12:26] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot) [18:12:56] sukhe: cool yeah [18:13:05] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:13] probably ntp.conf is an artifact from before we used ntpsec? [18:13:15] yep [18:13:21] good to know [18:13:27] good shoutout though, I will remove all traces of it to avoid confusion [18:13:43] (which is what we have been doing, even though ntpsec was aliasing ntpd, we are setting ntpsec everywhere) [18:13:49] (03CR) 10Ladsgroup: "I can deploy it once the question is answered." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [18:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:22:22] (03PS1) 10CDanis: NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563) [18:22:45] PROBLEM - NTP peers on dns1006 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [18:23:11] ^ yeah, bumping this shortly [18:23:30] there's nothing broken as the syncs are spaced apart but yes, we should not alert as quick as we were before [18:25:28] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [18:25:53] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.22 refs T373641 [18:25:57] T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641 [18:27:45] RECOVERY - NTP peers on dns1006 is OK: NTP OK: Offset 0.000132206 secs https://wikitech.wikimedia.org/wiki/NTP [18:29:40] 🍿 [18:30:21] !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php test2wiki --skip text_table_cleanup/test2wiki text_table_dump/test2wiki --sleep 1 # T183490 [18:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:24] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [18:30:29] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:31:18] ok, forgot the --dump lol, restarted [18:32:44] (03CR) 10Stoyofuku-wmf: [C:03+1] "thank you!!" [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson) [18:32:55] PROBLEM - NTP peers on dns2004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [18:33:00] (03CR) 10CDanis: [C:04-1] "If you planned on reusing the same cert as is in production now, this won't work -- lists1004.wikimedia.org is not one of its SANs." [puppet] - 10https://gerrit.wikimedia.org/r/1072247 (owner: 10EoghanGaffney) [18:33:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson) [18:33:34] zabe: i'm seeing a slew of errors from `migrateESRefToContentTable.php` [18:33:48] yep, already canceled [18:33:53] `PHP Warning: fwrite() expects parameter 1 to be resource, bool given` [18:34:13] do you need me to rollback wmf.22? [18:34:18] yes, and more interesting PHP Warning: `fopen(/home/zabe/text_table_dump/test2wiki): failed to open stream: Permission denied` [18:34:22] dduvall: nope [18:34:25] k [18:34:29] but thanks [18:34:33] np [18:35:32] (03PS1) 10Ssingh: P:ntp: bump retry_interval to 5 mins for NTP monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1072273 [18:37:22] (03PS1) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 [18:37:35] (03Abandoned) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson) [18:37:55] RECOVERY - NTP peers on dns2004 is OK: NTP OK: Offset -0.000446636 secs https://wikitech.wikimedia.org/wiki/NTP [18:39:28] zabe: just give 777 to the file :D [18:41:09] (03PS1) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276 [18:41:38] (03CR) 10Ssingh: [C:03+2] P:ntp: bump retry_interval to 5 mins for NTP monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1072273 (owner: 10Ssingh) [18:42:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3956/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [18:42:50] Amir1: yep :D [18:42:55] !log running agent on O:alerting_host [18:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] (03CR) 10CDanis: [C:03+1] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [18:46:16] (03CR) 10Stoyofuku-wmf: [C:03+1] "We're not worried that this is a cherry pick of the cherry pick, right? Everything looks fine to me so approving" [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson) [18:46:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson) [18:47:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68976 and previous config saved to /var/cache/conftool/dbconfig/20240911-184750-ladsgroup.json [18:47:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:57:53] PROBLEM - NTP peers on dns2006 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [18:59:27] RECOVERY - NTP peers on dns2006 is OK: NTP OK: Offset -0.000629282 secs https://wikitech.wikimedia.org/wiki/NTP [19:00:22] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:00:55] (03CR) 10Herron: [C:03+1] NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563) (owner: 10CDanis) [19:02:15] (03CR) 10CDanis: [C:03+2] NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563) (owner: 10CDanis) [19:02:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68977 and previous config saved to /var/cache/conftool/dbconfig/20240911-190257-ladsgroup.json [19:05:22] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:07:50] ACKNOWLEDGEMENT - NTP peers on dns3003 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown Sukhbir Singh cookbook run https://wikitech.wikimedia.org/wiki/NTP [19:15:22] (03PS2) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) [19:15:52] (03PS3) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) [19:18:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68978 and previous config saved to /var/cache/conftool/dbconfig/20240911-191805-ladsgroup.json [19:22:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:29:19] (03PS1) 10Scott French: aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502) [19:31:53] (03CR) 10Eevans: [C:03+2] puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:32:18] (03CR) 10DCausse: "lgtm, chart version needs to be updated I think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [19:33:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68979 and previous config saved to /var/cache/conftool/dbconfig/20240911-193312-ladsgroup.json [19:33:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:33:17] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:33:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:33:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68980 and previous config saved to /var/cache/conftool/dbconfig/20240911-193335-ladsgroup.json [19:42:58] jouncebot: next [19:42:58] In 0 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2000) [19:46:37] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:46:40] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:47:42] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:47:45] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:53:28] (03CR) 10BCornwall: [V:03+1] varnish: Conditionally monitor vcl reloads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [19:56:29] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:56:32] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:59:44] (03CR) 10Jdlrobson: [C:03+1] Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2000). [20:00:04] kimberly_sarabia, toyofuku, Nemoralis, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] Hi I'm here [20:00:21] o/ [20:00:57] (03PS8) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [20:01:23] (03PS12) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [20:01:40] here as well! [20:01:47] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [20:02:56] (03PS13) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [20:06:25] (03CR) 10Ebrahim: "I'm very sorry about that. Thanks for making this possible" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [20:07:19] hi - sorry to be late - i can deploy [20:07:33] thank you!! [20:07:52] (03PS14) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) [20:08:08] kimberly_sarabia: i'll start with yours! [20:08:17] thanks! [20:09:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [20:09:29] (03CR) 10Clare Ming: [C:03+2] Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson) [20:09:31] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:09:35] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:09:49] (03Merged) 10jenkins-bot: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [20:10:05] toyofuku: i manually +2'd your MF backport - guessing it'll take a while to merge [20:10:10] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]] [20:10:13] T371020: Roll out appearance menu and font size change to sister projects - https://phabricator.wikimedia.org/T371020 [20:10:21] Makes sense - thank you for thinking of that! [20:11:22] np! ya - it says 25 mins in zuul [20:11:28] rip [20:12:51] (03PS1) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 [20:13:35] (03PS2) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 [20:13:46] !log cjming@deploy1003 jdlrobson, cjming: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway) [20:13:51] kimberly_sarabia: up on test servers if you want to verify - lmk if/when to sync [20:14:26] ok one moment [20:18:27] LGTM! [20:18:34] cjming: ^ [20:18:42] yay! syncing [20:18:44] !log cjming@deploy1003 jdlrobson, cjming: Continuing with sync [20:23:19] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]] (duration: 13m 09s) [20:23:23] T371020: Roll out appearance menu and font size change to sister projects - https://phabricator.wikimedia.org/T371020 [20:24:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:24:12] kimberly_sarabia: should be live! [20:24:23] (03PS1) 10BCornwall: trafficserver: Conditionally set monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1072295 [20:24:27] toyofuku: doing your config patch next [20:24:35] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573 (10phaultfinder) 03NEW [20:24:35] thank you thank you [20:25:01] cjming: ty [20:25:01] (03PS4) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) [20:26:10] (03CR) 10TrainBranchBot: "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:26:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138926 (10Jclark-ctr) [Wed Sep 11 13:51:52 2024] sd 0:0:2:0: [sdc] tag#2137 CDB: Write(10) 2a 00 00 08 f0 10 00 00 08 00 [Wed Sep 11 13:51:52 2024] I/O error, dev sdc, sector 585744 op 0x1:(WRITE) flag... [20:26:52] (03Merged) 10jenkins-bot: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [20:27:13] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]] [20:27:17] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:29:08] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3957/console" [puppet] - 10https://gerrit.wikimedia.org/r/1072295 (owner: 10BCornwall) [20:30:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:30:43] !log cjming@deploy1003 cjming, toyofuku: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:30:45] toyofuku: your config patch is up on mwdebug if you'd like to test - lmk if/when to sync [20:30:47] :eyes: [20:30:54] looking now! [20:31:41] (03PS2) 10BCornwall: trafficserver: no logging on disabled monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1072295 [20:32:05] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox [20:32:08] All good, thank you! [20:32:17] cool - syncing :) [20:32:22] !log cjming@deploy1003 cjming, toyofuku: Continuing with sync [20:32:59] toyofuku: i think your backport will finish merging in the next few mins - so perfect timing to do that one next [20:33:06] yayyy [20:33:37] Nemoralis: i'll plan on doing your patch afterwards [20:33:46] ok [20:35:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [20:35:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138974 (10Jclark-ctr) jclark@prometheus1008:~$ for disk in $(lsblk -dn -o NAME); do echo "Device: /dev/$disk" udevadm info -q property -n /dev/$disk | grep -E "ID_SERIAL|ID_PATH" done Device: /... [20:35:57] (03Merged) 10jenkins-bot: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson) [20:36:10] cjming: is running maintenance script (namespaceDupes) required? This patch will update Project: namespace [20:36:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68982 and previous config saved to /var/cache/conftool/dbconfig/20240911-203623-ladsgroup.json [20:36:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:36:37] cjming: i'm here too, sorry i'm late [20:36:54] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:36:56] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]] (duration: 09m 42s) [20:36:56] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:36:59] T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585 [20:37:00] Nemoralis: i'm not sure actually - i don't think so somehow [20:37:16] cscott: no worries! you're early for your patch actually [20:37:55] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]] [20:38:09] toyofuku: config patch should be live, deploying your backport now [20:38:16] Thank you! [20:38:19] yw! [20:38:46] (03PS3) 10NMW03: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) [20:38:57] looks like the other one just got merged [20:39:12] (03CR) 10Ssingh: [C:03+1] varnish: Conditionally monitor vcl reloads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall) [20:39:22] toyofuku: yes - i just scap backported that one - should be up on test servers here soon [20:39:32] Thank you! [20:40:23] (03CR) 10Ssingh: [C:03+1] "Check with vg once too but looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1072295 (owner: 10BCornwall) [20:40:43] does anyone here know if we have to run namespace dupes script on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070347? I'm going to err on the side of not [20:42:33] !log cjming@deploy1003 jdlrobson, cjming: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:36] toyofuku: 2nd patch up on test servers if you want to check - please lmk if/when to sync [20:42:43] looking now! [20:42:50] doc page says "after adding a namespace (or interwiki prefix)". I am not sure [20:43:06] worked! thank you [20:43:14] nice - going live! [20:43:17] !log cjming@deploy1003 jdlrobson, cjming: Continuing with sync [20:43:56] Nemoralis: me neither [20:44:25] i feel like i've only had to run that script when the namespace dupes file was updated [20:45:36] (03PS3) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 [20:45:43] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway) [20:47:03] Nemoralis: i can run it afterwards - should be like 2 secs - i don't think it can hurt anything [20:47:08] sure [20:47:49] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]] (duration: 09m 54s) [20:48:01] toyofuku: both your patches should be live! [20:48:08] thank you so much! [20:48:22] yw! [20:48:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:49:22] (03Merged) 10jenkins-bot: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03) [20:49:40] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]] [20:49:43] T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009 [20:51:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68983 and previous config saved to /var/cache/conftool/dbconfig/20240911-205130-ladsgroup.json [20:51:37] !log cjming@deploy1003 cjming, nmw03: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:51:58] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:52:28] Nemoralis: want to test? lmk when to sync [20:53:18] site name works fine, let me test namespace [20:56:14] oh, it looks like I forgot to update wgMetaNamespace [20:56:27] (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [20:56:37] I think you can continue to sync now, I will send another patch to update that [20:56:43] sure thing [20:56:45] !log cjming@deploy1003 cjming, nmw03: Continuing with sync [20:57:52] cscott: if you're still around i'll do your patch next [20:57:57] i'm here! [20:58:16] (03PS1) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) [20:59:09] (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2100) [21:01:16] do i have time to squeeze in one more config patch before the abstract wikipedia folks have the window? [21:01:31] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]] (duration: 11m 51s) [21:01:40] T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009 [21:01:45] thanks! [21:02:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian) [21:02:33] Nemoralis: i ran the maint script - said there wasn't anything to fix fwiw [21:02:53] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian) [21:03:10] cjming: probably because of wgMetaNamespace [21:03:14] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]] [21:03:16] nevermind, thanks again [21:03:18] T373229: Deploy to next set of wikivoyages (ps,bn,hi,tr) week of Sep 9 - https://phabricator.wikimedia.org/T373229 [21:03:58] np! [21:05:19] !log cjming@deploy1003 cjming, cscott: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:25] cscott: up on test servers if you'd like to verify [21:05:28] ok, i'll check it out [21:06:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68984 and previous config saved to /var/cache/conftool/dbconfig/20240911-210638-ladsgroup.json [21:06:52] cjming: looks good [21:06:58] awesome - syncing [21:07:02] !log cjming@deploy1003 cjming, cscott: Continuing with sync [21:11:33] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]] (duration: 08m 19s) [21:11:37] T373229: Deploy to next set of wikivoyages (ps,bn,hi,tr) week of Sep 9 - https://phabricator.wikimedia.org/T373229 [21:12:06] thanks cjming ! [21:13:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [21:13:33] yw! [21:14:07] !log end of UTC late backport window [21:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [21:21:18] (03CR) 10Cwhite: "Ahhhh, I see what you mean now! The commit messages between the two are 97% identical - both mention activating the same services (Icinga," [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [21:21:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68985 and previous config saved to /var/cache/conftool/dbconfig/20240911-212145-ladsgroup.json [21:21:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:21:51] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:22:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:22:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68986 and previous config saved to /var/cache/conftool/dbconfig/20240911-212208-ladsgroup.json [21:22:18] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10139101 (10jhathaway) [21:22:38] (03CR) 10Cwhite: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [21:24:53] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:25:10] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:29:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:29:29] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:30:08] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:30:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139158 (10Jclark-ctr) @ABran-WMF @wiki_willy I glanced at this for Val we need assistance troubleshooting from service owner. I was looking at console it is in emergency mode and n... [21:30:36] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:30:51] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:33:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ganeti10 - jclark@cumin1002" [21:33:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ganeti10 - jclark@cumin1002" [21:33:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:12] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:36:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10139152 (10Jdlrobson) [21:36:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139169 (10phaultfinder) [21:37:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:39:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:39:12] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:39:26] PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:40:03] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:56] PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:41:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:41:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:41:38] (03PS1) 10JHathaway: puppet8: add explicit typecast [puppet] - 10https://gerrit.wikimedia.org/r/1072301 (https://phabricator.wikimedia.org/T372664) [21:41:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:42:51] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072301 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:43:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:43:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:44:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [21:44:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [21:44:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:45:13] (03PS3) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [21:45:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:47:02] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:47:14] (03PS1) 10JHathaway: puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664) [21:47:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:48:19] !log bking@deploy1003 test deploying flink operator in staging T373195 [21:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:23] T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195 [21:48:57] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1041 [21:49:47] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1040 [21:49:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1040 [21:50:04] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1039 [21:50:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1039 [21:50:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1041 [21:50:15] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1042 [21:50:18] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043 [21:50:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1042 [21:50:24] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1044 [21:50:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti1043 [21:51:05] (03Merged) 10jenkins-bot: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [21:51:18] (03Merged) 10jenkins-bot: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester) [21:51:41] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]] [21:51:44] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [21:51:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1044 [21:52:10] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043 [21:53:26] (03CR) 10Jdlrobson: [C:03+1] Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [21:53:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1043 [21:53:32] (03PS4) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) [21:53:34] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1045 [21:53:47] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:08] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:54:18] (03PS4) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 [21:54:19] (03PS15) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [21:54:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway) [21:54:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1045 [21:54:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1046 [21:54:56] !log jforrester@deploy1003 jforrester: Continuing with sync [21:56:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1046 [21:56:13] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1047 [21:56:20] !log bking@deploy1003 test deploy of flink operator in staging cancelled with no changes T373195 [21:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:24] T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195 [21:57:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1047 [21:57:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1048 [21:58:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1048 [21:58:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1049 [21:59:33] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]] (duration: 07m 51s) [21:59:36] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [21:59:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1049 [21:59:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1050 [22:00:14] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:00:19] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:00:40] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:00:52] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:00:56] RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:01:02] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:01:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1050 [22:01:15] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1051 [22:01:18] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1052 [22:02:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1051 [22:03:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1052 [22:04:26] RECOVERY - Hadoop NodeManager on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:07:04] (03PS1) 10Jforrester: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) [22:07:16] (03PS1) 10Jforrester: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) [22:09:28] PROBLEM - Webrequests Varnishkafka log producer on cp2027 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:10:28] RECOVERY - Webrequests Varnishkafka log producer on cp2027 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:11:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:12:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:14:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:14:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:48] (03CR) 10Jdlrobson: [C:03+1] Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [22:14:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:15:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:16:02] (03PS1) 10Ladsgroup: admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008) [22:16:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:16:41] (03PS16) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [22:17:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:09] (03CR) 10Jdlrobson: [C:03+1] "I restricted the liquid threads namespaces to the 5 wikis that still have it. This LGTM for deployment now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [22:17:17] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10139251 (10Ladsgroup) Almost ready, I need to check this https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group give me a bit. [22:17:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:18:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:18:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:18:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:19:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:19:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:19:11] (03PS2) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439) [22:19:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:19:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:57] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:21:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:21:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1052.eqiad.wmnet with OS bookworm [22:21:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm [22:26:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:26:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [22:26:10] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582 (10MBinder_WMF) 03NEW [22:26:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68987 and previous config saved to /var/cache/conftool/dbconfig/20240911-222711-ladsgroup.json [22:27:15] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:27:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:28:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1041.eqiad.wmnet with OS bookworm [22:28:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm [22:28:19] (03PS5) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [22:28:36] (03CR) 10CI reject: [V:04-1] Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson) [22:29:57] (03PS1) 10Ladsgroup: admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386) [22:33:56] (03CR) 10CI reject: [V:04-1] SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:35:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:35:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:36:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:36:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:38:11] (03Merged) 10jenkins-bot: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:38:36] (03Merged) 10jenkins-bot: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester) [22:38:50] Finally. [22:38:56] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]] [22:39:00] T373830: Deprecated: Use of MediaWiki\Output\OutputPage::setCategoryLinks was deprecated [Called from MediaWiki\Specials\SpecialExpandTemplates::showHtmlPreview] - https://phabricator.wikimedia.org/T373830 [22:41:03] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:41:50] !log jforrester@deploy1003 jforrester: Continuing with sync [22:42:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68988 and previous config saved to /var/cache/conftool/dbconfig/20240911-224218-ladsgroup.json [22:46:24] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]] (duration: 07m 27s) [22:46:29] T373830: Deprecated: Use of MediaWiki\Output\OutputPage::setCategoryLinks was deprecated [Called from MediaWiki\Specials\SpecialExpandTemplates::showHtmlPreview] - https://phabricator.wikimedia.org/T373830 [22:56:59] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139315 (10MBinder_WMF) {F57500778} config file attached after confirming with @Ladsgroup [22:57:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68989 and previous config saved to /var/cache/conftool/dbconfig/20240911-225726-ladsgroup.json [23:03:08] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139331 (10Ladsgroup) I'm guessing but: - Instead of bast1002 or bast4003, use `bast4005.wikimedia.org` (depending on where you live). Otherwise, it'll... [23:04:26] 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139333 (10Dzahn) > I tried phab1001.eqiad.wmnet and bast1002.eqiad.wmnet. Hi! The issue here is that these host names are outdated. Phabricator (Pho... [23:12:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68990 and previous config saved to /var/cache/conftool/dbconfig/20240911-231233-ladsgroup.json [23:12:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:12:37] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:12:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:12:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:13:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:13:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68991 and previous config saved to /var/cache/conftool/dbconfig/20240911-231311-ladsgroup.json [23:13:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1052.eqiad.wmnet with OS bookworm [23:13:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm executed w... [23:18:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10139353 (10Jclark-ctr) @Papaul i have updated bmc and bios with no change to server. can you assist with this last... [23:18:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:19:15] (03CR) 10Dzahn: "sorry, ignore my outdated comment" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine) [23:19:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:22:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:24:10] (03PS1) 10Dzahn: vrts: switch inactive host vrts2001 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677) [23:25:06] (03CR) 10Dzahn: [C:03+1] "Thanks! made https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072313 to do it for just the inactive host as suggested" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [23:28:51] (03PS1) 10Jforrester: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 [23:29:37] (03CR) 10CI reject: [V:04-1] On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester) [23:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139369 (10phaultfinder) [23:30:07] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add backup::host, gerrit::migration etc to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1070683 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [23:37:58] (03PS2) 10Dzahn: site: (WIP) try applying gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316 (owner: 10TrainBranchBot) [23:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139376 (10phaultfinder) [23:56:34] (03PS1) 10Andrea Denisse: alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1072318 (https://phabricator.wikimedia.org/T372418) [23:58:27] (03CR) 10Dzahn: "remaining diff https://puppet-compiler.wmflabs.org/output/1063893/3959/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)