[00:06:23] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071978 (owner: 10TrainBranchBot)
[00:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:47] <wikibugs>	 (03CR) 10Stang: "Where's "oathauth-enable"?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[00:09:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:14:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[00:17:03] <wikibugs>	 (03PS1) 10Scott French: sre.switchdc.mediawiki: skip check_core_masters_in_sync in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649)
[00:34:12] <wikibugs>	 06SRE, 10MediaWiki-libs-BagOStuff, 06MediaWiki-Platform-Team, 13Patch-For-Review, 07Wikimedia-production-error: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786#10136181 (10Krinkle)
[00:41:16] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:41:38] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:42:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:44:56] <wikibugs>	 (03CR) 10Hamish: "Arbcom is one of $wmgPrivilegedGroups and hence default true for oathauth-enable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[00:45:56] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:47:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68867 and previous config saved to /var/cache/conftool/dbconfig/20240911-004743-ladsgroup.json
[00:47:47] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[00:48:12] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:48:28] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:50:58] <wikibugs>	 (03CR) 10Stang: [C:03+1] "thanks for clarification" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[00:51:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[01:00:51] <wikibugs>	 (03PS3) 10Hamish: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455)
[01:02:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68868 and previous config saved to /var/cache/conftool/dbconfig/20240911-010250-ladsgroup.json
[01:04:21] <wikibugs>	 (03CR) 10Stang: [C:03+1] Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[01:07:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:17:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P68869 and previous config saved to /var/cache/conftool/dbconfig/20240911-011758-ladsgroup.json
[01:33:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68870 and previous config saved to /var/cache/conftool/dbconfig/20240911-013305-ladsgroup.json
[01:33:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[01:33:09] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[01:33:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[01:33:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68871 and previous config saved to /var/cache/conftool/dbconfig/20240911-013327-ladsgroup.json
[01:57:31] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:02:31] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:12:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[02:17:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[02:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[02:30:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68872 and previous config saved to /var/cache/conftool/dbconfig/20240911-023058-ladsgroup.json
[02:31:04] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[02:36:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:30] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:44:41] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi Cole, no, they're meant to be two different commits as one is for failing over from alert1001 to alert2002 (also setting up alert2002 a" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[02:45:04] <wikibugs>	 (03CR) 10Andrea Denisse: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[02:46:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68873 and previous config saved to /var/cache/conftool/dbconfig/20240911-024605-ladsgroup.json
[02:46:30] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:47:08] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp1110 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[02:48:08] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp1110 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[02:48:50] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:55:52] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:00:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P68874 and previous config saved to /var/cache/conftool/dbconfig/20240911-030112-ladsgroup.json
[03:01:52] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:02:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:10:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:10:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:14:58] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:16:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T371742)', diff saved to https://phabricator.wikimedia.org/P68875 and previous config saved to /var/cache/conftool/dbconfig/20240911-031621-ladsgroup.json
[03:16:23] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[03:16:26] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[03:16:36] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[03:16:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68876 and previous config saved to /var/cache/conftool/dbconfig/20240911-031643-ladsgroup.json
[03:17:00] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:35:08] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:36:26] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[03:41:10] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:41:28] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 70.42 ms
[03:51:14] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:19:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68877 and previous config saved to /var/cache/conftool/dbconfig/20240911-041922-ladsgroup.json
[04:19:26] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[04:34:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68878 and previous config saved to /var/cache/conftool/dbconfig/20240911-043429-ladsgroup.json
[04:49:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P68879 and previous config saved to /var/cache/conftool/dbconfig/20240911-044936-ladsgroup.json
[04:51:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[05:04:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:04:38] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:04:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T371742)', diff saved to https://phabricator.wikimedia.org/P68880 and previous config saved to /var/cache/conftool/dbconfig/20240911-050444-ladsgroup.json
[05:04:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[05:04:48] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[05:04:48] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:04:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:04:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[05:05:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68881 and previous config saved to /var/cache/conftool/dbconfig/20240911-050506-ladsgroup.json
[05:09:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:22:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:27:38] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072001
[05:51:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:56:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68882 and previous config saved to /var/cache/conftool/dbconfig/20240911-060444-ladsgroup.json
[06:04:48] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[06:08:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:19:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68883 and previous config saved to /var/cache/conftool/dbconfig/20240911-061951-ladsgroup.json
[06:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:34:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68884 and previous config saved to /var/cache/conftool/dbconfig/20240911-063458-ladsgroup.json
[06:36:03] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10136388 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF thanks @Jhancock.wm for the follow up, will let you know if there is any issue
[06:36:16] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10136392 (10ABran-WMF) p:05Triage→03Medium a:05ABran-WMF→03None
[06:40:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede)
[06:41:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Puppet agent: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1071885 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[06:42:48] <wikibugs>	 (03Merged) 10jenkins-bot: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 (owner: 10Slyngshede)
[06:50:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T371742)', diff saved to https://phabricator.wikimedia.org/P68885 and previous config saved to /var/cache/conftool/dbconfig/20240911-065005-ladsgroup.json
[06:50:08] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[06:50:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[06:50:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68886 and previous config saved to /var/cache/conftool/dbconfig/20240911-065026-ladsgroup.json
[06:51:17] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1072105 (https://phabricator.wikimedia.org/T374512)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T0700).
[07:00:05] <jouncebot>	 sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:25] <sergi0>	 hello
[07:02:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:02:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 10%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68887 and previous config saved to /var/cache/conftool/dbconfig/20240911-070254-arnaudb.json
[07:06:15] <wikibugs>	 (03PS1) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107
[07:07:22] <wikibugs>	 (03CR) 10Slyngshede: "NDA might not be the best group for testing, I'm open to other suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede)
[07:09:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355)
[07:11:01] <sergi0>	 If no deployer is around, I can self-deploy
[07:11:28] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:40] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T374512
[07:11:55] <stashbot>	 T374512: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T374512
[07:12:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2140 with weight 0 T374512', diff saved to https://phabricator.wikimedia.org/P68888 and previous config saved to /var/cache/conftool/dbconfig/20240911-071205-arnaudb.json
[07:12:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T374512
[07:13:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2140 from API/vslow/dump T374512', diff saved to https://phabricator.wikimedia.org/P68889 and previous config saved to /var/cache/conftool/dbconfig/20240911-071335-arnaudb.json
[07:14:20] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:14:30] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:18:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 25%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68890 and previous config saved to /var/cache/conftool/dbconfig/20240911-071802-arnaudb.json
[07:18:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno)
[07:18:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:18:32] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:19:10] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1072105 (https://phabricator.wikimedia.org/T374512) (owner: 10Gerrit maintenance bot)
[07:19:14] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) (owner: 10Sergio Gimeno)
[07:19:48] <logmsgbot>	 !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]]
[07:19:51] <stashbot>	 T370907: Metrics Platform Integration: Agree on a stream name convention - https://phabricator.wikimedia.org/T370907
[07:21:35] <arnaudb>	 !log Starting s4 codfw failover from db2179 to db2140 - T374512
[07:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:38] <stashbot>	 T374512: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T374512
[07:22:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2140 to s4 primary T374512', diff saved to https://phabricator.wikimedia.org/P68891 and previous config saved to /var/cache/conftool/dbconfig/20240911-072210-arnaudb.json
[07:24:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[07:24:40] <logmsgbot>	 !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:24:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 T374512', diff saved to https://phabricator.wikimedia.org/P68892 and previous config saved to /var/cache/conftool/dbconfig/20240911-072458-arnaudb.json
[07:26:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 T374512', diff saved to https://phabricator.wikimedia.org/P68893 and previous config saved to /var/cache/conftool/dbconfig/20240911-072612-arnaudb.json
[07:27:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Let's use cn=idptest-users. This was a group we once created for an external pen test of CAS and basically only grants access to the puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede)
[07:27:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:28:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:28:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355)
[07:29:08] <logmsgbot>	 !log sgimeno@deploy1003 sgimeno: Continuing with sync
[07:29:29] <wikibugs>	 (03PS2) 10Slyngshede: P:idp_test: Enable permission requests on testing. [puppet] - 10https://gerrit.wikimedia.org/r/1072107
[07:29:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10136469 (10ABran-WMF) T374512 done. all remaining hosts are either non prod critical or depoolable
[07:29:59] <wikibugs>	 (03CR) 10Volans: sre.switchdc.mediawiki: skip check_core_masters_in_sync in live-test (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[07:30:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[07:33:03] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: kmwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:33:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 50%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68894 and previous config saved to /var/cache/conftool/dbconfig/20240911-073307-arnaudb.json
[07:33:10] <arnaudb>	 checking
[07:33:14] <vgutierrez>	 !incidents
[07:33:15] <sirenbot>	 5157 (UNACKED)  db1166 (paged)/MariaDB Replica SQL: s3 (paged)
[07:33:15] <sirenbot>	 5156 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[07:33:15] <sirenbot>	 5155 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[07:33:15] <sirenbot>	 5152 (RESOLVED)  NELHigh sre (thanos-rule tcp.address_unreachable)
[07:33:16] <sirenbot>	 5151 (RESOLVED)  ProbeDown sre (10.2.2.25 ip4 prometheus-https:443 probes/service http_prometheus-https_ip4 eqiad)
[07:33:18] <vgutierrez>	 !ack 5157
[07:33:18] <sirenbot>	 5157 (ACKED)  db1166 (paged)/MariaDB Replica SQL: s3 (paged)
[07:33:21] <arnaudb>	 thanks vgutierrez 
[07:33:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: Add frlog2002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1071970 (https://phabricator.wikimedia.org/T372933) (owner: 10Dwisehaupt)
[07:33:44] <logmsgbot>	 !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062416|EventStreamConfig and stream registration for homepage modules analytics (T370907)]] (duration: 13m 56s)
[07:33:47] <stashbot>	 T370907: Metrics Platform Integration: Agree on a stream name convention - https://phabricator.wikimedia.org/T370907
[07:34:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'prod issue kmwiki.pagelinks', diff saved to https://phabricator.wikimedia.org/P68895 and previous config saved to /var/cache/conftool/dbconfig/20240911-073420-arnaudb.json
[07:34:33] <arnaudb>	 host depooled, rebuilding the index
[07:36:23] <icinga-wm>	 PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:36:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: post fix', diff saved to https://phabricator.wikimedia.org/P68896 and previous config saved to /var/cache/conftool/dbconfig/20240911-073643-arnaudb.json
[07:36:45] <arnaudb>	 host is repooling 
[07:36:50] <arnaudb>	 !resolve 5157
[07:36:50] <sirenbot>	 5157 (ACKED)  db1166 (paged)/MariaDB Replica SQL: s3 (paged)
[07:37:03] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 #page on db1166 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:38:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine)
[07:43:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355)
[07:46:55] <wikibugs>	 (03CR) 10Effie Mouzeli: "It is not related. We are not setting activeDeadlineSeconds in the spec, but I will update the job module to support it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan)
[07:48:03] <wikibugs>	 (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2003 with kafka-main2008 [puppet] - 10https://gerrit.wikimedia.org/r/1072138 (https://phabricator.wikimedia.org/T363210)
[07:48:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 75%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68897 and previous config saved to /var/cache/conftool/dbconfig/20240911-074813-arnaudb.json
[07:49:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10136495 (10dcaro)
[07:49:40] <jayme>	 !log evacuating leadership for all partitions assigned to broker id 2003 on kafka-main-codfw - T363210
[07:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:43] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[07:49:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:51:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: post fix', diff saved to https://phabricator.wikimedia.org/P68898 and previous config saved to /var/cache/conftool/dbconfig/20240911-075149-arnaudb.json
[07:52:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2003,2008].codfw.wmnet with reason: Hardware refresh
[07:53:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2003,2008].codfw.wmnet with reason: Hardware refresh
[07:53:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68899 and previous config saved to /var/cache/conftool/dbconfig/20240911-075310-ladsgroup.json
[07:53:14] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[07:59:32] <wikibugs>	 (03PS1) 10Jelto: gitlab: rotate logfiles by date and size also in production [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448)
[08:01:10] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3949/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448) (owner: 10Jelto)
[08:01:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:03:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 100%: post db2137 → db2237 repool', diff saved to https://phabricator.wikimedia.org/P68903 and previous config saved to /var/cache/conftool/dbconfig/20240911-080319-arnaudb.json
[08:06:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: post fix', diff saved to https://phabricator.wikimedia.org/P68904 and previous config saved to /var/cache/conftool/dbconfig/20240911-080654-arnaudb.json
[08:08:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68905 and previous config saved to /var/cache/conftool/dbconfig/20240911-080817-ladsgroup.json
[08:18:04] <elukey>	 jouncebot: next
[08:18:04] <jouncebot>	 In 1 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000)
[08:19:48] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:20:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:22:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: post fix', diff saved to https://phabricator.wikimedia.org/P68906 and previous config saved to /var/cache/conftool/dbconfig/20240911-082200-arnaudb.json
[08:22:31] <wikibugs>	 (03PS1) 10Muehlenhoff: config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355)
[08:23:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68907 and previous config saved to /var/cache/conftool/dbconfig/20240911-082324-ladsgroup.json
[08:23:30] <wikibugs>	 (03PS2) 10Muehlenhoff: config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355)
[08:25:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm
[08:25:50] <wikibugs>	 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm
[08:26:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:26:47] <wikibugs>	 (03CR) 10Elukey: [C:03+1] config_master: Explicitly configure the server from which Puppet changes are merged [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:27:28] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 201, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:58] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143
[08:28:56] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143
[08:29:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[08:29:42] <wikibugs>	 (03PS3) 10Muehlenhoff: config_master: Explicitly configure the server for Puppet merges [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355)
[08:35:20] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143
[08:35:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:35:58] <wikibugs>	 (03CR) 10Elukey: "My bad thanks for the patch! Added an alternative proposal, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[08:36:53] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dragonfly-supernode1001.eqiad.wmnet with reason: host reimage
[08:38:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T371742)', diff saved to https://phabricator.wikimedia.org/P68908 and previous config saved to /var/cache/conftool/dbconfig/20240911-083831-ladsgroup.json
[08:38:33] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[08:38:37] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[08:38:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[08:38:52] <wikibugs>	 (03CR) 10Elukey: "Is it something that triggered a problem? Because the weekly rebuild is healthy to pick up security upgrades for the OS, without it we'll " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos)
[08:38:54] <wikibugs>	 (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[08:39:23] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dragonfly-supernode1001.eqiad.wmnet with reason: host reimage
[08:40:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] httpbb: Move wikifunctions to its own test suite [puppet] - 10https://gerrit.wikimedia.org/r/1071919 (https://phabricator.wikimedia.org/T374442) (owner: 10Clément Goubert)
[08:41:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-debug: add initial "next" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071945 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[08:42:21] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[08:42:22] <wikibugs>	 (03PS1) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005)
[08:45:18] <wikibugs>	 (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[08:45:31] <wikibugs>	 (03CR) 10David Caro: cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro)
[08:46:33] <wikibugs>	 (03PS2) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005)
[08:46:52] <wikibugs>	 (03CR) 10Elukey: [C:03+2] aux-services: update Docker images for Jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071872 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[08:48:37] <wikibugs>	 (03PS1) 10Muehlenhoff: pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355)
[08:49:22] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] ipoid: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071843 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli)
[08:49:52] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "Cool, thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071752 (https://phabricator.wikimedia.org/T374414) (owner: 10Kosta Harlan)
[08:49:54] <wikibugs>	 (03CR) 10Klausman: "I concur with Luca that unless this causes a problem, we should keep doing weeklies. Since we (SRE) are working finding a way to expire un" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos)
[08:50:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job dragonfly_supernode in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:51:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:53:19] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm
[08:53:31] <wikibugs>	 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10136579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host dragonfly-supernode1001.eqiad.wmnet with OS bookworm completed: - dragonfly-supernode1001 (**PASS**...
[08:53:47] <wikibugs>	 (03CR) 10David Caro: [C:04-1] cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro)
[08:54:21] <wikibugs>	 (03PS3) 10David Caro: cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005)
[08:54:26] <wikibugs>	 (03CR) 10David Caro: cloudceph: add coludcephmon1006 to the pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro)
[08:55:47] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[08:58:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] global_config: add the s3-eqiad-dpe external service [puppet] - 10https://gerrit.wikimedia.org/r/1071920 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[08:58:49] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rotate logfiles by date and size also in production [puppet] - 10https://gerrit.wikimedia.org/r/1072140 (https://phabricator.wikimedia.org/T374448) (owner: 10Jelto)
[09:00:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete geoip templates [puppet] - 10https://gerrit.wikimedia.org/r/1072137 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[09:00:48] <moritzm>	 brouberol, jelto: I'll merge your changes along, ok?
[09:00:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro)
[09:01:26] <jelto>	 yes please go ahead 2a546bf3e8 :) brouberol was about to merge this also
[09:01:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: cronjobs: add support for  activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149
[09:02:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[09:02:24] <wikibugs>	 (03PS1) 10Elukey: spark: force a rebuild to pick up OS package upgrades [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874)
[09:02:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "No issue occurred, I had been thinking about the size and then I bumped into this special case for Spark images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos)
[09:02:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cronjobs: add support for  activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli)
[09:02:50] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: remove pytorch from weekly rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072143 (owner: 10Ilias Sarantopoulos)
[09:02:55] <moritzm>	 ack, now all merged
[09:02:59] <jelto>	 thanks
[09:03:41] <wikibugs>	 (03CR) 10Elukey: "Weekly rebuild are not happening due to https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/976663, so a manual " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) (owner: 10Elukey)
[09:04:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[09:05:47] <wikibugs>	 (03PS2) 10Clément Goubert: sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372)
[09:05:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) (owner: 10Cathal Mooney)
[09:10:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync
[09:11:11] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync
[09:11:27] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:12:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:12:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: store the connections.yaml content in a secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071908 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[09:14:57] <wikibugs>	 (03CR) 10David Caro: [C:03+2] cloudceph: add coludcephmon1006 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/1072146 (https://phabricator.wikimedia.org/T374005) (owner: 10David Caro)
[09:18:56] <wikibugs>	 (03CR) 10Clément Goubert: sre.hosts.provision: Fix --no-users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[09:20:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] config_master: Explicitly configure the server for Puppet merges [puppet] - 10https://gerrit.wikimedia.org/r/1072142 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[09:21:40] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:22:08] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[09:22:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:22:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[09:25:30] <wikibugs>	 (03PS2) 10Effie Mouzeli: cronjobs: add support for  activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149
[09:26:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cronjobs: add support for  activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli)
[09:27:26] <wikibugs>	 (03PS3) 10Effie Mouzeli: cronjobs: add support for  activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149
[09:30:05] <wikibugs>	 (03PS4) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149
[09:30:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[09:30:20] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply
[09:30:39] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply
[09:31:32] <wikibugs>	 (03PS3) 10Muehlenhoff: Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355)
[09:32:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[09:33:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:33:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[09:33:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:33:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[09:34:55] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: Fix --no-users [cookbooks] - 10https://gerrit.wikimedia.org/r/1071913 (https://phabricator.wikimedia.org/T365372) (owner: 10Clément Goubert)
[09:36:17] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[09:37:02] <wikibugs>	 (03PS5) 10Brouberol: airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787)
[09:38:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable s3 logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071909 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[09:39:08] <wikibugs>	 (03PS1) 10Elukey: jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491)
[09:39:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[09:40:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Looks like this change can be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071701 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1064828 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[09:41:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:42:15] <fabfur>	 !log depooling cp4037 to test haproxykafka (T374473)
[09:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:18] <stashbot>	 T374473: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473
[09:42:41] <wikibugs>	 (03Abandoned) 10Elukey: jaeger: swap securityContext with podSecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072156 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[09:42:43] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:haproxy: introduce extended logging on socket for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1071915 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur)
[09:42:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:42:56] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[09:43:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072107 (owner: 10Slyngshede)
[09:46:53] <wikibugs>	 (03PS1) 10Elukey: jaeger: set securityContext for the oauth sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491)
[09:52:56] <wikibugs>	 (03PS1) 10Fabfur: Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668)
[09:54:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM %request for poolcounter - https://phabricator.wikimedia.org/T374520 (10elukey) 03NEW
[09:55:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[09:56:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM %request for poolcounter - https://phabricator.wikimedia.org/T374520#10136739 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances |  MFree   | MFree avg |  DFree  | DFree avg | +---...
[09:56:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136740 (10elukey)
[09:56:52] <wikibugs>	 (03PS2) 10Fabfur: Fixed the haproxykafka uds path to reflect test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668)
[09:59:49] <wikibugs>	 (03PS3) 10Fabfur: cache:haproxykafka: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000)
[10:00:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136771 (10elukey) @MoritzMuehlenhoff I'd proceed with the creation of `poolcounter2005` in row A if you are ok, using `sre.ganeti.makevm`.
[10:00:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136772 (10elukey)
[10:01:42] <wikibugs>	 (03PS4) 10Fabfur: cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668)
[10:02:01] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[10:05:41] <wikibugs>	 (03PS1) 10Dreamy Jazz: Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277)
[10:06:23] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:06:23] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1000)
[10:06:23] <jouncebot>	 In 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1100)
[10:07:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz)
[10:08:29] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:haproxy: fixed the haproxykafka uds path [puppet] - 10https://gerrit.wikimedia.org/r/1072158 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[10:14:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz)
[10:14:48] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: add gitlab images to k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432)
[10:15:20] <wikibugs>	 (03CR) 10Dreamy Jazz: "recheck" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz)
[10:19:06] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[10:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:21:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[10:22:51] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[10:26:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[10:27:01] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[10:30:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10136808 (10MoritzMuehlenhoff) +1
[10:31:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] pontoon: Remove Puppet 5 specific settings no longer relevant [puppet] - 10https://gerrit.wikimedia.org/r/1072147 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[10:33:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523 (10cmooney) 03NEW p:05Triage→03Medium
[10:35:35] <wikibugs>	 (03PS5) 10Effie Mouzeli: app.job: update to job 2.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149
[10:37:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[10:38:49] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[10:38:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Trust DSCP markings from VMs on routed ganeti hypervisors [puppet] - 10https://gerrit.wikimedia.org/r/1071817 (https://phabricator.wikimedia.org/T374392) (owner: 10Cathal Mooney)
[10:40:27] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[10:46:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392#10136850 (10cmooney) 05Open→03Resolved Patch merged, working as expected: ` cmooney@ganeti2033:~$ cat /etc/nftables/postrouting/05_trust-vm-ds...
[10:47:30] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136856 (10ABran-WMF) I'll get to T374425 to get to T374421 and unblock this T374523
[10:48:18] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136865 (10cmooney) >>! In T374523#10136856, @ABran-WMF wrote: > I'll get to T374425 to get to T374421 and unblock this T374523  Thanks!
[10:48:31] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Move db2209 uplink from asw-c5-codfw to lsw1-c5-codfw - https://phabricator.wikimedia.org/T374523#10136866 (10cmooney)
[10:48:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10136867 (10cmooney)
[10:49:05] <wikibugs>	 (03PS1) 10Ladsgroup: wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496)
[10:50:21] <fabfur>	 !log repooling cp4037 to test haproxykafka (T374473)
[10:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:24] <stashbot>	 T374473: Prepare puppet configuration to send haproxy logs to haproxykafka socket - https://phabricator.wikimedia.org/T374473
[10:50:28] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[10:50:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2205.codfw.wmnet
[10:51:11] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[10:53:03] <wikibugs>	 (03CR) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński)
[10:55:34] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2205.codfw.wmnet
[10:55:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on db2205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 79746.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:56:05] <arnaudb>	 (normal)
[10:56:33] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168
[10:57:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup)
[10:59:21] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[11:00:04] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1100).
[11:01:13] <logmsgbot>	 !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@19cd97a]: (no justification provided)
[11:01:45] <logmsgbot>	 !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@19cd97a]: (no justification provided) (duration: 00m 32s)
[11:02:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:02:52] <wikibugs>	 (03PS2) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168
[11:03:04] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:03:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup)
[11:04:39] <wikibugs>	 (03PS3) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168
[11:05:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup)
[11:11:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "I am mildly worried the regex might be a bit too broad, but that's mostly a worry I can't justify/quantify right now. Let's cross the brid" [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey)
[11:14:57] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2227.codfw.wmnet onto db2205.codfw.wmnet
[11:15:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[11:15:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance
[11:15:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68909 and previous config saved to /var/cache/conftool/dbconfig/20240911-111549-ladsgroup.json
[11:15:53] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[11:18:15] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: productionize db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[11:18:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[11:18:32] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Sanity check, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[11:21:14] <wikibugs>	 (03PS9) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[11:21:47] <wikibugs>	 (03PS10) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[11:22:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add an explicit Hiera variable to determine the active swift ring server [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355)
[11:22:05] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[11:23:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:23:12] <wikibugs>	 (03PS4) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168
[11:23:50] <_joe_>	 !log uploaded conftool 3.2.3 to apt
[11:23:51] <wikibugs>	 (03PS1) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668)
[11:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[11:25:05] <wikibugs>	 (03PS2) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668)
[11:25:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dragonfly::supernode
[11:25:55] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[11:26:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch dragonfly-supernode to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072173 (https://phabricator.wikimedia.org/T349619)
[11:27:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:18] <wikibugs>	 (03PS1) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517)
[11:30:46] <wikibugs>	 (03PS1) 10Dreamy Jazz: IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526)
[11:30:52] <wikibugs>	 (03PS11) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[11:31:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz)
[11:31:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch dragonfly-supernode to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072173 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:34:01] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[11:34:08] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Add pc5-master [dns] - 10https://gerrit.wikimedia.org/r/1072167 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[11:34:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2003 with kafka-main2008 [puppet] - 10https://gerrit.wikimedia.org/r/1072138 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[11:35:11] <wikibugs>	 (03CR) 10Jcrespo: "❤️. Giving it a look." [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup)
[11:35:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[11:37:47] <wikibugs>	 (03PS1) 10Hamish: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528)
[11:37:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dragonfly::supernode
[11:38:52] <wikibugs>	 (03PS12) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[11:41:29] <wikibugs>	 (03PS1) 10Hamish: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504)
[11:42:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[11:43:00] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10137027 (10MoritzMuehlenhoff)
[11:43:13] <wikibugs>	 (03CR) 10JMeybohm: "Would you mind rebasing this on top of a verbatim copy of job 2.0.0 modules to make the actual diff visible?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072149 (owner: 10Effie Mouzeli)
[11:43:38] <wikibugs>	 (03PS3) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668)
[11:44:03] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw
[11:45:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, if some repos tend to be too noisy, we can still add excludes" [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey)
[11:45:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish)
[11:46:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish)
[11:48:37] <wikibugs>	 (03CR) 10Stang: [C:03+1] Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish)
[11:49:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Install poolcounter2005 with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072179 (https://phabricator.wikimedia.org/T332015)
[11:54:38] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 481, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:54:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński)
[11:55:45] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr1-eqiad with reason: reconfigure equinix port into LAG
[11:55:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-eqiad with reason: reconfigure equinix port into LAG
[11:57:23] <wikibugs>	 (03PS1) 10JMeybohm: kafka-main: Fix regex for kafka-main in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072182 (https://phabricator.wikimedia.org/T363210)
[11:58:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kafka-main: Fix regex for kafka-main in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072182 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[11:58:47] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685
[11:58:52] <hashar>	 matmarex and I are changing MediaWiki logging config  https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200
[11:59:07] <jayme>	 !incidents
[11:59:08] <sirenbot>	 5157 (RESOLVED)  db1166 (paged)/MariaDB Replica SQL: s3 (paged)
[11:59:08] <sirenbot>	 5156 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[11:59:08] <sirenbot>	 5155 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[11:59:08] <sirenbot>	 5152 (RESOLVED)  NELHigh sre (thanos-rule tcp.address_unreachable)
[11:59:22] <jayme>	 hu?
[11:59:43] <jayme>	 vgutierrez: did you get a page for cr2-magru as well?
[11:59:47] <sukhe>	 yeah this was the old page
[11:59:52] <vgutierrez>	 nope
[12:00:01] <MatmaRex>	 hi hashar :)
[12:00:05] <jouncebot>	 hashar and MatmaRex: Deploy window MediaWiki logging configuration tweaks (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200)
[12:00:07] <jayme>	 sukhe: from when was that one?
[12:00:12] <sukhe>	 which never resolved for some reason: T374401
[12:00:13] <stashbot>	 T374401: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401
[12:00:14] <vgutierrez>	 oh yeah
[12:00:14] <topranks>	 unsure what that page was - cr2-magru - router is up and online anyway, quick health check looks ok plus bgp stable for weeks etc 
[12:00:15] <vgutierrez>	 got ehre
[12:00:35] <jayme>	 ah, okay
[12:00:48] <sukhe>	 I am going to mark this as resolved and we can carry on discussing in the task why victorops didn't do so
[12:00:51] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow
[12:00:51] <sukhe>	 any objections?
[12:00:59] <sukhe>	 oh great, topranks did it
[12:01:00] <topranks>	 I sort of shrugged off the previous time it happens, we'll need to take a closer look though 
[12:01:02] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 11s)
[12:01:04] <topranks>	 sukhe: yeah I already did 
[12:01:10] <sukhe>	 topranks: thanks
[12:01:12] <hashar>	 MatmaRex: I am checking your earlier comment :)
[12:01:12] <jayme>	 cool, thanks
[12:01:15] <topranks>	 just to stop any panic in its tracks 
[12:01:20] <sukhe>	 yep
[12:02:43] <wikibugs>	 (03PS1) 10Brouberol: airflow: fix the s3 logging integration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787)
[12:03:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw
[12:03:17] <topranks>	 sukhe: oh sorry just realising this was the old page re-triggering ?
[12:03:23] <topranks>	 (reading scrollback) 
[12:05:10] <wikibugs>	 (03CR) 10Hashar: logging: Replace 'blackhole' handler with no handlers at all (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński)
[12:05:38] <hashar>	 MatmaRex: so essentially +1 on removing that blackhole
[12:05:45] <hashar>	 I guess yesterday I wanted to double check 
[12:05:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Awesome." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[12:06:18] <wikibugs>	 (03PS5) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716
[12:06:18] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344
[12:06:19] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685
[12:06:30] <hashar>	 that is me rebasing the whole series
[12:06:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: fix the s3 logging integration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072183 (https://phabricator.wikimedia.org/T372787) (owner: 10Brouberol)
[12:06:41] <sukhe>	 topranks: yep! same old page
[12:06:47] <sukhe>	 not a new one
[12:06:59] <topranks>	 ah ok 
[12:07:04] <hashar>	 and I guess we can do the first one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069716
[12:07:07] <MatmaRex>	 hashar: no problem. we should probably test that in production again when actually enabling the default log channel
[12:07:09] <sukhe>	 so the only question is why it didn't resolve
[12:07:11] <topranks>	 probably doesn't deserve too much more detective work on the cr / alerting side then 
[12:07:15] <topranks>	 yeah 
[12:07:18] <sukhe>	 olly is looking into that 
[12:07:21] <MatmaRex>	 hashar: whenever you're ready
[12:07:21] <sukhe>	 topranks: yeah
[12:07:26] <hashar>	 lets do
[12:07:26] <topranks>	 ok
[12:07:27] <topranks>	 cool
[12:08:04] <topranks>	 !log test bundling xe-3/0/6 into ae6 on cr1-eqiad T370696
[12:08:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[12:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:34] <MatmaRex>	 hashar: these changes don't really have anything obvious to test on mwdebug btw. they all should have no effect. i think we can just look at logstash afterwards and verify that the volume of logs didn't change
[12:08:48] <hashar>	 yeah that was my idea
[12:08:48] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[12:08:49] <hashar>	 :)
[12:09:10] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]]
[12:09:19] <hashar>	 I think at some point I wanted to craft a CI job that would generate the logging configuraiton diff
[12:09:40] <hashar>	 well the whoe diff actually
[12:09:44] <hashar>	 but that is not easily doable
[12:09:58] <hashar>	 maybe that patch makes it easier now 
[12:10:29] <hashar>	 https://grafana.wikimedia.org/d/000000102/production-logging might be the best place to watch for breakage/log vanishing
[12:10:56] <MatmaRex>	 hmm, maybe it could be included in the diffConfig jobs somehow? i don't know how that works
[12:11:14] <hashar>	 I think that one iterates over each db
[12:11:16] <MatmaRex>	 but it looks like it just makes some JSON files and diffs them. the logging config should be JSON-serializable too, so the same approach should work
[12:11:17] <hashar>	 but yeah possibly
[12:11:20] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:12:26] <MatmaRex>	 pretty dashboard
[12:12:35] <jayme>	 !log restoring leadership for partitions assigned to broker id 2003 on kafka-main-codfw - T363210
[12:12:37] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[12:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:38] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[12:13:02] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on db2205 is OK: OK slave_sql_lag Replication lag: 5.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:13:38] <wikibugs>	 (03PS1) 10Hamish: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323)
[12:14:02] <MatmaRex>	 what happened on 2024--09-05? :o https://phabricator.wikimedia.org/F57499437
[12:14:44] <wikibugs>	 (03PS1) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187
[12:15:13] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Continuing with sync
[12:15:22] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:15:44] <hashar>	 MatmaRex: some new mediawiki code landed / DBAs broke the infra? :)
[12:15:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2227.codfw.wmnet onto db2205.codfw.wmnet
[12:16:01] <hashar>	 and that graph is relative grr
[12:16:27] <hashar>	 anyway one can search in logstash 
[12:16:52] <MatmaRex>	 and the log volume logs below and log scale, so you can barely see that we double the number of WARNINGs
[12:17:03] <MatmaRex>	 well, the wikis did not fall over, so it's not too bad
[12:17:10] <MatmaRex>	 but i will try to find out what it was
[12:17:11] <hashar>	 yeah that dashboard is nice but has several usuability problems indeed
[12:17:22] <MatmaRex>	 are log-scale*
[12:17:29] <wikibugs>	 (03PS2) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187
[12:17:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[12:18:01] <hashar>	 Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}): {query} 
[12:18:01] <hashar>	 Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met (actual: {actualSeconds}): {query} 
[12:18:19] <hashar>	 roughly 720 000 of them per hour :)
[12:18:25] <topranks>	 !log re-activate Equinix IXP peers on cr1-eqiad T370696
[12:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:55] <wikibugs>	 (03PS1) 10David Caro: typos: add colud to the list [puppet] - 10https://gerrit.wikimedia.org/r/1072188
[12:19:47] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069716|logging: Fix local variables leaking into global scope]] (duration: 10m 38s)
[12:20:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 (owner: 10Brouberol)
[12:20:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68912 and previous config saved to /var/cache/conftool/dbconfig/20240911-122056-ladsgroup.json
[12:21:00] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[12:21:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:22:04] <MatmaRex>	 it looks like we just got some exceptions like this: "JobQueueError: Could not enqueue jobs"
[12:22:09] <MatmaRex>	 which is hopefully unrelated to the deploy
[12:22:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:22:49] <wikibugs>	 (03PS3) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187
[12:23:01] <hashar>	 yeah I think it is fine
[12:23:28] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:23:29] <MatmaRex>	 started at 12:17 (6 minutes ago)
[12:23:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:23:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:23:43] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:23:58] <hashar>	 hmm
[12:24:06] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:24:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:24:15] <wikibugs>	 (03PS1) 10Jforrester: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241)
[12:24:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:24:21] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:24:27] <wikibugs>	 (03PS1) 10Jforrester: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241)
[12:24:30] <MatmaRex>	 and it's still ongoing
[12:24:42] <hashar>	 what have we broke
[12:24:47] <MatmaRex>	 also a lot of "The maximum execution time of 60 seconds was exceeded"
[12:24:51] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:24:52] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:24:55] <MatmaRex>	 it still looks like a coincidence to me
[12:25:06] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:25:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:25:10] <MatmaRex>	 (i'm looking here: https://logstash.wikimedia.org/goto/eea316bb32ebedad09d7e1640283513c)
[12:25:24] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1071883 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[12:25:40] <MatmaRex>	 and it looks like the exceptions stopped happening
[12:25:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:25:43] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:25:45] <MatmaRex>	 🤷‍♂️
[12:25:56] <wikibugs>	 (03PS4) 10Brouberol: airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187
[12:26:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:26:18] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:26:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:26:21] <MatmaRex>	 i don't know, maybe somebody just did some bot things too quickly?
[12:26:34] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:26:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:26:40] <hashar>	 well exception rate has exploded for sure  https://grafana-rw.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now&viewPanel=19
[12:26:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10137107 (10jcrespo) I will want to stop ms backups at codfw for backup2011 before it happens. No big deal if I don't do it (ju...
[12:26:45] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:26:46] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10137114 (10jcrespo) I will want to stop ms backups at codfw for backup2007 before it happens. No big deal if I don't do it (ju...
[12:27:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:27:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: introduce a values files common to all airflow instances in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072187 (owner: 10Brouberol)
[12:27:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579
[12:27:37] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[12:27:49] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579
[12:27:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579
[12:27:52] <MatmaRex>	 200 per minute is maybe a minor fire in the kitchen, not an explosion ;)
[12:28:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: provisionning db2238.codfw.wmnet - T373579
[12:28:34] <moritzm>	 !log installing glibc bugfix updates from bookworm 12.7 point release
[12:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:41] <hashar>	 seriously
[12:28:46] <hashar>	 our whole infrastructure is crippled :/
[12:28:54] <MatmaRex>	 heh
[12:29:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2138 in db2238 for T373579', diff saved to https://phabricator.wikimedia.org/P68913 and previous config saved to /var/cache/conftool/dbconfig/20240911-122910-arnaudb.json
[12:29:12] <hashar>	 and there are a bunch of messages in the `jsonTruncated` channel
[12:29:21] <hashar>	 which are log messages being too long to be parsed by the logging stack
[12:29:25] <hashar>	 so they end up mostly ignored
[12:29:29] <hashar>	 hidding real problems
[12:29:29] <hashar>	 grr
[12:29:51] <hashar>	 that is wikifucntions requests timeout, it happened last week already
[12:30:21] <MatmaRex>	 yeah looks like it's all RequestTimeoutException with a realllllyyy long stack trace
[12:30:38] <hashar>	 yeah I think we had some talk about it on friday
[12:31:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:31:50] <MatmaRex>	 (because it times out while validating some really big nested recursive structure, apparently)
[12:31:57] <hashar>	 yeah
[12:32:04] <MatmaRex>	 hashar: anyway. i think we can proceed with the next patches, if you're happy with them
[12:32:04] <hashar>	 so that is more log spam we have to manage
[12:32:23] <hashar>	 the thousnad jobs not enqueuing, I don't think it is relatted at all
[12:32:56] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:33:02] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2138.codfw.wmnet onto db2238.codfw.wmnet
[12:33:32] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:33:51] <hashar>	 pff
[12:34:00] <hashar>	 I will process with the next one
[12:34:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński)
[12:34:26] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad
[12:35:00] <hashar>	 sometime I feel we could use #mediawiki-operations channel to cut from the rest of the wmf operations :]
[12:35:07] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński)
[12:35:10] <hashar>	 or #wikimedia-mw-infra :D
[12:35:24] <MatmaRex>	 hashar: btw the new rdbms warnings, i think they're all here: https://logstash.wikimedia.org/goto/1f5d398c8c6f7ceb2ea570bd57d22564 they're "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" with ExternalStoreDB in the stack trace
[12:35:27] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]]
[12:35:47] <MatmaRex>	 i wish the cookbook or whatever would not emit 5 log messages for every action. it's really difficult to read in the SAL later too
[12:36:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68914 and previous config saved to /var/cache/conftool/dbconfig/20240911-123603-ladsgroup.json
[12:37:33] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:37:41] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Continuing with sync
[12:37:56] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:38:12] <MatmaRex>	 (i'll file a bug for the rdbms warnings, they seem to not be known)
[12:38:50] <hashar>	 thanks
[12:38:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Good shout LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1071814 (owner: 10Muehlenhoff)
[12:39:41] <wikibugs>	 (03PS1) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039)
[12:39:53] <wikibugs>	 (03PS2) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039)
[12:40:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[12:41:55] <wikibugs>	 (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229)
[12:42:11] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069344|logging: Replace 'blackhole' handler with no handlers at all]] (duration: 06m 43s)
[12:42:36] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:42:50] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Install poolcounter2005 with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1072179 (https://phabricator.wikimedia.org/T332015) (owner: 10Muehlenhoff)
[12:42:56] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:44:36] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:44:59] <James_F>	 hashar, MatmaRex: Thank you both for working on that!
[12:47:11] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter2005.codfw.wmnet
[12:47:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[12:47:45] <hashar>	 and the last one 
[12:48:28] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add gitlab images to k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1072163 (https://phabricator.wikimedia.org/T373432) (owner: 10Elukey)
[12:49:45] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "FWIW upstream took a similar patch from me very quickly" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[12:49:57] <wikibugs>	 (03PS5) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812
[12:50:18] <MatmaRex>	 thanks hashar. logging still looks happy after the last changes
[12:50:32] <MatmaRex>	 and i filed https://phabricator.wikimedia.org/T374534 about the rdbms WARNING logs
[12:50:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński)
[12:50:53] <hashar>	 awesome
[12:51:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68915 and previous config saved to /var/cache/conftool/dbconfig/20240911-125110-ladsgroup.json
[12:51:55] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński)
[12:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:52:11] <wikibugs>	 (03CR) 10Elukey: "Left a note, lemme know what you think about it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[12:52:16] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]]
[12:52:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2005.codfw.wmnet - elukey@cumin1002"
[12:52:32] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter2005.codfw.wmnet - elukey@cumin1002"
[12:52:33] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:52:33] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter2005.codfw.wmnet on all recursors
[12:52:36] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter2005.codfw.wmnet on all recursors
[12:53:03] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2005.codfw.wmnet - elukey@cumin1002"
[12:53:08] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter2005.codfw.wmnet - elukey@cumin1002"
[12:53:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] jaeger: set securityContext for the oauth sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072157 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[12:53:40] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:54:06] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter2005.codfw.wmnet with OS bookworm
[12:54:21] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:54:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish)
[12:54:31] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "lgtm!! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1070920 (https://phabricator.wikimedia.org/T372411) (owner: 10Filippo Giunchedi)
[12:54:37] <logmsgbot>	 !log hashar@deploy1003 matmarex, hashar: Continuing with sync
[12:55:10] <wikibugs>	 (03PS1) 10AikoChou: admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902)
[12:55:16] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: sync
[12:55:36] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: sync
[12:55:42] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:55:46] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: (WIP) amd-pytorch: add vllm for ROCm to pytorch 2.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072194 (https://phabricator.wikimedia.org/T370149)
[12:56:20] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1010 is OK: SSL OK - Certificate kafka-jumbo1010.eqiad.wmnet valid until 2025-08-17 13:15:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[12:57:16] <hashar>	 MatmaRex: I have changed the graph of logs by channels to use absolute values instead of relative/percentage them
[12:57:24] <hashar>	 and sorted them by number of entries (total)  https://grafana-rw.wikimedia.org/d/000000102/mediawiki-production-logging
[12:57:29] <hashar>	 https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging
[12:57:34] <hashar>	 for the read-only link
[12:57:54] <hashar>	 for the severity, my guess is we would need to repeat the panel for each severity
[12:57:58] <MatmaRex>	 hashar: nice
[12:58:03] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[12:58:06] <hashar>	 but when I query the severity values, I get a bunch of non sense values :/
[12:58:11] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[12:58:15] <hashar>	 beside DEBUG/ERROR/INFO/NOTICE/WARNING
[12:58:19] <hashar>	 so hmm I don't know
[12:58:20] <wikibugs>	 (03CR) 10Muehlenhoff: Add an explicit Hiera variable to determine the active swift ring server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[12:58:36] <hashar>	 oh the levels are graphed independently at the bottom
[12:59:03] <MatmaRex>	 hashar: i think my next steps, later this week or next week, will be to enable the @default channel on testwiki, and if that doesn't break the world, then enable it everywhere (so basically, your original patch, just updated to fit my other changes)
[12:59:10] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070685|logging: Simplify extra debug logging configuration]] (duration: 06m 53s)
[12:59:19] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:59:20] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): MediaWiki logging configuration tweaks (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1200)
[12:59:20] <jouncebot>	 In 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300)
[12:59:30] <MatmaRex>	 heh, perfect timing
[12:59:41] <Dreamy_Jazz>	 :D
[12:59:56] <hashar>	 so hello and welcome to the backport window
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300).
[13:00:05] <jouncebot>	 JustHannah, Dreamy_Jazz, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <hashar>	 we might $have broken logging :]
[13:00:12] <Dreamy_Jazz>	 \o
[13:00:28] <MatmaRex>	 hashar: if you have the editor open already, want to also change the per-level charts to not be log-scale? so that spikes are actually visible?
[13:00:43] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz)
[13:00:45] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz)
[13:00:50] <MatmaRex>	 (i scheduled some unrelated no-op cleanup patches for the backport window)
[13:01:08] <hashar>	 MatmaRex: the problem is that debug have a large amount of entries so eg warning spiking would not show up at all
[13:01:15] <wikibugs>	 (03CR) 10Elukey: Add an explicit Hiera variable to determine the active swift ring server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072171 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[13:01:17] <hashar>	 I think we need another graph that highlight the spikes/change of rates
[13:01:38] <MatmaRex>	 hmm
[13:01:41] <wikibugs>	 (03PS2) 10Hokwelum: Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492)
[13:02:08] <MatmaRex>	 hashar: oh, i don't mean the "MW logs by severity" chart, i mean only the "MW logs (INFO)" etc. charts below
[13:02:15] <MatmaRex>	 those that just have one data series on them
[13:02:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:29] <Dreamy_Jazz>	 Can the window proceed then?
[13:02:35] <Dreamy_Jazz>	 If logs are broken
[13:02:56] <MatmaRex>	 Dreamy_Jazz: yeah, they're not broken :)
[13:03:04] <Dreamy_Jazz>	 :D
[13:03:07] <Dreamy_Jazz>	 Thanks
[13:03:09] <MatmaRex>	 we're just tweaking a dashboard
[13:03:19] <MatmaRex>	 https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging
[13:03:28] <JustHannah>	 Dreamy_Jazz: I'm here
[13:03:28] <wikibugs>	 (03PS1) 10Elukey: admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491)
[13:03:34] <Dreamy_Jazz>	 Hello
[13:04:02] <JustHannah>	 Dreamy_Jazz: Please proceed
[13:04:05] <Hamishcz1>	 I'm sorry maybe I missed some messages, but why reschedule my deployments? 
[13:04:39] <Dreamy_Jazz>	 They should't have been
[13:05:10] <Dreamy_Jazz>	 MatmaRex: Did you deliberately remove the other changes in the window?
[13:05:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10137235 (10MoritzMuehlenhoff)
[13:06:01] <MatmaRex>	 Dreamy_Jazz: aaargh. nopr
[13:06:14] <hashar>	 MatmaRex: ah true, I will change them
[13:06:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T371742)', diff saved to https://phabricator.wikimedia.org/P68916 and previous config saved to /var/cache/conftool/dbconfig/20240911-130618-ladsgroup.json
[13:06:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[13:06:21] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum)
[13:06:21] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[13:06:32] <wikibugs>	 (03PS1) 10AikoChou: hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902)
[13:06:33] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance
[13:06:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68917 and previous config saved to /var/cache/conftool/dbconfig/20240911-130639-ladsgroup.json
[13:06:55] <Dreamy_Jazz>	 JustHannah: I can deploy your change. Can you test it? I see that the default is now `true`, but want to make sure it still works as expected.
[13:07:00] <wikibugs>	 (03Merged) 10jenkins-bot: Remove ResourceLoaderUseObjectCacheForDeps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071838 (https://phabricator.wikimedia.org/T343492) (owner: 10Hokwelum)
[13:07:19] <MatmaRex>	 Dreamy_Jazz: i undid that. sorry, i guess i didn't notice that when editing
[13:07:28] <Dreamy_Jazz>	 Thanks
[13:07:40] <JustHannah>	 Dreamy_Jazz: Yes I can test it!
[13:07:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter2005.codfw.wmnet with reason: host reimage
[13:07:41] <MatmaRex>	 i'll schedule my cleanup for some other time. maybe one day the window will not be full
[13:07:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[13:07:46] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:07:46] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:07:56] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:07:57] <MatmaRex>	 hashar: thanks. and thanks for deploying :D
[13:08:00] <Hamishcz1>	 MatmaRex, thank you:)
[13:08:45] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[13:09:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add arbcom group to zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071902 (https://phabricator.wikimedia.org/T374455) (owner: 10Hamish)
[13:09:30] <wikibugs>	 (03CR) 10Klausman: [C:03+1] hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:09:48] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:10:24] <wikibugs>	 (03Merged) 10jenkins-bot: Generate special page name in English for central URLs [extensions/GlobalBlocking] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072159 (https://phabricator.wikimedia.org/T374277) (owner: 10Dreamy Jazz)
[13:10:29] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish)
[13:10:50] <wikibugs>	 (03CR) 10Klausman: [C:03+2] hiera/deployment-server: create revision-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:10:53] <wikibugs>	 (03PS2) 10Hamish: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504)
[13:11:11] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter2005.codfw.wmnet with reason: host reimage
[13:11:57] <wikibugs>	 (03CR) 10CDanis: [C:03+1] admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:12:04] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish)
[13:12:07] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: enforce PSS for the AUX cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072196 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:12:32] <Hamishcz>	 Dreamy_Jazz, and thank you lolll
[13:12:45] <wikibugs>	 (03Merged) 10jenkins-bot: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072178 (https://phabricator.wikimedia.org/T374504) (owner: 10Hamish)
[13:12:46] <Dreamy_Jazz>	 Waiting for one change to finish gate-and-submit-wmf
[13:12:50] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:12:52] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:12:57] <Dreamy_Jazz>	 Then will start on the config patches that I've +2'd
[13:13:07] <wikibugs>	 (03PS2) 10Hamish: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323)
[13:13:12] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:13:38] <Hamishcz>	 sure np
[13:14:00] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish)
[13:14:08] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:14:26] <Dreamy_Jazz>	 jan_drewniak: You around for the window?
[13:14:39] <wikibugs>	 (03Merged) 10jenkins-bot: Raise RelatedArticlesCardLimit to 9 in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072186 (https://phabricator.wikimedia.org/T374323) (owner: 10Hamish)
[13:14:41] <wikibugs>	 (03PS3) 10Jdrewniak: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039)
[13:14:44] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:15:01] <jan_drewniak>	 Dreamy_Jazz: hey! Yes I'm here
[13:15:12] <Dreamy_Jazz>	 I can deploy it. Can you test it?
[13:15:13] <jan_drewniak>	 Bit of a last minute addition
[13:15:18] <jan_drewniak>	 Yes
[13:15:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] nftables-compat-check: Don't flag dscp_default as needing conversion [puppet] - 10https://gerrit.wikimedia.org/r/1071814 (owner: 10Muehlenhoff)
[13:16:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz)
[13:16:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish)
[13:16:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[13:16:35] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3952/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072197 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:16:38] <wikibugs>	 (03PS2) 10Hamish: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528)
[13:16:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz)
[13:16:45] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish)
[13:16:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[13:17:14] <wikibugs>	 (03CR) 10Klausman: [C:03+2] admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:17:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Web team search suggestions survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072191 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[13:17:25] <wikibugs>	 (03Merged) 10jenkins-bot: Remove redundant oathauth-enable flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072177 (https://phabricator.wikimedia.org/T374528) (owner: 10Hamish)
[13:17:42] <Dreamy_Jazz>	 Should be about 7 or so mins before the process starts - Still waiting on a slow test job.
[13:17:51] <jan_drewniak>	 Np
[13:19:46] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Many thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) (owner: 10Elukey)
[13:20:44] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng/LiftWing: add revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072193 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:21:10] <Dreamy_Jazz>	 MatmaRex: Do you want me to ping you if there is enough time in the window to do your logging changes?
[13:21:19] <Dreamy_Jazz>	 jouncebot: nowandnext
[13:21:19] <jouncebot>	 For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1300)
[13:21:20] <jouncebot>	 In 0 hour(s) and 38 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400)
[13:21:53] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[13:21:56] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[13:22:05] <MatmaRex>	 Dreamy_Jazz: thanks, but we definitely won't make it in 38 minutes, and i have to leave soon afterwards
[13:22:14] <MatmaRex>	 these changes can wait, they do nothing :)
[13:22:17] <Dreamy_Jazz>	 Okay.
[13:22:21] <Dreamy_Jazz>	 :D
[13:23:07] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[13:23:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye
[13:24:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03)
[13:24:09] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[13:24:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536 (10MoritzMuehlenhoff) 03NEW
[13:24:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137289 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:25:20] <Dreamy_Jazz>	 Still waiting on test jobs - Watching it slowly process the tests is almost like watching paint dry :D
[13:25:24] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[13:26:05] <Hamishcz>	 Soft fire makes sweet malt :0 
[13:26:14] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[13:26:23] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:26:45] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:26:57] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter2005.codfw.wmnet with OS bookworm
[13:26:57] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter2005.codfw.wmnet
[13:26:57] <wikibugs>	 (03PS1) 10Elukey: kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491)
[13:27:16] <wikibugs>	 (03Merged) 10jenkins-bot: IPInfoLogFormatter: Avoid unnecessary User object creation [extensions/IPInfo] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072176 (https://phabricator.wikimedia.org/T374526) (owner: 10Dreamy Jazz)
[13:27:22] <wikibugs>	 (03PS1) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439)
[13:27:33] <wikibugs>	 (03PS13) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[13:27:41] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oathauth-enable flag (
[13:27:41] <logmsgbot>	 T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]]
[13:27:48] <stashbot>	 T343492: Phase out SqlModuleDependencyStore - https://phabricator.wikimedia.org/T343492
[13:27:48] <stashbot>	 T374277: View full log does not work on wikis with language other than English - https://phabricator.wikimedia.org/T374277
[13:27:49] <stashbot>	 T374526: InvalidArgumentException: Invalid IP address error when loading IPInfo logs - https://phabricator.wikimedia.org/T374526
[13:27:49] <stashbot>	 T374455: Create the "arbcom" user group on zhwiki - https://phabricator.wikimedia.org/T374455
[13:27:50] <stashbot>	 T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528
[13:27:50] <stashbot>	 T374504: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki - https://phabricator.wikimedia.org/T374504
[13:27:50] <stashbot>	 T374323: Raise RelatedArticlesCardLimit to 9 in zhwikinews - https://phabricator.wikimedia.org/T374323
[13:27:51] <stashbot>	 T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039
[13:28:31] <wikibugs>	 (03CR) 10Muehlenhoff: Bird::anycast - allow BFD connections from router link-local IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[13:29:28] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad
[13:29:49] <logmsgbot>	 !log dreamyjazz@deploy1003 jdrewniak, hokwelum, dreamyjazz, hamishz: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oatha
[13:29:49] <logmsgbot>	 uth-enable flag (T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:29:51] <Dreamy_Jazz>	 jan_drewniak: Hamishcz: JustHannah: Please test your changes, they are live on the test servers now.
[13:30:19] <JustHannah>	 Okay!
[13:30:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[13:30:33] <Hamishcz>	 sure:)
[13:30:47] <wikibugs>	 (03CR) 10Elukey: "Chris: Last one I promise! :D" [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:32:15] <wikibugs>	 (03CR) 10CDanis: [C:03+1] kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:32:46] <jan_drewniak>	 Dreamy_Jazz: yeah it's fine
[13:32:52] <Dreamy_Jazz>	 Thanks!
[13:33:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for poolcounter - https://phabricator.wikimedia.org/T374520#10137345 (10elukey) 05Open→03Resolved a:03elukey
[13:33:13] <hashar>	 MatmaRex: I have made a few more tweaks on https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m 
[13:33:48] <MatmaRex>	 hashar: i like it :D
[13:34:12] <Dreamy_Jazz>	 My changes work.
[13:34:53] <Hamishcz>	 Dreamy_Jazz, my changes are all fine for me
[13:34:57] <Dreamy_Jazz>	 Thanks.
[13:35:05] <Dreamy_Jazz>	 JustHannah: How is testing going?
[13:35:17] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: prod dbproxy200[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380)
[13:35:17] <wikibugs>	 (03CR) 10Arnaudb: "for a sanity check → those hosts are not due for "real" production before Manuel comes back and we run some tests" [puppet] - 10https://gerrit.wikimedia.org/r/1072195 (https://phabricator.wikimedia.org/T367380) (owner: 10Arnaudb)
[13:35:26] <wikibugs>	 (03PS14) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[13:36:11] <JustHannah>	 <Dreamy_Jazz>: looks good!
[13:36:18] <Dreamy_Jazz>	 Thanks. Proceeding.
[13:36:20] <logmsgbot>	 !log dreamyjazz@deploy1003 jdrewniak, hokwelum, dreamyjazz, hamishz: Continuing with sync
[13:36:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137366 (10MoritzMuehlenhoff)
[13:36:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10137367 (10MoritzMuehlenhoff)
[13:36:59] <wikibugs>	 (03PS1) 10Elukey: Swap poolcounter2003 with poolcounter2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015)
[13:37:53] <wikibugs>	 (03CR) 10Elukey: "Hey folks, the host is up and running, it seems working fine but some validation from serviceops is needed :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[13:38:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[13:38:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kubernetes: disable PSP for the AUX cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072202 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:40:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[13:40:50] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071838|Remove ResourceLoaderUseObjectCacheForDeps (T343492)]], [[gerrit:1072159|Generate special page name in English for central URLs (T374277)]], [[gerrit:1072176|IPInfoLogFormatter: Avoid unnecessary User object creation (T374526)]], [[gerrit:1071902|Add arbcom group to zhwiki (T374455)]], [[gerrit:1072177|Remove redundant oathauth-enable flag
[13:40:50] <logmsgbot>	 (T374528)]], [[gerrit:1072178|Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki (T374504)]], [[gerrit:1072186|Raise RelatedArticlesCardLimit to 9 in zhwikinews (T374323)]], [[gerrit:1072191|Enable Web team search suggestions survey (T373039)]] (duration: 13m 09s)
[13:40:56] <Dreamy_Jazz>	 Deploys done.
[13:40:57] <stashbot>	 T343492: Phase out SqlModuleDependencyStore - https://phabricator.wikimedia.org/T343492
[13:40:58] <stashbot>	 T374277: View full log does not work on wikis with language other than English - https://phabricator.wikimedia.org/T374277
[13:40:58] <stashbot>	 T374526: InvalidArgumentException: Invalid IP address error when loading IPInfo logs - https://phabricator.wikimedia.org/T374526
[13:40:58] <stashbot>	 T374455: Create the "arbcom" user group on zhwiki - https://phabricator.wikimedia.org/T374455
[13:40:59] <stashbot>	 T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528
[13:40:59] <stashbot>	 T374504: Allow ipblock-exempt-grantor to remove ipblock-exempt group flag on zhwiki - https://phabricator.wikimedia.org/T374504
[13:40:59] <stashbot>	 T374323: Raise RelatedArticlesCardLimit to 9 in zhwikinews - https://phabricator.wikimedia.org/T374323
[13:41:00] <stashbot>	 T373039: Set up quicksurveys for UI and non-UI experiments - https://phabricator.wikimedia.org/T373039
[13:41:49] <Dreamy_Jazz>	 wikitech.wikimedia.org might be broken?
[13:42:17] <Dreamy_Jazz>	 Loading https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=edit&section=6 says "File not found"
[13:42:39] <Hamishcz>	 Dreamy_Jazz, it works normal from my end
[13:43:07] <Dreamy_Jazz>	 Apparently it doesn't work if you still have the mwdebug servers enabled
[13:44:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[13:44:17] <Hamishcz>	 ah yes 
[13:44:24] <Dreamy_Jazz>	 For some reason https://wikitech.wikimedia.org/wiki/Deployments now always redirects me to https://foundation.wikimedia.org/wiki/Deployments
[13:44:33] <Dreamy_Jazz>	 Even with the debug server off
[13:45:00] <JustHannah>	 Dreamy_Jazz: yep and you can’t also reload a page with the debug enabled too
[13:45:26] <Dreamy_Jazz>	 Looks like it was a caching issue. Opening my developer tools fixed the redirect.
[13:45:33] <Dreamy_Jazz>	 Anyway, not caused by the deployments so that is all good.
[13:46:02] <Hamishcz>	 yes... https://wikitech.wikimedia.org/wiki/Main_Page always redirect me to https://foundation.wikimedia.org/wiki/Home
[13:46:16] <Hamishcz>	 w/ that off 
[13:47:05] <Dreamy_Jazz>	 !log Afternoon UTC backport window done
[13:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:53] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: logging: Default to log any error (on beta and group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838)
[13:48:21] <wikibugs>	 (03PS15) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[13:48:34] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian)
[13:48:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[13:49:00] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on ganeti2012.codfw.wmnet with reason: Move ganeti2012 server uplink
[13:49:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on ganeti2012.codfw.wmnet with reason: Move ganeti2012 server uplink
[13:49:24] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10137421 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ff7d01a-40d8-4196-9008-7bf9b79ea4e8) set by c...
[13:51:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2138.codfw.wmnet onto db2238.codfw.wmnet
[13:52:07] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:52:35] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[13:53:12] <wikibugs>	 (03PS1) 10Ladsgroup: DNM: Add pc5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072208 (https://phabricator.wikimedia.org/T374496)
[13:54:15] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:54:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:55:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540 (10ops-monitoring-bot) 03NEW
[13:55:59] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209
[13:56:57] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Puppet frontends: Remove obsolete manage_puppet_ca_file code [puppet] - 10https://gerrit.wikimedia.org/r/1072108 (https://phabricator.wikimedia.org/T366355) (owner: 10Muehlenhoff)
[13:57:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh)
[13:57:37] <MatmaRex>	 hashar: i updated your patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1018637 , want to un-WIP it?
[14:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400)
[14:00:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.767 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:00:51] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:00:51] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:02:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:05:47] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow
[14:05:59] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 11s)
[14:06:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10137512 (10cmooney) Still chasing Equinix to get this sorted, back-and-forth now with them for almost 2 weeks without an...
[14:07:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374380#10137505 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:07:43] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:07:55] <wikibugs>	 (03PS4) 10Fabfur: hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668)
[14:09:45] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet
[14:09:49] <wikibugs>	 (03PS1) 10JMeybohm: Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210)
[14:10:15] <wikibugs>	 (03CR) 10Volans: "post-merge suggestion" [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:10:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[14:11:04] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: disabling haproxy logging to socket (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1072172 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[14:11:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] sre.dns.admin: add cookbook for GeoDNS pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1060914 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[14:12:51] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213
[14:13:06] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2008.codfw.wmnet
[14:13:06] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2008.codfw.wmnet
[14:14:12] <wikibugs>	 (03PS1) 10Ladsgroup: conftool: Add pc5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072215 (https://phabricator.wikimedia.org/T374496)
[14:14:25] <fabfur>	 !log reverted 1072172 and repooling cp4037 (T370668)
[14:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:28] <stashbot>	 T370668: New software: haproxykafka - https://phabricator.wikimedia.org/T370668
[14:14:31] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496)
[14:14:34] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[14:14:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68918 and previous config saved to /var/cache/conftool/dbconfig/20240911-141449-ladsgroup.json
[14:14:53] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[14:17:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[14:17:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 10%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68919 and previous config saved to /var/cache/conftool/dbconfig/20240911-141732-arnaudb.json
[14:17:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[14:19:42] <wikibugs>	 (03Merged) 10jenkins-bot: Replace kafka-main2003 with kafka-main2008 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072210 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[14:19:59] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2013.codfw.wmnet
[14:20:13] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542 (10JMeybohm) 03NEW
[14:20:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] conftool: Add pc5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072215 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[14:20:24] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137567 (10JMeybohm)
[14:20:26] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet
[14:20:35] <wikibugs>	 (03PS3) 10Ladsgroup: Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496)
[14:20:38] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] Revert "conftool-data: Remove pc5 for now" [puppet] - 10https://gerrit.wikimedia.org/r/1072213 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[14:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:21:13] <wikibugs>	 (03PS1) 10JHathaway: postfix: remove wikimedia.com domain from relay hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489)
[14:21:27] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489) (owner: 10JHathaway)
[14:21:52] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: productionize db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579)
[14:21:52] <wikibugs>	 (03CR) 10Arnaudb: "@Ladsgroup@gmail.com I've checked on https://fault-tolerance.toolforge.org/map?cluster=db-masters and it will be in the same rack as db223" [puppet] - 10https://gerrit.wikimedia.org/r/1072216 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[14:21:54] <wikibugs>	 (03PS2) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209
[14:23:07] <wikibugs>	 (03CR) 10Clément Goubert: "I'm unsure if it is the right solution, but... I don't really have another answer to this problem." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:24:45] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: remove wikimedia.com domain from relay hosts [puppet] - 10https://gerrit.wikimedia.org/r/1072217 (https://phabricator.wikimedia.org/T374489) (owner: 10JHathaway)
[14:25:01] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2003.codfw.wmnet
[14:26:08] <wikibugs>	 (03PS1) 10Ladsgroup: conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496)
[14:26:17] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[14:26:37] <wikibugs>	 (03PS2) 10Ladsgroup: conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496)
[14:26:42] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] conftool: Add pc5 to list of allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1072218 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[14:27:17] <wikibugs>	 (03PS1) 10JMeybohm: Decom kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/1072219 (https://phabricator.wikimedia.org/T374542)
[14:27:43] <hashar>	 MatmaRex: amazing
[14:27:59] <wikibugs>	 (03PS2) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517)
[14:28:03] <hashar>	 does beta still has some log/kibana ?
[14:28:41] <wikibugs>	 (03PS3) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209
[14:28:44] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno)
[14:29:15] <wikibugs>	 (03CR) 10Hnowlan: "Yeah, I'm not either :( However, this is at least limited to shellbox-video for now. One option I considered here that is worth mentioning" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:29:41] <MatmaRex>	 hashar: say that again after we deploy it and it works ;)
[14:29:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68920 and previous config saved to /var/cache/conftool/dbconfig/20240911-142956-ladsgroup.json
[14:30:18] <MatmaRex>	 hashar: i need to be off for today, but i'll schedule some deploys some time soon
[14:30:21] <wikibugs>	 (03CR) 10BBlack: [C:03+1] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh)
[14:30:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[14:30:34] <hashar>	 MatmaRex: found it found it https://beta-logs.wmcloud.org  so I guess I will ninja enable it on beta :)
[14:30:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[14:30:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:30:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[14:30:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[14:30:38] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:30:38] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:30:42] <hashar>	 and we can pair the deploy tomorrow
[14:30:51] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:31:05] <MatmaRex>	 hashar: cool :D
[14:31:12] <jayme>	 !log last 7 helmfile deploys did not happen
[14:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:21] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10137610 (10dcaro) >>! In T348643#10113626, @wiki_willy wrote: > Thanks @dcaro, sounds good. I'll bug them again abo...
[14:32:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:32:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 25%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68921 and previous config saved to /var/cache/conftool/dbconfig/20240911-143237-arnaudb.json
[14:32:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[14:32:50] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2014.codfw.wmnet
[14:32:57] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet
[14:33:15] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:34:31] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:34:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[14:34:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Let's see if it works before adding more parameters. To clarify, this change will only impact `shellbox-video` because it's the only deplo" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:34:48] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[14:34:50] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:35:14] <wikibugs>	 (03PS16) 10Cathal Mooney: Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379)
[14:36:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:10] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[14:37:59] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:38:00] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[14:38:13] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[14:38:13] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2003.codfw.wmnet
[14:38:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2003.codfw.wmnet` - kafka-main2003.codf...
[14:38:22] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[14:38:30] <wikibugs>	 (03CR) 10Jgiannelos: [C:04-1] "Blocking this until apps team gives us a wiki to start with. Ptwiki is used for some experiments at the moment so we might not want to ris" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[14:39:01] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[14:39:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[14:39:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2003.codfw.wmnet - https://phabricator.wikimedia.org/T374542#10137630 (10JMeybohm)
[14:40:40] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[14:40:41] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:41:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:41:29] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[14:41:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Pool pc5 into production traffic (T374496)', diff saved to https://phabricator.wikimedia.org/P68922 and previous config saved to /var/cache/conftool/dbconfig/20240911-144147-ladsgroup.json
[14:41:50] <stashbot>	 T374496: Bring pc5 into rotation - https://phabricator.wikimedia.org/T374496
[14:42:40] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:43:15] <jayme>	 !log deployed changeprop-jobqueue changeprop cirrus-streaming-updater eventgate-main eventstreams mw-page-content-change-enrich rdf-streaming-updater for kafka connection string updates - T363210
[14:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:18] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[14:43:32] <wikibugs>	 (03CR) 10Cathal Mooney: "Ok hopefully this looks a bit better now." [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[14:43:46] <wikibugs>	 (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229)
[14:44:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian)
[14:45:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68923 and previous config saved to /var/cache/conftool/dbconfig/20240911-144504-ladsgroup.json
[14:45:05] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:45:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:46:12] <wikibugs>	 (03PS3) 10Hnowlan: php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517)
[14:46:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:46:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:47:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 50%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68924 and previous config saved to /var/cache/conftool/dbconfig/20240911-144743-arnaudb.json
[14:48:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2001.codfw.wmnet
[14:48:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2001.codfw.wmnet
[14:48:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2003.codfw.wmnet
[14:48:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2003.codfw.wmnet
[14:49:58] <claime>	 !log Depooling kubernetes2042.codfw.wmnet kubernetes2043.codfw.wmnet mw2350.codfw.wmnet mw2351.codfw.wmnet mw2352.codfw.wmnet mw2353.codfw.wmnet mw2354.codfw.wmnet mw2355.codfw.wmnet mw2356.codfw.wmnet mw2357.codfw.wmnet mw2359.codfw.wmnet parse2014.codfw.wmnet parse2015.codfw.wmnet wikikube-ctrl2002.codfw.wmnet wikikube-worker2020.codfw.wmnet wikikube-worker2021.codfw.wmnet
[14:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:00] <claime>	 wikikube-worker2022.codfw.wmnet wikikube-worker2023.codfw.wmnet wikikube-worker2024.codfw.wmnet wikikube-worker2032.codfw.wmnet - T373101
[14:50:02] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[14:50:05] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[14:50:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2042.codfw.wmnet
[14:50:55] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2042.codfw.wmnet
[14:51:00] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2043.codfw.wmnet
[14:51:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2043.codfw.wmnet
[14:51:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2350.codfw.wmnet
[14:52:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2350.codfw.wmnet
[14:52:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2351.codfw.wmnet
[14:52:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2351.codfw.wmnet
[14:53:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2352.codfw.wmnet
[14:53:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2352.codfw.wmnet
[14:53:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2353.codfw.wmnet
[14:53:46] <wikibugs>	 (03PS2) 10Hashar: logging: Default to log any error (on group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[14:53:46] <wikibugs>	 (03PS1) 10Hashar: logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838)
[14:54:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2353.codfw.wmnet
[14:54:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2354.codfw.wmnet
[14:54:30] <wikibugs>	 (03CR) 10Hashar: "I have moved beta to a standalone job https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1072226 . I will deploy it immediatel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[14:54:55] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Awesome work Bartosz thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072207 (https://phabricator.wikimedia.org/T228838) (owner: 10Bartosz Dziewoński)
[14:54:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2354.codfw.wmnet
[14:55:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2355.codfw.wmnet
[14:55:02] <wikibugs>	 (03PS14) 10Hashar: logging: Default to log any error (all wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1018637 (https://phabricator.wikimedia.org/T228838)
[14:55:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2355.codfw.wmnet
[14:55:39] <wikibugs>	 (03PS1) 10Superzerocool: eswiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072227 (https://phabricator.wikimedia.org/T374484)
[14:55:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2356.codfw.wmnet
[14:56:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2356.codfw.wmnet
[14:56:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2357.codfw.wmnet
[14:56:36] <hashar>	 jouncebot: now
[14:56:37] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1400)
[14:56:50] <wikibugs>	 (03CR) 10Hashar: [C:03+2] logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[14:56:55] <hashar>	 ^ that is solely for beta
[14:57:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2357.codfw.wmnet
[14:57:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2359.codfw.wmnet
[14:57:33] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Default to log any error (on beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072226 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar)
[14:57:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2359.codfw.wmnet
[14:57:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2014.codfw.wmnet
[14:58:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2014.codfw.wmnet
[14:58:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse2015.codfw.wmnet
[14:58:33] <Amir1>	 about to pool pc5
[14:58:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Pool pc5 into production traffic (T374496)', diff saved to https://phabricator.wikimedia.org/P68925 and previous config saved to /var/cache/conftool/dbconfig/20240911-145844-ladsgroup.json
[14:58:48] <stashbot>	 T374496: Bring pc5 into rotation - https://phabricator.wikimedia.org/T374496
[15:00:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T371742)', diff saved to https://phabricator.wikimedia.org/P68926 and previous config saved to /var/cache/conftool/dbconfig/20240911-150011-ladsgroup.json
[15:00:13] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[15:00:23] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[15:00:26] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[15:00:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse2015.codfw.wmnet
[15:01:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2002.codfw.wmnet
[15:01:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2002.codfw.wmnet
[15:01:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2020.codfw.wmnet
[15:01:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:02:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2020.codfw.wmnet
[15:02:27] <wikibugs>	 (03PS3) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045
[15:02:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2021.codfw.wmnet
[15:02:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 75%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68927 and previous config saved to /var/cache/conftool/dbconfig/20240911-150249-arnaudb.json
[15:03:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2021.codfw.wmnet
[15:03:07] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2022.codfw.wmnet
[15:03:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2022.codfw.wmnet
[15:03:49] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2023.codfw.wmnet
[15:04:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2023.codfw.wmnet
[15:04:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2024.codfw.wmnet
[15:04:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:05:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2024.codfw.wmnet
[15:05:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2032.codfw.wmnet
[15:05:46] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2032.codfw.wmnet
[15:07:13] <wikibugs>	 (03Abandoned) 10Ladsgroup: DNM: Add pc5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072208 (https://phabricator.wikimedia.org/T374496) (owner: 10Ladsgroup)
[15:08:35] <wikibugs>	 (03CR) 10Volans: sre.dns.admin: add guardrails for depool of sites/resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 (owner: 10Ssingh)
[15:08:36] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.ganeti.drain-node (exit_code=97) for draining ganeti node ganeti2014.codfw.wmnet
[15:09:44] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:11:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:15:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[15:15:54] <wikibugs>	 (03CR) 10Volans: "You probably want to use the `verbatim_hosts` flag, see https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.aler" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto)
[15:17:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 100%: post db2138 → db2238 repool', diff saved to https://phabricator.wikimedia.org/P68928 and previous config saved to /var/cache/conftool/dbconfig/20240911-151754-arnaudb.json
[15:18:08] <wikibugs>	 (03CR) 10Volans: "LGTM, modulo fixing the current CI failures that are legit (just rebase your local checkout with master). Although I'm not sure if it ever" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh)
[15:21:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=(cp2037|cp2038).codfw.wmnet [reason: depooling for T373101]
[15:21:48] <wikibugs>	 (03CR) 10Hnowlan: [V:03+2 C:03+2] php:common: sleep briefly when checking for busy workers [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072174 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:21:48] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[15:26:17] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided)
[15:26:58] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@71141b8] (releasing): (no justification provided) (duration: 00m 41s)
[15:27:24] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@4635fcb] (releasing): (no justification provided)
[15:28:00] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@4635fcb] (releasing): (no justification provided) (duration: 00m 35s)
[15:28:09] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020)
[15:28:09] <wikibugs>	 (03PS2) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649)
[15:28:55] <wikibugs>	 (03CR) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[15:29:28] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: use SSL to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231
[15:31:43] <topranks>	 !log push server and vlan configuration to lsw1-c6-codfw with Homer to prep physical moves T373101
[15:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:46] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[15:33:05] <wikibugs>	 (03CR) 10DCausse: [C:04-1] "sorry for the noise, it's not ready just realized that the consumers are still hardcoded to plaintext... needs a patch in the codebase 😞" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (owner: 10DCausse)
[15:34:01] <wikibugs>	 (03PS1) 10Hnowlan: php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232
[15:34:30] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232 (owner: 10Hnowlan)
[15:35:13] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on phab2002.codfw.wmnet with reason: nftables migration
[15:35:23] <wikibugs>	 (03CR) 10Hnowlan: [V:03+2 C:03+2] php: fix minor indentation issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072232 (owner: 10Hnowlan)
[15:35:28] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on phab2002.codfw.wmnet with reason: nftables migration
[15:35:33] <mutante>	 !log phab2002 - rebooting for nftables migration
[15:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:33] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on moscovium.eqiad.wmnet with reason: nftables migration
[15:36:42] <mutante>	 !log moscovium - rebooting for nftables migration
[15:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:48] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moscovium.eqiad.wmnet with reason: nftables migration
[15:37:14] <wikibugs>	 (03PS1) 10Hnowlan: Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234
[15:37:36] <urandom>	 !log depooling thanos-fe2004.codfw.wmnet — T373101
[15:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:39] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[15:38:02] <wikibugs>	 (03PS1) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[15:38:19] <wikibugs>	 (03PS2) 10Bking: flink-app: create a new label for selecting Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195)
[15:39:03] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234 (owner: 10Hnowlan)
[15:39:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[15:40:11] <wikibugs>	 (03CR) 10Hnowlan: [V:03+2 C:03+2] Fix image name typo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072234 (owner: 10Hnowlan)
[15:41:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2115 db2116 db2127 db2167 db2168 db2179 db2180 db2210 es2022 es2038 - T370852', diff saved to https://phabricator.wikimedia.org/P68929 and previous config saved to /var/cache/conftool/dbconfig/20240911-154114-arnaudb.json
[15:41:18] <stashbot>	 T370852: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852
[15:43:14] <wikibugs>	 (03PS3) 10Scott French: sre.switchdc.mediawiki: suppress check_core_masters_in_sync errors in live-test [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649)
[15:43:58] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:43:58] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:44:04] <hashar>	 earlier today I have complained about jsontruncated messages coming from wikifunctions . That is T374241  :)
[15:44:04] <stashbot>	 T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241
[15:44:16] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:44:38] <icinga-wm>	 PROBLEM - WDQS Main SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query-main.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:44:44] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:45:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2021:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:45:55] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10137884 (10ABran-WMF) depoolable hosts have been depooled https://phabricator.wikimedia.org/P68929
[15:46:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:51] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:49:28] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2021.codfw.wmnet with OS bullseye
[15:50:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791
[15:50:24] <stashbot>	 T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791
[15:50:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791
[15:55:21] <mutante>	 !log moscovium - apt-get upgrade - installing new apache2 version and more package upgrades
[15:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[15:56:01] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance
[15:56:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68930 and previous config saved to /var/cache/conftool/dbconfig/20240911-155608-ladsgroup.json
[15:56:11] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[15:56:41] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Makes sense to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071981 (https://phabricator.wikimedia.org/T372649) (owner: 10Scott French)
[16:00:10] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:03+1] typos: add colud to the list [puppet] - 10https://gerrit.wikimedia.org/r/1072188 (owner: 10David Caro)
[16:07:01] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 34 hosts with reason: Move server uplinks codfw racks C6
[16:07:30] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 34 hosts with reason: Move server uplinks codfw racks C6
[16:07:44] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10137966 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6257a49b-1ea6-4675-9944-c5d85eb38288) set by cmooney@cumin1002 for...
[16:07:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2002.codfw.wmnet - https://phabricator.wikimedia.org/T374451#10137961 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:08:03] <topranks>	 !log begin server uplink moves from asw-c6-codfw to lsw1-c6-codfw T373101
[16:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:07] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:08:59] * arnaudb grabs his popcorn
[16:09:38] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[16:10:43] <wikibugs>	 (03PS2) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[16:16:38] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2135.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2135.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:16:49] <arnaudb>	 ah
[16:16:52] <arnaudb>	 it's not been muted
[16:18:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[16:18:14] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2135.codfw.wmnet with reason: network maintenance
[16:18:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2135.codfw.wmnet with reason: network maintenance
[16:19:44] <wikibugs>	 (03PS1) 10Hnowlan: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072246 (https://phabricator.wikimedia.org/T342213)
[16:20:32] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 24 hosts with reason: Move server uplinks codfw racks C7
[16:20:38] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:20:53] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 24 hosts with reason: Move server uplinks codfw racks C7
[16:21:07] <logmsgbot>	 !log bking@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: 8
[16:21:08] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138029 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6458a64c-9bf9-4b09-a6e1-82f1e6f72fc3) set by cmooney@cumin1002 for...
[16:21:19] <logmsgbot>	 !log bking@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: 8 (duration: 00m 12s)
[16:21:46] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:22:00] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:23:00] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:23:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[16:23:16] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[16:25:43] <stashbot>	 T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791
[16:26:09] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Set number of processes for mailman3_runner to minimum of 14 [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney)
[16:27:50] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247
[16:27:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2037.codfw.wmnet
[16:27:59] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2037.codfw.wmnet
[16:28:03] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp2038.codfw.wmnet
[16:28:04] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2038.codfw.wmnet
[16:28:46] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138056 (10cmooney) All hosts successfully moved and responding to ping again.
[16:29:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:29:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:31:36] <inflatador>	 ^^ expected
[16:31:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68931 and previous config saved to /var/cache/conftool/dbconfig/20240911-163137-arnaudb.json
[16:31:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68932 and previous config saved to /var/cache/conftool/dbconfig/20240911-163142-arnaudb.json
[16:31:43] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:31:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68933 and previous config saved to /var/cache/conftool/dbconfig/20240911-163147-arnaudb.json
[16:31:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68934 and previous config saved to /var/cache/conftool/dbconfig/20240911-163152-arnaudb.json
[16:31:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68935 and previous config saved to /var/cache/conftool/dbconfig/20240911-163157-arnaudb.json
[16:32:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68936 and previous config saved to /var/cache/conftool/dbconfig/20240911-163202-arnaudb.json
[16:32:05] <wikibugs>	 (03PS4) 10Ssingh: P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209
[16:32:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68937 and previous config saved to /var/cache/conftool/dbconfig/20240911-163207-arnaudb.json
[16:32:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68938 and previous config saved to /var/cache/conftool/dbconfig/20240911-163212-arnaudb.json
[16:32:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68939 and previous config saved to /var/cache/conftool/dbconfig/20240911-163217-arnaudb.json
[16:32:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 25%: T373101', diff saved to https://phabricator.wikimedia.org/P68940 and previous config saved to /var/cache/conftool/dbconfig/20240911-163222-arnaudb.json
[16:33:10] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh)
[16:34:18] <wikibugs>	 (03PS1) 10CDanis: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248
[16:34:20] <claime>	 !log Repooling kubernetes2042.codfw.wmnet kubernetes2043.codfw.wmnet mw2350.codfw.wmnet mw2351.codfw.wmnet mw2352.codfw.wmnet mw2353.codfw.wmnet mw2354.codfw.wmnet mw2355.codfw.wmnet mw2356.codfw.wmnet mw2357.codfw.wmnet mw2359.codfw.wmnet parse2014.codfw.wmnet parse2015.codfw.wmnet wikikube-ctrl2002.codfw.wmnet wikikube-worker2020.codfw.wmnet wikikube-worker2021.codfw.wmnet
[16:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:22] <claime>	 wikikube-worker2022.codfw.wmnet wikikube-worker2023.codfw.wmnet wikikube-worker2024.codfw.wmnet wikikube-worker2032.codfw.wmnet - T373101
[16:34:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2042.codfw.wmnet
[16:34:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2042.codfw.wmnet
[16:34:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2043.codfw.wmnet
[16:34:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2043.codfw.wmnet
[16:34:44] <urandom>	 !log pooling thanos-fe2004.codfw.wmnet — T373101
[16:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2350.codfw.wmnet
[16:34:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2350.codfw.wmnet
[16:34:51] <wikibugs>	 (03CR) 10BBlack: [C:03+1] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh)
[16:34:55] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2351.codfw.wmnet
[16:34:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2351.codfw.wmnet
[16:35:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2352.codfw.wmnet
[16:35:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2352.codfw.wmnet
[16:35:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2353.codfw.wmnet
[16:35:11] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2353.codfw.wmnet
[16:35:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2354.codfw.wmnet
[16:35:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2354.codfw.wmnet
[16:35:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2355.codfw.wmnet
[16:35:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2355.codfw.wmnet
[16:35:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2356.codfw.wmnet
[16:35:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2356.codfw.wmnet
[16:35:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2357.codfw.wmnet
[16:35:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2357.codfw.wmnet
[16:35:41] <wikibugs>	 (03PS2) 10CDanis: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549)
[16:35:43] <topranks>	 !log disable now unused ports on asw-c6-codfw after server move T373101
[16:35:45] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw2359.codfw.wmnet
[16:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw2359.codfw.wmnet
[16:35:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2014.codfw.wmnet
[16:35:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2014.codfw.wmnet
[16:35:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host parse2015.codfw.wmnet
[16:36:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host parse2015.codfw.wmnet
[16:36:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2002.codfw.wmnet
[16:36:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2002.codfw.wmnet
[16:36:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138099 (10VRiley-WMF) a:03VRiley-WMF
[16:36:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2020.codfw.wmnet
[16:36:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2020.codfw.wmnet
[16:36:16] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10138100 (10ABran-WMF) hosts are repooling
[16:36:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2021.codfw.wmnet
[16:36:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2021.codfw.wmnet
[16:36:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2022.codfw.wmnet
[16:36:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2022.codfw.wmnet
[16:36:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2023.codfw.wmnet
[16:36:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2023.codfw.wmnet
[16:36:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2024.codfw.wmnet
[16:36:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2024.codfw.wmnet
[16:36:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2032.codfw.wmnet
[16:36:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2032.codfw.wmnet
[16:37:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[16:37:49] <wikibugs>	 (03CR) 10CDanis: [C:03+2] wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[16:38:50] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: enable tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072248 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[16:39:12] <rzl>	 jouncebot: nowandnext
[16:39:12] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[16:39:12] <jouncebot>	 In 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700)
[16:39:38] <wikibugs>	 (03PS1) 10CDanis: mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549)
[16:39:54] <wikibugs>	 (03CR) 10CDanis: [C:03+2] mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[16:40:11] <rzl>	 lucaswerkmeister: I'm around if you'd still like to deploy those fatal-error patches today
[16:40:18] <lucaswerkmeister>	 sure!
[16:40:29] <lucaswerkmeister>	 I figured out the fatal-error.php password too
[16:40:51] <rzl>	 ah perfect
[16:41:09] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: tracing at 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072251 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[16:41:44] <lucaswerkmeister>	 (and as far as I could tell, the X-Request-Id response header is never sent, so I guess the condition in https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/15/1071715/2/modules/profile/files/mediawiki/php/php7-fatal-error.php#104 is always false…)
[16:42:12] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[16:42:29] <wikibugs>	 (03PS1) 10AikoChou: ml-services: add ref-quality isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902)
[16:42:35] <lucaswerkmeister>	 (but maybe that only affects /w/fatal-error.php and “real” fatal errors still get a chance to send that header. no idea)
[16:43:16] <rzl>	 hm, okay
[16:43:29] <wikibugs>	 (03PS1) 10CDanis: wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253
[16:43:53] <rzl>	 lucaswerkmeister: do you want to deploy and test these together, or one at a time?
[16:44:18] <lucaswerkmeister>	 together, I think
[16:44:30] <rzl>	 works for me
[16:44:38] <lucaswerkmeister>	 I don’t even know how to test the first change, presumably the request ID being unset Should Never Happen™ ^^
[16:44:40] <wikibugs>	 (03CR) 10CDanis: [C:03+2] wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253 (owner: 10CDanis)
[16:45:25] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] errorpage: Remove redundant 'unknown' $reqId fallback [puppet] - 10https://gerrit.wikimedia.org/r/1071714 (owner: 10Lucas Werkmeister)
[16:45:36] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] errorpage: Include request ID early in HTML source [puppet] - 10https://gerrit.wikimedia.org/r/1071715 (https://phabricator.wikimedia.org/T291192) (owner: 10Lucas Werkmeister)
[16:45:39] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: no tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072253 (owner: 10CDanis)
[16:46:06] <logmsgbot>	 !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:46:22] <rzl>	 (waiting on puppet-merge)
[16:46:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68941 and previous config saved to /var/cache/conftool/dbconfig/20240911-164644-arnaudb.json
[16:46:47] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:46:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68942 and previous config saved to /var/cache/conftool/dbconfig/20240911-164648-arnaudb.json
[16:46:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68943 and previous config saved to /var/cache/conftool/dbconfig/20240911-164653-arnaudb.json
[16:46:56] <bblack>	 in my experience, "never" is defined in the wikimedia world as "something that probably happens at least once an hour somewhere" :)
[16:46:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68944 and previous config saved to /var/cache/conftool/dbconfig/20240911-164657-arnaudb.json
[16:47:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68945 and previous config saved to /var/cache/conftool/dbconfig/20240911-164703-arnaudb.json
[16:47:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68946 and previous config saved to /var/cache/conftool/dbconfig/20240911-164708-arnaudb.json
[16:47:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68947 and previous config saved to /var/cache/conftool/dbconfig/20240911-164713-arnaudb.json
[16:47:14] <logmsgbot>	 !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:47:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68948 and previous config saved to /var/cache/conftool/dbconfig/20240911-164718-arnaudb.json
[16:47:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68949 and previous config saved to /var/cache/conftool/dbconfig/20240911-164723-arnaudb.json
[16:47:25] <rzl>	 (now waiting on the puppet agent at deploy1003)
[16:47:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 50%: T373101', diff saved to https://phabricator.wikimedia.org/P68950 and previous config saved to /var/cache/conftool/dbconfig/20240911-164728-arnaudb.json
[16:47:30] <logmsgbot>	 !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:48:35] <logmsgbot>	 !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:48:42] <lucaswerkmeister>	 !bash <bblack> in my experience, "never" is defined in the wikimedia world as "something that probably happens at least once an hour somewhere" :)
[16:48:43] <stashbot>	 lucaswerkmeister: Stored quip at https://bash.toolforge.org/quip/Z5784ZEBFFSCpsJzvl4r
[16:48:55] <lucaswerkmeister>	 (hope you don’t mind, delete it if you do ^^)
[16:51:57] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:51:59] <rzl>	 cdanis: my helmfile deploy is picking up your wikifunctions tracing changes, is it okay if those go out?
[16:52:04] <cdanis>	 rzl: please
[16:52:11] <rzl>	 👍
[16:52:13] <cdanis>	 I hadn't started them because I didn't want to get in your way
[16:52:22] <rzl>	 swfrench-wmf++ for these good diffs appearing
[16:52:43] <cdanis>	 oh is it better now??
[16:52:56] <rzl>	 oh I just mean diffs appearing here at all
[16:53:05] <rzl>	 no news afaik, I just think it's neat that scap does that
[16:53:11] <cdanis>	 ah yeah
[16:53:21] <rzl>	 nothing else unexpected here, off we go
[16:53:22] <logmsgbot>	 !log rzl@deploy1003 Started scap sync-world: 1071714, 1071715 (T291192)
[16:53:29] <stashbot>	 T291192: Update php-wmerrors page to include request ID - https://phabricator.wikimedia.org/T291192
[16:53:38] <icinga-wm>	 RECOVERY - mailman3_runners on lists1004 is OK: PROCS OK: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:54:12] <logmsgbot>	 !log rzl@deploy1003 rzl: 1071714, 1071715 (T291192) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:54:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[16:54:18] <jinxer-wm>	 FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[16:54:32] * lucaswerkmeister looks up the mwdebug curl incantation
[16:54:32] <rzl>	 lucaswerkmeister, cdanis: at mwdebug, ready for testing
[16:54:47] <herron>	 !incidents
[16:54:47] <sirenbot>	 5158 (UNACKED)  NELHigh sre (thanos-rule tcp.address_unreachable)
[16:54:47] <sirenbot>	 5157 (RESOLVED)  db1166 (paged)/MariaDB Replica SQL: s3 (paged)
[16:54:48] <sirenbot>	 5156 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[16:54:48] <sirenbot>	 5155 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (cr2-eqord.wikimedia.org)
[16:55:01] <rzl>	 arnoldokoth, herron: fyi I have a deploy in progress but it's only as far as mwdebug and almost certainly can't be related to that NEL page
[16:55:13] <cdanis>	 I think that NEL page from GB is the same false positive as earlier this week
[16:55:15] <jynus>	 we had some of that recently
[16:55:21] <jynus>	 what cdanis said
[16:55:27] <herron>	 !ack 5158
[16:55:28] <sirenbot>	 5158 (ACKED)  NELHigh sre (thanos-rule tcp.address_unreachable)
[16:55:31] <lucaswerkmeister>	 rzl: seems to work, I see the HTML comment :)
[16:55:36] <cdanis>	 I'll do something quickly to exclude that bogus hostname in the logstash exporter that backs the metric for the alert
[16:55:41] <rzl>	 lucaswerkmeister: sweet
[16:55:48] <herron>	 ok sounds good cdanis thanks
[16:56:00] <cdanis>	 er, after I have some lunch, since I just realize I haven't yet
[16:56:07] <cdanis>	 but yeah before eod today :)
[16:56:21] <herron>	 write both before and after food and compare :)
[16:56:31] <rzl>	 cdanis: do you want to test anything on that tracing change while it's at mwdebug, or should I just roll it everywhere?
[16:56:39] <cdanis>	 rzl: roll
[16:56:54] <rzl>	 herron: any objection wrt that alert?
[16:57:08] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=(cp2037|cp2038).codfw.wmnet [reason: done T373101]
[16:57:12] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[16:57:13] <herron>	 rzl: negative sgtm
[16:57:18] <arnoldokoth>	 rzl: cdanis: Thanks.
[16:57:21] <rzl>	 🚀
[16:57:23] <logmsgbot>	 !log rzl@deploy1003 rzl: Continuing with sync
[16:58:24] <logmsgbot>	 !log rzl@deploy1003 Finished scap sync-world: 1071714, 1071715 (T291192) (duration: 07m 37s)
[16:58:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68951 and previous config saved to /var/cache/conftool/dbconfig/20240911-165838-ladsgroup.json
[16:58:42] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[16:59:39] <sukhe>	 !log sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1072209"'
[16:59:40] <rzl>	 and I think that's yesterday's puppet window complete 😅 
[16:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:49] <wikibugs>	 (03PS1) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700)
[17:00:21] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: drop backward compatibility for ntp (only use ntpsec) [puppet] - 10https://gerrit.wikimedia.org/r/1072209 (owner: 10Ssingh)
[17:01:17] <lucaswerkmeister>	 (and re what I wrote above about the X-Request-ID response header missing from the fatal-error response, it turns out Krinkle already figured that out three years ago, it works as expected but gets filtered out unless WikimediaDebug is used
[17:01:17] <lucaswerkmeister>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/721923/3#message-8147826063be8a55a599fb775df9f242d3e075ea)
[17:01:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68952 and previous config saved to /var/cache/conftool/dbconfig/20240911-170149-arnaudb.json
[17:01:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68953 and previous config saved to /var/cache/conftool/dbconfig/20240911-170153-arnaudb.json
[17:01:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68954 and previous config saved to /var/cache/conftool/dbconfig/20240911-170158-arnaudb.json
[17:02:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68955 and previous config saved to /var/cache/conftool/dbconfig/20240911-170203-arnaudb.json
[17:02:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68956 and previous config saved to /var/cache/conftool/dbconfig/20240911-170208-arnaudb.json
[17:02:12] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[17:02:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68957 and previous config saved to /var/cache/conftool/dbconfig/20240911-170213-arnaudb.json
[17:02:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68958 and previous config saved to /var/cache/conftool/dbconfig/20240911-170218-arnaudb.json
[17:02:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68959 and previous config saved to /var/cache/conftool/dbconfig/20240911-170223-arnaudb.json
[17:02:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68960 and previous config saved to /var/cache/conftool/dbconfig/20240911-170228-arnaudb.json
[17:02:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: T373101', diff saved to https://phabricator.wikimedia.org/P68961 and previous config saved to /var/cache/conftool/dbconfig/20240911-170233-arnaudb.json
[17:05:05] <sukhe>	 !log sukhe@dns7001:~$ sudo systemctl restart ntpsec.service
[17:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:46] <icinga-wm>	 PROBLEM - NTP peers on dns7001 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP
[17:08:46] <icinga-wm>	 RECOVERY - NTP peers on dns7001 is OK: NTP OK: Offset 0.000833847 secs https://wikitech.wikimedia.org/wiki/NTP
[17:09:52] <wikibugs>	 (03PS1) 10Zabe: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490)
[17:12:36] <wikibugs>	 (03PS2) 10Zabe: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490)
[17:12:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[17:13:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68962 and previous config saved to /var/cache/conftool/dbconfig/20240911-171346-ladsgroup.json
[17:14:20] <moritzm>	 !log installing gtk+2.0 security updates on bookworm
[17:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68963 and previous config saved to /var/cache/conftool/dbconfig/20240911-171655-arnaudb.json
[17:16:59] <stashbot>	 T373101: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101
[17:17:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68964 and previous config saved to /var/cache/conftool/dbconfig/20240911-171700-arnaudb.json
[17:17:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68965 and previous config saved to /var/cache/conftool/dbconfig/20240911-171704-arnaudb.json
[17:17:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68966 and previous config saved to /var/cache/conftool/dbconfig/20240911-171709-arnaudb.json
[17:17:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68967 and previous config saved to /var/cache/conftool/dbconfig/20240911-171714-arnaudb.json
[17:17:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68968 and previous config saved to /var/cache/conftool/dbconfig/20240911-171719-arnaudb.json
[17:17:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68969 and previous config saved to /var/cache/conftool/dbconfig/20240911-171724-arnaudb.json
[17:17:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68970 and previous config saved to /var/cache/conftool/dbconfig/20240911-171729-arnaudb.json
[17:17:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68971 and previous config saved to /var/cache/conftool/dbconfig/20240911-171734-arnaudb.json
[17:17:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: T373101', diff saved to https://phabricator.wikimedia.org/P68972 and previous config saved to /var/cache/conftool/dbconfig/20240911-171739-arnaudb.json
[17:17:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards
[17:17:42] <icinga-wm>	 RECOVERY - WDQS Main SPARQL on wdqs2021 is OK: HTTP OK: HTTP/1.1 200 OK - 785 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:17:47] <stashbot>	 T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791
[17:18:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:18:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:18:51] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:19:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[17:19:18] <jinxer-wm>	 RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[17:20:49] <lucaswerkmeister>	 rzl: I forgot to say thanks, so thanks for deploying! \o/
[17:20:56] <zabe>	 jouncebot: nowandnext
[17:20:56] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1700)
[17:20:56] <jouncebot>	 In 0 hour(s) and 39 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1800)
[17:21:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:21:56] <wikibugs>	 (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:22:02] <wikibugs>	 (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:22:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:22:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:24:36] <rzl>	 lucaswerkmeister: of course, any time! thanks for your flexibility
[17:25:02] <wikibugs>	 (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072257 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072258 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:25:41] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]]
[17:25:44] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[17:26:23] <wikibugs>	 (03PS1) 10Zabe: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490)
[17:26:57] <wikibugs>	 (03PS1) 10Zabe: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490)
[17:27:00] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: bump check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261
[17:27:10] <wikibugs>	 (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:27:15] <wikibugs>	 (03CR) 10Zabe: [C:03+2] migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:27:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add ref-quality isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[17:28:07] <wikibugs>	 (03PS2) 10Ssingh: P:ntp: bump monitoring check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261
[17:28:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68973 and previous config saved to /var/cache/conftool/dbconfig/20240911-172852-ladsgroup.json
[17:29:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:29:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:30:01] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-main,name=codfw
[17:30:18] <wikibugs>	 (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option for not deleting text row [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072259 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:30:20] <wikibugs>	 (03Merged) 10jenkins-bot: migrateESRefToContentTable: Add option to dump tt: -> es: reference [extensions/WikimediaMaintenance] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072260 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[17:30:39] <wikibugs>	 (03PS2) 10AikoChou: ml-services: deploy ref-quality isvc in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902)
[17:30:44] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: ref
[17:30:44] <logmsgbot>	 erence (T183490)]]
[17:30:47] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[17:32:07] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[17:33:04] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy ref-quality isvc in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072252 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[17:35:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:ntp: bump monitoring check_interval to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/1072261 (owner: 10Ssingh)
[17:39:42] <swfrench-wmf>	 !log imported php-uuid_1.2.0-12+wmf11u1 into component/php81 - T372507
[17:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:46] <stashbot>	 T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507
[17:43:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10138403 (10MoritzMuehlenhoff)
[17:44:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T371742)', diff saved to https://phabricator.wikimedia.org/P68974 and previous config saved to /var/cache/conftool/dbconfig/20240911-174400-ladsgroup.json
[17:44:02] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[17:44:04] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[17:44:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance
[17:44:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68975 and previous config saved to /var/cache/conftool/dbconfig/20240911-174422-ladsgroup.json
[17:45:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138408 (10MoritzMuehlenhoff)
[17:45:45] <moritzm>	 !log installing postgresql-15 security updates
[17:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:12] <wikibugs>	 (03PS1) 10Varnent: Updated license information from CC 3.0 to CC 4.0 per request from Legal. [puppet] - 10https://gerrit.wikimedia.org/r/1072265
[17:47:14] <sukhe>	 !log sukhe@dns7001:~$ sudo systemctl restart ntpsec.service
[17:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:08] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[17:48:10] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[17:48:17] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]]
[17:48:17] <logmsgbot>	 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:48:19] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[17:48:20] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[17:48:52] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[17:49:21] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[17:50:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138420 (10MoritzMuehlenhoff)
[17:50:13] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[17:50:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10138421 (10VRiley-WMF) After working with Dell on this issue for a while and they reviewed the logs, they don't see any issues with the Hardware. Would it be possible to reinstall the OS...
[17:50:23] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[17:51:45] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[17:51:51] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[17:52:50] <sukhe>	 !log re-enable puppet on A:dnsbox and enable agent
[17:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:55] <sukhe>	 !log re-enable puppet on A:dnsbox and [run] agent
[17:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Bird::anycast - allow BFD connections from router link-local IP [puppet] - 10https://gerrit.wikimedia.org/r/1071858 (https://phabricator.wikimedia.org/T374379) (owner: 10Cathal Mooney)
[17:54:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10138444 (10MoritzMuehlenhoff)
[17:58:41] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox
[18:00:04] <jouncebot>	 dduvall and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T1800).
[18:00:13] <zabe>	 sorry, scap is still running, it's far slower than expected
[18:01:37] <icinga-wm>	 PROBLEM - NTP peers on dns1004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP
[18:02:06] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072257|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072258|migrateESRefToContentTable: Add option to dump tt: -> es: reference (T183490)]], [[gerrit:1072259|migrateESRefToContentTable: Add option for not deleting text row (T183490)]], [[gerrit:1072260|migrateESRefToContentTable: Add option to dump tt: -> es: re
[18:02:06] <logmsgbot>	 ference (T183490)]] (duration: 31m 21s)
[18:02:10] * zabe done
[18:02:10] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[18:04:02] <topranks>	 sukhe: are you around - I'm unsure about the dns1004 ntp alert above?
[18:04:10] <sukhe>	 topranks: yeah, all good
[18:04:15] <topranks>	 ok yeah 
[18:04:20] <topranks>	 checking on the host all looks ok tome 
[18:04:27] <sukhe>	 thanks. we removed iburst today so the initial sync takes longer
[18:04:38] <sukhe>	 I bumped the check_interval but I think it needs to be higher
[18:04:40] <sukhe>	 will fix it
[18:04:50] <topranks>	 ok cool 
[18:04:57] <topranks>	 remind me what iburst does again?
[18:05:26] <wikibugs>	 (03CR) 10Xcollazo: "I was under the impression that older revisions continue to be CC 3.0 rather than CC 4.0." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent)
[18:05:29] <topranks>	 offset's don't seem too bad we obviously have a lowish threshold for it 
[18:05:30] <sukhe>	 so like when ntpsec service on dns1004 starts or we reimage the server or reboot
[18:05:48] <sukhe>	 we send a burst of six packets for a faster sync vs the usual one
[18:05:55] <wikibugs>	 (03PS1) 10Jforrester: dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268
[18:05:56] <sukhe>	 we debated quite a lot about this and decided to do away with it 
[18:06:16] <dduvall>	 zabe: all clear?
[18:06:18] <topranks>	 ok cool - good to refresh my memory on that thanks 
[18:06:24] <zabe>	 dduvall: yep
[18:06:27] <dduvall>	 thanks!
[18:06:28] <wikibugs>	 (03PS1) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602)
[18:06:29] <sukhe>	 we had it both for our servers (the dns boxes) and per-country pools and we removed both
[18:06:37] <icinga-wm>	 RECOVERY - NTP peers on dns1004 is OK: NTP OK: Offset -0.001005084 secs https://wikitech.wikimedia.org/wiki/NTP
[18:06:37] <topranks>	 cool
[18:07:29] <topranks>	 it's still there on dns1004 in ntp.conf though 
[18:07:31] <topranks>	 pool 0.us.pool.ntp.org iburst
[18:07:48] <topranks>	 not for the other dns servers though just that one 
[18:09:24] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641)
[18:09:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot)
[18:10:02] <wikibugs>	 (03CR) 10Varnent: "I have pinged Shaun S in Legal via Slack to have him verify. He may differ to someone else within Legal who already has an account here to" [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent)
[18:10:21] <wikibugs>	 (03PS2) 10Scott French: php8.1: add php8.1-uuid to php8.1-cli and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602)
[18:11:01] <sukhe>	 topranks: thanks for pointing that out. it's weird because https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1502b725372051a7d1e4a31d501b4e69ad393330%5E%21/#F1
[18:11:03] <wikibugs>	 (03CR) 10Varnent: "To clarify, Shaun is Legal rep that made initial request for that information to be updated." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent)
[18:11:23] <sukhe>	 and https://puppetboard.wikimedia.org/report/dns1004.wikimedia.org/a4c829ca1019a6d345486767ef567e7b6237b574
[18:11:34] <sukhe>	 ah sorry yeah
[18:11:41] <sukhe>	 you are looking at ntp.conf. the file should be ntpsec.conf
[18:11:57] <sukhe>	 I will remove ntp.conf from everywhere
[18:12:00] <wikibugs>	 (03CR) 10Scott French: "Hugh, since you kindly reviewed the last patch series, could I ask you to take a look at this as well? Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072269 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[18:12:06] <sukhe>	 so look at /etc/ntpsec/ntp.conf
[18:12:23] <sukhe>	 pool 0.us.pool.ntp.org
[18:12:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072270 (https://phabricator.wikimedia.org/T373641) (owner: 10TrainBranchBot)
[18:12:56] <topranks>	 sukhe: cool yeah 
[18:13:05] <icinga-wm>	 RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:13:13] <topranks>	 probably ntp.conf is an artifact from before we used ntpsec?
[18:13:15] <sukhe>	 yep
[18:13:21] <topranks>	 good to know 
[18:13:27] <sukhe>	 good shoutout though, I will remove all traces of it to avoid confusion
[18:13:43] <sukhe>	 (which is what we have been doing, even though ntpsec was aliasing ntpd, we are setting ntpsec everywhere)
[18:13:49] <wikibugs>	 (03CR) 10Ladsgroup: "I can deploy it once the question is answered." [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent)
[18:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:22:22] <wikibugs>	 (03PS1) 10CDanis: NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563)
[18:22:45] <icinga-wm>	 PROBLEM - NTP peers on dns1006 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP
[18:23:11] <sukhe>	 ^ yeah, bumping this shortly
[18:23:30] <sukhe>	 there's nothing broken as the syncs are spaced apart but yes, we should not alert as quick as we were before
[18:25:28] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[18:25:53] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.22  refs T373641
[18:25:57] <stashbot>	 T373641: 1.43.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T373641
[18:27:45] <icinga-wm>	 RECOVERY - NTP peers on dns1006 is OK: NTP OK: Offset 0.000132206 secs https://wikitech.wikimedia.org/wiki/NTP
[18:29:40] <Amir1>	 🍿
[18:30:21] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php test2wiki --skip text_table_cleanup/test2wiki text_table_dump/test2wiki --sleep 1 # T183490
[18:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:24] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[18:30:29] <icinga-wm>	 RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:31:18] <zabe>	 ok, forgot the --dump lol, restarted
[18:32:44] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "thank you!!" [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson)
[18:32:55] <icinga-wm>	 PROBLEM - NTP peers on dns2004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP
[18:33:00] <wikibugs>	 (03CR) 10CDanis: [C:04-1] "If you planned on reusing the same cert as is in production now, this won't work -- lists1004.wikimedia.org is not one of its SANs." [puppet] - 10https://gerrit.wikimedia.org/r/1072247 (owner: 10EoghanGaffney)
[18:33:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson)
[18:33:34] <dduvall>	 zabe: i'm seeing a slew of errors from `migrateESRefToContentTable.php` 
[18:33:48] <zabe>	 yep, already canceled
[18:33:53] <dduvall>	 `PHP Warning: fwrite() expects parameter 1 to be resource, bool given`
[18:34:13] <dduvall>	 do you need me to rollback wmf.22?
[18:34:18] <zabe>	 yes, and more interesting PHP Warning: `fopen(/home/zabe/text_table_dump/test2wiki): failed to open stream: Permission denied`
[18:34:22] <zabe>	 dduvall: nope
[18:34:25] <dduvall>	 k
[18:34:29] <zabe>	 but thanks
[18:34:33] <dduvall>	 np
[18:35:32] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: bump retry_interval to 5 mins for NTP monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1072273
[18:37:22] <wikibugs>	 (03PS1) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274
[18:37:35] <wikibugs>	 (03Abandoned) 10Jdlrobson: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072255 (owner: 10Jdlrobson)
[18:37:55] <icinga-wm>	 RECOVERY - NTP peers on dns2004 is OK: NTP OK: Offset -0.000446636 secs https://wikitech.wikimedia.org/wiki/NTP
[18:39:28] <Amir1>	 zabe: just give 777 to the file :D
[18:41:09] <wikibugs>	 (03PS1) 10Ssingh: P:ntp and nagios_core: update check_ntp_peer to include stratum checks [puppet] - 10https://gerrit.wikimedia.org/r/1072276
[18:41:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:ntp: bump retry_interval to 5 mins for NTP monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1072273 (owner: 10Ssingh)
[18:42:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3956/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[18:42:50] <zabe>	 Amir1: yep :D
[18:42:55] <sukhe>	 !log running agent on O:alerting_host
[18:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:22] <wikibugs>	 (03CR) 10CDanis: [C:03+1] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[18:46:16] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "We're not worried that this is a cherry pick of the cherry pick, right?  Everything looks fine to me so approving" [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson)
[18:46:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson)
[18:47:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68976 and previous config saved to /var/cache/conftool/dbconfig/20240911-184750-ladsgroup.json
[18:47:54] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[18:57:53] <icinga-wm>	 PROBLEM - NTP peers on dns2006 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP
[18:59:27] <icinga-wm>	 RECOVERY - NTP peers on dns2006 is OK: NTP OK: Offset -0.000629282 secs https://wikitech.wikimedia.org/wiki/NTP
[19:00:22] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:00:55] <wikibugs>	 (03CR) 10Herron: [C:03+1] NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563) (owner: 10CDanis)
[19:02:15] <wikibugs>	 (03CR) 10CDanis: [C:03+2] NEL alerts: exclude common noise [puppet] - 10https://gerrit.wikimedia.org/r/1072271 (https://phabricator.wikimedia.org/T374563) (owner: 10CDanis)
[19:02:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68977 and previous config saved to /var/cache/conftool/dbconfig/20240911-190257-ladsgroup.json
[19:05:22] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:07:50] <icinga-wm>	 ACKNOWLEDGEMENT - NTP peers on dns3003 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown Sukhbir Singh cookbook run https://wikitech.wikimedia.org/wiki/NTP
[19:15:22] <wikibugs>	 (03PS2) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585)
[19:15:52] <wikibugs>	 (03PS3) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585)
[19:18:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68978 and previous config saved to /var/cache/conftool/dbconfig/20240911-191805-ladsgroup.json
[19:22:12] <jinxer-wm>	 FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[19:29:19] <wikibugs>	 (03PS1) 10Scott French: aptrepo: ffmpeg bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/1072282 (https://phabricator.wikimedia.org/T374502)
[19:31:53] <wikibugs>	 (03CR) 10Eevans: [C:03+2] puppet8: ensure cassandra passwords are defined [puppet] - 10https://gerrit.wikimedia.org/r/1071020 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[19:32:18] <wikibugs>	 (03CR) 10DCausse: "lgtm, chart version needs to be updated I think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking)
[19:33:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T371742)', diff saved to https://phabricator.wikimedia.org/P68979 and previous config saved to /var/cache/conftool/dbconfig/20240911-193312-ladsgroup.json
[19:33:15] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[19:33:17] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[19:33:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance
[19:33:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68980 and previous config saved to /var/cache/conftool/dbconfig/20240911-193335-ladsgroup.json
[19:42:58] <Nemoralis>	 jouncebot: next
[19:42:58] <jouncebot>	 In 0 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2000)
[19:46:37] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:46:40] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:47:42] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:47:45] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:53:28] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] varnish: Conditionally monitor vcl reloads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall)
[19:56:29] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[19:56:32] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[19:59:44] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2000).
[20:00:04] <jouncebot>	 kimberly_sarabia, toyofuku, Nemoralis, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:18] <kimberly_sarabia>	 Hi I'm here 
[20:00:21] <Nemoralis>	 o/
[20:00:57] <wikibugs>	 (03PS8) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020)
[20:01:23] <wikibugs>	 (03PS12) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[20:01:40] <toyofuku>	 here as well!
[20:01:47] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh)
[20:02:56] <wikibugs>	 (03PS13) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[20:06:25] <wikibugs>	 (03CR) 10Ebrahim: "I'm very sorry about that. Thanks for making this possible" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[20:07:19] <cjming>	 hi - sorry to be late - i can deploy
[20:07:33] <toyofuku>	 thank you!!
[20:07:52] <wikibugs>	 (03PS14) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380)
[20:08:08] <cjming>	 kimberly_sarabia: i'll start with yours!
[20:08:17] <kimberly_sarabia>	 thanks!
[20:09:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson)
[20:09:29] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson)
[20:09:31] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:09:35] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:09:49] <wikibugs>	 (03Merged) 10jenkins-bot: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson)
[20:10:05] <cjming>	 toyofuku: i manually +2'd your MF backport - guessing it'll take a while to merge
[20:10:10] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]]
[20:10:13] <stashbot>	 T371020: Roll out appearance menu and font size change to sister projects  - https://phabricator.wikimedia.org/T371020
[20:10:21] <toyofuku>	 Makes sense - thank you for thinking of that!
[20:11:22] <cjming>	 np! ya - it says 25 mins in zuul
[20:11:28] <toyofuku>	 rip
[20:12:51] <wikibugs>	 (03PS1) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290
[20:13:35] <wikibugs>	 (03PS2) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290
[20:13:46] <logmsgbot>	 !log cjming@deploy1003 jdlrobson, cjming: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:13:46] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway)
[20:13:51] <cjming>	 kimberly_sarabia: up on test servers if you want to verify - lmk if/when to sync
[20:14:26] <kimberly_sarabia>	 ok one moment
[20:18:27] <kimberly_sarabia>	 LGTM!
[20:18:34] <kimberly_sarabia>	 cjming: ^
[20:18:42] <cjming>	 yay! syncing
[20:18:44] <logmsgbot>	 !log cjming@deploy1003 jdlrobson, cjming: Continuing with sync
[20:23:19] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1059393|Roll out appearance menu and font size change to sister projects (T371020)]] (duration: 13m 09s)
[20:23:23] <stashbot>	 T371020: Roll out appearance menu and font size change to sister projects  - https://phabricator.wikimedia.org/T371020
[20:24:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:24:12] <cjming>	 kimberly_sarabia: should be live!
[20:24:23] <wikibugs>	 (03PS1) 10BCornwall: trafficserver: Conditionally set monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1072295
[20:24:27] <cjming>	 toyofuku: doing your config patch next
[20:24:35] <wikibugs>	 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573 (10phaultfinder) 03NEW
[20:24:35] <toyofuku>	 thank you thank you
[20:25:01] <kimberly_sarabia>	 cjming: ty
[20:25:01] <wikibugs>	 (03PS4) 10Stoyofuku-wmf: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585)
[20:26:10] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:26:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138926 (10Jclark-ctr) [Wed Sep 11 13:51:52 2024] sd 0:0:2:0: [sdc] tag#2137 CDB: Write(10) 2a 00 00 08 f0 10 00 00 08 00 [Wed Sep 11 13:51:52 2024] I/O error, dev sdc, sector 585744 op 0x1:(WRITE) flag...
[20:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off feature flag to move donate link everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071961 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[20:27:13] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]]
[20:27:17] <stashbot>	 T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585
[20:29:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3957/console" [puppet] - 10https://gerrit.wikimedia.org/r/1072295 (owner: 10BCornwall)
[20:30:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[20:30:43] <logmsgbot>	 !log cjming@deploy1003 cjming, toyofuku: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:30:45] <cjming>	 toyofuku: your config patch is up on mwdebug if you'd like to test - lmk if/when to sync
[20:30:47] <Nemoralis>	 :eyes:
[20:30:54] <toyofuku>	 looking now!
[20:31:41] <wikibugs>	 (03PS2) 10BCornwall: trafficserver: no logging on disabled monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1072295
[20:32:05] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox
[20:32:08] <toyofuku>	 All good, thank you!
[20:32:17] <cjming>	 cool - syncing :)
[20:32:22] <logmsgbot>	 !log cjming@deploy1003 cjming, toyofuku: Continuing with sync
[20:32:59] <cjming>	 toyofuku: i think your backport will finish merging in the next few mins - so perfect timing to do that one next
[20:33:06] <toyofuku>	 yayyy
[20:33:37] <cjming>	 Nemoralis: i'll plan on doing your patch afterwards
[20:33:46] <Nemoralis>	 ok
[20:35:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[20:35:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10138974 (10Jclark-ctr) jclark@prometheus1008:~$ for disk in $(lsblk -dn -o NAME); do     echo "Device: /dev/$disk"     udevadm info -q property -n /dev/$disk | grep -E "ID_SERIAL|ID_PATH" done Device: /...
[20:35:57] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure that it is possible to override MFNamespacesWithLeadParagraphs [extensions/MobileFrontend] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072274 (owner: 10Jdlrobson)
[20:36:10] <Nemoralis>	 cjming: is running maintenance script (namespaceDupes) required? This patch will update Project: namespace
[20:36:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68982 and previous config saved to /var/cache/conftool/dbconfig/20240911-203623-ladsgroup.json
[20:36:27] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[20:36:37] <cscott>	 cjming: i'm here too, sorry i'm late
[20:36:54] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:36:56] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1071961|Turn off feature flag to move donate link everywhere (T373585)]] (duration: 09m 42s)
[20:36:56] <logmsgbot>	 !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:36:59] <stashbot>	 T373585: Deploy new donation entry point - https://phabricator.wikimedia.org/T373585
[20:37:00] <cjming>	 Nemoralis: i'm not sure actually - i don't think so somehow
[20:37:16] <cjming>	 cscott: no worries! you're early for your patch actually
[20:37:55] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]]
[20:38:09] <cjming>	 toyofuku: config patch should be live, deploying your backport now
[20:38:16] <toyofuku>	 Thank you!
[20:38:19] <cjming>	 yw!
[20:38:46] <wikibugs>	 (03PS3) 10NMW03: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009)
[20:38:57] <toyofuku>	 looks like the other one just got merged
[20:39:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Conditionally monitor vcl reloads (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071935 (owner: 10BCornwall)
[20:39:22] <cjming>	 toyofuku: yes - i just scap backported that one - should be up on test servers here soon
[20:39:32] <toyofuku>	 Thank you!
[20:40:23] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Check with vg once too but looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1072295 (owner: 10BCornwall)
[20:40:43] <cjming>	 does anyone here know if we have to run namespace dupes script on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1070347? I'm going to err on the side of not
[20:42:33] <logmsgbot>	 !log cjming@deploy1003 jdlrobson, cjming: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:42:36] <cjming>	 toyofuku: 2nd patch up on test servers if you want to check - please lmk if/when to sync
[20:42:43] <toyofuku>	 looking now!
[20:42:50] <Nemoralis>	 doc page says "after adding a namespace (or interwiki prefix)". I am not sure
[20:43:06] <toyofuku>	 worked!  thank you
[20:43:14] <cjming>	 nice - going live!
[20:43:17] <logmsgbot>	 !log cjming@deploy1003 jdlrobson, cjming: Continuing with sync
[20:43:56] <cjming>	 Nemoralis: me neither
[20:44:25] <cjming>	 i feel like i've only had to run that script when the namespace dupes file was updated
[20:45:36] <wikibugs>	 (03PS3) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290
[20:45:43] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway)
[20:47:03] <cjming>	 Nemoralis: i can run it afterwards - should be like 2 secs - i don't think it can hurt anything
[20:47:08] <Nemoralis>	 sure
[20:47:49] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072274|Ensure that it is possible to override MFNamespacesWithLeadParagraphs]] (duration: 09m 54s)
[20:48:01] <cjming>	 toyofuku: both your patches should be live!
[20:48:08] <toyofuku>	 thank you so much!
[20:48:22] <cjming>	 yw!
[20:48:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03)
[20:49:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) (owner: 10NMW03)
[20:49:40] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]]
[20:49:43] <stashbot>	 T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009
[20:51:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68983 and previous config saved to /var/cache/conftool/dbconfig/20240911-205130-ladsgroup.json
[20:51:37] <logmsgbot>	 !log cjming@deploy1003 cjming, nmw03: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:51:58] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:52:28] <cjming>	 Nemoralis: want to test? lmk when to sync
[20:53:18] <Nemoralis>	 site name works fine, let me test namespace
[20:56:14] <Nemoralis>	 oh, it looks like I forgot to update wgMetaNamespace
[20:56:27] <wikibugs>	 (03CR) 10Ebrahim: Enable the dark mode in Portal namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[20:56:37] <Nemoralis>	 I think you can continue to sync now, I will send another patch to update that
[20:56:43] <cjming>	 sure thing
[20:56:45] <logmsgbot>	 !log cjming@deploy1003 cjming, nmw03: Continuing with sync
[20:57:52] <cjming>	 cscott: if you're still around i'll do your patch next
[20:57:57] <cscott>	 i'm here!
[20:58:16] <wikibugs>	 (03PS1) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439)
[20:59:09] <wikibugs>	 (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229)
[21:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240911T2100)
[21:01:16] <cjming>	 do i have time to squeeze in one more config patch before the abstract wikipedia folks have the window?
[21:01:31] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070347|Update wgSitename for tlywiki (T367009)]] (duration: 11m 51s)
[21:01:40] <stashbot>	 T367009: Change namespace aliases for Talysh Wikipedia - https://phabricator.wikimedia.org/T367009
[21:01:45] <Nemoralis>	 thanks!
[21:02:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian)
[21:02:33] <cjming>	 Nemoralis: i ran the maint script - said there wasn't anything to fix fwiw
[21:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072192 (https://phabricator.wikimedia.org/T373229) (owner: 10C. Scott Ananian)
[21:03:10] <Nemoralis>	 cjming: probably because of wgMetaNamespace
[21:03:14] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]]
[21:03:16] <Nemoralis>	 nevermind, thanks again
[21:03:18] <stashbot>	 T373229: Deploy to next set of wikivoyages (ps,bn,hi,tr) week of Sep 9 - https://phabricator.wikimedia.org/T373229
[21:03:58] <cjming>	 np!
[21:05:19] <logmsgbot>	 !log cjming@deploy1003 cjming, cscott: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:05:25] <cjming>	 cscott: up on test servers if you'd like to verify
[21:05:28] <cscott>	 ok, i'll check it out
[21:06:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68984 and previous config saved to /var/cache/conftool/dbconfig/20240911-210638-ladsgroup.json
[21:06:52] <cscott>	 cjming: looks good
[21:06:58] <cjming>	 awesome - syncing
[21:07:02] <logmsgbot>	 !log cjming@deploy1003 cjming, cscott: Continuing with sync
[21:11:33] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072192|Deploy Parsoid Read Views to bn/hi/ps/tr wikivoyage (T373229)]] (duration: 08m 19s)
[21:11:37] <stashbot>	 T373229: Deploy to next set of wikivoyages (ps,bn,hi,tr) week of Sep 9 - https://phabricator.wikimedia.org/T373229
[21:12:06] <cscott>	 thanks cjming !
[21:13:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[21:13:33] <cjming>	 yw!
[21:14:07] <cjming>	 !log end of UTC late backport window
[21:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[21:21:18] <wikibugs>	 (03CR) 10Cwhite: "Ahhhh, I see what you mean now! The commit messages between the two are 97% identical - both mention activating the same services (Icinga," [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[21:21:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T371742)', diff saved to https://phabricator.wikimedia.org/P68985 and previous config saved to /var/cache/conftool/dbconfig/20240911-212145-ladsgroup.json
[21:21:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[21:21:51] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[21:22:01] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance
[21:22:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68986 and previous config saved to /var/cache/conftool/dbconfig/20240911-212208-ladsgroup.json
[21:22:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10139101 (10jhathaway)
[21:22:38] <wikibugs>	 (03CR) 10Cwhite: alert: Failover from alert1001 to alert2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071700 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[21:24:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:25:10] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[21:29:26] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:29:29] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[21:30:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:30:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10139158 (10Jclark-ctr) @ABran-WMF  @wiki_willy   I glanced at this for Val  we need assistance troubleshooting from service owner.  I was looking at console it is in emergency mode and n...
[21:30:36] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[21:30:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:33:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ganeti10 - jclark@cumin1002"
[21:33:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ganeti10 - jclark@cumin1002"
[21:33:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:34:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:36:06] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10139152 (10Jdlrobson)
[21:36:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139169 (10phaultfinder)
[21:37:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:39:04] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:39:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:39:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:40:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:40:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:40:18] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:40:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:41:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:41:22] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:41:38] <wikibugs>	 (03PS1) 10JHathaway: puppet8: add explicit typecast [puppet] - 10https://gerrit.wikimedia.org/r/1072301 (https://phabricator.wikimedia.org/T372664)
[21:41:48] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:42:51] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072301 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[21:43:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:43:25] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:44:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester)
[21:44:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester)
[21:44:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:45:13] <wikibugs>	 (03PS3) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[21:45:57] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:47:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:47:14] <wikibugs>	 (03PS1) 10JHathaway: puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664)
[21:47:26] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[21:48:19] <inflatador>	 !log bking@deploy1003 test deploying flink operator in staging T373195
[21:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:23] <stashbot>	 T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195
[21:48:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1041
[21:49:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1040
[21:49:52] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1040
[21:50:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1039
[21:50:08] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1039
[21:50:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1041
[21:50:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1042
[21:50:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043
[21:50:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1042
[21:50:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1044
[21:50:31] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti1043
[21:51:05] <wikibugs>	 (03Merged) 10jenkins-bot: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072189 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester)
[21:51:18] <wikibugs>	 (03Merged) 10jenkins-bot: ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily [extensions/WikiLambda] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072190 (https://phabricator.wikimedia.org/T374241) (owner: 10Jforrester)
[21:51:41] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]]
[21:51:44] <stashbot>	 T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241
[21:51:54] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1044
[21:52:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043
[21:53:26] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[21:53:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1043
[21:53:32] <wikibugs>	 (03PS4) 10Bking: rdf-streaming-updater: switch to calico-based network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195)
[21:53:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1045
[21:53:47] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:54:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:54:18] <wikibugs>	 (03PS4) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290
[21:54:19] <wikibugs>	 (03PS15) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[21:54:26] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway)
[21:54:41] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1045
[21:54:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1046
[21:54:56] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[21:56:09] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1046
[21:56:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1047
[21:56:20] <inflatador>	 !log bking@deploy1003 test deploy of flink operator in staging cancelled with no changes T373195
[21:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:24] <stashbot>	 T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195
[21:57:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1047
[21:57:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1048
[21:58:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1048
[21:58:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1049
[21:59:33] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072189|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]], [[gerrit:1072190|ZObjectFactory::validatePersistentKeys: Disable use of JsonSchema, at least temporarily (T374241)]] (duration: 07m 51s)
[21:59:36] <stashbot>	 T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241
[21:59:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1049
[21:59:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1050
[22:00:14] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[22:00:19] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[22:00:40] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[22:00:52] <logmsgbot>	 !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[22:00:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:01:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:01:12] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1050
[22:01:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1051
[22:01:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1052
[22:02:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1051
[22:03:40] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1052
[22:04:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:07:04] <wikibugs>	 (03PS1) 10Jforrester: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830)
[22:07:16] <wikibugs>	 (03PS1) 10Jforrester: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830)
[22:09:28] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp2027 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:10:28] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp2027 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:11:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:12:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:13:10] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:13:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:14:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:14:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:14:14] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:14:48] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[22:14:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:15:31] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:16:02] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Add Philippe Saade to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072308 (https://phabricator.wikimedia.org/T374008)
[22:16:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:16:41] <wikibugs>	 (03PS16) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[22:17:04] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:17:09] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "I restricted the liquid threads namespaces to the 5 wikis that still have it. This LGTM for deployment now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[22:17:17] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Philippe Saade - https://phabricator.wikimedia.org/T374008#10139251 (10Ladsgroup) Almost ready, I need to check this https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group  give me a bit.
[22:17:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:18:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:18:39] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:18:41] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:19:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:19:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:19:11] <wikibugs>	 (03PS2) 10Hamish: u4cwiki: create case and case_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072204 (https://phabricator.wikimedia.org/T374439)
[22:19:31] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:19:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:20:10] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:20:57] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[22:21:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[22:21:14] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1052.eqiad.wmnet with OS bookworm
[22:21:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm
[22:26:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:26:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[22:26:10] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582 (10MBinder_WMF) 03NEW
[22:26:30] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:27:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68987 and previous config saved to /var/cache/conftool/dbconfig/20240911-222711-ladsgroup.json
[22:27:15] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[22:27:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:27:39] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:28:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1041.eqiad.wmnet with OS bookworm
[22:28:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm
[22:28:19] <wikibugs>	 (03PS5) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041
[22:28:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 (owner: 10Jdlrobson)
[22:29:57] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Add echukwukere to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1072311 (https://phabricator.wikimedia.org/T374386)
[22:33:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:35:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:35:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:36:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:36:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:38:11] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072305 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:38:36] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1072306 (https://phabricator.wikimedia.org/T373830) (owner: 10Jforrester)
[22:38:50] <James_F>	 Finally.
[22:38:56] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]]
[22:39:00] <stashbot>	 T373830: Deprecated: Use of MediaWiki\Output\OutputPage::setCategoryLinks was deprecated [Called from MediaWiki\Specials\SpecialExpandTemplates::showHtmlPreview] - https://phabricator.wikimedia.org/T373830
[22:41:03] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:41:50] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[22:42:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68988 and previous config saved to /var/cache/conftool/dbconfig/20240911-224218-ladsgroup.json
[22:46:24] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072305|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]], [[gerrit:1072306|SpecialExpandTemplates: Replace use of deprecated OutputPage::addCategoryLinks() (T373830)]] (duration: 07m 27s)
[22:46:29] <stashbot>	 T373830: Deprecated: Use of MediaWiki\Output\OutputPage::setCategoryLinks was deprecated [Called from MediaWiki\Specials\SpecialExpandTemplates::showHtmlPreview] - https://phabricator.wikimedia.org/T373830
[22:56:59] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139315 (10MBinder_WMF) {F57500778}  config file attached after confirming with @Ladsgroup
[22:57:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68989 and previous config saved to /var/cache/conftool/dbconfig/20240911-225726-ladsgroup.json
[23:03:08] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139331 (10Ladsgroup) I'm guessing but:  - Instead of bast1002 or bast4003, use `bast4005.wikimedia.org` (depending on where you live). Otherwise, it'll...
[23:04:26] <wikibugs>	 06SRE: Production access has been approved but not able to log in, access was a long time ago so it's a new problem - https://phabricator.wikimedia.org/T374582#10139333 (10Dzahn) > I tried phab1001.eqiad.wmnet and bast1002.eqiad.wmnet.  Hi!  The issue here is that these host names are outdated.  Phabricator (Pho...
[23:12:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T371742)', diff saved to https://phabricator.wikimedia.org/P68990 and previous config saved to /var/cache/conftool/dbconfig/20240911-231233-ladsgroup.json
[23:12:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[23:12:37] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[23:12:49] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance
[23:12:51] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:13:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:13:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T371742)', diff saved to https://phabricator.wikimedia.org/P68991 and previous config saved to /var/cache/conftool/dbconfig/20240911-231311-ladsgroup.json
[23:13:43] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1052.eqiad.wmnet with OS bookworm
[23:13:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10139350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm executed w...
[23:18:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10139353 (10Jclark-ctr) @Papaul  i have updated bmc and bios with no change to server. can you assist with this last...
[23:18:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:19:15] <wikibugs>	 (03CR) 10Dzahn: "sorry, ignore my outdated comment" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine)
[23:19:22] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:22:12] <jinxer-wm>	 FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[23:24:10] <wikibugs>	 (03PS1) 10Dzahn: vrts: switch inactive host vrts2001 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677)
[23:25:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "Thanks! made https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072313 to do it for just the inactive host as suggested" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[23:28:51] <wikibugs>	 (03PS1) 10Jforrester: On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315
[23:29:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] On wikis with the Translate extension, allow thanking of translationreview log actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072315 (owner: 10Jforrester)
[23:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139369 (10phaultfinder)
[23:30:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add backup::host, gerrit::migration etc to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1070683 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[23:37:58] <wikibugs>	 (03PS2) 10Dzahn: site: (WIP) try applying gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804)
[23:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316
[23:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072316 (owner: 10TrainBranchBot)
[23:39:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10139376 (10phaultfinder)
[23:56:34] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1072318 (https://phabricator.wikimedia.org/T372418)
[23:58:27] <wikibugs>	 (03CR) 10Dzahn: "remaining diff https://puppet-compiler.wmflabs.org/output/1063893/3959/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)