[00:00:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802875 (10Aklapper) Please link to one specific example or proof where someone committed directly a change for "requested permission changes for user groups or extension requests/configur...
[00:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:04:09] <jinxer-wm>	 FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[00:07:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802878 (10Justman10000) >>! In T393587#10802875, @Aklapper hat geschrieben: > Please link to one specific example or proof where someone committed directly a change for "requested permiss...
[00:08:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212
[00:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212 (owner: 10TrainBranchBot)
[00:08:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802879 (10Justman10000) The question remains: who guarantees that I can submit patches faster than others? How can I prove myself if I don't have any options available?
[00:09:00] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams
[00:11:04] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:11:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802881 (10Aklapper) You repeatedly talked about "commit directly" here so I assume that you can provide an example when people "committed directly" (whatever that phrase means)?
[00:13:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802882 (10Aklapper) >>! In T393587#10802879, @Justman10000 wrote: > How can I prove myself if I don't have any options available?  I answered that already in T393587#10801870...
[00:14:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802885 (10Aklapper) >>! In T393587#10802879, @Justman10000 wrote: > The question remains: who guarantees that I can submit patches faster than others?  I have no idea why someone should "...
[00:14:49] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams
[00:20:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802897 (10Pppery) I'll at least do the courtesy of answering Andre's request, assuming by "commit directly" you mean commit without code review by others.  The Gerrit query https://gerrit...
[00:27:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212 (owner: 10TrainBranchBot)
[00:46:19] <wikibugs>	 (03PS1) 10Tim Starling: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601)
[00:46:39] <wikibugs>	 (03PS1) 10Tim Starling: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601)
[00:49:14] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/cdd7853a7f90dbe96c6896c4f027cbc0f493d5266ad74f35dd0255c6eecfcd48/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:56:14] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[01:04:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling)
[01:04:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling)
[01:05:28] <wikibugs>	 (03Merged) 10jenkins-bot: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling)
[01:05:29] <wikibugs>	 (03Merged) 10jenkins-bot: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling)
[01:06:06] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]]
[01:06:09] <stashbot>	 T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601
[01:09:14] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:12:16] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 9713 MB (3% inode=67%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[01:14:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:29:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:37:42] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[01:37:46] <stashbot>	 T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601
[01:38:16] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Continuing with sync
[01:43:07] <wikibugs>	 (03PS4) 10Scott French: P:mw::maint::temporary_accounts: purge_temporary_accounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143197 (https://phabricator.wikimedia.org/T385866)
[01:47:49] <wikibugs>	 (03CR) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[01:47:53] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[01:50:16] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[01:52:18] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]] (duration: 46m 12s)
[01:52:21] <stashbot>	 T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601
[01:53:07] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[02:00:04] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:36:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:04:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10803050 (10phaultfinder)
[03:24:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10803069 (10phaultfinder)
[03:41:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:46:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:46:54] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:55:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:55:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:55:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[03:55:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:57:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:57:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:58:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:58:30] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:59:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[04:01:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:02:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and 2620:0:860:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:04:09] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[04:06:30] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:10:20] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:11:18] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:46:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:46:54] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0600).
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:36:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:38:08] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:38:50] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:39:24] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:43:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet
[06:44:00] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:44:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:44:40] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:45:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet
[06:47:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9
[06:49:44] <moritzm>	 FYI, ml-etcd2001 will briefly go down for a Ganeti reboot
[06:49:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet
[06:51:44] <icinga-wm>	 PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[06:54:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9
[06:55:30] <icinga-wm>	 RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms
[06:56:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet
[06:56:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti2032:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:56:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet
[06:56:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete videoscaler cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1138713 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[07:04:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9
[07:04:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete videoscaler stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1143392 (https://phabricator.wikimedia.org/T360636)
[07:06:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[07:06:57] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet
[07:07:41] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete videoscaler stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1143392 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[07:10:31] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10803287 (10MoritzMuehlenhoff) A fixed package is now in bookworm-proposed-updates and will be part of the Bookworm 12.11 point rel...
[07:12:16] <icinga-wm>	 RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[07:12:55] <wikibugs>	 (03PS5) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350)
[07:27:29] <wikibugs>	 (03CR) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[07:29:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[07:39:35] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@4367417]: (no justification provided)
[07:39:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[07:40:01] <wikibugs>	 (03PS4) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610)
[07:40:15] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@4367417]: (no justification provided) (duration: 00m 40s)
[07:40:29] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided)
[07:41:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:30] <dcausse>	 jouncebot: nowandnext
[07:43:30] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0700)
[07:43:30] <jouncebot>	 In 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0800)
[07:44:07] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143471 (https://phabricator.wikimedia.org/T393498)
[07:46:10] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 05m 42s)
[07:47:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[07:48:38] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[07:48:42] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143471 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[07:49:09] <logmsgbot>	 !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]]
[07:49:12] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[07:51:50] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided)
[07:52:32] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 42s)
[07:53:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede)
[07:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[07:55:52] <logmsgbot>	 !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:55:55] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[07:55:58] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided)
[07:56:27] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 29s)
[07:57:07] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[07:57:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:58:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I'd like to understand better what are you trying to fix here. To be more explicit, do we have a case of a service exposing multiple ports" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson)
[07:58:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede)
[07:59:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0800)
[08:00:56] <wikibugs>	 (03Merged) 10jenkins-bot: Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede)
[08:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:02:00] <dcausse>	 sorry I'm still in the middle of a deploy 
[08:03:55] <logmsgbot>	 !log dcausse@deploy1003 dcausse: Continuing with sync
[08:04:09] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[08:05:22] <fabfur>	 !log depooling and disabling puppet on cp7001 to perform tests (T393671)
[08:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:25] <stashbot>	 T393671: Benchmark differnet options - https://phabricator.wikimedia.org/T393671
[08:06:05] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet
[08:12:28] <logmsgbot>	 !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]] (duration: 23m 19s)
[08:12:31] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[08:14:50] <dcausse>	 search flowing to opensearch@codfw, things look good afaics
[08:18:43] <dcausse>	 going to call this done, please let me know if you see anything weird related to search
[08:19:05] <dcausse>	 !log closing UTC morning backport window
[08:19:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] P:wmcs::instance: Don't install puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/1142535 (owner: 10Majavah)
[08:27:11] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::instance: Don't install puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/1142535 (owner: 10Majavah)
[08:37:03] <logmsgbot>	 !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: Testing in progress
[08:44:27] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143478 (https://phabricator.wikimedia.org/T393498)
[08:47:13] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143478 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[08:47:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm
[08:48:03] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm
[08:48:29] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411)
[08:48:29] <wikibugs>	 (03CR) 10Vgutierrez: "varnish tests are happy in both text and upload." [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[08:49:32] <wikibugs>	 (03PS3) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:49:53] <wikibugs>	 (03PS6) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168
[08:49:57] <wikibugs>	 (03CR) 10Volans: "I've added some entries in the host's hieradata file" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[08:52:49] <logmsgbot>	 !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-fe1003.eqiad.wmnet with OS bookworm
[08:52:53] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-f...
[08:53:12] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm
[08:53:23] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm
[08:54:11] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411)
[08:56:06] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411)
[08:56:37] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[09:01:07] <wikibugs>	 (03CR) 10Effie Mouzeli: "The issue that I0b412ef2aa7c2aac35747f3a4724848a3fee1df6 was submitted for, was addressed in I1f3a8f607e4864f9edbe2e4d949855385b430671." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli)
[09:01:38] <wikibugs>	 (03PS1) 10Volans: test-cookbook: expand help message [puppet] - 10https://gerrit.wikimedia.org/r/1143485
[09:01:38] <wikibugs>	 (03PS1) 10Volans: cumin: tweak insetup role report config [puppet] - 10https://gerrit.wikimedia.org/r/1143486
[09:01:38] <wikibugs>	 (03PS1) 10Volans: admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487
[09:02:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803567 (10MatthewVernon) @Jclark-ctr I've had a look at this, and the problem seems to be that it's failing to PXE boot at all - the reimage cookbook brings the host up fine (a...
[09:02:53] <logmsgbot>	 !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-fe1003.eqiad.wmnet with OS bookworm
[09:02:58] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-f...
[09:04:14] <wikibugs>	 (03PS2) 10Volans: admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487
[09:05:46] <wikibugs>	 (03CR) 10Volans: [C:03+2] admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487 (owner: 10Volans)
[09:07:01] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "admin: move jiji to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1143489
[09:09:07] <wikibugs>	 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10803576 (10jcrespo) @PMG: here are the files from backups. Would you reupload them to Commons...
[09:10:38] <wikibugs>	 (03PS4) 10Muehlenhoff: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380)
[09:10:41] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:15:16] <wikibugs>	 (03PS5) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:15:42] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:16:47] <wikibugs>	 (03PS1) 10Btullis: Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378)
[09:17:45] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676 (10dcaro) 03NEW
[09:17:55] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803598 (10dcaro)
[09:18:09] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803602 (10dcaro)
[09:18:19] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803604 (10dcaro)
[09:18:27] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803606 (10dcaro)
[09:20:02] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803608 (10dcaro) @ayounsi @cmooney feel free to use this task for this work, or link the one you are using for it, thanks!
[09:21:35] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[09:21:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:mw::maint::temporary_accounts: purge_temporary_accounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143197 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French)
[09:22:26] <wikibugs>	 (03PS6) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:22:35] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:23:38] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[09:23:50] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803615 (10cmooney) @dcaro is there another task for those servers to be installed/provisioned?  We have support for 25G in Eqiad racks E4 and F4, and codfw B1.  Not in eqiad C8/D5.  It's main...
[09:24:05] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143500 (https://phabricator.wikimedia.org/T393498)
[09:25:22] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143500 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[09:25:48] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803628 (10dcaro) >>! In T393676#10803615, @cmooney wrote: > @dcaro is there another task for those servers to be installed/provisioned?  It's linked as parent {T389851}, not yet bought, but w...
[09:25:54] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803630 (10cmooney) Sorry I found it.  I added a note about C8/D5 in eqiad.
[09:26:25] <Emperor>	 !log swift delete wikipedia-commons-local-public.e7 'e/e7/Hawkmoth_(Meganoton_nyctiphanes)_(8688240817).jpg' ms-fe1009 and ms-fe2009 T392658
[09:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:28] <stashbot>	 T392658: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658
[09:27:26] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10803632 (10MatthewVernon) @Sreejithk2000 done (apologies for the delay, I had some annua...
[09:28:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:30:28] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans)
[09:32:30] <wikibugs>	 (03CR) 10Volans: [C:03+1] "PCC seems finally happy, LGTM to start testing" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:33:08] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803636 (10dcaro) > We will need to support 25G on all of the racks, as we have to spread the nodes for high availability (specially critical if the hosts are that big)  This is a blocker to b...
[09:33:59] <wikibugs>	 (03CR) 10Jcrespo: "I believe there is not yet a bookworm transferpy package." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:35:41] <wikibugs>	 (03PS1) 10Hnowlan: Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236)
[09:36:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Initially we use profile::dbbackups::transfer::enabled: false so it won't get immediately installed." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:37:31] <wikibugs>	 (03CR) 10Volans: [C:03+1] "But I'm afraid `profile::dbbackups::transfer` installs `wmfbackups-remote` before checking teh enabled false/true :(" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:38:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) (owner: 10Hnowlan)
[09:41:02] <wikibugs>	 (03PS2) 10Hnowlan: Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236)
[09:42:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Oh indeed. I just had a look at the dependencies of wmfbackups-remote and transferpy and they have no specific dependencies not fulfilled " [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:43:47] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:45:04] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:45:34] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:46:04] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "No that you have to help me, but I would appreciate if you or anyone could help me at some point setting up bookworm CI for my python pack" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[09:51:24] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803654 (10dcaro)
[09:52:02] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803656 (10dcaro) p:05Triage→03High
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1000)
[10:01:13] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411)
[10:01:13] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411)
[10:01:36] <wikibugs>	 (03CR) 10Volans: "I don't have any CI" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[10:03:35] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Revert "admin: move jiji to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1143489 (owner: 10Effie Mouzeli)
[10:06:03] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143506
[10:11:30] <wikibugs>	 (03PS1) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333)
[10:12:14] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "The biggest issue with transferpy is that they assume hosts use iptables, not netfilter." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[10:13:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French)
[10:14:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[10:15:34] <wikibugs>	 (03PS2) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333)
[10:19:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[10:21:13] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) (owner: 10Hnowlan)
[10:22:20] <wikibugs>	 (03PS3) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333)
[10:23:24] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[10:25:37] <wikibugs>	 (03CR) 10Jforrester: "Ack." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli)
[10:26:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Looks good, thank you !" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[10:30:00] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro)
[10:30:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:31:17] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:31:25] <wikibugs>	 (03Merged) 10jenkins-bot: Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro)
[10:40:16] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French)
[10:40:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1165 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:42:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-cron: disable mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143517 (https://phabricator.wikimedia.org/T341555)
[10:42:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:15] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:45:46] <zabe>	 !log zabe@deploy1003:~$ mwscript-k8s --attach --  extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees" "Wikimedia Foundation/Board of Trustees" "Zabe" --reason "per request [[:phab:T393619|T393619]]"
[10:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:51] <stashbot>	 T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619
[10:46:36] <wikibugs>	 (03PS4) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782)
[10:47:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:53:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-cron: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555)
[10:56:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:57:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:59:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Can you please open a separate task for adding nftables support? I'm happy to help with that." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[11:01:07] <wikibugs>	 (03CR) 10Hnowlan: "Makes sense to me, I've added an `absent`." [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[11:15:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox
[11:15:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[11:15:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[11:19:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:19:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[11:19:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[11:20:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:21:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:21:26] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143528 (https://phabricator.wikimedia.org/T385782)
[11:21:28] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782)
[11:23:49] <wikibugs>	 (03PS1) 10Btullis: Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378)
[11:25:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1192 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:25:40] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate db_lag_stats_reporter to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800)
[11:26:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:28:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet
[11:32:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:32:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox
[11:32:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet
[11:39:34] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[11:40:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:40:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet
[11:41:40] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[11:41:44] <wikibugs>	 (03PS1) 10Muehlenhoff: transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380)
[11:41:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[11:43:28] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:44:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:44:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet
[11:44:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1192 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:43] <wikibugs>	 (03CR) 10Muehlenhoff: "The CI configured here isn't working, but the package built just fine on build2002, I'll import it next." [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[11:48:58] <icinga-wm>	 PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[11:50:58] <icinga-wm>	 RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[11:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[11:57:51] <moritzm>	 !log import transferpy 1.1+deb12u1 to bookworm-wikimedia T389380
[11:57:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:53] <stashbot>	 T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380
[11:59:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "wmfbackups-remote was already imported for Bookworm. I build transferpy for Bookworm and uploaded it to apt.wikimedia.org, so we should be" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1200)
[12:00:28] <wikibugs>	 (03PS8) 10Slyngshede: Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443)
[12:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:03:39] <wikibugs>	 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692 (10jcrespo) 03NEW
[12:04:09] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:06:06] <wikibugs>	 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692#10804062 (10jcrespo) @MoritzMuehlenhoff @Dzahn @FCeratto-WMF @MatthewVernon FYI
[12:07:14] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "T393692" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[12:07:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10804065 (10Jhancock.wm) service request submitted. i'll let you know when it gets here and is replaced.
[12:17:39] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10804140 (10Vgutierrez) ` vgutierrez@carrot:~$ whois pywikipedia.org |grep -i "Name server" Name Server: ns061.auroradns.eu Name Server: ns062.auroradns.nl Name Server: ns063.auroradns...
[12:18:26] <wikibugs>	 (03PS1) 10Btullis: Bump the heap allocated to YAN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695)
[12:18:40] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1143552
[12:18:53] <wikibugs>	 (03PS2) 10Btullis: Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695)
[12:19:06] <wikibugs>	 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692#10804144 (10jcrespo) My suggestion for a fix would be to Split [[ https://phabricator.wikimedia.org/diffusion/OS...
[12:19:37] <wikibugs>	 (03PS3) 10Btullis: Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695)
[12:20:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1143552 (owner: 10Vgutierrez)
[12:20:48] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5495/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis)
[12:22:52] <MatmaRex>	 jouncebot: next
[12:22:52] <jouncebot>	 In 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300)
[12:23:17] <MatmaRex>	 i have a maintenance script to run during the window, i hope someone can do it for me :)
[12:25:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org
[12:30:22] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10804180 (10Vgutierrez) I've reverted https://gerrit.wikimedia.org/r/1137481 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143552) to avoid acme-chief attempting to issue a ce...
[12:31:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org
[12:32:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org
[12:36:08] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568
[12:36:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:39:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org
[12:40:02] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis)
[12:40:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede)
[12:41:28] <jinxer-wm>	 FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:45:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568 (owner: 10Vgutierrez)
[12:46:28] <jinxer-wm>	 RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:46:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568 (owner: 10Vgutierrez)
[12:51:27] <wikibugs>	 (03PS9) 10Slyngshede: Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300).
[13:00:05] <jouncebot>	 MatmaRex and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:53] <Tchanders>	 o/
[13:01:07] <MatmaRex>	 hi
[13:01:24] <MatmaRex>	 i have a maintenance script to run during the window, i hope someone can do it for me :)
[13:01:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10804254 (10MoritzMuehlenhoff)
[13:03:30] <moritzm>	 !log installing jetty9 security updates
[13:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Only enabling this for Bookworm seems fine, after all we use systemd::sysuser very little on Bullseye since it's hampered by T" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[13:11:28] <Tchanders>	 I think I can deploy, but I haven't used spiderpig before. Do I have to use spiderpig or can I just use the deploy commands as before?
[13:12:27] <thcipriani>	 you can still use the regular commands, too
[13:12:42] <thcipriani>	 but spiderpig is an option
[13:14:12] <Tchanders>	 thcipriani: Thanks, going ahead
[13:16:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan)
[13:16:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff)
[13:17:18] <wikibugs>	 (03Merged) 10jenkins-bot: temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan)
[13:17:42] <logmsgbot>	 !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]]
[13:17:45] <stashbot>	 T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358
[13:20:23] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "LGTM on the surface!" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh)
[13:21:45] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10804369 (10ssingh)
[13:24:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:24:32] <logmsgbot>	 !log tchanders@deploy1003 tchanders, kharlan: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:24:34] <stashbot>	 T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358
[13:24:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10804385 (10MoritzMuehlenhoff)
[13:27:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch the kadmin server to krb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1143574 (https://phabricator.wikimedia.org/T390863)
[13:27:34] <logmsgbot>	 !log tchanders@deploy1003 tchanders, kharlan: Continuing with sync
[13:29:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:30:13] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[13:30:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:32:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:10] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:34:12] <logmsgbot>	 !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]] (duration: 16m 30s)
[13:34:16] <stashbot>	 T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358
[13:35:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:35:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:37:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:37:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:40:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:41:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:41:41] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch the kadmin server to krb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1143574 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:41:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:42:05] <volans>	 !log forced removal of db1246 from puppetdb to unblock reimage (was failing due to a puppet change in the meanwhile)
[13:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:22] <wikibugs>	 06SRE, 06serviceops, 07Essential-Work, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10804452 (10thcipriani)
[13:46:02] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:46:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:47:21] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis)
[13:47:24] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans)
[13:47:59] <Tchanders>	 If anyone is able to run MatmaRex's maintenance script, please go ahead
[13:50:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:50:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10804494 (10Gehel)
[13:50:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10804495 (10Gehel)
[13:51:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:51:59] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[1028-1030].eqiad.wmnet
[13:52:03] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm
[13:52:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10804504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm
[13:52:25] <jinxer-wm>	 RESOLVED: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:54:36] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:55:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:56:10] <wikibugs>	 (03PS8) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411)
[13:56:22] <wikibugs>	 (03CR) 10CDanis: [C:04-1] logstash: calculate w3c generated timestamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite)
[13:57:36] <MatmaRex>	 James_F: hi, you around perhaps? want to run a maintenance script for me? we did a dry run during the hackathon. https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300
[13:58:53] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:00:20] <James_F>	 MatmaRex: Sure, I have a meeting now but in half an hour?
[14:00:33] <MatmaRex>	 no hurry. thank you
[14:02:50] <wikibugs>	 (03PS2) 10Cwhite: logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886)
[14:03:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.dns.netbox
[14:03:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863)
[14:04:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863)
[14:05:40] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite)
[14:07:05] <wikibugs>	 (03PS1) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886)
[14:08:17] <wikibugs>	 (03CR) 10Cwhite: logstash: calculate w3c generated timestamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite)
[14:09:00] <logmsgbot>	 eevans@cumin1002 decommission (PID 3987673) is awaiting input
[14:12:01] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[14:12:26] <wikibugs>	 (03PS3) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863)
[14:12:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[14:13:29] <wikibugs>	 (03PS4) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863)
[14:14:15] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714)
[14:14:59] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[14:15:36] <wikibugs>	 (03PS1) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333)
[14:15:37] <wikibugs>	 (03PS1) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[14:15:54] <wikibugs>	 (03PS1) 10Bking: conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118)
[14:16:34] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking)
[14:17:11] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714) (owner: 10DDesouza)
[14:18:10] <wikibugs>	 (03PS2) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886)
[14:18:10] <wikibugs>	 (03PS2) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333)
[14:18:10] <wikibugs>	 (03PS2) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[14:19:00] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:19:04] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714) (owner: 10DDesouza)
[14:19:15] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[14:19:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:20:12] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:20:14] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1028-1030].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[14:20:32] <vgutierrez>	 topranks, XioNoX ^^
[14:20:37] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:20:38] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:21:06] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:21:07] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:21:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:21:38] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:21:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:23:06] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10804784 (10Jelto) Yesterday, @jcrespo, @MatthewVernon, and I discussed backups for object storage. The discussion covered not only...
[14:23:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1028-1030].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002"
[14:23:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:23:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[1028-1030].eqiad.wmnet
[14:23:33] <topranks>	 vgutierrez: thanks yeah, issue on the Arelion cct there 
[14:23:49] <vgutierrez>	 ack
[14:24:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[14:24:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:24:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:25:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission restbase10[28-30].eqiad.wmnet - https://phabricator.wikimedia.org/T393617#10804794 (10Eevans)
[14:25:49] <wikibugs>	 (03PS3) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333)
[14:25:49] <wikibugs>	 (03PS3) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[14:27:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:27:23] <wikibugs>	 (03PS4) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333)
[14:27:23] <wikibugs>	 (03PS4) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[14:27:23] <wikibugs>	 (03PS3) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886)
[14:27:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:28:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:28:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[14:28:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[14:33:26] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[14:34:19] <James_F>	 !log Running `foreachwiki extensions/Echo/maintenance/removeInvalidNotification.php --remove # T389673` for MatmaRex 
[14:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:22] <stashbot>	 T389673: Make it possible to remove extensions' event data from Echo tables after undeploying them - https://phabricator.wikimedia.org/T389673
[14:34:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:35:34] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[14:36:20] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm
[14:36:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10804877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**)   - Removed from Puppet and...
[14:38:35] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5498/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:40:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[14:43:34] <MatmaRex>	 James_F: tyvm. still running?
[14:43:45] <James_F>	 MatmaRex: Complete. Pasting the log now.
[14:43:58] <wikibugs>	 (03PS1) 10Ssingh: P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602)
[14:44:09] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:44:38] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5499/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh)
[14:45:06] <moritzm>	 !log imported ripe-atlas-sagan 1.3.1-1~wmf12u1 to apt.wikimedia.org/bookworm T389380
[14:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:10] <stashbot>	 T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380
[14:45:27] <moritzm>	 !log imported  ripe-atlas-tools 2.3.0-3+wmf12u1 to apt.wikimedia.org/bookworm T389380
[14:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:25] <jinxer-wm>	 RESOLVED: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:45] <wikibugs>	 (03PS3) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[14:51:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:52:58] <wikibugs>	 (03PS4) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[14:53:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:53:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:54:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:54:09] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:54:43] <wikibugs>	 (03PS2) 10Ssingh: P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602)
[14:55:23] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5501/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh)
[14:57:41] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet
[14:57:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:00:05] <jouncebot>	 jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1500)
[15:02:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looking good, check the inline suggestion" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh)
[15:03:16] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet
[15:05:09] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:05:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] P:dns:auth::update: add timer for monthly git maintenance run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh)
[15:05:46] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh)
[15:05:46] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:07:57] <sukhe>	 !log sudo cumin -b1 -s10 'A:dnsbox' 'run-puppet-agent'
[15:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:01] <wikibugs>	 (03CR) 10Btullis: [C:03+1] conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking)
[15:10:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:11:24] <wikibugs>	 (03CR) 10Scott French: mw-cron: enable monitoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli)
[15:11:37] <wikibugs>	 (03PS1) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595
[15:12:23] <wikibugs>	 (03PS2) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595
[15:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:13:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:13:57] <wikibugs>	 (03PS1) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597
[15:14:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805052 (10Papaul) I was having some issue to re-image this host but @Volans was able to help by removing the host from puppetdb. See below for error  ` $ sudo puppet lookup --render-as s --compile --...
[15:15:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (owner: 10BCornwall)
[15:16:59] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[15:17:12] <wikibugs>	 (03PS2) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597
[15:20:58] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-cron: enable monitoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli)
[15:21:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:21:15] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mw-cron: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli)
[15:21:32] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "I would add `Bug: T388809` to this patch so it gets linked with the task, otherwise looks fine -- the Pywikibot people have now had a mont" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (owner: 10BCornwall)
[15:21:55] <wikibugs>	 (03PS3) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809)
[15:22:09] <wikibugs>	 (03PS4) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809)
[15:22:14] <wikibugs>	 (03PS3) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809)
[15:22:30] <wikibugs>	 (03PS5) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809)
[15:22:49] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall)
[15:24:11] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] "good point, thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[15:24:41] <wikibugs>	 (03PS5) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[15:26:30] <wikibugs>	 (03PS5) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782)
[15:27:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:27:44] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:28:48] <wikibugs>	 (03PS6) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782)
[15:29:09] <jinxer-wm>	 FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1122-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:30:37] <swfrench-wmf>	 !incidents
[15:30:37] <sirenbot>	 6097 (ACKED)  db1247 (paged)/mysqld processes (paged)
[15:30:38] <sirenbot>	 6098 (ACKED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:30:38] <sirenbot>	 6099 (ACKED)  db1247 (paged)/MariaDB Replica SQL: s4 (paged)
[15:30:38] <sirenbot>	 6100 (ACKED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[15:31:00] <swfrench-wmf>	 !resolve 6097
[15:31:00] <sirenbot>	 6097 (RESOLVED)  db1247 (paged)/mysqld processes (paged)
[15:31:03] <swfrench-wmf>	 !resolve 6098
[15:31:04] <sirenbot>	 6098 (RESOLVED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:31:08] <swfrench-wmf>	 !resolve 6099
[15:31:08] <sirenbot>	 6099 (RESOLVED)  db1247 (paged)/MariaDB Replica SQL: s4 (paged)
[15:31:12] <swfrench-wmf>	 !resolve 6100
[15:31:12] <sirenbot>	 6100 (RESOLVED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[15:31:28] <wikibugs>	 (03PS5) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[15:31:28] <wikibugs>	 (03PS4) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886)
[15:31:58] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:32:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:32:32] <cdanis>	 swfrench-wmf: heh thanks
[15:32:43] <wikibugs>	 (03CR) 10Elukey: "Tried to run it, but run.sh returns to me:" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[15:32:44] <swfrench-wmf>	 :)
[15:35:17] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:35:35] <_joe_>	 uh?
[15:35:39] <herron>	 !incidents
[15:35:40] <sirenbot>	 6100 (RESOLVED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[15:35:40] <rzl>	 👋 ack expired
[15:35:47] <_joe_>	 ahhh
[15:35:49] <wikibugs>	 (03PS6) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[15:35:49] <wikibugs>	 (03PS5) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886)
[15:36:12] <rzl>	 I've boldly resolved in VO
[15:36:22] <swfrench-wmf>	 rzl: thanks, yeah - I did as well
[15:36:58] <wikibugs>	 (03CR) 10Elukey: "Probably something wrong in the prev attempt, I see now:" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[15:37:08] <swfrench-wmf>	 very strange, I received a notification for at least one of these that I'd already re-acked and resolved
[15:37:20] <swfrench-wmf>	 maybe I did that in the "wrong order" ?
[15:37:48] <logmsgbot>	 jhancock@cumin2002 netbox (PID 3066154) is awaiting input
[15:39:09] <jinxer-wm>	 RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1122-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:39:23] <wikibugs>	 (03CR) 10JHathaway: "looks much better!, you should be able to run further rake commands in /srv/workspace/puppet, e.g. bundle exec rake test" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[15:39:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[15:39:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[15:39:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:40:28] <wikibugs>	 (03PS7) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333)
[15:41:39] <wikibugs>	 (03Abandoned) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[15:41:43] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411)
[15:41:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:42:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey)
[15:42:33] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[15:43:35] <wikibugs>	 (03PS7) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782)
[15:43:49] <wikibugs>	 (03PS1) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487)
[15:44:13] <wikibugs>	 (03PS2) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487)
[15:44:22] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487) (owner: 10Majavah)
[15:44:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[15:44:39] <wikibugs>	 (03PS2) 10Eevans: JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544)
[15:44:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:45:31] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411)
[15:46:15] <wikibugs>	 (03PS3) 10Eevans: JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544)
[15:46:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047
[15:46:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047
[15:46:51] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[15:48:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:48:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Tested with Eric" [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[15:49:02] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411)
[15:50:22] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[15:50:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall)
[15:51:30] <wikibugs>	 (03CR) 10Bking: [C:03+2] conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking)
[15:52:41] <wikibugs>	 (03CR) 10Elukey: "Please keep in mind that my knowledge of ruby and Rakefiles is horrible, but the change is sound. I left a couple of little comments, and " [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[15:53:12] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[15:53:16] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5503/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:53:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:54:29] <wikibugs>	 (03PS1) 10Hnowlan: mw:sharded_periodic_job: use "command" instead of script [puppet] - 10https://gerrit.wikimedia.org/r/1143606
[15:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[15:57:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[15:57:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[15:57:37] <wikibugs>	 (03CR) 10Vgutierrez: "PCC output for non-NOOP can be seen on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143603/" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:00:04] <jouncebot>	 jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:02] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411)
[16:01:02] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411)
[16:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:03:01] <wikibugs>	 (03PS2) 10Hnowlan: mw:sharded_periodic_job: use "command" instead of script [puppet] - 10https://gerrit.wikimedia.org/r/1143606
[16:03:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "do not merge till Ia3c34647675a728e06c02e0d6cb9b00a8911ca61 is merged and deployed CDN wide" [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:03:34] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10805182 (10elukey) >>! In T391852#10796611, @herron wrote: >>>! In T391852#10796071, @elukey wrote: >> @herron @RLazarus There are a couple of logistical thin...
[16:04:51] <wikibugs>	 (03CR) 10Vgutierrez: "@abaso@wikimedia.org I'm not planning to merge this one immediately but we will need to deploy it before being able to split the CDN cache" [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:05:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:05:57] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5506/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143606 (owner: 10Hnowlan)
[16:06:39] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh)
[16:06:58] <wikibugs>	 (03PS5) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839)
[16:07:09] <wikibugs>	 (03CR) 10Ssingh: "rebased, no code change" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh)
[16:08:23] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh)
[16:08:43] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[16:09:47] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[16:10:24] <wikibugs>	 (03PS8) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782)
[16:10:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2048 to codfw - jhancock@cumin2002"
[16:10:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2048 to codfw - jhancock@cumin2002"
[16:11:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:11:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10805196 (10Jhancock.wm) heads up. dell doesn't want to replace it without more troubleshooting. since the idrac is not showing a failure. Gonna reseat the drive. Please let me know if it changes anything.
[16:11:44] <wikibugs>	 (03PS1) 10Andrew Bogott: designate policy.yaml: repair 'default' policy [puppet] - 10https://gerrit.wikimedia.org/r/1143610 (https://phabricator.wikimedia.org/T393679)
[16:11:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611
[16:11:45] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1143612 (https://phabricator.wikimedia.org/T330759)
[16:13:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] designate policy.yaml: repair 'default' policy [puppet] - 10https://gerrit.wikimedia.org/r/1143610 (https://phabricator.wikimedia.org/T393679) (owner: 10Andrew Bogott)
[16:13:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048
[16:13:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048
[16:14:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate db_lag_stats_reporter to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[16:14:48] <icinga-wm>	 PROBLEM - mysqld processes #page on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[16:14:49] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:50] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:51] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 #page on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:14:51] <icinga-wm>	 PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[16:15:03] <swfrench-wmf>	 !incidents
[16:15:04] <sirenbot>	 6105 (UNACKED)  db1246 (paged)/mysqld processes (paged)
[16:15:04] <sirenbot>	 6106 (UNACKED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[16:15:04] <sirenbot>	 6107 (UNACKED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[16:15:04] <sirenbot>	 6108 (UNACKED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[16:15:05] <jynus>	 :-(
[16:15:08] <vgutierrez>	 pope effect?
[16:15:14] <swfrench-wmf>	 !ack 6105
[16:15:15] <sirenbot>	 6105 (ACKED)  db1246 (paged)/mysqld processes (paged)
[16:15:16] <swfrench-wmf>	 !ack 6106
[16:15:17] <sirenbot>	 6106 (ACKED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[16:15:17] <sukhe>	 no, host specific I would say
[16:15:19] <swfrench-wmf>	 !ack 6107
[16:15:20] <sirenbot>	 6107 (ACKED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[16:15:21] <swfrench-wmf>	 !ack 6108
[16:15:22] <sirenbot>	 6108 (ACKED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[16:15:22] <sukhe>	 it's been bothering us for a while
[16:15:23] <volans>	 yes and was already broken
[16:15:26] <jynus>	 nope, I think there was some bad hardware there
[16:15:31] <volans>	 it's happy for the news
[16:15:32] <swfrench-wmf>	 there's a downtime for this host ...
[16:15:36] <swfrench-wmf>	 did the reimage clear it?
[16:15:48] <jynus>	 https://phabricator.wikimedia.org/T393296
[16:15:51] <federico3>	 did the downtime time out?
[16:15:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:16:01] <swfrench-wmf>	 no, I set it for 1w
[16:16:07] <jynus>	 I think Papaul is working on it?
[16:16:10] <swfrench-wmf>	 presumably the end of the reimage did something
[16:16:11] <swfrench-wmf>	 yes
[16:16:13] <papaul>	 yes
[16:16:14] <papaul>	 i am 
[16:16:33] <papaul>	 i think it was downtime
[16:16:35] <papaul>	 for a week
[16:16:38] <jynus>	 what I do in these hw cases is set it as notif disables on puppet
[16:16:38] <wikibugs>	 (03Abandoned) 10Ebernhardson: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson)
[16:16:44] <wikibugs>	 (03CR) 10Ebernhardson: "We run three clusters per DC, each cluster runs on a distinct port. They are implemented by running two copies of the server on each bare " [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson)
[16:18:45] <volans>	 swfrench-wmf: yes the reimage removes the host
[16:18:58] <swfrench-wmf>	 volans: thanks for confirming, that makes sense, then
[16:18:59] <volans>	 so it disappears from icinga and then it's added again
[16:19:17] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3091523) is awaiting input
[16:19:19] <volans>	 an in general if it's successful I think it also removed the downtime that on icinga means all downtimes
[16:19:26] <volans>	 I suggest to power off the host
[16:19:27] <swfrench-wmf>	 about to hit enter on re-downtiming, unless folks have objections and would prefer the puppet route :)
[16:19:32] <volans>	 as has creating already too much noise
[16:19:39] <swfrench-wmf>	 volans: there's active work happening on the host
[16:19:46] <swfrench-wmf>	 i.e., it should be up
[16:20:22] <volans>	 the host begs to differ :D
[16:20:26] <volans>	 doesn't want to stay up
[16:21:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:22:27] <logmsgbot>	 !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has crashed - T393296
[16:22:30] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10805238 (10dcaro) This is not a blocker anymore in general, will still be needed the more big hosts we get, but can wait for the general 25G everywhere some time.
[16:22:31] <stashbot>	 T393296: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296
[16:22:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3e20d375-9adc-4351-ba8a-0bbdf71aba3b) set by swfrench@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with rea...
[16:23:09] <swfrench-wmf>	 re-downtimed
[16:23:56] <swfrench-wmf>	 !incidents
[16:23:57] <sirenbot>	 6105 (ACKED)  db1246 (paged)/mysqld processes (paged)
[16:23:57] <sirenbot>	 6106 (ACKED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[16:23:57] <sirenbot>	 6107 (ACKED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[16:23:57] <sirenbot>	 6108 (ACKED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[16:24:07] <swfrench-wmf>	 !resolve 6105
[16:24:08] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[16:24:10] <swfrench-wmf>	 !resolve 6106
[16:24:10] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[16:24:13] <swfrench-wmf>	 !resolve 6107
[16:24:14] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[16:24:17] <swfrench-wmf>	 !resolve 6108
[16:24:17] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[16:25:46] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall)
[16:25:54] <wikibugs>	 (03PS4) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809)
[16:25:57] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall)
[16:26:24] <hnowlan>	 jouncebot: nowandnext
[16:26:24] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1600)
[16:26:24] <jouncebot>	 In 0 hour(s) and 33 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700)
[16:26:24] <jouncebot>	 In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700)
[16:26:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805253 (10akosiaris) +1 for what is worth.
[16:27:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm
[16:27:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm
[16:27:34] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[16:27:58] <wikibugs>	 (03PS6) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782)
[16:28:03] <logmsgbot>	 !log brett@dns1005 START - running authdns-update
[16:28:09] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[16:28:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10805256 (10akosiaris) >>! In T393053#10782038, @RobH wrote: > Alex, >  > We didn't get racking details on the ordering task T392715, so we need to get them from...
[16:29:14] <logmsgbot>	 !log brett@dns1005 END - running authdns-update
[16:30:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm
[16:30:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm
[16:30:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10805262 (10akosiaris) a:05akosiaris→03None >>! In T393054#10782085, @RobH wrote: > Alex, >  > We didn't get racking details on the ordering task T392714, so...
[16:30:34] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:30:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10805266 (10akosiaris) a:05akosiaris→03None
[16:34:00] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:34:06] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:36:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805292 (10Scott_French) I've re-added a 1w downtime, as the earlier one was removed as a side-effect of the reimage. If we expect the host to be powered on for ongoing work, and also expect that work...
[16:38:33] <wikibugs>	 (03PS1) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[16:38:43] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:38:58] <wikibugs>	 (03PS2) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[16:43:10] <wikibugs>	 (03PS3) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[16:46:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805304 (10Vgutierrez) We could definitely use that kind of data :)
[16:48:14] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet
[16:48:14] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet
[16:48:58] <fabfur>	 !log repooling cp7001  (T393671)
[16:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:01] <stashbot>	 T393671: Benchmark different options - https://phabricator.wikimedia.org/T393671
[16:49:49] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet
[16:50:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin
[16:50:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin
[16:51:01] <wikibugs>	 (03CR) 10Ssingh: trafficserver: Allow splitting the cache by HTTP header content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[16:53:13] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620
[17:00:05] <jouncebot>	 bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700).
[17:00:05] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700).
[17:01:33] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620 (owner: 10BryanDavis)
[17:01:36] <wikibugs>	 (03PS1) 10JHathaway: systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621
[17:01:44] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway)
[17:02:03] <wikibugs>	 (03CR) 10Ssingh: trafficserver: Allow splitting the cache by HTTP header content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[17:02:06] <bd808>	 o/ I have a developer.wikimedia.org version bump to push out.
[17:02:08] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:02:54] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:03:06] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620 (owner: 10BryanDavis)
[17:03:25] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite)
[17:03:33] <swfrench-wmf>	 o/ I'm holding off on my scheduled changes for the moment
[17:03:48] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3124560) is awaiting input
[17:04:44] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:04:58] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:05:00] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:05:16] <wikibugs>	 (03PS4) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[17:05:19] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:05:32] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:05:48] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:05:54] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:06:07] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:06:32] <wikibugs>	 (03CR) 10Majavah: [C:03+1] systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway)
[17:09:06] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1111.eqiad.wmnet|name=cirrussearch1112.eqiad.wmnet|name=cirrussearch1113.eqiad.wmnet|name=cirrussearch1114.eqiad.wmnet|name=cirrussearch1115.eqiad.wmnet|name=cirrussearch1116.eqiad.wmnet|name=cirrussearch1117.eqiad.wmnet|name=cirrussearch1118.eqiad.wmnet|name=cirrussearch1119.eqiad.wmnet|name=cirrussearch1120.eqiad.wmnet|name=cirru
[17:09:06] <logmsgbot>	 ssearch1121.eqiad.wmnet|name=cirrussearch1122.eqiad.wmnet|name=cirrussearch1123.eqiad.wmnet|name=cirrussearch1124.eqiad.wmnet|name=cirrussearch1125.eqiad.wmnet
[17:09:56] <wikibugs>	 (03PS1) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553)
[17:10:10] * bd808 is done with the WCMS & Tech Docs deploy window
[17:10:48] <wikibugs>	 (03PS2) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553)
[17:11:50] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5507/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[17:12:09] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1112.eqiad.wmnet|cirrussearch1113.eqiad.wmnet|cirrussearch1114.eqiad.wmnet|cirrussearch1115.eqiad.wmnet|cirrussearch1116.eqiad.wmnet|cirrussearch1117.eqiad.wmnet|cirrussearch1118.eqiad.wmnet|cirrussearch1119.eqiad.wmnet|cirrussearch1120.eqiad.wmnet|cirrussearch1121.eqiad.wmnet|cirrussearch1122.eqiad.wmnet|cirrussearch1123.eqiad.wmn
[17:12:09] <logmsgbot>	 et|cirrussearch1124.eqiad.wmnet|cirrussearch1125.eqiad.wmnet
[17:13:05] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm
[17:13:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:13:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err...
[17:13:12] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway)
[17:16:06] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10805388 (10Sreejithk2000) Thanks for the help. File undeleted.
[17:19:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 12.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:19:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805398 (10Jhancock.wm) @Papaul ganeti2047 tried to connect to the wrong puppetserver. failed there.    [8/10, retrying in 640.00s] Attempt to run 'spicerack.puppet...
[17:19:57] <jinxer-wm>	 FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:20:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[17:20:07] <cdanis>	 !incidents
[17:20:08] <sirenbot>	 6109 (UNACKED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[17:20:08] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[17:20:08] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[17:20:08] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[17:20:09] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[17:20:13] <cdanis>	 !ack 6109
[17:20:13] <sirenbot>	 6109 (ACKED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[17:20:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[17:21:11] <wikibugs>	 (03PS3) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553)
[17:21:17] <wikibugs>	 (03PS5) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[17:21:22] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:21:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[17:21:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 4 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[17:21:44] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10805399 (10Pppery) 05Open→03Resolved a:03MatthewVernon
[17:21:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[17:23:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[17:23:10] <wikibugs>	 (03PS1) 10CDanis: esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624
[17:23:32] <wikibugs>	 (03CR) 10BBlack: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis)
[17:23:36] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis)
[17:23:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis)
[17:23:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723 (10thcipriani) 03NEW
[17:23:42] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis)
[17:23:47] <wikibugs>	 (03CR) 10MVernon: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis)
[17:23:49] <logmsgbot>	 !log cdanis@dns1004 START - running authdns-update
[17:24:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[17:24:18] <jinxer-wm>	 FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[17:24:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[17:24:34] <swfrench-wmf>	 !incidents
[17:24:34] <sirenbot>	 6109 (ACKED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[17:24:34] <sirenbot>	 6110 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[17:24:34] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[17:24:35] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[17:24:35] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[17:24:35] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[17:24:37] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[17:24:40] <swfrench-wmf>	 !ack 6110
[17:24:40] <sirenbot>	 6110 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[17:24:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:25:05] <cdanis>	 is that for esams again
[17:25:08] <swfrench-wmf>	 !incidents
[17:25:09] <sirenbot>	 6109 (ACKED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[17:25:09] <sirenbot>	 6110 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[17:25:09] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[17:25:09] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[17:25:09] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[17:25:10] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[17:25:12] <logmsgbot>	 !log cdanis@dns1004 END - running authdns-update
[17:25:21] <jinxer-wm>	 FIRING: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[17:26:02] <cdanis>	 lol
[17:26:09] <akosiaris>	 full queues? ouch
[17:26:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:27:14] <wikibugs>	 (03PS1) 10Bernard Wang: Remove eb_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625
[17:27:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724 (10thcipriani) 03NEW
[17:27:25] <wikibugs>	 (03PS2) 10Bernard Wang: Remove web_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247)
[17:27:32] <wikibugs>	 (03PS6) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[17:28:05] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-upload_eqsin
[17:28:07] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-text_eqsin
[17:28:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[17:29:00] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:29:18] <jinxer-wm>	 FIRING: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[17:29:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[17:29:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:30:21] <jinxer-wm>	 RESOLVED: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[17:31:30] <jinxer-wm>	 RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[17:31:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:31:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:31:55] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:32:00] <bd808>	 jouncebot: refresh
[17:32:01] <jouncebot>	 I refreshed my knowledge about deployments.
[17:32:07] <sukhe>	 what's up with the puppetmaster?
[17:32:09] <bd808>	 jouncebot: next
[17:32:09] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1800)
[17:32:14] <sukhe>	 is anyone working on that?
[17:32:47] <vgutierrez>	 puppetmaster got broken a few days ago iirc
[17:32:58] <vgutierrez>	 tls material not setting SNI as expected IIRC
[17:33:04] <sukhe>	 hmm ok
[17:33:07] <taavi>	 these are all from three days ago apparently
[17:33:10] <taavi>	 these alerts*
[17:33:23] <wikibugs>	 (03PS7) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553)
[17:33:27] <vgutierrez>	 service owner will know better
[17:34:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[17:34:18] <jinxer-wm>	 RESOLVED: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[17:34:38] <taavi>	 those are old alerts, and as far as I can tell not directly impacting anything atm, so I'm tempted to throw them into the "file a task and deal with it later" pile
[17:35:58] <wikibugs>	 (03PS1) 10Hnowlan: mw-api-ext: bump replicas temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143629
[17:36:12] <wikibugs>	 (03PS1) 10CDanis: esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630
[17:36:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis)
[17:36:44] <wikibugs>	 (03CR) 10Scott French: [C:03+1] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis)
[17:36:49] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis)
[17:36:55] <logmsgbot>	 !log cdanis@dns1004 START - running authdns-update
[17:38:19] <logmsgbot>	 !log cdanis@dns1004 END - running authdns-update
[17:39:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:39:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[17:39:26] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[17:41:30] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Remove web_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang)
[17:45:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[17:51:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805494 (10Papaul) @Scott_French thank you
[17:53:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805498 (10Papaul) @jjanhone both 47 and 48 were on the wrong puppetserver. Remove all yours ` sudo puppet cert --list Warning: `puppet cert` is deprecated and will...
[17:58:07] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[18:00:37] <cdanis>	 jouncebot: nowandnext
[18:00:38] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 59 minute(s)
[18:00:38] <jouncebot>	 In 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900)
[18:01:51] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.162.0" for 2 host(s)
[18:03:39] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.162.0" completed for 2 hosts
[18:07:53] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553)
[18:08:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking)
[18:08:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10805541 (10Eevans) @cmassaro so if I understand correctly, this isn't really about access per say, but a request to have your key changed?  And (either way), we need to verify your ssh k...
[18:12:03] <wikibugs>	 (03PS2) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553)
[18:12:15] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking)
[18:13:52] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2048.codfw.wmnet with OS bookworm
[18:13:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with err...
[18:18:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805587 (10Eevans) @Seddon Can you post the public key on one of you user pages (meta.w.o for example) for verification purposes?
[18:29:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp50[19-24].eqsin.wmnet} and A:cp
[18:29:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10805607 (10ssingh)
[18:30:32] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp50[27-32].eqsin.wmnet} and A:cp
[18:31:41] <wikibugs>	 (03PS1) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103)
[18:34:49] <wikibugs>	 (03CR) 10Jsn.sherman: "Hi Amir, I got started with the dblist; I just checked for wikis with the interface enabled to start with and piped them into a dblist lik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman)
[18:37:53] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "If there are no objections, I would like to merge this next week." [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh)
[18:38:44] <zabe>	 !log zabe@deploy1003:~$ mwscript-k8s --attach --  extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees/Call for feedback:2022 Board of Trustees election/Upcoming Call for Feedback about the Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback:2022 Board of
[18:38:44] <zabe>	 Trustees election/Upcoming Call for Feedback about the Board of Trustees elections" "Zabe" --reason "per request [[:phab:T393619|T393619]]"
[18:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:46] <stashbot>	 T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619
[18:38:55] <zabe>	 meh too long
[18:39:36] <Reedy>	 lol
[18:42:18] <zabe>	 !log zabe@deploy1003:~$ mwscript-k8s --attach -- extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees/Call for feedback: Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback: Board of Trustees elections" "Zabe" --reason "per request
[18:42:19] <zabe>	 [[:phab:T393619|T393619]]"
[18:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:38] <zabe>	 !log mwscript-k8s [...]moveTranslatableBundle.php metawiki "Wikimedia Foundation Board of Trustees/Call for feedback: Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback: Board of Trustees elections" "Zabe" --reason "per request [[:phab:T393619|T393619]]"
[18:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:27] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1143606 (owner: 10Hnowlan)
[18:44:46] <jeena>	 Hello, I have been given the go ahead to start the train deploy now
[18:45:50] <zabe>	 !log move all translateable subpages of "Wikimedia Foundation Board of Trustees" to subpages of "Wikimedia Foundation/Board of Trustees" on metawiki (T393619)
[18:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm
[18:45:53] <stashbot>	 T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619
[18:45:59] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm
[18:46:00] <wikibugs>	 (03PS1) 10Ssingh: geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646
[18:46:01] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223)
[18:46:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot)
[18:46:46] <wikibugs>	 (03CR) 10Scott French: [C:03+1] geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646 (owner: 10Ssingh)
[18:46:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646 (owner: 10Ssingh)
[18:46:58] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[18:47:01] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot)
[18:48:21] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[18:49:57] <wikibugs>	 (03CR) 10Scott French: "Thank you! Once you re-add `interval` to the absented `periodic_job` (ugh ... mismatch in `Optional`-ness), I think this should tick all t" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[18:57:09] <wikibugs>	 (03PS1) 10Herron: thanos-rule: logstash_sli_availability:bool sum by (site) [puppet] - 10https://gerrit.wikimedia.org/r/1143650
[18:59:35] <wikibugs>	 (03CR) 10BCornwall: varnish: Issue and handle WMF-Uniq cookie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[19:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900)
[19:01:23] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.28  refs T386223
[19:01:26] <stashbot>	 T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223
[19:03:17] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10805702 (10BCornwall) 05In progress→03Resolved pywikipedia.org is no longer being managed by our infra as the pywikibot project didn't express an interest in maintenance.
[19:04:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage
[19:06:40] <wikibugs>	 (03CR) 10Ebernhardson: cirrussearch: Add cluster-specific domain name as a SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking)
[19:08:08] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage
[19:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:24:00] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:24:02] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[19:24:26] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:24:54] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:25:26] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733 (10RobH) 03NEW
[19:25:47] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733#10805769 (10RobH)
[19:26:28] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:26:57] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733#10805770 (10RobH) a:03fnegri Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving...
[19:27:32] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134)
[19:28:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[19:29:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:30:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[19:31:43] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841)
[19:31:43] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1143660/5508/" [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[19:32:24] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:32:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:07] <logmsgbot>	 vriley@cumin1002 reimage (PID 95758) is awaiting input
[19:35:28] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:36:20] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:36:52] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.003 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[19:37:44] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[19:40:02] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[19:40:28] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:41:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:42:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805795 (10Eevans)
[19:42:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=grafana.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[19:43:02] <swfrench-wmf>	 !incidents
[19:43:02] <sirenbot>	 6111 (UNACKED)  ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad)
[19:43:03] <sirenbot>	 6110 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[19:43:03] <sirenbot>	 6109 (RESOLVED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[19:43:03] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[19:43:03] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[19:43:03] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[19:43:04] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[19:43:06] <swfrench-wmf>	 !ack 6111
[19:43:07] <sirenbot>	 6111 (ACKED)  ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad)
[19:43:29] <swfrench-wmf>	 investigation ongoing
[19:43:36] <swfrench-wmf>	 denisse: ^ FYI
[19:44:15] <denisse>	 Thank you, I'm investigating the issue.
[19:45:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805798 (10Eevans) @BWojtowicz-WMF can I get you t...
[19:45:41] <denisse>	 swfrench-wmf: Do you know if envoy is also used to to access grafana-next?
[19:46:28] <denisse>	 If not, then I think that the issue with grafana may be related to this: https://phabricator.wikimedia.org/T393439
[19:47:35] <swfrench-wmf>	 denisse: alas, I do not know off hand. is grafana-next hosted differently? (e.g., on a different host or port?)
[19:48:46] <denisse>	 swfrench-wmf: My bad, I just realized envoy runs in the grafana host.
[19:48:52] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.002 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[19:49:16] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:49:18] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:49:20] <wikibugs>	 (03PS2) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675
[19:49:44] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:49:54] <swfrench-wmf>	 that looks promising!
[19:50:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805808 (10Eevans) @thcipriani Ok to add to deploy...
[19:50:08] <wikibugs>	 (03CR) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[19:51:55] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:52:45] <swfrench-wmf>	 !incidents
[19:52:45] <sirenbot>	 6111 (ACKED)  ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad)
[19:52:45] <sirenbot>	 6110 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[19:52:46] <sirenbot>	 6109 (RESOLVED)  ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams)
[19:52:46] <sirenbot>	 6108 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[19:52:46] <sirenbot>	 6107 (RESOLVED)  db1246 (paged)/MariaDB Replica Lag: s2 (paged)
[19:52:46] <sirenbot>	 6106 (RESOLVED)  db1246 (paged)/MariaDB Replica SQL: s2 (paged)
[19:52:46] <sirenbot>	 6105 (RESOLVED)  db1246 (paged)/mysqld processes (paged)
[19:52:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=grafana.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[19:52:58] <swfrench-wmf>	 there it is :)
[19:53:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805823 (10Eevans)
[19:53:34] <swfrench-wmf>	 jouncebot: nowandnext
[19:53:34] <jouncebot>	 For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900)
[19:53:35] <jouncebot>	 In 0 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2000)
[19:53:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805827 (10Eevans)
[19:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[19:55:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805835 (10Eevans)
[19:55:30] <swfrench-wmf>	 jeena: I see the train has rolled to group2. any objections if I were to sneak in a no-op scap run? (i.e., does not deploy images, just updates some bookkeeping)
[19:55:55] <jeena>	 swfrench-wmf: all good here, go ahead!
[19:55:57] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse)
[19:57:05] <swfrench-wmf>	 jeena: great, thank you!
[19:57:40] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: switch mw-script main release to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137496 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French)
[19:59:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805870 (10thcipriani) >>! In T393595#10805807, @E...
[20:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:02] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fdd20feac10: Failed to establish a new connection: [Errno 113
[20:03:02] <icinga-wm>	 te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:03:25] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to switch mw-script/main to PHP 8.1 - T391057
[20:03:28] <stashbot>	 T391057: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057
[20:03:38] <logmsgbot>	 !log swfrench@deploy1003 Stopping before sync operations
[20:04:34] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134)
[20:04:44] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: drop unsupported fallback to PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1137497 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French)
[20:05:02] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 
[20:05:02] <icinga-wm>	 r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:05:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805878 (10Eevans)
[20:05:42] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:06:57] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:11:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805887 (10Eevans) @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed {L3}?  And, while we're in the business of ticking boxes, can you have your manager...
[20:14:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805888 (10Eevans)
[20:14:57] <ryankemper>	 !log T388134 Beginning cutover of query.wikidata.org from `wdqs` to `wdqs-main`. Starting to see requests increase on wdqs-main (and decrease on wdqs) as expected. Rolling change to rest of cp text hosts. Traffic should be fully moved over in ~20 mins
[20:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:01] <stashbot>	 T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134
[20:16:35] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:16:35] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1003.eqiad.wmnet with OS bookworm
[20:16:41] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm completed: - apus-fe1003 (**PAS...
[20:16:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805891 (10Eevans)
[20:17:07] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805892 (10VRiley-WMF) 05Open→03Resolved This has been completed
[20:17:19] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805895 (10VRiley-WMF)
[20:19:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805903 (10Eevans) @VPuffetMichel: assuming you are @Esanders manager, do we have your OK?
[20:21:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[20:21:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[20:21:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm
[20:21:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm
[20:23:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805911 (10Eevans) p:05Triage→03Medium
[20:24:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805913 (10Eevans) 05Open→03In progress
[20:24:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805914 (10Eevans)
[20:24:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805915 (10Eevans) 05Open→03In progress
[20:24:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805916 (10Eevans) p:05Triage→03Medium
[20:25:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805917 (10Eevans) 05Open→03In progress p:05Triage→03Medium
[20:26:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805919 (10Eevans) 05Open→03In progress p:05Triage→03Medium
[20:27:57] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[20:29:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:30:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805923 (10thcipriani) For clarity: I filed this task as a followup to a request for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|spiderpig access]]. `deployment` membership is curren...
[20:30:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805925 (10thcipriani) For clarity: I filed this task as a followup to a request for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|spiderpig access]]. `deployment` membership is curre...
[20:32:04] <wikibugs>	 (03PS2) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:32:09] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:32:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:32:26] <wikibugs>	 (03PS3) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:32:30] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:32:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:33:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage
[20:33:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:33:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage
[20:36:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage
[20:36:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.21.1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1143671 (https://phabricator.wikimedia.org/T393731)
[20:37:06] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[20:37:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:40:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage
[20:42:06] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[20:45:14] <wikibugs>	 (03PS4) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:46:14] <wikibugs>	 (03PS5) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:49:26] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[20:49:44] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp50[27-32].eqsin.wmnet} and A:cp
[20:51:02] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[20:51:11] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:52:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:52:56] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[20:54:03] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[20:54:06] <wikibugs>	 (03PS9) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[20:54:25] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp50[19-24].eqsin.wmnet} and A:cp
[20:55:20] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3357775) is awaiting input
[20:55:23] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "Here it is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143134" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[20:56:14] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_drmrs
[20:56:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_drmrs
[20:57:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:58:37] <wikibugs>	 (03PS10) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134)
[21:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2100)
[21:00:26] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3357420) is awaiting input
[21:01:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T393368#10805985 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Rebalanced power.
[21:03:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[21:07:07] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[21:17:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:23:03] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:23:08] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:27:58] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:27:58] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:29:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1012.eqiad.wmnet
[21:29:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards
[21:29:52] <stashbot>	 T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134
[21:32:58] <jinxer-wm>	 FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:33:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2007.codfw.wmnet
[21:34:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards
[21:35:01] <logmsgbot>	 !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612
[21:35:04] <stashbot>	 T393612: db1247 crash or restart - 15:29 on 2025-05-07 - https://phabricator.wikimedia.org/T393612
[21:38:40] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs[2007,2013].codfw.wmnet,wdqs[1012-1014].eqiad.wmnet with reason: bringing hosts online with a data transfer
[21:40:40] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[21:43:20] <ryankemper>	 !log T388134 Cutover completed about an hour ago. Metrics look good; we're in the process of shifting over some of the old `wdqs` hosts to `wdqs-main` to increase capacity
[21:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:23] <stashbot>	 T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134
[21:43:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1013.eqiad.wmnet
[21:43:57] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  thanos-fe1005 - vriley@cumin1002"
[21:44:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1014.eqiad.wmnet
[21:44:16] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  thanos-fe1005 - vriley@cumin1002"
[21:44:17] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:44:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1015.eqiad.wmnet
[21:44:56] <tzatziki>	 !log removing 3 files for legal compliance
[21:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe1005
[21:45:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806065 (10VRiley-WMF)
[21:46:18] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe1005
[21:47:05] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:47:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:48:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2010.codfw.wmnet
[21:48:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2011.codfw.wmnet
[21:48:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2012.codfw.wmnet
[21:48:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2013.codfw.wmnet
[21:50:23] <tzatziki>	 !log removing 1 file for legal compliance
[21:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:54:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2048.codfw.wmnet with OS bookworm
[21:54:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm completed: - gane...
[21:54:46] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:54:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2047.codfw.wmnet with OS bookworm
[21:54:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm completed: - gane...
[21:55:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806102 (10Jhancock.wm) 05Open→03Resolved
[21:56:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806105 (10Jhancock.wm) @MoritzMuehlenhoff this is finally done. thanks for your patience!
[22:05:44] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:07:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French)
[22:08:18] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:08:28] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806119 (10VRiley-WMF)
[22:09:08] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:09:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:12:56] <wikibugs>	 (03PS4) 10Scott French: P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530)
[22:13:54] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French)
[22:14:31] <jinxer-wm>	 RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:17:34] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1005.eqiad.wmnet with OS bullseye
[22:17:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye
[22:27:55] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards
[22:27:59] <stashbot>	 T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134
[22:28:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards
[22:31:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:38:25] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10806152 (10RobH) url provided by support so i've uploaded the support collection report for their review
[22:47:34] <icinga-wm>	 PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10588 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops
[22:49:42] <icinga-wm>	 PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10628 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops
[23:06:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1012.eqiad.wmnet
[23:06:14] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84665MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[23:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:19:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2007.codfw.wmnet
[23:22:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2013.codfw.wmnet
[23:26:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1013.eqiad.wmnet
[23:30:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1015.eqiad.wmnet
[23:32:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[23:34:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:34:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2010.codfw.wmnet
[23:35:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1014.eqiad.wmnet
[23:35:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2011.codfw.wmnet
[23:36:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:37:48] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1005.eqiad.wmnet with OS bullseye
[23:37:56] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye e...
[23:37:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2012.codfw.wmnet
[23:37:58] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[23:39:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693
[23:39:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693 (owner: 10TrainBranchBot)
[23:39:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:44:28] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:51:03] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693 (owner: 10TrainBranchBot)
[23:51:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag