[00:00:13] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802875 (10Aklapper) Please link to one specific example or proof where someone committed directly a change for "requested permission changes for user groups or extension requests/configur... [00:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:09] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:07:05] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802878 (10Justman10000) >>! In T393587#10802875, @Aklapper hat geschrieben: > Please link to one specific example or proof where someone committed directly a change for "requested permiss... [00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212 [00:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212 (owner: 10TrainBranchBot) [00:08:49] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802879 (10Justman10000) The question remains: who guarantees that I can submit patches faster than others? How can I prove myself if I don't have any options available? [00:09:00] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams [00:11:04] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:30] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802881 (10Aklapper) You repeatedly talked about "commit directly" here so I assume that you can provide an example when people "committed directly" (whatever that phrase means)? [00:13:15] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802882 (10Aklapper) >>! In T393587#10802879, @Justman10000 wrote: > How can I prove myself if I don't have any options available? I answered that already in T393587#10801870... [00:14:31] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802885 (10Aklapper) >>! In T393587#10802879, @Justman10000 wrote: > The question remains: who guarantees that I can submit patches faster than others? I have no idea why someone should "... [00:14:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams [00:20:56] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802897 (10Pppery) I'll at least do the courtesy of answering Andre's request, assuming by "commit directly" you mean commit without code review by others. The Gerrit query https://gerrit... [00:27:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143212 (owner: 10TrainBranchBot) [00:46:19] (03PS1) 10Tim Starling: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601) [00:46:39] (03PS1) 10Tim Starling: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601) [00:49:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/cdd7853a7f90dbe96c6896c4f027cbc0f493d5266ad74f35dd0255c6eecfcd48/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:56:14] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:04:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling) [01:04:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling) [01:05:28] (03Merged) 10jenkins-bot: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143224 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling) [01:05:29] (03Merged) 10jenkins-bot: Use CONTENTLANGUAGE rather than USERLANGUAGE [extensions/WikimediaMessages] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143225 (https://phabricator.wikimedia.org/T393601) (owner: 10Tim Starling) [01:06:06] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]] [01:06:09] T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601 [01:09:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:12:16] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 9713 MB (3% inode=67%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [01:14:25] FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:25] RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:42] !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:37:46] T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601 [01:38:16] !log tstarling@deploy1003 tstarling: Continuing with sync [01:43:07] (03PS4) 10Scott French: P:mw::maint::temporary_accounts: purge_temporary_accounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143197 (https://phabricator.wikimedia.org/T385866) [01:47:49] (03CR) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [01:47:53] (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [01:50:16] (03Merged) 10jenkins-bot: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [01:52:18] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143224|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]], [[gerrit:1143225|Use CONTENTLANGUAGE rather than USERLANGUAGE (T393601)]] (duration: 46m 12s) [01:52:21] T393601: Sidebar donate link targets are always in English - https://phabricator.wikimedia.org/T393601 [01:53:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [02:00:04] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:36:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10803050 (10phaultfinder) [03:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10803069 (10phaultfinder) [03:41:55] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:55:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:55:14] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:55:20] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:55:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:57:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:58:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:58:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:59:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:01:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and 2620:0:860:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:04:09] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:06:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:10:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:11:18] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:46:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0600). [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:38:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:39:24] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet [06:44:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:44:14] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:44:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [06:47:44] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [06:49:44] FYI, ml-etcd2001 will briefly go down for a Ganeti reboot [06:49:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [06:51:44] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:54:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [06:55:30] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms [06:56:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [06:56:55] FIRING: [19x] ProbeDown: Service ganeti2032:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:56:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [06:56:59] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [07:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:49] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete videoscaler cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1138713 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:04:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [07:04:35] (03PS1) 10Muehlenhoff: Remove obsolete videoscaler stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1143392 (https://phabricator.wikimedia.org/T360636) [07:06:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [07:06:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [07:07:41] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete videoscaler stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1143392 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [07:10:31] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10803287 (10MoritzMuehlenhoff) A fixed package is now in bookworm-proposed-updates and will be part of the Bookworm 12.11 point rel... [07:12:16] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [07:12:55] (03PS5) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) [07:27:29] (03CR) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [07:29:10] (03CR) 10Muehlenhoff: [C:03+2] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [07:39:35] !log fab@deploy1003 Started deploy [airflow-dags/research@4367417]: (no justification provided) [07:39:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [07:40:01] (03PS4) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) [07:40:15] !log fab@deploy1003 Finished deploy [airflow-dags/research@4367417]: (no justification provided) (duration: 00m 40s) [07:40:29] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [07:41:55] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:30] jouncebot: nowandnext [07:43:30] For the next 0 hour(s) and 16 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0700) [07:43:30] In 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0800) [07:44:07] (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143471 (https://phabricator.wikimedia.org/T393498) [07:46:10] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 05m 42s) [07:47:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [07:48:38] (03Merged) 10jenkins-bot: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [07:48:42] (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143471 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [07:49:09] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]] [07:49:12] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [07:51:50] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [07:52:32] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 42s) [07:53:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede) [07:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:55:52] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:55:55] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [07:55:58] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [07:56:27] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 29s) [07:57:07] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [07:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:58:04] (03CR) 10Giuseppe Lavagetto: "I'd like to understand better what are you trying to fix here. To be more explicit, do we have a case of a service exposing multiple ports" [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [07:58:16] (03CR) 10Slyngshede: [C:03+2] Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede) [07:59:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T0800) [08:00:56] (03Merged) 10jenkins-bot: Login: fix redirect on login [software/bitu] - 10https://gerrit.wikimedia.org/r/1142575 (https://phabricator.wikimedia.org/T391345) (owner: 10Slyngshede) [08:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:00] sorry I'm still in the middle of a deploy [08:03:55] !log dcausse@deploy1003 dcausse: Continuing with sync [08:04:09] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:05:22] !log depooling and disabling puppet on cp7001 to perform tests (T393671) [08:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] T393671: Benchmark differnet options - https://phabricator.wikimedia.org/T393671 [08:06:05] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [08:12:28] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129182|cirrus: explicitly route search traffic to codfw (T388610)]] (duration: 23m 19s) [08:12:31] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [08:14:50] search flowing to opensearch@codfw, things look good afaics [08:18:43] going to call this done, please let me know if you see anything weird related to search [08:19:05] !log closing UTC morning backport window [08:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:49] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] P:wmcs::instance: Don't install puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/1142535 (owner: 10Majavah) [08:27:11] (03CR) 10Majavah: [C:03+2] P:wmcs::instance: Don't install puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/1142535 (owner: 10Majavah) [08:37:03] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7001.magru.wmnet with reason: Testing in progress [08:44:27] (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143478 (https://phabricator.wikimedia.org/T393498) [08:47:13] (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143478 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [08:47:58] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [08:48:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm [08:48:29] (03PS2) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) [08:48:29] (03CR) 10Vgutierrez: "varnish tests are happy in both text and upload." [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:49:32] (03PS3) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:49:53] (03PS6) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [08:49:57] (03CR) 10Volans: "I've added some entries in the host's hieradata file" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [08:52:49] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-fe1003.eqiad.wmnet with OS bookworm [08:52:53] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-f... [08:53:12] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [08:53:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm [08:54:11] (03PS1) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [08:56:06] (03PS2) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [08:56:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:01:07] (03CR) 10Effie Mouzeli: "The issue that I0b412ef2aa7c2aac35747f3a4724848a3fee1df6 was submitted for, was addressed in I1f3a8f607e4864f9edbe2e4d949855385b430671." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [09:01:38] (03PS1) 10Volans: test-cookbook: expand help message [puppet] - 10https://gerrit.wikimedia.org/r/1143485 [09:01:38] (03PS1) 10Volans: cumin: tweak insetup role report config [puppet] - 10https://gerrit.wikimedia.org/r/1143486 [09:01:38] (03PS1) 10Volans: admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487 [09:02:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803567 (10MatthewVernon) @Jclark-ctr I've had a look at this, and the problem seems to be that it's failing to PXE boot at all - the reimage cookbook brings the host up fine (a... [09:02:53] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-fe1003.eqiad.wmnet with OS bookworm [09:02:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10803571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-f... [09:04:14] (03PS2) 10Volans: admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487 [09:05:46] (03CR) 10Volans: [C:03+2] admin: add my own vim config [puppet] - 10https://gerrit.wikimedia.org/r/1143487 (owner: 10Volans) [09:07:01] (03PS1) 10Effie Mouzeli: Revert "admin: move jiji to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1143489 [09:09:07] 10SRE-swift-storage, 06Commons, 10media-backups: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10803576 (10jcrespo) @PMG: here are the files from backups. Would you reupload them to Commons... [09:10:38] (03PS4) 10Muehlenhoff: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) [09:10:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:15:16] (03PS5) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:15:42] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:16:47] (03PS1) 10Btullis: Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378) [09:17:45] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676 (10dcaro) 03NEW [09:17:55] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803598 (10dcaro) [09:18:09] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803602 (10dcaro) [09:18:19] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803604 (10dcaro) [09:18:27] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803606 (10dcaro) [09:20:02] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803608 (10dcaro) @ayounsi @cmooney feel free to use this task for this work, or link the one you are using for it, thanks! [09:21:35] (03CR) 10Btullis: [C:03+2] Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [09:21:53] (03CR) 10Hnowlan: [C:03+1] P:mw::maint::temporary_accounts: purge_temporary_accounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143197 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [09:22:26] (03PS6) 10Volans: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:22:35] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:23:38] (03Merged) 10jenkins-bot: Bump the resources available to airflow kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143495 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [09:23:50] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803615 (10cmooney) @dcaro is there another task for those servers to be installed/provisioned? We have support for 25G in Eqiad racks E4 and F4, and codfw B1. Not in eqiad C8/D5. It's main... [09:24:05] (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143500 (https://phabricator.wikimedia.org/T393498) [09:25:22] (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143500 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [09:25:48] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803628 (10dcaro) >>! In T393676#10803615, @cmooney wrote: > @dcaro is there another task for those servers to be installed/provisioned? It's linked as parent {T389851}, not yet bought, but w... [09:25:54] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803630 (10cmooney) Sorry I found it. I added a note about C8/D5 in eqiad. [09:26:25] !log swift delete wikipedia-commons-local-public.e7 'e/e7/Hawkmoth_(Meganoton_nyctiphanes)_(8688240817).jpg' ms-fe1009 and ms-fe2009 T392658 [09:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:28] T392658: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658 [09:27:26] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10803632 (10MatthewVernon) @Sreejithk2000 done (apologies for the delay, I had some annua... [09:28:19] (03CR) 10Muehlenhoff: [C:03+1] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:30:28] (03CR) 10MVernon: [C:03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans) [09:32:30] (03CR) 10Volans: [C:03+1] "PCC seems finally happy, LGTM to start testing" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:33:08] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803636 (10dcaro) > We will need to support 25G on all of the racks, as we have to spread the nodes for high availability (specially critical if the hosts are that big) This is a blocker to b... [09:33:59] (03CR) 10Jcrespo: "I believe there is not yet a bookworm transferpy package." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:35:41] (03PS1) 10Hnowlan: Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) [09:36:10] (03CR) 10Muehlenhoff: [C:03+1] "Initially we use profile::dbbackups::transfer::enabled: false so it won't get immediately installed." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:37:31] (03CR) 10Volans: [C:03+1] "But I'm afraid `profile::dbbackups::transfer` installs `wmfbackups-remote` before checking teh enabled false/true :(" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:38:29] (03CR) 10CI reject: [V:04-1] Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) (owner: 10Hnowlan) [09:41:02] (03PS2) 10Hnowlan: Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) [09:42:30] (03CR) 10Muehlenhoff: [C:03+1] "Oh indeed. I just had a look at the dependencies of wmfbackups-remote and transferpy and they have no specific dependencies not fulfilled " [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:43:47] (03CR) 10Jcrespo: [C:03+1] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:45:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:45:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:46:04] (03CR) 10Jcrespo: [C:03+1] "No that you have to help me, but I would appreciate if you or anyone could help me at some point setting up bookworm CI for my python pack" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [09:51:24] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803654 (10dcaro) [09:52:02] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10803656 (10dcaro) p:05Triage→03High [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1000) [10:01:13] (03PS3) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) [10:01:13] (03PS3) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [10:01:36] (03CR) 10Volans: "I don't have any CI" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:03:35] (03CR) 10Slyngshede: [C:03+1] Revert "admin: move jiji to ops-limited" [puppet] - 10https://gerrit.wikimedia.org/r/1143489 (owner: 10Effie Mouzeli) [10:06:03] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143506 [10:11:30] (03PS1) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) [10:12:14] (03CR) 10Jcrespo: [C:03+1] "The biggest issue with transferpy is that they assume hosts use iptables, not netfilter." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:13:11] (03CR) 10Hnowlan: [C:03+1] P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [10:14:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:15:34] (03PS2) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) [10:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:21:13] (03CR) 10Hnowlan: [C:03+2] Revert "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1143501 (https://phabricator.wikimedia.org/T393236) (owner: 10Hnowlan) [10:22:20] (03PS3) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) [10:23:24] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [10:25:37] (03CR) 10Jforrester: "Ack." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [10:26:58] (03CR) 10Filippo Giunchedi: [C:03+1] "Looks good, thank you !" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [10:30:00] (03CR) 10Jforrester: [C:03+2] Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro) [10:30:58] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:31:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:31:25] (03Merged) 10jenkins-bot: Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro) [10:40:16] (03CR) 10Kamila Součková: [C:03+1] P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [10:40:20] PROBLEM - Hadoop NodeManager on an-worker1165 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:42:09] (03PS1) 10Effie Mouzeli: mw-cron: disable mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143517 (https://phabricator.wikimedia.org/T341555) [10:42:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:15] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:45:46] !log zabe@deploy1003:~$ mwscript-k8s --attach -- extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees" "Wikimedia Foundation/Board of Trustees" "Zabe" --reason "per request [[:phab:T393619|T393619]]" [10:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:51] T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619 [10:46:36] (03PS4) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [10:47:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:09] (03PS1) 10Effie Mouzeli: mw-cron: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) [10:56:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:20] RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:59:10] (03CR) 10Muehlenhoff: [C:03+1] "Can you please open a separate task for adding nftables support? I'm happy to help with that." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:01:07] (03CR) 10Hnowlan: "Makes sense to me, I've added an `absent`." [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:15:10] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [11:15:14] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [11:15:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [11:19:32] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:19:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [11:19:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [11:20:20] PROBLEM - Hadoop NodeManager on an-worker1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:21:08] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:21:26] (03PS1) 10Hnowlan: mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143528 (https://phabricator.wikimedia.org/T385782) [11:21:28] (03PS1) 10Hnowlan: mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) [11:23:49] (03PS1) 10Btullis: Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378) [11:25:22] PROBLEM - Hadoop NodeManager on an-worker1192 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:25:40] (03PS1) 10Hnowlan: mw::maintenance: migrate db_lag_stats_reporter to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) [11:26:20] RECOVERY - Hadoop NodeManager on an-worker1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:28:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet [11:32:08] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:32:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [11:32:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet [11:39:34] (03CR) 10Btullis: [C:03+2] Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [11:40:32] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:40:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet [11:41:40] (03Merged) 10jenkins-bot: Reduce the limits on the default kubernetes pod operator tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143531 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [11:41:44] (03PS1) 10Muehlenhoff: transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) [11:41:55] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:25] (03CR) 10CI reject: [V:04-1] transferpy: Build for Bookworm [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:43:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:44:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:44:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet [11:44:22] RECOVERY - Hadoop NodeManager on an-worker1192 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:43] (03CR) 10Muehlenhoff: "The CI configured here isn't working, but the package built just fine on build2002, I'll import it next." [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [11:48:58] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [11:50:58] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [11:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:57:51] !log import transferpy 1.1+deb12u1 to bookworm-wikimedia T389380 [11:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:53] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [11:59:21] (03CR) 10Muehlenhoff: [C:03+1] "wmfbackups-remote was already imported for Bookworm. I build transferpy for Bookworm and uploaded it to apt.wikimedia.org, so we should be" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1200) [12:00:28] (03PS8) 10Slyngshede: Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) [12:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:39] 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692 (10jcrespo) 03NEW [12:04:09] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:06:06] 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692#10804062 (10jcrespo) @MoritzMuehlenhoff @Dzahn @FCeratto-WMF @MatthewVernon FYI [12:07:14] (03CR) 10Jcrespo: [C:03+1] "T393692" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [12:07:38] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10804065 (10Jhancock.wm) service request submitted. i'll let you know when it gets here and is replaced. [12:17:39] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10804140 (10Vgutierrez) ` vgutierrez@carrot:~$ whois pywikipedia.org |grep -i "Name server" Name Server: ns061.auroradns.eu Name Server: ns062.auroradns.nl Name Server: ns063.auroradns... [12:18:26] (03PS1) 10Btullis: Bump the heap allocated to YAN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) [12:18:40] (03PS1) 10Vgutierrez: Revert "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1143552 [12:18:53] (03PS2) 10Btullis: Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) [12:19:06] 10SRE-tools, 10database-backups, 10Infrastructure Security, 06Infrastructure-Foundations: transfer.py fails when handling nftables-configured firewall - https://phabricator.wikimedia.org/T393692#10804144 (10jcrespo) My suggestion for a fix would be to Split [[ https://phabricator.wikimedia.org/diffusion/OS... [12:19:37] (03PS3) 10Btullis: Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) [12:20:22] (03CR) 10Vgutierrez: [C:03+2] Revert "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1143552 (owner: 10Vgutierrez) [12:20:48] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5495/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [12:22:52] jouncebot: next [12:22:52] In 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300) [12:23:17] i have a maintenance script to run during the window, i hope someone can do it for me :) [12:25:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [12:30:22] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10804180 (10Vgutierrez) I've reverted https://gerrit.wikimedia.org/r/1137481 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143552) to avoid acme-chief attempting to issue a ce... [12:31:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [12:32:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [12:36:08] (03PS1) 10Vgutierrez: hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568 [12:36:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:39:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [12:40:02] (03CR) 10Stevemunene: [C:03+1] "looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [12:40:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) (owner: 10Slyngshede) [12:41:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:45:12] (03CR) 10Ssingh: [C:03+1] hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568 (owner: 10Vgutierrez) [12:46:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:46:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Set acmechief_host to acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/1143568 (owner: 10Vgutierrez) [12:51:27] (03PS9) 10Slyngshede: Initial implementation of VueJS frontend [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300). [13:00:05] MatmaRex and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:53] o/ [13:01:07] hi [13:01:24] i have a maintenance script to run during the window, i hope someone can do it for me :) [13:01:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10804254 (10MoritzMuehlenhoff) [13:03:30] !log installing jetty9 security updates [13:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Only enabling this for Bookworm seems fine, after all we use systemd::sysuser very little on Bullseye since it's hampered by T" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [13:11:28] I think I can deploy, but I haven't used spiderpig before. Do I have to use spiderpig or can I just use the deploy commands as before? [13:12:27] you can still use the regular commands, too [13:12:42] but spiderpig is an option [13:14:12] thcipriani: Thanks, going ahead [13:16:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan) [13:16:38] (03CR) 10Muehlenhoff: [C:03+2] Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [13:17:18] (03Merged) 10jenkins-bot: temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan) [13:17:42] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]] [13:17:45] T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358 [13:20:23] (03CR) 10BBlack: [C:03+1] "LGTM on the surface!" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [13:21:45] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10804369 (10ssingh) [13:24:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:24:32] !log tchanders@deploy1003 tchanders, kharlan: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:34] T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358 [13:24:35] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10804385 (10MoritzMuehlenhoff) [13:27:16] (03PS1) 10Muehlenhoff: Switch the kadmin server to krb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1143574 (https://phabricator.wikimedia.org/T390863) [13:27:34] !log tchanders@deploy1003 tchanders, kharlan: Continuing with sync [13:29:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:30:13] (03CR) 10Bking: [C:03+2] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [13:30:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:32:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:34:12] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142858|temp accounts: Remove AutopromoteOnce configuration (T393358)]] (duration: 16m 30s) [13:34:16] T393358: Temporary accounts: Remove autopromote configuration for temporary-account-viewer - https://phabricator.wikimedia.org/T393358 [13:35:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:35:28] RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:37:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:37:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:41:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:41:41] (03CR) 10Elukey: [C:03+1] Switch the kadmin server to krb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1143574 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:41:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:42:05] !log forced removal of db1246 from puppetdb to unblock reimage (was failing due to a puppet change in the meanwhile) [13:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:22] 06SRE, 06serviceops, 07Essential-Work, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10804452 (10thcipriani) [13:46:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:46:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:47:21] (03CR) 10Btullis: [V:03+1 C:03+2] Bump the heap allocated to YARN nodemanagers on the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1143551 (https://phabricator.wikimedia.org/T393695) (owner: 10Btullis) [13:47:24] (03CR) 10Eevans: [C:03+2] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans) [13:47:59] If anyone is able to run MatmaRex's maintenance script, please go ahead [13:50:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:50:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10804494 (10Gehel) [13:50:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10804495 (10Gehel) [13:51:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:51:59] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[1028-1030].eqiad.wmnet [13:52:03] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [13:52:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10804504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm [13:52:25] RESOLVED: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:55:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:10] (03PS8) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [13:56:22] (03CR) 10CDanis: [C:04-1] logstash: calculate w3c generated timestamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite) [13:57:36] James_F: hi, you around perhaps? want to run a maintenance script for me? we did a dry run during the hackathon. https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1300 [13:58:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:20] MatmaRex: Sure, I have a meeting now but in half an hour? [14:00:33] no hurry. thank you [14:02:50] (03PS2) 10Cwhite: logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) [14:03:30] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [14:03:50] (03PS1) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) [14:04:07] (03PS2) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) [14:05:40] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite) [14:07:05] (03PS1) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) [14:08:17] (03CR) 10Cwhite: logstash: calculate w3c generated timestamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite) [14:09:00] eevans@cumin1002 decommission (PID 3987673) is awaiting input [14:12:01] !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [14:12:26] (03PS3) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) [14:12:50] (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [14:13:29] (03PS4) 10Muehlenhoff: Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) [14:14:15] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714) [14:14:59] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [14:15:36] (03PS1) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) [14:15:37] (03PS1) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [14:15:54] (03PS1) 10Bking: conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) [14:16:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [14:17:11] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714) (owner: 10DDesouza) [14:18:10] (03PS2) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) [14:18:10] (03PS2) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) [14:18:10] (03PS2) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [14:19:00] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:04] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143585 (https://phabricator.wikimedia.org/T393714) (owner: 10DDesouza) [14:19:15] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [14:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:20:12] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:20:14] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1028-1030].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [14:20:32] topranks, XioNoX ^^ [14:20:37] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:20:38] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:21:06] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:21:07] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:21:36] (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:21:38] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:21:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:06] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10804784 (10Jelto) Yesterday, @jcrespo, @MatthewVernon, and I discussed backups for object storage. The discussion covered not only... [14:23:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1028-1030].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [14:23:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:23:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[1028-1030].eqiad.wmnet [14:23:33] vgutierrez: thanks yeah, issue on the Arelion cct there [14:23:49] ack [14:24:05] (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [14:24:22] (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:24:30] (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:25:37] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission restbase10[28-30].eqiad.wmnet - https://phabricator.wikimedia.org/T393617#10804794 (10Eevans) [14:25:49] (03PS3) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) [14:25:49] (03PS3) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [14:27:04] (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:27:23] (03PS4) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) [14:27:23] (03PS4) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [14:27:23] (03PS3) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) [14:27:26] (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:28:37] (03CR) 10CI reject: [V:04-1] modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:28:42] (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [14:28:43] (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:33:26] (03PS1) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [14:34:19] !log Running `foreachwiki extensions/Echo/maintenance/removeInvalidNotification.php --remove # T389673` for MatmaRex [14:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:22] T389673: Make it possible to remove extensions' event data from Echo tables after undeploying them - https://phabricator.wikimedia.org/T389673 [14:34:30] (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:35:34] (03PS2) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [14:36:20] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [14:36:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10804877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**) - Removed from Puppet and... [14:38:35] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5498/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:40:17] (03CR) 10Muehlenhoff: [C:03+2] Disable httbb k8s tests on cumin1003 for now [puppet] - 10https://gerrit.wikimedia.org/r/1143583 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [14:43:34] James_F: tyvm. still running? [14:43:45] MatmaRex: Complete. Pasting the log now. [14:43:58] (03PS1) 10Ssingh: P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) [14:44:09] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:44:38] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5499/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh) [14:45:06] !log imported ripe-atlas-sagan 1.3.1-1~wmf12u1 to apt.wikimedia.org/bookworm T389380 [14:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:10] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [14:45:27] !log imported ripe-atlas-tools 2.3.0-3+wmf12u1 to apt.wikimedia.org/bookworm T389380 [14:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:25] RESOLVED: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:45] (03PS3) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [14:51:51] (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:52:58] (03PS4) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [14:53:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:20] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:02] (03CR) 10CI reject: [V:04-1] mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:54:09] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1121-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:54:43] (03PS2) 10Ssingh: P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) [14:55:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5501/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh) [14:57:41] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [14:57:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:00:05] jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1500) [15:02:34] (03CR) 10Vgutierrez: [C:03+1] "looking good, check the inline suggestion" [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh) [15:03:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [15:05:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:05:44] (03CR) 10Ssingh: [V:03+1] P:dns:auth::update: add timer for monthly git maintenance run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh) [15:05:46] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns:auth::update: add timer for monthly git maintenance run [puppet] - 10https://gerrit.wikimedia.org/r/1143593 (https://phabricator.wikimedia.org/T393602) (owner: 10Ssingh) [15:05:46] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:07:57] !log sudo cumin -b1 -s10 'A:dnsbox' 'run-puppet-agent' [15:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] (03CR) 10Btullis: [C:03+1] conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [15:10:41] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:11:24] (03CR) 10Scott French: mw-cron: enable monitoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli) [15:11:37] (03PS1) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 [15:12:23] (03PS2) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 [15:12:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:13:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:13:57] (03PS1) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 [15:14:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805052 (10Papaul) I was having some issue to re-image this host but @Volans was able to help by removing the host from puppetdb. See below for error ` $ sudo puppet lookup --render-as s --compile --... [15:15:06] (03CR) 10CI reject: [V:04-1] Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (owner: 10BCornwall) [15:16:59] (03CR) 10JHathaway: [C:03+2] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [15:17:12] (03PS2) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 [15:20:58] (03CR) 10Effie Mouzeli: mw-cron: enable monitoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli) [15:21:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:21:15] (03Abandoned) 10Effie Mouzeli: mw-cron: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143520 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli) [15:21:32] (03CR) 10Pppery: [C:03+1] "I would add `Bug: T388809` to this patch so it gets linked with the task, otherwise looks fine -- the Pywikibot people have now had a mont" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (owner: 10BCornwall) [15:21:55] (03PS3) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809) [15:22:09] (03PS4) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809) [15:22:14] (03PS3) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) [15:22:30] (03PS5) 10BCornwall: Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809) [15:22:49] (03CR) 10BCornwall: [V:03+2 C:03+2] Revert "ncmonitor: Add pywikipedia.org to ignored domains" [puppet] - 10https://gerrit.wikimedia.org/r/1143597 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [15:24:11] (03CR) 10JHathaway: [C:03+2] "good point, thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [15:24:41] (03PS5) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [15:26:30] (03PS5) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [15:27:34] (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:27:44] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:28:48] (03PS6) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [15:29:09] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1122-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:30:37] !incidents [15:30:37] 6097 (ACKED) db1247 (paged)/mysqld processes (paged) [15:30:38] 6098 (ACKED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:30:38] 6099 (ACKED) db1247 (paged)/MariaDB Replica SQL: s4 (paged) [15:30:38] 6100 (ACKED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [15:31:00] !resolve 6097 [15:31:00] 6097 (RESOLVED) db1247 (paged)/mysqld processes (paged) [15:31:03] !resolve 6098 [15:31:04] 6098 (RESOLVED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:31:08] !resolve 6099 [15:31:08] 6099 (RESOLVED) db1247 (paged)/MariaDB Replica SQL: s4 (paged) [15:31:12] !resolve 6100 [15:31:12] 6100 (RESOLVED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [15:31:28] (03PS5) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [15:31:28] (03PS4) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) [15:31:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:32:12] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:32:32] swfrench-wmf: heh thanks [15:32:43] (03CR) 10Elukey: "Tried to run it, but run.sh returns to me:" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [15:32:44] :) [15:35:17] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:35:35] <_joe_> uh? [15:35:39] !incidents [15:35:40] 6100 (RESOLVED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [15:35:40] 👋 ack expired [15:35:47] <_joe_> ahhh [15:35:49] (03PS6) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [15:35:49] (03PS5) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) [15:36:12] I've boldly resolved in VO [15:36:22] rzl: thanks, yeah - I did as well [15:36:58] (03CR) 10Elukey: "Probably something wrong in the prev attempt, I see now:" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [15:37:08] very strange, I received a notification for at least one of these that I'd already re-acked and resolved [15:37:20] maybe I did that in the "wrong order" ? [15:37:48] jhancock@cumin2002 netbox (PID 3066154) is awaiting input [15:39:09] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1122-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:39:23] (03CR) 10JHathaway: "looks much better!, you should be able to run further rake commands in /srv/workspace/puppet, e.g. bundle exec rake test" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [15:39:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [15:39:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [15:39:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:28] (03PS7) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [15:41:39] (03Abandoned) 10Elukey: envoy: customize latency buckets [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1143507 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [15:41:43] (03PS1) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) [15:41:55] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:28] (03CR) 10CI reject: [V:04-1] admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [15:42:33] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:43:35] (03PS7) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [15:43:49] (03PS1) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487) [15:44:13] (03PS2) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487) [15:44:22] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393487) (owner: 10Majavah) [15:44:39] (03CR) 10CI reject: [V:04-1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:44:39] (03PS2) 10Eevans: JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) [15:44:41] (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:45:31] (03PS2) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) [15:46:15] (03PS3) 10Eevans: JBOD partman recipe for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) [15:46:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [15:46:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [15:46:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:48:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:48:11] (03CR) 10Filippo Giunchedi: [C:03+1] "Tested with Eric" [puppet] - 10https://gerrit.wikimedia.org/r/1142635 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:49:02] (03PS1) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) [15:50:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:50:51] (03CR) 10Vgutierrez: [C:03+1] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [15:51:30] (03CR) 10Bking: [C:03+2] conftool: remove elastic row A hosts and add newly-reimaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143589 (https://phabricator.wikimedia.org/T391118) (owner: 10Bking) [15:52:41] (03CR) 10Elukey: "Please keep in mind that my knowledge of ruby and Rakefiles is horrible, but the change is sound. I left a couple of little comments, and " [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [15:53:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:53:16] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5503/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:53:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:54:29] (03PS1) 10Hnowlan: mw:sharded_periodic_job: use "command" instead of script [puppet] - 10https://gerrit.wikimedia.org/r/1143606 [15:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:57:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [15:57:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [15:57:37] (03CR) 10Vgutierrez: "PCC output for non-NOOP can be seen on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143603/" [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:00:04] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:02] (03PS2) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) [16:01:02] (03PS1) 10Vgutierrez: cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) [16:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:01] (03PS2) 10Hnowlan: mw:sharded_periodic_job: use "command" instead of script [puppet] - 10https://gerrit.wikimedia.org/r/1143606 [16:03:13] (03CR) 10Vgutierrez: [C:04-2] "do not merge till Ia3c34647675a728e06c02e0d6cb9b00a8911ca61 is merged and deployed CDN wide" [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:03:34] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10805182 (10elukey) >>! In T391852#10796611, @herron wrote: >>>! In T391852#10796071, @elukey wrote: >> @herron @RLazarus There are a couple of logistical thin... [16:04:51] (03CR) 10Vgutierrez: "@abaso@wikimedia.org I'm not planning to merge this one immediately but we will need to deploy it before being able to split the CDN cache" [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:05:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:05:57] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5506/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143606 (owner: 10Hnowlan) [16:06:39] (03CR) 10Ssingh: [C:03+2] type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:06:58] (03PS5) 10Ssingh: type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) [16:07:09] (03CR) 10Ssingh: "rebased, no code change" [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:08:23] (03CR) 10Ssingh: [V:03+2 C:03+2] type65.py: add support for generation of additional HTTPS SvcParams [dns] - 10https://gerrit.wikimedia.org/r/1142631 (https://phabricator.wikimedia.org/T384839) (owner: 10Ssingh) [16:08:43] !log sukhe@dns1004 START - running authdns-update [16:09:47] !log sukhe@dns1004 END - running authdns-update [16:10:24] (03PS8) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [16:10:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2048 to codfw - jhancock@cumin2002" [16:10:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2048 to codfw - jhancock@cumin2002" [16:11:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:12] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10805196 (10Jhancock.wm) heads up. dell doesn't want to replace it without more troubleshooting. since the idrac is not showing a failure. Gonna reseat the drive. Please let me know if it changes anything. [16:11:44] (03PS1) 10Andrew Bogott: designate policy.yaml: repair 'default' policy [puppet] - 10https://gerrit.wikimedia.org/r/1143610 (https://phabricator.wikimedia.org/T393679) [16:11:45] (03PS1) 10Andrew Bogott: Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611 [16:11:45] (03PS1) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1143612 (https://phabricator.wikimedia.org/T330759) [16:13:25] (03CR) 10Andrew Bogott: [C:03+2] designate policy.yaml: repair 'default' policy [puppet] - 10https://gerrit.wikimedia.org/r/1143610 (https://phabricator.wikimedia.org/T393679) (owner: 10Andrew Bogott) [16:13:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048 [16:13:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048 [16:14:13] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate db_lag_stats_reporter to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [16:14:48] PROBLEM - mysqld processes #page on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:14:49] PROBLEM - MariaDB Replica SQL: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:50] PROBLEM - MariaDB Replica Lag: s2 #page on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:51] PROBLEM - MariaDB Replica IO: s2 #page on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:51] PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:15:03] !incidents [16:15:04] 6105 (UNACKED) db1246 (paged)/mysqld processes (paged) [16:15:04] 6106 (UNACKED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [16:15:04] 6107 (UNACKED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [16:15:04] 6108 (UNACKED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [16:15:05] :-( [16:15:08] pope effect? [16:15:14] !ack 6105 [16:15:15] 6105 (ACKED) db1246 (paged)/mysqld processes (paged) [16:15:16] !ack 6106 [16:15:17] 6106 (ACKED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [16:15:17] no, host specific I would say [16:15:19] !ack 6107 [16:15:20] 6107 (ACKED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [16:15:21] !ack 6108 [16:15:22] 6108 (ACKED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [16:15:22] it's been bothering us for a while [16:15:23] yes and was already broken [16:15:26] nope, I think there was some bad hardware there [16:15:31] it's happy for the news [16:15:32] there's a downtime for this host ... [16:15:36] did the reimage clear it? [16:15:48] https://phabricator.wikimedia.org/T393296 [16:15:51] did the downtime time out? [16:15:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:16:01] no, I set it for 1w [16:16:07] I think Papaul is working on it? [16:16:10] presumably the end of the reimage did something [16:16:11] yes [16:16:13] yes [16:16:14] i am [16:16:33] i think it was downtime [16:16:35] for a week [16:16:38] what I do in these hw cases is set it as notif disables on puppet [16:16:38] (03Abandoned) 10Ebernhardson: services_proxy: Support multiple ports on discovery dns services [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [16:16:44] (03CR) 10Ebernhardson: "We run three clusters per DC, each cluster runs on a distinct port. They are implemented by running two copies of the server on each bare " [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [16:18:45] swfrench-wmf: yes the reimage removes the host [16:18:58] volans: thanks for confirming, that makes sense, then [16:18:59] so it disappears from icinga and then it's added again [16:19:17] jhancock@cumin2002 reimage (PID 3091523) is awaiting input [16:19:19] an in general if it's successful I think it also removed the downtime that on icinga means all downtimes [16:19:26] I suggest to power off the host [16:19:27] about to hit enter on re-downtiming, unless folks have objections and would prefer the puppet route :) [16:19:32] as has creating already too much noise [16:19:39] volans: there's active work happening on the host [16:19:46] i.e., it should be up [16:20:22] the host begs to differ :D [16:20:26] doesn't want to stay up [16:21:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:22:27] !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has crashed - T393296 [16:22:30] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10805238 (10dcaro) This is not a blocker anymore in general, will still be needed the more big hosts we get, but can wait for the general 25G everywhere some time. [16:22:31] T393296: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296 [16:22:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3e20d375-9adc-4351-ba8a-0bbdf71aba3b) set by swfrench@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with rea... [16:23:09] re-downtimed [16:23:56] !incidents [16:23:57] 6105 (ACKED) db1246 (paged)/mysqld processes (paged) [16:23:57] 6106 (ACKED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [16:23:57] 6107 (ACKED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [16:23:57] 6108 (ACKED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [16:24:07] !resolve 6105 [16:24:08] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [16:24:10] !resolve 6106 [16:24:10] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [16:24:13] !resolve 6107 [16:24:14] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [16:24:17] !resolve 6108 [16:24:17] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [16:25:46] (03CR) 10BCornwall: [C:03+2] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [16:25:54] (03PS4) 10BCornwall: Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) [16:25:57] (03CR) 10BCornwall: [V:03+2 C:03+2] Revert "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1143595 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [16:26:24] jouncebot: nowandnext [16:26:24] For the next 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1600) [16:26:24] In 0 hour(s) and 33 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700) [16:26:24] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700) [16:26:43] 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805253 (10akosiaris) +1 for what is worth. [16:27:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [16:27:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm [16:27:34] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:27:58] (03PS6) 10Hnowlan: mw::maintenance: move updateMenteeData to upper level job [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) [16:28:03] !log brett@dns1005 START - running authdns-update [16:28:09] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143590 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:28:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10805256 (10akosiaris) >>! In T393053#10782038, @RobH wrote: > Alex, > > We didn't get racking details on the ordering task T392715, so we need to get them from... [16:29:14] !log brett@dns1005 END - running authdns-update [16:30:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [16:30:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm [16:30:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10805262 (10akosiaris) a:05akosiaris→03None >>! In T393054#10782085, @RobH wrote: > Alex, > > We didn't get racking details on the ordering task T392714, so... [16:30:34] (03CR) 10BCornwall: [C:03+1] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:30:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10805266 (10akosiaris) a:05akosiaris→03None [16:34:00] (03CR) 10BCornwall: [C:03+1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:34:06] (03CR) 10BCornwall: [C:03+1] hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:36:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805292 (10Scott_French) I've re-added a 1w downtime, as the earlier one was removed as a side-effect of the reimage. If we expect the host to be powered on for ongoing work, and also expect that work... [16:38:33] (03PS1) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [16:38:43] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:38:58] (03PS2) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [16:43:10] (03PS3) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [16:46:10] 06SRE, 06Infrastructure-Foundations, 10netops: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10805304 (10Vgutierrez) We could definitely use that kind of data :) [16:48:14] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp7001.magru.wmnet [16:48:14] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7001.magru.wmnet [16:48:58] !log repooling cp7001 (T393671) [16:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:01] T393671: Benchmark different options - https://phabricator.wikimedia.org/T393671 [16:49:49] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [16:50:02] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin [16:50:28] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin [16:51:01] (03CR) 10Ssingh: trafficserver: Allow splitting the cache by HTTP header content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:53:13] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620 [17:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700). [17:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1700). [17:01:33] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620 (owner: 10BryanDavis) [17:01:36] (03PS1) 10JHathaway: systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621 [17:01:44] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway) [17:02:03] (03CR) 10Ssingh: trafficserver: Allow splitting the cache by HTTP header content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:02:06] o/ I have a developer.wikimedia.org version bump to push out. [17:02:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:03:06] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2025-05-08-122500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143620 (owner: 10BryanDavis) [17:03:25] (03CR) 10Cwhite: [C:03+2] logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) (owner: 10Cwhite) [17:03:33] o/ I'm holding off on my scheduled changes for the moment [17:03:48] jhancock@cumin2002 reimage (PID 3124560) is awaiting input [17:04:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:04:58] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:05:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:16] (03PS4) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [17:05:19] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:05:32] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:05:48] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:05:54] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:06:07] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:06:32] (03CR) 10Majavah: [C:03+1] systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway) [17:09:06] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1111.eqiad.wmnet|name=cirrussearch1112.eqiad.wmnet|name=cirrussearch1113.eqiad.wmnet|name=cirrussearch1114.eqiad.wmnet|name=cirrussearch1115.eqiad.wmnet|name=cirrussearch1116.eqiad.wmnet|name=cirrussearch1117.eqiad.wmnet|name=cirrussearch1118.eqiad.wmnet|name=cirrussearch1119.eqiad.wmnet|name=cirrussearch1120.eqiad.wmnet|name=cirru [17:09:06] ssearch1121.eqiad.wmnet|name=cirrussearch1122.eqiad.wmnet|name=cirrussearch1123.eqiad.wmnet|name=cirrussearch1124.eqiad.wmnet|name=cirrussearch1125.eqiad.wmnet [17:09:56] (03PS1) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [17:10:10] * bd808 is done with the WCMS & Tech Docs deploy window [17:10:48] (03PS2) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [17:11:50] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5507/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:12:09] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1112.eqiad.wmnet|cirrussearch1113.eqiad.wmnet|cirrussearch1114.eqiad.wmnet|cirrussearch1115.eqiad.wmnet|cirrussearch1116.eqiad.wmnet|cirrussearch1117.eqiad.wmnet|cirrussearch1118.eqiad.wmnet|cirrussearch1119.eqiad.wmnet|cirrussearch1120.eqiad.wmnet|cirrussearch1121.eqiad.wmnet|cirrussearch1122.eqiad.wmnet|cirrussearch1123.eqiad.wmn [17:12:09] et|cirrussearch1124.eqiad.wmnet|cirrussearch1125.eqiad.wmnet [17:13:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm [17:13:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err... [17:13:12] (03CR) 10JHathaway: [C:03+2] systemd::sysuser: don't run exec when absent [puppet] - 10https://gerrit.wikimedia.org/r/1143621 (owner: 10JHathaway) [17:16:06] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10805388 (10Sreejithk2000) Thanks for the help. File undeleted. [17:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 12.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:19:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805398 (10Jhancock.wm) @Papaul ganeti2047 tried to connect to the wrong puppetserver. failed there. [8/10, retrying in 640.00s] Attempt to run 'spicerack.puppet... [17:19:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:20:07] !incidents [17:20:08] 6109 (UNACKED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [17:20:08] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [17:20:08] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [17:20:08] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [17:20:09] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [17:20:13] !ack 6109 [17:20:13] 6109 (ACKED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [17:20:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [17:21:11] (03PS3) 10Ebernhardson: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) [17:21:17] (03PS5) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [17:21:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:21:24] (03CR) 10CI reject: [V:04-1] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:21:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 4 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [17:21:44] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10805399 (10Pppery) 05Open→03Resolved a:03MatthewVernon [17:21:53] (03CR) 10CI reject: [V:04-1] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:23:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [17:23:10] (03PS1) 10CDanis: esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 [17:23:32] (03CR) 10BBlack: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis) [17:23:36] (03CR) 10Ladsgroup: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis) [17:23:38] (03CR) 10Hnowlan: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis) [17:23:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723 (10thcipriani) 03NEW [17:23:42] (03CR) 10CDanis: [V:03+2 C:03+2] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis) [17:23:47] (03CR) 10MVernon: [C:03+1] esams-- [dns] - 10https://gerrit.wikimedia.org/r/1143624 (owner: 10CDanis) [17:23:49] !log cdanis@dns1004 START - running authdns-update [17:24:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:24:18] FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from GB) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [17:24:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:24:34] !incidents [17:24:34] 6109 (ACKED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [17:24:34] 6110 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [17:24:34] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [17:24:35] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [17:24:35] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [17:24:35] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [17:24:37] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:24:40] !ack 6110 [17:24:40] 6110 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [17:24:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:05] is that for esams again [17:25:08] !incidents [17:25:09] 6109 (ACKED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [17:25:09] 6110 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [17:25:09] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [17:25:09] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [17:25:09] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [17:25:10] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [17:25:12] !log cdanis@dns1004 END - running authdns-update [17:25:21] FIRING: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [17:26:02] lol [17:26:09] full queues? ouch [17:26:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:27:14] (03PS1) 10Bernard Wang: Remove eb_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 [17:27:15] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724 (10thcipriani) 03NEW [17:27:25] (03PS2) 10Bernard Wang: Remove web_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) [17:27:32] (03PS6) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [17:28:05] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-upload_eqsin [17:28:07] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-text_eqsin [17:28:12] (03CR) 10CI reject: [V:04-1] search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:29:00] FIRING: [14x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:18] FIRING: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [17:29:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:29:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:21] RESOLVED: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [17:31:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [17:31:42] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:31:55] FIRING: [13x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:00] jouncebot: refresh [17:32:01] I refreshed my knowledge about deployments. [17:32:07] what's up with the puppetmaster? [17:32:09] jouncebot: next [17:32:09] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1800) [17:32:14] is anyone working on that? [17:32:47] puppetmaster got broken a few days ago iirc [17:32:58] tls material not setting SNI as expected IIRC [17:33:04] hmm ok [17:33:07] these are all from three days ago apparently [17:33:10] these alerts* [17:33:23] (03PS7) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [17:33:27] service owner will know better [17:34:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:34:18] RESOLVED: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [17:34:38] those are old alerts, and as far as I can tell not directly impacting anything atm, so I'm tempted to throw them into the "file a task and deal with it later" pile [17:35:58] (03PS1) 10Hnowlan: mw-api-ext: bump replicas temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143629 [17:36:12] (03PS1) 10CDanis: esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 [17:36:29] (03CR) 10Ssingh: [C:03+1] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis) [17:36:44] (03CR) 10Scott French: [C:03+1] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis) [17:36:49] (03CR) 10CDanis: [V:03+2 C:03+2] esams++ drmrs-- [dns] - 10https://gerrit.wikimedia.org/r/1143630 (owner: 10CDanis) [17:36:55] !log cdanis@dns1004 START - running authdns-update [17:38:19] !log cdanis@dns1004 END - running authdns-update [17:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:39:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:39:26] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:41:30] (03CR) 10Jdlrobson: [C:03+1] Remove web_ab_test_enrollment schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [17:45:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:51:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10805494 (10Papaul) @Scott_French thank you [17:53:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805498 (10Papaul) @jjanhone both 47 and 48 were on the wrong puppetserver. Remove all yours ` sudo puppet cert --list Warning: `puppet cert` is deprecated and will... [17:58:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:00:37] jouncebot: nowandnext [18:00:38] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [18:00:38] In 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900) [18:01:51] !log dancy@deploy1003 Installing scap version "4.162.0" for 2 host(s) [18:03:39] !log dancy@deploy1003 Installation of scap version "4.162.0" completed for 2 hosts [18:07:53] (03PS1) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [18:08:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [18:08:55] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10805541 (10Eevans) @cmassaro so if I understand correctly, this isn't really about access per say, but a request to have your key changed? And (either way), we need to verify your ssh k... [18:12:03] (03PS2) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [18:12:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [18:13:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2048.codfw.wmnet with OS bookworm [18:13:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with err... [18:18:27] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805587 (10Eevans) @Seddon Can you post the public key on one of you user pages (meta.w.o for example) for verification purposes? [18:29:30] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp50[19-24].eqsin.wmnet} and A:cp [18:29:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10805607 (10ssingh) [18:30:32] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp50[27-32].eqsin.wmnet} and A:cp [18:31:41] (03PS1) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) [18:34:49] (03CR) 10Jsn.sherman: "Hi Amir, I got started with the dblist; I just checked for wikis with the interface enabled to start with and piped them into a dblist lik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [18:37:53] (03CR) 10Ssingh: [V:03+1] "If there are no objections, I would like to merge this next week." [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [18:38:44] !log zabe@deploy1003:~$ mwscript-k8s --attach -- extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees/Call for feedback:2022 Board of Trustees election/Upcoming Call for Feedback about the Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback:2022 Board of [18:38:44] Trustees election/Upcoming Call for Feedback about the Board of Trustees elections" "Zabe" --reason "per request [[:phab:T393619|T393619]]" [18:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:46] T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619 [18:38:55] meh too long [18:39:36] lol [18:42:18] !log zabe@deploy1003:~$ mwscript-k8s --attach -- extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Wikimedia Foundation Board of Trustees/Call for feedback: Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback: Board of Trustees elections" "Zabe" --reason "per request [18:42:19] [[:phab:T393619|T393619]]" [18:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:38] !log mwscript-k8s [...]moveTranslatableBundle.php metawiki "Wikimedia Foundation Board of Trustees/Call for feedback: Board of Trustees elections" "Wikimedia Foundation/Board of Trustees/Call for feedback: Board of Trustees elections" "Zabe" --reason "per request [[:phab:T393619|T393619]]" [18:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:27] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1143606 (owner: 10Hnowlan) [18:44:46] Hello, I have been given the go ahead to start the train deploy now [18:45:50] !log move all translateable subpages of "Wikimedia Foundation Board of Trustees" to subpages of "Wikimedia Foundation/Board of Trustees" on metawiki (T393619) [18:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:52] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [18:45:53] T393619: Request to move translatable page: Wikimedia Foundation Board of Trustees - https://phabricator.wikimedia.org/T393619 [18:45:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm [18:46:00] (03PS1) 10Ssingh: geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646 [18:46:01] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223) [18:46:03] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:46:46] (03CR) 10Scott French: [C:03+1] geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646 (owner: 10Ssingh) [18:46:53] (03CR) 10Ssingh: [C:03+2] geo-maps: revert DE back to esams [dns] - 10https://gerrit.wikimedia.org/r/1143646 (owner: 10Ssingh) [18:46:58] !log sukhe@dns1004 START - running authdns-update [18:47:01] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143647 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:48:21] !log sukhe@dns1004 END - running authdns-update [18:49:57] (03CR) 10Scott French: "Thank you! Once you re-add `interval` to the absented `periodic_job` (ugh ... mismatch in `Optional`-ness), I think this should tick all t" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [18:57:09] (03PS1) 10Herron: thanos-rule: logstash_sli_availability:bool sum by (site) [puppet] - 10https://gerrit.wikimedia.org/r/1143650 [18:59:35] (03CR) 10BCornwall: varnish: Issue and handle WMF-Uniq cookie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [19:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900) [19:01:23] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.28 refs T386223 [19:01:26] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [19:03:17] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10805702 (10BCornwall) 05In progress→03Resolved pywikipedia.org is no longer being managed by our infra as the pywikibot project didn't express an interest in maintenance. [19:04:29] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage [19:06:40] (03CR) 10Ebernhardson: cirrussearch: Add cluster-specific domain name as a SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [19:08:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage [19:12:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:24:00] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:02] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [19:24:26] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:24:54] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:25:26] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733 (10RobH) 03NEW [19:25:47] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733#10805769 (10RobH) [19:26:28] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:26:57] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733#10805770 (10RobH) a:03fnegri Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [19:27:32] (03PS1) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) [19:28:39] (03CR) 10CI reject: [V:04-1] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [19:29:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:30:01] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:31:43] (03PS1) 10Andrea Denisse: grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) [19:31:43] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1143660/5508/" [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [19:32:24] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:32:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:07] vriley@cumin1002 reimage (PID 95758) is awaiting input [19:35:28] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:36:20] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:36:52] RECOVERY - Squid on install1004 is OK: TCP OK - 0.003 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [19:37:44] (03CR) 10BCornwall: [C:03+1] grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [19:40:02] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [19:40:28] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:41:55] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:46] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805795 (10Eevans) [19:42:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=grafana.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:43:02] !incidents [19:43:02] 6111 (UNACKED) ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad) [19:43:03] 6110 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [19:43:03] 6109 (RESOLVED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [19:43:03] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [19:43:03] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [19:43:03] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [19:43:04] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [19:43:06] !ack 6111 [19:43:07] 6111 (ACKED) ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad) [19:43:29] investigation ongoing [19:43:36] denisse: ^ FYI [19:44:15] Thank you, I'm investigating the issue. [19:45:08] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805798 (10Eevans) @BWojtowicz-WMF can I get you t... [19:45:41] swfrench-wmf: Do you know if envoy is also used to to access grafana-next? [19:46:28] If not, then I think that the issue with grafana may be related to this: https://phabricator.wikimedia.org/T393439 [19:47:35] denisse: alas, I do not know off hand. is grafana-next hosted differently? (e.g., on a different host or port?) [19:48:46] swfrench-wmf: My bad, I just realized envoy runs in the grafana host. [19:48:52] RECOVERY - Squid on install1004 is OK: TCP OK - 0.002 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [19:49:16] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:49:18] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:49:20] (03PS2) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 [19:49:44] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:49:54] that looks promising! [19:50:05] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805808 (10Eevans) @thcipriani Ok to add to deploy... [19:50:08] (03CR) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [19:51:55] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:45] !incidents [19:52:45] 6111 (ACKED) ATSBackendErrorsHigh cache_text sre (grafana.discovery.wmnet eqiad) [19:52:45] 6110 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [19:52:46] 6109 (RESOLVED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [19:52:46] 6108 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [19:52:46] 6107 (RESOLVED) db1246 (paged)/MariaDB Replica Lag: s2 (paged) [19:52:46] 6106 (RESOLVED) db1246 (paged)/MariaDB Replica SQL: s2 (paged) [19:52:46] 6105 (RESOLVED) db1246 (paged)/mysqld processes (paged) [19:52:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=grafana.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:52:58] there it is :) [19:53:07] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805823 (10Eevans) [19:53:34] jouncebot: nowandnext [19:53:34] For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900) [19:53:35] In 0 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2000) [19:53:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805827 (10Eevans) [19:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:55:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805835 (10Eevans) [19:55:30] jeena: I see the train has rolled to group2. any objections if I were to sneak in a no-op scap run? (i.e., does not deploy images, just updates some bookkeeping) [19:55:55] swfrench-wmf: all good here, go ahead! [19:55:57] (03CR) 10Andrea Denisse: [C:03+2] grafana: Enable dashboard sync between hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143660 (https://phabricator.wikimedia.org/T384841) (owner: 10Andrea Denisse) [19:57:05] jeena: great, thank you! [19:57:40] (03CR) 10Scott French: [C:03+2] hieradata: switch mw-script main release to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1137496 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [19:59:38] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805870 (10thcipriani) >>! In T393595#10805807, @E... [20:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (HOLD/pushed for Habemus papam traffic) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T1900) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:02] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fdd20feac10: Failed to establish a new connection: [Errno 113 [20:03:02] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:03:25] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to switch mw-script/main to PHP 8.1 - T391057 [20:03:28] T391057: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057 [20:03:38] !log swfrench@deploy1003 Stopping before sync operations [20:04:34] (03PS2) 10Ryan Kemper: wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) [20:04:44] (03CR) 10Scott French: [C:03+2] deployment_server: drop unsupported fallback to PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/1137497 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [20:05:02] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: [20:05:02] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:05:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805878 (10Eevans) [20:05:42] (03CR) 10BCornwall: [C:03+1] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:06:57] (03CR) 10Ryan Kemper: [C:03+2] wdqs: route query.wd.org to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1143659 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:11:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805887 (10Eevans) @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed {L3}? And, while we're in the business of ticking boxes, can you have your manager... [20:14:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805888 (10Eevans) [20:14:57] !log T388134 Beginning cutover of query.wikidata.org from `wdqs` to `wdqs-main`. Starting to see requests increase on wdqs-main (and decrease on wdqs) as expected. Rolling change to rest of cp text hosts. Traffic should be fully moved over in ~20 mins [20:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:01] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [20:16:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:16:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1003.eqiad.wmnet with OS bookworm [20:16:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm completed: - apus-fe1003 (**PAS... [20:16:50] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805891 (10Eevans) [20:17:07] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805892 (10VRiley-WMF) 05Open→03Resolved This has been completed [20:17:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10805895 (10VRiley-WMF) [20:19:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805903 (10Eevans) @VPuffetMichel: assuming you are @Esanders manager, do we have your OK? [20:21:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [20:21:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [20:21:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [20:21:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10805905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm [20:23:19] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805911 (10Eevans) p:05Triage→03Medium [20:24:06] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10805913 (10Eevans) 05Open→03In progress [20:24:35] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805914 (10Eevans) [20:24:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805915 (10Eevans) 05Open→03In progress [20:24:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10805916 (10Eevans) p:05Triage→03Medium [20:25:35] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805917 (10Eevans) 05Open→03In progress p:05Triage→03Medium [20:26:05] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805919 (10Eevans) 05Open→03In progress p:05Triage→03Medium [20:27:57] (03PS1) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [20:29:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:04] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10805923 (10thcipriani) For clarity: I filed this task as a followup to a request for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|spiderpig access]]. `deployment` membership is curren... [20:30:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10805925 (10thcipriani) For clarity: I filed this task as a followup to a request for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|spiderpig access]]. `deployment` membership is curre... [20:32:04] (03PS2) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:32:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:32:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:26] (03PS3) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:32:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:32:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage [20:33:10] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage [20:36:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage [20:36:54] (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.21.1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1143671 (https://phabricator.wikimedia.org/T393731) [20:37:06] FIRING: SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:37:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage [20:42:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:45:14] (03PS4) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:46:14] (03PS5) 10Bking: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:49:26] (03PS6) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [20:49:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp50[27-32].eqsin.wmnet} and A:cp [20:51:02] (03PS7) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [20:51:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:52:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:52:56] (03PS8) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [20:54:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:54:06] (03PS9) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [20:54:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp50[19-24].eqsin.wmnet} and A:cp [20:55:20] jhancock@cumin2002 reimage (PID 3357775) is awaiting input [20:55:23] (03CR) 10Aleksandar Mastilovic: "Here it is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143134" [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [20:56:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_drmrs [20:56:43] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_drmrs [20:57:21] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:58:37] (03PS10) 10Ryan Kemper: wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250508T2100) [21:00:26] jhancock@cumin2002 reimage (PID 3357420) is awaiting input [21:01:25] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T393368#10805985 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Rebalanced power. [21:03:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [21:07:07] (03CR) 10Bking: [C:03+2] wdqs-main: shift over hosts from full graph [puppet] - 10https://gerrit.wikimedia.org/r/1143670 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [21:17:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:23:03] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:23:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:27:58] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:27:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:29:39] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1012.eqiad.wmnet [21:29:50] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [21:29:52] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [21:32:58] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:33:48] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2007.codfw.wmnet [21:34:03] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [21:35:01] !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612 [21:35:04] T393612: db1247 crash or restart - 15:29 on 2025-05-07 - https://phabricator.wikimedia.org/T393612 [21:38:40] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs[2007,2013].codfw.wmnet,wdqs[1012-1014].eqiad.wmnet with reason: bringing hosts online with a data transfer [21:40:40] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:43:20] !log T388134 Cutover completed about an hour ago. Metrics look good; we're in the process of shifting over some of the old `wdqs` hosts to `wdqs-main` to increase capacity [21:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:23] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [21:43:47] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1013.eqiad.wmnet [21:43:57] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1005 - vriley@cumin1002" [21:44:07] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1014.eqiad.wmnet [21:44:16] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1005 - vriley@cumin1002" [21:44:17] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:44:21] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1015.eqiad.wmnet [21:44:56] !log removing 3 files for legal compliance [21:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:01] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe1005 [21:45:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806065 (10VRiley-WMF) [21:46:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe1005 [21:47:05] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:47:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:48:00] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2010.codfw.wmnet [21:48:08] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2011.codfw.wmnet [21:48:22] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2012.codfw.wmnet [21:48:57] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2013.codfw.wmnet [21:50:23] !log removing 1 file for legal compliance [21:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:54:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2048.codfw.wmnet with OS bookworm [21:54:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm completed: - gane... [21:54:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:54:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2047.codfw.wmnet with OS bookworm [21:54:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm completed: - gane... [21:55:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806102 (10Jhancock.wm) 05Open→03Resolved [21:56:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10806105 (10Jhancock.wm) @MoritzMuehlenhoff this is finally done. thanks for your patience! [22:05:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:07:20] (03CR) 10Scott French: [C:03+2] P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [22:08:18] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806119 (10VRiley-WMF) [22:09:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:12:56] (03PS4) 10Scott French: P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) [22:13:54] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French) [22:14:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:34] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1005.eqiad.wmnet with OS bullseye [22:17:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye [22:27:55] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [22:27:59] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [22:28:51] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1012.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [22:31:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:25] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10806152 (10RobH) url provided by support so i've uploaded the support collection report for their review [22:47:34] PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10588 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [22:49:42] PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10628 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [23:06:13] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1012.eqiad.wmnet [23:06:14] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84665MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [23:12:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:19:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2007.codfw.wmnet [23:22:36] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2013.codfw.wmnet [23:26:27] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1013.eqiad.wmnet [23:30:20] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1015.eqiad.wmnet [23:32:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:34:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:34:55] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2010.codfw.wmnet [23:35:02] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1014.eqiad.wmnet [23:35:09] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2011.codfw.wmnet [23:36:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:48] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1005.eqiad.wmnet with OS bullseye [23:37:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye e... [23:37:56] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2012.codfw.wmnet [23:37:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693 [23:39:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693 (owner: 10TrainBranchBot) [23:39:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:44:28] RESOLVED: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1015:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:51:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143693 (owner: 10TrainBranchBot) [23:51:55] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag