[00:02:04] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1085495 [00:02:08] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1085496 [00:02:11] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1085497 [00:02:29] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1085495 (owner: 10Ncmonitor) [00:02:39] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1085496 (owner: 10Ncmonitor) [00:02:57] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1085497 (owner: 10Ncmonitor) [00:05:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P70828 and previous config saved to /var/cache/conftool/dbconfig/20241101-000506-ladsgroup.json [00:05:25] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:39] (03PS6) 10Aude: Helm chart for the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) [00:20:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P70829 and previous config saved to /var/cache/conftool/dbconfig/20241101-002013-ladsgroup.json [00:35:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T376905)', diff saved to https://phabricator.wikimedia.org/P70830 and previous config saved to /var/cache/conftool/dbconfig/20241101-003520-ladsgroup.json [00:35:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:35:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:35:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T376905)', diff saved to https://phabricator.wikimedia.org/P70831 and previous config saved to /var/cache/conftool/dbconfig/20241101-003546-ladsgroup.json [00:37:36] RESOLVED: ProbeDown: Service aqs1022-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1022-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085499 [00:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085499 (owner: 10TrainBranchBot) [00:45:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T376905)', diff saved to https://phabricator.wikimedia.org/P70832 and previous config saved to /var/cache/conftool/dbconfig/20241101-004514-ladsgroup.json [00:54:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [00:54:59] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1019.eqiad.wmnet'] [01:00:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P70833 and previous config saved to /var/cache/conftool/dbconfig/20241101-010021-ladsgroup.json [01:05:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085499 (owner: 10TrainBranchBot) [01:07:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bullseye [01:08:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085502 [01:08:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085502 (owner: 10TrainBranchBot) [01:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10283295 (10phaultfinder) [01:14:13] (03CR) 10TChin: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [01:15:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P70834 and previous config saved to /var/cache/conftool/dbconfig/20241101-011528-ladsgroup.json [01:22:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1019.eqiad.wmnet with reason: host reimage [01:25:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1019.eqiad.wmnet with reason: host reimage [01:30:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T376905)', diff saved to https://phabricator.wikimedia.org/P70835 and previous config saved to /var/cache/conftool/dbconfig/20241101-013035-ladsgroup.json [01:30:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [01:30:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [01:31:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70836 and previous config saved to /var/cache/conftool/dbconfig/20241101-013102-ladsgroup.json [01:39:16] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1022.eqiad.wmnet [01:39:16] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1022.eqiad.wmnet [01:39:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70837 and previous config saved to /var/cache/conftool/dbconfig/20241101-013926-ladsgroup.json [01:40:44] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T378725 [01:40:47] T378725: Refresh aqs1013 w/ aqs1022 - https://phabricator.wikimedia.org/T378725 [01:40:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T378725 [01:42:55] !log Decommissioning Cassandra/aqs1013-{a,b} — T378725 [01:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085502 (owner: 10TrainBranchBot) [01:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [01:54:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P70838 and previous config saved to /var/cache/conftool/dbconfig/20241101-015433-ladsgroup.json [01:59:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1019.eqiad.wmnet with OS bullseye [02:06:12] 06SRE, 10Wikimedia-SVG-rendering, 07Upstream: SVG: Gaussian blur filter effect not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#10283406 (10Chealer) [02:07:05] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/2453917cb1b8b774989b94d6a33183b06fd02dbd1a1e487615960f58c8e86cf4/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:09:08] (03PS1) 10RLazarus: mediawiki: Support copying text files into mw-script containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) [02:09:17] (03PS1) 10RLazarus: deployment_server: Add --file to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) [02:09:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P70839 and previous config saved to /var/cache/conftool/dbconfig/20241101-020940-ladsgroup.json [02:14:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:24:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T376905)', diff saved to https://phabricator.wikimedia.org/P70840 and previous config saved to /var/cache/conftool/dbconfig/20241101-022447-ladsgroup.json [02:24:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [02:25:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [02:27:05] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:25] RESOLVED: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241101T0600) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241101T0700) [07:34:40] (03PS1) 10Slyngshede: P:firewall remove Icinga conntrack check [puppet] - 10https://gerrit.wikimedia.org/r/1085515 (https://phabricator.wikimedia.org/T374827) [08:31:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800 (10MatthewVernon) 03NEW [08:31:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10283593 (10MatthewVernon) p:05Triage→03High [09:04:17] (03PS2) 10Slyngshede: Start migrating Netbox alerts from Icinga. [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) [09:17:29] (03CR) 10Slyngshede: Start migrating Netbox alerts from Icinga. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:44:36] (03PS1) 10Ssingh: magru: set check_min_fe_mem false [puppet] - 10https://gerrit.wikimedia.org/r/1085569 [09:45:31] (03CR) 10Ssingh: [C:03+2] magru: set check_min_fe_mem false [puppet] - 10https://gerrit.wikimedia.org/r/1085569 (owner: 10Ssingh) [09:46:41] !log sudo cumin -b4 "A:cp and A:magru" "run-puppet-agent" to pick up CR 1085569 [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:12:23] (03CR) 10Slyngshede: [C:03+2] Disable LDAPPasswordValidator. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084777 (owner: 10Slyngshede) [10:15:04] (03Merged) 10jenkins-bot: Disable LDAPPasswordValidator. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084777 (owner: 10Slyngshede) [10:15:51] (03PS1) 10Kevin Bazira: ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085570 (https://phabricator.wikimedia.org/T371897) [10:19:36] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085570 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:20:40] (03Merged) 10jenkins-bot: ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085570 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:38:17] (03PS1) 10Hamish: Cleanup for logo related file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 [10:38:31] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:38:55] (03CR) 10CI reject: [V:04-1] Cleanup for logo related file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 (owner: 10Hamish) [10:46:03] PROBLEM - Host logstash1024 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:05] PROBLEM - Host moscovium is DOWN: PING CRITICAL - Packet loss = 100% [10:47:36] FIRING: [2x] ProbeDown: Service logstash1024:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1024:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:51:50] (03PS2) 10Hamish: Cleanup for logo related file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 [10:52:36] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:14] (03PS3) 10Hamish: Cleanup for logo related file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 [10:55:23] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:46] (03Abandoned) 10Kosta Harlan: [WIP] mediamoderation: Add one-off job for processing the Commons backlog [puppet] - 10https://gerrit.wikimedia.org/r/1040150 (owner: 10Kosta Harlan) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241101T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241101T1100) [11:53:24] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10283919 (10MatthewVernon) Thanks for the update @wiki_willy! Do I understand correctly from this that if we want to use these... [12:03:35] (03PS1) 10Hnowlan: admin_ng: set a very high quota for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085579 (https://phabricator.wikimedia.org/T356241) [12:18:56] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:28:14] !log rebooting ganeti1025 as VMs are unresponsive and will not shutdown or move [12:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:18] !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [12:37:36] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [12:42:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [12:42:36] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:42:36] FIRING: [3x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:25] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [12:43:30] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1025.eqiad.wmnet [12:43:42] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [12:43:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [12:46:49] RECOVERY - Host logstash1024 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:47:21] RECOVERY - Host moscovium is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [12:48:56] RESOLVED: [3x] ProbeDown: Service ganeti1025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [12:50:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:47] 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809 (10cmooney) 03NEW p:05Triage→03Medium [13:20:51] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2190 gradually with 4 steps - Maint over [13:33:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:34:03] 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809#10284072 (10cmooney) [13:34:41] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10284086 (10elukey) [13:35:20] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10284087 (10elukey) 05Open→03Resolved Supermicro sent a new license for 1044 that worked, and I've ran successfully the provision cookbook. [13:36:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10284091 (10elukey) @VRiley-WMF @Jclark-ctr I fixed the issue with ganeti1044 and ran provision, all good! The rest of the nodes should be fine as well :) [13:38:59] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:43:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:43:22] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:45:00] 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#10284099 (10cmooney) We seen this (or at least something similar) today, see T378809. [13:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:55:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bookworm [13:57:16] 06SRE, 06Infrastructure-Foundations: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye - https://phabricator.wikimedia.org/T348730#10284140 (10CDanis) [14:05:51] (03PS1) 10Cathal Mooney: Only try to find 'real' netmask for IPs if they are /32 or /128 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1085590 (https://phabricator.wikimedia.org/T378751) [14:06:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2190 gradually with 4 steps - Maint over [14:11:15] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 614300536 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:12:15] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42784 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:17:25] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10284181 (10jhathaway) @MatthewVernon, during a sprint week myself and @ayounsi worked on adding EFI booting support. We are pr... [14:27:19] 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809#10284231 (10CDanis) I'm pretty confident this is the same as T348730, and I think it would be okay to return ganeti1025 to service and close this task as a dup [14:27:34] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host an-presto1020.eqiad.wmnet with OS bookworm [14:29:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bullseye [14:29:43] (03PS1) 10Máté Szabó: Exclude temp account viewer autopromotions from RC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085593 (https://phabricator.wikimedia.org/T377829) [14:35:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10284248 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm new disk installed. let us know if you need further help. ty for the blink! [14:35:13] 06SRE, 06Infrastructure-Foundations, 10netops: ganeti1025 VMs unresponsive Nov 1 2024 - https://phabricator.wikimedia.org/T378809#10284244 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T378809#10284231, @CDanis wrote: > I'm pretty confident this is the same as T348730, and I think it would be ok... [14:35:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10284251 (10Jhancock.wm) I see the disk is still not blinking. does this still need attention? [14:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:06] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM, thank you." [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:40:53] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1020.eqiad.wmnet with OS bullseye [14:54:04] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bullseye [14:56:53] (03PS1) 10Cwhite: phatality: restart opensearch-dashboards after plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1085595 (https://phabricator.wikimedia.org/T342476) [14:57:17] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10284297 (10Jhancock.wm) 05In progress→03Resolved a:05Papaul→03Jhancock.wm upgraded to 128Gb. powering up now. Please let us know if you need a... [15:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [15:05:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10284306 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: blkid issues [15:18:15] (03PS1) 10Thcipriani: Revert "Dummy commit for testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085597 [15:19:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet [15:23:04] (03PS1) 10Scott French: shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) [15:23:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: disk (sde) failed on thanos-be2003 - https://phabricator.wikimedia.org/T378800#10284321 (10MatthewVernon) Hm, the new disk has installed OK (and the fs is filling up), but xfs_admin operations are hanging on the drive. I'm not sure if that's just the F... [15:23:59] (03CR) 10Dreamy Jazz: Exclude temp account viewer autopromotions from RC (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085593 (https://phabricator.wikimedia.org/T377829) (owner: 10Máté Szabó) [15:26:35] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:43] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10284384 (10Jhancock.wm) a:03Jhancock.wm [15:47:13] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [15:48:14] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10284382 (10Jhancock.wm) i found this on supermicros forum. > IERR is a Processor Internal Error, a signal that indicates a Processor unrecoverable error or even a non-CP... [15:51:13] (03PS1) 10CDanis: deployment group: add aude [puppet] - 10https://gerrit.wikimedia.org/r/1085607 (https://phabricator.wikimedia.org/T372081) [15:51:55] (03CR) 10Hnowlan: [C:03+1] shellbox: add optional .spec.strategy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085598 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [15:55:08] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1020.eqiad.wmnet with OS bullseye [15:55:25] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1020.eqiad.wmnet'] [15:56:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085597 (owner: 10Thcipriani) [15:57:09] (03Merged) 10jenkins-bot: Revert "Dummy commit for testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085597 (owner: 10Thcipriani) [15:57:28] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1085597|Revert "Dummy commit for testing"]] [16:00:05] !log thcipriani@deploy2002 thcipriani: Backport for [[gerrit:1085597|Revert "Dummy commit for testing"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:00:26] !log thcipriani@deploy2002 thcipriani: Continuing with sync [16:02:17] (03CR) 10CDanis: [C:03+2] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [16:03:09] (03Merged) 10jenkins-bot: Helm chart for the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085467 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [16:05:14] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085597|Revert "Dummy commit for testing"]] (duration: 07m 46s) [16:05:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-presto1020.eqiad.wmnet'] [16:13:46] (03CR) 10Thcipriani: [C:03+1] deployment group: add aude [puppet] - 10https://gerrit.wikimedia.org/r/1085607 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [16:13:56] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 16:00:00 on db2239.codfw.wmnet with reason: not yet in production [16:16:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 16:00:00 on db2239.codfw.wmnet with reason: not yet in production [16:17:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 16:00:00 on thanos-be2003.codfw.wmnet with reason: give it time for sde1 fs to backfill [16:17:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 16:00:00 on thanos-be2003.codfw.wmnet with reason: give it time for sde1 fs to backfill [16:18:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1020.eqiad.wmnet with OS bullseye [16:33:03] (03CR) 10Ottomata: [V:03+1] Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [16:33:08] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1020.eqiad.wmnet with reason: host reimage [16:34:16] (03PS1) 10MVernon: service::catalog: mark apus service as paging [puppet] - 10https://gerrit.wikimedia.org/r/1085617 (https://phabricator.wikimedia.org/T279621) [16:35:01] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10284510 (10phaultfinder) [16:36:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1020.eqiad.wmnet with reason: host reimage [16:36:38] (03CR) 10MVernon: [C:03+1] Deprecate system::role for Swift roles [puppet] - 10https://gerrit.wikimedia.org/r/1083158 (owner: 10Muehlenhoff) [16:38:58] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [16:41:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10284528 (10Papaul) Last update from Supermicro is if the BIOS is set to UEFI mode even after replacing a disk in JBOD, the system should be able to add t... [16:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10284558 (10phaultfinder) [16:51:23] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825 (10RobH) 03NEW [16:51:34] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10284584 (10RobH) [16:53:31] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10284594 (10RobH) a:03Andrew @andrew, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-... [16:54:23] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10284607 (10RobH) [16:58:02] (03CR) 10Scott French: "Thanks,Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [17:00:42] !log Ran `/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/all.dblist extensions/WikimediaEvents/maintenance/UpdatePeriodicMetrics.php --verbose` [17:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - ms-be1066 - https://phabricator.wikimedia.org/T378692#10284703 (10Dzahn) [17:07:24] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - ms-be1066 - https://phabricator.wikimedia.org/T378692#10284720 (10VRiley-WMF) 05Open→03Resolved [17:08:20] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - ms-be1066 - https://phabricator.wikimedia.org/T378692#10284719 (10VRiley-WMF) Reseated power supply. Everything is nominal. [17:14:08] (03PS1) 10Dreamy Jazz: Schedule daily runs of WikimediaEvents UpdatePeriodicMetrics.php [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) [17:14:37] (03PS2) 10Dreamy Jazz: Schedule daily runs of WikimediaEvents UpdatePeriodicMetrics.php [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) [17:21:08] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) (owner: 10Dreamy Jazz) [17:24:31] (03CR) 10RLazarus: mediawiki: Support copying text files into mw-script containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [17:38:30] (03PS2) 10RLazarus: mediawiki: Support copying text files into mw-script containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) [17:39:21] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10284858 (10Eevans) I've registered `cortobot` on liberachat. It's currently registered against my wikimedia email address, suggestions for something more generally available welcome (root@wikimedi... [17:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10284859 (10phaultfinder) [17:50:29] (03CR) 10Scott French: mediawiki: Support copying text files into mw-script containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [17:51:18] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10284914 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced pdu [17:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:04:10] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1020.eqiad.wmnet with OS bullseye [18:06:53] !log dancy@deploy2002 Installing scap version "4.120.0" for 1 hosts [18:06:55] (03CR) 10Jdrewniak: [C:03+2] Enable Chart progressive enhancement on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085492 (https://phabricator.wikimedia.org/T378206) (owner: 10Jdlrobson) [18:07:28] !log bking@cumin2002 START - Cookbook sre.puppet.renew-cert for an-presto1020.eqiad.wmnet: Renew puppet certificate - bking@cumin2002 [18:07:46] !log dancy@deploy2002 Installation of scap version "4.120.0" completed for 1 hosts [18:07:59] (03Merged) 10jenkins-bot: Enable Chart progressive enhancement on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085492 (https://phabricator.wikimedia.org/T378206) (owner: 10Jdlrobson) [18:09:25] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for an-presto1020.eqiad.wmnet: Renew puppet certificate - bking@cumin2002 [18:10:47] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1018.eqiad.wmnet'] [18:11:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1018.eqiad.wmnet'] [18:18:02] 06SRE-OnFire, 10Incident Tooling: Corto: ensure Phabricator tasks are created with correct default visibility & priority - https://phabricator.wikimedia.org/T376500#10285082 (10Eevans) @Aklapper would you be the right person to ask about access/visibility? We want our nascent irc bot to create new phab incide... [18:19:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:21:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:21:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:06] (03PS1) 10Aude: Update my (Aude) ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1085631 [18:25:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1039.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:25:52] (03CR) 10Aude: [C:03+1] "looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1085607 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [18:26:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:29:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:29:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:29:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:29:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:30:22] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:31:54] (03CR) 10CDanis: [C:03+2] Update my (Aude) ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1085631 (owner: 10Aude) [18:32:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1040.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:32:07] (03CR) 10CDanis: [C:03+2] deployment group: add aude [puppet] - 10https://gerrit.wikimedia.org/r/1085607 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [18:33:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:33:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:33:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:34:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:34:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1042.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:34:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1043.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:35:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:35:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:35:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:35:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:38:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:38:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:39:09] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:39:12] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:39:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:40:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:40:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:41:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:41:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:42:18] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:42:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:43:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:44:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:44:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:44:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:46:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:46:09] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:46:25] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:46:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:47:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:49:24] (03PS1) 10Aude: Create releases for chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) [18:50:03] (03CR) 10Aude: [C:04-2] "need to update to the correct docker image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085633 (https://phabricator.wikimedia.org/T376948) (owner: 10Aude) [18:51:40] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:51:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:51:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:56:30] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1017.eqiad.wmnet'] [18:56:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-presto1017.eqiad.wmnet'] [18:56:59] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1016.eqiad.wmnet'] [19:02:50] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-presto1017.eqiad.wmnet'] [19:07:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-presto1016.eqiad.wmnet'] [19:09:06] PROBLEM - Host an-presto1017 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-presto1017.eqiad.wmnet'] [19:12:50] RECOVERY - Host an-presto1017 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:16:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [19:31:11] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1016.eqiad.wmnet with reason: host reimage [19:34:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1016.eqiad.wmnet with reason: host reimage [19:47:14] !log bking@an-presto[1016:1020].eqiad.wmnet temporarily install perccli to check disk status without requiring reboot T374924 [19:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] T374924: Bring an-presto10[16-20] into service to replace an-presto100[1-5] - https://phabricator.wikimedia.org/T374924 [19:48:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:48:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:24] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10285435 (10lmata) >>! In T378650#10284858, @Eevans wrote: > I've registered `cortobot` on liberachat. It's currently registered against my wikimedia email address, suggestions for something more g... [20:07:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10285445 (10VRiley-WMF) [20:14:43] PROBLEM - Host an-presto1017 is DOWN: PING CRITICAL - Packet loss = 100% [20:16:51] RECOVERY - Host an-presto1017 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:17:36] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:33] (03PS1) 10Dzahn: wikistats: use the 'extinfo' update on every update run [puppet] - 10https://gerrit.wikimedia.org/r/1085643 (https://phabricator.wikimedia.org/T317241) [20:27:18] (03CR) 10Dzahn: [C:03+2] wikistats: use the 'extinfo' update on every update run [puppet] - 10https://gerrit.wikimedia.org/r/1085643 (https://phabricator.wikimedia.org/T317241) (owner: 10Dzahn) [20:27:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1016.eqiad.wmnet with OS bullseye [20:48:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:48:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:59] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854 (10bking) 03NEW [20:56:30] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10285577 (10bking) [20:56:35] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10285578 (10bking) [21:15:05] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10285611 (10Eevans) > [NOTICE] root@wikimedia.org has too many accounts registered. Whelp. [21:17:02] (03PS3) 10Cwhite: opensearch_dashboards: package provider must remove before install [puppet] - 10https://gerrit.wikimedia.org/r/1085486 (https://phabricator.wikimedia.org/T342476) [21:18:03] (03PS2) 10Cwhite: phatality: restart opensearch-dashboards after plugin install [puppet] - 10https://gerrit.wikimedia.org/r/1085595 (https://phabricator.wikimedia.org/T342476) [21:20:37] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10285626 (10Eevans) >>! In T378650#10285435, @lmata wrote: >>>! In T378650#10284858, @Eevans wrote: >> I've registered `cortobot` on liberachat. It's currently registered against my wikimedia email... [21:40:29] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10285667 (10lmata) >>! In T378650#10285611, @Eevans wrote: >> [NOTICE] root@wikimedia.org has too many accounts registered. > > Whelp. Boo >>! In T378650#10285626, @Eevans wrote: >>>! In T37865... [21:45:23] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10285674 (10Eevans) >>! In T370786#10169691, @Eevans wrote: >>>! In T370786#10023319, @hnowlan wrote: >> One of the big challenges I can see here is the use of compound words - currently we use la... [21:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [23:15:21] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10285773 (10Eevans) >>! In T378650#10285667, @lmata wrote: >>>! In T378650#10285611, @Eevans wrote: >>> [ ... ] >>>>>! In T378650#10284858, @Eevans wrote: >>>> I've registered `cortobot` on liberach...