[00:04:45] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [00:04:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [00:05:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [00:05:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [00:11:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) @Jclark-ctr when you are next on site can you please replace the DAC cable connecting cloudvirtlocal1001 to the switch? Thanks [00:17:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [00:18:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [00:39:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663 [00:39:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663 (owner: 10TrainBranchBot) [00:57:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663 (owner: 10TrainBranchBot) [01:01:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [01:02:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [01:07:48] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [01:07:59] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 10s) [01:08:07] (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:30] fab@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [01:09:07] here, looking [01:09:20] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2371.codfw.wmnet, mw2393.codfw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, m [01:09:20] fw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2274.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2307.codfw.wmnet, mw2380.codfw.wmnet, mw2383.codfw.wmnet, mw2336.codfw.wmnet, mw2414.codfw.wmnet, mw2268.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2365.codfw.wmnet, mw2373.codfw.wmnet, mw2393.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2371.codfw.wmnet, mw2274.codfw.wmnet, mw2438.codfw.wmnet, mw2414.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2383.codfw.wmnet, m [01:10:00] fw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmnet, mw2380.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw2336.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2268.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:10:56] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:11:16] (MediaWikiLatencyExceeded) firing: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:11:36] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: ... [01:12:16] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:12:38] (ProbeDown) firing: (15) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:07] (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:14:03] o/ [01:14:21] seems to have all resolved now [01:14:30] we had a big spike in reads to s6 starting at 01:04 -- that eventually led to DB errors and appserver worker saturation [01:14:47] ok thanks. any action required on our end then? [01:15:05] still working on what those s6 reads were about -- doesn't look like it was driven by a traffic spike afaict [01:15:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:16:16] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [01:17:16] codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:19:08] I don't think anything is still broken, I just don't get what happened yet or whether it will happen again [01:20:01] that's getting quipped [01:37:27] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [01:37:38] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 11s) [01:45:48] There's definitely *something* still going on -- I got an error page accessing wikivoyage, and it was very slow to fetch the page on a reload. [01:46:01] I was about to say, it's on the upswing again [01:46:03] might page shortly [01:46:30] "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow" [01:47:02] was also slow again here, now fine [01:47:43] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [01:47:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:48:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:48:43] hello again [01:48:58] this one seems more severe, or just more visible [01:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:52:43] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [01:52:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:53:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:10:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:28] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:47:14] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:55:34] PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:05:18] PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 1783 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [03:21:02] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann [03:21:02] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:22:36] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:24:34] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [03:26:08] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:41:36] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [03:43:08] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:45:50] RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:51:46] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28 [03:51:46] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:53:16] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_% [03:53:16] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:53:16] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:54:48] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:59:32] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [04:04:22] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:06:38] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:22:10] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:22:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [04:22:50] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:23:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [04:27:02] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:27:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:35:48] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:37:26] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:41:36] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle) (I'm responding here in response to an email to the Peformance Team.) This is an exciting project to see happen. We love meauring stuff and are happ... [04:45:32] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:52:58] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:55:18] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:57:52] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:12:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:56:38] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230414T0600) [06:00:48] (03PS1) 10Marostegui: mariadb: Remove db1107 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908684 (https://phabricator.wikimedia.org/T334447) [06:01:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1107.eqiad.wmnet [06:06:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:06:30] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:07:01] (03CR) 10Ayounsi: [C: 03+2] Remove option to disable vcp_snmp_statistics [homer/public] - 10https://gerrit.wikimedia.org/r/904177 (owner: 10Ayounsi) [06:07:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1107 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908684 (https://phabricator.wikimedia.org/T334447) (owner: 10Marostegui) [06:07:39] (03Merged) 10jenkins-bot: Remove option to disable vcp_snmp_statistics [homer/public] - 10https://gerrit.wikimedia.org/r/904177 (owner: 10Ayounsi) [06:08:22] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1107.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:09:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1107.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:09:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:09:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1107.eqiad.wmnet [06:10:34] 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui) This is ready for DC-Ops [06:11:15] 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui) a:05Marostegui→03Jclark-ctr [06:11:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:11:20] 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui) [06:11:58] (03PS1) 10Marostegui: install_server: Do not reimage db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908685 [06:12:47] (03Abandoned) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi) [06:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:16:48] PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:17:58] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_% [06:17:58] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:18:38] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908685 (owner: 10Marostegui) [06:19:22] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:20:36] (03PS1) 10Marostegui: site.pp: Add db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908687 [06:20:55] (03CR) 10Ayounsi: [C: 03+1] "code and diff lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [06:21:03] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908687 (owner: 10Marostegui) [06:21:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:05ayounsi→03cmooney [06:25:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 T329352', diff saved to https://phabricator.wikimedia.org/P46679 and previous config saved to /var/cache/conftool/dbconfig/20230414-062553-marostegui.json [06:25:59] T329352: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 [06:27:30] (03PS1) 10Marostegui: db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908689 (https://phabricator.wikimedia.org/T329352) [06:27:32] (03PS1) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 [06:28:06] (03CR) 10Marostegui: [C: 03+2] db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908689 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui) [06:29:39] (03CR) 10CI reject: [V: 04-1] C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede) [06:30:06] 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) 05Open→03Resolved [06:30:13] RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.938 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:30:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:18] (03CR) 10Stevemunene: [C: 03+2] Add referer_name field to druid pageviews hourly and daily tables turnilo [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) (owner: 10Snwachukwu) [06:33:21] (03CR) 10Elukey: "Left two nits, the rest looks good! I'll create the new namespace and puppet configs :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [06:38:07] (03PS1) 10Marostegui: install_server: Do not reimage db1211 [puppet] - 10https://gerrit.wikimedia.org/r/908691 [06:38:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:38:45] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1211 [puppet] - 10https://gerrit.wikimedia.org/r/908691 (owner: 10Marostegui) [06:39:53] PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:39:58] (03CR) 10Elukey: Remove extra check on webrequest _SUCCESS files on HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [06:41:07] (03CR) 10Elukey: Prepare removal of systemd_timer check_webrequest_partitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [06:48:42] (03CR) 10Ayounsi: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [06:49:02] (03PS2) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 [06:49:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once NDA confirmed by Legal)" [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) (owner: 10Dzahn) [06:51:04] (03CR) 10Muehlenhoff: "Not all use cases of the httpd class have auto restart enabled, e.g. on mw* we explicitly don't it (but rather with a cookbook)." [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede) [06:51:49] (03CR) 10Muehlenhoff: "But we could add a parameter to the class to enable it." [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede) [06:52:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40677/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede) [06:52:47] (03Abandoned) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede) [06:55:57] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:05ayounsi→03cmooney [06:58:03] (03PS3) 10Aqu: analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) [06:58:26] (03PS2) 10Aqu: analytics: Prepare removal of systemd_timer check_webrequest_partitions [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) [06:59:18] (03CR) 10Aqu: "Thanks for the check Elukey" [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230414T0700) [07:03:57] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) a:03cmooney [07:04:12] RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.963 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:05:07] (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:10:07] (ProbeDown) resolved: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:11:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:26:20] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:27:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete pinning after recent toolsdb migration [puppet] - 10https://gerrit.wikimedia.org/r/907717 (owner: 10Muehlenhoff) [07:28:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [07:31:10] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:34:26] (03PS1) 10Slyngshede: Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 [07:35:14] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Enable esitest on text@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/908569 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [07:36:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:36:48] (03PS2) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 [07:39:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:39:39] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:40:14] (03PS6) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [07:41:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:43:17] (03Abandoned) 10Slyngshede: Access Requests, allow users to request more permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 (owner: 10Slyngshede) [07:44:30] (03PS7) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [07:45:46] (03CR) 10Filippo Giunchedi: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [07:50:26] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:51:25] (03PS3) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 [07:51:31] (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [07:54:47] (03CR) 10Slyngshede: [C: 03+2] LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [07:54:49] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [07:55:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:55:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [07:55:25] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [07:56:04] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:56:50] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:00:45] (03PS1) 10Muehlenhoff: Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) [08:01:09] (03CR) 10CI reject: [V: 04-1] Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:01:14] (03CR) 10Filippo Giunchedi: "LGTM to my untrained eye!" [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [08:07:24] (03PS1) 10Filippo Giunchedi: webperf: fix puppet on arclamp* [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577) [08:07:32] slyngs: FYI ^ [08:08:50] (03PS2) 10Muehlenhoff: Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) [08:10:25] (03PS1) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) [08:10:48] (03CR) 10CI reject: [V: 04-1] analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [08:11:40] (03CR) 10Slyngshede: [C: 03+1] "Looks good. Surprised that Puppet didn't complain earlier." [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577) (owner: 10Filippo Giunchedi) [08:12:57] (03PS2) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) [08:13:18] slyngs: thank you for the quick review! puppet did complain btw for arclamp hosts [08:13:23] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: fix puppet on arclamp* [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577) (owner: 10Filippo Giunchedi) [08:13:30] (03CR) 10Muehlenhoff: [C: 03+2] Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:13:44] godog: Ah, in that case I completely understand Puppet :-) [08:13:59] hehehe [08:14:02] moritzm: merged your change too [08:14:05] ack, thx [08:18:29] (03PS3) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) [08:21:21] !log aborrero@apt2001:~ $ sudo -i reprepro --noskipold --component thirdparty/kubeadm-k8s-1-23 update buster-wikimedia (T298005) [08:21:23] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) [08:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:26] T298005: Upgrade Toolforge Kubernetes to version 1.23 - https://phabricator.wikimedia.org/T298005 [08:21:29] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [08:21:40] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:22:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [08:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:23:18] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:23:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [08:28:04] (03CR) 10JMeybohm: thumbor: make tmp-dir configurable, default disabled (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [08:28:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:31:35] (03CR) 10JMeybohm: thumbor: make tmp-dir configurable, default disabled (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [08:35:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:35:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [08:36:02] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [08:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:38:22] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:39:49] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10ItamarWMDE) [08:43:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:44:58] (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [08:45:43] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [08:51:08] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:51:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [08:51:31] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [08:52:44] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:54] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:59:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:02:14] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) 05Open→03Resolved a:03Tgr La... [09:02:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40678/console" [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [09:03:05] (03CR) 10Elukey: [V: 03+1 C: 03+2] analytics: Prepare removal of systemd_timer check_webrequest_partitions [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [09:04:04] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:05:41] (03PS4) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:06:14] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:07:17] (03CR) 10CI reject: [V: 04-1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [09:08:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:24] (03PS5) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:08:54] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:12:18] parse2016 is kinda hammered [09:12:23] But it's not "down" down. [09:12:34] 72 load avg tho [09:12:35] well... >5s for a request is down :) [09:12:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:01] vgutierrez: Oh I agree [09:13:03] parsoid has 9 nodes of 20 depooled in codfw [09:13:08] But I don't really see what I can do about it [09:13:16] can we pooled some? [09:13:17] *pool [09:13:34] Hmm why the heck are they depooled is the question >_> [09:15:25] Ok I don't see anything relevant [09:15:29] (in SAL [09:15:31] ) [09:15:43] yep.. I'm failing to find anything regarding the depooled servers [09:15:47] So I'd say yeah, we can repool them [09:16:16] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2002.codfw.wmnet with reason: systemd package upgrade [09:16:31] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2002.codfw.wmnet with reason: systemd package upgrade [09:16:32] They should still have gotten scap deployments but I'll run a pull on them just to be sure, and repool them [09:17:06] last reference to parsoid being depooled in codfw seems to be T327925 [09:17:07] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [09:17:33] (and that got repooled the very same day apparently) [09:17:35] yeah but it's not even those servers [09:17:49] well.. dc=codfw,cluster=parsoid [09:18:29] yeah but it's not even dc=codfw,cluster=parsoid - the servers in the task [09:18:32] It's a mish mash [09:19:18] dc=codfw,cluster=parsoid was the selector logged by conftool [09:21:01] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [09:22:19] Ok they're all up to date on scap deployments, repooling [09:22:24] greaet [09:22:26] *great [09:22:39] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid [09:23:05] (03PS6) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:23:18] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:44] It works much better when 50% of the cluster isn't depooled tbh [09:24:10] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:24:17] (03PS7) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:25:23] (03CR) 10CI reject: [V: 04-1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [09:25:39] (03CR) 10Jbond: "updated" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [09:26:11] (03CR) 10Muehlenhoff: "First round of comments, but looks good in general" [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [09:27:13] (03PS8) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:30:20] (03PS1) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [09:33:39] (03CR) 10CI reject: [V: 04-1] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:36:38] RECOVERY - Check systemd state on kubemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:14] I don't know what happened but there are also a **lot** of mw appservers depooled in codfw [09:38:45] vgutierrez: Looks like remnants from what happened around 1:00 UTC [09:39:17] I'll repool because we can't really go on with 89 mw appservers depooled, can we [09:41:34] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:41:40] (03PS9) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [09:42:34] claime: what do you mean? we don't have any action on the SAL indicating that servers were depooled last night [09:42:59] vgutierrez: No you're right, I saw the alert for servers marked down, but they were not depooled [09:45:21] I'm going through SAL trying to find when they could have been depooled and coming up empty [09:45:37] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [09:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:51:03] scap pull done on all of them, repooling [09:53:18] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2.*.codfw.wmnet,cluster=appserver [09:53:39] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2.*.codfw.wmnet,cluster=api_appserver [10:02:24] RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:02:44] (03PS1) 10Muehlenhoff: Use signed-by notation for component/puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/908789 (https://phabricator.wikimedia.org/T330495) [10:03:49] (03CR) 10Jbond: "I think this is fine, however i would prefer it if we set this up as a module in gitlab so that we could add CI. then add this module to " [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [10:04:38] (03PS1) 10Jameel Kaisar: Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) [10:05:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [10:05:39] (03CR) 10Jbond: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [10:05:41] (03PS2) 10Jameel Kaisar: Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) [10:06:02] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [10:06:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:39] (03PS1) 10Elukey: Add new images to support AMD GPUs on k8s [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908792 (https://phabricator.wikimedia.org/T333009) [10:08:29] !log kamila@deploy2002 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [10:16:03] (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [10:16:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:25] (03PS4) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 [10:17:01] (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by notation for component/puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/908789 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:17:43] (03CR) 10JMeybohm: "hm..works on my machine 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [10:20:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [10:26:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [10:30:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1120.eqiad.wmnet [10:33:14] (03PS1) 10Marostegui: mariadb: Remove db1120 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908793 (https://phabricator.wikimedia.org/T334580) [10:36:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1120 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908793 (https://phabricator.wikimedia.org/T334580) (owner: 10Marostegui) [10:37:12] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [10:37:14] 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) The patch above fixes the problem, tested in mwdebug2001. Now I need someone to review and merge it, I'll depl... [10:39:08] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1120.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [10:40:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1120.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [10:40:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1120.eqiad.wmnet [10:41:18] 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Marostegui) [10:41:36] 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Marostegui) [10:43:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [10:43:22] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [10:46:27] (03PS1) 10Kamila Součková: thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 [10:49:18] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [10:52:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:26] (03PS1) 10Jcrespo: dbbackups: Move s2 and s3 backups from db1102 to db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908798 (https://phabricator.wikimedia.org/T334057) [11:05:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:20] (03PS1) 10Muehlenhoff: Pass -y --force-yes to puppet installation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908799 (https://phabricator.wikimedia.org/T330495) [11:15:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:54] (03CR) 10Muehlenhoff: [C: 03+2] Pass -y --force-yes to puppet installation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908799 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [11:22:34] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Vgutierrez) [11:26:01] (03CR) 10Hnowlan: [C: 03+1] thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková) [11:27:19] (03CR) 10JMeybohm: [C: 03+1] "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková) [11:30:00] (03CR) 10Vgutierrez: [C: 03+1] Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [11:32:41] (03PS2) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [11:34:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [11:34:15] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10phaultfinder) [11:34:18] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [11:37:46] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Samwalton9) Huge thanks all! [11:37:52] (03CR) 10CI reject: [V: 04-1] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [11:39:05] (03PS7) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [11:41:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1109.eqiad.wmnet with reason: Maintenance [11:41:23] (03CR) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. (039 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede) [11:41:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1109.eqiad.wmnet with reason: Maintenance [11:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46680 and previous config saved to /var/cache/conftool/dbconfig/20230414-114148-ladsgroup.json [11:41:54] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:41:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:42:01] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 3 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) Thanks a lot! Can we make a user notice in th... [11:42:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:42:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46681 and previous config saved to /var/cache/conftool/dbconfig/20230414-114219-ladsgroup.json [11:43:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:43:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:43:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46682 and previous config saved to /var/cache/conftool/dbconfig/20230414-114356-ladsgroup.json [11:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:44:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46683 and previous config saved to /var/cache/conftool/dbconfig/20230414-114407-ladsgroup.json [11:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46684 and previous config saved to /var/cache/conftool/dbconfig/20230414-114429-ladsgroup.json [11:46:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46685 and previous config saved to /var/cache/conftool/dbconfig/20230414-114619-ladsgroup.json [11:50:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [11:50:31] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [11:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:28] 10Puppet, 10Infrastructure-Foundations: gitlab: test out gitlb actions with a stable puppet module - https://phabricator.wikimedia.org/T334723 (10jbond) p:05Triage→03Medium [11:53:17] (03PS1) 10Jbond: debian: move debian package to gitlab/vendored_modules [puppet] - 10https://gerrit.wikimedia.org/r/908805 (https://phabricator.wikimedia.org/T334723) [11:58:39] (03PS12) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [11:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P46686 and previous config saved to /var/cache/conftool/dbconfig/20230414-115903-ladsgroup.json [11:59:35] (03Abandoned) 10Elukey: Add new images to support AMD GPUs on k8s [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908792 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:59:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46687 and previous config saved to /var/cache/conftool/dbconfig/20230414-115935-ladsgroup.json [11:59:51] (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [12:01:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P46688 and previous config saved to /var/cache/conftool/dbconfig/20230414-120125-ladsgroup.json [12:03:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) [12:04:09] (03PS1) 10Jelto: install_server: configure root raid only on gitlab-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) [12:07:13] (03CR) 10Jelto: "I'm unable to find a recipe to configure two independent raids on four disks. This change mostly rolls back to https://gerrit.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:08:40] (03CR) 10Jelto: install_server: configure root raid only on gitlab-raid1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:09:53] (03CR) 10JMeybohm: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [12:13:43] (03CR) 10Ottomata: [C: 03+1] analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu) [12:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P46689 and previous config saved to /var/cache/conftool/dbconfig/20230414-121409-ladsgroup.json [12:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46690 and previous config saved to /var/cache/conftool/dbconfig/20230414-121442-ladsgroup.json [12:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P46691 and previous config saved to /var/cache/conftool/dbconfig/20230414-121632-ladsgroup.json [12:16:58] (03CR) 10Clément Goubert: "This change is ready for review." [alerts] - 10https://gerrit.wikimedia.org/r/908830 (owner: 10Clément Goubert) [12:20:46] (03PS1) 10Muehlenhoff: Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) [12:25:39] (03CR) 10Muehlenhoff: [C: 03+1] "That sounds like a reasonable compromise" [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:27:14] (03PS3) 10Clément Goubert: team-sre: add alert on mediawiki pooled percentage [alerts] - 10https://gerrit.wikimedia.org/r/908830 [12:27:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:28:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46692 and previous config saved to /var/cache/conftool/dbconfig/20230414-122915-ladsgroup.json [12:29:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1111.eqiad.wmnet with reason: Maintenance [12:29:21] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:29:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1111.eqiad.wmnet with reason: Maintenance [12:29:34] (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [12:29:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46693 and previous config saved to /var/cache/conftool/dbconfig/20230414-122939-ladsgroup.json [12:29:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46694 and previous config saved to /var/cache/conftool/dbconfig/20230414-122948-ladsgroup.json [12:29:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:30:03] (03CR) 10Jbond: [C: 03+1] Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:30:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46695 and previous config saved to /var/cache/conftool/dbconfig/20230414-123011-ladsgroup.json [12:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46696 and previous config saved to /var/cache/conftool/dbconfig/20230414-123047-ladsgroup.json [12:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46697 and previous config saved to /var/cache/conftool/dbconfig/20230414-123138-ladsgroup.json [12:31:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:31:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:32:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46698 and previous config saved to /var/cache/conftool/dbconfig/20230414-123201-ladsgroup.json [12:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46699 and previous config saved to /var/cache/conftool/dbconfig/20230414-123221-ladsgroup.json [12:34:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46700 and previous config saved to /var/cache/conftool/dbconfig/20230414-123413-ladsgroup.json [12:34:50] (03CR) 10Muehlenhoff: Password reset - Allow users to request a password reset. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede) [12:38:57] (03PS14) 10Jbond: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [12:40:31] (03CR) 10Jbond: opensearch_dashboards: add package provider (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [12:40:56] (03CR) 10Jbond: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [12:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P46701 and previous config saved to /var/cache/conftool/dbconfig/20230414-124553-ladsgroup.json [12:47:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46702 and previous config saved to /var/cache/conftool/dbconfig/20230414-124727-ladsgroup.json [12:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P46703 and previous config saved to /var/cache/conftool/dbconfig/20230414-124920-ladsgroup.json [12:51:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:27] (03CR) 10Muehlenhoff: "First pass of comments, this is looking good in general!" [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede) [12:58:40] (03PS3) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 [12:58:47] (03CR) 10Slyngshede: Password reset - Allow users to request a password reset. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede) [13:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P46704 and previous config saved to /var/cache/conftool/dbconfig/20230414-130101-ladsgroup.json [13:01:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46705 and previous config saved to /var/cache/conftool/dbconfig/20230414-130234-ladsgroup.json [13:04:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P46706 and previous config saved to /var/cache/conftool/dbconfig/20230414-130426-ladsgroup.json [13:07:21] !log creating User:ANONYMOUS ACLs on kafka-test cluster https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs [13:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:02] !log granting IdempotentWrite on kafka jumbo-eqiad cluster to User:ANONYNOUS - this will allow for user of newer kafka producers that have enabled transactional writes by default. `kafka acls --add --allow-principal User:ANONYMOUS --cluster --operation IdempotentWrite` [13:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46707 and previous config saved to /var/cache/conftool/dbconfig/20230414-131607-ladsgroup.json [13:16:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1114.eqiad.wmnet with reason: Maintenance [13:16:13] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:16:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1114.eqiad.wmnet with reason: Maintenance [13:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46708 and previous config saved to /var/cache/conftool/dbconfig/20230414-131631-ladsgroup.json [13:17:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46709 and previous config saved to /var/cache/conftool/dbconfig/20230414-131739-ladsgroup.json [13:17:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:17:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:17:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:18:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:18:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46710 and previous config saved to /var/cache/conftool/dbconfig/20230414-131824-ladsgroup.json [13:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46711 and previous config saved to /var/cache/conftool/dbconfig/20230414-131932-ladsgroup.json [13:19:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:19:48] 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) [13:19:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T333332)', diff saved to https://phabricator.wikimedia.org/P46712 and previous config saved to /var/cache/conftool/dbconfig/20230414-131956-ladsgroup.json [13:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46713 and previous config saved to /var/cache/conftool/dbconfig/20230414-132034-ladsgroup.json [13:22:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T333332)', diff saved to https://phabricator.wikimedia.org/P46714 and previous config saved to /var/cache/conftool/dbconfig/20230414-132208-ladsgroup.json [13:22:13] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:23:34] PROBLEM - Check systemd state on doc2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_php7.3-fpm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) @Papaul replaced dac cable [13:30:25] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:31:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:30] (03PS14) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [13:32:32] (03PS1) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [13:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P46715 and previous config saved to /var/cache/conftool/dbconfig/20230414-133245-ladsgroup.json [13:32:47] (03CR) 10David Caro: "Untested for now" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [13:34:58] (03PS1) 10Slyngshede: Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 [13:35:17] (03CR) 10CI reject: [V: 04-1] Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 (owner: 10Slyngshede) [13:35:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46716 and previous config saved to /var/cache/conftool/dbconfig/20230414-133540-ladsgroup.json [13:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P46717 and previous config saved to /var/cache/conftool/dbconfig/20230414-133714-ladsgroup.json [13:37:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:18] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) a:03fnegri [13:37:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS buster [13:37:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS buster [13:37:43] (03PS2) 10Slyngshede: Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 [13:42:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:42:10] 10SRE, 10Traffic, 10conftool, 10serviceops: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Clement_Goubert) For future reference, this left 89 out of 280 appservers and 9 out of 20 parsoid servers depooled in codf... [13:42:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:42:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:42:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:42:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [13:44:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:45:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:45:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:45:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [13:45:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.404 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:46:43] (03PS1) 10Andrew Bogott: Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696) [13:47:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P46718 and previous config saved to /var/cache/conftool/dbconfig/20230414-134751-ladsgroup.json [13:48:27] (03PS2) 10Andrew Bogott: Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696) [13:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:49:19] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696) (owner: 10Andrew Bogott) [13:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46719 and previous config saved to /var/cache/conftool/dbconfig/20230414-135047-ladsgroup.json [13:51:05] (03CR) 10Slyngshede: Read systems and approval rules from YAML file. (037 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede) [13:51:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:51:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:52:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P46720 and previous config saved to /var/cache/conftool/dbconfig/20230414-135220-ladsgroup.json [13:53:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:37] (03PS15) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [13:57:53] (03CR) 10David Caro: "Tested manually by deleting and the letting it recreate a tool account, got the stats:" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [14:00:13] (03PS1) 10Andrew Bogott: Nova/cloudvirtlocal: force replacement of /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/908851 [14:02:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46721 and previous config saved to /var/cache/conftool/dbconfig/20230414-140258-ladsgroup.json [14:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1116.eqiad.wmnet with reason: Maintenance [14:03:05] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:03:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1116.eqiad.wmnet with reason: Maintenance [14:03:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [14:03:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [14:03:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:03:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:04:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46722 and previous config saved to /var/cache/conftool/dbconfig/20230414-140401-ladsgroup.json [14:04:38] (03CR) 10Andrew Bogott: [C: 03+2] Nova/cloudvirtlocal: force replacement of /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/908851 (owner: 10Andrew Bogott) [14:05:14] 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ssingh) [14:05:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46723 and previous config saved to /var/cache/conftool/dbconfig/20230414-140553-ladsgroup.json [14:05:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:06:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46724 and previous config saved to /var/cache/conftool/dbconfig/20230414-140616-ladsgroup.json [14:07:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46725 and previous config saved to /var/cache/conftool/dbconfig/20230414-140725-ladsgroup.json [14:07:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:07:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46726 and previous config saved to /var/cache/conftool/dbconfig/20230414-140749-ladsgroup.json [14:09:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.365 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 6.897 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:10:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46727 and previous config saved to /var/cache/conftool/dbconfig/20230414-141002-ladsgroup.json [14:10:07] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:11:13] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:11:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [14:11:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:11:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:12:37] (03PS6) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [14:12:39] (03PS8) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [14:12:41] (03PS5) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [14:12:43] (03PS39) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:15:12] (03PS40) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:17:37] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [14:18:01] (03PS41) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:19:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:19:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:21:15] !log rebooting list1001 for cpu bump [14:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.428 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46728 and previous config saved to /var/cache/conftool/dbconfig/20230414-142232-ladsgroup.json [14:25:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P46729 and previous config saved to /var/cache/conftool/dbconfig/20230414-142508-ladsgroup.json [14:25:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46730 and previous config saved to /var/cache/conftool/dbconfig/20230414-142518-ladsgroup.json [14:25:23] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:26:29] (03PS42) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:27:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [14:29:30] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [14:29:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:29:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [14:30:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:30:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:31:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) Updated netbox and idracs on all three servers frbast1002: vlan:frack-bastion-eqiad ip:10.64.40.196 frmon1002: vlan:frack-administration-e... [14:32:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts pybal-test2001.codfw.wmnet [14:34:20] (03PS43) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:34:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:12] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:36:50] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:37:08] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mngmt dns fundrasing - jclark@cumin1001" [14:37:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46731 and previous config saved to /var/cache/conftool/dbconfig/20230414-143738-ladsgroup.json [14:38:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pybal-test2001.codfw.wmnet [14:38:12] 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `pybal-test2001.codfw.wmnet` - pybal-test2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanage... [14:38:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts pybal-test2002.codfw.wmnet [14:38:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mngmt dns fundrasing - jclark@cumin1001" [14:38:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:08] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:40:01] (03PS44) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P46732 and previous config saved to /var/cache/conftool/dbconfig/20230414-144014-ladsgroup.json [14:40:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P46733 and previous config saved to /var/cache/conftool/dbconfig/20230414-144024-ladsgroup.json [14:41:40] (03PS1) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T321309) [14:42:51] (03PS45) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:44:24] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) >>! In T332024#8780999, @Krinkle wrote: > (I'm responding here in response to an email to the Peformance Team.) > > This is an exciting project to se... [14:44:32] (03Merged) 10jenkins-bot: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:46] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:46:19] (03PS46) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:47:56] (03CR) 10JMeybohm: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [14:48:16] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pybal-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:49:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pybal-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:49:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pybal-test2002.codfw.wmnet [14:49:41] 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `pybal-test2002.codfw.wmnet` - pybal-test2002.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanage... [14:50:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46734 and previous config saved to /var/cache/conftool/dbconfig/20230414-145245-ladsgroup.json [14:52:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:52:50] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:53:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:53:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:53:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:53:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46735 and previous config saved to /var/cache/conftool/dbconfig/20230414-145327-ladsgroup.json [14:54:21] (03PS2) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745) [14:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46736 and previous config saved to /var/cache/conftool/dbconfig/20230414-145521-ladsgroup.json [14:55:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P46737 and previous config saved to /var/cache/conftool/dbconfig/20230414-145531-ladsgroup.json [14:55:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46738 and previous config saved to /var/cache/conftool/dbconfig/20230414-145537-ladsgroup.json [14:55:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:55:39] (03CR) 10Kamila Součková: [C: 03+2] thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková) [14:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46739 and previous config saved to /var/cache/conftool/dbconfig/20230414-145544-ladsgroup.json [14:55:58] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet [14:56:44] (03PS3) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745) [14:57:41] (03CR) 10Ssingh: [C: 03+2] Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745) (owner: 10Ssingh) [14:57:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46740 and previous config saved to /var/cache/conftool/dbconfig/20230414-145756-ladsgroup.json [14:58:03] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:00:06] (03PS1) 10Hnowlan: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) [15:00:29] (03PS3) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [15:00:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:10] (03Merged) 10jenkins-bot: thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková) [15:04:28] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:04:39] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:05:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:10] (03PS4) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [15:08:00] (03CR) 10CI reject: [V: 04-1] svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [15:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46741 and previous config saved to /var/cache/conftool/dbconfig/20230414-151037-ladsgroup.json [15:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:10:43] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:10:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46742 and previous config saved to /var/cache/conftool/dbconfig/20230414-151043-ladsgroup.json [15:10:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:11:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:11:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46743 and previous config saved to /var/cache/conftool/dbconfig/20230414-151108-ladsgroup.json [15:12:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46744 and previous config saved to /var/cache/conftool/dbconfig/20230414-151216-ladsgroup.json [15:12:51] 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ssingh) 05Open→03Resolved a:03ssingh Hosts decommissioned and removed from Puppet. [15:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P46745 and previous config saved to /var/cache/conftool/dbconfig/20230414-151303-ladsgroup.json [15:14:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:14:45] (03PS5) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [15:15:14] PROBLEM - DPKG on dse-k8s-worker1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:15:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:04] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10odimitrijevic) [15:24:39] (03PS6) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) [15:24:46] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:25:24] (03CR) 10JHathaway: [C: 03+1] Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [15:25:40] PROBLEM - mailman3-web on lists1001 is CRITICAL: PROCS CRITICAL: 5 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46746 and previous config saved to /var/cache/conftool/dbconfig/20230414-152550-ladsgroup.json [15:26:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [15:26:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [15:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P46747 and previous config saved to /var/cache/conftool/dbconfig/20230414-152722-ladsgroup.json [15:28:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P46748 and previous config saved to /var/cache/conftool/dbconfig/20230414-152809-ladsgroup.json [15:32:50] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:36:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye [15:37:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1013.eqiad.wmnet with OS bullseye [15:38:53] (03Merged) 10jenkins-bot: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:38:55] (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [15:40:46] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [15:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46749 and previous config saved to /var/cache/conftool/dbconfig/20230414-154056-ladsgroup.json [15:40:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:41:02] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:41:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:41:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46750 and previous config saved to /var/cache/conftool/dbconfig/20230414-154119-ladsgroup.json [15:42:26] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:42:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P46751 and previous config saved to /var/cache/conftool/dbconfig/20230414-154228-ladsgroup.json [15:43:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46752 and previous config saved to /var/cache/conftool/dbconfig/20230414-154316-ladsgroup.json [15:43:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:43:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46753 and previous config saved to /var/cache/conftool/dbconfig/20230414-154329-ladsgroup.json [15:43:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:43:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46754 and previous config saved to /var/cache/conftool/dbconfig/20230414-154339-ladsgroup.json [15:45:03] 10SRE-Unowned: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh) [15:45:25] 10SRE-Unowned: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh) [15:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46755 and previous config saved to /var/cache/conftool/dbconfig/20230414-154551-ladsgroup.json [15:46:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [15:52:51] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:52:57] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:53:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [15:55:10] (03PS1) 10Hnowlan: rest-gateway: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/908891 (https://phabricator.wikimedia.org/T334611) [15:55:43] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10fnegri) [15:57:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46756 and previous config saved to /var/cache/conftool/dbconfig/20230414-155735-ladsgroup.json [15:57:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:57:40] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:57:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46757 and previous config saved to /var/cache/conftool/dbconfig/20230414-155758-ladsgroup.json [15:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46758 and previous config saved to /var/cache/conftool/dbconfig/20230414-155835-ladsgroup.json [16:00:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P46759 and previous config saved to /var/cache/conftool/dbconfig/20230414-160058-ladsgroup.json [16:06:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bullseye [16:06:17] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1013.eqiad.wmnet with OS bullseye completed: - lvs1013 (**PASS**) - Downtimed on Icinga/Aler... [16:12:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:13:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46760 and previous config saved to /var/cache/conftool/dbconfig/20230414-161341-ladsgroup.json [16:16:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P46761 and previous config saved to /var/cache/conftool/dbconfig/20230414-161604-ladsgroup.json [16:16:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen can you please confirm that you an not access those servers so you can take over the task? thanks [16:18:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) >>! In T319460#8782417, @Papaul wrote: > @Jgreen can you please confirm that you an not access those servers so you can take over the task? >... [16:20:00] (03CR) 10Pmiazga: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [16:27:30] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [16:27:45] (03PS1) 10JHathaway: lists: Bump the number worker processes to 4 [puppet] - 10https://gerrit.wikimedia.org/r/908896 [16:28:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [16:28:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46762 and previous config saved to /var/cache/conftool/dbconfig/20230414-162848-ladsgroup.json [16:28:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:28:53] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:29:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:29:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46763 and previous config saved to /var/cache/conftool/dbconfig/20230414-162911-ladsgroup.json [16:30:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:30:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46764 and previous config saved to /var/cache/conftool/dbconfig/20230414-163110-ladsgroup.json [16:31:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:31:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46765 and previous config saved to /var/cache/conftool/dbconfig/20230414-163120-ladsgroup.json [16:31:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:31:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway) [16:31:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:31:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:32:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:32:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46766 and previous config saved to /var/cache/conftool/dbconfig/20230414-163221-ladsgroup.json [16:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46767 and previous config saved to /var/cache/conftool/dbconfig/20230414-163434-ladsgroup.json [16:34:39] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:38:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:38:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:38:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:38:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [16:39:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:39:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [16:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46768 and previous config saved to /var/cache/conftool/dbconfig/20230414-164627-ladsgroup.json [16:47:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1015.eqiad.wmnet with OS bullseye [16:47:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1015.eqiad.wmnet with OS bullseye [16:49:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P46769 and previous config saved to /var/cache/conftool/dbconfig/20230414-164940-ladsgroup.json [16:51:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:43] (03CR) 10Ladsgroup: [C: 03+1] "Thanks! We can merge it early next week?" [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway) [16:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46770 and previous config saved to /var/cache/conftool/dbconfig/20230414-165814-ladsgroup.json [16:58:19] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:59:29] 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) 05Open→03Resolved We moved the port from ge-6/0/6 to ge-6/0/22. This should stop the errors. if they occur again we'll reinvestigate. [17:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [17:01:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46771 and previous config saved to /var/cache/conftool/dbconfig/20230414-170133-ladsgroup.json [17:02:07] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:02:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [17:03:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [17:04:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [17:04:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [17:04:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P46772 and previous config saved to /var/cache/conftool/dbconfig/20230414-170447-ladsgroup.json [17:05:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:05:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:05:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:51] (03PS1) 10DCausse: rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900 [17:07:37] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [17:08:41] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [17:10:18] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:10:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [17:11:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:11:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:12:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P46773 and previous config saved to /var/cache/conftool/dbconfig/20230414-171320-ladsgroup.json [17:13:38] (03PS7) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [17:13:40] (03PS9) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [17:13:42] (03PS6) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [17:13:44] (03PS47) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [17:13:46] (03PS1) 10Jbond: ssl_ssl_ciphersuite: Add AES256-SHA256 to list of mid cipher [puppet] - 10https://gerrit.wikimedia.org/r/908902 [17:15:05] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:15:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [17:15:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1015.eqiad.wmnet with OS bullseye [17:15:52] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1015.eqiad.wmnet with OS bullseye completed: - lvs1015 (**PASS**) - Downtimed on Icinga/Aler... [17:16:11] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900 (owner: 10DCausse) [17:16:20] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46774 and previous config saved to /var/cache/conftool/dbconfig/20230414-171638-ladsgroup.json [17:16:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:16:44] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:16:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46775 and previous config saved to /var/cache/conftool/dbconfig/20230414-171702-ladsgroup.json [17:17:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1014.eqiad.wmnet with OS bullseye [17:17:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1014.eqiad.wmnet with OS bullseye [17:18:22] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [17:18:58] (03CR) 10CI reject: [V: 04-1] ssl_ssl_ciphersuite: Add AES256-SHA256 to list of mid cipher [puppet] - 10https://gerrit.wikimedia.org/r/908902 (owner: 10Jbond) [17:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46776 and previous config saved to /var/cache/conftool/dbconfig/20230414-171911-ladsgroup.json [17:19:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46777 and previous config saved to /var/cache/conftool/dbconfig/20230414-171953-ladsgroup.json [17:19:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:20:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:20:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46778 and previous config saved to /var/cache/conftool/dbconfig/20230414-172016-ladsgroup.json [17:20:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:21:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:21:34] (03Merged) 10jenkins-bot: rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900 (owner: 10DCausse) [17:22:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46779 and previous config saved to /var/cache/conftool/dbconfig/20230414-172229-ladsgroup.json [17:22:34] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:23:48] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:24:01] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:25:51] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:25:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [17:27:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirtlocal1001.eqiad.wmnet [17:28:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P46780 and previous config saved to /var/cache/conftool/dbconfig/20230414-172826-ladsgroup.json [17:29:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye [17:29:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye [17:30:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46781 and previous config saved to /var/cache/conftool/dbconfig/20230414-173418-ladsgroup.json [17:36:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [17:37:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P46782 and previous config saved to /var/cache/conftool/dbconfig/20230414-173734-ladsgroup.json [17:39:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072'] [17:39:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [17:42:10] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage [17:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46783 and previous config saved to /var/cache/conftool/dbconfig/20230414-174333-ladsgroup.json [17:43:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [17:43:38] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:43:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [17:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46784 and previous config saved to /var/cache/conftool/dbconfig/20230414-174356-ladsgroup.json [17:45:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage [17:47:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet [17:49:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:49:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [17:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46785 and previous config saved to /var/cache/conftool/dbconfig/20230414-174924-ladsgroup.json [17:50:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P46786 and previous config saved to /var/cache/conftool/dbconfig/20230414-175242-ladsgroup.json [17:53:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1014.eqiad.wmnet with OS bullseye [17:53:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1014.eqiad.wmnet with OS bullseye completed: - lvs1014 (**PASS**) - Downtimed on Icinga/Aler... [17:57:01] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1016.eqiad.wmnet with OS bullseye [17:57:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye completed: - lvs1016 (**PASS**) - Downtimed on Icinga/Aler... [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:07] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:03:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [18:03:20] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) [18:03:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:03:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:04:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46787 and previous config saved to /var/cache/conftool/dbconfig/20230414-180430-ladsgroup.json [18:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:04:36] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:04:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:04:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [18:05:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [18:05:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [18:05:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [18:05:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:06:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46788 and previous config saved to /var/cache/conftool/dbconfig/20230414-180606-ladsgroup.json [18:06:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:37] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) Yet another data point if that helps: I am trying to merge the codfw LVS hiera definitions and ran into the following... [18:07:05] (03CR) 10Ssingh: "To be merged on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46789 and previous config saved to /var/cache/conftool/dbconfig/20230414-180748-ladsgroup.json [18:07:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:08:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46790 and previous config saved to /var/cache/conftool/dbconfig/20230414-180812-ladsgroup.json [18:08:57] !log doc1002, doc2001 - manually remove php7.3-fpm restart timers to fix T334735 and alerting - T322357 - systemctl stop wmf_auto_restart_php7.3-fpm.timer; systemctl stop wmf_auto_restart_php7.3-fpm.service; rm /lib/systemd/system/wmf_auto_restart_php7.3-fpm.* [18:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:03] T334735: fix PHP auto-restarts on doc hosts - https://phabricator.wikimedia.org/T334735 [18:09:04] T322357: OOUI PHP demos page is broken (again) - https://phabricator.wikimedia.org/T322357 [18:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46791 and previous config saved to /var/cache/conftool/dbconfig/20230414-181025-ladsgroup.json [18:10:31] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:11:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46792 and previous config saved to /var/cache/conftool/dbconfig/20230414-181123-ladsgroup.json [18:13:52] (03CR) 10Dzahn: [C: 03+2] "another follow-up was that the restart services were not removed by puppet and failed to restart missing php 7.3 which then caused monitor" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [18:16:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:49] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:17:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [18:18:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:18:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:19:37] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [18:21:15] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:22:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Papaul) When I run ` cookbook sre.hosts.dhcp --os bullseye cloudvirtlocal1001 ` i able to reboot the server as many time as i want and hit F12 and t... [18:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P46793 and previous config saved to /var/cache/conftool/dbconfig/20230414-182532-ladsgroup.json [18:25:58] (03PS48) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:26:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46794 and previous config saved to /var/cache/conftool/dbconfig/20230414-182629-ladsgroup.json [18:26:42] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:26:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:27:03] (03CR) 10JHathaway: lists: Bump the number worker processes to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway) [18:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46795 and previous config saved to /var/cache/conftool/dbconfig/20230414-183311-ladsgroup.json [18:33:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:33:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:35:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) First of all thank you Timo and Chris for the detailed information. ## Measurement domain - The shuffling the targets/domains part is implemen... [18:36:12] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) rsyncing of /srv/gerrit including /srv/gerrit/git and other things is STILL ongoing, it's hundreds of GB of ALL small files.. and rsync bandwith limited to make sure gerrit prod is not affec... [18:36:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P46796 and previous config saved to /var/cache/conftool/dbconfig/20230414-184038-ladsgroup.json [18:41:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46797 and previous config saved to /var/cache/conftool/dbconfig/20230414-184135-ladsgroup.json [18:48:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P46798 and previous config saved to /var/cache/conftool/dbconfig/20230414-184818-ladsgroup.json [18:51:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:52:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:52:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye complete... [18:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46799 and previous config saved to /var/cache/conftool/dbconfig/20230414-185545-ladsgroup.json [18:55:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:55:50] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:56:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:56:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:56:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:56:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46800 and previous config saved to /var/cache/conftool/dbconfig/20230414-185630-ladsgroup.json [18:56:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46801 and previous config saved to /var/cache/conftool/dbconfig/20230414-185642-ladsgroup.json [18:56:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [18:56:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance [18:57:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46802 and previous config saved to /var/cache/conftool/dbconfig/20230414-185705-ladsgroup.json [18:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46803 and previous config saved to /var/cache/conftool/dbconfig/20230414-185842-ladsgroup.json [18:59:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46804 and previous config saved to /var/cache/conftool/dbconfig/20230414-185921-ladsgroup.json [19:03:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P46805 and previous config saved to /var/cache/conftool/dbconfig/20230414-190324-ladsgroup.json [19:05:06] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10RLazarus) [19:06:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:16] (03PS1) 10AOkoth: prometheus: delete migrated eventgate alerts [puppet] - 10https://gerrit.wikimedia.org/r/908917 (https://phabricator.wikimedia.org/T309009) [19:08:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BBlack) It's awesome to see this moving along! One minor point: >> This would then be immediately queryable in Grafana by DC and Country code, where you c... [19:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P46806 and previous config saved to /var/cache/conftool/dbconfig/20230414-191348-ladsgroup.json [19:14:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46807 and previous config saved to /var/cache/conftool/dbconfig/20230414-191428-ladsgroup.json [19:15:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46808 and previous config saved to /var/cache/conftool/dbconfig/20230414-191831-ladsgroup.json [19:18:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [19:18:36] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:18:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [19:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46809 and previous config saved to /var/cache/conftool/dbconfig/20230414-191854-ladsgroup.json [19:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46810 and previous config saved to /var/cache/conftool/dbconfig/20230414-192001-ladsgroup.json [19:22:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P46811 and previous config saved to /var/cache/conftool/dbconfig/20230414-192855-ladsgroup.json [19:29:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46812 and previous config saved to /var/cache/conftool/dbconfig/20230414-192934-ladsgroup.json [19:31:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P46813 and previous config saved to /var/cache/conftool/dbconfig/20230414-193507-ladsgroup.json [19:44:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46814 and previous config saved to /var/cache/conftool/dbconfig/20230414-194401-ladsgroup.json [19:44:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [19:44:07] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:44:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [19:44:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46815 and previous config saved to /var/cache/conftool/dbconfig/20230414-194424-ladsgroup.json [19:44:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46816 and previous config saved to /var/cache/conftool/dbconfig/20230414-194441-ladsgroup.json [19:44:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:44:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:45:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46817 and previous config saved to /var/cache/conftool/dbconfig/20230414-194504-ladsgroup.json [19:46:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46818 and previous config saved to /var/cache/conftool/dbconfig/20230414-194637-ladsgroup.json [19:47:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46819 and previous config saved to /var/cache/conftool/dbconfig/20230414-194720-ladsgroup.json [19:50:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P46820 and previous config saved to /var/cache/conftool/dbconfig/20230414-195014-ladsgroup.json [20:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P46821 and previous config saved to /var/cache/conftool/dbconfig/20230414-200144-ladsgroup.json [20:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46822 and previous config saved to /var/cache/conftool/dbconfig/20230414-200226-ladsgroup.json [20:05:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46823 and previous config saved to /var/cache/conftool/dbconfig/20230414-200520-ladsgroup.json [20:05:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [20:05:25] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:05:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [20:05:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46824 and previous config saved to /var/cache/conftool/dbconfig/20230414-200543-ladsgroup.json [20:06:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46825 and previous config saved to /var/cache/conftool/dbconfig/20230414-200751-ladsgroup.json [20:15:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen I can check and let you know on the firmware update. [20:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P46826 and previous config saved to /var/cache/conftool/dbconfig/20230414-201650-ladsgroup.json [20:16:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) [20:17:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Tenant networking not working on cloudvirtlocal hosts - https://phabricator.wikimedia.org/T334694 (10Andrew) 05Open→03Resolved a:05Cmjohnson→03cmooney This was fixed by Cathal. [20:17:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) [20:17:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) [20:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46827 and previous config saved to /var/cache/conftool/dbconfig/20230414-201734-ladsgroup.json [20:21:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) 05Open→03Resolved Complete [20:22:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:19] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [20:22:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P46828 and previous config saved to /var/cache/conftool/dbconfig/20230414-202257-ladsgroup.json [20:26:59] (03PS15) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) [20:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:53] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Andrew) OK -- I'm not ready to get rid of the data on this server but it is fine to reboot it now. Thanks for waiting! [20:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46829 and previous config saved to /var/cache/conftool/dbconfig/20230414-203156-ladsgroup.json [20:31:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [20:32:04] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:32:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [20:32:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T333332)', diff saved to https://phabricator.wikimedia.org/P46830 and previous config saved to /var/cache/conftool/dbconfig/20230414-203220-ladsgroup.json [20:32:21] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [20:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46831 and previous config saved to /var/cache/conftool/dbconfig/20230414-203241-ladsgroup.json [20:32:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [20:32:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [20:33:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46832 and previous config saved to /var/cache/conftool/dbconfig/20230414-203304-ladsgroup.json [20:33:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [20:35:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46833 and previous config saved to /var/cache/conftool/dbconfig/20230414-203520-ladsgroup.json [20:36:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:50] !log rebooting labstore1004 for mgmt interface issue [20:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P46834 and previous config saved to /var/cache/conftool/dbconfig/20230414-203804-ladsgroup.json [20:41:05] (03CR) 10Cwhite: opensearch_dashboards: add package provider (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:42:00] (03CR) 10Andrea Denisse: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [20:45:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46835 and previous config saved to /var/cache/conftool/dbconfig/20230414-205026-ladsgroup.json [20:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46836 and previous config saved to /var/cache/conftool/dbconfig/20230414-205310-ladsgroup.json [20:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:53:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46837 and previous config saved to /var/cache/conftool/dbconfig/20230414-205333-ladsgroup.json [20:55:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46838 and previous config saved to /var/cache/conftool/dbconfig/20230414-205541-ladsgroup.json [20:56:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen we will have to update the firmware on those. [20:57:21] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Papaul) 05Open→03Resolved rebooting the server fixed the issue. We can now resolve this [20:57:44] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [20:58:28] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10Krinkle) [20:59:59] (03CR) 10Cwhite: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [21:01:58] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [21:03:24] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10Krinkle) > Category membership: queried using this [[ https://en.wikipedia.org/w/api.php?action=query&format=json&continue=&revids=1147464943&cllimit=max&i... [21:05:08] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [21:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46840 and previous config saved to /var/cache/conftool/dbconfig/20230414-210533-ladsgroup.json [21:06:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:53] (03PS1) 10EoghanGaffney: Only recurse if the directory is to be removed [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) [21:08:58] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite) [21:10:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P46841 and previous config saved to /var/cache/conftool/dbconfig/20230414-211048-ladsgroup.json [21:11:30] PROBLEM - Disk space on urldownloader1001 is CRITICAL: DISK CRITICAL - free space: / 283 MB (3% inode=89%): /tmp 283 MB (3% inode=89%): /var/tmp 283 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [21:11:59] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40681/console" [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) (owner: 10EoghanGaffney) [21:12:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:13:33] (03PS2) 10EoghanGaffney: [gitlab/ssh] Only recurse if the directory is to be removed [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) [21:16:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46842 and previous config saved to /var/cache/conftool/dbconfig/20230414-212039-ladsgroup.json [21:20:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [21:20:45] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:20:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [21:21:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46843 and previous config saved to /var/cache/conftool/dbconfig/20230414-212102-ladsgroup.json [21:23:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46844 and previous config saved to /var/cache/conftool/dbconfig/20230414-212319-ladsgroup.json [21:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P46845 and previous config saved to /var/cache/conftool/dbconfig/20230414-212554-ladsgroup.json [21:26:14] (03Abandoned) 10Cwhite: profile: clean up ipsec aggregate check [puppet] - 10https://gerrit.wikimedia.org/r/632739 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [21:27:09] (03Abandoned) 10Cwhite: scb: enable statsd_exporter and add matching rules [puppet] - 10https://gerrit.wikimedia.org/r/484586 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:30:10] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 3 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Quiddity) 1) Hi, Re: User Notice - please could someone... [21:34:49] (03Abandoned) 10Cwhite: when configured to relay statsd traffic, send the raw []byte recieved toward the configured statsd endpoint [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/554544 (https://phabricator.wikimedia.org/T239833) (owner: 10Cwhite) [21:36:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:46] (03Abandoned) 10Cwhite: hiera: specify tlsproxy configuration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/616811 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [21:38:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46846 and previous config saved to /var/cache/conftool/dbconfig/20230414-213825-ladsgroup.json [21:39:07] (03Abandoned) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [21:41:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46847 and previous config saved to /var/cache/conftool/dbconfig/20230414-214100-ladsgroup.json [21:41:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [21:41:06] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:41:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [21:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46848 and previous config saved to /var/cache/conftool/dbconfig/20230414-214123-ladsgroup.json [21:42:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46849 and previous config saved to /var/cache/conftool/dbconfig/20230414-214231-ladsgroup.json [21:44:02] 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10kaldari) @Ladsgroup - Is there any way that folks can manually purge thumbnails that didn't get regenerated (besides reup... [21:46:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46850 and previous config saved to /var/cache/conftool/dbconfig/20230414-215331-ladsgroup.json [21:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P46851 and previous config saved to /var/cache/conftool/dbconfig/20230414-215738-ladsgroup.json [21:59:07] (03PS2) 10Cwhite: logstash: ulogd remove copy network.transport to network.protocol [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195) [22:01:16] (03CR) 10Andrea Denisse: [C: 03+1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [22:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46852 and previous config saved to /var/cache/conftool/dbconfig/20230414-220838-ladsgroup.json [22:08:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [22:08:44] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:08:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [22:08:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [22:09:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [22:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46853 and previous config saved to /var/cache/conftool/dbconfig/20230414-220918-ladsgroup.json [22:11:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46854 and previous config saved to /var/cache/conftool/dbconfig/20230414-221134-ladsgroup.json [22:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P46855 and previous config saved to /var/cache/conftool/dbconfig/20230414-221244-ladsgroup.json [22:21:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:35] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) Sorry, it was a misclick. I removed the tag. [22:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46856 and previous config saved to /var/cache/conftool/dbconfig/20230414-222641-ladsgroup.json [22:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46857 and previous config saved to /var/cache/conftool/dbconfig/20230414-222750-ladsgroup.json [22:27:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [22:27:56] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:28:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [22:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46858 and previous config saved to /var/cache/conftool/dbconfig/20230414-222814-ladsgroup.json [22:29:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46859 and previous config saved to /var/cache/conftool/dbconfig/20230414-222921-ladsgroup.json [22:30:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:31:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46860 and previous config saved to /var/cache/conftool/dbconfig/20230414-224147-ladsgroup.json [22:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P46861 and previous config saved to /var/cache/conftool/dbconfig/20230414-224428-ladsgroup.json [22:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:44] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10KFrancis) @Dzahn I am confirming the signed NDA. Please proceed with the the access request. Thank you! [22:56:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46862 and previous config saved to /var/cache/conftool/dbconfig/20230414-225654-ladsgroup.json [22:56:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:56:59] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:57:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46863 and previous config saved to /var/cache/conftool/dbconfig/20230414-225717-ladsgroup.json [22:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46864 and previous config saved to /var/cache/conftool/dbconfig/20230414-225934-ladsgroup.json [22:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P46865 and previous config saved to /var/cache/conftool/dbconfig/20230414-225934-ladsgroup.json [23:01:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:58] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [23:12:32] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [23:14:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46866 and previous config saved to /var/cache/conftool/dbconfig/20230414-231440-ladsgroup.json [23:14:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46867 and previous config saved to /var/cache/conftool/dbconfig/20230414-231440-ladsgroup.json [23:14:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:14:48] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:14:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:15:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [23:15:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [23:15:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [23:15:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [23:15:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [23:15:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [23:15:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:15:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:15:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T333332)', diff saved to https://phabricator.wikimedia.org/P46868 and previous config saved to /var/cache/conftool/dbconfig/20230414-231557-ladsgroup.json [23:17:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T333332)', diff saved to https://phabricator.wikimedia.org/P46869 and previous config saved to /var/cache/conftool/dbconfig/20230414-231707-ladsgroup.json [23:21:59] (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) [23:23:03] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [23:25:18] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [23:26:52] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [23:27:59] (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) [23:28:50] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [23:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46870 and previous config saved to /var/cache/conftool/dbconfig/20230414-232946-ladsgroup.json [23:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P46871 and previous config saved to /var/cache/conftool/dbconfig/20230414-233213-ladsgroup.json [23:32:57] (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus5002 [puppet] - 10https://gerrit.wikimedia.org/r/907985 (https://phabricator.wikimedia.org/T309979) [23:33:32] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:35:00] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus5002 [puppet] - 10https://gerrit.wikimedia.org/r/907985 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [23:35:33] (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46872 and previous config saved to /var/cache/conftool/dbconfig/20230414-234453-ladsgroup.json [23:44:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:44:58] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:45:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46873 and previous config saved to /var/cache/conftool/dbconfig/20230414-234516-ladsgroup.json [23:45:33] (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:47:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P46874 and previous config saved to /var/cache/conftool/dbconfig/20230414-234720-ladsgroup.json [23:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46875 and previous config saved to /var/cache/conftool/dbconfig/20230414-234732-ladsgroup.json [23:47:50] (03PS49) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [23:50:32] (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:50:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:17] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:55:47] (JobUnavailable) firing: (20) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:58:32] (JobUnavailable) resolved: (2) Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:59:02] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable