[00:04:45] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[00:04:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[00:05:34] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[00:05:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[00:11:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) @Jclark-ctr when you are next on site can you please replace the DAC cable connecting cloudvirtlocal1001 to the switch?  Thanks
[00:17:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[00:18:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[00:39:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663
[00:39:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663 (owner: 10TrainBranchBot)
[00:57:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908663 (owner: 10TrainBranchBot)
[01:01:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[01:02:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[01:07:48] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided)
[01:07:59] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 10s)
[01:08:07] <jinxer-wm>	 (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:08:30] <stashbot>	 fab@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[01:09:07] <rzl>	 here, looking
[01:09:20] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2371.codfw.wmnet, mw2393.codfw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, m
[01:09:20] <icinga-wm_>	 fw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2274.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2307.codfw.wmnet, mw2380.codfw.wmnet, mw2383.codfw.wmnet, mw2336.codfw.wmnet, mw2414.codfw.wmnet, mw2268.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:10:00] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2365.codfw.wmnet, mw2373.codfw.wmnet, mw2393.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2371.codfw.wmnet, mw2274.codfw.wmnet, mw2438.codfw.wmnet, mw2414.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2383.codfw.wmnet, m
[01:10:00] <icinga-wm_>	 fw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmnet, mw2380.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw2336.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2268.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:10:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:10:56] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:11:36] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:12:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: ...
[01:12:16] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:12:38] <jinxer-wm>	 (ProbeDown) firing: (15) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:13:07] <jinxer-wm>	 (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:14:03] <urandom>	 o/
[01:14:21] <sukhe>	 seems to have all resolved now
[01:14:30] <rzl>	 we had a big spike in reads to s6 starting at 01:04 -- that eventually led to DB errors and appserver worker saturation
[01:14:47] <sukhe>	 ok thanks. any action required on our end then?
[01:15:05] <rzl>	 still working on what those s6 reads were about -- doesn't look like it was driven by a traffic spike afaict
[01:15:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: ...
[01:17:16] <jinxer-wm>	 codfw api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:19:08] <rzl>	 I don't think anything is still broken, I just don't get what happened yet or whether it will happen again
[01:20:01] <AntiComposite>	 that's getting quipped
[01:37:27] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided)
[01:37:38] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 11s)
[01:45:48] <Kemayo>	 There's definitely *something* still going on -- I got an error page accessing wikivoyage, and it was very slow to fetch the page on a reload.
[01:46:01] <rzl>	 I was about to say, it's on the upswing again
[01:46:03] <rzl>	 might page shortly
[01:46:30] <Kemayo>	 "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow"
[01:47:02] <AntiComposite>	 was also slow again here, now fine
[01:47:43] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[01:47:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[01:48:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[01:48:43] <sukhe>	 hello again
[01:48:58] <sukhe>	 this one seems more severe, or just more visible
[01:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:52:43] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[01:52:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[01:53:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:10:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:30:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:28] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:47:14] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:55:34] <icinga-wm_>	 PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:05:18] <icinga-wm_>	 PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 1783 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[03:21:02] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann
[03:21:02] <icinga-wm_>	 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:22:36] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:24:34] <icinga-wm_>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[03:26:08] <icinga-wm_>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:41:36] <icinga-wm_>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[03:43:08] <icinga-wm_>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:45:50] <icinga-wm_>	 RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:51:46] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28
[03:51:46] <icinga-wm_>	 MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:53:16] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%
[03:53:16] <icinga-wm_>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:53:16] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:54:48] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:59:32] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: deployment of ores-legacy app in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[04:04:22] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:06:38] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:22:10] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:22:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[04:22:50] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:23:15] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[04:27:02] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:27:42] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:35:48] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:37:26] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:41:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle) (I'm responding here in response to an email to the Peformance Team.)  This is an exciting project to see happen. We love meauring stuff and are happ...
[04:45:32] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:52:58] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:55:18] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:57:52] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:56:38] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230414T0600)
[06:00:48] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db1107 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908684 (https://phabricator.wikimedia.org/T334447)
[06:01:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1107.eqiad.wmnet
[06:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:06:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:07:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove option to disable vcp_snmp_statistics [homer/public] - 10https://gerrit.wikimedia.org/r/904177 (owner: 10Ayounsi)
[06:07:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1107 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908684 (https://phabricator.wikimedia.org/T334447) (owner: 10Marostegui)
[06:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: Remove option to disable vcp_snmp_statistics [homer/public] - 10https://gerrit.wikimedia.org/r/904177 (owner: 10Ayounsi)
[06:08:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1107.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[06:09:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1107.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[06:09:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:09:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1107.eqiad.wmnet
[06:10:34] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui) This is ready for DC-Ops
[06:11:15] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui) a:05Marostegui→03Jclark-ctr
[06:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:11:20] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 (10Marostegui)
[06:11:58] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908685
[06:12:47] <wikibugs>	 (03Abandoned) 10Ayounsi: cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp [homer/public] - 10https://gerrit.wikimedia.org/r/905946 (https://phabricator.wikimedia.org/T320508) (owner: 10Ayounsi)
[06:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:16:48] <icinga-wm_>	 PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:17:58] <icinga-wm_>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%
[06:17:58] <icinga-wm_>	 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:18:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908685 (owner: 10Marostegui)
[06:19:22] <icinga-wm_>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:20:36] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908687
[06:20:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "code and diff lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[06:21:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908687 (owner: 10Marostegui)
[06:21:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:05ayounsi→03cmooney
[06:25:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 T329352', diff saved to https://phabricator.wikimedia.org/P46679 and previous config saved to /var/cache/conftool/dbconfig/20230414-062553-marostegui.json
[06:25:59] <stashbot>	 T329352: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352
[06:27:30] <wikibugs>	 (03PS1) 10Marostegui: db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908689 (https://phabricator.wikimedia.org/T329352)
[06:27:32] <wikibugs>	 (03PS1) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690
[06:28:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1100: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908689 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui)
[06:29:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede)
[06:30:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) 05Open→03Resolved
[06:30:13] <icinga-wm_>	 RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.938 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:30:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:32:18] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add referer_name field to druid pageviews hourly and daily tables turnilo [puppet] - 10https://gerrit.wikimedia.org/r/908272 (https://phabricator.wikimedia.org/T334224) (owner: 10Snwachukwu)
[06:33:21] <wikibugs>	 (03CR) 10Elukey: "Left two nits, the rest looks good! I'll create the new namespace and puppet configs :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908191 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[06:38:07] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1211 [puppet] - 10https://gerrit.wikimedia.org/r/908691
[06:38:27] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:38:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1211 [puppet] - 10https://gerrit.wikimedia.org/r/908691 (owner: 10Marostegui)
[06:39:53] <icinga-wm_>	 PROBLEM - PHP7 rendering on parse2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:39:58] <wikibugs>	 (03CR) 10Elukey: Remove extra check on webrequest _SUCCESS files on HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[06:41:07] <wikibugs>	 (03CR) 10Elukey: Prepare removal of systemd_timer check_webrequest_partitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[06:48:42] <wikibugs>	 (03CR) 10Ayounsi: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[06:49:02] <wikibugs>	 (03PS2) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690
[06:49:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once NDA confirmed by Legal)" [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) (owner: 10Dzahn)
[06:51:04] <wikibugs>	 (03CR) 10Muehlenhoff: "Not all use cases of the httpd class have auto restart enabled, e.g. on mw* we explicitly don't it (but rather with a cookbook)." [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede)
[06:51:49] <wikibugs>	 (03CR) 10Muehlenhoff: "But we could add a parameter to the class to enable it." [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede)
[06:52:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40677/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede)
[06:52:47] <wikibugs>	 (03Abandoned) 10Slyngshede: C:httpd move auto restart to class [puppet] - 10https://gerrit.wikimedia.org/r/908690 (owner: 10Slyngshede)
[06:55:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:05ayounsi→03cmooney
[06:58:03] <wikibugs>	 (03PS3) 10Aqu: analytics: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073)
[06:58:26] <wikibugs>	 (03PS2) 10Aqu: analytics: Prepare removal of systemd_timer check_webrequest_partitions [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073)
[06:59:18] <wikibugs>	 (03CR) 10Aqu: "Thanks for the check Elukey" [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230414T0700)
[07:03:57] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) a:03cmooney
[07:04:12] <icinga-wm_>	 RECOVERY - PHP7 rendering on parse2016 is OK: HTTP OK: HTTP/1.1 302 Found - 521 bytes in 8.963 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:05:07] <jinxer-wm>	 (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:10:07] <jinxer-wm>	 (ProbeDown) resolved: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:12:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:26:20] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:27:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete pinning after recent toolsdb migration [puppet] - 10https://gerrit.wikimedia.org/r/907717 (owner: 10Muehlenhoff)
[07:28:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff)
[07:31:10] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:34:26] <wikibugs>	 (03PS1) 10Slyngshede: Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769
[07:35:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Enable esitest on text@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/908569 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[07:36:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:36:48] <wikibugs>	 (03PS2) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277
[07:39:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[07:39:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[07:40:14] <wikibugs>	 (03PS6) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519
[07:41:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:43:17] <wikibugs>	 (03Abandoned) 10Slyngshede: Access Requests, allow users to request more permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 (owner: 10Slyngshede)
[07:44:30] <wikibugs>	 (03PS7) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182
[07:45:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond)
[07:50:26] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:51:25] <wikibugs>	 (03PS3) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890
[07:51:31] <wikibugs>	 (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro)
[07:54:47] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede)
[07:54:49] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAP attribute editor [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede)
[07:55:14] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:55:20] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[07:55:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[07:56:04] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:56:50] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:00:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495)
[08:01:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[08:01:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM to my untrained eye!" [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[08:07:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: webperf: fix puppet on arclamp* [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577)
[08:07:32] <godog>	 slyngs: FYI ^
[08:08:50] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495)
[08:10:25] <wikibugs>	 (03PS1) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707)
[08:10:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu)
[08:11:40] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good. Surprised that Puppet didn't complain earlier." [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577) (owner: 10Filippo Giunchedi)
[08:12:57] <wikibugs>	 (03PS2) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707)
[08:13:18] <godog>	 slyngs: thank you for the quick review! puppet did complain btw for arclamp hosts
[08:13:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: fix puppet on arclamp* [puppet] - 10https://gerrit.wikimedia.org/r/908774 (https://phabricator.wikimedia.org/T334577) (owner: 10Filippo Giunchedi)
[08:13:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix test for installing the puppet5 component [puppet] - 10https://gerrit.wikimedia.org/r/908772 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[08:13:44] <slyngs>	 godog: Ah, in that case I completely understand Puppet :-)
[08:13:59] <godog>	 hehehe
[08:14:02] <godog>	 moritzm: merged your change too
[08:14:05] <moritzm>	 ack, thx
[08:18:29] <wikibugs>	 (03PS3) 10Aqu: analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707)
[08:21:21] <arturo>	 !log aborrero@apt2001:~ $ sudo -i reprepro --noskipold  --component thirdparty/kubeadm-k8s-1-23 update buster-wikimedia (T298005)
[08:21:23] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi)
[08:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:26] <stashbot>	 T298005: Upgrade Toolforge Kubernetes to version 1.23 - https://phabricator.wikimedia.org/T298005
[08:21:29] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[08:21:40] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:22:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[08:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:23:18] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:23:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[08:28:04] <wikibugs>	 (03CR) 10JMeybohm: thumbor: make tmp-dir configurable, default disabled (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[08:28:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:31:35] <wikibugs>	 (03CR) 10JMeybohm: thumbor: make tmp-dir configurable, default disabled (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[08:35:14] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:35:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[08:36:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[08:38:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:38:22] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:39:49] <wikibugs>	 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10ItamarWMDE)
[08:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:44:58] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[08:45:43] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi)
[08:51:08] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:51:26] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[08:51:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[08:52:44] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:56:54] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:59:14] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:02:14] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) 05Open→03Resolved a:03Tgr La...
[09:02:56] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40678/console" [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[09:03:05] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] analytics: Prepare removal of systemd_timer check_webrequest_partitions [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[09:04:04] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:05:41] <wikibugs>	 (03PS4) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:06:14] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[09:07:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond)
[09:08:06] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:24] <wikibugs>	 (03PS5) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:08:54] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers parse2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:12:18] <claime>	 parse2016 is kinda hammered
[09:12:23] <claime>	 But it's not "down" down.
[09:12:34] <claime>	 72 load avg tho
[09:12:35] <vgutierrez>	 well... >5s for a request is down :)
[09:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:13:01] <claime>	 vgutierrez: Oh I agree
[09:13:03] <vgutierrez>	 parsoid has 9 nodes of 20 depooled in codfw
[09:13:08] <claime>	 But I don't really see what I can do about it
[09:13:16] <vgutierrez>	 can we pooled some?
[09:13:17] <vgutierrez>	 *pool
[09:13:34] <claime>	 Hmm why the heck are they depooled is the question >_>
[09:15:25] <claime>	 Ok I don't see anything relevant
[09:15:29] <claime>	 (in SAL
[09:15:31] <claime>	 )
[09:15:43] <vgutierrez>	 yep.. I'm failing to find anything regarding the depooled servers
[09:15:47] <claime>	 So I'd say yeah, we can repool them
[09:16:16] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2002.codfw.wmnet with reason: systemd package upgrade
[09:16:31] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2002.codfw.wmnet with reason: systemd package upgrade
[09:16:32] <claime>	 They should still have gotten scap deployments but I'll run a pull on them just to be sure, and repool them
[09:17:06] <vgutierrez>	 last reference to parsoid being depooled in codfw seems to be T327925
[09:17:07] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[09:17:33] <vgutierrez>	 (and that got repooled the very same day apparently)
[09:17:35] <claime>	 yeah  but it's not even those servers
[09:17:49] <vgutierrez>	 well.. dc=codfw,cluster=parsoid
[09:18:29] <claime>	 yeah but it's not even dc=codfw,cluster=parsoid - the servers in the task
[09:18:32] <claime>	 It's a mish mash
[09:19:18] <vgutierrez>	 dc=codfw,cluster=parsoid was the  selector logged by conftool
[09:21:01] <logmsgbot>	 !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[09:22:19] <claime>	 Ok they're all up to date on scap deployments, repooling
[09:22:24] <vgutierrez>	 greaet
[09:22:26] <vgutierrez>	 *great
[09:22:39] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid
[09:23:05] <wikibugs>	 (03PS6) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:23:18] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:23:44] <claime>	 It works much better when 50% of the cluster isn't depooled tbh
[09:24:10] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:24:17] <wikibugs>	 (03PS7) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:25:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond)
[09:25:39] <wikibugs>	 (03CR) 10Jbond: "updated" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond)
[09:26:11] <wikibugs>	 (03CR) 10Muehlenhoff: "First round of comments, but looks good in general" [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede)
[09:27:13] <wikibugs>	 (03PS8) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:30:20] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[09:33:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[09:36:38] <icinga-wm_>	 RECOVERY - Check systemd state on kubemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:14] <claime>	 I don't know what happened but there are also a **lot** of mw appservers depooled in codfw
[09:38:45] <claime>	 vgutierrez: Looks like remnants from what happened around 1:00 UTC
[09:39:17] <claime>	 I'll repool because we can't really go on with 89 mw appservers depooled, can we
[09:41:34] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[09:41:40] <wikibugs>	 (03PS9) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007)
[09:42:34] <vgutierrez>	 claime: what do you mean? we don't have any action on the SAL indicating that servers were depooled last night
[09:42:59] <claime>	 vgutierrez: No you're right, I saw the alert for servers marked down, but they were not depooled
[09:45:21] <claime>	 I'm going through SAL trying to find when they could have been depooled and coming up empty
[09:45:37] <logmsgbot>	 !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[09:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:51:03] <claime>	 scap pull done on all of them, repooling
[09:53:18] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2.*.codfw.wmnet,cluster=appserver
[09:53:39] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2.*.codfw.wmnet,cluster=api_appserver
[10:02:24] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:02:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Use signed-by notation for component/puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/908789 (https://phabricator.wikimedia.org/T330495)
[10:03:49] <wikibugs>	 (03CR) 10Jbond: "I think this is fine, however i would prefer it if we set this up as a module in gitlab so that we could add CI.  then add this module to " [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[10:04:38] <wikibugs>	 (03PS1) 10Jameel Kaisar: Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028)
[10:05:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond)
[10:05:39] <wikibugs>	 (03CR) 10Jbond: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[10:05:41] <wikibugs>	 (03PS2) 10Jameel Kaisar: Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028)
[10:06:02] <wikibugs>	 (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[10:06:50] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:39] <wikibugs>	 (03PS1) 10Elukey: Add new images to support AMD GPUs on k8s [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908792 (https://phabricator.wikimedia.org/T333009)
[10:08:29] <logmsgbot>	 !log kamila@deploy2002 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[10:16:03] <wikibugs>	 (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro)
[10:16:12] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:25] <wikibugs>	 (03PS4) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890
[10:17:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use signed-by notation for component/puppet5 [puppet] - 10https://gerrit.wikimedia.org/r/908789 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[10:17:43] <wikibugs>	 (03CR) 10JMeybohm: "hm..works on my machine 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[10:20:54] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[10:26:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[10:30:18] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:32:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1120.eqiad.wmnet
[10:33:14] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db1120 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908793 (https://phabricator.wikimedia.org/T334580)
[10:36:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1120 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/908793 (https://phabricator.wikimedia.org/T334580) (owner: 10Marostegui)
[10:37:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[10:37:14] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) The patch above fixes the problem, tested in mwdebug2001. Now I need someone to review and merge it, I'll depl...
[10:39:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1120.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[10:40:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1120.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[10:40:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:40:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1120.eqiad.wmnet
[10:41:18] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Marostegui)
[10:41:36] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 (10Marostegui)
[10:43:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[10:43:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[10:46:27] <wikibugs>	 (03PS1) 10Kamila Součková: thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795
[10:49:18] <logmsgbot>	 !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[10:52:14] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:00:48] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:26] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Move s2 and s3 backups from db1102 to db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908798 (https://phabricator.wikimedia.org/T334057)
[11:05:20] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:07:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass -y --force-yes to puppet installation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908799 (https://phabricator.wikimedia.org/T330495)
[11:15:24] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pass -y --force-yes to puppet installation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908799 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[11:22:34] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Vgutierrez)
[11:26:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková)
[11:27:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková)
[11:30:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Handle h/2 coalescing issue for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908790 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[11:32:41] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[11:34:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[11:34:15] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10phaultfinder)
[11:34:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[11:37:46] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Samwalton9) Huge thanks all!
[11:37:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[11:39:05] <wikibugs>	 (03PS7) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519
[11:41:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[11:41:23] <wikibugs>	 (03CR) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. (039 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 (owner: 10Slyngshede)
[11:41:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[11:41:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46680 and previous config saved to /var/cache/conftool/dbconfig/20230414-114148-ladsgroup.json
[11:41:54] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[11:41:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[11:42:01] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 3 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) Thanks a lot! Can we make a user notice in th...
[11:42:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[11:42:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46681 and previous config saved to /var/cache/conftool/dbconfig/20230414-114219-ladsgroup.json
[11:43:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[11:43:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[11:43:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:43:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46682 and previous config saved to /var/cache/conftool/dbconfig/20230414-114356-ladsgroup.json
[11:44:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:44:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46683 and previous config saved to /var/cache/conftool/dbconfig/20230414-114407-ladsgroup.json
[11:44:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46684 and previous config saved to /var/cache/conftool/dbconfig/20230414-114429-ladsgroup.json
[11:46:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46685 and previous config saved to /var/cache/conftool/dbconfig/20230414-114619-ladsgroup.json
[11:50:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[11:50:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**)   - Re...
[11:50:44] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: gitlab: test out gitlb actions with a stable puppet module - https://phabricator.wikimedia.org/T334723 (10jbond) p:05Triage→03Medium
[11:53:17] <wikibugs>	 (03PS1) 10Jbond: debian: move debian package to gitlab/vendored_modules [puppet] - 10https://gerrit.wikimedia.org/r/908805 (https://phabricator.wikimedia.org/T334723)
[11:58:39] <wikibugs>	 (03PS12) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[11:59:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P46686 and previous config saved to /var/cache/conftool/dbconfig/20230414-115903-ladsgroup.json
[11:59:35] <wikibugs>	 (03Abandoned) 10Elukey: Add new images to support AMD GPUs on k8s [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908792 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey)
[11:59:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46687 and previous config saved to /var/cache/conftool/dbconfig/20230414-115935-ladsgroup.json
[11:59:51] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[12:01:18] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P46688 and previous config saved to /var/cache/conftool/dbconfig/20230414-120125-ladsgroup.json
[12:03:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui)
[12:04:09] <wikibugs>	 (03PS1) 10Jelto: install_server: configure root raid only on gitlab-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172)
[12:07:13] <wikibugs>	 (03CR) 10Jelto: "I'm unable to find a recipe to configure two independent raids on four disks. This change mostly rolls back to https://gerrit.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:08:40] <wikibugs>	 (03CR) 10Jelto: install_server: configure root raid only on gitlab-raid1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:09:53] <wikibugs>	 (03CR) 10JMeybohm: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[12:13:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] analytics: Add purge job for webrequest data loss reports [puppet] - 10https://gerrit.wikimedia.org/r/908777 (https://phabricator.wikimedia.org/T332707) (owner: 10Aqu)
[12:14:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P46689 and previous config saved to /var/cache/conftool/dbconfig/20230414-121409-ladsgroup.json
[12:14:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P46690 and previous config saved to /var/cache/conftool/dbconfig/20230414-121442-ladsgroup.json
[12:16:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P46691 and previous config saved to /var/cache/conftool/dbconfig/20230414-121632-ladsgroup.json
[12:16:58] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [alerts] - 10https://gerrit.wikimedia.org/r/908830 (owner: 10Clément Goubert)
[12:20:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495)
[12:25:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "That sounds like a reasonable compromise" [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:27:14] <wikibugs>	 (03PS3) 10Clément Goubert: team-sre: add alert on mediawiki pooled percentage [alerts] - 10https://gerrit.wikimedia.org/r/908830
[12:27:17] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[12:28:15] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[12:29:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46692 and previous config saved to /var/cache/conftool/dbconfig/20230414-122915-ladsgroup.json
[12:29:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[12:29:21] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[12:29:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[12:29:34] <wikibugs>	 (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[12:29:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46693 and previous config saved to /var/cache/conftool/dbconfig/20230414-122939-ladsgroup.json
[12:29:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46694 and previous config saved to /var/cache/conftool/dbconfig/20230414-122948-ladsgroup.json
[12:29:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:30:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[12:30:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:30:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46695 and previous config saved to /var/cache/conftool/dbconfig/20230414-123011-ladsgroup.json
[12:30:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46696 and previous config saved to /var/cache/conftool/dbconfig/20230414-123047-ladsgroup.json
[12:31:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T333332)', diff saved to https://phabricator.wikimedia.org/P46697 and previous config saved to /var/cache/conftool/dbconfig/20230414-123138-ladsgroup.json
[12:31:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:31:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:32:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46698 and previous config saved to /var/cache/conftool/dbconfig/20230414-123201-ladsgroup.json
[12:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T333332)', diff saved to https://phabricator.wikimedia.org/P46699 and previous config saved to /var/cache/conftool/dbconfig/20230414-123221-ladsgroup.json
[12:34:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46700 and previous config saved to /var/cache/conftool/dbconfig/20230414-123413-ladsgroup.json
[12:34:50] <wikibugs>	 (03CR) 10Muehlenhoff: Password reset - Allow users to request a password reset. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede)
[12:38:57] <wikibugs>	 (03PS14) 10Jbond: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[12:40:31] <wikibugs>	 (03CR) 10Jbond: opensearch_dashboards: add package provider (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[12:40:56] <wikibugs>	 (03CR) 10Jbond: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[12:45:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P46701 and previous config saved to /var/cache/conftool/dbconfig/20230414-124553-ladsgroup.json
[12:47:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46702 and previous config saved to /var/cache/conftool/dbconfig/20230414-124727-ladsgroup.json
[12:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P46703 and previous config saved to /var/cache/conftool/dbconfig/20230414-124920-ladsgroup.json
[12:51:50] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:27] <wikibugs>	 (03CR) 10Muehlenhoff: "First pass of comments, this is looking good in general!" [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede)
[12:58:40] <wikibugs>	 (03PS3) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277
[12:58:47] <wikibugs>	 (03CR) 10Slyngshede: Password reset - Allow users to request a password reset. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 (owner: 10Slyngshede)
[13:01:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P46704 and previous config saved to /var/cache/conftool/dbconfig/20230414-130101-ladsgroup.json
[13:01:12] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P46705 and previous config saved to /var/cache/conftool/dbconfig/20230414-130234-ladsgroup.json
[13:04:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P46706 and previous config saved to /var/cache/conftool/dbconfig/20230414-130426-ladsgroup.json
[13:07:21] <ottomata>	 !log creating User:ANONYMOUS ACLs on kafka-test cluster https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs
[13:07:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:02] <ottomata>	 !log granting IdempotentWrite on kafka jumbo-eqiad cluster to User:ANONYNOUS - this will allow for user of newer kafka producers that have enabled transactional writes by default.  `kafka acls  --add --allow-principal User:ANONYMOUS --cluster --operation IdempotentWrite`
[13:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:16:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T333332)', diff saved to https://phabricator.wikimedia.org/P46707 and previous config saved to /var/cache/conftool/dbconfig/20230414-131607-ladsgroup.json
[13:16:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[13:16:13] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:16:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[13:16:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46708 and previous config saved to /var/cache/conftool/dbconfig/20230414-131631-ladsgroup.json
[13:17:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46709 and previous config saved to /var/cache/conftool/dbconfig/20230414-131739-ladsgroup.json
[13:17:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[13:17:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[13:17:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:18:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:18:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46710 and previous config saved to /var/cache/conftool/dbconfig/20230414-131824-ladsgroup.json
[13:19:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T333332)', diff saved to https://phabricator.wikimedia.org/P46711 and previous config saved to /var/cache/conftool/dbconfig/20230414-131932-ladsgroup.json
[13:19:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:19:48] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata)
[13:19:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:19:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T333332)', diff saved to https://phabricator.wikimedia.org/P46712 and previous config saved to /var/cache/conftool/dbconfig/20230414-131956-ladsgroup.json
[13:20:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46713 and previous config saved to /var/cache/conftool/dbconfig/20230414-132034-ladsgroup.json
[13:22:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T333332)', diff saved to https://phabricator.wikimedia.org/P46714 and previous config saved to /var/cache/conftool/dbconfig/20230414-132208-ladsgroup.json
[13:22:13] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:23:34] <icinga-wm_>	 PROBLEM - Check systemd state on doc2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_php7.3-fpm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) @Papaul  replaced dac cable
[13:30:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:31:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:32:30] <wikibugs>	 (03PS14) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[13:32:32] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[13:32:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P46715 and previous config saved to /var/cache/conftool/dbconfig/20230414-133245-ladsgroup.json
[13:32:47] <wikibugs>	 (03CR) 10David Caro: "Untested for now" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[13:34:58] <wikibugs>	 (03PS1) 10Slyngshede: Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845
[13:35:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 (owner: 10Slyngshede)
[13:35:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46716 and previous config saved to /var/cache/conftool/dbconfig/20230414-133540-ladsgroup.json
[13:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P46717 and previous config saved to /var/cache/conftool/dbconfig/20230414-133714-ladsgroup.json
[13:37:18] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:37:18] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): WMCS Cookbook Automation FY2022-23 Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) a:03fnegri
[13:37:30] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:37:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS buster
[13:37:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS buster
[13:37:43] <wikibugs>	 (03PS2) 10Slyngshede: Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845
[13:42:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:42:10] <wikibugs>	 10SRE, 10Traffic, 10conftool, 10serviceops: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Clement_Goubert) For future reference, this left 89 out of 280 appservers and 9 out of 20 parsoid servers depooled in codf...
[13:42:12] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:42:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:42:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:42:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[13:44:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:45:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:45:03] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:45:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[13:45:16] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.404 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:46:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696)
[13:47:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P46718 and previous config saved to /var/cache/conftool/dbconfig/20230414-134751-ladsgroup.json
[13:48:27] <wikibugs>	 (03PS2) 10Andrew Bogott: Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696)
[13:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:49:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirtlocal1001 back to 'insetup' [puppet] - 10https://gerrit.wikimedia.org/r/908848 (https://phabricator.wikimedia.org/T334696) (owner: 10Andrew Bogott)
[13:50:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P46719 and previous config saved to /var/cache/conftool/dbconfig/20230414-135047-ladsgroup.json
[13:51:05] <wikibugs>	 (03CR) 10Slyngshede: Read systems and approval rules from YAML file. (037 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede)
[13:51:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:51:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[13:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P46720 and previous config saved to /var/cache/conftool/dbconfig/20230414-135220-ladsgroup.json
[13:53:32] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:53:44] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:56:37] <wikibugs>	 (03PS15) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[13:57:53] <wikibugs>	 (03CR) 10David Caro: "Tested manually by deleting and the letting it recreate a tool account, got the stats:" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[14:00:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova/cloudvirtlocal: force replacement of /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/908851
[14:02:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46721 and previous config saved to /var/cache/conftool/dbconfig/20230414-140258-ladsgroup.json
[14:03:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[14:03:05] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:03:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[14:03:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[14:03:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[14:03:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:03:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:04:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46722 and previous config saved to /var/cache/conftool/dbconfig/20230414-140401-ladsgroup.json
[14:04:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova/cloudvirtlocal: force replacement of /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/908851 (owner: 10Andrew Bogott)
[14:05:14] <wikibugs>	 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ssingh)
[14:05:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46723 and previous config saved to /var/cache/conftool/dbconfig/20230414-140553-ladsgroup.json
[14:05:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[14:06:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[14:06:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46724 and previous config saved to /var/cache/conftool/dbconfig/20230414-140616-ladsgroup.json
[14:07:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46725 and previous config saved to /var/cache/conftool/dbconfig/20230414-140725-ladsgroup.json
[14:07:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[14:07:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[14:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46726 and previous config saved to /var/cache/conftool/dbconfig/20230414-140749-ladsgroup.json
[14:09:44] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.365 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:09:56] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 6.897 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:10:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46727 and previous config saved to /var/cache/conftool/dbconfig/20230414-141002-ladsgroup.json
[14:10:07] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:11:13] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:11:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[14:11:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:11:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:12:37] <wikibugs>	 (03PS6) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[14:12:39] <wikibugs>	 (03PS8) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[14:12:41] <wikibugs>	 (03PS5) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326
[14:12:43] <wikibugs>	 (03PS39) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:15:12] <wikibugs>	 (03PS40) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:17:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond)
[14:18:01] <wikibugs>	 (03PS41) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:19:32] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:19:40] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:21:15] <claime>	 !log rebooting list1001 for cpu bump
[14:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:26] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.428 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:22:32] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:22:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46728 and previous config saved to /var/cache/conftool/dbconfig/20230414-142232-ladsgroup.json
[14:25:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P46729 and previous config saved to /var/cache/conftool/dbconfig/20230414-142508-ladsgroup.json
[14:25:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46730 and previous config saved to /var/cache/conftool/dbconfig/20230414-142518-ladsgroup.json
[14:25:23] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:26:29] <wikibugs>	 (03PS42) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:27:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[14:29:30] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[14:29:41] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:29:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[14:30:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:30:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:30:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[14:31:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) Updated netbox and idracs on all three servers  frbast1002: vlan:frack-bastion-eqiad ip:10.64.40.196 frmon1002: vlan:frack-administration-e...
[14:32:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts pybal-test2001.codfw.wmnet
[14:34:20] <wikibugs>	 (03PS43) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:34:52] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:36:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:37:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mngmt dns fundrasing - jclark@cumin1001"
[14:37:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P46731 and previous config saved to /var/cache/conftool/dbconfig/20230414-143738-ladsgroup.json
[14:38:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pybal-test2001.codfw.wmnet
[14:38:12] <wikibugs>	 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `pybal-test2001.codfw.wmnet` - pybal-test2001.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanage...
[14:38:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts pybal-test2002.codfw.wmnet
[14:38:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mngmt dns fundrasing - jclark@cumin1001"
[14:38:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:39:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[14:40:01] <wikibugs>	 (03PS44) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:40:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P46732 and previous config saved to /var/cache/conftool/dbconfig/20230414-144014-ladsgroup.json
[14:40:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P46733 and previous config saved to /var/cache/conftool/dbconfig/20230414-144024-ladsgroup.json
[14:41:40] <wikibugs>	 (03PS1) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T321309)
[14:42:51] <wikibugs>	 (03PS45) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:44:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) >>! In T332024#8780999, @Krinkle wrote: > (I'm responding here in response to an email to the Peformance Team.) >  > This is an exciting project to se...
[14:44:32] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[14:45:28] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:46:19] <wikibugs>	 (03PS46) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[14:47:56] <wikibugs>	 (03CR) 10JMeybohm: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[14:48:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pybal-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:49:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pybal-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:49:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:49:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pybal-test2002.codfw.wmnet
[14:49:41] <wikibugs>	 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `pybal-test2002.codfw.wmnet` - pybal-test2002.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanage...
[14:50:52] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46734 and previous config saved to /var/cache/conftool/dbconfig/20230414-145245-ladsgroup.json
[14:52:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[14:52:50] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:53:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[14:53:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[14:53:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[14:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46735 and previous config saved to /var/cache/conftool/dbconfig/20230414-145327-ladsgroup.json
[14:54:21] <wikibugs>	 (03PS2) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745)
[14:55:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T333332)', diff saved to https://phabricator.wikimedia.org/P46736 and previous config saved to /var/cache/conftool/dbconfig/20230414-145521-ladsgroup.json
[14:55:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:55:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P46737 and previous config saved to /var/cache/conftool/dbconfig/20230414-145531-ladsgroup.json
[14:55:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46738 and previous config saved to /var/cache/conftool/dbconfig/20230414-145537-ladsgroup.json
[14:55:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:55:39] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková)
[14:55:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46739 and previous config saved to /var/cache/conftool/dbconfig/20230414-145544-ladsgroup.json
[14:55:58] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet
[14:56:44] <wikibugs>	 (03PS3) 10Ssingh: Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745)
[14:57:41] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Remove outdated references to pybal-test200[12] [puppet] - 10https://gerrit.wikimedia.org/r/908860 (https://phabricator.wikimedia.org/T334745) (owner: 10Ssingh)
[14:57:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46740 and previous config saved to /var/cache/conftool/dbconfig/20230414-145756-ladsgroup.json
[14:58:03] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:00:06] <wikibugs>	 (03PS1) 10Hnowlan: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725)
[15:00:29] <wikibugs>	 (03PS3) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[15:00:36] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:10] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: correct comments around tmp_empty_dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/908795 (owner: 10Kamila Součková)
[15:04:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:04:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:05:06] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:10] <wikibugs>	 (03PS4) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[15:08:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan)
[15:10:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T333332)', diff saved to https://phabricator.wikimedia.org/P46741 and previous config saved to /var/cache/conftool/dbconfig/20230414-151037-ladsgroup.json
[15:10:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[15:10:43] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:10:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[15:10:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46742 and previous config saved to /var/cache/conftool/dbconfig/20230414-151043-ladsgroup.json
[15:10:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[15:11:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[15:11:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46743 and previous config saved to /var/cache/conftool/dbconfig/20230414-151108-ladsgroup.json
[15:12:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46744 and previous config saved to /var/cache/conftool/dbconfig/20230414-151216-ladsgroup.json
[15:12:51] <wikibugs>	 10SRE, 10Traffic: Deprecate pybal test hosts pybal-test200[12] - https://phabricator.wikimedia.org/T334745 (10ssingh) 05Open→03Resolved a:03ssingh Hosts decommissioned and removed from Puppet.
[15:13:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P46745 and previous config saved to /var/cache/conftool/dbconfig/20230414-151303-ladsgroup.json
[15:14:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:14:45] <wikibugs>	 (03PS5) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[15:15:14] <icinga-wm_>	 PROBLEM - DPKG on dse-k8s-worker1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[15:15:16] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:04] <wikibugs>	 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10odimitrijevic)
[15:24:39] <wikibugs>	 (03PS6) 10JMeybohm: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291)
[15:24:46] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[15:25:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[15:25:40] <icinga-wm_>	 PROBLEM - mailman3-web on lists1001 is CRITICAL: PROCS CRITICAL: 5 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:25:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P46746 and previous config saved to /var/cache/conftool/dbconfig/20230414-152550-ladsgroup.json
[15:26:52] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[15:26:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[15:27:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P46747 and previous config saved to /var/cache/conftool/dbconfig/20230414-152722-ladsgroup.json
[15:28:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P46748 and previous config saved to /var/cache/conftool/dbconfig/20230414-152809-ladsgroup.json
[15:32:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:36:38] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye
[15:37:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1013.eqiad.wmnet with OS bullseye
[15:38:53] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Remove if-guards for k8s 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908788 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:38:55] <wikibugs>	 (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan)
[15:40:46] <icinga-wm_>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[15:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T333332)', diff saved to https://phabricator.wikimedia.org/P46749 and previous config saved to /var/cache/conftool/dbconfig/20230414-154056-ladsgroup.json
[15:40:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[15:41:02] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:41:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[15:41:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46750 and previous config saved to /var/cache/conftool/dbconfig/20230414-154119-ladsgroup.json
[15:42:26] <icinga-wm_>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[15:42:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P46751 and previous config saved to /var/cache/conftool/dbconfig/20230414-154228-ladsgroup.json
[15:43:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T333332)', diff saved to https://phabricator.wikimedia.org/P46752 and previous config saved to /var/cache/conftool/dbconfig/20230414-154316-ladsgroup.json
[15:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[15:43:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46753 and previous config saved to /var/cache/conftool/dbconfig/20230414-154329-ladsgroup.json
[15:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[15:43:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46754 and previous config saved to /var/cache/conftool/dbconfig/20230414-154339-ladsgroup.json
[15:45:03] <wikibugs>	 10SRE-Unowned: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh)
[15:45:25] <wikibugs>	 10SRE-Unowned: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh)
[15:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46755 and previous config saved to /var/cache/conftool/dbconfig/20230414-154551-ladsgroup.json
[15:46:28] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[15:52:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:52:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:53:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[15:55:10] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/908891 (https://phabricator.wikimedia.org/T334611)
[15:55:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10fnegri)
[15:57:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T333332)', diff saved to https://phabricator.wikimedia.org/P46756 and previous config saved to /var/cache/conftool/dbconfig/20230414-155735-ladsgroup.json
[15:57:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[15:57:40] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:57:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[15:57:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46757 and previous config saved to /var/cache/conftool/dbconfig/20230414-155758-ladsgroup.json
[15:58:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46758 and previous config saved to /var/cache/conftool/dbconfig/20230414-155835-ladsgroup.json
[16:00:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P46759 and previous config saved to /var/cache/conftool/dbconfig/20230414-160058-ladsgroup.json
[16:06:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bullseye
[16:06:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1013.eqiad.wmnet with OS bullseye completed: - lvs1013 (**PASS**)   - Downtimed on Icinga/Aler...
[16:12:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[16:13:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P46760 and previous config saved to /var/cache/conftool/dbconfig/20230414-161341-ladsgroup.json
[16:16:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P46761 and previous config saved to /var/cache/conftool/dbconfig/20230414-161604-ladsgroup.json
[16:16:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen can you please confirm that you an not access those servers so you can take over the task?  thanks
[16:18:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) >>! In T319460#8782417, @Papaul wrote: > @Jgreen can you please confirm that you an not access those servers so you can take over the task? >...
[16:20:00] <wikibugs>	 (03CR) 10Pmiazga: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[16:27:30] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[16:27:45] <wikibugs>	 (03PS1) 10JHathaway: lists: Bump the number worker processes to 4 [puppet] - 10https://gerrit.wikimedia.org/r/908896
[16:28:31] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[16:28:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T333332)', diff saved to https://phabricator.wikimedia.org/P46762 and previous config saved to /var/cache/conftool/dbconfig/20230414-162848-ladsgroup.json
[16:28:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:28:53] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:29:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:29:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46763 and previous config saved to /var/cache/conftool/dbconfig/20230414-162911-ladsgroup.json
[16:30:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:30:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:31:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T333332)', diff saved to https://phabricator.wikimedia.org/P46764 and previous config saved to /var/cache/conftool/dbconfig/20230414-163110-ladsgroup.json
[16:31:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[16:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46765 and previous config saved to /var/cache/conftool/dbconfig/20230414-163120-ladsgroup.json
[16:31:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[16:31:32] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway)
[16:31:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[16:31:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[16:32:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[16:32:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[16:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46766 and previous config saved to /var/cache/conftool/dbconfig/20230414-163221-ladsgroup.json
[16:34:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46767 and previous config saved to /var/cache/conftool/dbconfig/20230414-163434-ladsgroup.json
[16:34:39] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:38:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:38:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:38:37] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:38:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[16:39:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:39:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[16:46:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46768 and previous config saved to /var/cache/conftool/dbconfig/20230414-164627-ladsgroup.json
[16:47:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1015.eqiad.wmnet with OS bullseye
[16:47:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1015.eqiad.wmnet with OS bullseye
[16:49:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P46769 and previous config saved to /var/cache/conftool/dbconfig/20230414-164940-ladsgroup.json
[16:51:40] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks! We can merge it early next week?" [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway)
[16:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46770 and previous config saved to /var/cache/conftool/dbconfig/20230414-165814-ladsgroup.json
[16:58:19] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:59:29] <wikibugs>	 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) 05Open→03Resolved We moved the port from ge-6/0/6 to ge-6/0/22. This should stop the errors. if they occur again we'll reinvestigate.
[17:00:12] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage
[17:01:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P46771 and previous config saved to /var/cache/conftool/dbconfig/20230414-170133-ladsgroup.json
[17:02:07] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:02:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[17:03:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage
[17:04:08] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[17:04:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[17:04:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P46772 and previous config saved to /var/cache/conftool/dbconfig/20230414-170447-ladsgroup.json
[17:05:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:05:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:05:33] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:51] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900
[17:07:37] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[17:08:41] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[17:10:18] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:10:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[17:11:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:11:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:13:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P46773 and previous config saved to /var/cache/conftool/dbconfig/20230414-171320-ladsgroup.json
[17:13:38] <wikibugs>	 (03PS7) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991
[17:13:40] <wikibugs>	 (03PS9) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841)
[17:13:42] <wikibugs>	 (03PS6) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326
[17:13:44] <wikibugs>	 (03PS47) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[17:13:46] <wikibugs>	 (03PS1) 10Jbond: ssl_ssl_ciphersuite: Add AES256-SHA256 to list of mid cipher [puppet] - 10https://gerrit.wikimedia.org/r/908902
[17:15:05] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:15:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[17:15:35] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1015.eqiad.wmnet with OS bullseye
[17:15:52] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1015.eqiad.wmnet with OS bullseye completed: - lvs1015 (**PASS**)   - Downtimed on Icinga/Aler...
[17:16:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900 (owner: 10DCausse)
[17:16:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[17:16:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T333332)', diff saved to https://phabricator.wikimedia.org/P46774 and previous config saved to /var/cache/conftool/dbconfig/20230414-171638-ladsgroup.json
[17:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[17:16:44] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:16:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[17:17:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46775 and previous config saved to /var/cache/conftool/dbconfig/20230414-171702-ladsgroup.json
[17:17:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1014.eqiad.wmnet with OS bullseye
[17:17:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1014.eqiad.wmnet with OS bullseye
[17:18:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond)
[17:18:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssl_ssl_ciphersuite: Add AES256-SHA256 to list of mid cipher [puppet] - 10https://gerrit.wikimedia.org/r/908902 (owner: 10Jbond)
[17:19:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46776 and previous config saved to /var/cache/conftool/dbconfig/20230414-171911-ladsgroup.json
[17:19:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T333332)', diff saved to https://phabricator.wikimedia.org/P46777 and previous config saved to /var/cache/conftool/dbconfig/20230414-171953-ladsgroup.json
[17:19:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[17:20:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[17:20:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46778 and previous config saved to /var/cache/conftool/dbconfig/20230414-172016-ladsgroup.json
[17:20:51] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:21:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: use flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/908900 (owner: 10DCausse)
[17:22:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46779 and previous config saved to /var/cache/conftool/dbconfig/20230414-172229-ladsgroup.json
[17:22:34] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:23:48] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:24:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:25:51] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:25:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[17:27:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirtlocal1001.eqiad.wmnet
[17:28:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P46780 and previous config saved to /var/cache/conftool/dbconfig/20230414-172826-ladsgroup.json
[17:29:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye
[17:29:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye
[17:30:25] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46781 and previous config saved to /var/cache/conftool/dbconfig/20230414-173418-ladsgroup.json
[17:36:33] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage
[17:37:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P46782 and previous config saved to /var/cache/conftool/dbconfig/20230414-173734-ladsgroup.json
[17:39:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072']
[17:39:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage
[17:42:10] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage
[17:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46783 and previous config saved to /var/cache/conftool/dbconfig/20230414-174333-ladsgroup.json
[17:43:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[17:43:38] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:43:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[17:43:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46784 and previous config saved to /var/cache/conftool/dbconfig/20230414-174356-ladsgroup.json
[17:45:27] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage
[17:47:50] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet
[17:49:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:49:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[17:49:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P46785 and previous config saved to /var/cache/conftool/dbconfig/20230414-174924-ladsgroup.json
[17:50:19] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P46786 and previous config saved to /var/cache/conftool/dbconfig/20230414-175242-ladsgroup.json
[17:53:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1014.eqiad.wmnet with OS bullseye
[17:53:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1014.eqiad.wmnet with OS bullseye completed: - lvs1014 (**PASS**)   - Downtimed on Icinga/Aler...
[17:57:01] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1016.eqiad.wmnet with OS bullseye
[17:57:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye completed: - lvs1016 (**PASS**)   - Downtimed on Icinga/Aler...
[18:00:45] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:03:07] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:03:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[18:03:20] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309)
[18:03:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:03:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:04:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T333332)', diff saved to https://phabricator.wikimedia.org/P46787 and previous config saved to /var/cache/conftool/dbconfig/20230414-180430-ladsgroup.json
[18:04:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[18:04:36] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:04:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[18:04:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[18:05:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[18:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[18:05:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[18:05:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[18:06:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[18:06:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46788 and previous config saved to /var/cache/conftool/dbconfig/20230414-180606-ladsgroup.json
[18:06:37] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) Yet another data point if that helps: I am trying to merge the codfw LVS hiera definitions and ran into the following...
[18:07:05] <wikibugs>	 (03CR) 10Ssingh: "To be merged on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[18:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T333332)', diff saved to https://phabricator.wikimedia.org/P46789 and previous config saved to /var/cache/conftool/dbconfig/20230414-180748-ladsgroup.json
[18:07:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[18:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[18:08:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46790 and previous config saved to /var/cache/conftool/dbconfig/20230414-180812-ladsgroup.json
[18:08:57] <mutante>	 !log doc1002, doc2001 - manually remove php7.3-fpm restart timers to fix T334735 and alerting - T322357 - systemctl stop wmf_auto_restart_php7.3-fpm.timer; systemctl stop wmf_auto_restart_php7.3-fpm.service; rm /lib/systemd/system/wmf_auto_restart_php7.3-fpm.*
[18:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:03] <stashbot>	 T334735: fix PHP auto-restarts on doc hosts - https://phabricator.wikimedia.org/T334735
[18:09:04] <stashbot>	 T322357: OOUI PHP demos page is broken (again) - https://phabricator.wikimedia.org/T322357
[18:10:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46791 and previous config saved to /var/cache/conftool/dbconfig/20230414-181025-ladsgroup.json
[18:10:31] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:11:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46792 and previous config saved to /var/cache/conftool/dbconfig/20230414-181123-ladsgroup.json
[18:13:52] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "another follow-up was that the restart services were not removed by puppet and failed to restart missing php 7.3 which then caused monitor" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[18:16:01] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:49] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:17:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed...
[18:18:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:18:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:19:37] <icinga-wm_>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[18:21:15] <icinga-wm_>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[18:22:23] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Papaul) When I run  ` cookbook sre.hosts.dhcp --os bullseye cloudvirtlocal1001 ` i able to reboot the server as many time as i want and hit F12 and t...
[18:25:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P46793 and previous config saved to /var/cache/conftool/dbconfig/20230414-182532-ladsgroup.json
[18:25:58] <wikibugs>	 (03PS48) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[18:26:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46794 and previous config saved to /var/cache/conftool/dbconfig/20230414-182629-ladsgroup.json
[18:26:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond)
[18:26:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[18:27:03] <wikibugs>	 (03CR) 10JHathaway: lists: Bump the number worker processes to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908896 (owner: 10JHathaway)
[18:30:13] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:33:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46795 and previous config saved to /var/cache/conftool/dbconfig/20230414-183311-ladsgroup.json
[18:33:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:33:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[18:35:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10JameelKaisar) First of all thank you Timo and Chris for the detailed information.  ## Measurement domain  - The shuffling the targets/domains part is implemen...
[18:36:12] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) rsyncing of /srv/gerrit including /srv/gerrit/git and other things is STILL ongoing, it's hundreds of GB of ALL small files.. and rsync bandwith limited to make sure gerrit prod is not affec...
[18:36:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[18:40:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P46796 and previous config saved to /var/cache/conftool/dbconfig/20230414-184038-ladsgroup.json
[18:41:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P46797 and previous config saved to /var/cache/conftool/dbconfig/20230414-184135-ladsgroup.json
[18:48:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P46798 and previous config saved to /var/cache/conftool/dbconfig/20230414-184818-ladsgroup.json
[18:51:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[18:52:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye
[18:52:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye complete...
[18:55:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T333332)', diff saved to https://phabricator.wikimedia.org/P46799 and previous config saved to /var/cache/conftool/dbconfig/20230414-185545-ladsgroup.json
[18:55:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[18:55:50] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:56:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[18:56:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[18:56:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[18:56:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46800 and previous config saved to /var/cache/conftool/dbconfig/20230414-185630-ladsgroup.json
[18:56:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T333332)', diff saved to https://phabricator.wikimedia.org/P46801 and previous config saved to /var/cache/conftool/dbconfig/20230414-185642-ladsgroup.json
[18:56:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[18:56:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[18:57:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46802 and previous config saved to /var/cache/conftool/dbconfig/20230414-185705-ladsgroup.json
[18:58:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46803 and previous config saved to /var/cache/conftool/dbconfig/20230414-185842-ladsgroup.json
[18:59:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46804 and previous config saved to /var/cache/conftool/dbconfig/20230414-185921-ladsgroup.json
[19:03:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P46805 and previous config saved to /var/cache/conftool/dbconfig/20230414-190324-ladsgroup.json
[19:05:06] <wikibugs>	 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10RLazarus)
[19:06:08] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:06:16] <wikibugs>	 (03PS1) 10AOkoth: prometheus: delete migrated eventgate alerts [puppet] - 10https://gerrit.wikimedia.org/r/908917 (https://phabricator.wikimedia.org/T309009)
[19:08:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BBlack) It's awesome to see this moving along!  One minor point:  >>   This would then be immediately queryable in Grafana by DC and Country code, where you c...
[19:13:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P46806 and previous config saved to /var/cache/conftool/dbconfig/20230414-191348-ladsgroup.json
[19:14:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46807 and previous config saved to /var/cache/conftool/dbconfig/20230414-191428-ladsgroup.json
[19:15:40] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T333332)', diff saved to https://phabricator.wikimedia.org/P46808 and previous config saved to /var/cache/conftool/dbconfig/20230414-191831-ladsgroup.json
[19:18:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[19:18:36] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[19:18:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[19:18:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46809 and previous config saved to /var/cache/conftool/dbconfig/20230414-191854-ladsgroup.json
[19:20:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46810 and previous config saved to /var/cache/conftool/dbconfig/20230414-192001-ladsgroup.json
[19:22:02] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P46811 and previous config saved to /var/cache/conftool/dbconfig/20230414-192855-ladsgroup.json
[19:29:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P46812 and previous config saved to /var/cache/conftool/dbconfig/20230414-192934-ladsgroup.json
[19:31:28] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P46813 and previous config saved to /var/cache/conftool/dbconfig/20230414-193507-ladsgroup.json
[19:44:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T333332)', diff saved to https://phabricator.wikimedia.org/P46814 and previous config saved to /var/cache/conftool/dbconfig/20230414-194401-ladsgroup.json
[19:44:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[19:44:07] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[19:44:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[19:44:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46815 and previous config saved to /var/cache/conftool/dbconfig/20230414-194424-ladsgroup.json
[19:44:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T333332)', diff saved to https://phabricator.wikimedia.org/P46816 and previous config saved to /var/cache/conftool/dbconfig/20230414-194441-ladsgroup.json
[19:44:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[19:44:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[19:45:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46817 and previous config saved to /var/cache/conftool/dbconfig/20230414-194504-ladsgroup.json
[19:46:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46818 and previous config saved to /var/cache/conftool/dbconfig/20230414-194637-ladsgroup.json
[19:47:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46819 and previous config saved to /var/cache/conftool/dbconfig/20230414-194720-ladsgroup.json
[19:50:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P46820 and previous config saved to /var/cache/conftool/dbconfig/20230414-195014-ladsgroup.json
[20:01:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P46821 and previous config saved to /var/cache/conftool/dbconfig/20230414-200144-ladsgroup.json
[20:02:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46822 and previous config saved to /var/cache/conftool/dbconfig/20230414-200226-ladsgroup.json
[20:05:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T333332)', diff saved to https://phabricator.wikimedia.org/P46823 and previous config saved to /var/cache/conftool/dbconfig/20230414-200520-ladsgroup.json
[20:05:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[20:05:25] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:05:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[20:05:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46824 and previous config saved to /var/cache/conftool/dbconfig/20230414-200543-ladsgroup.json
[20:06:24] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46825 and previous config saved to /var/cache/conftool/dbconfig/20230414-200751-ladsgroup.json
[20:15:52] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen I can check and let you know on the firmware update.
[20:16:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P46826 and previous config saved to /var/cache/conftool/dbconfig/20230414-201650-ladsgroup.json
[20:16:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul)
[20:17:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Tenant networking not working on cloudvirtlocal hosts - https://phabricator.wikimedia.org/T334694 (10Andrew) 05Open→03Resolved a:05Cmjohnson→03cmooney This was fixed by Cathal.
[20:17:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew)
[20:17:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew)
[20:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P46827 and previous config saved to /var/cache/conftool/dbconfig/20230414-201734-ladsgroup.json
[20:21:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Papaul) 05Open→03Resolved Complete
[20:22:12] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul)
[20:22:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P46828 and previous config saved to /var/cache/conftool/dbconfig/20230414-202257-ladsgroup.json
[20:26:59] <wikibugs>	 (03PS15) 10Cwhite: opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732)
[20:30:06] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:53] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Andrew) OK -- I'm not ready to get rid of the data on this server but it is fine to reboot it now.  Thanks for waiting!
[20:31:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T333332)', diff saved to https://phabricator.wikimedia.org/P46829 and previous config saved to /var/cache/conftool/dbconfig/20230414-203156-ladsgroup.json
[20:31:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[20:32:04] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:32:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[20:32:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T333332)', diff saved to https://phabricator.wikimedia.org/P46830 and previous config saved to /var/cache/conftool/dbconfig/20230414-203220-ladsgroup.json
[20:32:21] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[20:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46831 and previous config saved to /var/cache/conftool/dbconfig/20230414-203241-ladsgroup.json
[20:32:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[20:32:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[20:33:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46832 and previous config saved to /var/cache/conftool/dbconfig/20230414-203304-ladsgroup.json
[20:33:16] <wikibugs>	 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)
[20:35:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46833 and previous config saved to /var/cache/conftool/dbconfig/20230414-203520-ladsgroup.json
[20:36:26] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:50] <papaul>	 !log rebooting labstore1004 for mgmt interface issue
[20:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P46834 and previous config saved to /var/cache/conftool/dbconfig/20230414-203804-ladsgroup.json
[20:41:05] <wikibugs>	 (03CR) 10Cwhite: opensearch_dashboards: add package provider (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:42:00] <wikibugs>	 (03CR) 10Andrea Denisse: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[20:45:56] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46835 and previous config saved to /var/cache/conftool/dbconfig/20230414-205026-ladsgroup.json
[20:53:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T333332)', diff saved to https://phabricator.wikimedia.org/P46836 and previous config saved to /var/cache/conftool/dbconfig/20230414-205310-ladsgroup.json
[20:53:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[20:53:15] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[20:53:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46837 and previous config saved to /var/cache/conftool/dbconfig/20230414-205333-ladsgroup.json
[20:55:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46838 and previous config saved to /var/cache/conftool/dbconfig/20230414-205541-ladsgroup.json
[20:56:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Jgreen we will have to update the firmware on those.
[20:57:21] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Papaul) 05Open→03Resolved rebooting the server fixed the issue. We can now resolve this
[20:57:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul)
[20:58:28] <wikibugs>	 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10Krinkle)
[20:59:59] <wikibugs>	 (03CR) 10Cwhite: opensearch_dashboards: add package provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[21:01:58] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[21:03:24] <wikibugs>	 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10Krinkle) > Category membership: queried using this [[ https://en.wikipedia.org/w/api.php?action=query&format=json&continue=&revids=1147464943&cllimit=max&i...
[21:05:08] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[21:05:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P46840 and previous config saved to /var/cache/conftool/dbconfig/20230414-210533-ladsgroup.json
[21:06:36] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:53] <wikibugs>	 (03PS1) 10EoghanGaffney: Only recurse if the directory is to be removed [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736)
[21:08:58] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite)
[21:10:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P46841 and previous config saved to /var/cache/conftool/dbconfig/20230414-211048-ladsgroup.json
[21:11:30] <icinga-wm_>	 PROBLEM - Disk space on urldownloader1001 is CRITICAL: DISK CRITICAL - free space: / 283 MB (3% inode=89%): /tmp 283 MB (3% inode=89%): /var/tmp 283 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops
[21:11:59] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40681/console" [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) (owner: 10EoghanGaffney)
[21:12:53] <jinxer-wm>	 (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:13:33] <wikibugs>	 (03PS2) 10EoghanGaffney: [gitlab/ssh] Only recurse if the directory is to be removed [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736)
[21:16:08] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46842 and previous config saved to /var/cache/conftool/dbconfig/20230414-212039-ladsgroup.json
[21:20:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[21:20:45] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[21:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[21:21:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46843 and previous config saved to /var/cache/conftool/dbconfig/20230414-212102-ladsgroup.json
[21:23:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46844 and previous config saved to /var/cache/conftool/dbconfig/20230414-212319-ladsgroup.json
[21:25:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P46845 and previous config saved to /var/cache/conftool/dbconfig/20230414-212554-ladsgroup.json
[21:26:14] <wikibugs>	 (03Abandoned) 10Cwhite: profile: clean up ipsec aggregate check [puppet] - 10https://gerrit.wikimedia.org/r/632739 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite)
[21:27:09] <wikibugs>	 (03Abandoned) 10Cwhite: scb: enable statsd_exporter and add matching rules [puppet] - 10https://gerrit.wikimedia.org/r/484586 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[21:30:10] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 3 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Quiddity) 1) Hi, Re: User Notice - please could someone...
[21:34:49] <wikibugs>	 (03Abandoned) 10Cwhite: when configured to relay statsd traffic, send the raw []byte recieved toward the configured statsd endpoint [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/554544 (https://phabricator.wikimedia.org/T239833) (owner: 10Cwhite)
[21:36:44] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:46] <wikibugs>	 (03Abandoned) 10Cwhite: hiera: specify tlsproxy configuration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/616811 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[21:38:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46846 and previous config saved to /var/cache/conftool/dbconfig/20230414-213825-ladsgroup.json
[21:39:07] <wikibugs>	 (03Abandoned) 10Cwhite: provision loki on grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/616851 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[21:41:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T333332)', diff saved to https://phabricator.wikimedia.org/P46847 and previous config saved to /var/cache/conftool/dbconfig/20230414-214100-ladsgroup.json
[21:41:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[21:41:06] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[21:41:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[21:41:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46848 and previous config saved to /var/cache/conftool/dbconfig/20230414-214123-ladsgroup.json
[21:42:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46849 and previous config saved to /var/cache/conftool/dbconfig/20230414-214231-ladsgroup.json
[21:44:02] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management, 10Patch-For-Review, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10kaldari) @Ladsgroup - Is there any way that folks can manually purge thumbnails that didn't get regenerated (besides reup...
[21:46:20] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:17] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:53:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P46850 and previous config saved to /var/cache/conftool/dbconfig/20230414-215331-ladsgroup.json
[21:57:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P46851 and previous config saved to /var/cache/conftool/dbconfig/20230414-215738-ladsgroup.json
[21:59:07] <wikibugs>	 (03PS2) 10Cwhite: logstash: ulogd remove copy network.transport to network.protocol [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195)
[22:01:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[22:08:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T333332)', diff saved to https://phabricator.wikimedia.org/P46852 and previous config saved to /var/cache/conftool/dbconfig/20230414-220838-ladsgroup.json
[22:08:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[22:08:44] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[22:08:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[22:08:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[22:09:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[22:09:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46853 and previous config saved to /var/cache/conftool/dbconfig/20230414-220918-ladsgroup.json
[22:11:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46854 and previous config saved to /var/cache/conftool/dbconfig/20230414-221134-ladsgroup.json
[22:12:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P46855 and previous config saved to /var/cache/conftool/dbconfig/20230414-221244-ladsgroup.json
[22:21:30] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:35] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) Sorry, it was a misclick. I removed the tag.
[22:26:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46856 and previous config saved to /var/cache/conftool/dbconfig/20230414-222641-ladsgroup.json
[22:27:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T333332)', diff saved to https://phabricator.wikimedia.org/P46857 and previous config saved to /var/cache/conftool/dbconfig/20230414-222750-ladsgroup.json
[22:27:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[22:27:56] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[22:28:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[22:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46858 and previous config saved to /var/cache/conftool/dbconfig/20230414-222814-ladsgroup.json
[22:29:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46859 and previous config saved to /var/cache/conftool/dbconfig/20230414-222921-ladsgroup.json
[22:30:47] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:31:04] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:35:52] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:41:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P46860 and previous config saved to /var/cache/conftool/dbconfig/20230414-224147-ladsgroup.json
[22:44:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P46861 and previous config saved to /var/cache/conftool/dbconfig/20230414-224428-ladsgroup.json
[22:45:24] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:51:50] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:55:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10KFrancis) @Dzahn I am confirming the signed NDA.  Please proceed with the the access request.  Thank you!
[22:56:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T333332)', diff saved to https://phabricator.wikimedia.org/P46862 and previous config saved to /var/cache/conftool/dbconfig/20230414-225654-ladsgroup.json
[22:56:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:56:59] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[22:57:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:57:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46863 and previous config saved to /var/cache/conftool/dbconfig/20230414-225717-ladsgroup.json
[22:59:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46864 and previous config saved to /var/cache/conftool/dbconfig/20230414-225934-ladsgroup.json
[22:59:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P46865 and previous config saved to /var/cache/conftool/dbconfig/20230414-225934-ladsgroup.json
[23:01:26] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:58] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[23:12:32] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[23:14:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46866 and previous config saved to /var/cache/conftool/dbconfig/20230414-231440-ladsgroup.json
[23:14:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T333332)', diff saved to https://phabricator.wikimedia.org/P46867 and previous config saved to /var/cache/conftool/dbconfig/20230414-231440-ladsgroup.json
[23:14:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[23:14:48] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[23:14:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[23:15:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[23:15:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[23:15:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[23:15:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[23:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[23:15:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[23:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[23:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[23:15:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T333332)', diff saved to https://phabricator.wikimedia.org/P46868 and previous config saved to /var/cache/conftool/dbconfig/20230414-231557-ladsgroup.json
[23:17:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T333332)', diff saved to https://phabricator.wikimedia.org/P46869 and previous config saved to /var/cache/conftool/dbconfig/20230414-231707-ladsgroup.json
[23:21:59] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979)
[23:23:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[23:25:18] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[23:26:52] <icinga-wm_>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[23:27:59] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979)
[23:28:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[23:29:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P46870 and previous config saved to /var/cache/conftool/dbconfig/20230414-232946-ladsgroup.json
[23:32:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P46871 and previous config saved to /var/cache/conftool/dbconfig/20230414-233213-ladsgroup.json
[23:32:57] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus5002 [puppet] - 10https://gerrit.wikimedia.org/r/907985 (https://phabricator.wikimedia.org/T309979)
[23:33:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:35:00] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Apply prometheus::pop role to prometheus5002 [puppet] - 10https://gerrit.wikimedia.org/r/907985 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse)
[23:35:33] <jinxer-wm>	 (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:44:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46872 and previous config saved to /var/cache/conftool/dbconfig/20230414-234453-ladsgroup.json
[23:44:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[23:44:58] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[23:45:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[23:45:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46873 and previous config saved to /var/cache/conftool/dbconfig/20230414-234516-ladsgroup.json
[23:45:33] <jinxer-wm>	 (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:47:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P46874 and previous config saved to /var/cache/conftool/dbconfig/20230414-234720-ladsgroup.json
[23:47:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T333332)', diff saved to https://phabricator.wikimedia.org/P46875 and previous config saved to /var/cache/conftool/dbconfig/20230414-234732-ladsgroup.json
[23:47:50] <wikibugs>	 (03PS49) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356
[23:50:32] <jinxer-wm>	 (JobUnavailable) firing: (15) Reduced availability for job bird in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:50:48] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:55:17] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:55:47] <jinxer-wm>	 (JobUnavailable) firing: (20) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:58:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job blackbox/pingthing in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:59:02] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable