[00:01:15] <icinga-wm>	 RECOVERY - Check systemd state on logstash1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185
[00:38:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185 (owner: 10TrainBranchBot)
[00:42:13] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:35] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185 (owner: 10TrainBranchBot)
[01:55:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:00:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:02:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:04:35] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:39:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:41] <wikibugs>	 (03CR) 10Winston Sung: [C: 04-1] SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung)
[03:09:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:11:44] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:21:43] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:41:44] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:23:57] <icinga-wm>	 PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:23:59] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:24:11] <icinga-wm>	 PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:50:33] <icinga-wm>	 PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[04:51:29] <icinga-wm>	 PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[04:52:17] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:56:53] <icinga-wm>	 RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50
[04:57:19] <andrewbogott>	 !log restarting wikitech-static,  oom
[04:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:47] <icinga-wm>	 RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38
[04:58:21] <icinga-wm>	 RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2024-03-06 18:33:38 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:58:39] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[04:58:41] <icinga-wm>	 RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2024-03-06 18:33:38 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:00:17] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26898 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:39:37] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[06:26:06] <wikibugs>	 (03PS1) 10Marostegui: db2117: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/990434 (https://phabricator.wikimedia.org/T354506)
[06:28:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2117: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/990434 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[07:10:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:33:01] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503)
[07:47:05] <wikibugs>	 (03PS4) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694)
[07:54:51] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[07:57:49] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Docker
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T0800).
[08:00:05] <jouncebot>	 pfischer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:11] <pfischer>	 o/
[08:00:56] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] enable page_rerender for 5th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer)
[08:02:46] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:02:55] <pfischer>	 dcausse: if currently no one is around, I might as well self-deploy
[08:03:28] <dcausse>	 pfischer: I can deploy if you want
[08:04:05] <pfischer>	 Sure, thank you!
[08:04:36] <wikibugs>	 (03Merged) 10jenkins-bot: Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:09:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer)
[08:11:36] <wikibugs>	 (03Merged) 10jenkins-bot: enable page_rerender for 5th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer)
[08:12:02] <logmsgbot>	 !log dcausse@deploy2002 Started scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]]
[08:12:06] <stashbot>	 T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503
[08:13:36] <logmsgbot>	 !log dcausse@deploy2002 pfischer and dcausse: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:14:06] <wikibugs>	 (03CR) 10JMeybohm: "Something else crossed my mind this morning: Running jobs via helmfile will result in one helm release per job run which will never be cle" [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[08:17:31] <logmsgbot>	 !log dcausse@deploy2002 pfischer and dcausse: Continuing with sync
[08:23:42] <logmsgbot>	 !log dcausse@deploy2002 Finished scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] (duration: 11m 40s)
[08:23:47] <stashbot>	 T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503
[08:24:22] <dcausse>	 pfischer: deploy done
[08:26:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/990590
[08:33:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/990590 (owner: 10Muehlenhoff)
[08:44:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 (owner: 10Muehlenhoff)
[08:45:42] <logmsgbot>	 !log filippo@deploy2002 Started deploy [performance/arc-lamp@67389a0]: (no justification provided)
[08:45:47] <logmsgbot>	 !log filippo@deploy2002 Finished deploy [performance/arc-lamp@67389a0]: (no justification provided) (duration: 00m 05s)
[08:46:51] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer)
[08:48:17] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer)
[09:03:59] <wikibugs>	 (03PS12) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035)
[09:04:04] <wikibugs>	 (03PS13) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035)
[09:14:45] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[09:15:00] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:15:31] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[09:15:59] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:16:15] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[09:16:30] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:18:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:19:10] <wikibugs>	 (03CR) 10DCausse: Search update pipeline: update README (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (owner: 10Peter Fischer)
[09:24:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10JMeybohm) >>! In T354276#9443301, @KFrancis wrote: > Hi all, please provide Dima koushha's WMDE email address to kfrancis@wikimedia.org and I'll prepare the NDA.  Thank...
[09:27:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Mark's key with a new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/990594
[09:30:35] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/990594 (owner: 10Muehlenhoff)
[09:30:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update Mark's key with a new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/990594 (owner: 10Muehlenhoff)
[09:37:56] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (2 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619)
[09:41:56] <wikibugs>	 (03PS1) 10Ladsgroup: SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263)
[09:42:12] <Amir1>	 jouncebot: nowandnext
[09:42:12] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 17 minute(s)
[09:42:12] <jouncebot>	 In 1 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1100)
[09:42:19] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263) (owner: 10Ladsgroup)
[09:42:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (2 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[09:44:23] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (3 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619)
[09:45:28] <wikibugs>	 (03Merged) 10jenkins-bot: SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263) (owner: 10Ladsgroup)
[09:46:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (3 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[09:46:43] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]]
[09:46:50] <stashbot>	 T349263: Create voter list for U4C Charter ratification vote - https://phabricator.wikimedia.org/T349263
[09:48:16] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597
[09:48:18] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:48:37] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597
[09:50:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597 (owner: 10Effie Mouzeli)
[09:50:59] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:47] <wikibugs>	 (03PS1) 10Btullis: Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573)
[09:53:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1107/co" [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[09:56:29] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[09:58:46] <wikibugs>	 (03PS1) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573)
[09:58:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-gp1002.eqiad.wmnet
[09:59:48] <wikibugs>	 (03PS2) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573)
[10:02:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mc-gp1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990602 (https://phabricator.wikimedia.org/T349619)
[10:02:48] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]] (duration: 16m 04s)
[10:02:52] <stashbot>	 T349263: Create voter list for U4C Charter ratification vote - https://phabricator.wikimedia.org/T349263
[10:04:15] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) to sum it up as it's a bit confusing to re-read everything:  | | puppet5 (db1139) | puppet 7 (db1133) | `mysql --ssl-ca wmf-ca-certif...
[10:05:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-gp1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990602 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:07:11] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:07:22] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:08:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-gp1002.eqiad.wmnet
[10:10:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] spark-history: set production retention to 60 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/990034 (https://phabricator.wikimedia.org/T354927) (owner: 10Brouberol)
[10:11:22] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] spark-history: set production retention to 60 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/990034 (https://phabricator.wikimedia.org/T354927) (owner: 10Brouberol)
[10:11:51] <wikibugs>	 (03PS1) 10Btullis: Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573)
[10:13:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1037.eqiad.wmnet
[10:13:40] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1108/co" [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:14:14] <wikibugs>	 (03PS1) 10Brouberol: spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573)
[10:14:35] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:41] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:15:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mc1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990628 (https://phabricator.wikimedia.org/T349619)
[10:15:27] <wikibugs>	 (03PS2) 10Btullis: Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573)
[10:17:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mc1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990628 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:17:28] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:18:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[10:19:08] <wikibugs>	 (03PS1) 10Btullis: Temporarily disable systemd jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573)
[10:21:10] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1109/co" [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:21:27] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) as for the certificates side:  | | Puppet 7 ca.crt `puppet_rsa` | Puppet 5 ca.crt `palladium.eqiad.wmnet` | wmf-ca.crt `Wikimedia_Int...
[10:22:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:22:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Temporarily disable systemd jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[10:22:48] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Those are tests from the orchestrator server I assume?
[10:24:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[10:27:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[10:30:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1037.eqiad.wmnet
[10:32:22] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) No, good catch! I forgot to add those results as well. Previous results were from the previously described tests.  From orchestrator...
[10:34:29] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) If db1133 gets fixed, that should mean that the new dbstores (1008, 1009) should pop up and get discovered automatically too.
[10:37:20] <wikibugs>	 (03PS5) 10Klausman: profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458
[10:37:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman)
[10:38:06] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman)
[10:45:56] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi)
[10:47:23] <wikibugs>	 (03PS17) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[10:48:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:59] <moritzm>	 !log installing systemd bugfix updates from Bullseye point release
[10:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:11] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) ` root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json    "MySQLOrchestratorSSLCAFile": "/etc/ssl/cert...
[10:49:16] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1110/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[10:51:19] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet
[10:53:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[10:53:59] <wikibugs>	 (03CR) 10Vgutierrez: hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[10:58:45] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1100)
[11:03:07] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet
[11:06:23] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:25] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:25] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:27] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:27] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:27] <icinga-wm>	 PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:28] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:29] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:30] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:31] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:32] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:33] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:36] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:40] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:42] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:43] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:07:00] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 97 hosts with reason: Bringing new nameservers into service
[11:08:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:08:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 97 hosts with reason: Bringing new nameservers into service
[11:08:33] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 8 hosts with reason: Bringing new nameservers into service
[11:08:53] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 8 hosts with reason: Bringing new nameservers into service
[11:09:00] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet
[11:09:43] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service,hadoop-mapreduce-historyserver.service,hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:01] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-master[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service
[11:10:09] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:13] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-metastore.service,hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:18] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-master[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service
[11:10:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:10:25] <icinga-wm>	 PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:26] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-coord[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service
[11:10:27] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:43] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-coord[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service
[11:11:31] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:11:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:12:45] <wikibugs>	 (03PS3) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573)
[11:15:29] <wikibugs>	 10SRE, 10Observability-Alerting, 10serviceops-radar, 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[11:16:52] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi)
[11:17:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:17:41] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:47] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:53] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10jcrespo)
[11:18:19] <icinga-wm>	 RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:36] <wikibugs>	 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10database-backups: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) 05In progress→03Resolved Backups worked over the weekend with no issues. Resolving.
[11:22:15] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:54] <wikibugs>	 (03PS18) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[11:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:30] <wikibugs>	 (03PS6) 10Slyngshede: Ganeti memory pressure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694)
[11:30:50] <wikibugs>	 (03CR) 10Slyngshede: Ganeti memory pressure alerting. (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:31:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[11:33:47] <icinga-wm>	 RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:33:51] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:03] <icinga-wm>	 RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:34:07] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:37:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573) (owner: 10Brouberol)
[11:37:26] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573) (owner: 10Brouberol)
[11:38:39] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[11:38:40] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619)
[11:39:30] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619)
[11:39:58] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1111/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[11:40:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:03] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987)
[11:41:13] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:13] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:13] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:17] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:19] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:21] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:24] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:29] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:35] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:39] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[11:41:43] <icinga-wm>	 RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:41:49] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:50] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:41:53] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:54] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:56] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:00] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:04] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:06] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:11] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:53] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:55] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:55] <icinga-wm>	 RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:55] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:57] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:03] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:07] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:09] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:21] <icinga-wm>	 RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:23] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:23] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:25] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:27] <icinga-wm>	 RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:31] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:33] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:37] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:37] <icinga-wm>	 RECOVERY - Check systemd state on analytics1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:41] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:43] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:44:03] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:44:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:09] <icinga-wm>	 RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:49] <icinga-wm>	 RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:48:36] <wikibugs>	 (03PS1) 10Btullis: Set the old namenodes to be insetup [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573)
[11:50:22] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573)
[11:51:42] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573)
[11:51:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:51:54] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1113/co" [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:52:16] <wikibugs>	 (03PS2) 10Btullis: Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573)
[11:52:40] <wikibugs>	 (03PS2) 10Btullis: Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573)
[11:53:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Set the old namenodes to be insetup [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:54:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[11:58:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[12:00:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for 92 hosts
[12:01:02] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 92 hosts
[12:01:08] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] hiera: add acls for heavy ratelimiting abusing ip from list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:02:32] <wikibugs>	 (03PS19) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[12:07:03] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619)
[12:07:47] <wikibugs>	 (03PS2) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619)
[12:07:52] <wikibugs>	 (03PS4) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444
[12:07:57] <wikibugs>	 (03PS1) 10Btullis: Enable monitoring for the new namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573)
[12:08:53] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1115/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:09:48] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1116/console" [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[12:13:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate BGP peering on MR routers towards core - https://phabricator.wikimedia.org/T354809 (10cmooney) 05Open→03Resolved a:03cmooney
[12:14:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff)
[12:14:32] <wikibugs>	 (03PS3) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619)
[12:14:39] <wikibugs>	 (03CR) 10Ottomata: "This is in the wrong file.  It should be in helmfile.d/services/eventstreams/values.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[12:16:08] <wikibugs>	 (03CR) 10Ottomata: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[12:16:42] <wikibugs>	 (03CR) 10Ottomata: "M" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[12:20:03] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou)
[12:20:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10JMeybohm) >>! In T354049#9439064, @Aklapper wrote: > How / where was this account created? `ldapsearch -xxx cn="Arthur Taylor"` says `cn` and `sn` are `Arthur t...
[12:20:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:20:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[12:21:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[12:23:10] <effie>	 !log stopping puppet on all mediawiki memcached hosts (mc*, mc-gp*), puppet 7 migration in progress - T349619
[12:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:14] <stashbot>	 T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619
[12:25:19] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990666
[12:28:59] <wikibugs>	 (03PS20) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910)
[12:30:38] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:35:51] <wikibugs>	 (03PS4) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619)
[12:36:39] <wikibugs>	 (03CR) 10Muehlenhoff: "insetup::data_engineering won't work for the old master nodes, the role defaults to Puppet 7 and we don't have Puppet 7 for Buster. But we" [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[12:37:02] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbstore1003.eqiad.wmnet
[12:37:47] <wikibugs>	 (03PS5) 10Effie Mouzeli: Switch Mediawiki memcache gutter clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619)
[12:39:33] <wikibugs>	 (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619)
[12:39:56] <effie>	 !log enable puppet on mc* hosts - - T349619
[12:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:59] <stashbot>	 T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619
[12:42:44] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[12:46:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable monitoring for the new namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[12:50:47] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou)
[12:51:49] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou)
[12:53:39] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur)
[12:54:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mediawiki::memcached::gutter
[12:54:09] <wikibugs>	 (03PS1) 10Slyngshede: LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060)
[12:54:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki memcache gutter clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli)
[12:55:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede)
[12:56:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede)
[12:56:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede)
[12:56:27] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:28] <wikibugs>	 (03PS1) 10Btullis: Use insetup::buster for the old namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573)
[12:59:01] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002"
[12:59:08] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1118/co" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[12:59:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediawiki::memcached::gutter
[13:00:25] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002"
[13:00:25] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:00:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbstore1003.eqiad.wmnet
[13:01:56] <wikibugs>	 (03PS2) 10Btullis: Use insetup::buster for the old namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573)
[13:02:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:03:22] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet
[13:04:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (the lack of disabled noticitions is an oversight, I'll fix that in a separate commit)" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[13:05:00] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet
[13:05:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1119/co" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis)
[13:09:48] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet
[13:10:10] <wikibugs>	 (03PS1) 10Aqu: Update statsd-exporter mappings for Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/990688 (https://phabricator.wikimedia.org/T343232)
[13:12:11] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet
[13:12:24] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet
[13:13:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:56] <wikibugs>	 (03PS5) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444
[13:17:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff)
[13:19:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet
[13:19:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet
[13:21:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) @JMeybohm I am able to login to https://wikitech.wikimedia.org/ with "Arthur taylor"
[13:26:17] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet
[13:31:03] <wikibugs>	 (03PS1) 10Jelto: miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791)
[13:33:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch hadoop master/standby roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619)
[13:35:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:39:34] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[13:43:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff)
[13:48:25] <wikibugs>	 (03PS3) 10Anzx: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424)
[13:48:38] <wikibugs>	 (03PS2) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425)
[13:48:48] <wikibugs>	 (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:49:03] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:51:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Also default insetup::buster role disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/990695
[13:52:13] <wikibugs>	 (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:53:10] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:44] <wikibugs>	 (03PS17) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817)
[13:54:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1400).
[14:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <Lucas_WMDE>	 o/
[14:00:21] <anzx>	 o/
[14:01:34] <Lucas_WMDE>	 I can deploy
[14:01:34] <Lucas_WMDE>	 currently looking at the mywiki change
[14:01:46] <anzx>	 Ok
[14:02:13] <Lucas_WMDE>	 uhm
[14:02:22] <Lucas_WMDE>	 > The community was informed on 22-Nov-23 at here.
[14:02:31] <Lucas_WMDE>	 I don’t really like the verb “informed” there tbh
[14:02:43] <Lucas_WMDE>	 was there… no community discussion? not even a reply from anybody?
[14:02:59] * Lucas_WMDE checks how many active editors the wiki has
[14:04:55] <Lucas_WMDE>	 ok, it’s not a huge amount, but it’s not like Ninjastrikers is the only person on the whole wiki either
[14:07:29] <Lucas_WMDE>	 I’m looking at https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes#How_to_request_a_change now… is there a threshold for “a very small and low-activity community”
[14:07:31] <Lucas_WMDE>	 ?
[14:07:37] <Lucas_WMDE>	 because I wouldn’t call what we currently have a “consensus”
[14:08:07] <Lucas_WMDE>	 Ninjastrikers has certainly “given an opportunity for objections”, but I’m not sure if the wiki counts as small enough to apply that sentence to it
[14:09:37] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:09:56] <Lucas_WMDE>	 I’ll move on to the cawiki change for now
[14:10:03] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490)
[14:10:11] <anzx>	 Ok
[14:10:56] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:11:58] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870)
[14:13:14] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:15:46] <wikibugs>	 (03PS3) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425)
[14:16:26] <wikibugs>	 (03CR) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:18:26] <wikibugs>	 (03CR) 10Majavah: "Some post-merge comments. Is it intentional this alert is applied more widely than the existing Icinga check?" [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:19:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:19:46] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:20:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:21:42] <wikibugs>	 (03Merged) 10jenkins-bot: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx)
[14:21:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]]
[14:22:12] <stashbot>	 T354425: Changing autoconfirmed users rights in cawiki - https://phabricator.wikimedia.org/T354425
[14:22:23] <Lucas_WMDE>	 anyone else around to comment on the mywiki question above?
[14:22:40] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[14:23:01] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:23:26] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:23:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:24:01] <Lucas_WMDE>	 anzx: can you test cawiki on mwdebug?
[14:24:04] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:24:10] <Lucas_WMDE>	 though I’m not sure how autoconfirm stuff could be tested tb
[14:24:10] <anzx>	 Lucas_WMDE: checking 
[14:24:12] <Lucas_WMDE>	 *tbh
[14:24:14] <Lucas_WMDE>	 ok
[14:24:32] <wikibugs>	 (03PS1) 10Slyngshede: Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701
[14:24:40] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:25:31] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "The PCC lists `toolforge_hosts` as empty, but it seems to be a PCC-specific issue. Cherry-picking this to toolsbeta works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[14:25:34] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[14:25:39] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:25:55] <anzx>	 Lucas_WMDE: looks good 
[14:26:06] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:26:21] <Lucas_WMDE>	 anzx: I’m curious, what did you actually test? ^^
[14:27:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Continuing with sync
[14:28:04] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:28:22] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:32:15] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: move hba to grid-specific bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/990702
[14:32:17] <wikibugs>	 (03PS1) 10Majavah: O:toolforge: add role for grid-less bastions [puppet] - 10https://gerrit.wikimedia.org/r/990703 (https://phabricator.wikimedia.org/T314665)
[14:32:19] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::shell_environ: remove packages not on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/990704
[14:33:34] <wikibugs>	 (03CR) 10Jelto: trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:33:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]] (duration: 11m 36s)
[14:33:40] <stashbot>	 T354425: Changing autoconfirmed users rights in cawiki - https://phabricator.wikimedia.org/T354425
[14:33:47] <wikibugs>	 (03CR) 10Jelto: miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[14:34:12] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870)
[14:35:36] <Lucas_WMDE>	 alright, still no response regarding mywiki…
[14:35:43] <Lucas_WMDE>	 then I’ll decline to deploy that for now, sorry
[14:36:26] <anzx>	 Lucas_WMDE: i will stall that task , stating further community support needed
[14:36:30] <Lucas_WMDE>	 personally I’d like to see at least one support vote from someone else on the project (preferably one of the other active people from recentchanges)
[14:36:42] <Lucas_WMDE>	 though I’m not going to stop anyone else from deploying it either, in case someone else has different standards ^^
[14:36:47] <Lucas_WMDE>	 anzx: ok, thanks!
[14:36:56] <Lucas_WMDE>	 (my next question was going to be if I should write that on the task or you would ^^)
[14:37:22] <anzx>	 Lucas_WMDE: if you want you can, or i will 
[14:37:43] <Lucas_WMDE>	 not particularly… I’m fine with you doing it
[14:37:58] <anzx>	 Ok i will add comment 
[14:38:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[14:38:12] <Lucas_WMDE>	 ok thanks!
[14:38:17] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:04] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[14:39:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:43:26] <wikibugs>	 (03PS1) 10Klausman: profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706
[14:46:24] <wikibugs>	 (03PS2) 10Klausman: profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706
[14:47:36] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbstore1005.eqiad.wmnet
[14:49:48] <wikibugs>	 (03PS1) 10Btullis: Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923)
[14:54:15] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:56:45] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706 (owner: 10Klausman)
[14:59:55] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[15:03:04] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[15:04:29] <urbanecm>	 I'm going to deploy some beta patches to test T353225 deployment plan at the beta cluster.
[15:04:29] <stashbot>	 T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225
[15:05:32] <wikibugs>	 (03PS3) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225)
[15:05:36] <wikibugs>	 (03CR) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm)
[15:05:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm)
[15:06:48] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002"
[15:07:02] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm)
[15:11:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis)
[15:14:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[15:22:04] <wikibugs>	 (03PS1) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711
[15:26:58] <hnowlan>	 !log depooled jobrunner mw1460 to repurpose as k8s node
[15:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:28] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good. It seems that dbstore1005 is also found in the mwaddlink repo: https://codesearch.wmcloud.org/search/?q=dbstore1005&files=&exc" [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis)
[15:33:18] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791)
[15:35:22] <jinxer-wm>	 (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:36:24] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791)
[15:39:16] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:45:30] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) Hm, I stumbled upon something unexpected:   ` root@db1133:/etc/ssl/certs# mysql [snip] MariaDB [(none)]> select @@global.ssl_ca; +---...
[15:46:53] <wikibugs>	 (03PS3) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791)
[15:47:24] <wikibugs>	 (03CR) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:48:03] <wikibugs>	 (03PS1) 10Reedy: captchaloop: Generate old and new captchas [puppet] - 10https://gerrit.wikimedia.org/r/990715
[15:52:47] <wikibugs>	 (03PS5) 10Effie Mouzeli: (WIP2) mcrouter vanilla chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981461
[15:54:12] <wikibugs>	 (03Abandoned) 10Reedy: Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy)
[15:54:15] <jinxer-wm>	 (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:54:15] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:54:16] <wikibugs>	 (03PS1) 10Peter Fischer: enable page_rerender for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503)
[15:54:24] <wikibugs>	 (03PS4) 10Effie Mouzeli: (WIP2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690)
[15:55:00] <wikibugs>	 (03PS2) 10Klausman: profile::thanos: Try and use explicit buckets to fix isio latency buckets [puppet] - 10https://gerrit.wikimedia.org/r/990708
[15:57:00] <wikibugs>	 (03PS3) 10Peter Fischer: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197)
[15:57:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:59:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:02:29] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] profile::thanos: Try and use explicit buckets to fix isio latency buckets [puppet] - 10https://gerrit.wikimedia.org/r/990708 (owner: 10Klausman)
[16:04:41] <wikibugs>	 (03PS4) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791)
[16:05:12] <wikibugs>	 (03CR) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:05:25] <wikibugs>	 (03PS4) 10Effie Mouzeli: (WIP) modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841
[16:05:40] <wikibugs>	 (03PS9) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847
[16:07:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] clouddumps: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990048 (owner: 10Muehlenhoff)
[16:08:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:18:32] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[16:18:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[16:29:12] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) I ran the following test: with a custom PKI, a server certificate generated with an intermediate CA and the CA bundle fed to Orchestr...
[16:30:05] <jouncebot>	 jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1630). nyaa~
[16:32:57] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791)
[16:37:44] <wikibugs>	 (03PS4) 10Reedy: mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936)
[16:38:18] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] "Fine to be merged at some point now..." [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy)
[16:40:52] <wikibugs>	 (03PS1) 10Majavah: P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067)
[16:45:35] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002"
[16:45:35] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:45:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbstore1005.eqiad.wmnet
[16:51:52] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[16:55:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) a:05andrea.denisse→03None
[17:00:48] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
[17:00:55] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[17:02:19] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Docker
[17:02:47] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
[17:16:12] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah)
[17:19:31] <Reedy>	 jouncebot: nowandnext
[17:19:31] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 40 minute(s)
[17:19:32] <jouncebot>	 In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800)
[17:19:32] <jouncebot>	 In 0 hour(s) and 40 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800)
[17:23:58] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[17:29:47] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[17:33:29] <wikibugs>	 (03PS1) 10Urbanecm: beta: Temporarily change default value for 3 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225)
[17:33:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "beta only, no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm)
[17:34:33] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Temporarily change default value for 3 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm)
[17:35:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[17:36:23] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990671
[17:36:25] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990672
[17:36:27] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990673
[17:48:42] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
[17:50:51] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:55:12] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800)
[18:00:04] <jouncebot>	 ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800).
[18:14:15] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:52] <urbanecm>	 okay...while testing the T353225 plan, i found a bug in UserOptionsManager. wonderful :D
[18:14:53] <stashbot>	 T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225
[18:37:51] <wikibugs>	 (03CR) 10Gmodena: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman)
[18:45:02] <wikibugs>	 (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff)
[18:45:34] <wikibugs>	 (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff)
[18:51:43] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite)
[18:54:57] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:55:53] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:56:25] <wikibugs>	 (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:01:09] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:23:34] <tzatziki>	 !log creating the u4c2024_edits table on all wikis
[19:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:18] <wikibugs>	 (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:24:20] <wikibugs>	 (03PS18) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817)
[19:25:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:53:33] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:53:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:55:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T2100).
[21:00:05] <jouncebot>	 tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:02:12] <tgr>	 deploying
[21:03:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza)
[21:11:10] <wikibugs>	 (03PS3) 10Gergő Tisza: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164
[21:11:34] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza)
[21:12:34] <wikibugs>	 (03Merged) 10jenkins-bot: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza)
[21:12:48] <logmsgbot>	 !log tgr@deploy2002 Started scap: Backport for [[gerrit:990164|Log emails in production]]
[21:14:19] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:990164|Log emails in production]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:15:42] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[21:16:02] <wikibugs>	 (03PS2) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841)
[21:17:41] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:00] <logmsgbot>	 !log tgr@deploy2002 Finished scap: Backport for [[gerrit:990164|Log emails in production]] (duration: 09m 11s)
[21:23:37] <tgr>	 !log UTC late deploys done
[21:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:34] <wikibugs>	 (03PS3) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841)
[21:28:48] <wikibugs>	 (03PS1) 10Reedy: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841)
[21:29:37] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy)
[21:30:22] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy)
[21:36:22] <wikibugs>	 (03PS2) 10Reedy: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841)
[21:36:57] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@9b6a69a]: (no justification provided)
[21:37:16] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Swap stringified class names in ConfirmEdit usages (duration: 06m 30s)
[21:37:25] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@9b6a69a]: (no justification provided) (duration: 00m 27s)
[21:38:15] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy)
[21:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy)
[21:44:03] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:58] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/: Fix more stringified class names (duration: 06m 29s)
[21:58:21] <wikibugs>	 (03CR) 10VolkerE: "@Jelto How would we verify and approve? What to look out for?" [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto)
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T2200).
[22:06:10] <wikibugs>	 (03PS19) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817)
[23:55:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:55:22] <wikibugs>	 (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774)