[00:01:15] RECOVERY - Check systemd state on logstash1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185 (owner: 10TrainBranchBot) [00:42:13] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:35] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990185 (owner: 10TrainBranchBot) [01:55:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:00:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:02:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:04:35] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:39:15] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:41] (03CR) 10Winston Sung: [C: 04-1] SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [03:09:15] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:44] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:21:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:41:44] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:23:57] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [04:23:59] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:11] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [04:50:33] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [04:51:29] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [04:52:17] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:56:53] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [04:57:19] !log restarting wikitech-static, oom [04:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:47] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [04:58:21] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2024-03-06 18:33:38 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:58:39] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:58:41] RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate status.wikimedia.org valid until 2024-03-06 18:33:38 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:00:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26898 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:39:37] PROBLEM - Query Service HTTP Port on wdqs1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:26:06] (03PS1) 10Marostegui: db2117: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/990434 (https://phabricator.wikimedia.org/T354506) [06:28:37] (03CR) 10Marostegui: [C: 03+2] db2117: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/990434 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [07:10:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:33:01] (03PS1) 10Peter Fischer: Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503) [07:47:05] (03PS4) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) [07:54:51] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [07:57:49] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Docker [08:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T0800). [08:00:05] pfischer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:11] o/ [08:00:56] (03CR) 10DCausse: [C: 03+1] enable page_rerender for 5th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:02:46] (03CR) 10Slyngshede: [C: 03+2] Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:02:55] dcausse: if currently no one is around, I might as well self-deploy [08:03:28] pfischer: I can deploy if you want [08:04:05] Sure, thank you! [08:04:36] (03Merged) 10jenkins-bot: Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:09:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:11:36] (03Merged) 10jenkins-bot: enable page_rerender for 5th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:12:02] !log dcausse@deploy2002 Started scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] [08:12:06] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:13:36] !log dcausse@deploy2002 pfischer and dcausse: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:06] (03CR) 10JMeybohm: "Something else crossed my mind this morning: Running jobs via helmfile will result in one helm release per job run which will never be cle" [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [08:17:31] !log dcausse@deploy2002 pfischer and dcausse: Continuing with sync [08:23:42] !log dcausse@deploy2002 Finished scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] (duration: 11m 40s) [08:23:47] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [08:24:22] pfischer: deploy done [08:26:57] (03PS1) 10Muehlenhoff: Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/990590 [08:33:22] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/990590 (owner: 10Muehlenhoff) [08:44:13] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 (owner: 10Muehlenhoff) [08:45:42] !log filippo@deploy2002 Started deploy [performance/arc-lamp@67389a0]: (no justification provided) [08:45:47] !log filippo@deploy2002 Finished deploy [performance/arc-lamp@67389a0]: (no justification provided) (duration: 00m 05s) [08:46:51] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [08:48:17] (03Merged) 10jenkins-bot: Search update pipeline: 5th batch page_rerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/990586 (https://phabricator.wikimedia.org/T351503) (owner: 10Peter Fischer) [09:03:59] (03PS12) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) [09:04:04] (03PS13) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) [09:14:45] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:15:00] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:15:31] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:15:59] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:16:15] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:16:30] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:18:15] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:19:10] (03CR) 10DCausse: Search update pipeline: update README (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (owner: 10Peter Fischer) [09:24:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10JMeybohm) >>! In T354276#9443301, @KFrancis wrote: > Hi all, please provide Dima koushha's WMDE email address to kfrancis@wikimedia.org and I'll prepare the NDA. Thank... [09:27:25] (03PS1) 10Muehlenhoff: Update Mark's key with a new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/990594 [09:30:35] (03CR) 10Mark Bergsma: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/990594 (owner: 10Muehlenhoff) [09:30:57] (03CR) 10Muehlenhoff: [C: 03+2] Update Mark's key with a new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/990594 (owner: 10Muehlenhoff) [09:37:56] (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (2 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) [09:41:56] (03PS1) 10Ladsgroup: SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263) [09:42:12] jouncebot: nowandnext [09:42:12] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [09:42:12] In 1 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1100) [09:42:19] (03CR) 10Ladsgroup: [C: 03+2] SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263) (owner: 10Ladsgroup) [09:42:44] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (2 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [09:44:23] (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (3 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) [09:45:28] (03Merged) 10jenkins-bot: SecurePoll: Adding updated voterlist files [extensions/SecurePoll] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/990424 (https://phabricator.wikimedia.org/T349263) (owner: 10Ladsgroup) [09:46:12] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (3 hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990596 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [09:46:43] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]] [09:46:50] T349263: Create voter list for U4C Charter ratification vote - https://phabricator.wikimedia.org/T349263 [09:48:16] (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597 [09:48:18] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:48:37] (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597 [09:50:02] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (fix) [puppet] - 10https://gerrit.wikimedia.org/r/990597 (owner: 10Effie Mouzeli) [09:50:59] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:47] (03PS1) 10Btullis: Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) [09:53:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1107/co" [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [09:56:29] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:58:46] (03PS1) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) [09:58:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-gp1002.eqiad.wmnet [09:59:48] (03PS2) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) [10:02:43] (03PS1) 10Muehlenhoff: Switch mc-gp1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990602 (https://phabricator.wikimedia.org/T349619) [10:02:48] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:990424|SecurePoll: Adding updated voterlist files (T349263)]] (duration: 16m 04s) [10:02:52] T349263: Create voter list for U4C Charter ratification vote - https://phabricator.wikimedia.org/T349263 [10:04:15] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) to sum it up as it's a bit confusing to re-read everything: | | puppet5 (db1139) | puppet 7 (db1133) | `mysql --ssl-ca wmf-ca-certif... [10:05:02] (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-gp1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990602 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:07:11] (03CR) 10Brouberol: [C: 03+1] Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:07:22] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bump the namenode heap value for the new nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990598 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:08:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-gp1002.eqiad.wmnet [10:10:58] (03CR) 10Btullis: [C: 03+1] spark-history: set production retention to 60 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/990034 (https://phabricator.wikimedia.org/T354927) (owner: 10Brouberol) [10:11:22] (03CR) 10Brouberol: [C: 03+2] spark-history: set production retention to 60 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/990034 (https://phabricator.wikimedia.org/T354927) (owner: 10Brouberol) [10:11:51] (03PS1) 10Btullis: Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) [10:13:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1037.eqiad.wmnet [10:13:40] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1108/co" [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:14:14] (03PS1) 10Brouberol: spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573) [10:14:35] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:41] (03CR) 10Brouberol: [C: 03+1] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:15:05] (03PS1) 10Muehlenhoff: Switch mc1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990628 (https://phabricator.wikimedia.org/T349619) [10:15:27] (03PS2) 10Btullis: Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) [10:17:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch mc1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990628 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:17:28] (03CR) 10Brouberol: [C: 03+1] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:18:26] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [10:19:08] (03PS1) 10Btullis: Temporarily disable systemd jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573) [10:21:10] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1109/co" [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:21:27] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) as for the certificates side: | | Puppet 7 ca.crt `puppet_rsa` | Puppet 5 ca.crt `palladium.eqiad.wmnet` | wmf-ca.crt `Wikimedia_Int... [10:22:04] (03CR) 10Btullis: [C: 03+2] Temporarily disable gobblin ingestion [puppet] - 10https://gerrit.wikimedia.org/r/990605 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:22:36] (03CR) 10Btullis: [V: 03+1 C: 03+2] Temporarily disable systemd jobs on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/990629 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [10:22:48] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Those are tests from the orchestrator server I assume? [10:24:31] (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [10:27:42] (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [10:30:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1037.eqiad.wmnet [10:32:22] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) No, good catch! I forgot to add those results as well. Previous results were from the previously described tests. From orchestrator... [10:34:29] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) If db1133 gets fixed, that should mean that the new dbstores (1008, 1009) should pop up and get discovered automatically too. [10:37:20] (03PS5) 10Klausman: profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 [10:37:59] (03CR) 10Klausman: [C: 03+2] profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman) [10:38:06] (03CR) 10Klausman: [V: 03+2 C: 03+2] profile::thanos: Remove latency histo bucket filter for istio RR [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman) [10:45:56] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [10:47:23] (03PS17) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [10:48:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:59] !log installing systemd bugfix updates from Bullseye point release [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:11] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) ` root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json "MySQLOrchestratorSSLCAFile": "/etc/ssl/cert... [10:49:16] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1110/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [10:51:19] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [10:53:11] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [10:53:59] (03CR) 10Vgutierrez: hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [10:58:45] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1100) [11:03:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [11:06:23] PROBLEM - Hadoop DataNode on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:25] PROBLEM - Hadoop DataNode on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:25] PROBLEM - Hadoop DataNode on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:27] PROBLEM - Hadoop DataNode on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:27] PROBLEM - Hadoop DataNode on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:27] PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:28] PROBLEM - Hadoop DataNode on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:29] PROBLEM - Hadoop DataNode on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:30] PROBLEM - Hadoop DataNode on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:31] PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:32] PROBLEM - Hadoop DataNode on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:33] PROBLEM - Hadoop DataNode on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:34] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:35] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:36] PROBLEM - Hadoop DataNode on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:37] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:38] PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:39] PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:40] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:41] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:42] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:43] PROBLEM - Hadoop DataNode on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:07:00] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 97 hosts with reason: Bringing new nameservers into service [11:08:05] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:08:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 97 hosts with reason: Bringing new nameservers into service [11:08:33] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 8 hosts with reason: Bringing new nameservers into service [11:08:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 8 hosts with reason: Bringing new nameservers into service [11:09:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [11:09:43] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service,hadoop-mapreduce-historyserver.service,hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:01] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-master[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service [11:10:09] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:13] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-metastore.service,hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-master[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service [11:10:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:25] PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:26] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-coord[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service [11:10:27] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-coord[1001-1004].eqiad.wmnet with reason: Bringing new nameservers into service [11:11:31] (03CR) 10Brouberol: [C: 03+1] Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:11:40] (03CR) 10Btullis: [C: 03+2] Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:12:45] (03PS3) 10Btullis: Update the hadoop nameservers [puppet] - 10https://gerrit.wikimedia.org/r/990600 (https://phabricator.wikimedia.org/T332573) [11:15:29] 10SRE, 10Observability-Alerting, 10serviceops-radar, 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [11:16:52] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [11:17:01] RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:19] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:17:41] RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:47] RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:53] RECOVERY - Hadoop DataNode on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:53] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:14] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10jcrespo) [11:18:19] RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:36] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10database-backups: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) 05In progress→03Resolved Backups worked over the weekend with no issues. Resolving. [11:22:15] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:54] (03PS18) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [11:30:09] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:30] (03PS6) 10Slyngshede: Ganeti memory pressure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [11:30:50] (03CR) 10Slyngshede: Ganeti memory pressure alerting. (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:31:04] (03CR) 10CI reject: [V: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [11:33:47] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:33:51] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:03] RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:34:07] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:37:21] (03CR) 10Btullis: [C: 03+1] spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573) (owner: 10Brouberol) [11:37:26] (03CR) 10Brouberol: [C: 03+2] spark-history: update an-master hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/990627 (https://phabricator.wikimedia.org/T332573) (owner: 10Brouberol) [11:38:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [11:38:40] (03PS1) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) [11:39:30] (03PS2) 10Effie Mouzeli: memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) [11:39:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1111/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [11:40:01] RECOVERY - Hadoop DataNode on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:03] (03PS1) 10AikoChou: ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) [11:41:13] RECOVERY - Hadoop DataNode on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:13] RECOVERY - Hadoop DataNode on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:13] RECOVERY - Hadoop DataNode on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:15] RECOVERY - Hadoop DataNode on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:17] RECOVERY - Hadoop DataNode on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:19] RECOVERY - Hadoop DataNode on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:21] RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:21] RECOVERY - Hadoop DataNode on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:23] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:23] RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:23] RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:24] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:25] RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:27] RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:29] RECOVERY - Hadoop DataNode on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:33] RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:33] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:35] RECOVERY - Hadoop DataNode on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:37] RECOVERY - Check systemd state on an-worker1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:39] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:39] RECOVERY - Hadoop DataNode on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:40] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:41] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [11:41:43] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:49] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:41:49] RECOVERY - Hadoop DataNode on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:50] RECOVERY - Hadoop DataNode on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:51] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:52] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:41:53] RECOVERY - Hadoop DataNode on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:54] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:55] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:56] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:57] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:58] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:59] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:00] RECOVERY - Hadoop DataNode on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:01] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:02] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:03] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:04] RECOVERY - Hadoop DataNode on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:05] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:06] RECOVERY - Hadoop DataNode on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:09] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:11] RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:13] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:53] RECOVERY - Hadoop DataNode on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:55] RECOVERY - Hadoop DataNode on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:55] RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:55] RECOVERY - Hadoop DataNode on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:57] RECOVERY - Hadoop DataNode on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:01] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:03] RECOVERY - Hadoop DataNode on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:07] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:07] RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:09] RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:09] RECOVERY - Hadoop DataNode on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:09] RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:11] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:15] RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:15] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:17] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:17] RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:21] RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:23] RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:23] RECOVERY - Hadoop DataNode on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:23] RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:25] RECOVERY - Hadoop DataNode on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:27] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:27] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:31] RECOVERY - Hadoop DataNode on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:33] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:33] RECOVERY - Hadoop DataNode on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:37] RECOVERY - Hadoop DataNode on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:37] RECOVERY - Check systemd state on an-worker1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:37] RECOVERY - Check systemd state on analytics1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:39] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:41] RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:43] RECOVERY - Hadoop DataNode on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:45] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:01] RECOVERY - Hadoop DataNode on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:44:03] RECOVERY - Hadoop DataNode on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:44:07] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:09] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:11] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:19] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:25] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:41] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:45] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:49] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:49] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:05] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:48:36] (03PS1) 10Btullis: Set the old namenodes to be insetup [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) [11:50:22] (03PS1) 10Btullis: Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) [11:51:42] (03PS1) 10Btullis: Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573) [11:51:46] (03CR) 10CI reject: [V: 04-1] Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:51:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1113/co" [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:52:16] (03PS2) 10Btullis: Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) [11:52:40] (03PS2) 10Btullis: Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573) [11:53:32] (03CR) 10Btullis: [V: 03+1 C: 03+2] Set the old namenodes to be insetup [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:54:34] (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable gobblin ingestion" [puppet] - 10https://gerrit.wikimedia.org/r/990612 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:58:09] (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable systemd jobs on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/990613 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:00:25] !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for 92 hosts [12:01:02] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 92 hosts [12:01:08] (03CR) 10Fabfur: [V: 03+1] hiera: add acls for heavy ratelimiting abusing ip from list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:02:32] (03PS19) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [12:07:03] (03PS1) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) [12:07:47] (03PS2) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) [12:07:52] (03PS4) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 [12:07:57] (03PS1) 10Btullis: Enable monitoring for the new namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573) [12:08:53] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1115/co" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:09:48] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1116/console" [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:13:32] 10SRE, 10Infrastructure-Foundations, 10netops: Automate BGP peering on MR routers towards core - https://phabricator.wikimedia.org/T354809 (10cmooney) 05Open→03Resolved a:03cmooney [12:14:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff) [12:14:32] (03PS3) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) [12:14:39] (03CR) 10Ottomata: "This is in the wrong file. It should be in helmfile.d/services/eventstreams/values.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [12:16:08] (03CR) 10Ottomata: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [12:16:42] (03CR) 10Ottomata: "M" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [12:20:03] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [12:20:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10JMeybohm) >>! In T354049#9439064, @Aklapper wrote: > How / where was this account created? `ldapsearch -xxx cn="Arthur Taylor"` says `cn` and `sn` are `Arthur t... [12:20:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:20:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [12:21:44] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: provide the CA cert when listening to TLS (all hosts) [puppet] - 10https://gerrit.wikimedia.org/r/990635 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [12:23:10] !log stopping puppet on all mediawiki memcached hosts (mc*, mc-gp*), puppet 7 migration in progress - T349619 [12:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:14] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [12:25:19] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990666 [12:28:59] (03PS20) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [12:30:38] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:35:51] (03PS4) 10Effie Mouzeli: Switch Mediawiki memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) [12:36:39] (03CR) 10Muehlenhoff: "insetup::data_engineering won't work for the old master nodes, the role defaults to Puppet 7 and we don't have Puppet 7 for Buster. But we" [puppet] - 10https://gerrit.wikimedia.org/r/990637 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:37:02] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbstore1003.eqiad.wmnet [12:37:47] (03PS5) 10Effie Mouzeli: Switch Mediawiki memcache gutter clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) [12:39:33] (03PS1) 10Effie Mouzeli: Switch Mediawiki main memcache clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619) [12:39:56] !log enable puppet on mc* hosts - - T349619 [12:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:59] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [12:42:44] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [12:46:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable monitoring for the new namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990643 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:50:47] (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [12:51:49] (03Merged) 10jenkins-bot: ml-services: update revertrisk-la batcher image on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990636 (https://phabricator.wikimedia.org/T352987) (owner: 10AikoChou) [12:53:39] (03CR) 10Fabfur: [V: 03+1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:54:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mediawiki::memcached::gutter [12:54:09] (03PS1) 10Slyngshede: LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) [12:54:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki memcache gutter clusters to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990641 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [12:55:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede) [12:56:11] (03CR) 10Slyngshede: [V: 03+2] LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede) [12:56:14] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAP account creation, do not capitalize CN and SN. [software/bitu] - 10https://gerrit.wikimedia.org/r/990664 (https://phabricator.wikimedia.org/T355060) (owner: 10Slyngshede) [12:56:27] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:28] (03PS1) 10Btullis: Use insetup::buster for the old namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) [12:59:01] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [12:59:08] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1118/co" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [12:59:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediawiki::memcached::gutter [13:00:25] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [13:00:25] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:00:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbstore1003.eqiad.wmnet [13:01:56] (03PS2) 10Btullis: Use insetup::buster for the old namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) [13:02:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:22] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [13:04:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (the lack of disabled noticitions is an oversight, I'll fix that in a separate commit)" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [13:05:00] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [13:05:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1119/co" [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [13:09:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [13:10:10] (03PS1) 10Aqu: Update statsd-exporter mappings for Airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/990688 (https://phabricator.wikimedia.org/T343232) [13:12:11] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [13:12:24] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [13:13:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:56] (03PS5) 10Muehlenhoff: rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 [13:17:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff) [13:19:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [13:19:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [13:21:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) @JMeybohm I am able to login to https://wikitech.wikimedia.org/ with "Arthur taylor" [13:26:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [13:31:03] (03PS1) 10Jelto: miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791) [13:33:04] (03PS1) 10Muehlenhoff: Switch hadoop master/standby roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) [13:35:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:39:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:43:16] (03CR) 10Muehlenhoff: [C: 03+2] rsync: Remove support for auto_ferm and rename auto_nft [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff) [13:48:25] (03PS3) 10Anzx: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) [13:48:38] (03PS2) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) [13:48:48] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:49:03] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:51:07] (03PS1) 10Muehlenhoff: Also default insetup::buster role disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/990695 [13:52:13] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:53:10] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:44] (03PS17) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [13:54:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1400). [14:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:21] o/ [14:01:34] I can deploy [14:01:34] currently looking at the mywiki change [14:01:46] Ok [14:02:13] uhm [14:02:22] > The community was informed on 22-Nov-23 at here. [14:02:31] I don’t really like the verb “informed” there tbh [14:02:43] was there… no community discussion? not even a reply from anybody? [14:02:59] * Lucas_WMDE checks how many active editors the wiki has [14:04:55] ok, it’s not a huge amount, but it’s not like Ninjastrikers is the only person on the whole wiki either [14:07:29] I’m looking at https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes#How_to_request_a_change now… is there a threshold for “a very small and low-activity community” [14:07:31] ? [14:07:37] because I wouldn’t call what we currently have a “consensus” [14:08:07] Ninjastrikers has certainly “given an opportunity for objections”, but I’m not sure if the wiki counts as small enough to apply that sentence to it [14:09:37] (03CR) 10Jelto: [C: 03+2] miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:09:56] I’ll move on to the cawiki change for now [14:10:03] (03PS1) 10Ladsgroup: mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490) [14:10:11] Ok [14:10:56] (03Merged) 10jenkins-bot: miscweb: update design-strategy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990692 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:11:58] (03PS1) 10Ilias Sarantopoulos: ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870) [14:13:14] (03CR) 10Lucas Werkmeister (WMDE): cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:15:46] (03PS3) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) [14:16:26] (03CR) 10Anzx: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:18:26] (03CR) 10Majavah: "Some post-merge comments. Is it intentional this alert is applied more widely than the existing Icinga check?" [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:19:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:19:46] (03PS4) 10Lucas Werkmeister (WMDE): cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:20:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:21:42] (03Merged) 10jenkins-bot: cawiki: update wgAutoConfirmAge and wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989747 (https://phabricator.wikimedia.org/T354425) (owner: 10Anzx) [14:21:59] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]] [14:22:12] T354425: Changing autoconfirmed users rights in cawiki - https://phabricator.wikimedia.org/T354425 [14:22:23] anyone else around to comment on the mywiki question above? [14:22:40] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:23:01] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:23:26] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:23:42] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:01] anzx: can you test cawiki on mwdebug? [14:24:04] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:24:10] though I’m not sure how autoconfirm stuff could be tested tb [14:24:10] Lucas_WMDE: checking [14:24:12] *tbh [14:24:14] ok [14:24:32] (03PS1) 10Slyngshede: Bump version number to 0.0.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/990701 [14:24:40] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:25:31] (03CR) 10Majavah: [V: 03+1] "The PCC lists `toolforge_hosts` as empty, but it seems to be a PCC-specific issue. Cherry-picking this to toolsbeta works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [14:25:34] (03Merged) 10jenkins-bot: ml-services: increase limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/990699 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:25:39] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:25:55] Lucas_WMDE: looks good [14:26:06] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:26:21] anzx: I’m curious, what did you actually test? ^^ [14:27:33] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Continuing with sync [14:28:04] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:28:22] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:32:15] (03PS1) 10Majavah: P:toolforge: move hba to grid-specific bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/990702 [14:32:17] (03PS1) 10Majavah: O:toolforge: add role for grid-less bastions [puppet] - 10https://gerrit.wikimedia.org/r/990703 (https://phabricator.wikimedia.org/T314665) [14:32:19] (03PS1) 10Majavah: P:toolforge::shell_environ: remove packages not on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/990704 [14:33:34] (03CR) 10Jelto: trafficserver: switch design.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:33:36] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:989747|cawiki: update wgAutoConfirmAge and wgAutoConfirmCount (T354425)]] (duration: 11m 36s) [14:33:40] T354425: Changing autoconfirmed users rights in cawiki - https://phabricator.wikimedia.org/T354425 [14:33:47] (03CR) 10Jelto: miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:34:12] (03PS1) 10Ilias Sarantopoulos: ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870) [14:35:36] alright, still no response regarding mywiki… [14:35:43] then I’ll decline to deploy that for now, sorry [14:36:26] Lucas_WMDE: i will stall that task , stating further community support needed [14:36:30] personally I’d like to see at least one support vote from someone else on the project (preferably one of the other active people from recentchanges) [14:36:42] though I’m not going to stop anyone else from deploying it either, in case someone else has different standards ^^ [14:36:47] anzx: ok, thanks! [14:36:56] (my next question was going to be if I should write that on the task or you would ^^) [14:37:22] Lucas_WMDE: if you want you can, or i will [14:37:43] not particularly… I’m fine with you doing it [14:37:58] Ok i will add comment [14:38:04] (03CR) 10Klausman: [C: 03+2] ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:38:12] ok thanks! [14:38:17] !log UTC afternoon backport+config window done [14:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:04] (03Merged) 10jenkins-bot: ml-services: increase falcon-7b pod memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990705 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:39:15] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:26] (03PS1) 10Klausman: profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706 [14:46:24] (03PS2) 10Klausman: profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706 [14:47:36] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbstore1005.eqiad.wmnet [14:49:48] (03PS1) 10Btullis: Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) [14:54:15] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:45] (03CR) 10Klausman: [C: 03+2] profile::thanos: Add dummy filter for `le` label [puppet] - 10https://gerrit.wikimedia.org/r/990706 (owner: 10Klausman) [14:59:55] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:03:04] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:04:29] I'm going to deploy some beta patches to test T353225 deployment plan at the beta cluster. [15:04:29] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [15:05:32] (03PS3) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) [15:05:36] (03CR) 10Urbanecm: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [15:05:47] (03CR) 10Urbanecm: [C: 03+2] beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [15:06:48] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [15:07:02] (03Merged) 10jenkins-bot: beta: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987963 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [15:11:12] (03CR) 10Marostegui: [C: 03+1] Remove remaining references to dbstore100[35] [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis) [15:14:11] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [15:22:04] (03PS1) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 [15:26:58] !log depooled jobrunner mw1460 to repurpose as k8s node [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:28] (03CR) 10Brouberol: [C: 03+1] "Looks good. It seems that dbstore1005 is also found in the mwaddlink repo: https://codesearch.wmcloud.org/search/?q=dbstore1005&files=&exc" [puppet] - 10https://gerrit.wikimedia.org/r/990707 (https://phabricator.wikimedia.org/T351923) (owner: 10Btullis) [15:33:18] (03PS1) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) [15:35:22] (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:36:24] (03PS2) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) [15:39:16] (JobUnavailable) firing: (6) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:58] (03CR) 10Effie Mouzeli: [C: 04-1] kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:45:30] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) Hm, I stumbled upon something unexpected: ` root@db1133:/etc/ssl/certs# mysql [snip] MariaDB [(none)]> select @@global.ssl_ca; +---... [15:46:53] (03PS3) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) [15:47:24] (03CR) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:48:03] (03PS1) 10Reedy: captchaloop: Generate old and new captchas [puppet] - 10https://gerrit.wikimedia.org/r/990715 [15:52:47] (03PS5) 10Effie Mouzeli: (WIP2) mcrouter vanilla chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981461 [15:54:12] (03Abandoned) 10Reedy: Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [15:54:15] (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:15] (JobUnavailable) firing: (6) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:16] (03PS1) 10Peter Fischer: enable page_rerender for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990718 (https://phabricator.wikimedia.org/T351503) [15:54:24] (03PS4) 10Effie Mouzeli: (WIP2) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/982785 (https://phabricator.wikimedia.org/T346690) [15:55:00] (03PS2) 10Klausman: profile::thanos: Try and use explicit buckets to fix isio latency buckets [puppet] - 10https://gerrit.wikimedia.org/r/990708 [15:57:00] (03PS3) 10Peter Fischer: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197) [15:57:46] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:59:48] (03CR) 10Effie Mouzeli: [C: 04-1] kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:02:29] (03CR) 10Klausman: [V: 03+2 C: 03+2] profile::thanos: Try and use explicit buckets to fix isio latency buckets [puppet] - 10https://gerrit.wikimedia.org/r/990708 (owner: 10Klausman) [16:04:41] (03PS4) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) [16:05:12] (03CR) 10Hnowlan: kubernetes: make jobrunner mw1460 a kubernetes worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:05:25] (03PS4) 10Effie Mouzeli: (WIP) modules/lamp: remove job_1.0.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/989841 [16:05:40] (03PS9) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [16:07:52] (03CR) 10Muehlenhoff: [C: 03+2] clouddumps: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990048 (owner: 10Muehlenhoff) [16:08:42] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: make jobrunner mw1460 a kubernetes worker [puppet] - 10https://gerrit.wikimedia.org/r/990713 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:18:32] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [16:18:52] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [16:29:12] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) I ran the following test: with a custom PKI, a server certificate generated with an intermediate CA and the CA bundle fed to Orchestr... [16:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1630). nyaa~ [16:32:57] (03PS1) 10Hnowlan: kubernetes: make 4 codfw jobrunner hosts k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/990723 (https://phabricator.wikimedia.org/T354791) [16:37:44] (03PS4) 10Reedy: mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) [16:38:18] (03CR) 10Reedy: [C: 03+1] "Fine to be merged at some point now..." [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [16:40:52] (03PS1) 10Majavah: P:openstack: nova::compute: restart libvirt api after changing TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/990724 (https://phabricator.wikimedia.org/T355067) [16:45:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbstore1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [16:45:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbstore1005.eqiad.wmnet [16:51:52] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:55:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10andrea.denisse) a:05andrea.denisse→03None [17:00:48] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [17:00:55] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [17:02:19] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Docker [17:02:47] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [17:16:12] (03CR) 10FNegri: [C: 03+1] "LGTM, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [17:19:31] jouncebot: nowandnext [17:19:31] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [17:19:32] In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800) [17:19:32] In 0 hour(s) and 40 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800) [17:23:58] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [17:29:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [17:33:29] (03PS1) 10Urbanecm: beta: Temporarily change default value for 3 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225) [17:33:50] (03CR) 10Urbanecm: [C: 03+2] "beta only, no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [17:34:33] (03Merged) 10jenkins-bot: beta: Temporarily change default value for 3 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990730 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [17:35:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [17:36:23] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990671 [17:36:25] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990672 [17:36:27] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990673 [17:48:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [17:50:51] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800) [18:00:04] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T1800). [18:14:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:52] okay...while testing the T353225 plan, i found a bug in UserOptionsManager. wonderful :D [18:14:53] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [18:37:51] (03CR) 10Gmodena: update eventstream helm values.yaml file to include hard-coded list of redacted pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [18:45:02] (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff) [18:45:34] (03CR) 10Dreamy Jazz: Update associated email address for dreamyjazz (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988399 (https://phabricator.wikimedia.org/T353735) (owner: 10Muehlenhoff) [18:51:43] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) (owner: 10Cwhite) [18:54:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:53] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:25] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:01:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:23:34] !log creating the u4c2024_edits table on all wikis [19:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:18] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:24:20] (03PS18) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [19:25:00] (03CR) 10CI reject: [V: 04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:53:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:55:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T2100). [21:00:05] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:02:12] deploying [21:03:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza) [21:11:10] (03PS3) 10Gergő Tisza: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 [21:11:34] (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza) [21:12:34] (03Merged) 10jenkins-bot: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza) [21:12:48] !log tgr@deploy2002 Started scap: Backport for [[gerrit:990164|Log emails in production]] [21:14:19] !log tgr@deploy2002 tgr: Backport for [[gerrit:990164|Log emails in production]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:42] !log tgr@deploy2002 tgr: Continuing with sync [21:16:02] (03PS2) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) [21:17:41] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:00] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:990164|Log emails in production]] (duration: 09m 11s) [21:23:37] !log UTC late deploys done [21:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:34] (03PS3) 10Reedy: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) [21:28:48] (03PS1) 10Reedy: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) [21:29:37] (03CR) 10Reedy: [C: 03+2] CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [21:30:22] (03Merged) 10jenkins-bot: CommonSettings: Swap stringified class names in ConfirmEdit usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990711 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [21:36:22] (03PS2) 10Reedy: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) [21:36:57] !log fab@deploy2002 Started deploy [airflow-dags/research@9b6a69a]: (no justification provided) [21:37:16] !log reedy@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Swap stringified class names in ConfirmEdit usages (duration: 06m 30s) [21:37:25] !log fab@deploy2002 Finished deploy [airflow-dags/research@9b6a69a]: (no justification provided) (duration: 00m 27s) [21:38:15] (03CR) 10Reedy: [C: 03+2] wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [21:39:09] (03Merged) 10jenkins-bot: wmf-config: Replace numerous stringified classes with use statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990766 (https://phabricator.wikimedia.org/T251841) (owner: 10Reedy) [21:44:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:58] !log reedy@deploy2002 Synchronized wmf-config/: Fix more stringified class names (duration: 06m 29s) [21:58:21] (03CR) 10VolkerE: "@Jelto How would we verify and approve? What to look out for?" [puppet] - 10https://gerrit.wikimedia.org/r/989834 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240115T2200). [22:06:10] (03PS19) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [23:55:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:55:22] (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990402 (https://phabricator.wikimedia.org/T349774)