[00:02:28] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:12:00] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[00:33:02] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:35:04] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[00:49:04] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:03:40] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:12:00] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:28:33] <wikibugs>	 (03PS1) 10DDesouza: QuickSurveys: Add research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[01:29:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:32:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:42:22] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[01:55:26] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:57:44] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:57:59] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10tstarling)
[02:00:48] <wikibugs>	 (03PS9) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820)
[02:04:06] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10tstarling)
[02:07:56] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966
[02:08:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966 (owner: 10TrainBranchBot)
[02:23:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966 (owner: 10TrainBranchBot)
[02:30:36] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:38:05] <wikibugs>	 (03PS10) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820)
[02:40:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[02:43:39] <wikibugs>	 (03PS11) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820)
[02:47:28] <wikibugs>	 (03CR) 10Tim Starling: "* PS9: rebase" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[03:05:58] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:38] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:22:22] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:24:28] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:24:56] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:26:38] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:26:58] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:12] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:45:34] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:46:50] <wikibugs>	 (03CR) 10Tim Starling: "This is pretty harmless, and once it is merged, we can benchmark it in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[03:54:52] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:56:23] <wikibugs>	 (03PS2) 10Tim Starling: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[03:57:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[04:02:50] <wikibugs>	 (03PS3) 10Tim Starling: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[04:04:10] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:13:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:15:27] <wikibugs>	 (03PS3) 10Tim Starling: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[04:21:11] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970)
[04:21:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:28] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[04:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[04:48:46] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:51:24] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[04:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:07:36] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[05:07:47] <wikibugs>	 (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/806972
[05:10:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/806972 (owner: 10Marostegui)
[05:24:42] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:24:54] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:30:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Set innodb_max_dirty_pages_pct to 75 [puppet] - 10https://gerrit.wikimedia.org/r/806973 (https://phabricator.wikimedia.org/T308380)
[05:33:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Set innodb_max_dirty_pages_pct to 75 [puppet] - 10https://gerrit.wikimedia.org/r/806973 (https://phabricator.wikimedia.org/T308380) (owner: 10Marostegui)
[05:34:00] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) The amount of binlogs per day is also fine (not like parsercache which generates an insane amount of...
[05:48:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/806525
[05:49:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/806525 (owner: 10Marostegui)
[05:54:14] <marostegui>	 !log Reboot db1132 and db1181 for kernel upgrade
[05:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:42] <wikibugs>	 (03PS1) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)
[06:06:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[06:09:46] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:11:48] <wikibugs>	 (03PS2) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)
[06:12:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[06:15:23] <wikibugs>	 (03PS3) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)
[06:26:36] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:28:04] <wikibugs>	 (03PS4) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)
[06:29:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[06:31:05] <wikibugs>	 (03PS5) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)
[06:35:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:39:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:50:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041
[06:53:24] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] admin: add taavi to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/806487 (https://phabricator.wikimedia.org/T309375) (owner: 10Dzahn)
[06:53:45] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:54:53] <wikibugs>	 (03PS3) 10Slyngshede: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[06:54:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[06:55:15] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041
[06:56:26] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[06:58:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041 (owner: 10Muehlenhoff)
[06:59:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10taavi) 05Resolved→03Open Hi @SLyngshede-WMF, please also add myself to the `ciadmin` ldap group as requested in the task description. Thanks!
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0700).
[07:00:04] <jouncebot>	 matthiasmullie and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:01] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043
[07:01:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) @taavi Sorry, didn't spot that. I'll be right back :)
[07:01:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede)
[07:01:58] <matthiasmullie>	 o/
[07:04:23] <matthiasmullie>	 brb, nature calls
[07:04:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) 05Open→03Resolved @taavi You're now added to ciadmin, but let me know if something doesn't work.
[07:08:42] <matthiasmullie>	 b
[07:09:07] <matthiasmullie>	 I can deploy my own patch
[07:10:01] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[07:10:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[07:11:24] <kostajh>	 \o sorry to be late to the party
[07:11:41] <kostajh>	 matthiasmullie: let me know when you're done, I can deploy my patch
[07:12:01] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:12:18] <matthiasmullie>	 kostajh: sure!
[07:12:57] <wikibugs>	 (03PS2) 10Slyngshede: prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:17:38] <matthiasmullie>	 kostajh: all done, the floor is yours!
[07:17:45] <kostajh>	 matthiasmullie: cheers
[07:18:33] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan)
[07:20:05] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:21:18] <kostajh>	 matthiasmullie: I don't see my patch on mediawiki-staging after git status && git fetch, have I done something wrong?
[07:21:44] <kostajh>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805766 says the gate pipeline succeeded, but gerrit also shows a merge conflict
[07:22:47] <matthiasmullie>	 kostajh: looks like it didn't merge
[07:22:50] <matthiasmullie>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805766
[07:23:04] <wikibugs>	 (03PS3) 10Kosta Harlan: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548)
[07:23:14] <kostajh>	 alright let's see if a rebase fixes it
[07:25:09] <wikibugs>	 (03PS1) 10Matthias Mullie: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711)
[07:26:02] <wikibugs>	 (03CR) 10Kosta Harlan: "side note: it could be useful to make a phab task for this and tag this patch with it, for increased visibility and to have a place to gat" [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori)
[07:26:26] <wikibugs>	 (03PS1) 10Matthias Mullie: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711)
[07:27:25] <kostajh>	 matthiasmullie: should I press "Submit"? Usually that happens on its own. cc Amir1 && urbanecm 
[07:27:52] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan)
[07:28:19] <urbanecm>	 Good morning kostajh. Shouldn't be needed. 
[07:28:28] <matthiasmullie>	 kostajh: yeah, usually does it on its own; I guess it didn't because it already had +2 prior?
[07:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan)
[07:28:43] <urbanecm>	 Yeah, that'd do it. 
[07:28:49] <hashar>	 good morning
[07:28:55] <kostajh>	 hrm
[07:28:56] <kostajh>	 ok, thanks
[07:29:17] <matthiasmullie>	 IIRC, removing your own vote & reapplying +2 also kicks it off again
[07:31:34] <kostajh>	 sigh, I need to revert my patch, I didn't read back far enough in the relevant phab task
[07:31:46] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992
[07:32:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765)
[07:32:04] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan)
[07:32:30] <Amir1>	 kostajh: don't forget to rebase before hitting +2
[07:32:35] <Amir1>	 I usually do 
[07:33:02] <kostajh>	 Amir1: it says "Change is up to date with the target branch already (master) "
[07:33:18] <kostajh>	 (For the revert patch.)
[07:33:25] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan)
[07:33:39] <Amir1>	 So that's not why it doesn't merge it then 
[07:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan)
[07:34:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[07:35:05] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:36:56] <kostajh>	 alright, I'm done
[07:37:12] <kostajh>	 well, scap is still wrapping up its thing
[07:42:06] <wikibugs>	 (03PS2) 10Slyngshede: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:42:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:44:59] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:44:59] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:44:59] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:46:03] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:56] <hashar>	 kostajh: has your deploy completed ? ;)
[07:49:12] <kostajh>	 hashar: yes!
[07:49:29] <hashar>	 I will start the train dance in a few minutes so :]
[07:52:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:52:49] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:19] <hashar>	 kostajh: I have closed and removed from train blockers a GrowthExperiments task from an earlier train "TypeError: Cannot read properties of undefined (reading 'dailyLimit')"   https://phabricator.wikimedia.org/T309768
[07:54:32] <hashar>	 looks like that got fixed in master/ wmf.16  and backported to wmf.15
[07:54:39] <hashar>	 I am going to roll wmf.17 which does include the fix
[07:54:45] <hashar>	 so I went bold and marked that one resolved
[07:54:49] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia ayounsi https://phabricator.wikimedia.org/T311038 - The acknowledgement expires at: 2022-06-22 07:54:30. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:55:18] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[07:56:19] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:59:13] <wikibugs>	 (03PS1) 10Slyngshede: C:dumps::web::dumpstatusfiles, convert to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673)
[07:59:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765)
[08:00:05] <jouncebot>	 hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0800).
[08:00:18] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm)
[08:01:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[08:03:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:24] <wikibugs>	 (03CR) 10Volans: Allow to dry-run SREBatchRunnerBase (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm)
[08:04:49] <wikibugs>	 (03PS1) 10Hashar: testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070)
[08:04:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[08:05:37] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[08:11:31] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:14:03] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:32] <apergos>	 what was wrong I wonder
[08:14:37] <moritzm>	 !log remove EOLed parsoid debs from releases.wikimedia.org T309765
[08:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:42] <stashbot>	 T309765: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765
[08:15:39] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:49] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think there is a small error, see details inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[08:20:31] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 5.347e+06 ge 2.592e+05 ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[08:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[08:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1690100572056 and 1231592 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1743034916712 and 1289445 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1691134467920 and 1231494 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:26:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Please let us know before proceeding with this as now db1131 is a master so we'd need to switch it back to become a single replica. So please let us know before hand with 2-3 days...
[08:28:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire releasers-parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765)
[08:29:18] <marostegui>	 !log Reboot db1120 for kernel upgrade
[08:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) > Kindly be informed that we have logged your issue under ref 01420952, we will investigate and get back to you with our findings.
[08:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[08:45:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:47:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:47:47] <wikibugs>	 (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[08:49:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans)
[08:51:08] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[08:52:52] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35941/console" [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:57:26] <elukey>	 !log copy package 'jvm-tools' from buster-wikimedia to bullseye-wikimedia on apt1001 - T310980
[08:57:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans)
[08:58:36] <hashar>	 so testwiki got promoted, I am going to do group0 wikis
[08:59:22] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070)
[08:59:24] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[08:59:40] <wikibugs>	 (03PS1) 10Elukey: aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980)
[09:00:05] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[09:00:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:01:33] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:02:17] <wikibugs>	 (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:06:50] <wikibugs>	 (03PS2) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:08:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans)
[09:09:59] <wikibugs>	 (03CR) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:11:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:11:25] <wikibugs>	 (03PS1) 10Elukey: Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980)
[09:12:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove profile::releases::upload and related classes [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765)
[09:13:19] <marostegui>	 !log dbmaint s8@codfw T310011
[09:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:23] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:13:32] <wikibugs>	 (03PS3) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:14:33] <wikibugs>	 (03CR) 10Elukey: "I haven't tested the tools but the changes look straightforward to me. If the changes are good we can cherry pick the commit in the debian" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:18:01] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35943/console" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[09:19:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071
[09:20:24] <marostegui>	 !log dbmaint s8@eqiad T310011
[09:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:28] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:20:56] <wikibugs>	 (03PS1) 10Volans: doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074
[09:21:52] <urbanecm>	 jouncebot: nowandnext
[09:21:53] <jouncebot>	 For the next 0 hour(s) and 38 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0800)
[09:21:53] <jouncebot>	 In 3 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300)
[09:21:53] <jouncebot>	 In 3 hour(s) and 38 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300)
[09:22:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071 (owner: 10Muehlenhoff)
[09:23:05] <urbanecm>	 hashar: looks like traindeployment is done; would it be fine for me to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806407, or should i wait the ~40 minutes?
[09:23:20] <hashar>	 urbanecm: go for it :)
[09:23:23] <urbanecm>	 thanks!
[09:23:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm)
[09:23:40] <wikibugs>	 (03PS2) 10Urbanecm: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885)
[09:23:43] <hashar>	 and thank you to have checked with me!
[09:23:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm)
[09:23:47] <wikibugs>	 (03CR) 10Muehlenhoff: Apply 2to3 to migrate the code to Python3 (032 comments) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:24:22] <urbanecm>	 no problem :)
[09:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm)
[09:25:36] <wikibugs>	 (03PS2) 10Elukey: Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980)
[09:25:48] <wikibugs>	 (03CR) 10Elukey: "Thanks!" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:25:50] <wikibugs>	 (03PS8) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[09:25:52] <wikibugs>	 (03PS17) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[09:28:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, there are some subtleties 2to3 won't catch, but those will be found during ml-cache ramp-up." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:31:09] <urbanecm>	 okay, scap sync-file completed, but logmsgbot is not here :/
[09:31:28] <urbanecm>	 can a SRE follow https://wikitech.wikimedia.org/wiki/Logmsgbot#Restart to restart it please?
[09:31:52] <urbanecm>	 !log 09:29:23 Synchronized wmf-config/throttle.php: 7c9f6a561b2b4b5c5db063bad83bd23e9cbac347: Add a throttle rule for a Czech course (T310885) (duration: 03m 34s) #manually logging in logmsgbot's absence
[09:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:59] <stashbot>	 T310885: Request a throttle lift for Czech course for students – 2022-06-23 - https://phabricator.wikimedia.org/T310885
[09:32:22] <taavi>	 is it just me or have irc bots hosted on our networks recently been more unstable than usual?
[09:32:35] <urbanecm>	 I'm not sure. perhaps?
[09:32:51] <wikibugs>	 (03CR) 10Jbond: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:36:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I see why we'd want to store less samples for non-changing data, though scrape intervals larger than 2m AFAIK are to be avoided (details a" [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:37:32] <wikibugs>	 (03CR) 10Ayounsi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:37:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, I'll let Eric comment authoritatively though" [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[09:38:02] <wikibugs>	 (03CR) 10Jbond: admin: Temporarily disable legoktm's access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806489 (owner: 10Legoktm)
[09:39:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Fix typoes found by Junoser (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi)
[09:39:41] <urbanecm>	 jbond: hi, can i trouble you to rotate logmsgbot? https://wikitech.wikimedia.org/wiki/Logmsgbot#Restart it's not here and logging deployments :/
[09:39:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074 (owner: 10Volans)
[09:40:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite)
[09:40:42] <wikibugs>	 (03CR) 10Ayounsi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:43:02] <wikibugs_>	 (03CR) 10Vgutierrez: [C: 03+1] service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[09:43:33] <wikibugs_>	 (03CR) 10Filippo Giunchedi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:43:54] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/807078
[09:44:14] <wikibugs_>	 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10phuedx) >>! In T306181#8013301, @Ottomata wrote: > Thanks ben!  Seconded. Thanks for all of your w...
[09:44:28] <wikibugs_>	 (03CR) 10Elukey: [C: 03+2] service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[09:44:29] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[09:47:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071 (owner: 10Muehlenhoff)
[09:48:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff)
[09:49:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[09:49:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Also LGTM but will defer to Eric." [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[09:49:45] <wikibugs>	 (03Merged) 10jenkins-bot: doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074 (owner: 10Volans)
[09:50:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426 (owner: 10Muehlenhoff)
[09:52:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: Netbox: add monitoring to dns.git endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[09:52:10] <wikibugs>	 (03PS4) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447)
[09:52:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[09:52:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Fix typoes found by Junoser (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi)
[09:54:02] <wikibugs>	 (03PS2) 10Jbond: aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:54:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Sorry I'm lagging a bit behind testing this, I can say for sure though that 'confd' package isn't in Bullseye so this change will fail" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[09:54:41] <wikibugs>	 (03CR) 10Jbond: "looks like moritz removed this spec test so have rebased (lgtm otherwise)" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:55:16] <wikibugs>	 (03PS2) 10Ayounsi: Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422
[09:55:26] <wikibugs>	 (03CR) 10Elukey: "Thanks John!" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:55:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[09:56:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:56:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[09:57:21] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:59:05] <wikibugs>	 (03PS1) 10Btullis: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629)
[09:59:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi)
[10:00:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Fix typoes found by Junoser [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi)
[10:03:00] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:05:33] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:06:21] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[10:10:37] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:51] <wikibugs>	 (03CR) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[10:15:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis)
[10:15:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:17:16] <wikibugs>	 10SRE-tools, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10JMeybohm)
[10:17:27] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:17:45] <wikibugs>	 (03CR) 10JMeybohm: Allow to dry-run SREBatchRunnerBase (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm)
[10:19:26] <wikibugs>	 (03Merged) 10jenkins-bot: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis)
[10:24:56] <wikibugs>	 (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[10:25:18] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090
[10:26:15] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "You need to import RemoteExecutionError from spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff)
[10:26:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[10:27:26] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:27:41] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090
[10:28:03] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[10:28:44] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[10:30:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:30:12] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:30:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff)
[10:31:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:32:20] <wikibugs>	 (03PS4) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831)
[10:32:41] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090
[10:33:19] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970)
[10:34:02] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/inference-staging https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:34:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) I can take care of install and puppet changes if firmware/boot is taken care of -it that helps speed it up.  We would like to have 100...
[10:34:22] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048)
[10:34:34] <vgutierrez>	 elukey: ^^
[10:35:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi)
[10:36:40] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/inference-staging https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:36:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: Netbox: add monitoring to dns.git endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[10:36:58] <vgutierrez>	 elukey: I'm assuming that's triggered by 'cluster=ml_staging,service=kubesvc' not having any server
[10:36:59] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] dumps: migrate cron of dumps-exception-checker to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:37:07] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:37:41] <vgutierrez>	 Jun 21 10:37:15 puppetmaster1001 confd[19313]: 2022-06-21T10:37:15Z puppetmaster1001 /usr/bin/confd[19313]: ERROR 100: Key not found (/conftool/v1/pools/codfw/ml_staging/kubesvc) [582380]
[10:37:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi)
[10:37:44] <vgutierrez>	 looks like it
[10:38:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi)
[10:38:52] * kart_ updating cxserver
[10:39:12] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) (owner: 10KartikMistry)
[10:39:24] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] osm: migrate import_waterlines cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:39:38] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] C:tilerator::regen fix logging and rename service. [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[10:41:49] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35944/console" [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[10:42:03] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff)
[10:42:17] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I don't know how it will affects Phabricator though :)" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:42:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) (owner: 10KartikMistry)
[10:42:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff)
[10:42:54] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[10:44:45] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[10:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:12] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[10:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:54] <wikibugs>	 (03CR) 10JMeybohm: "Hey Jesse, do you have some time to do a review of this by chance?" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[10:47:06] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:47:17] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[10:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:28] <wikibugs>	 (03PS2) 10JMeybohm: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714)
[10:47:30] <wikibugs>	 (03PS2) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714)
[10:47:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[10:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:58] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[10:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:26] <wikibugs>	 (03PS1) 10Muehlenhoff: squid: Harden config, we don't use Gopher anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807093
[10:48:49] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[10:48:52] <wikibugs>	 (03CR) 10JMeybohm: Add helm-state-metrics helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[10:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:49:35] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[10:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:56] <kart_>	 !log Updated cxserver to 2022-06-21-035954-production (T307970)
[10:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:01] <stashbot>	 T307970: Deploy Flores Machine Translation in a new set of Languages - https://phabricator.wikimedia.org/T307970
[10:57:44] <volans>	 !log deleting netbox getstats.GetDeviceStats job results - T311048
[10:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:50] <stashbot>	 T311048: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048
[10:59:47] <wikibugs>	 (03PS2) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661)
[10:59:49] <wikibugs>	 (03PS3) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661)
[10:59:51] <wikibugs>	 (03PS1) 10Muehlenhoff: squid/url downloaders: Drop Gopher in ACLs, not used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807094
[11:00:22] <wikibugs>	 (03PS1) 10Jbond: getstats: Delete old ve5rsions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095
[11:01:56] <wikibugs>	 (03PS2) 10Jbond: getstats: Delete old ve5rsions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048)
[11:02:34] <wikibugs>	 (03PS3) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048)
[11:02:41] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[11:03:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Retire releasers-parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[11:10:45] <wikibugs>	 (03PS1) 10Klausman: net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195)
[11:11:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:16:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:17:35] <wikibugs>	 (03PS1) 10Jbond: WIP: make the export title much more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097
[11:18:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[11:21:16] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:32:45] <wikibugs>	 (03PS1) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048)
[11:34:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860)
[11:34:47] <wikibugs>	 (03PS2) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048)
[11:35:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35946/console" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[11:35:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff)
[11:36:23] <wikibugs>	 (03PS4) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048)
[11:37:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[11:37:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10jbond)
[11:39:18] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:40:51] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tilerator::regen fix logging and rename service. [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:41:08] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:41:50] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:41:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 for testing', diff saved to https://phabricator.wikimedia.org/P29934 and previous config saved to /var/cache/conftool/dbconfig/20220621-114151-root.json
[11:41:54] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:57] <wikibugs>	 (03PS2) 10Zabe: osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673)
[11:42:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for testing', diff saved to https://phabricator.wikimedia.org/P29935 and previous config saved to /var/cache/conftool/dbconfig/20220621-114216-root.json
[11:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 for testing', diff saved to https://phabricator.wikimedia.org/P29936 and previous config saved to /var/cache/conftool/dbconfig/20220621-114232-root.json
[11:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:39] <wikibugs>	 (03PS2) 10Zabe: memcached: remove absented memkeys cron [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673)
[11:42:56] <wikibugs>	 (03PS3) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673)
[11:43:24] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[11:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[11:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[11:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:32] <wikibugs>	 (03PS1) 10Zabe: dumps: remove absented dumps-exception-checker cron [puppet] - 10https://gerrit.wikimedia.org/r/807101 (https://phabricator.wikimedia.org/T273673)
[11:48:30] <wikibugs>	 (03PS3) 10Zabe: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673)
[11:50:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the reviews -- nothing substantial should change I think, I'll try and deploy the patch next week!" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[11:50:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "To be tested but logic sgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[11:51:47] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593)
[11:55:35] <wikibugs>	 (03CR) 10Ayounsi: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[11:55:37] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35947/console" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[11:55:55] <wikibugs>	 (03PS2) 10Jbond: wmflib::resource::export: make exported resource titles more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097
[11:56:24] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Fix bridge detection logic and provide guidance what do you [cookbooks] - 10https://gerrit.wikimedia.org/r/807105
[11:59:04] <mbsantos>	 !log mbsantos@maps2009 imposm-removebackup-import (T305845)
[11:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:11] <stashbot>	 T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845
[12:00:40] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "This may be a solution for filling docker cache on gitlab-runner nodes." [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[12:00:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35948/console" [puppet] - 10https://gerrit.wikimedia.org/r/807097 (owner: 10Jbond)
[12:01:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Fix bridge detection logic and provide guidance what do you [cookbooks] - 10https://gerrit.wikimedia.org/r/807105 (owner: 10Muehlenhoff)
[12:02:19] <wikibugs>	 (03PS1) 10MSantos: maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845)
[12:05:42] <wikibugs>	 (03PS2) 10MSantos: maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845)
[12:05:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[12:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:02] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[12:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I ran into this while debugging sth else in Pontoon, please let me know what you think (and related https://gerrit.wikimedia.org/r/c/opera" [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi)
[12:06:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) (owner: 10MSantos)
[12:07:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "No worries! Thank you for the review" [puppet] - 10https://gerrit.wikimedia.org/r/806377 (owner: 10Filippo Giunchedi)
[12:07:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: update hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/806377
[12:12:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[12:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:12:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: base: include profile::pontoon::base (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi)
[12:12:35] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[12:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:36] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:19:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[12:20:58] <wikibugs>	 (03PS5) 10Jbond: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[12:23:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35950/console" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[12:23:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib::resource::export: make exported resource titles more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097 (owner: 10Jbond)
[12:25:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, see inline for the discussed point" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[12:25:38] <moritzm>	 !log reset logster-csp/logster-badpass-priv on mwlog1002, these were removed from Puppet
[12:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:25:50] <icinga-wm>	 RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[12:29:58] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10MoritzMuehlenhoff)
[12:30:10] <wikibugs>	 (03PS3) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048)
[12:30:13] <wikibugs>	 (03CR) 10Jbond: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[12:30:25] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) 05Open→03Resolved ganeti4004 has been added to the ganeti/ulsfo cluster now. Cluster is currently rebalancing.
[12:32:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[12:32:14] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:45] <wikibugs>	 (03PS1) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[12:33:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] pontoon: add metricsinfra_prometheus_nodes to settings [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi)
[12:34:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:35:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi)
[12:35:46] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "You should be able to leave this variable empty as long as `prometheus_nodes` is set up correctly." [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi)
[12:35:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi)
[12:36:09] <wikibugs>	 (03PS2) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[12:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[12:37:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:37:34] <wikibugs>	 (03CR) 10Jelto: "small question inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[12:39:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi)
[12:39:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
[12:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378
[12:40:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet
[12:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: pontoon: add metricsinfra_prometheus_nodes to settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi)
[12:43:44] <moritzm>	 !log installing python-bottle security updates
[12:43:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: rework prometheus settings in its own file [puppet] - 10https://gerrit.wikimedia.org/r/806379
[12:48:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, question inline" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[12:48:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for non-blocking comment" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi)
[12:50:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet
[12:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet
[12:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:30] <wikibugs>	 (03CR) 10Jbond: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:52:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
[12:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
[12:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:52] <wikibugs>	 (03PS4) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048)
[12:54:07] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[12:54:12] <wikibugs>	 (03CR) 10Jbond: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[12:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:54:22] <wikibugs>	 (03PS6) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574)
[12:54:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Sounds good, we can later on investigate offloading the caches to Swift/S3 ;)" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[12:55:07] <wikibugs>	 (03PS3) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[12:55:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35952/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[12:56:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:56:28] <wikibugs>	 (03CR) 10Muehlenhoff: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:56:39] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "On integration and contint* machines we do some pruning via ::profile::docker::prune , but given Gitlab provides its own clear cache syste" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[12:56:48] <moritzm>	 !log installing haproxy security updates on stretch
[12:56:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet
[12:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:23] <wikibugs>	 (03PS4) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[12:59:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[12:59:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet
[12:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Change is ready for review again, addressing the optional nits: using external and merging the BGP configurations." [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[12:59:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] rpc: Remove unused RunJobs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300).
[13:00:04] <jouncebot>	 duesen, xsavitar, Lucas_WMDE, itamarWMDE, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:04] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300)
[13:00:11] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC for centrallog and dns: https://puppet-compiler.wmflabs.org/pcc-worker1003/35882/" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:01:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet
[13:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:52] <wikibugs>	 (03PS5) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[13:02:28] <wikibugs>	 (03PS6) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[13:02:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
[13:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:23] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Interestingly enough, using "external" results in bird2 complaining:" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:03:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:04:13] <MichaelG_WMDE>	 Lucas will be here in a second, slight IRC client trouble
[13:04:21] <itamarWMDE>	 o/ here, and Lucas_WMDE is on his way
[13:05:04] <moritzm>	 !log installing Linux 5.10.120-1~bpo10+1 on buster hosts with backports kernel
[13:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:22] <duesen>	 o/
[13:05:27] <koi>	 o/
[13:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:05:49] <duesen>	 xSavitar will probably not come, hie pwoer went out an hour ago
[13:06:30] <wikibugs>	 (03PS7) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[13:06:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Interesting, seems like reverting patchset 5 is probably a good idea :)" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:07:10] <duesen>	 I may need some hand holding with deploying my "config" patch. It's removing an unused endpoint. No idea how to test this.
[13:08:28] <xSavitar>	 o/
[13:08:51] <xSavitar>	 duesen, we can deploy
[13:09:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:09:42] <duesen>	 xSavitar: 
[13:09:48] <xSavitar>	 Here are the deploy commands: https://deploy-commands.toolforge.org/bacc/805775
[13:09:58] <xSavitar>	 Doing that and testing should be fine.
[13:10:12] <xSavitar>	 duesen ^^
[13:10:24] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Idea might be ok (modulo race conditions), but the implementation has errors." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[13:10:28] <wikibugs>	 (03PS8) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[13:12:37] <duesen>	 urbanecm, awight: can Derick and me go ahead with the deployment? 
[13:12:50] <urbanecm>	 duesen: go ahead if you feel comfortable. 
[13:12:54] <urbanecm>	 i can deploy if you're not
[13:13:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
[13:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:13:28] <duesen>	 urbanecm: we'll go ahead
[13:13:34] <urbanecm>	 okay. ping me if i can help :)
[13:13:47] <xSavitar>	 <3 urbanecm 
[13:14:23] <wikibugs>	 (03PS9) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118
[13:14:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet
[13:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:20] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) >>! In T67270#8012925, @Legoktm wrote: > Can we clarify what the goal here is? More recently I've been good about throwing a GP...
[13:16:10] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+2] rpc: Remove unused RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01)
[13:16:28] <duesen>	 I forgot to merge the patch beforehand, will take a couple of minutes
[13:16:31] <duesen>	 it's a config patch though
[13:16:44] <Lucas_WMDE>	 hello
[13:16:47] <Lucas_WMDE>	 I think I *finally* made it here
[13:16:52] <Lucas_WMDE>	 sorry I’m late
[13:16:54] <duesen>	 hey Lucas_WMDE !
[13:17:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[13:17:18] <Lucas_WMDE>	 urbanecm: you’re deploying?
[13:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: rpc: Remove unused RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01)
[13:17:22] <itamarWMDE>	 Welcome Lucas_WMDE! :D
[13:17:31] <urbanecm>	 Lucas_WMDE: duesen and xSavitar are
[13:17:33] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:17:34] <urbanecm>	 I'm just standing by
[13:18:14] <duesen>	 ok, patch merged
[13:18:14] <Lucas_WMDE>	 ok
[13:18:28] <wikibugs>	 (03CR) 10Slyngshede: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:20:09] <duesen>	 I pulled the patch to mwdebug1001.
[13:20:14] <Lucas_WMDE>	 I unfortunately have a meeting starting in 10 minutes, so I might deploy my Lexeme Lua patch in the break after the backport+config window
[13:20:14] <duesen>	 There is nothing to test, really.
[13:20:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[13:20:28] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35955/console" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:22:58] <duesen>	 I'll scap now. Wort that can happen is job execution breaking...
[13:23:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:57] <icinga-wm>	 PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7293 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:24:08] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang)
[13:24:19] <duesen>	 urbanecm: uh, how do you scap a file deletion?
[13:24:20] <duesen>	 #
[13:24:26] <duesen>	 ...for config
[13:24:32] <urbanecm>	 duesen: scap the folder the file is in
[13:24:36] <urbanecm>	 (wel, was)
[13:24:39] <duesen>	 kk!
[13:24:59] <duesen>	 running
[13:25:04] <wikibugs>	 (03PS2) 10Ori: varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (https://phabricator.wikimedia.org/T138093)
[13:25:06] <wikibugs>	 (03CR) 10Slyngshede: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[13:26:30] <wikibugs>	 (03CR) 10Elukey: "There is a bit missing IIUC, but the rest looks good! After this change I'd file another one to change profile::pki::multirootca and add t" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:27:05] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[13:28:01] <duesen>	 ...still going...
[13:28:12] <urbanecm>	 yeah, it takes a couple of minutes those days :/
[13:28:38] <logmsgbot>	 !log daniel@deploy1002 Synchronized rpc/: Config: [[gerrit:805775|rpc: Remove unused RunJobs.php (T175146 T243096)]] (duration: 03m 45s)
[13:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:45] <stashbot>	 T243096: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096
[13:28:45] <stashbot>	 T175146: [RfC] Move RunJobs.php to the mediawiki (core) repository - https://phabricator.wikimedia.org/T175146
[13:29:06] <duesen>	 ok, sync is done. I'm seeing nothing suspicious on logstash so far.
[13:30:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[13:30:24] <wikibugs>	 (03CR) 10Ori: [C: 03+2] varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori)
[13:30:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[13:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:07] * Lucas_WMDE afk for 30 minutes
[13:32:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet
[13:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:27] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7714 seconds old and 217 bytes Btullis T310293 - running on standby server temporarily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:32:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[13:32:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:19] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[13:35:22] <duesen>	 urbanecm, Lucas_WMDE: all good.
[13:35:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[13:37:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:04] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10MatthewVernon)
[13:38:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet
[13:38:33] <duesen>	 itamarWMDE, koi: do you want to deploy now?
[13:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:59] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet
[13:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:03] <wikibugs>	 (03PS2) 10Klausman: net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195)
[13:40:13] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "This looks good" [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:40:16] <koi>	 I couldn't, could anyone help me
[13:40:24] <wikibugs>	 (03CR) 10Klausman: net: Add network config setup for ML staging k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:41:09] <wikibugs>	 (03CR) 10Ayounsi: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:41:46] <duesen>	 koi: urbanecm  should be able to help. 
[13:41:48] <xSavitar>	 urbanecm, do you want to take on the other patches?
[13:41:57] <urbanecm>	 sure
[13:41:59] <urbanecm>	 jouncebot: now
[13:41:59] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300)
[13:41:59] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300)
[13:42:08] <xSavitar>	 Thank you urbanecm <3
[13:42:29] <urbanecm>	 i'm not sure we can deploy all patches though
[13:43:34] <itamarWMDE>	 duesen: Don't have prod deployment access, if that's what you're asking
[13:43:49] <urbanecm>	 yeah, I'll try to deploy what i can
[13:43:50] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:43:58] <urbanecm>	 we don¨t have a lot of time though
[13:44:24] <wikibugs>	 (03PS3) 10Urbanecm: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:44:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:44:47] <wikibugs>	 (03PS3) 10Urbanecm: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:44:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:45:07] <itamarWMDE>	 Thank you urbanecm
[13:45:14] <wikibugs>	 (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:45:39] <wikibugs>	 (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:46:02] <urbanecm>	 itamarWMDE: pulled to mwdebug1001, can you check please?
[13:46:04] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet
[13:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:46:45] <wikibugs>	 (03PS1) 10Muehlenhoff: aptly: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013)
[13:46:47] <wikibugs>	 (03PS1) 10Muehlenhoff: aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013)
[13:46:49] <wikibugs>	 (03PS1) 10Muehlenhoff: grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013)
[13:46:52] <wikibugs>	 (03PS1) 10Muehlenhoff: smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013)
[13:47:02] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "This looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807101 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[13:47:20] <wikibugs>	 (03PS7) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574)
[13:48:03] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:48:04] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35956/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:48:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:07] <taavi>	 umh, did something just break?
[13:48:16] <wikibugs>	 (03PS5) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048)
[13:48:19] <jinxer-wm>	 (ProbeDown) firing: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:48:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:27] <icinga-wm>	 PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:29] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet
[13:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:37] <urbanecm>	 taavi: i didn't touch anything yet
[13:48:48] <urbanecm>	 but let me check, there were some deployments
[13:48:55] <itamarWMDE>	 urbanecm: connection seems slow here, sorry for delay
[13:48:59] <urbanecm>	 no problem
[13:49:17] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[13:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[13:49:30] <wikibugs>	 (03PS8) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574)
[13:50:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35957/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:50:17] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[13:50:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:33] <urbanecm>	 looks like a temporary issue taavi 
[13:50:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:50:45] <XioNoX>	 great
[13:50:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:51:00] <Emperor>	 huh, I got paged, despite not being on duty
[13:51:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:51:31] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6027 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[13:51:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:52:01] <volans>	 Emperor: we''re still paging everyone
[13:52:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:52:07] <wikibugs>	 (03CR) 10ArielGlenn: [C: 04-1] C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:13] <itamarWMDE>	 urbanecm: seems good to me
[13:52:33] <urbanecm>	 itamarWMDE: thanks, but not deploying atm, seems we're in a middle of a problem
[13:52:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:52:57] <wikibugs>	 (03CR) 10ArielGlenn: [C: 04-1] "I am a little uneasy about creating a new wrapper script just for this one thing; is there no nicer way to pass in multiple commands?" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:52:58] <Emperor>	 FE
[13:52:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:53:18] <jinxer-wm>	 (ProbeDown) resolved: (10) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:53:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:53:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Based on Arzhel's last comment, ready for (final?) review again. Changes: neighbor external. Distinct BGP blocks as required by bird2." [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[13:53:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:17] <wikibugs>	 (03PS1) 10Klausman: hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195)
[13:54:28] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet
[13:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:52] <itamarWMDE>	 urbanecm: no worries, was so busy trying to test I didn't notice, thanks :D
[13:55:03] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:55:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:55:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:56:13] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[13:56:14] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[13:57:58] * Lucas_WMDE back
[13:58:02] <urbanecm>	 itamarWMDE: np. i'll finish the deployment or revert later, depending on how the incident goes. thanks for the test.
[13:58:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:58:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[14:00:27] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey)
[14:02:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[14:24:31] <papaul>	 !log on going maintenance on ps1-a2-codfw 
[14:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:32:13] <icinga-wm>	 PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:32:36] <XioNoX>	 papaul: ^
[14:33:05] <icinga-wm>	 PROBLEM - Host lvs2007 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:13] <icinga-wm>	 PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:35] <icinga-wm>	 PROBLEM - Host ms-be2044 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:57] <icinga-wm>	 PROBLEM - Host kafka-logging2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:33:57] <icinga-wm>	 PROBLEM - Host ping2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:34:09] <icinga-wm>	 PROBLEM - Host elastic2055 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:09] <icinga-wm>	 PROBLEM - Host ganeti2030 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:17] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[14:34:43] <icinga-wm>	 PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:43] <icinga-wm>	 PROBLEM - Host grafana2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:53] <icinga-wm>	 PROBLEM - Host netbox2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:54] <icinga-wm>	 PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:57] <icinga-wm>	 PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:57] <icinga-wm>	 PROBLEM - Host authdns2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:59] <icinga-wm>	 PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:59] <icinga-wm>	 PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:01] <icinga-wm>	 PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:05] <icinga-wm>	 PROBLEM - Host urldownloader2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:07] <icinga-wm>	 PROBLEM - Host ms-be2029 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:09] <icinga-wm>	 PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:09] <icinga-wm>	 PROBLEM - Host rpki2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:17] <icinga-wm>	 PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:18] <jinxer-wm>	 (ProbeDown) firing: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:35:21] <icinga-wm>	 PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:21] <icinga-wm>	 PROBLEM - Host deneb is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:25] <icinga-wm>	 PROBLEM - Host ganeti2029 is DOWN: PING CRITICAL - Packet loss = 100%
[14:35:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:36:06] <duesen>	 oh oh...
[14:36:47] <icinga-wm>	 PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[14:36:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:36:55] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:37:19] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64602/IPv4: Connect - kubernetes-codfw, AS64600/IPv4: Connect - PyBal, AS64602/IPv6: Active - kubernetes-codfw, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:37:21] <icinga-wm>	 PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:37:21] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-a-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:17] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:38:28] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:38:39] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:38:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:39:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2305.codfw.wmnet, mw2380.codfw.wmnet, mw2389.codfw.wmnet, mw2387.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2304.codfw.wmnet, mw2295.codfw.wmnet, mw2252.codfw.wmnet, mw2251.codfw.wmnet, mw2299.codfw.wmnet, mw2306.codfw.wmnet are marked down but pooled https://wikitech.wikimedia
[14:39:13] <icinga-wm>	 i/PyBal
[14:39:35] <icinga-wm>	 RECOVERY - Host urldownloader2001 is UP: PING WARNING - Packet loss = 90%, RTA = 33.32 ms
[14:39:35] <icinga-wm>	 RECOVERY - Host ms-be2029 is UP: PING WARNING - Packet loss = 50%, RTA = 33.10 ms
[14:39:35] <icinga-wm>	 RECOVERY - Host rpki2002 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[14:39:35] <icinga-wm>	 RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms
[14:39:37] <icinga-wm>	 RECOVERY - Host netbox2002 is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms
[14:39:37] <icinga-wm>	 RECOVERY - Host authdns2001 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[14:39:37] <icinga-wm>	 RECOVERY - Host thanos-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms
[14:39:37] <icinga-wm>	 RECOVERY - Host ms-be2044 is UP: PING OK - Packet loss = 0%, RTA = 34.83 ms
[14:39:37] <icinga-wm>	 RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 34.75 ms
[14:39:39] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw-a-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:39:39] <icinga-wm>	 RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms
[14:39:39] <icinga-wm>	 RECOVERY - Host ganeti2029 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms
[14:39:39] <icinga-wm>	 RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms
[14:39:39] <icinga-wm>	 RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms
[14:39:41] <icinga-wm>	 RECOVERY - Host lvs2007 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms
[14:39:45] <icinga-wm>	 RECOVERY - Host ping2002 is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms
[14:39:45] <icinga-wm>	 RECOVERY - Host kafka-logging2001 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms
[14:39:45] <icinga-wm>	 RECOVERY - Host elastic2055 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms
[14:39:45] <icinga-wm>	 RECOVERY - Host grafana2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[14:39:49] <icinga-wm>	 RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms
[14:39:55] <icinga-wm>	 RECOVERY - Host ganeti2030 is UP: PING OK - Packet loss = 0%, RTA = 31.75 ms
[14:39:55] <icinga-wm>	 RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms
[14:40:01] <icinga-wm>	 RECOVERY - Host deneb is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms
[14:40:31] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:40:55] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:40:57] <icinga-wm>	 RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms
[14:40:59] <icinga-wm>	 RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 31.87 ms
[14:41:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:41:06] <icinga-wm>	 RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms
[14:41:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:41:29] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:41:49] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:53] <icinga-wm>	 RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:43:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:44:53] <XioNoX>	 papaul: was that expected? ^
[14:44:55] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:45:09] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:30] <papaul>	 XioNoX: no second power cable for asw was not plug all the way in 
[14:45:34] <jinxer-wm>	 (ProbeDown) resolved: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:45:41] <papaul>	 so bump into it and got disconnected
[14:45:41] <XioNoX>	 ok
[14:46:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:36] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[14:46:43] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:46:46] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:46:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:47:03] <jinxer-wm>	 (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:47:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:48:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] acme_chief: Remove old buster IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/805140 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[14:49:33] <icinga-wm>	 PROBLEM - Host ms-be2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:50:09] <wikibugs>	 (03PS2) 10BCornwall: Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723)
[14:51:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove old buster IDPs from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/807140
[14:51:27] <icinga-wm>	 RECOVERY - Host ms-be2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms
[14:53:22] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:53:43] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:29] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:55:00] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[14:55:06] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328)
[14:55:10] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807149 (https://phabricator.wikimedia.org/T304328)
[14:55:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm)
[14:55:26] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807149 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm)
[14:55:30] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm)
[14:56:05] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove old buster IDPs from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/807140 (owner: 10Muehlenhoff)
[14:57:59] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: override unit to set ProtectSystem to strict [puppet] - 10https://gerrit.wikimedia.org/r/807142
[14:58:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[14:59:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:59:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:59:25] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35958/console" [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh)
[14:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:57] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:23] <papaul>	 !log PDU swap for rack a2 complete 
[15:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:23] <wikibugs>	 (03PS1) 10Majavah: sonofgridengine: grid_configurator: ignore non-ACTIVE instances [puppet] - 10https://gerrit.wikimedia.org/r/807143
[15:05:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:35] <wikibugs>	 (03PS2) 10Klausman: hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195)
[15:05:51] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35960/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:06:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:06:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:06:58] <moritzm>	 !log installing avahi security updates
[15:06:58] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10jbond) >  In such cases it might make sense to align such files by relicensing to Apache 2 starting of with the obligatory IANAL :).  My understan...
[15:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:36] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35961/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:09:30] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35962/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:09:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:12:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:13:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) the varnish-mmap-count situation could be resolved with https://github.com/prometheus/proc...
[15:13:57] <Lucas_WMDE>	 jouncebot: now
[15:13:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[15:14:14] <wikibugs>	 (03PS2) 10Majavah: sonofgridengine: grid_configurator: ignore non-ACTIVE instances [puppet] - 10https://gerrit.wikimedia.org/r/807143
[15:14:18] <Lucas_WMDE>	 if the incident is resolved for now, I’d like to deploy one config change that didn’t make it during the backport window
[15:14:24] <Lucas_WMDE>	 if anyone wants to object to that, shout :)
[15:15:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:35] <logmsgbot>	 !log klausman@cumin1001 conftool action : help; selector: name=ml-staging2001
[15:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:09] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2001.codfw.wmnet
[15:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:14] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2002.codfw.wmnet
[15:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:23] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging-ctrl2002.codfw.wmnet
[15:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:48] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:18:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[15:18:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:30] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593)
[15:21:42] <Lucas_WMDE>	 ^ about to deploy this
[15:23:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig looks good, let’s go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE))
[15:24:01] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:24:09] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:24:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE))
[15:25:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh)
[15:25:20] <Lucas_WMDE>	 testing on mwdebug1001
[15:25:33] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 67 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal
[15:26:01] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 87 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal
[15:26:28] <Lucas_WMDE>	 seems to work fine, syncing
[15:26:47] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/weight=1; selector: name=ml-staging2001.codfw.wmnet
[15:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:54] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/weight=1; selector: name=ml-staging2002.codfw.wmnet
[15:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:59] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) https://wikitech.wikimedia.org/wiki/PyBal
[15:26:59] <wikibugs>	 (03PS4) 10Krinkle: mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[15:27:01] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) https://wikitech.wikimedia.org/wiki/PyBal
[15:27:14] <Lucas_WMDE>	 holding
[15:27:28] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2002.codfw.wmnet
[15:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:40] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2001.codfw.wmnet
[15:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:19] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal
[15:28:24] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 67 connections established with conf2004.codfw.wmnet:4001 (min=68) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal
[15:28:29] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 87 connections established with conf2004.codfw.wmnet:4001 (min=88) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal
[15:28:42] <Lucas_WMDE>	 ok, I’m continuing
[15:30:02] <wikibugs>	 (03PS4) 10Krinkle: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[15:30:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:10] <klausman>	 !log Restarting pybal on lvs2010 
[15:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:31] <wikibugs>	 (03PS2) 10Volans: Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452)
[15:30:33] <wikibugs>	 (03PS6) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178
[15:30:35] <wikibugs>	 (03PS9) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446)
[15:30:54] <Lucas_WMDE>	 scap errors on 2 hosts
[15:31:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:31:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:12] <Lucas_WMDE>	 I have to do another sync anyways, I’ll see if that one works better, then the php-fpm restart should be covered by that
[15:32:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:38] <dancy>	 Lucas_WMDE: What are the errors?
[15:32:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806877|Enable Lexeme Lua access everywhere (T309593)]] (1/2) (duration: 03m 51s)
[15:32:58] <Lucas_WMDE>	 issues connecting to lvs, apparently
[15:32:59] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2320.codfw.wmnet]) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal
[15:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:03] <stashbot>	 T309593: enable Lexeme Lua access on remaining Wikimedia projects - https://phabricator.wikimedia.org/T309593
[15:33:04] <Lucas_WMDE>	 so I guess that could be related to what klausman is doing?
[15:33:13] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[15:33:13] <dancy>	 Lucas: Same as https://phabricator.wikimedia.org/T310835  ?
[15:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:26] <wikibugs>	 (03PS1) 10Ssingh: Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145
[15:33:28] <Lucas_WMDE>	 not quite the same It hink
[15:33:36] <Lucas_WMDE>	 but the “free opcache” is also in the output at least
[15:33:47] <Lucas_WMDE>	 I can paste the console output later, I’ll do the second sync first
[15:33:53] <dancy>	 ok thanks
[15:33:55] <Lucas_WMDE>	 unless you want me to wait?
[15:33:59] <icinga-wm>	 RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[15:34:03] <dancy>	 no, go ahead
[15:34:05] <Lucas_WMDE>	 ok
[15:34:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) 05Open→03Resolved a:03ayounsi > This should be fixed.  Looks like it was a configuration failure during the planned migration PWIC218882.3. Confirmed resolved.
[15:34:25] <Lucas_WMDE>	 (second sync is only IS-labs, so main prod effect should be to restart php-fpm on the remaining two hosts)
[15:34:30] <Lucas_WMDE>	 scap running
[15:34:37] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet
[15:34:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:52] <wikibugs>	 (03CR) 10Volans: "I've cleanup netbox-next and run the script for all clusters:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[15:36:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[15:37:01] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 88 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal
[15:37:03] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:37:39] <klausman>	 !log restarting pybal on lvs2009
[15:37:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:806877|Enable Lexeme Lua access everywhere (T309593)]] (2/2) (duration: 03m 28s)
[15:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:52] <Lucas_WMDE>	 no error this time fyi dancy 
[15:38:13] <dancy>	 OK thanks
[15:38:47] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[15:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:58] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
[15:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:35] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:39:57] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 68 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal
[15:40:01] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:40:01] <Lucas_WMDE>	 dancy: I’ve put the output in a private paste for now https://phabricator.wikimedia.org/P29939
[15:40:14] <Lucas_WMDE>	 it’s probably okay to make publish, feel free to copy it into a task somewhere
[15:41:28] <Lucas_WMDE>	 I think there’s two issues there – the failed connection to lvs2010 (understandable if that was being worked on at the moment), and the fact that https://gerrit.wikimedia.org/g/operations/puppet/+/c8cb4a1796d5ff22803c171c277943eebecb8ee7/modules/conftool/files/safe-service-restart.py#333 throws an error if `status` was never assigned
[15:42:20] <wikibugs>	 (03PS5) 10Krinkle: Set $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[15:42:29] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[15:42:34] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Set $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[15:43:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[15:45:13] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:45:35] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:46:13] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:46:13] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:46:27] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:47:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:47:33] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ticket 1-218053856766 opened for the loopback test.  > Support, >  > We need to test our cross-connection 20676697-A, which terminates into our panel @ PP:0603:1087235 - 15/16 and from that into our rou...
[15:47:48] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: add buster endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807168 (https://phabricator.wikimedia.org/T277653)
[15:47:50] <wikibugs>	 (03PS1) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653)
[15:47:52] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653)
[15:51:52] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 52.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:52:08] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 36.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:52:46] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[15:52:47] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet
[15:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM will also need a follow up patch to remove old files, variables and resources" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[15:53:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:54:15] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet
[15:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:24] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 76.95 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:54:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:55:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:55:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:55:54] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2048.codfw.wmnet
[15:55:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:57:27] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet
[15:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::service: Reject empty string values [puppet] - 10https://gerrit.wikimedia.org/r/806208 (owner: 10Jbond)
[15:59:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet
[15:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:06] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1600).
[16:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:54] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet
[16:00:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:09] <wikibugs>	 (03PS1) 10Papaul: Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957)
[16:02:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul)
[16:02:26] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35965/console" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[16:02:49] <wikibugs>	 (03PS25) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:03:20] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:03:52] <wikibugs>	 (03CR) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:04:17] <wikibugs>	 (03PS2) 10Papaul: Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957)
[16:05:02] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet
[16:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:39] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul)
[16:06:07] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: base: include profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi)
[16:06:54] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[16:07:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373
[16:07:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376
[16:07:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375
[16:07:19] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[16:07:28] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:07:46] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-a2-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[16:07:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: update kernel_details to also include kernel.unprivileged_userns_clone [puppet] - 10https://gerrit.wikimedia.org/r/806425 (owner: 10Jbond)
[16:08:00] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:10:09] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[16:10:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[16:10:25] <wikibugs>	 (03CR) 10Dzahn: "So.. there is a parameter "severity". and the default is "critical". This is what they mean:" [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn)
[16:11:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35967/console" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[16:11:47] <wikibugs>	 (03PS1) 10Jbond: P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173
[16:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:12:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "both http://checker.tools.wmflabs.org/grid/continuous/buster and http://checker.tools.wmflabs.org/grid/start/buster yield 404 for me, expe" [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:13:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173 (owner: 10Jbond)
[16:13:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173 (owner: 10Jbond)
[16:13:43] <wikibugs>	 (03PS2) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653)
[16:13:45] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653)
[16:13:58] <wikibugs>	 (03PS1) 10David Caro: openstack.vendordata: reduce timeout so it retries [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930)
[16:14:08] <wikibugs>	 (03CR) 10Majavah: icinga::monitor::toollabs: replace stretch with buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:14:20] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Pinging @JMeybohm and @Dzahn for support.
[16:14:39] <wikibugs>	 (03CR) 10Dzahn: "Do you guys see an existing list of teams? I asked about that and whether there are plans for another severity level "paging"." [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn)
[16:16:03] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373
[16:16:05] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376
[16:16:07] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375
[16:16:45] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:16:49] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[16:17:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "-1 while endpoints exist/parent change is deployed, can be merged afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:18:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Went with the ENC approach, PTAL" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[16:21:37] <wikibugs>	 (03PS1) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176
[16:21:37] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:22:27] <wikibugs>	 (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Rename a variable [puppet] - 10https://gerrit.wikimedia.org/r/807178
[16:23:17] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807168 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[16:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:26:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860)
[16:26:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[16:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[16:27:57] <icinga-wm>	 PROBLEM - Host ms-be1055 is DOWN: PING CRITICAL - Packet loss = 100%
[16:28:51] <icinga-wm>	 RECOVERY - Host ms-be1055 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms
[16:29:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[16:29:19] <wikibugs>	 (03PS6) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048)
[16:29:37] <wikibugs>	 (03CR) 10Jbond: getstats: Delete old versions of this report before running (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[16:30:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond)
[16:30:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860)
[16:31:25] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[16:32:33] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[16:32:43] <icinga-wm>	 PROBLEM - SSH on ms-be1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:32:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC full diff https://puppet-compiler.wmflabs.org/pcc-worker1003/35969/prometheus1005.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[16:34:43] <icinga-wm>	 RECOVERY - SSH on ms-be1055 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:34:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[16:36:55] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:37:48] <wikibugs>	 (03PS3) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653)
[16:37:52] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653)
[16:37:56] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::checker: add missing endpoint config [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653)
[16:38:25] <icinga-wm>	 PROBLEM - Host ms-be1055 is DOWN: PING CRITICAL - Packet loss = 100%
[16:38:49] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::checker: add missing endpoint config [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653)
[16:38:51] <icinga-wm>	 RECOVERY - Host ms-be1055 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[16:38:58] <wikibugs>	 (03PS4) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653)
[16:39:02] <wikibugs>	 (03PS4) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653)
[16:39:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster
[16:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster
[16:41:41] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:19] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results (no changes): https://puppet-compiler.wmflabs.org/pcc-worker1002/35971/" [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy)
[16:45:50] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet
[16:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:03] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "Lets hold on this for now since we need to manually bootstrap tile storage with fresh tiles." [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) (owner: 10MSantos)
[16:49:25] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:54:05] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:59:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @BTullis Can you confirm raid configuration and partman recipe to use please?
[17:01:31] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster
[17:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w...
[17:02:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster
[17:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster
[17:03:53] <icinga-wm>	 PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:06:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[17:09:49] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic1049.eqiad.wmnet
[17:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:48] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[17:14:34] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster
[17:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w...
[17:15:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp2001.wikimedia.org
[17:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:51] <wikibugs>	 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Just a quick correction on the numbers: the current production container size is ~40M objects not ~12M (i was countin...
[17:19:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[17:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:54] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host elastic1049.eqiad.wmnet
[17:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2001.wikimedia.org
[17:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp2001.wikimedia.org` - idp2001.wikimedia.org (**PASS**)   - Downtimed host on Icing...
[17:24:21] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: override unit to set ProtectSystem to strict [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh)
[17:24:41] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service Brian_King This should have cleared by now, looking closer at the alert rules. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp1001.wikimedia.org
[17:26:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:58] <wikibugs>	 (03PS1) 10Majavah: Remove stretch support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/807184 (https://phabricator.wikimedia.org/T277653)
[17:30:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[17:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:45] <wikibugs>	 (03PS1) 10Cmjohnson: Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835)
[17:36:37] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: service override (improves 54f018dc5) [puppet] - 10https://gerrit.wikimedia.org/r/807188
[17:36:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) (owner: 10Cmjohnson)
[17:37:26] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35972/console" [puppet] - 10https://gerrit.wikimedia.org/r/807188 (owner: 10Ssingh)
[17:37:44] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: service override (improves 54f018dc5) [puppet] - 10https://gerrit.wikimedia.org/r/807188 (owner: 10Ssingh)
[17:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[17:38:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:38:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1001.wikimedia.org
[17:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp1001.wikimedia.org` - idp1001.wikimedia.org (**PASS**)   - Downtimed host on Icing...
[17:41:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff)
[17:42:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete
[17:43:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[17:48:03] <wikibugs>	 (03PS2) 10Cmjohnson: Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835)
[17:48:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Remove profile::releases::upload and related classes [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[17:50:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "glad to remove this, it did raise some support requests before afair" [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff)
[17:50:58] <wikibugs>	 (03PS1) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620)
[17:51:00] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:52:08] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] "-1 as it needs a private key added to puppet secrets before it will work correctly" [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:52:58] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) (owner: 10Cmjohnson)
[17:54:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson)
[17:55:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:56:13] <wikibugs>	 (03PS2) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620)
[17:56:15] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:56:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.345 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:56:59] <wikibugs>	 (03CR) 10Ahmon Dancy: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:57:01] <wikibugs>	 (03PS3) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620)
[17:57:03] <wikibugs>	 (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:57:36] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[17:57:43] <wikibugs>	 (03PS7) 10Majavah: P:toolforge::grid::cronrunner: sync crontabs between hosts [puppet] - 10https://gerrit.wikimedia.org/r/805848 (https://phabricator.wikimedia.org/T284767)
[17:57:45] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::grid::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/807194 (https://phabricator.wikimedia.org/T284767)
[17:58:29] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10thcipriani) Talked to @LSobanski and he asked for some clarification on the steps we need root help with.  Here are all the steps Release Engineerin...
[17:59:45] <wikibugs>	 (03CR) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[18:00:05] <jouncebot>	 hashar and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1800).
[18:01:45] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Keyholder: add new agent for trainbranchbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[18:02:55] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:05:00] <wikibugs>	 (03PS7) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[18:06:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[18:07:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @btullis can you confirm what the raid configuration is supposed to be please.   2 SSD Raid 1? and Raid 10 th...
[18:07:31] <brennen>	 o/ - train was rolled to group0 earlier, no current blockers, logs fairly clean at a glance.  currently nothing to do for this window.
[18:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[18:07:47] <wikibugs>	 (03PS8) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299)
[18:09:00] <wikibugs>	 (03CR) 10Jdlrobson: QuickSurveys: Add research-incentive to jawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[18:10:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede)
[18:10:51] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks Arzhel, tried to address in latest patchset let me know what you think.  Went with the try/except as what the filter returns is dif" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney)
[18:14:04] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::grid::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/807194 (https://phabricator.wikimedia.org/T284767)
[18:14:06] <wikibugs>	 (03PS8) 10Majavah: P:toolforge::grid::cronrunner: sync crontabs between hosts [puppet] - 10https://gerrit.wikimedia.org/r/805848 (https://phabricator.wikimedia.org/T284767)
[18:20:03] <wikibugs>	 (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[18:25:57] <wikibugs>	 (03CR) 10Muehlenhoff: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[18:26:00] <wikibugs>	 (03PS10) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[18:29:09] <wikibugs>	 (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[18:30:03] <wikibugs>	 (03PS11) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[18:34:24] <wikibugs>	 (03PS2) 10Dzahn: mediawiki::deployment::server: Rename $deploy_ensure to $secondary_deploy_ensure [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy)
[18:34:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki::deployment::server: Rename $deploy_ensure to $secondary_deploy_ensure [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy)
[18:38:33] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:39:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed on deploy1002/2002 in prod" [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy)
[18:40:55] <wikibugs>	 (03PS4) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620)
[18:41:20] <wikibugs>	 (03CR) 10Thcipriani: Keyholder: add new agent for trainbranchbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[18:43:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:47:28] <wikibugs>	 (03PS4) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330
[18:48:18] <ryankemper>	 !log T301461 `ryankemper@miscweb1002:~$ sudo systemctl reload apache2`
[18:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:24] <stashbot>	 T301461: Investigate cache issues after WDQS UI deployments - https://phabricator.wikimedia.org/T301461
[18:55:49] <wikibugs>	 (03PS1) 10Ryan Kemper: query_service: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243)
[18:56:31] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper)
[18:56:33] <ryankemper>	 !log T301461 `ryankemper@miscweb1002:~$ sudo systemctl reload apache2` failed due to syntax error, patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/807200
[18:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:37] <stashbot>	 T301461: Investigate cache issues after WDQS UI deployments - https://phabricator.wikimedia.org/T301461
[18:56:54] <wikibugs>	 (03PS2) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[19:04:26] <wikibugs>	 (03PS1) 10Dzahn: alertmanager: create receivers for serviceops-collab [puppet] - 10https://gerrit.wikimedia.org/r/807201
[19:06:04] <wikibugs>	 (03PS2) 10Dzahn: alertmanager: create receivers for serviceops-collab [puppet] - 10https://gerrit.wikimedia.org/r/807201
[19:10:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "first needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/807201 but keeping it separate" [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn)
[19:12:21] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:02] <wikibugs>	 (03PS2) 10Ryan Kemper: query_service: fix syntax error in apache config [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243)
[19:14:25] <wikibugs>	 (03PS1) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015)
[19:15:49] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Ah, sorry for not catching this!" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper)
[19:20:38] <perryprog>	 mediawiki.org down? Just got "upstream connect error or disconnect/reset before headers. reset reason: overflow"
[19:20:45] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: copy aqs info field to error.message [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[19:21:11] <perryprog>	 it and enwiki now loading for me but very very slowly
[19:21:50] <perryprog>	 Seems to maybe be okay now though?
[19:22:03] <urandom>	 !log replicating Cassandra `system_auth` keyspace to codfw -- T307641
[19:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:07] <stashbot>	 T307641: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641
[19:22:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[19:38:02] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.9.5" for 558 hosts
[19:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:22] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.9.5" completed for 558 hosts
[19:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:44] <logmsgbot>	 !log dancy@deploy1002 backport aborted:  (duration: 00m 10s)
[19:38:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:21] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:42:36] <wikibugs>	 (03CR) 10Jdlrobson: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[19:42:38] <wikibugs>	 (03CR) 10Ahmon Dancy: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[19:45:05] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:38] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ok, they can place the loop back 1 hop away on sg1 side of things and asked if they could do so today while on the call.  I advised not yet, as we haven't drained that of traffic.  @ayounsi or @cmooney:...
[19:47:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jgreen) @Cmjohnson is there any update on these machines?
[19:48:02] <wikibugs>	 (03PS1) 10Cwhite: logstash: disable aqs high log rate mitigations [puppet] - 10https://gerrit.wikimedia.org/r/807208 (https://phabricator.wikimedia.org/T310760)
[19:50:43] <wikibugs>	 (03PS1) 10Ottomata: Set krb: present for ori [puppet] - 10https://gerrit.wikimedia.org/r/807209 (https://phabricator.wikimedia.org/T311088)
[19:52:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: disable aqs high log rate mitigations [puppet] - 10https://gerrit.wikimedia.org/r/807208 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[19:52:55] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Set krb: present for ori [puppet] - 10https://gerrit.wikimedia.org/r/807209 (https://phabricator.wikimedia.org/T311088) (owner: 10Ottomata)
[19:54:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[19:55:02] <wikibugs>	 (03CR) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T2000).
[20:00:05] <jouncebot>	 koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:16] <koi>	 hi
[20:01:06] <urbanecm>	 hi, i can deploy today
[20:01:28] <urbanecm>	 koi: ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805840, did you run the automated update via tox to ensure the png's are up to date?
[20:02:09] <koi>	 I do, and found generated file is even larger than the one existed
[20:02:27] <koi>	 you could see my PS1
[20:02:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "logos in project-logos should be maintained through /logos/config.yaml and via tox. can you please update the yaml config to generate the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:03:31] <urbanecm>	 that's weird
[20:04:02] <urbanecm>	 but looks you're right
[20:04:09] <wikibugs>	 (03PS3) 10Urbanecm: zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang)
[20:04:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang)
[20:04:34] <wikibugs>	 (03PS2) 10Urbanecm: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang)
[20:04:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang)
[20:04:47] <wikibugs>	 (03CR) 10Stang: zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:04:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[20:05:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375 (owner: 10Filippo Giunchedi)
[20:05:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] "Looks good to me, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/806375 (owner: 10Filippo Giunchedi)
[20:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang)
[20:06:13] <wikibugs>	 (03Merged) 10jenkins-bot: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang)
[20:06:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:06:58] <urbanecm>	 koi: the patches i merged are at mwdebug1001, please check
[20:07:16] <koi>	 looking
[20:07:28] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376 (owner: 10Filippo Giunchedi)
[20:07:49] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:13] <koi>	 LGTM(the sandboxlink one)
[20:08:56] <urbanecm>	 and the other one?
[20:10:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:41] <koi>	 you mean the logo on zh_classical? Sorry but don't know how to check that
[20:11:03] <wikibugs>	 (03PS4) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099)
[20:11:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:43] <urbanecm>	 koi: sorry, didn't realize it's a no-op patch
[20:11:58] <urbanecm>	 you could check the logo's still there, but it's impossible for that patch to break something, so, syncing
[20:12:01] <urbanecm>	 (both)
[20:12:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:12:38] <wikibugs>	 (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211
[20:13:35] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3f70e302e11756d9704acc86c45b3d7aabf31c4d: fawiktionary: Enable SandboxLink extension (T308505) (duration: 03m 37s)
[20:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:40] <stashbot>	 T308505: Activate SandboxLink extensions for fa.wiktionary - https://phabricator.wikimedia.org/T308505
[20:14:51] <wikibugs>	 (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079)
[20:14:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:26] <wikibugs>	 (03PS2) 10Urbanecm: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang)
[20:16:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang)
[20:16:53] <wikibugs>	 (03CR) 10Stang: zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:17:01] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang)
[20:18:36] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 721e413fff4e797626c7c5e8433130f341310af0: zh_classicalwiki: Declare commons files for logo (1/2) (duration: 03m 30s)
[20:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:00] <wikibugs>	 (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079)
[20:20:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:21:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized logos/config.yaml: 721e413fff4e797626c7c5e8433130f341310af0: zh_classicalwiki: Declare commons files for logo (2/2) (duration: 03m 28s)
[20:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:53] <wikibugs>	 (03PS4) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079)
[20:23:57] <urbanecm>	 koi: so, most patches done. for the last one, i don't really want to add yet another file w/o a SVG equivalent (we should be converging to a HD'ed logos, and not adding more non-HD files is a way to get there, eventually). 
[20:25:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35973/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[20:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:26:35] <icinga-wm>	 PROBLEM - Check systemd state on mw1406 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[20:26:51] <icinga-wm>	 RECOVERY - AQS root url on aqs2001 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[20:26:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:27] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:27:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:27:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:27:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "This config file was created by puppet:" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[20:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:58] <koi>	 yeah I know there's such goal for replacing all logo w/o 1.5x and 2x support, but I thought such variant is indeed need, many Chinese related sites treat logo variant in a not pretty elegant way
[20:28:24] <koi>	 Like zhwikibooks, actually such file should exist inside repository many years before
[20:28:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: alertmanager use logsource as source for host.name field [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[20:28:31] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8007136, @Dzahn wrote: > There are incoming redirects into policy.wikimedia.org: >  > https://wikimedia....
[20:28:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:41] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "(this looks ready to backport to me!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:30:08] <koi>	 BTW urbanecm, did you pull https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806941 on mwdebug1001?
[20:30:36] <urbanecm>	 koi: sorry, i thought we already did that one. my mistake.
[20:30:44] <urbanecm>	 looks i only merged it
[20:30:48] <urbanecm>	 pulled to mwdebug1001 now
[20:30:49] <urbanecm>	 can you check?
[20:30:54] <urandom>	 cwhite: I've (re)enabled one of those aqs nodes that made so much noise last week.  afaict things seem OK, but just in case there is something I'm not seeing...
[20:31:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[20:31:02] <koi>	 looking
[20:31:31] <cwhite>	 urandom: thanks for the heads up.  I'll watch for issues
[20:32:06] <koi>	 LGTM
[20:32:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[20:32:41] <wikibugs>	 (03PS1) 10BCornwall: traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723)
[20:33:16] <urbanecm>	 koi: thanks, syncing
[20:37:01] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b42e57d75ec6b0536493fa073805a0bcb066aef1: zhwikiquote: Disable local upload (T311017) (duration: 03m 43s)
[20:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:08] <stashbot>	 T311017: Disable local upload for Chinese Wikiquote - https://phabricator.wikimedia.org/T311017
[20:37:15] <urbanecm>	 koi: okay, that should be everything
[20:37:17] <urbanecm>	 anything else?
[20:38:02] <koi>	 nothing except the logo for zhwikibooks, what should I do for that?
[20:38:48] <urbanecm>	 get a SVG for it 🙂
[20:39:24] <koi>	 0_o
[20:39:51] <wikibugs>	 (03Abandoned) 10Stang: zhwikibooks: Add zh-hant variant logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:40:09] <koi>	 that's all
[20:41:10] <urbanecm>	 okay, then see you later :)
[20:41:41] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:43:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[20:43:31] <koi>	 urbanecm: Sorry to bother again, I would like to have some input for T106068 (its a config change), where's the suggested place for me to go and like posting a notice for that?
[20:43:31] <stashbot>	 T106068: [DisableAccount] Remove "inactive" user group - https://phabricator.wikimedia.org/T106068
[20:43:38] <koi>	 *it's
[20:44:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "this goes last after the cookbook. other changes go before the cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[20:48:41] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:42] <urbanecm>	 koi: if i understand the issue correctly, you want to remove the inactive group and before doing so, you want to ensure the group's not used for anything
[20:50:45] <urbanecm>	 is that right?
[20:51:49] <koi>	 yep, as I know at least one private site has such issue - someone inside the inactive group but not blocked
[20:53:06] <wikibugs>	 (03PS1) 10Cwhite: logstash: restore logging to the ecs-test partition [puppet] - 10https://gerrit.wikimedia.org/r/807216 (https://phabricator.wikimedia.org/T310760)
[20:53:25] <urbanecm>	 koi: in that case, user-notice is your friend. tag the task with #user-notice and add a comment summarizing the impact in "plain English" (including stuff you'd like people to check for). once it went through tech news, we can wait for a while and assuming no issues are raised, i'd be comfortable with going ahead
[20:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:54:20] <urbanecm>	 if there are any particular issues to check for before group removal (such as, presence of unblocked users in the group), feel free to define those issues in a separate comment -- i can run a private wiki-wide SQL query to check where the issue is present (and we can check with those responsible for whichever private wikis is affected in addition to a tech news entry)
[20:54:57] <urbanecm>	 does that make sense?
[20:55:32] <koi>	 um, should this task therefore be protected as there might be kind of security risk of that, like data leaking
[20:55:45] <koi>	 yeah, pretty clear, thanks a lot
[20:56:37] <urbanecm>	 koi: well, so long as you only describe potential issues, we'd be fine. i can paste the results of whichever query i run in a private paste, or we can create a separate task for coordination that requires a private discussion space
[20:57:21] <koi>	 clear to me, doing
[20:57:44] <urbanecm>	 okay, great. feel free to ping me if i can help :)
[21:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:07:12] <wikibugs>	 (03PS1) 10David Caro: openstack.vendordata: Allow downgrading packages too [puppet] - 10https://gerrit.wikimedia.org/r/807221 (https://phabricator.wikimedia.org/T309930)
[21:07:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack.vendordata: reduce timeout so it retries [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[21:08:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack.vendordata: Allow downgrading packages too [puppet] - 10https://gerrit.wikimedia.org/r/807221 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[21:17:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: restore logging to the ecs-test partition [puppet] - 10https://gerrit.wikimedia.org/r/807216 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[21:19:00] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8017973, @Varnent wrote: > @Dzahn - is that doable? I am not sure if we have redirected to web.archive.org...
[21:21:18] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, some minor nits. Its unfortunate that helm doesn't have an api, and that you need to drag in so many dependencies, but I don't" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[21:24:18] <wikibugs>	 (03CR) 10Scardenasmolinar: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[21:28:19] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "I suppose another way would be to invoke the help binary directly & consume its json output?" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[21:31:12] <wikibugs>	 (03CR) 10Dzahn: "great, if it works, of course. but please check that realtime notifications in phab still work after this. (aphlict). I don't think we hav" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:33:33] <wikibugs>	 (03PS5) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099)
[21:40:14] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Note that I am taking this over from Jaime while he is out." [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[22:06:13] <wikibugs>	 10SRE, 10Traffic-Icebox: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall) a:03BCornwall Would love some pointers on where to start; I'll eventually find my way to the right place but it always helps to have an experienced set of hands to guide. :)
[22:06:33] <wikibugs>	 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) 05Open→03Resolved a:03RobH
[22:06:36] <wikibugs>	 10SRE, 10Traffic: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall)
[22:07:45] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic-Icebox, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) a:03BCornwall
[22:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[22:07:58] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall)
[22:20:20] <wikibugs>	 10SRE, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) 05Stalled→03Declined so they are listed on the pdus but not normalized.  We're not going to burn on-site remote hands to do this, and we'll just get this done wehn we update/migrate hardwa...
[22:21:32] <wikibugs>	 10SRE, 10ops-esams: trace qfx5100-spare[12]-esams power cables - https://phabricator.wikimedia.org/T244914 (10RobH) 05Open→03Resolved a:03RobH they are spare and not powered or cabled, which is why no entry...  should have closed this after realizing this months ago but forgot.
[22:28:54] <wikibugs>	 (03PS1) 10BCornwall: Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229
[22:32:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:33:22] <wikibugs>	 (03CR) 10BCornwall: "This is a pretty opinionated CR, so I apologize if it's not helpful. It appears that this script hasn't seen any review/update since 2014 " [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[22:36:27] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:36:41] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:43:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:50:52] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8018203, @Dzahn wrote: >>>! In T310738#8017973, @Varnent wrote: >> @Dzahn - is that doable? I am not sur...
[23:10:21] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:32] <wikibugs>	 (03PS2) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015)
[23:14:20] <wikibugs>	 (03CR) 10DDesouza: [C: 03+1] "Fixed issues causing CodeSniffer to throw warnings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[23:16:59] <wikibugs>	 (03PS3) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015)
[23:17:27] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:48] <wikibugs>	 (03CR) 10DDesouza: [C: 03+1] "Fixed issues that would case CodeSniffer to throw warnings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[23:22:10] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "Seems like a good cleanup to me" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz)
[23:45:29] <wikibugs>	 (03CR) 10Aaron Schulz: [V: 03+1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[23:50:40] <wikibugs>	 (03CR) 10Jdlrobson: "Coode looks good, but as discussed you'll want to backport https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/807202 first and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza)
[23:51:08] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[23:51:20] <wikibugs>	 (03PS6) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)