[00:02:28] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:12:00] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [00:33:02] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:35:04] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [00:49:04] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:03:40] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:12:00] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:28:33] (03PS1) 10DDesouza: QuickSurveys: Add research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [01:29:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:32:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:42:22] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:55:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:57:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:57:59] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10tstarling) [02:00:48] (03PS9) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:04:06] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10tstarling) [02:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966 [02:08:02] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966 (owner: 10TrainBranchBot) [02:23:18] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.17 [core] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/806966 (owner: 10TrainBranchBot) [02:30:36] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:05] (03PS10) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:40:52] (03CR) 10CI reject: [V: 04-1] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [02:43:39] (03PS11) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [02:47:28] (03CR) 10Tim Starling: "* PS9: rebase" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [03:05:58] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:38] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:22] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:28] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:26:38] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:58] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:12] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:45:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:46:50] (03CR) 10Tim Starling: "This is pretty harmless, and once it is merged, we can benchmark it in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [03:54:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:10] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:56:23] (03PS2) 10Tim Starling: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [03:57:16] (03CR) 10CI reject: [V: 04-1] Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [04:02:50] (03PS3) 10Tim Starling: Add "mcrouter-master-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [04:04:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:14] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:13:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:15:27] (03PS3) 10Tim Starling: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [04:21:11] (03PS1) 10KartikMistry: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) [04:21:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:28] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [04:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [04:48:46] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:51:24] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:07:36] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [05:07:47] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/806972 [05:10:13] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/806972 (owner: 10Marostegui) [05:24:42] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:54] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:30:45] (03PS1) 10Marostegui: mariadb: Set innodb_max_dirty_pages_pct to 75 [puppet] - 10https://gerrit.wikimedia.org/r/806973 (https://phabricator.wikimedia.org/T308380) [05:33:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Set innodb_max_dirty_pages_pct to 75 [puppet] - 10https://gerrit.wikimedia.org/r/806973 (https://phabricator.wikimedia.org/T308380) (owner: 10Marostegui) [05:34:00] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) The amount of binlogs per day is also fine (not like parsercache which generates an insane amount of... [05:48:48] (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/806525 [05:49:44] (03CR) 10Marostegui: [C: 03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/806525 (owner: 10Marostegui) [05:54:14] !log Reboot db1132 and db1181 for kernel upgrade [05:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:42] (03PS1) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) [06:06:13] (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [06:09:46] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:48] (03PS2) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) [06:12:40] (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [06:15:23] (03PS3) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) [06:26:36] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:04] (03PS4) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) [06:29:05] (03CR) 10CI reject: [V: 04-1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [06:31:05] (03PS5) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) [06:35:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:39:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:25] (03PS1) 10Muehlenhoff: Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041 [06:53:24] (03CR) 10Slyngshede: [C: 03+2] admin: add taavi to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/806487 (https://phabricator.wikimedia.org/T309375) (owner: 10Dzahn) [06:53:45] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:54:53] (03PS3) 10Slyngshede: zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:54:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [06:55:15] (03PS2) 10Muehlenhoff: Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041 [06:56:26] (03CR) 10Slyngshede: [C: 03+2] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:58:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for ppena [puppet] - 10https://gerrit.wikimedia.org/r/807041 (owner: 10Muehlenhoff) [06:59:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10taavi) 05Resolved→03Open Hi @SLyngshede-WMF, please also add myself to the `ciadmin` ldap group as requested in the task description. Thanks! [07:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0700). [07:00:04] matthiasmullie and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:01] (03PS1) 10Slyngshede: WIP: Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [07:01:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) @taavi Sorry, didn't spot that. I'll be right back :) [07:01:54] (03CR) 10CI reject: [V: 04-1] WIP: Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [07:01:58] o/ [07:04:23] brb, nature calls [07:04:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10SLyngshede-WMF) 05Open→03Resolved @taavi You're now added to ciadmin, but let me know if something doesn't work. [07:08:42] b [07:09:07] I can deploy my own patch [07:10:01] (03CR) 10Matthias Mullie: [C: 03+2] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [07:10:48] (03Merged) 10jenkins-bot: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [07:11:24] \o sorry to be late to the party [07:11:41] matthiasmullie: let me know when you're done, I can deploy my patch [07:12:01] (03CR) 10Slyngshede: [C: 03+2] prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:12:18] kostajh: sure! [07:12:57] (03PS2) 10Slyngshede: prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:17:38] kostajh: all done, the floor is yours! [07:17:45] matthiasmullie: cheers [07:18:33] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:20:05] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:18] matthiasmullie: I don't see my patch on mediawiki-staging after git status && git fetch, have I done something wrong? [07:21:44] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805766 says the gate pipeline succeeded, but gerrit also shows a merge conflict [07:22:47] kostajh: looks like it didn't merge [07:22:50] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805766 [07:23:04] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) [07:23:14] alright let's see if a rebase fixes it [07:25:09] (03PS1) 10Matthias Mullie: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) [07:26:02] (03CR) 10Kosta Harlan: "side note: it could be useful to make a phab task for this and tag this patch with it, for increased visibility and to have a place to gat" [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori) [07:26:26] (03PS1) 10Matthias Mullie: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711) [07:27:25] matthiasmullie: should I press "Submit"? Usually that happens on its own. cc Amir1 && urbanecm [07:27:52] (03CR) 10Matthias Mullie: [C: 03+2] GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:28:19] Good morning kostajh. Shouldn't be needed. [07:28:28] kostajh: yeah, usually does it on its own; I guess it didn't because it already had +2 prior? [07:28:38] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 (https://phabricator.wikimedia.org/T304548) (owner: 10Kosta Harlan) [07:28:43] Yeah, that'd do it. [07:28:49] good morning [07:28:55] hrm [07:28:56] ok, thanks [07:29:17] IIRC, removing your own vote & reapplying +2 also kicks it off again [07:31:34] sigh, I need to revert my patch, I didn't read back far enough in the relevant phab task [07:31:46] (03PS1) 10Kosta Harlan: Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 [07:32:02] (03PS1) 10Muehlenhoff: Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) [07:32:04] (03CR) 10Kosta Harlan: [C: 03+2] Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan) [07:32:30] kostajh: don't forget to rebase before hitting +2 [07:32:35] I usually do [07:33:02] Amir1: it says "Change is up to date with the target branch already (master) " [07:33:18] (For the revert patch.) [07:33:25] (03CR) 10Kosta Harlan: [C: 03+2] Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan) [07:33:39] So that's not why it doesn't merge it then [07:34:11] (03Merged) 10jenkins-bot: Revert "GrowthExperiments: Enable link recommendation on aswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806992 (owner: 10Kosta Harlan) [07:34:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [07:35:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:56] alright, I'm done [07:37:12] well, scap is still wrapping up its thing [07:42:06] (03PS2) 10Slyngshede: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:42:30] (03CR) 10CI reject: [V: 04-1] zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:44:59] ACKNOWLEDGEMENT - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:59] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:44:59] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP ayounsi Telxius outage - https://phabricator.wikimedia.org/T311036 - The acknowledgement expires at: 2022-06-22 07:44:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:46:03] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:56] kostajh: has your deploy completed ? ;) [07:49:12] hashar: yes! [07:49:29] I will start the train dance in a few minutes so :] [07:52:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:52:49] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:19] kostajh: I have closed and removed from train blockers a GrowthExperiments task from an earlier train "TypeError: Cannot read properties of undefined (reading 'dailyLimit')" https://phabricator.wikimedia.org/T309768 [07:54:32] looks like that got fixed in master/ wmf.16 and backported to wmf.15 [07:54:39] I am going to roll wmf.17 which does include the fix [07:54:45] so I went bold and marked that one resolved [07:54:49] ACKNOWLEDGEMENT - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia ayounsi https://phabricator.wikimedia.org/T311038 - The acknowledgement expires at: 2022-06-22 07:54:30. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:18] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [07:56:19] (03CR) 10Slyngshede: [C: 03+2] sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:59:13] (03PS1) 10Slyngshede: C:dumps::web::dumpstatusfiles, convert to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) [07:59:23] (03PS2) 10Muehlenhoff: Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) [08:00:05] hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0800). [08:00:18] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm) [08:01:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove reprepro config from releases* [puppet] - 10https://gerrit.wikimedia.org/r/807052 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [08:03:07] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:24] (03CR) 10Volans: Allow to dry-run SREBatchRunnerBase (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm) [08:04:49] (03PS1) 10Hashar: testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070) [08:04:51] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [08:05:37] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807058 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [08:11:31] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:14:03] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:32] what was wrong I wonder [08:14:37] !log remove EOLed parsoid debs from releases.wikimedia.org T309765 [08:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:42] T309765: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 [08:15:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:49] (03CR) 10Volans: [C: 04-1] "I think there is a small error, see details inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [08:20:31] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:26:38] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 5.347e+06 ge 2.592e+05 ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [08:26:38] ACKNOWLEDGEMENT - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [08:26:38] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1690100572056 and 1231592 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:26:38] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1743034916712 and 1289445 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:26:38] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1691134467920 and 1231494 seconds ayounsi https://phabricator.wikimedia.org/T311039 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:26:49] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Please let us know before proceeding with this as now db1131 is a master so we'd need to switch it back to become a single replica. So please let us know before hand with 2-3 days... [08:28:02] (03PS1) 10Muehlenhoff: Retire releasers-parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765) [08:29:18] !log Reboot db1120 for kernel upgrade [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:27] 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) > Kindly be informed that we have logged your issue under ref 01420952, we will investigate and get back to you with our findings. [08:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [08:45:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:47:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:47:47] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [08:49:42] (03CR) 10Jbond: [C: 03+1] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans) [08:51:08] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [08:52:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35941/console" [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:57:26] !log copy package 'jvm-tools' from buster-wikimedia to bullseye-wikimedia on apt1001 - T310980 [08:57:40] (03CR) 10Volans: [C: 03+2] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans) [08:58:36] so testwiki got promoted, I am going to do group0 wikis [08:59:22] (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070) [08:59:24] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [08:59:40] (03PS1) 10Elukey: aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) [09:00:05] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807064 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [09:00:41] (03CR) 10CI reject: [V: 04-1] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:01:33] (03CR) 10Muehlenhoff: "Looks good, comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:02:17] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:06:50] (03PS2) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:08:11] (03CR) 10CI reject: [V: 04-1] icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans) [09:09:59] (03CR) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:11:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:11:25] (03PS1) 10Elukey: Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) [09:12:43] (03PS1) 10Muehlenhoff: Remove profile::releases::upload and related classes [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765) [09:13:19] !log dbmaint s8@codfw T310011 [09:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:13:32] (03PS3) 10Slyngshede: memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:14:33] (03CR) 10Elukey: "I haven't tested the tools but the changes look straightforward to me. If the changes are good we can cherry pick the commit in the debian" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:18:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35943/console" [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:19:22] (03PS1) 10Muehlenhoff: Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071 [09:20:24] !log dbmaint s8@eqiad T310011 [09:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:28] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:20:56] (03PS1) 10Volans: doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074 [09:21:52] jouncebot: nowandnext [09:21:53] For the next 0 hour(s) and 38 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0800) [09:21:53] In 3 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300) [09:21:53] In 3 hour(s) and 38 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300) [09:22:43] (03CR) 10Elukey: [C: 03+1] Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071 (owner: 10Muehlenhoff) [09:23:05] hashar: looks like traindeployment is done; would it be fine for me to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806407, or should i wait the ~40 minutes? [09:23:20] urbanecm: go for it :) [09:23:23] thanks! [09:23:37] (03CR) 10Urbanecm: [C: 03+2] Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm) [09:23:40] (03PS2) 10Urbanecm: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) [09:23:43] and thank you to have checked with me! [09:23:45] (03CR) 10Urbanecm: [C: 03+2] Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm) [09:23:47] (03CR) 10Muehlenhoff: Apply 2to3 to migrate the code to Python3 (032 comments) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:24:22] no problem :) [09:25:01] (03Merged) 10jenkins-bot: Add a throttle rule for a Czech course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806407 (https://phabricator.wikimedia.org/T310885) (owner: 10Urbanecm) [09:25:36] (03PS2) 10Elukey: Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) [09:25:48] (03CR) 10Elukey: "Thanks!" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:25:50] (03PS8) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [09:25:52] (03PS17) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [09:28:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, there are some subtleties 2to3 won't catch, but those will be found during ml-cache ramp-up." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:31:09] okay, scap sync-file completed, but logmsgbot is not here :/ [09:31:28] can a SRE follow https://wikitech.wikimedia.org/wiki/Logmsgbot#Restart to restart it please? [09:31:52] !log 09:29:23 Synchronized wmf-config/throttle.php: 7c9f6a561b2b4b5c5db063bad83bd23e9cbac347: Add a throttle rule for a Czech course (T310885) (duration: 03m 34s) #manually logging in logmsgbot's absence [09:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:59] T310885: Request a throttle lift for Czech course for students – 2022-06-23 - https://phabricator.wikimedia.org/T310885 [09:32:22] is it just me or have irc bots hosted on our networks recently been more unstable than usual? [09:32:35] I'm not sure. perhaps? [09:32:51] (03CR) 10Jbond: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:36:50] (03CR) 10Filippo Giunchedi: "I see why we'd want to store less samples for non-changing data, though scrape intervals larger than 2m AFAIK are to be avoided (details a" [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:37:32] (03CR) 10Ayounsi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:37:56] (03CR) 10Filippo Giunchedi: "LGTM overall, I'll let Eric comment authoritatively though" [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [09:38:02] (03CR) 10Jbond: admin: Temporarily disable legoktm's access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806489 (owner: 10Legoktm) [09:39:39] (03CR) 10Jbond: [C: 03+1] Fix typoes found by Junoser (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi) [09:39:41] jbond: hi, can i trouble you to rotate logmsgbot? https://wikitech.wikimedia.org/wiki/Logmsgbot#Restart it's not here and logging deployments :/ [09:39:53] (03CR) 10Volans: [C: 03+2] doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074 (owner: 10Volans) [09:40:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite) [09:40:42] (03CR) 10Ayounsi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:43:02] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:43:33] (03CR) 10Filippo Giunchedi: Netbox stats, set scrape interval to 1h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:43:54] (03PS1) 10Muehlenhoff: Remove mailman-admins [puppet] - 10https://gerrit.wikimedia.org/r/807078 [09:44:14] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10phuedx) >>! In T306181#8013301, @Ottomata wrote: > Thanks ben! Seconded. Thanks for all of your w... [09:44:28] (03CR) 10Elukey: [C: 03+2] service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:44:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [09:47:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove aptrepo spec test [puppet] - 10https://gerrit.wikimedia.org/r/807071 (owner: 10Muehlenhoff) [09:48:20] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [09:49:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [09:49:32] (03CR) 10Btullis: [C: 03+1] "Also LGTM but will defer to Eric." [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [09:49:45] (03Merged) 10jenkins-bot: doc: fix intersphinx links [software/spicerack] - 10https://gerrit.wikimedia.org/r/807074 (owner: 10Volans) [09:50:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove references [puppet] - 10https://gerrit.wikimedia.org/r/806426 (owner: 10Muehlenhoff) [09:52:07] (03CR) 10Filippo Giunchedi: Netbox: add monitoring to dns.git endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [09:52:10] (03PS4) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) [09:52:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [09:52:59] (03CR) 10Jbond: [C: 03+1] Fix typoes found by Junoser (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi) [09:54:02] (03PS2) 10Jbond: aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:54:39] (03CR) 10Filippo Giunchedi: "Sorry I'm lagging a bit behind testing this, I can say for sure though that 'confd' package isn't in Bullseye so this change will fail" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [09:54:41] (03CR) 10Jbond: "looks like moritz removed this spec test so have rebased (lgtm otherwise)" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:55:16] (03PS2) 10Ayounsi: Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 [09:55:26] (03CR) 10Elukey: "Thanks John!" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:55:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [09:56:26] (03CR) 10Jbond: [C: 03+1] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:56:40] (03CR) 10Filippo Giunchedi: [C: 03+1] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [09:57:21] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:59:05] (03PS1) 10Btullis: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629) [09:59:11] (03CR) 10Ayounsi: [C: 03+2] Netbox stats, set scrape interval to 2m [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [10:00:59] (03CR) 10Ayounsi: [C: 03+2] Fix typoes found by Junoser [homer/public] - 10https://gerrit.wikimedia.org/r/806857 (owner: 10Ayounsi) [10:03:00] (03CR) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:05:33] (03CR) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:06:21] PROBLEM - Check systemd state on kubernetes1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [10:10:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:51] (03CR) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:15:06] (03CR) 10Btullis: [C: 03+2] Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [10:15:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:16] 10SRE-tools, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10JMeybohm) [10:17:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:45] (03CR) 10JMeybohm: Allow to dry-run SREBatchRunnerBase (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm) [10:19:26] (03Merged) 10jenkins-bot: Update the container image used for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/807081 (https://phabricator.wikimedia.org/T310629) (owner: 10Btullis) [10:24:56] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [10:25:18] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 [10:26:15] (03CR) 10Volans: [C: 04-1] "You need to import RemoteExecutionError from spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff) [10:26:50] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [10:27:26] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:27:41] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 [10:28:03] (03CR) 10Klausman: [C: 03+1] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [10:28:44] (03CR) 10Klausman: [C: 03+1] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [10:30:02] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:30:12] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:30:35] (03CR) 10CI reject: [V: 04-1] sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff) [10:31:22] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:20] (03PS4) 10Ayounsi: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) [10:32:41] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 [10:33:19] (03PS2) 10KartikMistry: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) [10:34:02] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/inference-staging https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:34:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) I can take care of install and puppet changes if firmware/boot is taken care of -it that helps speed it up. We would like to have 100... [10:34:22] (03PS1) 10Ayounsi: Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) [10:34:34] elukey: ^^ [10:35:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [10:36:40] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/inference-staging https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:36:55] (03CR) 10Filippo Giunchedi: Netbox: add monitoring to dns.git endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [10:36:58] elukey: I'm assuming that's triggered by 'cluster=ml_staging,service=kubesvc' not having any server [10:36:59] (03CR) 10Slyngshede: [C: 03+1] dumps: migrate cron of dumps-exception-checker to systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:37:07] (03CR) 10Slyngshede: [C: 03+2] dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:37:41] Jun 21 10:37:15 puppetmaster1001 confd[19313]: 2022-06-21T10:37:15Z puppetmaster1001 /usr/bin/confd[19313]: ERROR 100: Key not found (/conftool/v1/pools/codfw/ml_staging/kubesvc) [582380] [10:37:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [10:37:44] looks like it [10:38:48] (03CR) 10Ayounsi: [C: 03+2] Prometheus: temporarily disable the Netbox job [puppet] - 10https://gerrit.wikimedia.org/r/807091 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [10:38:52] * kart_ updating cxserver [10:39:12] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) (owner: 10KartikMistry) [10:39:24] (03CR) 10Slyngshede: [C: 03+2] osm: migrate import_waterlines cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:39:38] (03CR) 10Hnowlan: [C: 03+1] C:tilerator::regen fix logging and rename service. [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:41:49] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35944/console" [puppet] - 10https://gerrit.wikimedia.org/r/781050 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:42:03] (03CR) 10Muehlenhoff: sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff) [10:42:17] (03CR) 10Hashar: [C: 03+1] "I don't know how it will affects Phabricator though :)" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:42:22] (03Merged) 10jenkins-bot: Update cxserver to 2022-06-21-035954-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/806970 (https://phabricator.wikimedia.org/T307970) (owner: 10KartikMistry) [10:42:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff) [10:42:54] RECOVERY - Check systemd state on kubernetes1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:29] (03CR) 10Filippo Giunchedi: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:44:45] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [10:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [10:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:54] (03CR) 10JMeybohm: "Hey Jesse, do you have some time to do a review of this by chance?" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [10:47:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:47:17] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [10:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:28] (03PS2) 10JMeybohm: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) [10:47:30] (03PS2) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) [10:47:35] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:58] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [10:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] (03PS1) 10Muehlenhoff: squid: Harden config, we don't use Gopher anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807093 [10:48:49] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [10:48:52] (03CR) 10JMeybohm: Add helm-state-metrics helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [10:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:06] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:49:35] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [10:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:56] !log Updated cxserver to 2022-06-21-035954-production (T307970) [10:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:01] T307970: Deploy Flores Machine Translation in a new set of Languages - https://phabricator.wikimedia.org/T307970 [10:57:44] !log deleting netbox getstats.GetDeviceStats job results - T311048 [10:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] T311048: Netbox DB is growing out of control - https://phabricator.wikimedia.org/T311048 [10:59:47] (03PS2) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) [10:59:49] (03PS3) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) [10:59:51] (03PS1) 10Muehlenhoff: squid/url downloaders: Drop Gopher in ACLs, not used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807094 [11:00:22] (03PS1) 10Jbond: getstats: Delete old ve5rsions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 [11:01:56] (03PS2) 10Jbond: getstats: Delete old ve5rsions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [11:02:34] (03PS3) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [11:02:41] (03CR) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [11:03:51] (03CR) 10Muehlenhoff: [C: 03+2] Retire releasers-parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/807061 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [11:10:45] (03PS1) 10Klausman: net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) [11:11:50] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:16:24] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:17:35] (03PS1) 10Jbond: WIP: make the export title much more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097 [11:18:05] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [11:21:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:32:45] (03PS1) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) [11:34:17] (03PS1) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) [11:34:47] (03PS2) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) [11:35:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35946/console" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:35:59] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Also catch RemoteExecutionError in trunking check [cookbooks] - 10https://gerrit.wikimedia.org/r/807090 (owner: 10Muehlenhoff) [11:36:23] (03PS4) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [11:37:11] (03CR) 10Jbond: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:37:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10jbond) [11:39:18] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:51] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tilerator::regen fix logging and rename service. [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:41:08] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:41:50] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:41:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 for testing', diff saved to https://phabricator.wikimedia.org/P29934 and previous config saved to /var/cache/conftool/dbconfig/20220621-114151-root.json [11:41:54] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] (03PS2) 10Zabe: osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) [11:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for testing', diff saved to https://phabricator.wikimedia.org/P29935 and previous config saved to /var/cache/conftool/dbconfig/20220621-114216-root.json [11:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 for testing', diff saved to https://phabricator.wikimedia.org/P29936 and previous config saved to /var/cache/conftool/dbconfig/20220621-114232-root.json [11:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:39] (03PS2) 10Zabe: memcached: remove absented memkeys cron [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) [11:42:56] (03PS3) 10Zabe: sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) [11:43:24] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [11:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [11:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:32] (03PS1) 10Zabe: dumps: remove absented dumps-exception-checker cron [puppet] - 10https://gerrit.wikimedia.org/r/807101 (https://phabricator.wikimedia.org/T273673) [11:48:30] (03PS3) 10Zabe: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) [11:50:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the reviews -- nothing substantial should change I think, I'll try and deploy the patch next week!" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:50:03] (03CR) 10Ayounsi: [C: 03+1] "To be tested but logic sgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:51:47] (03PS1) 10Jelto: gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) [11:55:35] (03CR) 10Ayounsi: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [11:55:37] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35947/console" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [11:55:55] (03PS2) 10Jbond: wmflib::resource::export: make exported resource titles more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097 [11:56:24] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Fix bridge detection logic and provide guidance what do you [cookbooks] - 10https://gerrit.wikimedia.org/r/807105 [11:59:04] !log mbsantos@maps2009 imposm-removebackup-import (T305845) [11:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:11] T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 [12:00:40] (03CR) 10Jelto: [V: 03+1] "This may be a solution for filling docker cache on gitlab-runner nodes." [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [12:00:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35948/console" [puppet] - 10https://gerrit.wikimedia.org/r/807097 (owner: 10Jbond) [12:01:47] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Fix bridge detection logic and provide guidance what do you [cookbooks] - 10https://gerrit.wikimedia.org/r/807105 (owner: 10Muehlenhoff) [12:02:19] (03PS1) 10MSantos: maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) [12:05:42] (03PS2) 10MSantos: maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) [12:05:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [12:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:23] (03CR) 10Filippo Giunchedi: "I ran into this while debugging sth else in Pontoon, please let me know what you think (and related https://gerrit.wikimedia.org/r/c/opera" [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi) [12:06:53] (03CR) 10CI reject: [V: 04-1] maps: re-enable tile generation cron in codfw [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) (owner: 10MSantos) [12:07:21] (03CR) 10Filippo Giunchedi: [C: 03+2] "No worries! Thank you for the review" [puppet] - 10https://gerrit.wikimedia.org/r/806377 (owner: 10Filippo Giunchedi) [12:07:27] (03PS2) 10Filippo Giunchedi: pontoon: update hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/806377 [12:12:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [12:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:12:17] (03CR) 10Filippo Giunchedi: base: include profile::pontoon::base (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi) [12:12:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [12:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:36] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:19:50] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [12:20:58] (03PS5) 10Jbond: Netbox: add monitoring to dns.git endpoint [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [12:23:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35950/console" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [12:23:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib::resource::export: make exported resource titles more unique [puppet] - 10https://gerrit.wikimedia.org/r/807097 (owner: 10Jbond) [12:25:06] (03CR) 10Volans: [C: 03+1] "LGTM, see inline for the discussed point" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [12:25:38] !log reset logster-csp/logster-badpass-priv on mwlog1002, these were removed from Puppet [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:25:50] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [12:29:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10MoritzMuehlenhoff) [12:30:10] (03PS3) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) [12:30:13] (03CR) 10Jbond: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [12:30:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) 05Open→03Resolved ganeti4004 has been added to the ganeti/ulsfo cluster now. Cluster is currently rebalancing. [12:32:01] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [12:32:14] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:45] (03PS1) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [12:33:59] (03CR) 10Jbond: [C: 03+1] pontoon: add metricsinfra_prometheus_nodes to settings [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi) [12:34:42] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:35:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi) [12:35:46] (03CR) 10Majavah: [C: 04-1] "You should be able to leave this variable empty as long as `prometheus_nodes` is set up correctly." [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi) [12:35:57] (03CR) 10Majavah: [C: 03+1] wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi) [12:36:09] (03PS2) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [12:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [12:37:03] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:37:34] (03CR) 10Jelto: "small question inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [12:39:44] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 (owner: 10Filippo Giunchedi) [12:39:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [12:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] (03PS2) 10Filippo Giunchedi: wmcs: add default for metricsinfra_prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/806378 [12:40:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet [12:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:43] (03CR) 10Filippo Giunchedi: pontoon: add metricsinfra_prometheus_nodes to settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi) [12:43:44] !log installing python-bottle security updates [12:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:02] (03PS2) 10Filippo Giunchedi: pontoon: rework prometheus settings in its own file [puppet] - 10https://gerrit.wikimedia.org/r/806379 [12:48:41] (03CR) 10Volans: [C: 03+1] "LGTM, question inline" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [12:48:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline for non-blocking comment" [puppet] - 10https://gerrit.wikimedia.org/r/806405 (https://phabricator.wikimedia.org/T310831) (owner: 10Ayounsi) [12:50:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet [12:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet [12:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:30] (03CR) 10Jbond: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:52:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [12:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [12:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:52] (03PS4) 10Jbond: P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) [12:54:07] (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [12:54:12] (03CR) 10Jbond: P:netbox: add dynamic config back to config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [12:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:54:22] (03PS6) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) [12:54:51] (03CR) 10Hashar: [C: 03+1] "Sounds good, we can later on investigate offloading the caches to Swift/S3 ;)" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [12:55:07] (03PS3) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [12:55:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35952/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [12:56:04] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:56:28] (03CR) 10Muehlenhoff: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:56:39] (03CR) 10Hashar: [C: 03+1] "On integration and contint* machines we do some pruning via ::profile::docker::prune , but given Gitlab provides its own clear cache syste" [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [12:56:48] !log installing haproxy security updates on stretch [12:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet [12:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:23] (03PS4) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [12:59:28] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [12:59:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet [12:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:55] (03CR) 10Ssingh: [V: 03+1] "Change is ready for review again, addressing the optional nits: using external and merging the BGP configurations." [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [12:59:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] rpc: Remove unused RunJobs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300). [13:00:04] duesen, xsavitar, Lucas_WMDE, itamarWMDE, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300) [13:00:11] (03CR) 10Ssingh: [V: 03+1] "PCC for centrallog and dns: https://puppet-compiler.wmflabs.org/pcc-worker1003/35882/" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:01:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet [13:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:52] (03PS5) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [13:02:28] (03PS6) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [13:02:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [13:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:23] (03CR) 10Ssingh: [V: 03+1] "Interestingly enough, using "external" results in bird2 complaining:" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:03:25] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:04:13] Lucas will be here in a second, slight IRC client trouble [13:04:21] o/ here, and Lucas_WMDE is on his way [13:05:04] !log installing Linux 5.10.120-1~bpo10+1 on buster hosts with backports kernel [13:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:22] o/ [13:05:27] o/ [13:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:05:49] xSavitar will probably not come, hie pwoer went out an hour ago [13:06:30] (03PS7) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [13:06:49] (03CR) 10Ssingh: [V: 03+1] "Interesting, seems like reverting patchset 5 is probably a good idea :)" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:07:10] I may need some hand holding with deploying my "config" patch. It's removing an unused endpoint. No idea how to test this. [13:08:28] o/ [13:08:51] duesen, we can deploy [13:09:20] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:09:42] xSavitar: [13:09:48] Here are the deploy commands: https://deploy-commands.toolforge.org/bacc/805775 [13:09:58] Doing that and testing should be fine. [13:10:12] duesen ^^ [13:10:24] (03CR) 10Volans: [C: 04-1] "Idea might be ok (modulo race conditions), but the implementation has errors." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [13:10:28] (03PS8) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [13:12:37] urbanecm, awight: can Derick and me go ahead with the deployment? [13:12:50] duesen: go ahead if you feel comfortable. [13:12:54] i can deploy if you're not [13:13:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:26] (03CR) 10CI reject: [V: 04-1] C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:13:28] urbanecm: we'll go ahead [13:13:34] okay. ping me if i can help :) [13:13:47] <3 urbanecm [13:14:23] (03PS9) 10Slyngshede: C:base::puppet move Puppet to Systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/807118 [13:14:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) >>! In T67270#8012925, @Legoktm wrote: > Can we clarify what the goal here is? More recently I've been good about throwing a GP... [13:16:10] (03CR) 10Daniel Kinzler: [C: 03+2] rpc: Remove unused RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01) [13:16:28] I forgot to merge the patch beforehand, will take a couple of minutes [13:16:31] it's a config patch though [13:16:44] hello [13:16:47] I think I *finally* made it here [13:16:52] sorry I’m late [13:16:54] hey Lucas_WMDE ! [13:17:13] (03PS1) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [13:17:18] urbanecm: you’re deploying? [13:17:19] (03Merged) 10jenkins-bot: rpc: Remove unused RunJobs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01) [13:17:22] Welcome Lucas_WMDE! :D [13:17:31] Lucas_WMDE: duesen and xSavitar are [13:17:33] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:34] I'm just standing by [13:18:14] ok, patch merged [13:18:14] ok [13:18:28] (03CR) 10Slyngshede: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:20:09] I pulled the patch to mwdebug1001. [13:20:14] I unfortunately have a meeting starting in 10 minutes, so I might deploy my Lexeme Lua patch in the break after the backport+config window [13:20:14] There is nothing to test, really. [13:20:25] (03CR) 10CI reject: [V: 04-1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [13:20:28] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35955/console" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:22:58] I'll scap now. Wort that can happen is job execution breaking... [13:23:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:57] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7293 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:24:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang) [13:24:19] urbanecm: uh, how do you scap a file deletion? [13:24:20] # [13:24:26] ...for config [13:24:32] duesen: scap the folder the file is in [13:24:36] (wel, was) [13:24:39] kk! [13:24:59] running [13:25:04] (03PS2) 10Ori: varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (https://phabricator.wikimedia.org/T138093) [13:25:06] (03CR) 10Slyngshede: C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [13:26:30] (03CR) 10Elukey: "There is a bit missing IIUC, but the rest looks good! After this change I'd file another one to change profile::pki::multirootca and add t" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:27:05] (03PS2) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [13:28:01] ...still going... [13:28:12] yeah, it takes a couple of minutes those days :/ [13:28:38] !log daniel@deploy1002 Synchronized rpc/: Config: [[gerrit:805775|rpc: Remove unused RunJobs.php (T175146 T243096)]] (duration: 03m 45s) [13:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:45] T243096: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 [13:28:45] T175146: [RfC] Move RunJobs.php to the mediawiki (core) repository - https://phabricator.wikimedia.org/T175146 [13:29:06] ok, sync is done. I'm seeing nothing suspicious on logstash so far. [13:30:16] (03CR) 10CI reject: [V: 04-1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [13:30:24] (03CR) 10Ori: [C: 03+2] varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [13:30:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [13:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] * Lucas_WMDE afk for 30 minutes [13:32:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:27] ACKNOWLEDGEMENT - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7714 seconds old and 217 bytes Btullis T310293 - running on standby server temporarily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:32:27] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [13:32:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:19] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:35:22] urbanecm, Lucas_WMDE: all good. [13:35:52] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:37:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:04] 10SRE-swift-storage, 10Infrastructure-Foundations: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10MatthewVernon) [13:38:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet [13:38:33] itamarWMDE, koi: do you want to deploy now? [13:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [13:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:03] (03PS2) 10Klausman: net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) [13:40:13] (03CR) 10Hokwelum: [C: 03+1] "This looks good" [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:40:16] I couldn't, could anyone help me [13:40:24] (03CR) 10Klausman: net: Add network config setup for ML staging k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:41:09] (03CR) 10Ayounsi: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:41:46] koi: urbanecm should be able to help. [13:41:48] urbanecm, do you want to take on the other patches? [13:41:57] sure [13:41:59] jouncebot: now [13:41:59] For the next 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300) [13:41:59] For the next 0 hour(s) and 18 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1300) [13:42:08] Thank you urbanecm <3 [13:42:29] i'm not sure we can deploy all patches though [13:43:34] duesen: Don't have prod deployment access, if that's what you're asking [13:43:49] yeah, I'll try to deploy what i can [13:43:50] (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:43:58] we don¨t have a lot of time though [13:44:24] (03PS3) 10Urbanecm: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:44:28] (03CR) 10Urbanecm: [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:44:47] (03PS3) 10Urbanecm: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:44:54] (03CR) 10Urbanecm: [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:45:07] Thank you urbanecm [13:45:14] (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:45:39] (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:46:02] itamarWMDE: pulled to mwdebug1001, can you check please? [13:46:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] (03CR) 10Elukey: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:46:45] (03PS1) 10Muehlenhoff: aptly: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013) [13:46:47] (03PS1) 10Muehlenhoff: aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) [13:46:49] (03PS1) 10Muehlenhoff: grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) [13:46:52] (03PS1) 10Muehlenhoff: smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) [13:47:02] (03CR) 10Hokwelum: [C: 03+1] "This looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807101 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:47:20] (03PS7) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) [13:48:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:48:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35956/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:48:05] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:05] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:07] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:07] umh, did something just break? [13:48:16] (03PS5) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [13:48:19] (ProbeDown) firing: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:19] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:27] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [13:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:37] taavi: i didn't touch anything yet [13:48:48] but let me check, there were some deployments [13:48:55] urbanecm: connection seems slow here, sorry for delay [13:48:59] no problem [13:49:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [13:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:28] (03CR) 10Elukey: [C: 03+2] aptrepo: add cassandra components to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/807065 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [13:49:30] (03PS8) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) [13:50:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35957/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:50:17] (03CR) 10Jbond: "thanks updated" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [13:50:17] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:50:17] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:50:17] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:50:33] looks like a temporary issue taavi [13:50:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:50:45] great [13:50:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:51:00] huh, I got paged, despite not being on duty [13:51:21] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:51:31] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6027 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:51:33] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:52:01] Emperor: we''re still paging everyone [13:52:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:52:07] (03CR) 10ArielGlenn: [C: 04-1] C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:13] urbanecm: seems good to me [13:52:33] itamarWMDE: thanks, but not deploying atm, seems we're in a middle of a problem [13:52:49] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:52:57] (03CR) 10ArielGlenn: [C: 04-1] "I am a little uneasy about creating a new wrapper script just for this one thing; is there no nicer way to pass in multiple commands?" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:52:58] FE [13:52:59] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:53:18] (ProbeDown) resolved: (10) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:33] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:53:42] (03CR) 10Ssingh: [V: 03+1] "Based on Arzhel's last comment, ready for (final?) review again. Changes: neighbor external. Distinct BGP blocks as required by bird2." [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:53:43] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:17] (03PS1) 10Klausman: hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) [13:54:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [13:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:52] urbanecm: no worries, was so busy trying to test I didn't notice, thanks :D [13:55:03] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:55:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:55:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:56:13] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:56:14] (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [13:57:58] * Lucas_WMDE back [13:58:02] itamarWMDE: np. i'll finish the deployment or revert later, depending on how the incident goes. thanks for the test. [13:58:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:16] (03CR) 10Eevans: [C: 03+2] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [14:00:27] (03CR) 10Eevans: [V: 03+2 C: 03+2] Apply 2to3 to migrate the code to Python3 [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/807068 (https://phabricator.wikimedia.org/T310980) (owner: 10Elukey) [14:02:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [14:24:31] !log on going maintenance on ps1-a2-codfw [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:23] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:32:13] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:32:36] papaul: ^ [14:33:05] PROBLEM - Host lvs2007 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:13] PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:35] PROBLEM - Host ms-be2044 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:57] PROBLEM - Host kafka-logging2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:57] PROBLEM - Host ping2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:09] PROBLEM - Host elastic2055 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:09] PROBLEM - Host ganeti2030 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:17] (03PS3) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [14:34:43] PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:43] PROBLEM - Host grafana2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:53] PROBLEM - Host netbox2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:54] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:57] PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:57] PROBLEM - Host authdns2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:59] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:59] PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:01] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:05] PROBLEM - Host urldownloader2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:07] PROBLEM - Host ms-be2029 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:09] PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:09] PROBLEM - Host rpki2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:17] PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:18] (ProbeDown) firing: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:21] PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:21] PROBLEM - Host deneb is DOWN: PING CRITICAL - Packet loss = 100% [14:35:25] PROBLEM - Host ganeti2029 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:36:06] oh oh... [14:36:47] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:55] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:37:19] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64602/IPv4: Connect - kubernetes-codfw, AS64600/IPv4: Connect - PyBal, AS64602/IPv6: Active - kubernetes-codfw, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:37:21] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:37:21] PROBLEM - Juniper virtual chassis ports on asw-a-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:37:45] (JobUnavailable) firing: Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:38:28] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:38:39] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:38:58] (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:39:13] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2305.codfw.wmnet, mw2380.codfw.wmnet, mw2389.codfw.wmnet, mw2387.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2304.codfw.wmnet, mw2295.codfw.wmnet, mw2252.codfw.wmnet, mw2251.codfw.wmnet, mw2299.codfw.wmnet, mw2306.codfw.wmnet are marked down but pooled https://wikitech.wikimedia [14:39:13] i/PyBal [14:39:35] RECOVERY - Host urldownloader2001 is UP: PING WARNING - Packet loss = 90%, RTA = 33.32 ms [14:39:35] RECOVERY - Host ms-be2029 is UP: PING WARNING - Packet loss = 50%, RTA = 33.10 ms [14:39:35] RECOVERY - Host rpki2002 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:39:35] RECOVERY - Host ms-be2051 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [14:39:37] RECOVERY - Host netbox2002 is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms [14:39:37] RECOVERY - Host authdns2001 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [14:39:37] RECOVERY - Host thanos-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [14:39:37] RECOVERY - Host ms-be2044 is UP: PING OK - Packet loss = 0%, RTA = 34.83 ms [14:39:37] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 34.75 ms [14:39:39] RECOVERY - Juniper virtual chassis ports on asw-a-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:39:39] RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [14:39:39] RECOVERY - Host ganeti2029 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [14:39:39] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [14:39:39] RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [14:39:41] RECOVERY - Host lvs2007 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [14:39:45] RECOVERY - Host ping2002 is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms [14:39:45] RECOVERY - Host kafka-logging2001 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [14:39:45] RECOVERY - Host elastic2055 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:39:45] RECOVERY - Host grafana2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [14:39:49] RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [14:39:55] RECOVERY - Host ganeti2030 is UP: PING OK - Packet loss = 0%, RTA = 31.75 ms [14:39:55] RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [14:40:01] RECOVERY - Host deneb is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [14:40:31] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:40:55] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:57] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [14:40:59] RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 31.87 ms [14:41:06] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:41:06] RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 31.89 ms [14:41:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:41:29] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:53] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:43:45] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:44:53] papaul: was that expected? ^ [14:44:55] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:45:09] PROBLEM - Check systemd state on kubernetes2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:30] XioNoX: no second power cable for asw was not plug all the way in [14:45:34] (ProbeDown) resolved: Service sessionstore:8081 has failed probes (http_sessionstore_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:41] so bump into it and got disconnected [14:45:41] ok [14:46:09] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:36] (03CR) 10BCornwall: [C: 03+2] data-engineering: add varnishkafka delivery errors [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:46:43] (JobUnavailable) resolved: (5) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:46] (Emergency syslog message) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:46:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:47:03] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:47:11] (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:48:31] (03CR) 10Muehlenhoff: [C: 03+2] acme_chief: Remove old buster IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/805140 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [14:49:33] PROBLEM - Host ms-be2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:09] (03PS2) 10BCornwall: Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) [14:51:06] (03PS1) 10Muehlenhoff: Remove old buster IDPs from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/807140 [14:51:27] RECOVERY - Host ms-be2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms [14:53:22] (Emergency syslog message) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:53:43] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:29] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:00] (03CR) 10BCornwall: [C: 03+2] Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:55:06] (03PS1) 10Urbanecm: Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328) [14:55:10] (03PS1) 10Urbanecm: Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807149 (https://phabricator.wikimedia.org/T304328) [14:55:17] (03CR) 10Urbanecm: [C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm) [14:55:26] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807149 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm) [14:55:30] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807148 (https://phabricator.wikimedia.org/T304328) (owner: 10Urbanecm) [14:56:05] RECOVERY - Check systemd state on kubernetes2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove old buster IDPs from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/807140 (owner: 10Muehlenhoff) [14:57:59] (03PS1) 10Ssingh: dnsdist: override unit to set ProtectSystem to strict [puppet] - 10https://gerrit.wikimedia.org/r/807142 [14:58:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:01] (03CR) 10Elukey: [C: 03+1] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:59:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:59:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:59:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35958/console" [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh) [14:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] !log PDU swap for rack a2 complete [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:23] (03PS1) 10Majavah: sonofgridengine: grid_configurator: ignore non-ACTIVE instances [puppet] - 10https://gerrit.wikimedia.org/r/807143 [15:05:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:35] (03PS2) 10Klausman: hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) [15:05:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35960/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:06:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:06:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:58] !log installing avahi security updates [15:06:58] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10jbond) > In such cases it might make sense to align such files by relicensing to Apache 2 starting of with the obligatory IANAL :). My understan... [15:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:36] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35961/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:09:30] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35962/console" [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:09:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:08] (03CR) 10Vgutierrez: [C: 03+1] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:13:00] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) the varnish-mmap-count situation could be resolved with https://github.com/prometheus/proc... [15:13:57] jouncebot: now [15:13:57] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [15:14:14] (03PS2) 10Majavah: sonofgridengine: grid_configurator: ignore non-ACTIVE instances [puppet] - 10https://gerrit.wikimedia.org/r/807143 [15:14:18] if the incident is resolved for now, I’d like to deploy one config change that didn’t make it during the backport window [15:14:24] if anyone wants to object to that, shout :) [15:15:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:35] !log klausman@cumin1001 conftool action : help; selector: name=ml-staging2001 [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:09] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2001.codfw.wmnet [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:14] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2002.codfw.wmnet [15:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging-ctrl2002.codfw.wmnet [15:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Switch ML staging inference endpoint to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/807133 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:18:22] (03CR) 10Jbond: [C: 03+2] P:netbox: add dynamic config back to config file [puppet] - 10https://gerrit.wikimedia.org/r/807099 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [15:18:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:30] (03PS2) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) [15:21:42] ^ about to deploy this [15:23:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig looks good, let’s go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE)) [15:24:01] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:24:09] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:24:30] (03Merged) 10jenkins-bot: Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE)) [15:25:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh) [15:25:20] testing on mwdebug1001 [15:25:33] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 67 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [15:26:01] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 87 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal [15:26:28] seems to work fine, syncing [15:26:47] !log klausman@puppetmaster1001 conftool action : set/weight=1; selector: name=ml-staging2001.codfw.wmnet [15:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:54] !log klausman@puppetmaster1001 conftool action : set/weight=1; selector: name=ml-staging2002.codfw.wmnet [15:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:59] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:26:59] (03PS4) 10Krinkle: mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [15:27:01] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) https://wikitech.wikimedia.org/wiki/PyBal [15:27:14] holding [15:27:28] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2002.codfw.wmnet [15:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:40] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-staging2001.codfw.wmnet [15:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:19] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.58:30443]) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal [15:28:24] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 67 connections established with conf2004.codfw.wmnet:4001 (min=68) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal [15:28:29] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 87 connections established with conf2004.codfw.wmnet:4001 (min=88) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal [15:28:42] ok, I’m continuing [15:30:02] (03PS4) 10Krinkle: Set $wgCentralAuthTokenCacheType to mcrouter-master-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [15:30:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] !log Restarting pybal on lvs2010 [15:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:31] (03PS2) 10Volans: Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452) [15:30:33] (03PS6) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 [15:30:35] (03PS9) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [15:30:54] scap errors on 2 hosts [15:31:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:31:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:12] I have to do another sync anyways, I’ll see if that one works better, then the php-fpm restart should be covered by that [15:32:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:38] Lucas_WMDE: What are the errors? [15:32:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806877|Enable Lexeme Lua access everywhere (T309593)]] (1/2) (duration: 03m 51s) [15:32:58] issues connecting to lvs, apparently [15:32:59] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2320.codfw.wmnet]) Klausman Setting up LVS for inference-staging (ML team) https://wikitech.wikimedia.org/wiki/PyBal [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:03] T309593: enable Lexeme Lua access on remaining Wikimedia projects - https://phabricator.wikimedia.org/T309593 [15:33:04] so I guess that could be related to what klausman is doing? [15:33:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [15:33:13] Lucas: Same as https://phabricator.wikimedia.org/T310835 ? [15:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:26] (03PS1) 10Ssingh: Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145 [15:33:28] not quite the same It hink [15:33:36] but the “free opcache” is also in the output at least [15:33:47] I can paste the console output later, I’ll do the second sync first [15:33:53] ok thanks [15:33:55] unless you want me to wait? [15:33:59] RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:34:03] no, go ahead [15:34:05] ok [15:34:15] 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) 05Open→03Resolved a:03ayounsi > This should be fixed. Looks like it was a configuration failure during the planned migration PWIC218882.3. Confirmed resolved. [15:34:25] (second sync is only IS-labs, so main prod effect should be to restart php-fpm on the remaining two hosts) [15:34:30] scap running [15:34:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [15:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] (03CR) 10Volans: "I've cleanup netbox-next and run the script for all clusters:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [15:36:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [15:37:01] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 88 connections established with conf2004.codfw.wmnet:4001 (min=88) https://wikitech.wikimedia.org/wiki/PyBal [15:37:03] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:37:39] !log restarting pybal on lvs2009 [15:37:40] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:806877|Enable Lexeme Lua access everywhere (T309593)]] (2/2) (duration: 03m 28s) [15:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:52] no error this time fyi dancy [15:38:13] OK thanks [15:38:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [15:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:35] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:39:57] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 68 connections established with conf2004.codfw.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [15:40:01] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:40:01] dancy: I’ve put the output in a private paste for now https://phabricator.wikimedia.org/P29939 [15:40:14] it’s probably okay to make publish, feel free to copy it into a task somewhere [15:41:28] I think there’s two issues there – the failed connection to lvs2010 (understandable if that was being worked on at the moment), and the fact that https://gerrit.wikimedia.org/g/operations/puppet/+/c8cb4a1796d5ff22803c171c277943eebecb8ee7/modules/conftool/files/safe-service-restart.py#333 throws an error if `status` was never assigned [15:42:20] (03PS5) 10Krinkle: Set $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [15:42:29] (03CR) 10Krinkle: [C: 03+1] mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [15:42:34] (03CR) 10Krinkle: [C: 03+1] Set $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [15:43:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [15:45:13] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:45:35] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:46:13] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:46:13] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:46:27] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference-staging on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:47:06] (03CR) 10Jbond: [C: 03+1] net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:47:33] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ticket 1-218053856766 opened for the loopback test. > Support, > > We need to test our cross-connection 20676697-A, which terminates into our panel @ PP:0603:1087235 - 15/16 and from that into our rou... [15:47:48] (03PS1) 10Majavah: P:toolforge::checker: add buster endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807168 (https://phabricator.wikimedia.org/T277653) [15:47:50] (03PS1) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) [15:47:52] (03PS1) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) [15:51:52] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 52.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:52:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 36.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:52:46] (Device rebooted) firing: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [15:52:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [15:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:56] (03CR) 10Jbond: [C: 03+1] "LGTM will also need a follow up patch to remove old files, variables and resources" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [15:53:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:54:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:24] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 76.95 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:54:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:55:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:55:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:55:54] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2048.codfw.wmnet [15:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:57:27] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [15:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:19] (03CR) 10Jbond: [C: 03+2] wmflib::service: Reject empty string values [puppet] - 10https://gerrit.wikimedia.org/r/806208 (owner: 10Jbond) [15:59:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [15:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:06] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1600). [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:54] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [16:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:09] (03PS1) 10Papaul: Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) [16:02:03] (03CR) 10CI reject: [V: 04-1] Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:02:26] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35965/console" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [16:02:49] (03PS25) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:03:20] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:03:52] (03CR) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:04:17] (03PS2) 10Papaul: Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) [16:05:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [16:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:39] (03CR) 10Papaul: [C: 03+2] Add new pdu model for ps1-a2-codfw [puppet] - 10https://gerrit.wikimedia.org/r/807171 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:06:07] (03Abandoned) 10Filippo Giunchedi: base: include profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806374 (owner: 10Filippo Giunchedi) [16:06:54] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [16:07:11] (03PS2) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 [16:07:13] (03PS2) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376 [16:07:15] (03PS2) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375 [16:07:19] (03PS4) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [16:07:28] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:07:46] (Device rebooted) resolved: Device ps1-a2-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:07:50] (03CR) 10Jbond: [C: 03+2] wmflib: update kernel_details to also include kernel.unprivileged_userns_clone [puppet] - 10https://gerrit.wikimedia.org/r/806425 (owner: 10Jbond) [16:08:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:09] (03PS5) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [16:10:16] (03CR) 10CI reject: [V: 04-1] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [16:10:25] (03CR) 10Dzahn: "So.. there is a parameter "severity". and the default is "critical". This is what they mean:" [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [16:11:03] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35967/console" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [16:11:47] (03PS1) 10Jbond: P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173 [16:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:12:14] (03CR) 10Filippo Giunchedi: "both http://checker.tools.wmflabs.org/grid/continuous/buster and http://checker.tools.wmflabs.org/grid/start/buster yield 404 for me, expe" [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:13:30] (03CR) 10Jbond: [C: 03+2] P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173 (owner: 10Jbond) [16:13:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:sretest: Add original title parameter to sretest import/export [puppet] - 10https://gerrit.wikimedia.org/r/807173 (owner: 10Jbond) [16:13:43] (03PS2) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) [16:13:45] (03PS2) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) [16:13:58] (03PS1) 10David Caro: openstack.vendordata: reduce timeout so it retries [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) [16:14:08] (03CR) 10Majavah: icinga::monitor::toollabs: replace stretch with buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:14:20] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Pinging @JMeybohm and @Dzahn for support. [16:14:39] (03CR) 10Dzahn: "Do you guys see an existing list of teams? I asked about that and whether there are plans for another severity level "paging"." [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [16:16:03] (03PS3) 10Filippo Giunchedi: pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 [16:16:05] (03PS3) 10Filippo Giunchedi: pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376 [16:16:07] (03PS3) 10Filippo Giunchedi: pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375 [16:16:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:49] (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [16:17:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "-1 while endpoints exist/parent change is deployed, can be merged afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:18:34] (03CR) 10Filippo Giunchedi: "Went with the ENC approach, PTAL" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [16:21:37] (03PS1) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176 [16:21:37] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:22:27] (03PS1) 10Ahmon Dancy: profile::mediawiki::deployment::server: Rename a variable [puppet] - 10https://gerrit.wikimedia.org/r/807178 [16:23:17] (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807168 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:26:07] (03PS1) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) [16:26:32] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [16:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [16:27:57] PROBLEM - Host ms-be1055 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:51] RECOVERY - Host ms-be1055 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [16:29:12] (03CR) 10CI reject: [V: 04-1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [16:29:19] (03PS6) 10Jbond: getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) [16:29:37] (03CR) 10Jbond: getstats: Delete old versions of this report before running (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [16:30:06] (03CR) 10CI reject: [V: 04-1] getstats: Delete old versions of this report before running [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807095 (https://phabricator.wikimedia.org/T311048) (owner: 10Jbond) [16:30:30] (03PS2) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) [16:31:25] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [16:32:33] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [16:32:43] PROBLEM - SSH on ms-be1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:32:43] (03CR) 10Filippo Giunchedi: "PCC full diff https://puppet-compiler.wmflabs.org/pcc-worker1003/35969/prometheus1005.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [16:34:43] RECOVERY - SSH on ms-be1055 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:34:51] PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [16:36:55] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:48] (03PS3) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) [16:37:52] (03PS3) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) [16:37:56] (03PS1) 10Majavah: P:toolforge::checker: add missing endpoint config [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653) [16:38:25] PROBLEM - Host ms-be1055 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:49] (03PS2) 10Majavah: P:toolforge::checker: add missing endpoint config [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653) [16:38:51] RECOVERY - Host ms-be1055 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [16:38:58] (03PS4) 10Majavah: icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) [16:39:02] (03PS4) 10Majavah: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) [16:39:19] PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster [16:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster [16:41:41] RECOVERY - Check systemd state on ms-be1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:19] (03CR) 10Ahmon Dancy: "PCC results (no changes): https://puppet-compiler.wmflabs.org/pcc-worker1002/35971/" [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy) [16:45:50] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:03] (03CR) 10Jgiannelos: [C: 04-1] "Lets hold on this for now since we need to manually bootstrap tile storage with fresh tiles." [puppet] - 10https://gerrit.wikimedia.org/r/807108 (https://phabricator.wikimedia.org/T305845) (owner: 10MSantos) [16:49:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:54:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:59:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @BTullis Can you confirm raid configuration and partman recipe to use please? [17:01:31] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster [17:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w... [17:02:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster [17:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster [17:03:53] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:06:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [17:09:49] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic1049.eqiad.wmnet [17:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:48] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [17:14:34] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster [17:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w... [17:15:10] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp2001.wikimedia.org [17:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:51] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Just a quick correction on the numbers: the current production container size is ~40M objects not ~12M (i was countin... [17:19:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [17:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host elastic1049.eqiad.wmnet [17:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:31] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2001.wikimedia.org [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:31] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp2001.wikimedia.org` - idp2001.wikimedia.org (**PASS**) - Downtimed host on Icing... [17:24:21] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: override unit to set ProtectSystem to strict [puppet] - 10https://gerrit.wikimedia.org/r/807142 (owner: 10Ssingh) [17:24:41] ACKNOWLEDGEMENT - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search.service Brian_King This should have cleared by now, looking closer at the alert rules. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:51] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp1001.wikimedia.org [17:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:58] (03PS1) 10Majavah: Remove stretch support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/807184 (https://phabricator.wikimedia.org/T277653) [17:30:42] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [17:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:45] (03PS1) 10Cmjohnson: Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) [17:36:37] (03PS1) 10Ssingh: dnsdist: service override (improves 54f018dc5) [puppet] - 10https://gerrit.wikimedia.org/r/807188 [17:36:46] (03CR) 10CI reject: [V: 04-1] Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) (owner: 10Cmjohnson) [17:37:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35972/console" [puppet] - 10https://gerrit.wikimedia.org/r/807188 (owner: 10Ssingh) [17:37:44] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: service override (improves 54f018dc5) [puppet] - 10https://gerrit.wikimedia.org/r/807188 (owner: 10Ssingh) [17:37:52] (03CR) 10Muehlenhoff: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [17:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:38:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1001.wikimedia.org [17:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp1001.wikimedia.org` - idp1001.wikimedia.org (**PASS**) - Downtimed host on Icing... [17:41:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) [17:42:21] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [17:43:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [17:48:03] (03PS2) 10Cmjohnson: Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) [17:48:18] (03CR) 10Dzahn: [C: 03+2] Remove profile::releases::upload and related classes [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [17:50:19] (03CR) 10Dzahn: [C: 03+2] "glad to remove this, it did raise some support requests before afair" [puppet] - 10https://gerrit.wikimedia.org/r/807069 (https://phabricator.wikimedia.org/T309765) (owner: 10Muehlenhoff) [17:50:58] (03PS1) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) [17:51:00] (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:52:08] (03CR) 10Thcipriani: [C: 04-1] "-1 as it needs a private key added to puppet secrets before it will work correctly" [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:52:58] (03CR) 10Cmjohnson: [C: 03+2] Add netboot.cfg and site.pp for an-presto hosts [puppet] - 10https://gerrit.wikimedia.org/r/807187 (https://phabricator.wikimedia.org/T306835) (owner: 10Cmjohnson) [17:54:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) [17:55:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:13] (03PS2) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) [17:56:15] (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:56:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.345 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:59] (03CR) 10Ahmon Dancy: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:57:01] (03PS3) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) [17:57:03] (03CR) 10Thcipriani: [C: 04-1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:57:36] (03CR) 10Ahmon Dancy: [C: 03+1] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [17:57:43] (03PS7) 10Majavah: P:toolforge::grid::cronrunner: sync crontabs between hosts [puppet] - 10https://gerrit.wikimedia.org/r/805848 (https://phabricator.wikimedia.org/T284767) [17:57:45] (03PS1) 10Majavah: P:toolforge::grid::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/807194 (https://phabricator.wikimedia.org/T284767) [17:58:29] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10thcipriani) Talked to @LSobanski and he asked for some clarification on the steps we need root help with. Here are all the steps Release Engineerin... [17:59:45] (03CR) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [18:00:05] hashar and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T1800). [18:01:45] (03CR) 10Ahmon Dancy: [C: 03+1] Keyholder: add new agent for trainbranchbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [18:02:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:05:00] (03PS7) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [18:06:08] (03CR) 10CI reject: [V: 04-1] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [18:07:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @btullis can you confirm what the raid configuration is supposed to be please. 2 SSD Raid 1? and Raid 10 th... [18:07:31] o/ - train was rolled to group0 earlier, no current blockers, logs fairly clean at a glance. currently nothing to do for this window. [18:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [18:07:47] (03PS8) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [18:09:00] (03CR) 10Jdlrobson: QuickSurveys: Add research-incentive to jawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [18:10:28] (03CR) 10Dzahn: [C: 03+1] C:base::puppet move Puppet to Systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807118 (owner: 10Slyngshede) [18:10:51] (03CR) 10Cathal Mooney: "Thanks Arzhel, tried to address in latest patchset let me know what you think. Went with the try/except as what the filter returns is dif" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [18:14:04] (03PS2) 10Majavah: P:toolforge::grid::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/807194 (https://phabricator.wikimedia.org/T284767) [18:14:06] (03PS8) 10Majavah: P:toolforge::grid::cronrunner: sync crontabs between hosts [puppet] - 10https://gerrit.wikimedia.org/r/805848 (https://phabricator.wikimedia.org/T284767) [18:20:03] (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [18:25:57] (03CR) 10Muehlenhoff: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [18:26:00] (03PS10) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [18:29:09] (03CR) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [18:30:03] (03PS11) 10Dzahn: base: create profile to allow unprivileged userns, use it on gitlab_runners [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [18:34:24] (03PS2) 10Dzahn: mediawiki::deployment::server: Rename $deploy_ensure to $secondary_deploy_ensure [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy) [18:34:55] (03CR) 10Dzahn: [C: 03+2] mediawiki::deployment::server: Rename $deploy_ensure to $secondary_deploy_ensure [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy) [18:38:33] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:39:14] (03CR) 10Dzahn: [C: 03+2] "noop confirmed on deploy1002/2002 in prod" [puppet] - 10https://gerrit.wikimedia.org/r/807178 (owner: 10Ahmon Dancy) [18:40:55] (03PS4) 10Thcipriani: Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) [18:41:20] (03CR) 10Thcipriani: Keyholder: add new agent for trainbranchbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [18:43:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:47:28] (03PS4) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330 [18:48:18] !log T301461 `ryankemper@miscweb1002:~$ sudo systemctl reload apache2` [18:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:24] T301461: Investigate cache issues after WDQS UI deployments - https://phabricator.wikimedia.org/T301461 [18:55:49] (03PS1) 10Ryan Kemper: query_service: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) [18:56:31] (03CR) 10Gehel: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper) [18:56:33] !log T301461 `ryankemper@miscweb1002:~$ sudo systemctl reload apache2` failed due to syntax error, patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/807200 [18:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:37] T301461: Investigate cache issues after WDQS UI deployments - https://phabricator.wikimedia.org/T301461 [18:56:54] (03PS2) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [19:04:26] (03PS1) 10Dzahn: alertmanager: create receivers for serviceops-collab [puppet] - 10https://gerrit.wikimedia.org/r/807201 [19:06:04] (03PS2) 10Dzahn: alertmanager: create receivers for serviceops-collab [puppet] - 10https://gerrit.wikimedia.org/r/807201 [19:10:04] (03CR) 10Dzahn: [C: 04-1] "first needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/807201 but keeping it separate" [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [19:12:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:02] (03PS2) 10Ryan Kemper: query_service: fix syntax error in apache config [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) [19:14:25] (03PS1) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) [19:15:49] (03CR) 10RLazarus: [C: 03+1] "Ah, sorry for not catching this!" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper) [19:20:38] mediawiki.org down? Just got "upstream connect error or disconnect/reset before headers. reset reason: overflow" [19:20:45] (03CR) 10Cwhite: [C: 03+2] logstash: copy aqs info field to error.message [puppet] - 10https://gerrit.wikimedia.org/r/806484 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [19:21:11] it and enwiki now loading for me but very very slowly [19:21:50] Seems to maybe be okay now though? [19:22:03] !log replicating Cassandra `system_auth` keyspace to codfw -- T307641 [19:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:07] T307641: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 [19:22:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [19:38:02] !log dancy@deploy1002 Installing scap version "4.9.5" for 558 hosts [19:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:22] !log dancy@deploy1002 Installation of scap version "4.9.5" completed for 558 hosts [19:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:44] !log dancy@deploy1002 backport aborted: (duration: 00m 10s) [19:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:21] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:42:36] (03CR) 10Jdlrobson: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [19:42:38] (03CR) 10Ahmon Dancy: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [19:45:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:38] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ok, they can place the loop back 1 hop away on sg1 side of things and asked if they could do so today while on the call. I advised not yet, as we haven't drained that of traffic. @ayounsi or @cmooney:... [19:47:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jgreen) @Cmjohnson is there any update on these machines? [19:48:02] (03PS1) 10Cwhite: logstash: disable aqs high log rate mitigations [puppet] - 10https://gerrit.wikimedia.org/r/807208 (https://phabricator.wikimedia.org/T310760) [19:50:43] (03PS1) 10Ottomata: Set krb: present for ori [puppet] - 10https://gerrit.wikimedia.org/r/807209 (https://phabricator.wikimedia.org/T311088) [19:52:01] (03CR) 10Cwhite: [C: 03+2] logstash: disable aqs high log rate mitigations [puppet] - 10https://gerrit.wikimedia.org/r/807208 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [19:52:55] (03CR) 10Ottomata: [C: 03+2] Set krb: present for ori [puppet] - 10https://gerrit.wikimedia.org/r/807209 (https://phabricator.wikimedia.org/T311088) (owner: 10Ottomata) [19:54:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [19:55:02] (03CR) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] hi [20:01:06] hi, i can deploy today [20:01:28] koi: ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805840, did you run the automated update via tox to ensure the png's are up to date? [20:02:09] I do, and found generated file is even larger than the one existed [20:02:27] you could see my PS1 [20:02:54] (03CR) 10Urbanecm: [C: 04-1] "logos in project-logos should be maintained through /logos/config.yaml and via tox. can you please update the yaml config to generate the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:03:31] that's weird [20:04:02] but looks you're right [20:04:09] (03PS3) 10Urbanecm: zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang) [20:04:14] (03CR) 10Urbanecm: [C: 03+2] zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang) [20:04:34] (03PS2) 10Urbanecm: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang) [20:04:37] (03CR) 10Urbanecm: [C: 03+2] fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang) [20:04:47] (03CR) 10Stang: zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:04:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [20:05:11] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: fix race between SD/dnsmasq and resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/806375 (owner: 10Filippo Giunchedi) [20:05:40] (03CR) 10Andrea Denisse: [C: 03+2] "Looks good to me, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/806375 (owner: 10Filippo Giunchedi) [20:05:48] (03Merged) 10jenkins-bot: zh_classicalwiki: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805840 (owner: 10Stang) [20:06:13] (03Merged) 10jenkins-bot: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) (owner: 10Stang) [20:06:27] (03CR) 10Urbanecm: [C: 04-1] zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:06:58] koi: the patches i merged are at mwdebug1001, please check [20:07:16] looking [20:07:28] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: enable SD for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/806376 (owner: 10Filippo Giunchedi) [20:07:49] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:13] LGTM(the sandboxlink one) [20:08:56] and the other one? [20:10:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] you mean the logo on zh_classical? Sorry but don't know how to check that [20:11:03] (03PS4) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [20:11:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:43] koi: sorry, didn't realize it's a no-op patch [20:11:58] you could check the logo's still there, but it's impossible for that patch to break something, so, syncing [20:12:01] (both) [20:12:14] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:12:38] (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 [20:13:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3f70e302e11756d9704acc86c45b3d7aabf31c4d: fawiktionary: Enable SandboxLink extension (T308505) (duration: 03m 37s) [20:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:40] T308505: Activate SandboxLink extensions for fa.wiktionary - https://phabricator.wikimedia.org/T308505 [20:14:51] (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) [20:14:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:26] (03PS2) 10Urbanecm: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang) [20:16:07] (03CR) 10Urbanecm: [C: 03+2] zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang) [20:16:53] (03CR) 10Stang: zhwikibooks: Add zh-hant variant logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:17:01] (03Merged) 10jenkins-bot: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) (owner: 10Stang) [20:18:36] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 721e413fff4e797626c7c5e8433130f341310af0: zh_classicalwiki: Declare commons files for logo (1/2) (duration: 03m 30s) [20:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) [20:20:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:04] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 721e413fff4e797626c7c5e8433130f341310af0: zh_classicalwiki: Declare commons files for logo (2/2) (duration: 03m 28s) [20:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:53] (03PS4) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) [20:23:57] koi: so, most patches done. for the last one, i don't really want to add yet another file w/o a SVG equivalent (we should be converging to a HD'ed logos, and not adding more non-HD files is a way to get there, eventually). [20:25:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35973/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [20:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:26:35] PROBLEM - Check systemd state on mw1406 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [20:26:51] RECOVERY - AQS root url on aqs2001 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [20:26:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:49] (03CR) 10Dzahn: [C: 03+2] "This config file was created by puppet:" [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [20:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:58] yeah I know there's such goal for replacing all logo w/o 1.5x and 2x support, but I thought such variant is indeed need, many Chinese related sites treat logo variant in a not pretty elegant way [20:28:24] Like zhwikibooks, actually such file should exist inside repository many years before [20:28:31] (03CR) 10Cwhite: [C: 03+2] logstash: alertmanager use logsource as source for host.name field [puppet] - 10https://gerrit.wikimedia.org/r/806430 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:28:31] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8007136, @Dzahn wrote: > There are incoming redirects into policy.wikimedia.org: > > https://wikimedia.... [20:28:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:41] (03CR) 10Jdlrobson: [C: 03+1] "(this looks ready to backport to me!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:30:08] BTW urbanecm, did you pull https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806941 on mwdebug1001? [20:30:36] koi: sorry, i thought we already did that one. my mistake. [20:30:44] looks i only merged it [20:30:48] pulled to mwdebug1001 now [20:30:49] can you check? [20:30:54] cwhite: I've (re)enabled one of those aqs nodes that made so much noise last week. afaict things seem OK, but just in case there is something I'm not seeing... [20:31:01] (03CR) 10CI reject: [V: 04-1] QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:31:02] looking [20:31:31] urandom: thanks for the heads up. I'll watch for issues [20:32:06] LGTM [20:32:37] (03CR) 10David Caro: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/807182 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [20:32:41] (03PS1) 10BCornwall: traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) [20:33:16] koi: thanks, syncing [20:37:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b42e57d75ec6b0536493fa073805a0bcb066aef1: zhwikiquote: Disable local upload (T311017) (duration: 03m 43s) [20:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:08] T311017: Disable local upload for Chinese Wikiquote - https://phabricator.wikimedia.org/T311017 [20:37:15] koi: okay, that should be everything [20:37:17] anything else? [20:38:02] nothing except the logo for zhwikibooks, what should I do for that? [20:38:48] get a SVG for it 🙂 [20:39:24] 0_o [20:39:51] (03Abandoned) 10Stang: zhwikibooks: Add zh-hant variant logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:40:09] that's all [20:41:10] okay, then see you later :) [20:41:41] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:06] (03CR) 10Dzahn: [C: 03+1] gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:43:31] urbanecm: Sorry to bother again, I would like to have some input for T106068 (its a config change), where's the suggested place for me to go and like posting a notice for that? [20:43:31] T106068: [DisableAccount] Remove "inactive" user group - https://phabricator.wikimedia.org/T106068 [20:43:38] *it's [20:44:25] (03CR) 10Dzahn: [C: 03+1] "this goes last after the cookbook. other changes go before the cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [20:48:41] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:42] koi: if i understand the issue correctly, you want to remove the inactive group and before doing so, you want to ensure the group's not used for anything [20:50:45] is that right? [20:51:49] yep, as I know at least one private site has such issue - someone inside the inactive group but not blocked [20:53:06] (03PS1) 10Cwhite: logstash: restore logging to the ecs-test partition [puppet] - 10https://gerrit.wikimedia.org/r/807216 (https://phabricator.wikimedia.org/T310760) [20:53:25] koi: in that case, user-notice is your friend. tag the task with #user-notice and add a comment summarizing the impact in "plain English" (including stuff you'd like people to check for). once it went through tech news, we can wait for a while and assuming no issues are raised, i'd be comfortable with going ahead [20:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:54:20] if there are any particular issues to check for before group removal (such as, presence of unblocked users in the group), feel free to define those issues in a separate comment -- i can run a private wiki-wide SQL query to check where the issue is present (and we can check with those responsible for whichever private wikis is affected in addition to a tech news entry) [20:54:57] does that make sense? [20:55:32] um, should this task therefore be protected as there might be kind of security risk of that, like data leaking [20:55:45] yeah, pretty clear, thanks a lot [20:56:37] koi: well, so long as you only describe potential issues, we'd be fine. i can paste the results of whichever query i run in a private paste, or we can create a separate task for coordination that requires a private discussion space [20:57:21] clear to me, doing [20:57:44] okay, great. feel free to ping me if i can help :) [21:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:07:12] (03PS1) 10David Caro: openstack.vendordata: Allow downgrading packages too [puppet] - 10https://gerrit.wikimedia.org/r/807221 (https://phabricator.wikimedia.org/T309930) [21:07:56] (03CR) 10David Caro: [C: 03+2] openstack.vendordata: reduce timeout so it retries [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [21:08:22] (03CR) 10David Caro: [C: 03+2] openstack.vendordata: Allow downgrading packages too [puppet] - 10https://gerrit.wikimedia.org/r/807221 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [21:17:16] (03CR) 10Cwhite: [C: 03+2] logstash: restore logging to the ecs-test partition [puppet] - 10https://gerrit.wikimedia.org/r/807216 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [21:19:00] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8017973, @Varnent wrote: > @Dzahn - is that doable? I am not sure if we have redirected to web.archive.org... [21:21:18] (03CR) 10JHathaway: [C: 03+1] "looks good, some minor nits. Its unfortunate that helm doesn't have an api, and that you need to drag in so many dependencies, but I don't" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [21:24:18] (03CR) 10Scardenasmolinar: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [21:28:19] (03CR) 10JHathaway: [C: 03+1] "I suppose another way would be to invoke the help binary directly & consume its json output?" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [21:31:12] (03CR) 10Dzahn: "great, if it works, of course. but please check that realtime notifications in phab still work after this. (aphlict). I don't think we hav" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:33:33] (03PS5) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [21:40:14] (03CR) 10Ahmon Dancy: [C: 03+1] "Note that I am taking this over from Jaime while he is out." [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [22:06:13] 10SRE, 10Traffic-Icebox: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall) a:03BCornwall Would love some pointers on where to start; I'll eventually find my way to the right place but it always helps to have an experienced set of hands to guide. :) [22:06:33] 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) 05Open→03Resolved a:03RobH [22:06:36] 10SRE, 10Traffic: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall) [22:07:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic-Icebox, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) a:03BCornwall [22:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [22:07:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) [22:20:20] 10SRE, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) 05Stalled→03Declined so they are listed on the pdus but not normalized. We're not going to burn on-site remote hands to do this, and we'll just get this done wehn we update/migrate hardwa... [22:21:32] 10SRE, 10ops-esams: trace qfx5100-spare[12]-esams power cables - https://phabricator.wikimedia.org/T244914 (10RobH) 05Open→03Resolved a:03RobH they are spare and not powered or cabled, which is why no entry... should have closed this after realizing this months ago but forgot. [22:28:54] (03PS1) 10BCornwall: Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229 [22:32:01] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:33:22] (03CR) 10BCornwall: "This is a pretty opinionated CR, so I apologize if it's not helpful. It appears that this script hasn't seen any review/update since 2014 " [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [22:36:27] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:36:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:43:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:50:52] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8018203, @Dzahn wrote: >>>! In T310738#8017973, @Varnent wrote: >> @Dzahn - is that doable? I am not sur... [23:10:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:32] (03PS2) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) [23:14:20] (03CR) 10DDesouza: [C: 03+1] "Fixed issues causing CodeSniffer to throw warnings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [23:16:59] (03PS3) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [23:17:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:48] (03CR) 10DDesouza: [C: 03+1] "Fixed issues that would case CodeSniffer to throw warnings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [23:22:10] (03CR) 10Tim Starling: [C: 03+2] "Seems like a good cleanup to me" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [23:45:29] (03CR) 10Aaron Schulz: [V: 03+1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [23:50:40] (03CR) 10Jdlrobson: "Coode looks good, but as discussed you'll want to backport https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/807202 first and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [23:51:08] (03CR) 10Aaron Schulz: [C: 03+1] mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [23:51:20] (03PS6) 10Tim Starling: mcrouter: Add stats route for fast increment [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662)